Evaluation of Automatic Text Summarization - KTH
Evaluation of Automatic Text Summarization - KTH
Evaluation of Automatic Text Summarization - KTH
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
16 CHAPTER 1. SUMMARIES AND THE PROCESS OF SUMMARIZATION<br />
texts collected were news flashes and press releases from large Swedish newspapers<br />
like Svenska Dagbladet, Dagens Nyheter and Aftonbladet, business and computer<br />
magazines as well as press releases from humanitary organizations like Amnesty<br />
International, RFSL 16 and authorities like Riksdagen. 17<br />
The <strong>KTH</strong> News Corpus has since been used to train a Named Entity tagger<br />
(Dalianis and Åström 2001), that also is used in the SweSum text summarizer (see<br />
Paper 4). The <strong>KTH</strong> News Corpus has also been used to evaluate the robustness<br />
<strong>of</strong> the SweSum summarizer by summarizing Swedish news text collected by the<br />
Business Intelligence tool NyhetsGuiden, “NewsGuide” (see Hassel 2001b), that<br />
actually is an interface to the <strong>KTH</strong> News Corpus tool.<br />
Furthermore, the corpus is currently used in experiments concerning the training<br />
<strong>of</strong> a Random Indexer (see Karlgren and Sahlgren 2001, Sahlgren 2001) for experiments<br />
where “synonym sets”, or rather semantic sets, are used to augment the<br />
frequency count. Gong and Liu (2001) have carried out similar experiments, but<br />
using LSA for creating their sets, where they used the semantic sets to pin point<br />
the topically central sentences, using each set only once in an attempt to avoid<br />
redundancy. LSA, however, is a costly method for building these types <strong>of</strong> sets, and<br />
if RI performs equally or better there would be much benefit.<br />
1.4.2 Paper 2.<br />
Improving Precision in Information Retrieval for Swedish using Stemming<br />
(Carlberger, Dalianis, Hassel, and Knutsson, 2001)<br />
An early stage <strong>of</strong> the <strong>KTH</strong> News Corpus 18 , as from October 2000, was used to evaluate<br />
a stemmer for Swedish and its inpact on the performance <strong>of</strong> a search engine,<br />
SiteSeeker 19 . The work was completely manual in the sense that three persons first<br />
had to each annotate ≈33 20 randomly selected texts from the <strong>KTH</strong> News Corpus<br />
each, and for each text construct a relevant/central question with corresponding<br />
answer, in total 100 texts and 100 answers. The next step was to use the search engine<br />
formulating queries to retrieve information, i.e to find answers to our allotted<br />
questions, with and without stemming in hope that one would find answers to the<br />
questions. Here the questions where swapped one step so that no one formulated<br />
queries concerning their own questions. The last step was to assess the answers <strong>of</strong><br />
the partner, again a swap was made, to find out the precision and recall. This time<br />
consuming and manual evaluation schema showed us that precision and relative<br />
recall improved with 15 respectively 18 percent for Swedish in information retrieval<br />
using stemming.<br />
16Riksförbundet För Sexuellt Likaberättigande; A gay, lesbian, bisexual and transgendered<br />
lobby organization.<br />
17The Swedish Parliament.<br />
18The corpus at this stage contained about 54,000 texts.<br />
19http://www.euroling.se/siteseeker/ 20One person actually had to do 34 questions/queries/evaluations at each corresponding stage,<br />
but this “extra load” was shifted with each stage.