A naÃ¯ve Bayes Classifier for Web Document Summarie...

More documents

Recommendations

Info

470 M. S. Pera & Y.-K. Ng3.1.1. Word-Correlation factors and sentence similarityTheword-correlationmatrixM introducedinRef.12,whichincludesthecorrelationfactors of non-stop, stemmed words b , is a 54,625 × 54,625 symmetric matrix. Thecorrelation factor of any two words w i and w j , which indicates how closely relatedw i and w j are semantically, is computed based on the (i) frequency of co-occurrenceand (ii) relative distance of w i and w j in each document D in a collection and isdefined as follows:wcf(w i ,w j ) = ∑ ∑ 1(1)d(x,y)x∈V(w i) y∈V(w j)where d(x,y) denotes the distance (i.e., the number of words in) between x and yin D plus 1, and V(w i ) (V(w j ), respectively) is the set of words that include w i(w j , respectively) and its stem variations in D.Theword-correlationfactorsinM werecomputedusingtheWikipediaDatabaseDump (http://en.wikipedia.org/wiki/Wikipedia:Databasedownload), which consistsof 930,000 documents written by more than 89,000 authors on various topics,and hence is diverse in content and writing styles and is an ideal choice for measuringword similarity. Using the word-correlation factors in M, we compute thedegree of similarity of any two sentences S 1 and S 2 by adding the word-correlationfactors of each word in S 1 with every word in S 2 as follows:n∑ m∑Sim(S 1 ,S 2 ) = wcf(w i ,w j ) (2)i=1 j=1where w i (w j , respectively) is a word in S 1 (S 2 , respectively), n (m, respectively)is the number of words in S 1 (S 2 , respectively), and wcf(w i ,w j ) is defined inEquation 1.3.1.2. Most representative sentencesCorSum identifies sentences in a document D that most accurately represent thecontent of D during the summarization process. To determine which sentencesshould be included in the summary of D, CorSum computes the overall similarityof each sentence S i in D, denoted OS(S i ), with respect to the other sentences inD asn∑OS(S i ) = Sim(S i ,S j ) (3)j=1,j≠iwhere n is the number of sentences in D, S j is a sentence in D, and Sim(S i ,S j ) isdefined in Equation 2.b Stopwords are commonly-occurred words such as articles, prepositions, and conjunctions, whichare poor discriminators in representing the content of a sentence (or document), whereas stemmedwords are words reduced to their grammatical root. From now on, unless stated otherwise, wheneverwe refer to words, we mean non-stop, stemmed words.
Classifying Summaries of Web Documents 471Fig. 1. A document D from the DUC-2002 dataset with (the most representative, highlighted)sentences extracted by CorSum to form the summary of D.We rely on the Odds ratio, 10 which is defined as the ratio of the probability (p)that an event occurs to the probability (1 - p) that it does not, i.e., Odds = p1−p , tocompute the Ranking value of S i in D. We treat OS(S i ) as the positive evidenceof S i in representing the content of D. The Ranking value of S i , which determinesthe content significance of S i in D, such that the higher the ranking value of S i is,the more significant (content-wise) S i is in D, and is defined asRanking(S i ) = OS(S i)1−OS(S i ) . (4)Having computed the Ranking value of each sentence in D, CorSum choosesthe top N (≥ 1) ranked sentences in D as the most representative sentencesof D to create the summary of D. In establishing the proper value of N, i.e.,the number of representative sentences to be included in a summary, we followthe results of a study conducted by Ref. 23 on two different popular datasets,i.e., Reuters-21578 (http://archive.ics.uci.edu/ml/databases/reuters21578/) andWebKB (http://www.cs.umd.edu/ sen/lbc-proj/data/WebKB.tgz), using differentclassifiers, which concludes that in general a summary with six sentences can accuratelyrepresent the overall content of a document. More importantly, Mihalceaand Hassan 23 show in their study that using summaries with six sentences for clustering/classificationone can achieve the highest accuracy, an assumption we adoptin designing CorSum. If a document D contains less than six sentences, CorSumincludes the entire content of D as the summary of D.Example 1. Figure 1 shows a document D from the DUC-2002 dataset (to beintroduced in Section 4.1) in which the six most representative sentences (i.e., sentenceswith their Ranking values higher than the remaining ones) are highlighted,whereas Table 1 shows the Ranking values of the sentences in D and Table 2 includesthe degrees of similarity of the highest ranked (i.e., Sentence 10) and lowestranked (i.e., Sentence 11) sentences with the remaining sentences in D.
Page 1 and 2: International Journal on Artificial
Page 3 and 4: Classifying Summaries of Web Docume
Page 5: Classifying Summaries of Web Docume

A naÃ¯ve Bayes Classifier for Web Document Summarie...

Create successful ePaper yourself

Delete template?

Save as template?