12.07.2015 Views

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

470 M. S. Pera & Y.-K. Ng3.1.1. Word-Correlation factors and sentence similarityTheword-correlationmatrixM introducedinRef.12,whichincludesthecorrelationfactors of non-stop, stemmed words b , is a 54,625 × 54,625 symmetric matrix. Thecorrelation factor of any two words w i and w j , which indicates how closely relatedw i and w j are semantically, is computed based on the (i) frequency of co-occurrenceand (ii) relative distance of w i and w j in each document D in a collection and isdefined as follows:wcf(w i ,w j ) = ∑ ∑ 1(1)d(x,y)x∈V(w i) y∈V(w j)where d(x,y) denotes the distance (i.e., the number of words in) between x and yin D plus 1, and V(w i ) (V(w j ), respectively) is the set of words that include w i(w j , respectively) and its stem variations in D.Theword-correlationfactorsinM werecomputedusingtheWikipediaDatabaseDump (http://en.wikipedia.org/wiki/Wikipedia:Databasedownload), which consistsof 930,000 documents written by more than 89,000 authors on various topics,and hence is diverse in content and writing styles and is an ideal choice <strong>for</strong> measuringword similarity. Using the word-correlation factors in M, we compute thedegree of similarity of any two sentences S 1 and S 2 by adding the word-correlationfactors of each word in S 1 with every word in S 2 as follows:n∑ m∑Sim(S 1 ,S 2 ) = wcf(w i ,w j ) (2)i=1 j=1where w i (w j , respectively) is a word in S 1 (S 2 , respectively), n (m, respectively)is the number of words in S 1 (S 2 , respectively), and wcf(w i ,w j ) is defined inEquation 1.3.1.2. Most representative sentencesCorSum identifies sentences in a document D that most accurately represent thecontent of D during the summarization process. To determine which sentencesshould be included in the summary of D, CorSum computes the overall similarityof each sentence S i in D, denoted OS(S i ), with respect to the other sentences inD asn∑OS(S i ) = Sim(S i ,S j ) (3)j=1,j≠iwhere n is the number of sentences in D, S j is a sentence in D, and Sim(S i ,S j ) isdefined in Equation 2.b Stopwords are commonly-occurred words such as articles, prepositions, and conjunctions, whichare poor discriminators in representing the content of a sentence (or document), whereas stemmedwords are words reduced to their grammatical root. From now on, unless stated otherwise, wheneverwe refer to words, we mean non-stop, stemmed words.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!