A naÃ¯ve Bayes Classifier for Web Document Summarie...

More documents

Recommendations

Info

468 M. S. Pera & Y.-K. NgAntiqueira et al. 1 adopt metrics and concepts from complex networks to selectsentences in a document D that compose the summary of D. The authors representD as a graph in which sentences in D are denoted as nodes and sentences that sharecommon nouns are connected as edges. Thereafter, nodes in the graph are rankedaccording to various network measurements, such as the (i) number of nodes aparticular node is connected to, (ii) length of the shortest path between any twonodes,(iii)localityindex,whichidentifiescentralnodesinpartitiongroupsofnodes,and (iv) network modularity, which measures the proportion of edges connectingto intra-community a nodes, and the highest ranked nodes are chosen to create thecorresponding summary. Unlike the approach in Ref. 8, which depends on sentencebasedfeatures to train the proposed summarizer, or the approach in Ref. 1, whichreliesonnetwork-basedfeaturestocreatethesummaryofadocument,CorSum-SFrelies solely on the word-correlation factors among (the words in) sentences withina document D and the sentence significance factor to determine the sentences thatshould be included in the summary of D.The authors in Ref. 15 claim that many existing methodologies treat summarizationas a binary classification problem (i.e., sentences are either included orexcluded in a summary), which generates redundant, unbalanced, and low-recallsummaries. In solving this problem, Li et al. 15 propose a Support Vector Machinesummarization method in which summaries (i) are diverse, i.e., they include asfew redundant sentences as possible, (ii) contain (most of the) important aspectsof the corresponding documents, and (iii) are balanced, i.e., they emphasize differentaspects of the corresponding documents. By selecting the most representativesentences, CorSum-SF, similar to the approach in Ref. 15, creates summariesthat are balanced and diverse, but does not require previous training in generatingsummaries, and hence is less computationally expensive than the summarizationapproach in Ref. 15.Besides using text summarization for capturing the main content of web documents,constructed summaries can be further classified. Yang and Pedersen 34present several feature-selection approaches for text classification and compare theperformance of two classifiers, K Nearest Neighbor (KNN) and Linear List SquaresFit mapping (LLSF). The classifiers compute the confidence score CS of a documentD in each category. CS in KNN is determined by the degrees of similarityof D with respect to the K nearest training documents in each category, whereasLLSF calculates CS of D in each category using a regression model based on thewords in D.McCallum and Nigam 22 discuss the differences between the Multi-variateBernoulli and Multinomial Naïve Bayes classifiers. Multi-variate Bernoulli representsa document D using binary attributes, indicating the absence and occurrencea The concept of communities is defined in Ref. 5 as sets of nodes that are highly interconnected,whereas other sets are scarcely connected to each other.
Classifying Summaries of Web Documents 469of words in D, whereas the Multinomial classifier captures the content of D by thefrequency of occurrence of each word in D. Regardless of the classifier, the classificationis performed by computing the posterior probability of each class given anunlabeled document D and assigning D to the class with the highest probability.Nigam et al. 25 rely on Maximum Entropy to perform text classification. MaximumEntropy, which estimates probability distributions of data on a class-by-classbasis, represents a document D by its word count feature. Maximum Entropy assignsD to a unique class based on the frequency of occurrence of words in D thatis more alike to the word occurrence distribution of a particular class.Using the Dempster-Shafer Theory (DST), the authors of Ref. 29 combine theoutputs of several sub-classifiers (trained on different feature sets extracted fromthe same document collection C) and determine to which class a document inC should be assigned. As claimed by the authors, sub-classifiers reduce computationaltime without sacrificing classification performance, and DST fusion outperformstraditional fusion methods, such as plain voting and majority weightedvoting.Unlike the methodologies adopted in Refs. 25, 29 and 34 in assigning (summarized)documents to their corresponding class, we depend on the MultinomialNaïve Bayes classifiers, which is one of the most widely-used and effective textclassifiers. 223. Summarization and ClassificationInthissection,wefirstdiscusstheoveralldesignofCorSum,whichusestheprecomputedword-correlation factors to identify representative sentences in a document Dto create the summary of D. Hereafter, we introduce CorSum-SF, which relies onsentence significance factors, in addition to word similarity, to improve the qualityof CorSum generated summaries. Furthermore, we present a Multinomial NaïveBayes classifier, which we adopt for classifying CorSum-SF generated summariesof web documents in large collections.3.1. CorSum, a summarization approachMihalcea and Tarau 24 propose a sentence-extraction summarization method thatapplies two graph-based ranking algorithms, PageRank 3 and HITS, 11 to determinethe rank value of a vertex (i.e., a sentence in a document D) in a graph based on theglobal information computed using the entire graph, i.e., the similarity of sentencepairs in D, which is calculated as a function of content overlap. Hereafter, sentencesare sorted in reversed order of their rank values, and the top-ranked sentences areincluded in the summary of D. CorSum also depends on ranked sentences, but therankvaluesarecomputedaccordingto(i) the word-correlation factors introducedinRef. 12, and (ii) the degrees of similarity of sentences. The highly ranked sentencesare the most representative sentences of D, which form the summary of D.
Page 1 and 2: International Journal on Artificial
Page 3: Classifying Summaries of Web Docume
Page 7 and 8: Classifying Summaries of Web Docume

A naÃ¯ve Bayes Classifier for Web Document Summarie...

Create successful ePaper yourself

Delete template?

Save as template?