12.07.2015 Views

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

468 M. S. Pera & Y.-K. NgAntiqueira et al. 1 adopt metrics and concepts from complex networks to selectsentences in a document D that compose the summary of D. The authors representD as a graph in which sentences in D are denoted as nodes and sentences that sharecommon nouns are connected as edges. Thereafter, nodes in the graph are rankedaccording to various network measurements, such as the (i) number of nodes aparticular node is connected to, (ii) length of the shortest path between any twonodes,(iii)localityindex,whichidentifiescentralnodesinpartitiongroupsofnodes,and (iv) network modularity, which measures the proportion of edges connectingto intra-community a nodes, and the highest ranked nodes are chosen to create thecorresponding summary. Unlike the approach in Ref. 8, which depends on sentencebasedfeatures to train the proposed summarizer, or the approach in Ref. 1, whichreliesonnetwork-basedfeaturestocreatethesummaryofadocument,CorSum-SFrelies solely on the word-correlation factors among (the words in) sentences withina document D and the sentence significance factor to determine the sentences thatshould be included in the summary of D.The authors in Ref. 15 claim that many existing methodologies treat summarizationas a binary classification problem (i.e., sentences are either included orexcluded in a summary), which generates redundant, unbalanced, and low-recallsummaries. In solving this problem, Li et al. 15 propose a Support Vector Machinesummarization method in which summaries (i) are diverse, i.e., they include asfew redundant sentences as possible, (ii) contain (most of the) important aspectsof the corresponding documents, and (iii) are balanced, i.e., they emphasize differentaspects of the corresponding documents. By selecting the most representativesentences, CorSum-SF, similar to the approach in Ref. 15, creates summariesthat are balanced and diverse, but does not require previous training in generatingsummaries, and hence is less computationally expensive than the summarizationapproach in Ref. 15.Besides using text summarization <strong>for</strong> capturing the main content of web documents,constructed summaries can be further classified. Yang and Pedersen 34present several feature-selection approaches <strong>for</strong> text classification and compare theper<strong>for</strong>mance of two classifiers, K Nearest Neighbor (KNN) and Linear List SquaresFit mapping (LLSF). The classifiers compute the confidence score CS of a documentD in each category. CS in KNN is determined by the degrees of similarityof D with respect to the K nearest training documents in each category, whereasLLSF calculates CS of D in each category using a regression model based on thewords in D.McCallum and Nigam 22 discuss the differences between the Multi-variateBernoulli and Multinomial Naïve <strong>Bayes</strong> classifiers. Multi-variate Bernoulli representsa document D using binary attributes, indicating the absence and occurrencea The concept of communities is defined in Ref. 5 as sets of nodes that are highly interconnected,whereas other sets are scarcely connected to each other.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!