12.07.2015 Views

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

A naïve Bayes Classifier for Web Document Summarie...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

484 M. S. Pera & Y.-K. Ngas well as the per<strong>for</strong>mance of MNB, were per<strong>for</strong>med on a HP workstation runningundertheWindows7operatingsystem,with2IntelCoreDuo3.166GHzprocessors,8 GB RAM, and hard disk of 460 GB.5. ConclusionsLocating relevant in<strong>for</strong>mation on the web in a timely manner is often a challengingtask, even using well-known web search engines, due to the vast amount ofdata available <strong>for</strong> the users to process. Although retrieved documents can be precategorizedbasedontheircontentsusingatextclassifier,webusersarestillrequiredto analyze the entire documents in each category (or class) to determine their relevancewith respect to their in<strong>for</strong>mation needs. To assist web users in speeding upthe process of identifying relevant web in<strong>for</strong>mation, we have introduced CorSum,an extractive summarization approachwhich requires only precomputed word similarityto select the most representative sentences of a document D (that capture itsmain content) as the summary of D. We further enhance CorSum by consideringthe significance factor of, besides the correlation factors of words in, a sentence S tomeasure the relative degree of representativeness of S with respect to the contentof a document D to which S belongs. We denote the enhanced summarization approachCorSum-SF. CorSum-SF selects the most representative sentences in Dbased on their combined significance factor and ranking score, which are the highestamong all the sentences in D, to create the summary of D. We have also usedsummaries generated by CorSum-SF to train a multinomial Naïve <strong>Bayes</strong> (MNB)classifier and verified its effectiveness and efficiency in per<strong>for</strong>ming the classificationtask.Empirical studies conducted using the DUC-2002 dataset have shown thatCorSum-SF creates high-quality summaries compared with other well-known extractivesummarization approaches. Furthermore, by applying the MNB classifieron CorSum-SF generated summaries of the news articles in the 20NG dataset, wehave validated that in classifying a large document collection C, the classificationtask using CorSum-SF generated summaries is in the order of magnitude fasterthan using the entire documents in C with compatible accuracy.For future work, we will consider applying feature extractors and selectors, suchas sentence length, topical words, mutual in<strong>for</strong>mation, or log-likelihood ratio, on aclassifier to (i) further enhance the classification accuracy on using CorSum-SFgenerated summaries and (ii) minimize the classifier’s training and classificationtime.References1. L. Antiqueira, O. Oliveira, L. Costa, and M. Nunes. A Complex Network Approachto Text Summarization. In<strong>for</strong>mation Sciences: An International Journal, 179(5):584–599, 2009.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!