A naÃ¯ve Bayes Classifier for Web Document Summarie...

More documents

Recommendations

Info

484 M. S. Pera & Y.-K. Ngas well as the performance of MNB, were performed on a HP workstation runningundertheWindows7operatingsystem,with2IntelCoreDuo3.166GHzprocessors,8 GB RAM, and hard disk of 460 GB.5. ConclusionsLocating relevant information on the web in a timely manner is often a challengingtask, even using well-known web search engines, due to the vast amount ofdata available for the users to process. Although retrieved documents can be precategorizedbasedontheircontentsusingatextclassifier,webusersarestillrequiredto analyze the entire documents in each category (or class) to determine their relevancewith respect to their information needs. To assist web users in speeding upthe process of identifying relevant web information, we have introduced CorSum,an extractive summarization approachwhich requires only precomputed word similarityto select the most representative sentences of a document D (that capture itsmain content) as the summary of D. We further enhance CorSum by consideringthe significance factor of, besides the correlation factors of words in, a sentence S tomeasure the relative degree of representativeness of S with respect to the contentof a document D to which S belongs. We denote the enhanced summarization approachCorSum-SF. CorSum-SF selects the most representative sentences in Dbased on their combined significance factor and ranking score, which are the highestamong all the sentences in D, to create the summary of D. We have also usedsummaries generated by CorSum-SF to train a multinomial Naïve Bayes (MNB)classifier and verified its effectiveness and efficiency in performing the classificationtask.Empirical studies conducted using the DUC-2002 dataset have shown thatCorSum-SF creates high-quality summaries compared with other well-known extractivesummarization approaches. Furthermore, by applying the MNB classifieron CorSum-SF generated summaries of the news articles in the 20NG dataset, wehave validated that in classifying a large document collection C, the classificationtask using CorSum-SF generated summaries is in the order of magnitude fasterthan using the entire documents in C with compatible accuracy.For future work, we will consider applying feature extractors and selectors, suchas sentence length, topical words, mutual information, or log-likelihood ratio, on aclassifier to (i) further enhance the classification accuracy on using CorSum-SFgenerated summaries and (ii) minimize the classifier’s training and classificationtime.References1. L. Antiqueira, O. Oliveira, L. Costa, and M. Nunes. A Complex Network Approachto Text Summarization. Information Sciences: An International Journal, 179(5):584–599, 2009.
Classifying Summaries of Web Documents 4852. M. Binwahlan, N. Salim, and L. Suanmali. Swarm Based Text Summarization. InProceedings of International Association of Computer Science and information Technology- Spring Conference, pages 145–150, 2009.3. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine.Computer Networks and ISDN Systems, 30:1–7, 1998.4. O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke, and T. Winograd. EfficientWeb Browsing on Handheld Devices Using Page and Form Summarization.ACM Transactions on Information Systems (TOIS), 20(1):82–115, 2002.5. L. da F. Costa, F. Rodriguez, G. Travieso, and P. Villas Boas. Characterization ofComplex Networks: A Survey of Measurements. Advances in Physics, 56(1):167–242,2007.6. D. Das and A. Martins. A Survey on Automatic Text Summarization. LiteratureSurvey for the Language and Statistics II Course at CMU, 2007.7. D. Dunlavy, D. O’Leary, J. Conroy, and J. Schlesinger. QCS: A System for Querying,Clustering and Summarizing documents. Information Processing and Management,43(6):1588–1605, 2007.8. M. Fattah and F. Ren. GA, MR, FFNN, PNN and GMM Based Models for AutomaticText Summarization. Computer, Speech and Language, 23(1):126–144, 2009.9. Y. Gong. Generic TextSummarization Using Relevance Measure and Latent SemanticAnalysis. In Proceedings of International Conference on Research and Developmentin Information Retrieval (ACM SIGIR), pages 19–25, 2001.10. P. Judea. Probabilistic Reasoning in the Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, 1988.11. J. Kleinberg. AuthoritativeSources in HyperlinkEnvironment.JACM, 46(5):604–632,1999.12. J. Koberstein and Y.-K. Ng. Using Word Clusters to Detect Similar Web Documents.In Proceedings of International Conference on Knowledge Science, Engineering andManagement (KSEM), pages 215–228, 2006.13. A. Kolcz. Local Sparsity Control for Naive Bayes with Extreme MisclassificationCosts. In Proceedings of International Conference on Knowledge Discovery and DataMining (KDD), pages 128–137, 2005.14. K. Lang. Newsweeder: Learning to Filter Netnews. In Proceedings of InternationalConference on Machine Learning (ICML), pages 331–339, 1997.15. L. Li, K. Zhou, G. Xue, H. Zha, and Y. Yu. Enhancing Diversity, Coverage and Balancefor Summarization through Structure Learning. In Proceedings of InternationalConference on World Wide Web (WWW), pages 71–80, 2009.16. C. Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedingsof ACL Workshop on Text Summarization Branches Out, pages 74–81, 2004.17. X. Liu, J. Webster, and C. Kit. An Extractive Text Summarizer Based on SignificantWords. Lecture Notes in Artificial Intelligence, 5459:168–178, 2009.18. E.LloretandM.Palomar. AGradualCombinationofFeaturesforBuildingAutomaticSummarization Systems. Lecture Notes in Artificial Intelligence, 5729:16–23, 2009.19. G. Luger. Artificial Intelligence: Structures and Strategies for Complex Problem Solving,6 th Ed. Addison Wesley, 2009.20. H. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Researchand Development, 2(2):159–165, 1958.21. A. Martins and N. Smith. Summarization with a Joint Model for Sentence Extractionand Compression. In Proceedings of the Association for Computational LinguisticsWorkshop on Integer Linear Programming for Natural Language Processing, pages1–9, 2009.
Page 1 and 2: International Journal on Artificial
Page 3 and 4: Classifying Summaries of Web Docume
Page 19: Classifying Summaries of Web Docume

A naÃ¯ve Bayes Classifier for Web Document Summarie...

Create successful ePaper yourself

Delete template?

Save as template?