484 M. S. Pera & Y.-K. Ngas well as the per<strong>for</strong>mance of MNB, were per<strong>for</strong>med on a HP workstation runningundertheWindows7operatingsystem,with2IntelCoreDuo3.166GHzprocessors,8 GB RAM, and hard disk of 460 GB.5. ConclusionsLocating relevant in<strong>for</strong>mation on the web in a timely manner is often a challengingtask, even using well-known web search engines, due to the vast amount ofdata available <strong>for</strong> the users to process. Although retrieved documents can be precategorizedbasedontheircontentsusingatextclassifier,webusersarestillrequiredto analyze the entire documents in each category (or class) to determine their relevancewith respect to their in<strong>for</strong>mation needs. To assist web users in speeding upthe process of identifying relevant web in<strong>for</strong>mation, we have introduced CorSum,an extractive summarization approachwhich requires only precomputed word similarityto select the most representative sentences of a document D (that capture itsmain content) as the summary of D. We further enhance CorSum by consideringthe significance factor of, besides the correlation factors of words in, a sentence S tomeasure the relative degree of representativeness of S with respect to the contentof a document D to which S belongs. We denote the enhanced summarization approachCorSum-SF. CorSum-SF selects the most representative sentences in Dbased on their combined significance factor and ranking score, which are the highestamong all the sentences in D, to create the summary of D. We have also usedsummaries generated by CorSum-SF to train a multinomial Naïve <strong>Bayes</strong> (MNB)classifier and verified its effectiveness and efficiency in per<strong>for</strong>ming the classificationtask.Empirical studies conducted using the DUC-2002 dataset have shown thatCorSum-SF creates high-quality summaries compared with other well-known extractivesummarization approaches. Furthermore, by applying the MNB classifieron CorSum-SF generated summaries of the news articles in the 20NG dataset, wehave validated that in classifying a large document collection C, the classificationtask using CorSum-SF generated summaries is in the order of magnitude fasterthan using the entire documents in C with compatible accuracy.For future work, we will consider applying feature extractors and selectors, suchas sentence length, topical words, mutual in<strong>for</strong>mation, or log-likelihood ratio, on aclassifier to (i) further enhance the classification accuracy on using CorSum-SFgenerated summaries and (ii) minimize the classifier’s training and classificationtime.References1. L. Antiqueira, O. Oliveira, L. Costa, and M. Nunes. A Complex Network Approachto Text Summarization. In<strong>for</strong>mation Sciences: An International Journal, 179(5):584–599, 2009.
Classifying <strong>Summarie</strong>s of <strong>Web</strong> <strong>Document</strong>s 4852. M. Binwahlan, N. Salim, and L. Suanmali. Swarm Based Text Summarization. InProceedings of International Association of Computer Science and in<strong>for</strong>mation Technology- Spring Conference, pages 145–150, 2009.3. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual <strong>Web</strong> Search Engine.Computer Networks and ISDN Systems, 30:1–7, 1998.4. O. Buyukkokten, O. Kaljuvee, H. Garcia-Molina, A. Paepcke, and T. Winograd. Efficient<strong>Web</strong> Browsing on Handheld Devices Using Page and Form Summarization.ACM Transactions on In<strong>for</strong>mation Systems (TOIS), 20(1):82–115, 2002.5. L. da F. Costa, F. Rodriguez, G. Travieso, and P. Villas Boas. Characterization ofComplex Networks: A Survey of Measurements. Advances in Physics, 56(1):167–242,2007.6. D. Das and A. Martins. A Survey on Automatic Text Summarization. LiteratureSurvey <strong>for</strong> the Language and Statistics II Course at CMU, 2007.7. D. Dunlavy, D. O’Leary, J. Conroy, and J. Schlesinger. QCS: A System <strong>for</strong> Querying,Clustering and Summarizing documents. In<strong>for</strong>mation Processing and Management,43(6):1588–1605, 2007.8. M. Fattah and F. Ren. GA, MR, FFNN, PNN and GMM Based Models <strong>for</strong> AutomaticText Summarization. Computer, Speech and Language, 23(1):126–144, 2009.9. Y. Gong. Generic TextSummarization Using Relevance Measure and Latent SemanticAnalysis. In Proceedings of International Conference on Research and Developmentin In<strong>for</strong>mation Retrieval (ACM SIGIR), pages 19–25, 2001.10. P. Judea. Probabilistic Reasoning in the Intelligent Systems: Networks of PlausibleInference. Morgan Kaufmann, 1988.11. J. Kleinberg. AuthoritativeSources in HyperlinkEnvironment.JACM, 46(5):604–632,1999.12. J. Koberstein and Y.-K. Ng. Using Word Clusters to Detect Similar <strong>Web</strong> <strong>Document</strong>s.In Proceedings of International Conference on Knowledge Science, Engineering andManagement (KSEM), pages 215–228, 2006.13. A. Kolcz. Local Sparsity Control <strong>for</strong> Naive <strong>Bayes</strong> with Extreme MisclassificationCosts. In Proceedings of International Conference on Knowledge Discovery and DataMining (KDD), pages 128–137, 2005.14. K. Lang. Newsweeder: Learning to Filter Netnews. In Proceedings of InternationalConference on Machine Learning (ICML), pages 331–339, 1997.15. L. Li, K. Zhou, G. Xue, H. Zha, and Y. Yu. Enhancing Diversity, Coverage and Balance<strong>for</strong> Summarization through Structure Learning. In Proceedings of InternationalConference on World Wide <strong>Web</strong> (WWW), pages 71–80, 2009.16. C. Lin. ROUGE: A Package <strong>for</strong> Automatic Evaluation of <strong>Summarie</strong>s. In Proceedingsof ACL Workshop on Text Summarization Branches Out, pages 74–81, 2004.17. X. Liu, J. <strong>Web</strong>ster, and C. Kit. An Extractive Text Summarizer Based on SignificantWords. Lecture Notes in Artificial Intelligence, 5459:168–178, 2009.18. E.LloretandM.Palomar. AGradualCombinationofFeatures<strong>for</strong>BuildingAutomaticSummarization Systems. Lecture Notes in Artificial Intelligence, 5729:16–23, 2009.19. G. Luger. Artificial Intelligence: Structures and Strategies <strong>for</strong> Complex Problem Solving,6 th Ed. Addison Wesley, 2009.20. H. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Researchand Development, 2(2):159–165, 1958.21. A. Martins and N. Smith. Summarization with a Joint Model <strong>for</strong> Sentence Extractionand Compression. In Proceedings of the Association <strong>for</strong> Computational LinguisticsWorkshop on Integer Linear Programming <strong>for</strong> Natural Language Processing, pages1–9, 2009.