Web classification based features extracted and K-NN algorithm

British Journal of Science 15April 2013, Vol. 8 (2)documents that contain the terms "stemmer" and "stems" because all share the common root word stem". Italso has applications in machine translation, document summarization, and text classification [14].Also it isessential to support effective Indonesian Information Retrieval, and has uses as diverse as defenseintelligence applications, document translation, and web search.3. Features extractionThere are number of features formula used in this paper such as:1. Wij=n ij *log (n/n j ) …………………………………………. (1)Where n ij is the frequency of the term t j in the document j ,and n j is the number of the document containingthe term t j . This formula represent the weight of term tj in document i, i =1, . . . , n, j = 1,.m .., m, and n isthe number of document in collection.2. In the vector space model, the method for computing featureweight is using TF-IDF, as follow:….. (2)TF (t, d) is the frequency of t contained by d, IDF (t) is: Inverse document frequency, namely,IDF=1/DF (t) δ is an adjustment coefficient, about 0.01 usually.3. The frequency of feature t in whole training corpus (TC) is called the overall frequency (OF).which isdefined as:…………… (3)4. The information containing in the document can be defined as:Entropy= …………… (4)Where p ij is the frequency of the term in the document.5. The distributed of the term in the document can be shown in the formula below:Energy= ……………… (5)Then grouping all features for each document ,also summation the features for each document.© 2013 British Journals ISSN 2047-3745

British Journal of Science 18April 2013, Vol. 8 (2)Also testing this system for different mining webs ,where the result can be shown in figure (4):Figure 4:The implementation of the propsed system for the wen miningBut when the proposed system tested for other web subjects such as image processing the resultedclassification can be shown in figure (5):© 2013 British Journals ISSN 2047-3745

British Journal of Science 19April 2013, Vol. 8 (2)Figure 5: The implementation of the proposed system for image proccessing web© 2013 British Journals ISSN 2047-3745

British Journal of Science 20April 2013, Vol. 8 (2)7. Conclusions1-The k-NN algorithm is the most suitable method for classification these types of webs.2- In the classification process for the network security ,minining and database webs , the features weight ,overall frequency,TI-IDF,energy and entropy showing the stable behavior in all webs of this type.3- The relation between some web objects allow the interrelated in the classification process such as imageprocessing and neural network ,since most the methods of the recognition in the image processingdepending on the neural network therefore,the features weight ,overall frequency and entropy used toclassify the image processing webs ,while the other features such as TF-IDF and energy features going toclassify the related neural network webs.4- There is arelation between the tittle and the containts of the webs with the classification process.8. References1. Dumais, S., Chen, H.: Hierarchical classification of web content. In: Proceedings of the 23rd annualinternational ACM SIGIR conference on Research and development in information retrieval. pp. 256-263(2000).2. Forman, G.: An extensive empirical study of feature selection metrics for text classification. The Journalof Machine Learning Research 3, 1289-1305 (2003).3. Zu Eissen, S. M. and B. Stein (2004). Genre classification of web pages. In Proceedings of the 27thGerman Conference on Artificial Intelligence, Volume 3238 of LNCS, Berlin, pp.256–269. Springer.4. Gyo .. ngyi, Z. and H. Garcia-Molina (2005b, May). Web spam taxonomy. In B. D. Davison (Ed.),Proceedings of the First International Workshop on Adversarial Information Retrieval (AIR Web),Bethlehem, PA, pp. 39–47. Lehigh University, Department of Computer Science. Technical Report LU-CSE-05-030.5. Castillo, C., D. Donato, A. Gionis, V. Murdock, and F. Silvestri (2007). Know your neighbors: Web spamdetection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conferenceon Research and Development in Information Retrieval,New York, NY. ACM Press. In press.6. Huang, C.-C., S.-L. Chuang and L.-F. Chien (2004a). Live classifier: Creating hierarchical text classifiersthrough web corpora. In WWW ’04: Proceedings of the 13th International Conference on World Wide Web,New York, NY, pp. 184–192. ACM Press.7. Huang, C.-C., S.-L. Chuang and L.-F. Chien (2004b). Using a web-based categorization approach togenerate thematic metadata from texts. ACM Transactions on Asian Language Information Processing(TALIP) 3(3), 190–212.8. Hammami, M., Y. Chahir, and L. Chen (2003). Web guard: Web based adult content detection andfiltering system. In WI ’03: Proceedings of the 2003 IEEE/WIC International Conference on WebIntelligence, Washington, DC, pp. 574. IEEE Computer Society.9. Chen, Z., O. Wu, M. Zhu, and W. Hu (2006). A novel web page filtering system by combining texts andimages. In WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence,Washington, DC, pp. 732–735. IEEE Computer Society.10. Zhang, D. and W. S. Lee (2003). Question classification using support vector machines. In Proceedingsof the 26th Annual International ACM SIGIR Conference on Research and Development in InformaionRetrieval, New York, NY, pp. 26–32. ACM Press.© 2013 British Journals ISSN 2047-3745

British Journal of Science 21April 2013, Vol. 8 (2)11. Chakrabarti, S., M. van den Berg, and B. Dom (1999, May). Focused crawling: A new approach to topicspecificWeb resource discovery. In WWW ’99: Proceeding of the 8 th International Conference on WorldWide Web, New York, NY, pp. 1623–1640. Elsevier.12. Maggie Johnson and Julie Zelenski, Lexical Analysis, June 25, 2008.13. Ho, T. K. (1999). Fast Identification of Stop Words for Font Learning and Keyword Spotting. InProceedings of the Fifth International Conference on Document Analysis and Recognition (pp. pp. 333-336).: IEEE Computer Society.14. Orasan, C., Pekar, V. & Hasler, L. (2004), A comparison of summarization methods based on termspecificity estimation, in `Proceedings of the Fourth International Conference on LanguageResources and Evaluation (LREC2004)', Lisbon, Portugal, pp. 1037 - 1041.15. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annualinternational ACM SIGIR conference on Research and development in information retrieval. pp. 42-49(1999).16. Hao Luo Faxin Yu, Zheming Lu & Pinghui Wang, (2010) “Three-dimensional model analysis andprocessing”, Advanced topics in science and technology, Springer.© 2013 British Journals ISSN 2047-3745

Web classification based features extracted and K-NN algorithm

Create successful ePaper yourself

Delete template?

Save as template?