12.07.2015 Views

Web classification based features extracted and K-NN algorithm

Web classification based features extracted and K-NN algorithm

Web classification based features extracted and K-NN algorithm

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

British Journal of Science 15April 2013, Vol. 8 (2)documents that contain the terms "stemmer" <strong>and</strong> "stems" because all share the common root word stem". Italso has applications in machine translation, document summarization, <strong>and</strong> text <strong>classification</strong> [14].Also it isessential to support effective Indonesian Information Retrieval, <strong>and</strong> has uses as diverse as defenseintelligence applications, document translation, <strong>and</strong> web search.3. Features extractionThere are number of <strong>features</strong> formula used in this paper such as:1. Wij=n ij *log (n/n j ) …………………………………………. (1)Where n ij is the frequency of the term t j in the document j ,<strong>and</strong> n j is the number of the document containingthe term t j . This formula represent the weight of term tj in document i, i =1, . . . , n, j = 1,.m .., m, <strong>and</strong> n isthe number of document in collection.2. In the vector space model, the method for computing featureweight is using TF-IDF, as follow:….. (2)TF (t, d) is the frequency of t contained by d, IDF (t) is: Inverse document frequency, namely,IDF=1/DF (t) δ is an adjustment coefficient, about 0.01 usually.3. The frequency of feature t in whole training corpus (TC) is called the overall frequency (OF).which isdefined as:…………… (3)4. The information containing in the document can be defined as:Entropy= …………… (4)Where p ij is the frequency of the term in the document.5. The distributed of the term in the document can be shown in the formula below:Energy= ……………… (5)Then grouping all <strong>features</strong> for each document ,also summation the <strong>features</strong> for each document.© 2013 British Journals ISSN 2047-3745


British Journal of Science 18April 2013, Vol. 8 (2)Also testing this system for different mining webs ,where the result can be shown in figure (4):Figure 4:The implementation of the propsed system for the wen miningBut when the proposed system tested for other web subjects such as image processing the resulted<strong>classification</strong> can be shown in figure (5):© 2013 British Journals ISSN 2047-3745


British Journal of Science 19April 2013, Vol. 8 (2)Figure 5: The implementation of the proposed system for image proccessing web© 2013 British Journals ISSN 2047-3745


British Journal of Science 20April 2013, Vol. 8 (2)7. Conclusions1-The k-<strong>NN</strong> <strong>algorithm</strong> is the most suitable method for <strong>classification</strong> these types of webs.2- In the <strong>classification</strong> process for the network security ,minining <strong>and</strong> database webs , the <strong>features</strong> weight ,overall frequency,TI-IDF,energy <strong>and</strong> entropy showing the stable behavior in all webs of this type.3- The relation between some web objects allow the interrelated in the <strong>classification</strong> process such as imageprocessing <strong>and</strong> neural network ,since most the methods of the recognition in the image processingdepending on the neural network therefore,the <strong>features</strong> weight ,overall frequency <strong>and</strong> entropy used toclassify the image processing webs ,while the other <strong>features</strong> such as TF-IDF <strong>and</strong> energy <strong>features</strong> going toclassify the related neural network webs.4- There is arelation between the tittle <strong>and</strong> the containts of the webs with the <strong>classification</strong> process.8. References1. Dumais, S., Chen, H.: Hierarchical <strong>classification</strong> of web content. In: Proceedings of the 23rd annualinternational ACM SIGIR conference on Research <strong>and</strong> development in information retrieval. pp. 256-263(2000).2. Forman, G.: An extensive empirical study of feature selection metrics for text <strong>classification</strong>. The Journalof Machine Learning Research 3, 1289-1305 (2003).3. Zu Eissen, S. M. <strong>and</strong> B. Stein (2004). Genre <strong>classification</strong> of web pages. In Proceedings of the 27thGerman Conference on Artificial Intelligence, Volume 3238 of LNCS, Berlin, pp.256–269. Springer.4. Gyo .. ngyi, Z. <strong>and</strong> H. Garcia-Molina (2005b, May). <strong>Web</strong> spam taxonomy. In B. D. Davison (Ed.),Proceedings of the First International Workshop on Adversarial Information Retrieval (AIR <strong>Web</strong>),Bethlehem, PA, pp. 39–47. Lehigh University, Department of Computer Science. Technical Report LU-CSE-05-030.5. Castillo, C., D. Donato, A. Gionis, V. Murdock, <strong>and</strong> F. Silvestri (2007). Know your neighbors: <strong>Web</strong> spamdetection using the web topology. In Proceedings of the 30th Annual International ACM SIGIR Conferenceon Research <strong>and</strong> Development in Information Retrieval,New York, NY. ACM Press. In press.6. Huang, C.-C., S.-L. Chuang <strong>and</strong> L.-F. Chien (2004a). Live classifier: Creating hierarchical text classifiersthrough web corpora. In WWW ’04: Proceedings of the 13th International Conference on World Wide <strong>Web</strong>,New York, NY, pp. 184–192. ACM Press.7. Huang, C.-C., S.-L. Chuang <strong>and</strong> L.-F. Chien (2004b). Using a web-<strong>based</strong> categorization approach togenerate thematic metadata from texts. ACM Transactions on Asian Language Information Processing(TALIP) 3(3), 190–212.8. Hammami, M., Y. Chahir, <strong>and</strong> L. Chen (2003). <strong>Web</strong> guard: <strong>Web</strong> <strong>based</strong> adult content detection <strong>and</strong>filtering system. In WI ’03: Proceedings of the 2003 IEEE/WIC International Conference on <strong>Web</strong>Intelligence, Washington, DC, pp. 574. IEEE Computer Society.9. Chen, Z., O. Wu, M. Zhu, <strong>and</strong> W. Hu (2006). A novel web page filtering system by combining texts <strong>and</strong>images. In WI ’06: Proceedings of the 2006 IEEE/WIC/ACM International Conference on <strong>Web</strong> Intelligence,Washington, DC, pp. 732–735. IEEE Computer Society.10. Zhang, D. <strong>and</strong> W. S. Lee (2003). Question <strong>classification</strong> using support vector machines. In Proceedingsof the 26th Annual International ACM SIGIR Conference on Research <strong>and</strong> Development in InformaionRetrieval, New York, NY, pp. 26–32. ACM Press.© 2013 British Journals ISSN 2047-3745


British Journal of Science 21April 2013, Vol. 8 (2)11. Chakrabarti, S., M. van den Berg, <strong>and</strong> B. Dom (1999, May). Focused crawling: A new approach to topicspecific<strong>Web</strong> resource discovery. In WWW ’99: Proceeding of the 8 th International Conference on WorldWide <strong>Web</strong>, New York, NY, pp. 1623–1640. Elsevier.12. Maggie Johnson <strong>and</strong> Julie Zelenski, Lexical Analysis, June 25, 2008.13. Ho, T. K. (1999). Fast Identification of Stop Words for Font Learning <strong>and</strong> Keyword Spotting. InProceedings of the Fifth International Conference on Document Analysis <strong>and</strong> Recognition (pp. pp. 333-336).: IEEE Computer Society.14. Orasan, C., Pekar, V. & Hasler, L. (2004), A comparison of summarization methods <strong>based</strong> on termspecificity estimation, in `Proceedings of the Fourth International Conference on LanguageResources <strong>and</strong> Evaluation (LREC2004)', Lisbon, Portugal, pp. 1037 - 1041.15. Yang, Y., Liu, X.: A re-examination of text categorization methods. In: Proceedings of the 22nd annualinternational ACM SIGIR conference on Research <strong>and</strong> development in information retrieval. pp. 42-49(1999).16. Hao Luo Faxin Yu, Zheming Lu & Pinghui Wang, (2010) “Three-dimensional model analysis <strong>and</strong>processing”, Advanced topics in science <strong>and</strong> technology, Springer.© 2013 British Journals ISSN 2047-3745

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!