13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 2010features. For the development of this blog crawler, it was considered three algorithms for recognizing blogpages, for blog crawling, and for links extraction.Despite the fact that all works described above explored important and interesting features of theBlogosphere, they are not directly concerned with the blog retrieval and blog description issue. For Glance etal. the blog retrieval is important, however they use metrics that <strong>do</strong> not increase the description level of theblogs. Wei-jiang et al. focused on blog searching, considering some aspects of the extraction and informationretrieval issues. Finally, Hurst and Maykov just considered the blogs searching task as the more importantrequirement of their system.In our framework’s instantiation, we used tagging systems to enhance the semantic level of retrievedblogs, improving the search process. Indeed, all the cited works are focused on one specific <strong>do</strong>main.Contrarily, our proposal is more open and general in which the user will have little effort to extend ourframework for creating new applications. In other words, our proposed framework is at a higher level ofabstraction which facilitates the customization of blog crawlers in the context of any greater application.5. CONCLUSION AND FUTURE WORKThe main purpose of this work was to propose a framework for building context-based blog crawlers. Theproposed framework comprises a large set of tools and keeps a high level of abstraction for the developer,providing an easy access to its tools and APIs. Many of its aspects were detailed. The potential of ourframework was evidenced by an instantiation on a typical blog extraction task that achieved satisfactoryresults: an average precision of 73.46% and average recall of 71.92%.The framework presents many expected features suggested by the software engineering field, such as:reliability, risk reduction, rapid prototype development, and easy maintenance, among others. Furthermore,our framework provides access to many tools and is able to deal with multiple languages.As future work, we intend to: (1) perform a deeper evaluation of an improved version of theSummaryStrategy algorithm (with some more heuristics). In addition, we would like to compare it with otherlayout template-detection algorithms; (2) add an information extraction module for enabling further blogsanalysis; and finally, (3) provide a component for combining the tags suggested by several blog indexingservices, based on a fine-grained information extraction performed by the aforementioned module. Moreover,this same component would be able to classify new untagged retrieved blog pages improving in this way therobustness of our framework.REFERENCESArasu, A. et al., 2001. Searching the web. ACM Transactions on <strong>Internet</strong> Technology, 1(1), pp. 2–43.Berwick, R. C. Abney, S. P. and Tenny, C., 1991. Principle-Based Parsing. Computation and Psycholinguistics. SIGARTBull, 3 (2), pp. 26-28.Baeza-Yates, R. A. and Ribeiro-Neto, B., 1999. Modern Information Retrieval.: Addison-Wesley Longman Publishing.Boston.Chau M., Lam P. , Shiu B., Xu J. and Cao J. 2009. A Blog Mining Framework. Social Network Applications. IEEEComputer Society.Frakes, W. B. and Baeza-Yates, R. A., 1992. Information Retrieval: Data Structures & Algorithms. Upper Saddle River:Prentice-Hall.Glance, N.S. Hurst, M. and Tomokiyo, T., 2004. Blogpulse: Automated trend discovery for weblogs. In: <strong>WWW</strong>2004,workshop on the weblogging ecosystem: aggregation, analysis and dynamics. New York City, USA, 17-22.Hotho, A. et al., 2005. A Brief Survey of Text Mining. GLDV Journal for Computational Linguistics and LanguageTechnology, 20 (1), pp.19-62.Hurst, M. and Maykov, A., 2009. Social streams blog crawler. In: Proceedings of the 2009 IEEE InternationalConference on Data Engineering, Shanghai, China.Isotani, S. et al., 2009. Esta<strong>do</strong> da arte em web semântica e web 2.0: Potencialidades e tendências da nova geração deambientes de ensino na internet. Revista Brasileira de Informática na Educação, 17 (1), pp.30-42.125

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!