13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 2010the blogs. This approach is used by blog indexing services like Technorati, Icerocket and Blogcatalog 1 .Thesecompanies maintain tagging services for <strong>Internet</strong> sites and blogs, considerably increasing the semantic levelassociated to them [Mathes, 2004]. For instance, if a blog post related to a math class is marked with aneducation tag, it is inserted into the educational context, hence facilitating the search process and providingbetter search results to the user.In this light, in order to enable users to perform their searches, the text and information related to eachblog need to be properly indexed and stored. Such a task is performed by blog crawlers [Arasu et al., 2001].In order to build a blog crawler, we should consider many aspects related to blogs itself, such as languageused by them, blog indexing service, preprocessing tasks, and indexing techniques. In addition, the blogcrawler should be easily extendable, due to the dynamic nature of the Web.Thus, a good framework for constructing blog crawlers should attend to all these issues [Johnson andFoote, 1988]. With a framework that could be easily extendable for different applications, the users wouldcreate specific blog crawlers with little effort.Accordingly, we propose a framework for building context-based blog crawlers. The framework uses tagsto increase the semantic level of the blogs and provides many services, such as preprocessing, indexing andgeneral text extraction from HTML. We also present an instantiation of the framework based on Technorati’stagging system.This article is structured as follows: Section 2 details the framework’s architecture components. Anexample of how to instantiate the proposed framework, as well as first results and discussion are shown inSection 3. Related work is presented in Section 4. Finally, in Section 5, we present our conclusions and futurework.2. THE FRAMEWORK’S ARCHITECTUREThe proposed system architecture consists of a white-box framework which allows the development ofcontext-based blog crawlers. This framework provides application services such as text preprocessing, blogindexing, text extraction from HTML pages, and an API allowing the user to easily implement datapersistence. The general architecture is shown on Figure 1. The aforementioned services are explained inmore detail below.2.1 The Crawler ModuleThis is the main module of the framework. The Application and TagParser classes from this module aredirectly connected to the framework’s core. In order to create an application with it, the user must extendthese two classes. The user can configure some aspects of the application by initializing a few attributes whenextending the Application class. For example, she can define the language that will be used by the crawler,the preprocessing algorithm, and the text extraction method, just to mention a few. Concerning the TagParserclass, the only requirement for using it, is to define the tagSearch() method, which executes a search andreturns a list of blogs matching a given tag. This will be further discussed in the Section 3.1.2.2 The Application and General Services ModulesThe Application Services Module provides many services that are used in the crawling process. Theseservices allow users to create several applications, i.e., in other scenarios, the framework is not limited to thecreation of blog crawlers. It follows a short description of all these services: Preprocessing: This service removes the information considered irrelevant to the blog analysis. Thecurrent version of the framework can perform four types of preprocessing: i) CleanHTML [Hotho et al.,2005], which is responsible for cleaning all HTML tags; ii) EnglishFiltering [Frakes and Baeza-Yates, 1992]which can filter just English text; iii) EnglishStemming [Porter, 1980], performing a classical stemmingtechnique, reducing the word to its lemma; and iv) WhiteSpace, which removes extra whitespaces;1 http://technorati.com, http://www.icerocket.com and http://www.blogcatalog.com, respectively.121

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!