13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISBN: 978-972-8939-25-0 © 2010 IADISthe preprocessing and indexing techniques, and the tags to search for. This gives a lot of flexibility to theapplication.In order to extend the TagParser class, the user only needs to implement the tagSearch() method. Thismethod is responsible for finding blogs with a specific tag. At first this may seem a difficult task, however itgets a lot simpler by making use of tagging services as the one provided by Technorati´s. Nevertheless, it’sworth mentioning that we <strong>do</strong> not recommend the use of multiple data sources, i.e., using several blogindexing services at the same time, because this can cause information inconsistency.In our instantiation we used five categories (tags): education, economy, entertainment, sports and movies.We have chosen to use a limited number of categories to simplify our analysis. For the tagSearch() methodwe used Technorati’s search service because it provides a wide range of tags (both general and specific ones).For instance, there are general tags in politics context and specific tags related to political personalities, asBarak Obama. In addition, even though the framework allowed us to work with blogs in many languages, forthe sake of simplicity, we preferred just to deal with blogs in English. The SummaryStrategy algorithm wasused for the text extraction from blogs posts.3.2 Results and DiscussionWe evaluated one framework’s instantiation and the results are discussed below. The evaluation was based intwo common metrics of information retrieval systems: precision and recall [Baeza-Yates and Ribeiro-Neto,1999]. The precision allowed us to know how many of the retrieved posts were relevant. In our context, apost was considered relevant only if it was complete and written in English. The recall indicates how manyrelevant posts were retrieved, considering the total set of relevant posts.For this first simple analysis, we selected twenty blogs from each of the five aforementioned categories.The crawling results were inspected manually. Table 1 shows the results.Table 1. Results by categoryCategory Precision RecallEducation 61.5% 65%EconomyEntertainmentSportsMovies83.3%60%87.5%75%66.6%60%80%88%Average 73.46% 71.92%As we could notice, some categories performed better in this evaluation than others. For instance, thesports category got a precision of 87.5% and a recall of 80%, while the entertainment category scored 60%for both the precision and recall. This variability in the results occurred due to differences in the level oflanguage in which the blogs are written in. Sports blogs tends to be more formal and better structured,because they tend to have connections with larger organizations. On the other hand, blogs aboutentertainment subjects are mainly written by the general public, they tend to be less formal and contain slanglanguage that affects the text extraction process.The algorithm managed to achieve an average precision and recall of 73.46% and 71.92%, respectively.4. RELATED WORK[Glance et al., 2004] proposed an application for analyzing collections of blog using machine learning andnatural language processing techniques to select the most quoted phrases and accessed blogs.In [Hurst and Maykov, 2009], the authors proposed a blog crawler that was concerned with specific issuessuch as performance, scalability and fast update. Their crawler takes in account low latency, high scalabilityand data quality, and appropriate network politeness. The distributed crawler architecture is composed ofscheduling, fetching, and list creation subsystems, as well as a natural language parser.More recently, [Wei-jiang et al., 2009] tried to increase the quality and accuracy of searches in blogs.The authors proposed a blog-oriented crawler, and compared it to a topical crawler using their specific124

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!