13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 20102.4 The Summarystrategy AlgorithmDuring the implementation of this framework, we have faced the task of extracting the textual content ofblogs posts. In order to <strong>do</strong> that, we had to find the delimiters indicating both the beginning and the ending ofthe posts within a blog page. Furthermore, the fact that each blog has its own HTML structure makes thistask more difficult. Therefore, we propose the SummaryStrategy algorithm which uses both the link of theoriginal blog to retrieve it and a brief summary of the post to find its text within a blog page. Since we rely onblog indexing services like Technorati’s to retrieve the summaries of the blogs, their acquisition was a minorissue. Being the summary text and the text inside the blog exactly the same, the algorithm can make use ofthe former as a reference to find the latter inside the blog page. However, the blog summary is sel<strong>do</strong>m writtenin plain text (without any formatting markup), contrasting with the text inside the blog and this fact canprevent the algorithm to locate the summary within a blog page. To overcome that difficulty, if the algorithmwas not able to find the beginning of the posts text content, it tries to use smaller fragments of the summaryas a reference, until it finds it.The algorithm also has to detect the posts endings. By analyzing the layout tags used by different blogpages, we noticed that we could use the block-delimiting HTML tags (i.e. ) as delimiters of the textualcontent of blogs posts. The Figure 2 shows the pseu<strong>do</strong>code of the SummaryStrategy algorithm, whereSummary and blogHTML are strings, and possibleEndings is a list of strings.1 while (length of Summary > 0) <strong>do</strong>:2 if (Summary is found within blogHTML) then:3 textStart ← position of Summary within blogHTML4 for each ending in possibleEndings <strong>do</strong>:5 if (ending is found within blogHTML, after textStart) then:6 textEnd ← ending position within blogHTML, after textStart7 return text between textStart and textEnd8 end if9 end for10 return “not found”11 end if12 delete the last character of Summary13 end while14 return “not found”Figure 2. Pseu<strong>do</strong>code of the summarystrategy algorithm.The algorithm is fast and has low processing requirements. The <strong>do</strong>wnside of the algorithm is that it relieson a summary and on hardcoded delimiters (i.e. the div tag) to extract the full text from blogs posts.In a future version of our framework, we could implement webpage template-detection techniques such asproposed by [Wang et al., 2008]. The template-detection algorithms are able to infer the layout template of aweb site, generally by counting block frequencies. Thus, these algorithms can be used to determine the blogsposts.3. THE FRAMEWORK’S INSTANTIATIONThis section briefly describes the development of an application using the proposed framework, anddiscusses an analysis of the developed application.3.1 Developing an Application with the FrameworkAs previously stated, the only requirement to create an application using our framework is to extend theApplication and TagParser classes.When extending the Application class, the user can configure some aspects of the application by settingattributes of the class. The user can define features as the desired language of blogs, the type of persistence,123

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!