13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

ISBN: 978-972-8939-25-0 © 2010 IADIS Indexing: We delegated the implementation of this service to the Lucene 2 tool. The service is incharge of indexing texts for the search process. Text Extraction from HTML pages: This service extracts the textual content of blog posts. It isbased on the SummaryStrategy algorithm that will be discussed in Section 2.4; Persistent storage API: This API is responsible for saving, retrieving and updating instances ineither a MySQL 3 database or in a RDF schema-based repository as Sesame 4 .The General services module implements basic services which are used in the crawling process. Itprovides file manipulation, HTTP requests, XML file manipulation, and language detection. All webconnections are handled by the HTTP service provided by this module.Figure 1. The framework's architecture2.3 The toolkit and Persistence ModulesThe Toolkit Module has several interfaces to the following set of APIs and tools: i) Lucene and Lingpipe 5 ,for extraction and text retrieval; ii) Hibernate and Elmo, which handle data persistence transparently; iii)HttpClient, for HTTP page retrieval; iv) Google Language Detection, to detect the language of the blogs text.The Toolkit module makes transparent the use of each one of these tools, providing an effortless access tothem which decreases the learning time. For instance, the user <strong>do</strong>es not to go deeper in details about LuceneAPI in order to build an application that actually uses it. Instead, she could delegate to the Lucene’s functionsof the framework.The Persistence Module is responsible for the storage. It supports MySQL databases and Sesame, whichis an open source Java framework for storage and querying of RDF data. Particularly, if the user chooses touse Sesame, she has to define the ontology which will be used as the database schema.2 http://lucene.apache.org3 http://www.mysql.com4 http:// www.openrdf.org5 Lingpipe is a toolkit for processing text using computational linguistics: http://alias-i.com/lingpipe/122

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!