13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

ISBN: 978-972-8939-25-0 © 2010 IADISadded to the queue for future expansion. Later when the crawler parses another page C, finds again a link topage B. We <strong>do</strong>n’t ignore this link. We record its appearance as new information for the site’s link structure.However we <strong>do</strong> not send again page B to the queue for <strong>do</strong>wnloading and expansion. The following tablesums up how we deal with all possible cases.Table 1. The behavior of the crawler depending on the found linksDuring page parsing: Check ActionFound link to newpageFound link to analready queued orparsed pageFound other links, notconsidered as properlinks (files etc.), ordisallowed by webmaster-Check if the link:{father_node, child_node}is unique• Submit new page toqueue• Record linkRecord link if yesIgnore link if no- Ignore linkThere were several other issues concerning the crawling algorithm that we had to deal with. We providedflexibility regarding the ability of the web master to block certain types of files or even whole directories ofhis site from the crawling process. Consider for example the case where a web site supports more than onelanguage. Some of those sites implement multilingual services by “repeating” the whole site with the samestructure, under another directory (www.website.com/index.html - .website.com/gr/index.html). Such a designwould be mapped by our crawler as repeating the whole link structure one level below the home page, sincethe URLs of the pages would appear different for each language and normally the links would not beidentified as already parsed. Cases like the following example would need to be treated specially. We providesuch options for the web masters.Table 2. Data Retrieved by the Crawler?URLFather URLTitleDepthPage fileParent codeDirectory URLAppearanceThe URL of the link currently parsed.The URL of the father nodeThe title of the web page as retrieved from the tag.The minimum number of steps required to get to this page from the homepage.The filename of the webpage in the server.The html code of the link that lead to this page. Useful for automatic hotlink application.The URL of the page’s parent directory in the web server.The order of appearance of this specific page, contrast to the appearance of the same pagefrom other father pages. The greatest ‘Appearance’ of each page, equals to the number ofdifferent parent pages this page can occur from.In other equally important issues, we only allow one thread to undertake the crawling, in synchronousmode, since we noticed rare but existing anomalies in the sequence of the parsed pages when operating inmulti threading mode. Also there was a lot to consider due to the embedment of the crawler to a webenvironment in terms of usability and security concerns. In addition, the crawling algorithm underwentseveral changes in order to support our need to store several extra data during the crawl. In two different,intervened stages of the crawling process we collect for each page the data shown in table 2. Finally somemore interventions were made, concerning the parsing of the pages, pages pointing to themselves, URLnormalization and the storing of the collected information in the database.3.3 Configuring the WebSPHINX CrawlerAll the specifications and needs previously described constitute a very different web crawler. Obviously weneeded to start our work on a highly dynamic and configurable machine. The link oriented charactercombined with the web environment in which the crawler is embedded, led us to choose a light weight, java76

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!