13.07.2015 Views

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

WWW/Internet - Portal do Software Público Brasileiro

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

IADIS International Conference <strong>WWW</strong>/<strong>Internet</strong> 2010researcher’s goals, a web site is usually represented as a full binary tree, a direct acyclic graph, or as someother restrictive structure. It is common ground that most of the link structure information is lost in that firststep and what’s more, it is never retrieved. So to say we have extracted and stored the “real map” of a website, it means we make sure that no possible link will be passed over by the crawling process and all possiblepairs of {father node, child node} will be recorded.With this last concession alone, it becomes transparent that we build the algorithm of our crawler aroundthe notion of the link between two nodes – web pages of the site, contrary to what applies to most -web pageoriented- crawlers. Or as we said before, we implement a link oriented, web site crawler.3.2 The Notion of a Link Oriented CrawlerOur specific crawling needs as described above, infer several other matters that we need to consider inspecifying the crawling algorithm. Firstly, it is common ground that in actual web sites, a user can be lead toa web page from more than one parent pages. It is important however for our study to be able to record thedistance of each web page from the root, because this information will provide a measure of the access costof the page, which can be exploited later on. In particular we take special interest in identifying the firstappearance of a web page, which is the page with the minimum depth from the root. Recording thisinformation means that our crawling algorithm must be able to identify that first appearance, process it in adifferent manner and of course store that information. To succeed in that, we request that our crawler followsbreadth first search. The reason is better depicted in the two following figures. In the first we see how theweb site would be traversed by a depth first algorithm and in the second by a breadth first algorithm.Figure 1. a) Depth first. The numbers next to the nodes depict the order in which the crawler visits the nodes. In this treelikerepresentation we see that each page occurs more than once, since in real sites each page can emerge from manydifferent parents. With depth first search, Node ‘E’ would first occur at a depth of d=2.b) Breadth first. This time node ‘E’ is rightly recorded as having its first appearance with depth: d=1.In the case of recording the web site’s real map, the mistake described above would be significant. Forinstance, if we miscalculate the depth of the first appearance of a web page, this would result in themiscalculation of all depths of its children pages and so on. In the end we would not be able to regenerate thecorrect map from the stored data. We could seek the first appearance of a web page by just looking for theminimum depth (since the depth is retrieved anyway). However we want to store all the appearances of a webpage in the site and also store them in the order that they actually occur. The <strong>do</strong>wnloading and parsing of thepage is performed only once and only in its first appearance (the following appearances are simply recorded).So the correct determination of that first appearance is a necessary step for the correct mapping of the site.Since we need to know all the transitions between the nodes of the site, we must record all possibleparents that a child node can occur from. This means that crawling only once for each page-node is notenough. When a common crawler opens a new page, it looks for links. If those links have already beenfound, or lead to pages that have already been crawled, they are ignored since they are treated as old pagesthat have no new information to provide. In our case however we record the link structure and are notinterested in storing the content of the pages. Consequently, our crawler <strong>do</strong>es not ignore links that lead topages that have already been traversed. Instead we check if those links occur for the first time, considering ofcourse the current father node. In other words, if the current pair {father_node, child_node} occurs for thefirst time, this to us is new information and therefore needs to be recorded. For instance, consider the casethat while the crawler parses a page A, finds a link to a page B that has not appeared before. Page B, will be75

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!