Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
frequency <strong>and</strong> in-place page update or page collection shadowing.<br />
Third, some well studied factors that are widely considered in the design of a crawling<br />
policy include:<br />
• importance - of a piece of web content (e.g. web page, RSS feed) relative to other<br />
pieces of content,<br />
• relevance - of a piece of web content to the purpose served by the crawl, <strong>and</strong><br />
• content changes - or how the web content tends to evolve in time.<br />
And last, we mention some other problems that must be considered in the crawler design,<br />
such as:<br />
• content avoidance - which is the appropriate way a crawler should detect <strong>and</strong> avoid<br />
content that is redundant, malicious, wasteful or misleading?<br />
• hidden or deep web - how should the web content available only by filling in HTML<br />
forms be crawled?<br />
• personalized content - in which manner a crawler should treat web sites that offer<br />
personalized content to individual users?<br />
• erased content - in the case a crawler runs out of storage space, how should it select<br />
the pages to retire from its repository?<br />
In the following sections we introduce selected references that are based on some of the<br />
already discussed crawling considered factors <strong>and</strong> targeted objectives.<br />
3.2.1 First-time Downloading <strong>Strategies</strong><br />
This group of techniques focuses on ordering the content for first-time download with the<br />
goal of achieving wide coverage of the available web content. Roughly put, these strategies<br />
download the web content in descending order of importance, where the importance<br />
function reflects the purpose of the crawl.<br />
When the goal is to cover high quality content of all types, an importance score extensively<br />
used in literature is PageRank [PBMW99] <strong>and</strong> its variations. Three published<br />
empirical studies evaluate different ordering policies on real web data: [CGMP98, NW01,<br />
BYCMR05]. In [CGMP98], the authors use the indegree <strong>and</strong> the PageRank of web pages to<br />
prioritize crawling. Indegree priority favors pages with higher numbers of incoming hyperlinks.<br />
There is no consensus on prioritization by indegree: [CGMP98] finds that it works<br />
fairly well (almost as well as prioritization by PageRank), whereas [BYCMR05] finds that<br />
it performs very poorly. [NW01] uses breadth-first search, an appealing crawling strategy<br />
due to its simplicity: pages are downloaded in the order in which they are discovered,<br />
where discovery occurs via extracting links from each downloaded page. [BYCMR05] proposes<br />
a crawl policy that gives priority to sites containing a large number of discovered but<br />
36