18.12.2012 Views

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

frequency <strong>and</strong> in-place page update or page collection shadowing.<br />

Third, some well studied factors that are widely considered in the design of a crawling<br />

policy include:<br />

• importance - of a piece of web content (e.g. web page, RSS feed) relative to other<br />

pieces of content,<br />

• relevance - of a piece of web content to the purpose served by the crawl, <strong>and</strong><br />

• content changes - or how the web content tends to evolve in time.<br />

And last, we mention some other problems that must be considered in the crawler design,<br />

such as:<br />

• content avoidance - which is the appropriate way a crawler should detect <strong>and</strong> avoid<br />

content that is redundant, malicious, wasteful or misleading?<br />

• hidden or deep web - how should the web content available only by filling in HTML<br />

forms be crawled?<br />

• personalized content - in which manner a crawler should treat web sites that offer<br />

personalized content to individual users?<br />

• erased content - in the case a crawler runs out of storage space, how should it select<br />

the pages to retire from its repository?<br />

In the following sections we introduce selected references that are based on some of the<br />

already discussed crawling considered factors <strong>and</strong> targeted objectives.<br />

3.2.1 First-time Downloading <strong>Strategies</strong><br />

This group of techniques focuses on ordering the content for first-time download with the<br />

goal of achieving wide coverage of the available web content. Roughly put, these strategies<br />

download the web content in descending order of importance, where the importance<br />

function reflects the purpose of the crawl.<br />

When the goal is to cover high quality content of all types, an importance score extensively<br />

used in literature is PageRank [PBMW99] <strong>and</strong> its variations. Three published<br />

empirical studies evaluate different ordering policies on real web data: [CGMP98, NW01,<br />

BYCMR05]. In [CGMP98], the authors use the indegree <strong>and</strong> the PageRank of web pages to<br />

prioritize crawling. Indegree priority favors pages with higher numbers of incoming hyperlinks.<br />

There is no consensus on prioritization by indegree: [CGMP98] finds that it works<br />

fairly well (almost as well as prioritization by PageRank), whereas [BYCMR05] finds that<br />

it performs very poorly. [NW01] uses breadth-first search, an appealing crawling strategy<br />

due to its simplicity: pages are downloaded in the order in which they are discovered,<br />

where discovery occurs via extracting links from each downloaded page. [BYCMR05] proposes<br />

a crawl policy that gives priority to sites containing a large number of discovered but<br />

36

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!