18.12.2012 Views

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

optimization method. In these works, a famous counterintuitive result was obtained: a<br />

uniform crawl strategy (where all pages are crawled with the same frequency) is superior<br />

to the proportional one (where a page is crawled proportional to its change frequency).<br />

Furthermore, their optimal crawling strategy may never revisit pages that change too<br />

often. The intuition is that the crawling resources (b<strong>and</strong>width) needed to keep weakly<br />

synchronized a fast changing page can be better used to keep strongly synchronized the<br />

pages that change slower for obtaining higher freshness scores. The same goal is also aimed<br />

by [WSY + 02], with the difference that they no longer use a homogeneous Poisson page<br />

change model, but a quasi-deterministic page change model in which page update events<br />

are nonuniform in time <strong>and</strong> the change distribution is a priori known. The authors of<br />

[EMT01] do not make any particular assumption about the page change evolution model,<br />

but they rather use an adaptive approach that estimates <strong>and</strong> adapts the page change<br />

frequency on the fly.<br />

Continuous freshness measures typically classify a content element as being ”fresher” than<br />

others. As a first example, [CGM03a] introduces the age of an element as the amount<br />

of time the real <strong>and</strong> the cached copy of the element have been different. They propose<br />

a refresh strategy that minimizes age. Content-based freshness measures are introduced<br />

in [OP08], where the degree of change is reflected directly by the changes in the page<br />

content. Fragment staleness (they use the reverse of the freshness) is estimated by the<br />

Jaccard set distance on the component fragments of a page. In addition to characterizing<br />

a page by its update frequency, they also introduce the longevity of a page, as the lifetime<br />

of page fragments that appear <strong>and</strong> disappear in time. The principle of their best effort<br />

refresh strategy is also based on the Lagrange multipliers [Ste91] optimization method,<br />

but they use a variant that employs a utility function. This solution avoids computing the<br />

differential equations system of the classical Lagrange method. This same method was<br />

first presented in [OW02] in the context of cache synchronization with source cooperation<br />

<strong>and</strong> push protocol <strong>and</strong> it also represents an inspiration source for our two steps refresh<br />

strategy for RSS feeds introduced in chapter 4. For the case when the page revisitation<br />

purpose is to maintain the index of a search engine, [PO05] proposed a user-centric method<br />

that assigns weights to individual content changes, based on how this impacts the ranking<br />

of the page.<br />

If the crawl purpose is to capture as many individual content updates as possible, then the<br />

refresh strategy goal is to maximize completeness. This is specific to applications such as<br />

web archiving or temporal data mining analysis. In this sense, a first algorithm was proposed<br />

in [PRC03]. In a subsequent work [PDO04], a more general algorithm is introduced<br />

that supports a flexible combination of freshness <strong>and</strong> completeness optimization.<br />

3.3 Crawling RSS Feeds<br />

A refresh strategy specially conceived for crawling RSS feeds is proposed in [SCC07],<br />

based on the Lagrange multipliers [Ste91] optimization method. The challenge for the<br />

38

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!