Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
optimization method. In these works, a famous counterintuitive result was obtained: a<br />
uniform crawl strategy (where all pages are crawled with the same frequency) is superior<br />
to the proportional one (where a page is crawled proportional to its change frequency).<br />
Furthermore, their optimal crawling strategy may never revisit pages that change too<br />
often. The intuition is that the crawling resources (b<strong>and</strong>width) needed to keep weakly<br />
synchronized a fast changing page can be better used to keep strongly synchronized the<br />
pages that change slower for obtaining higher freshness scores. The same goal is also aimed<br />
by [WSY + 02], with the difference that they no longer use a homogeneous Poisson page<br />
change model, but a quasi-deterministic page change model in which page update events<br />
are nonuniform in time <strong>and</strong> the change distribution is a priori known. The authors of<br />
[EMT01] do not make any particular assumption about the page change evolution model,<br />
but they rather use an adaptive approach that estimates <strong>and</strong> adapts the page change<br />
frequency on the fly.<br />
Continuous freshness measures typically classify a content element as being ”fresher” than<br />
others. As a first example, [CGM03a] introduces the age of an element as the amount<br />
of time the real <strong>and</strong> the cached copy of the element have been different. They propose<br />
a refresh strategy that minimizes age. Content-based freshness measures are introduced<br />
in [OP08], where the degree of change is reflected directly by the changes in the page<br />
content. Fragment staleness (they use the reverse of the freshness) is estimated by the<br />
Jaccard set distance on the component fragments of a page. In addition to characterizing<br />
a page by its update frequency, they also introduce the longevity of a page, as the lifetime<br />
of page fragments that appear <strong>and</strong> disappear in time. The principle of their best effort<br />
refresh strategy is also based on the Lagrange multipliers [Ste91] optimization method,<br />
but they use a variant that employs a utility function. This solution avoids computing the<br />
differential equations system of the classical Lagrange method. This same method was<br />
first presented in [OW02] in the context of cache synchronization with source cooperation<br />
<strong>and</strong> push protocol <strong>and</strong> it also represents an inspiration source for our two steps refresh<br />
strategy for RSS feeds introduced in chapter 4. For the case when the page revisitation<br />
purpose is to maintain the index of a search engine, [PO05] proposed a user-centric method<br />
that assigns weights to individual content changes, based on how this impacts the ranking<br />
of the page.<br />
If the crawl purpose is to capture as many individual content updates as possible, then the<br />
refresh strategy goal is to maximize completeness. This is specific to applications such as<br />
web archiving or temporal data mining analysis. In this sense, a first algorithm was proposed<br />
in [PRC03]. In a subsequent work [PDO04], a more general algorithm is introduced<br />
that supports a flexible combination of freshness <strong>and</strong> completeness optimization.<br />
3.3 Crawling RSS Feeds<br />
A refresh strategy specially conceived for crawling RSS feeds is proposed in [SCC07],<br />
based on the Lagrange multipliers [Ste91] optimization method. The challenge for the<br />
38