18.12.2012 Views

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

fetch new or updated information or they receive information that suits their interests<br />

from the content providers that push it to the aggregators.<br />

One of the earliest search services that adopted a push model is the Harvest system<br />

[BDH + 95], but this approach did not succeed on the long run. There are a number of<br />

papers that discuss server initiated events (push), often in comparison with pull techniques.<br />

Most of them focus on distributed publish-subscribe systems [JA06, SMPD05, RPS06,<br />

RMP + 07, CY08, ZSA09], multi-casting with a single publisher [AFZ97, FZ98, TV00] <strong>and</strong><br />

AJAX applications [BMvD07]. In a more recent work, [STCR10] uses a hybrid push-pull<br />

refresh strategy for materializing users views over event feeds.<br />

An initiative to define a push based protocol for RSS feeds has been proposed by the<br />

PubSubHubbub [pub] open protocol for distributed publish-subscribe communication on<br />

the Internet. The main purpose is to provide near-instant notifications of change updates,<br />

which would improve the typical situation where a client periodically polls the feed server<br />

at some arbitrary interval. In essence PubSubHubbub provides pushed RSS/Atom update<br />

notifications instead of requiring clients to poll whole feeds. However, the PubSubHubbub<br />

adoption heavily depends on whether both RSS feeds publishers <strong>and</strong> clients support the<br />

protocol extensions, which is rarely the case. For example, [URM + 11] estimates that only<br />

2% of their 180, 000 feeds data set support the PubSubHubbub protocol.<br />

Other researchers [OW02] have used the push protocol with a source cooperation approach,<br />

where data sources actively notify the remote data cache servers about their changes. They<br />

propose different types of divergence metrics <strong>and</strong> while lag is similar to our window/feed<br />

divergence measures, these metrics are conceived for a different application context <strong>and</strong><br />

only partially reflect the particularities of RSS feeds.<br />

Virtually all content providers <strong>and</strong> aggregators adopt the pull approach. Content providers<br />

sometime exclude all or part of their content from being crawled <strong>and</strong> may provide hints<br />

about their content, rate of change <strong>and</strong> importance. Such efforts to make web crawling<br />

more efficient by improving the underlying protocol were made by introducing the Sitemaps<br />

protocol [sit] that allows a site administrator to publish the list of available pages <strong>and</strong> their<br />

last modification date. Tradeoffs between using the classical pull approach <strong>and</strong> Sitemaps<br />

for content discovery were examined in [SS09]. Even though this protocol offers a crawler<br />

useful information about new created pages <strong>and</strong> their changes, the crawler still has to use<br />

the pull architecture in order to regularly get the information of its interest <strong>and</strong> thus, pull<br />

based refresh strategies are still needed.<br />

Both push <strong>and</strong> pull protocols have their advantages <strong>and</strong> disadvantages. When the new<br />

content is pushed by the content providers, a big advantage is that the subscribed aggregators<br />

can receive instant update notifications as soon as new content is published.<br />

However, the problem is that the content providers must be aware of their subscribers list<br />

<strong>and</strong> manage a trust relationship with them, which is difficult <strong>and</strong> highly unlikely in the<br />

web’s reality. On the other h<strong>and</strong>, when the new available content is proactively pulled<br />

from the content providers, the crawler obeys a refresh strategy that must predict the<br />

update moments <strong>and</strong> sources <strong>and</strong> that decides when the crawler fetches the newly pub-<br />

34

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!