Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
fetch new or updated information or they receive information that suits their interests<br />
from the content providers that push it to the aggregators.<br />
One of the earliest search services that adopted a push model is the Harvest system<br />
[BDH + 95], but this approach did not succeed on the long run. There are a number of<br />
papers that discuss server initiated events (push), often in comparison with pull techniques.<br />
Most of them focus on distributed publish-subscribe systems [JA06, SMPD05, RPS06,<br />
RMP + 07, CY08, ZSA09], multi-casting with a single publisher [AFZ97, FZ98, TV00] <strong>and</strong><br />
AJAX applications [BMvD07]. In a more recent work, [STCR10] uses a hybrid push-pull<br />
refresh strategy for materializing users views over event feeds.<br />
An initiative to define a push based protocol for RSS feeds has been proposed by the<br />
PubSubHubbub [pub] open protocol for distributed publish-subscribe communication on<br />
the Internet. The main purpose is to provide near-instant notifications of change updates,<br />
which would improve the typical situation where a client periodically polls the feed server<br />
at some arbitrary interval. In essence PubSubHubbub provides pushed RSS/Atom update<br />
notifications instead of requiring clients to poll whole feeds. However, the PubSubHubbub<br />
adoption heavily depends on whether both RSS feeds publishers <strong>and</strong> clients support the<br />
protocol extensions, which is rarely the case. For example, [URM + 11] estimates that only<br />
2% of their 180, 000 feeds data set support the PubSubHubbub protocol.<br />
Other researchers [OW02] have used the push protocol with a source cooperation approach,<br />
where data sources actively notify the remote data cache servers about their changes. They<br />
propose different types of divergence metrics <strong>and</strong> while lag is similar to our window/feed<br />
divergence measures, these metrics are conceived for a different application context <strong>and</strong><br />
only partially reflect the particularities of RSS feeds.<br />
Virtually all content providers <strong>and</strong> aggregators adopt the pull approach. Content providers<br />
sometime exclude all or part of their content from being crawled <strong>and</strong> may provide hints<br />
about their content, rate of change <strong>and</strong> importance. Such efforts to make web crawling<br />
more efficient by improving the underlying protocol were made by introducing the Sitemaps<br />
protocol [sit] that allows a site administrator to publish the list of available pages <strong>and</strong> their<br />
last modification date. Tradeoffs between using the classical pull approach <strong>and</strong> Sitemaps<br />
for content discovery were examined in [SS09]. Even though this protocol offers a crawler<br />
useful information about new created pages <strong>and</strong> their changes, the crawler still has to use<br />
the pull architecture in order to regularly get the information of its interest <strong>and</strong> thus, pull<br />
based refresh strategies are still needed.<br />
Both push <strong>and</strong> pull protocols have their advantages <strong>and</strong> disadvantages. When the new<br />
content is pushed by the content providers, a big advantage is that the subscribed aggregators<br />
can receive instant update notifications as soon as new content is published.<br />
However, the problem is that the content providers must be aware of their subscribers list<br />
<strong>and</strong> manage a trust relationship with them, which is difficult <strong>and</strong> highly unlikely in the<br />
web’s reality. On the other h<strong>and</strong>, when the new available content is proactively pulled<br />
from the content providers, the crawler obeys a refresh strategy that must predict the<br />
update moments <strong>and</strong> sources <strong>and</strong> that decides when the crawler fetches the newly pub-<br />
34