Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>and</strong> therefore, hard to manage. However, the pull protocol works without needing the<br />
content providers to be aware of the subscribed aggregators. And finally, adopting a push<br />
protocol requires the existence <strong>and</strong> maintenance of a trust relationship between content<br />
providers <strong>and</strong> content aggregators, which is very difficult to manage <strong>and</strong> highly unlikely<br />
in the Web’s reality. A more elaborate discussion together with related work references<br />
on pull <strong>and</strong> push protocols is presented in Section 3.1.<br />
When considering the underlying communication protocol, from the point of view of a<br />
web server there is no distinction between RSS feeds <strong>and</strong> web pages or any other web<br />
resources. All kinds of resources have to be refreshed by using the st<strong>and</strong>ard pull-based<br />
HTTP protocol where changes can only be detected by explicitely contacting the content<br />
provider server. Therefore aggregators as well as search engines face the same type of<br />
challenge: deciding when is the optimum refresh time moment of each resource.<br />
Our research work is situated in the context of the RoSeS project (Really Open Simple<br />
<strong>and</strong> Efficient Syndication) [rosa], a content-based feed aggregation system. Its goal is to<br />
define a set of web resource syndication services <strong>and</strong> tools for localizing, querying, generating,<br />
composing <strong>and</strong> personalizing RSS feeds available on the Web. A RoSeS aggregator<br />
represents an RSS feed sharing system in which users are allowed to register content-based<br />
aggregation queries defined on a set of input RSS feeds, whose results are made available<br />
to users as a newly created output RSS feed. For that, it has to periodically retrieve the<br />
newly published content on the ever changing RSS feed sources.<br />
Placed in the context of the content-based feed aggregation systems, the main goal of this<br />
dissertation is to improve the aggregation quality by designing optimal refresh strategies<br />
<strong>and</strong> online change estimation techniques adapted for crawling highly dynamic RSS feed<br />
sources. Despite the apparent simplicity, feed aggregation systems have many inherent<br />
challenges:<br />
• Large scale. Not only that the Web is very large, but it is also permanently<br />
evolving <strong>and</strong> so are the RSS feeds hosted on websites. Therefore, feed aggregators<br />
must support an extremely high throughput by dealing with a large number of feed<br />
content providers <strong>and</strong> consumers while seeking high aggregation quality.<br />
• Highly dynamic content. As any web resource, RSS feeds evolve independently<br />
of their clients <strong>and</strong> may suddenly change their publication activity. Even if the<br />
aggregator supports large scale syndication, it could never keep up with all the<br />
dynamic changes of all the feeds.<br />
• <strong>Refresh</strong> decision. In order to assure the proper functioning of an aggregator, it<br />
must use an intelligent refresh strategy that decides when to refresh each feed source<br />
such as to assure aggregation quality maximization within a minimum cost.<br />
• Limited resources. Not only that crawlers should respect politeness policies by<br />
not imposing too much of a burden on the content providers, but they are also<br />
constrained by their own internal resource limitations, such as b<strong>and</strong>width, storage,<br />
memory or computing consumed resources, minimizing the overall cost.<br />
3