18.12.2012 Views

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>and</strong> therefore, hard to manage. However, the pull protocol works without needing the<br />

content providers to be aware of the subscribed aggregators. And finally, adopting a push<br />

protocol requires the existence <strong>and</strong> maintenance of a trust relationship between content<br />

providers <strong>and</strong> content aggregators, which is very difficult to manage <strong>and</strong> highly unlikely<br />

in the Web’s reality. A more elaborate discussion together with related work references<br />

on pull <strong>and</strong> push protocols is presented in Section 3.1.<br />

When considering the underlying communication protocol, from the point of view of a<br />

web server there is no distinction between RSS feeds <strong>and</strong> web pages or any other web<br />

resources. All kinds of resources have to be refreshed by using the st<strong>and</strong>ard pull-based<br />

HTTP protocol where changes can only be detected by explicitely contacting the content<br />

provider server. Therefore aggregators as well as search engines face the same type of<br />

challenge: deciding when is the optimum refresh time moment of each resource.<br />

Our research work is situated in the context of the RoSeS project (Really Open Simple<br />

<strong>and</strong> Efficient Syndication) [rosa], a content-based feed aggregation system. Its goal is to<br />

define a set of web resource syndication services <strong>and</strong> tools for localizing, querying, generating,<br />

composing <strong>and</strong> personalizing RSS feeds available on the Web. A RoSeS aggregator<br />

represents an RSS feed sharing system in which users are allowed to register content-based<br />

aggregation queries defined on a set of input RSS feeds, whose results are made available<br />

to users as a newly created output RSS feed. For that, it has to periodically retrieve the<br />

newly published content on the ever changing RSS feed sources.<br />

Placed in the context of the content-based feed aggregation systems, the main goal of this<br />

dissertation is to improve the aggregation quality by designing optimal refresh strategies<br />

<strong>and</strong> online change estimation techniques adapted for crawling highly dynamic RSS feed<br />

sources. Despite the apparent simplicity, feed aggregation systems have many inherent<br />

challenges:<br />

• Large scale. Not only that the Web is very large, but it is also permanently<br />

evolving <strong>and</strong> so are the RSS feeds hosted on websites. Therefore, feed aggregators<br />

must support an extremely high throughput by dealing with a large number of feed<br />

content providers <strong>and</strong> consumers while seeking high aggregation quality.<br />

• Highly dynamic content. As any web resource, RSS feeds evolve independently<br />

of their clients <strong>and</strong> may suddenly change their publication activity. Even if the<br />

aggregator supports large scale syndication, it could never keep up with all the<br />

dynamic changes of all the feeds.<br />

• <strong>Refresh</strong> decision. In order to assure the proper functioning of an aggregator, it<br />

must use an intelligent refresh strategy that decides when to refresh each feed source<br />

such as to assure aggregation quality maximization within a minimum cost.<br />

• Limited resources. Not only that crawlers should respect politeness policies by<br />

not imposing too much of a burden on the content providers, but they are also<br />

constrained by their own internal resource limitations, such as b<strong>and</strong>width, storage,<br />

memory or computing consumed resources, minimizing the overall cost.<br />

3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!