18.12.2012 Views

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

Roxana - Gabriela HORINCAR Refresh Strategies and Online ... - LIP6

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

fraction of recently published feed items that have not been retrieved yet. A maximum<br />

window freshness shows that the RSS feed result of an aggregation query is up to date<br />

with all its RSS feed sources. The proposed quality metrics (feed completeness <strong>and</strong> window<br />

freshness) considered simultaneously were designed to reflect the quality of newly<br />

produced aggregation feeds within a content-based feed aggregation system.<br />

When should the crawler of a feed aggregation system refresh the RSS feed<br />

sources <strong>and</strong> what makes a refresh strategy optimal? We propose a best effort refresh<br />

strategy based on the Lagrange multipliers [Ste91] optimization method that retrieves<br />

new postings from different feed sources <strong>and</strong> maintains an optimal level of aggregation<br />

quality, obtained with a minimum cost. Our strategy maximizes both feed completeness<br />

<strong>and</strong> window freshness, fetching new items in time <strong>and</strong> avoiding item loss. The proposed<br />

policy is supported by an extensive experimental evaluation, where its performance <strong>and</strong><br />

robustness are tested by comparing it to other refresh strategies.<br />

How much new information do the RSS feeds publish every day? When exactly<br />

do they publish it? Are there any patterns in their publication activity? In<br />

order to better underst<strong>and</strong> the publication behavior of real RSS feeds, we propose a general<br />

characteristics analysis with a focus on the temporal dimension of real RSS feed sources,<br />

using data collected over four weeks from more than 2500 RSS feeds. We start by looking<br />

into the intensity of the feed publication activity <strong>and</strong> into their daily periodicity features.<br />

We also classify the source feeds according to three different publication shapes: peaks,<br />

uniform <strong>and</strong> waves.<br />

How should the feed publication behavior should be modeled? What is an<br />

appropriate estimation model? How can we keep the estimation models up to<br />

date with the continually changing feed publication activities? Inspired by the<br />

observations made on the real RSS feeds publication behavior, we propose <strong>and</strong> study two<br />

different models that reflect different types of feed publication activities. Furthermore, we<br />

propose two online estimation methods that correspond to the publication models that are<br />

continually updated in order to reflect the changes in the real feed publication activities.<br />

We provide an experimental evaluation of the online estimation methods in cohesion with<br />

different refresh strategies <strong>and</strong> an analysis of their effectiveness on sources with different<br />

publication behavior was made.<br />

Organization of the Dissertation<br />

This rest of the dissertation is structured as follows. Chapter 1 introduces some fundamental<br />

notions of the content-based feed aggregation domain. We describe the architecture<br />

of our aggregation system in order to better underst<strong>and</strong> the proposed methods <strong>and</strong> also<br />

introduce a formal content-based feed aggregation model. Chapter 2 defines our two aggregation<br />

oriented quality measures: feed completeness <strong>and</strong> window freshness. We also<br />

explain here the saturation problem specific to RSS feeds <strong>and</strong> give an item loss measure.<br />

In Chapter 3 we discuss the general problem of web crawling, by introducing some key<br />

5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!