18.07.2013 Views

considering autocorrelation in predictive models - Department of ...

considering autocorrelation in predictive models - Department of ...

considering autocorrelation in predictive models - Department of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

4 Introduction<br />

analysis <strong>of</strong> such data needs to take this <strong>in</strong>to account.<br />

Such work either removes the <strong>autocorrelation</strong> dependencies dur<strong>in</strong>g pre-process<strong>in</strong>g and then use traditional<br />

algorithms (e.g., (Hardisty and Klippel, 2010; Huang et al, 2004)) or modifies the classical<br />

mach<strong>in</strong>e learn<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g and statistical methods <strong>in</strong> order to consider the <strong>autocorrelation</strong> (e.g., (Bel<br />

et al, 2009; R<strong>in</strong>zivillo and Tur<strong>in</strong>i, 2004, 2007)). There are also approaches which use a relational sett<strong>in</strong>g<br />

(e.g., (Ceci and Appice, 2006; Malerba et al, 2005)), where the <strong>autocorrelation</strong> is usually <strong>in</strong>corporated<br />

through the data structure or def<strong>in</strong>ed implicitly through relationships among the data and other data<br />

properties.<br />

However, one limitation <strong>of</strong> most <strong>of</strong> the approaches that take <strong>autocorrelation</strong> <strong>in</strong>to account is that they<br />

assume that <strong>autocorrelation</strong> dependencies are constant (i.e., do not change) throughout the space/network<br />

(Ang<strong>in</strong> and Neville, 2008). This means that possible significant variability <strong>in</strong> <strong>autocorrelation</strong> dependencies<br />

<strong>in</strong> different po<strong>in</strong>ts <strong>of</strong> the space/network cannot be represented and modeled. Such variability could<br />

result from a different underly<strong>in</strong>g latent structure <strong>of</strong> the space/network that varies among its parts <strong>in</strong><br />

terms <strong>of</strong> properties <strong>of</strong> nodes or associations between them. For example, different research communities<br />

may have different levels <strong>of</strong> cohesiveness and thus cite papers on other topics with vary<strong>in</strong>g degrees. As<br />

po<strong>in</strong>ted out by Ang<strong>in</strong> and Neville (2008), when <strong>autocorrelation</strong> varies significantly throughout a network,<br />

it may be more accurate to model the dependencies locally rather than globally.<br />

In the dissertation, we extend the <strong>predictive</strong> cluster<strong>in</strong>g framework <strong>in</strong> the context <strong>of</strong> PCTs that are<br />

able to deal with data (spatial and network) that do not follow the i.i.d. assumption. The dist<strong>in</strong>ctive characteristic<br />

<strong>of</strong> the proposed approach is that it explicitly considers the non-stationary (spatial and network)<br />

<strong>autocorrelation</strong> when build<strong>in</strong>g the <strong>predictive</strong> <strong>models</strong>. Such a method not only extends the applicability <strong>of</strong><br />

the <strong>predictive</strong> cluster<strong>in</strong>g approach, but also exploits the <strong>autocorrelation</strong> phenomenon and uses it to make<br />

better predictions and better <strong>models</strong>.<br />

In traditional PCTs (Blockeel, 1998), the tree construction is performed by maximiz<strong>in</strong>g variance<br />

reduction. This heuristic guarantees, <strong>in</strong> pr<strong>in</strong>ciple, accurate <strong>models</strong> s<strong>in</strong>ce it reduces the error on the<br />

tra<strong>in</strong><strong>in</strong>g set. However, it neglects the possible presence <strong>of</strong> <strong>autocorrelation</strong> <strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data. To address<br />

this issue, we propose to simultaneously maximize <strong>autocorrelation</strong> for spatial/network doma<strong>in</strong>s. In this<br />

way, we exploit the spatial/network structure <strong>of</strong> the data <strong>in</strong> the PCT <strong>in</strong>duction phase and obta<strong>in</strong> <strong>predictive</strong><br />

<strong>models</strong> that naturally deal with the phenomenon <strong>of</strong> <strong>autocorrelation</strong>.<br />

The consideration <strong>of</strong> <strong>autocorrelation</strong> <strong>in</strong> cluster<strong>in</strong>g has already been <strong>in</strong>vestigated <strong>in</strong> the literature,<br />

both for spatial cluster<strong>in</strong>g (Glotsos et al, 2004) and network cluster<strong>in</strong>g (Jahani and Bagherpour, 2011).<br />

Motivated by the demonstrated benefits <strong>of</strong> <strong>consider<strong>in</strong>g</strong> <strong>autocorrelation</strong>, we exploit some characteristics<br />

<strong>of</strong> autocorrelated data to improve the quality <strong>of</strong> PCTs. The consideration <strong>of</strong> <strong>autocorrelation</strong> <strong>in</strong> cluster<strong>in</strong>g<br />

<strong>of</strong>fers several advantages, s<strong>in</strong>ce it allows us to:<br />

• determ<strong>in</strong>e the strength <strong>of</strong> the spatial/network arrangement on the variables <strong>in</strong> the model;<br />

• evaluate stationarity and heterogeneity <strong>of</strong> the <strong>autocorrelation</strong> phenomenon across space;<br />

• identify the possible role <strong>of</strong> the spatial/network arrangement/distance decay on the predictions<br />

associated with each <strong>of</strong> the nodes <strong>of</strong> the tree;<br />

• focus on the spatial/network “neighborhood” to better understand the effects that it can have on<br />

other neighborhoods and vice versa.<br />

These advantages <strong>of</strong> <strong>consider<strong>in</strong>g</strong> spatial <strong>autocorrelation</strong> <strong>in</strong> cluster<strong>in</strong>g, identified by (Arthur, 2008),<br />

fit well <strong>in</strong>to the case <strong>of</strong> PCTs. Moreover, as recognized by (Griffith, 2003), <strong>autocorrelation</strong> implicitly<br />

def<strong>in</strong>es a zon<strong>in</strong>g <strong>of</strong> a (spatial) phenomenon: Tak<strong>in</strong>g this <strong>in</strong>to account reduces the effect <strong>of</strong> <strong>autocorrelation</strong><br />

on prediction errors. Therefore, we propose to perform cluster<strong>in</strong>g by maximiz<strong>in</strong>g both variance reduction

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!