18.07.2013 Views

considering autocorrelation in predictive models - Department of ...

considering autocorrelation in predictive models - Department of ...

considering autocorrelation in predictive models - Department of ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Abstract<br />

Most mach<strong>in</strong>e learn<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g and statistical methods rely on the assumption that the analyzed data<br />

are <strong>in</strong>dependent and identically distributed (i.i.d.). More specifically, the <strong>in</strong>dividual examples <strong>in</strong>cluded<br />

<strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data are assumed to be drawn <strong>in</strong>dependently from each other from the same probability<br />

distribution. However, cases where this assumption is violated can be easily found: For example, species<br />

are distributed non-randomly across a wide range <strong>of</strong> spatial scales. The i.i.d. assumption is <strong>of</strong>ten violated<br />

because <strong>of</strong> the phenomenon <strong>of</strong> <strong>autocorrelation</strong>.<br />

The cross-correlation <strong>of</strong> an attribute with itself is typically referred to as <strong>autocorrelation</strong>: This is<br />

the most general def<strong>in</strong>ition found <strong>in</strong> the literature. Specifically, <strong>in</strong> statistics, temporal <strong>autocorrelation</strong> is<br />

def<strong>in</strong>ed as the cross-correlation between the attribute <strong>of</strong> a process at different po<strong>in</strong>ts <strong>in</strong> time. In timeseries<br />

analysis, temporal <strong>autocorrelation</strong> is def<strong>in</strong>ed as the correlation among time-stamped values due<br />

to their relative proximity <strong>in</strong> time. In spatial analysis, spatial <strong>autocorrelation</strong> has been def<strong>in</strong>ed as the<br />

correlation among data values, which is strictly due to the relative location proximity <strong>of</strong> the objects<br />

that the data refer to. It is justified by Tobler’s first law <strong>of</strong> geography accord<strong>in</strong>g to which “everyth<strong>in</strong>g<br />

is related to everyth<strong>in</strong>g else, but near th<strong>in</strong>gs are more related than distant th<strong>in</strong>gs”. In network studies,<br />

<strong>autocorrelation</strong> is def<strong>in</strong>ed by the homophily pr<strong>in</strong>ciple as the tendency <strong>of</strong> nodes with similar values to be<br />

l<strong>in</strong>ked with each other.<br />

In this dissertation, we first give a clear and general def<strong>in</strong>ition <strong>of</strong> the <strong>autocorrelation</strong> phenomenon,<br />

which <strong>in</strong>cludes spatial and network <strong>autocorrelation</strong> for cont<strong>in</strong>uous and discrete responses. We then<br />

present a broad overview <strong>of</strong> the exist<strong>in</strong>g <strong>autocorrelation</strong> measures for the different types <strong>of</strong> <strong>autocorrelation</strong><br />

and data analysis methods that consider them. Focus<strong>in</strong>g on spatial and network <strong>autocorrelation</strong>, we<br />

propose three algorithms that handle non-stationary <strong>autocorrelation</strong> with<strong>in</strong> the framework <strong>of</strong> <strong>predictive</strong><br />

cluster<strong>in</strong>g, which deals with the tasks <strong>of</strong> classification, regression and structured output prediction. These<br />

algorithms and their empirical evaluation are the major contributions <strong>of</strong> this thesis.<br />

We first propose a data m<strong>in</strong><strong>in</strong>g method called SCLUS that explicitly considers spatial <strong>autocorrelation</strong><br />

when learn<strong>in</strong>g <strong>predictive</strong> cluster<strong>in</strong>g <strong>models</strong>. The method is based on the concept <strong>of</strong> <strong>predictive</strong> cluster<strong>in</strong>g<br />

trees (PCTs), accord<strong>in</strong>g to which hierarchies <strong>of</strong> clusters <strong>of</strong> similar data are identified and a <strong>predictive</strong><br />

model is associated to each cluster. In particular, our approach is able to learn <strong>predictive</strong> <strong>models</strong> for both<br />

a cont<strong>in</strong>uous response (regression task) and a discrete response (classification task). It properly deals with<br />

<strong>autocorrelation</strong> <strong>in</strong> data and provides a multi-level <strong>in</strong>sight <strong>in</strong>to the spatial <strong>autocorrelation</strong> phenomenon.<br />

The <strong>predictive</strong> <strong>models</strong> adapt to the local properties <strong>of</strong> the data, provid<strong>in</strong>g at the same time spatially<br />

smoothed predictions. We evaluate our approach on several real world problems <strong>of</strong> spatial regression<br />

and spatial classification.<br />

The problem <strong>of</strong> “network <strong>in</strong>ference” is known to be a challeng<strong>in</strong>g task. In this dissertation, we<br />

propose a data m<strong>in</strong><strong>in</strong>g method called NCLUS that explicitly considers <strong>autocorrelation</strong> when build<strong>in</strong>g<br />

<strong>predictive</strong> <strong>models</strong> from network data. The algorithm is based on the concept <strong>of</strong> PCTs that can be used<br />

for cluster<strong>in</strong>g, prediction and multi-target prediction, <strong>in</strong>clud<strong>in</strong>g multi-target regression and multi-target<br />

classification. We evaluate our approach on several real world problems <strong>of</strong> network regression, com<strong>in</strong>g<br />

from the areas <strong>of</strong> social and spatial networks. Empirical results show that our algorithm performs better

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!