considering autocorrelation in predictive models - Department of ...
considering autocorrelation in predictive models - Department of ...
considering autocorrelation in predictive models - Department of ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Abstract<br />
Most mach<strong>in</strong>e learn<strong>in</strong>g, data m<strong>in</strong><strong>in</strong>g and statistical methods rely on the assumption that the analyzed data<br />
are <strong>in</strong>dependent and identically distributed (i.i.d.). More specifically, the <strong>in</strong>dividual examples <strong>in</strong>cluded<br />
<strong>in</strong> the tra<strong>in</strong><strong>in</strong>g data are assumed to be drawn <strong>in</strong>dependently from each other from the same probability<br />
distribution. However, cases where this assumption is violated can be easily found: For example, species<br />
are distributed non-randomly across a wide range <strong>of</strong> spatial scales. The i.i.d. assumption is <strong>of</strong>ten violated<br />
because <strong>of</strong> the phenomenon <strong>of</strong> <strong>autocorrelation</strong>.<br />
The cross-correlation <strong>of</strong> an attribute with itself is typically referred to as <strong>autocorrelation</strong>: This is<br />
the most general def<strong>in</strong>ition found <strong>in</strong> the literature. Specifically, <strong>in</strong> statistics, temporal <strong>autocorrelation</strong> is<br />
def<strong>in</strong>ed as the cross-correlation between the attribute <strong>of</strong> a process at different po<strong>in</strong>ts <strong>in</strong> time. In timeseries<br />
analysis, temporal <strong>autocorrelation</strong> is def<strong>in</strong>ed as the correlation among time-stamped values due<br />
to their relative proximity <strong>in</strong> time. In spatial analysis, spatial <strong>autocorrelation</strong> has been def<strong>in</strong>ed as the<br />
correlation among data values, which is strictly due to the relative location proximity <strong>of</strong> the objects<br />
that the data refer to. It is justified by Tobler’s first law <strong>of</strong> geography accord<strong>in</strong>g to which “everyth<strong>in</strong>g<br />
is related to everyth<strong>in</strong>g else, but near th<strong>in</strong>gs are more related than distant th<strong>in</strong>gs”. In network studies,<br />
<strong>autocorrelation</strong> is def<strong>in</strong>ed by the homophily pr<strong>in</strong>ciple as the tendency <strong>of</strong> nodes with similar values to be<br />
l<strong>in</strong>ked with each other.<br />
In this dissertation, we first give a clear and general def<strong>in</strong>ition <strong>of</strong> the <strong>autocorrelation</strong> phenomenon,<br />
which <strong>in</strong>cludes spatial and network <strong>autocorrelation</strong> for cont<strong>in</strong>uous and discrete responses. We then<br />
present a broad overview <strong>of</strong> the exist<strong>in</strong>g <strong>autocorrelation</strong> measures for the different types <strong>of</strong> <strong>autocorrelation</strong><br />
and data analysis methods that consider them. Focus<strong>in</strong>g on spatial and network <strong>autocorrelation</strong>, we<br />
propose three algorithms that handle non-stationary <strong>autocorrelation</strong> with<strong>in</strong> the framework <strong>of</strong> <strong>predictive</strong><br />
cluster<strong>in</strong>g, which deals with the tasks <strong>of</strong> classification, regression and structured output prediction. These<br />
algorithms and their empirical evaluation are the major contributions <strong>of</strong> this thesis.<br />
We first propose a data m<strong>in</strong><strong>in</strong>g method called SCLUS that explicitly considers spatial <strong>autocorrelation</strong><br />
when learn<strong>in</strong>g <strong>predictive</strong> cluster<strong>in</strong>g <strong>models</strong>. The method is based on the concept <strong>of</strong> <strong>predictive</strong> cluster<strong>in</strong>g<br />
trees (PCTs), accord<strong>in</strong>g to which hierarchies <strong>of</strong> clusters <strong>of</strong> similar data are identified and a <strong>predictive</strong><br />
model is associated to each cluster. In particular, our approach is able to learn <strong>predictive</strong> <strong>models</strong> for both<br />
a cont<strong>in</strong>uous response (regression task) and a discrete response (classification task). It properly deals with<br />
<strong>autocorrelation</strong> <strong>in</strong> data and provides a multi-level <strong>in</strong>sight <strong>in</strong>to the spatial <strong>autocorrelation</strong> phenomenon.<br />
The <strong>predictive</strong> <strong>models</strong> adapt to the local properties <strong>of</strong> the data, provid<strong>in</strong>g at the same time spatially<br />
smoothed predictions. We evaluate our approach on several real world problems <strong>of</strong> spatial regression<br />
and spatial classification.<br />
The problem <strong>of</strong> “network <strong>in</strong>ference” is known to be a challeng<strong>in</strong>g task. In this dissertation, we<br />
propose a data m<strong>in</strong><strong>in</strong>g method called NCLUS that explicitly considers <strong>autocorrelation</strong> when build<strong>in</strong>g<br />
<strong>predictive</strong> <strong>models</strong> from network data. The algorithm is based on the concept <strong>of</strong> PCTs that can be used<br />
for cluster<strong>in</strong>g, prediction and multi-target prediction, <strong>in</strong>clud<strong>in</strong>g multi-target regression and multi-target<br />
classification. We evaluate our approach on several real world problems <strong>of</strong> network regression, com<strong>in</strong>g<br />
from the areas <strong>of</strong> social and spatial networks. Empirical results show that our algorithm performs better