01.04.2015 Views

1FfUrl0

1FfUrl0

1FfUrl0

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11<br />

Detecting redundant features using filters<br />

Filters try to clean up the feature forest independent of any machine learning<br />

method used later. They rely on statistical methods to find out which of the features<br />

are redundant (in which case, we need to keep only one per redundant feature<br />

group) or irrelevant. In general, the filter works as depicted in the workflow<br />

shown in the following diagram:<br />

y<br />

All features<br />

x1, x2, ..., xN<br />

Select features<br />

that are not<br />

redundant<br />

Some features<br />

x2, x7, ..., xM<br />

Select features<br />

that are not<br />

irrelevant<br />

Resulting<br />

features<br />

x2, x10, x14<br />

Correlation<br />

Using correlation, we can easily see linear relationships between pairs of features,<br />

which are relationships that can be modeled using a straight line. In the graphs<br />

shown in the following screenshot, we can see different degrees of correlation<br />

together with a potential linear dependency plotted as a red dashed line (a fitted<br />

one-dimensional polynomial). The correlation coefficient<br />

at the top of<br />

the individual graphs is calculated using the common Pearson correlation coefficient<br />

(the Pearson value) by means of the pearsonr() function of scipy.stat.<br />

Given two equal-sized data series, it returns a tuple of the correlation coefficient<br />

values and the p-value, which is the probability that these data series are being<br />

generated by an uncorrelated system. In other words, the higher the p-value, the<br />

less we should trust the correlation coefficient:<br />

>> from import scipy.stats import pearsonr<br />

>> pearsonr([1,2,3], [1,2,3.1])<br />

>> (0.99962228516121843, 0.017498096813278487)<br />

>> pearsonr([1,2,3], [1,20,6])<br />

>> (0.25383654128340477, 0.83661493668227405)<br />

In the first case, we have a clear indication that both series are correlated. In the<br />

second one, we still clearly have a non-zero value.<br />

[ 223 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!