08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 11<br />

Detecting redundant features using filters<br />

Filters try to clean up the feature forest independent of any machine learning<br />

method used later. They rely on statistical methods to find out which of the features<br />

are redundant (in which case, we need to keep only one per redundant feature<br />

group) or irrelevant. In general, the filter works as depicted in the workflow<br />

shown in the following diagram:<br />

y<br />

All features<br />

x1, x2, ..., xN<br />

Select features<br />

that are not<br />

redundant<br />

Some features<br />

x2, x7, ..., xM<br />

Select features<br />

that are not<br />

irrelevant<br />

Resulting<br />

features<br />

x2, x10, x14<br />

Correlation<br />

Using correlation, we can easily see linear relationships between pairs of features,<br />

which are relationships that can be modeled using a straight line. In the graphs<br />

shown in the following screenshot, we can see different degrees of correlation<br />

together <strong>with</strong> a potential linear dependency plotted as a red dashed line (a fitted<br />

one-dimensional polynomial). The correlation coefficient<br />

at the top of<br />

the individual graphs is calculated using the common Pearson correlation coefficient<br />

(the Pearson value) by means of the pearsonr() function of scipy.stat.<br />

Given two equal-sized data series, it returns a tuple of the correlation coefficient<br />

values and the p-value, which is the probability that these data series are being<br />

generated by an uncorrelated system. In other words, the higher the p-value, the<br />

less we should trust the correlation coefficient:<br />

>> from import scipy.stats import pearsonr<br />

>> pearsonr([1,2,3], [1,2,3.1])<br />

>> (0.99962228516121843, 0.017498096813278487)<br />

>> pearsonr([1,2,3], [1,20,6])<br />

>> (0.25383654128340477, 0.83661493668227405)<br />

In the first case, we have a clear indication that both series are correlated. In the<br />

second one, we still clearly have a non-zero value.<br />

[ 223 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!