10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Extracting Features <strong>with</strong> Transformers<br />

The preceding function almost fits the interface needed to be used in scikit-learn's<br />

univariate transformers. The function needs to accept two arrays (x and y in our<br />

example) as parameters and returns two arrays, the scores for each feature and the<br />

corresponding p-values. The chi2 function we used earlier only uses the required<br />

interface, which allowed us to just pass it directly to SelectKBest.<br />

The pearsonr function in SciPy accepts two arrays; however, the X array it accepts is<br />

only one dimension. We will write a wrapper function that allows us to use this for<br />

multivariate arrays like the one we have. Let's look at the code:<br />

def multivariate_pearsonr(X, y):<br />

We create our scores and pvalues arrays, and then iterate over each column of<br />

the dataset:<br />

scores, pvalues = [], []<br />

for column in range(X.shape[1]):<br />

We compute the Pearson correlation for this column only and the record both the<br />

score and p-value.<br />

cur_score, cur_p = pearsonr(X[:,column], y)<br />

scores.append(abs(cur_score))<br />

pvalues.append(cur_p)<br />

The Pearson value could be between -1 and 1. A value of 1 implies a<br />

perfect correlation between two variables, while a value of -1 implies a<br />

perfect negative correlation, that is, high values in one variable give low<br />

values in the other and vice versa. Such features are really useful to have,<br />

but would be discarded. For this reason, we have stored the absolute<br />

value in the scores array, rather than the original signed value.<br />

Finally, we return the scores and p-values in a tuple:<br />

return (np.array(scores), np.array(pvalues))<br />

Now, we can use the transformer class as before to rank the features using the<br />

Pearson correlation coefficient:<br />

transformer = SelectKBest(score_func=multivariate_pearsonr, k=3)<br />

Xt_pearson = transformer.fit_transform(X, y)<br />

print(transformer.scores_)<br />

[ 92 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!