10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features <strong>with</strong> Transformers<br />

Principal Component Analysis<br />

In some datasets, features heavily correlate <strong>with</strong> each other. For example, the speed<br />

and the fuel consumption would be heavily correlated in a go-kart <strong>with</strong> a single gear.<br />

While it can be useful to find these correlations for some applications, data mining<br />

algorithms typically do not need the redundant information.<br />

The ads dataset has heavily correlated features, as many of the keywords are<br />

repeated across the alt text and caption.<br />

The Principal Component Analysis (PCA) aims to find combinations of<br />

features that describe the dataset in less information. It aims to discover principal<br />

components, which are features that do not correlate <strong>with</strong> each other and explain the<br />

information—specifically the variance—of the dataset. What this means is that we<br />

can often capture most of the information in a dataset in fewer features.<br />

We apply PCA just like any other transformer. It has one key parameter, which is the<br />

number of components to find. By default, it will result in as many features as you<br />

have in the original dataset. However, these principal components are ranked—the<br />

first feature explains the largest amount of the variance in the dataset, the second a<br />

little less, and so on. Therefore, finding just the first few features is often enough to<br />

explain much of the dataset. Let's look at the code:<br />

from sklearn.decomposition import PCA<br />

pca = PCA(n_components=5)<br />

Xd = pca.fit_transform(X)<br />

The resulting matrix, Xd, has just five features. However, let's look at the amount of<br />

variance that is explained by each of these features:<br />

np.set_printoptions(precision=3, suppress=True)<br />

pca.explained_variance_ratio_<br />

The result, array([ 0.854, 0.145, 0.001, 0. , 0. ]), shows<br />

us that the first feature accounts for 85.4 percent of the variance in the dataset,<br />

the second accounts for 14.5 percent, and so on. By the fourth feature, less than<br />

one-tenth of a percent of the variance is contained in the feature. The other 1,553<br />

features explain even less.<br />

[ 96 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!