08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 2<br />

In the preceding screenshot, the Canadian examples are shown as diamonds, Kama<br />

seeds as circles, and Rosa seeds as triangles. Their respective areas are shown as<br />

white, black, and grey. You might be wondering why the regions are so horizontal,<br />

almost weirdly so. The problem is that the x axis (area) ranges from 10 to 22 while<br />

the y axis (compactness) ranges from 0.75 to 1.0. This means that a small change in x<br />

is actually much larger than a small change in y. So, when we compute the distance<br />

according to the preceding function, we are, for the most part, only taking the x axis<br />

into account.<br />

If you have a physics background, you might have already noticed that we had<br />

been summing up lengths, areas, and dimensionless quantities, mixing up our<br />

units (which is something you never want to do in a physical system). We need to<br />

normalize all of the features to a common scale. There are many solutions to this<br />

problem; a simple one is to normalize to Z-scores. The Z-score of a value is how far<br />

away from the mean it is in terms of units of standard deviation. It comes down to<br />

this simple pair of operations:<br />

# subtract the mean for each feature:<br />

features -= features.mean(axis=0)<br />

# divide each feature by its standard deviation<br />

features /= features.std(axis=0)<br />

[ 45 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!