08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Dimensionality Reduction<br />

Garbage in, garbage out, that's what we know from real life. Throughout this book,<br />

we have seen that this pattern also holds true when applying machine learning<br />

methods to training data. Looking back, we realize that the most interesting machine<br />

learning challenges always involved some sort of feature engineering, where we<br />

tried to use our insight into the problem to carefully craft additional features that the<br />

machine learner hopefully picks up.<br />

In this chapter, we will go in the opposite direction <strong>with</strong> dimensionality reduction<br />

involving cutting away features that are irrelevant or redundant. Removing features<br />

might seem counter-intuitive at first thought, as more information is always better<br />

than less information. Shouldn't the unnecessary features be ignored after all? For<br />

example, by setting their weights to 0 inside the machine learning algorithm. The<br />

following are several good reasons that are still in practice for trimming down the<br />

dimensions as much as possible:<br />

• Superfluous features can irritate or mislead the learner. This is not the<br />

case <strong>with</strong> all machine learning methods (for example, Support Vector<br />

<strong>Machine</strong>s love high-dimensional spaces). But most of the models feel<br />

safer <strong>with</strong> less dimensions.<br />

• Another argument against high-dimensional feature spaces is that more<br />

features mean more parameters to tune and a higher risk of overfitting.<br />

• The data we retrieved to solve our task might just have artificial high<br />

dimensions, whereas the real dimension might be small.<br />

• Less dimensions mean faster training and more variations to try out,<br />

resulting in better end results.<br />

• If we want to visualize the data, we are restricted to two or three<br />

dimensions. This is known as visualization.<br />

So, here we will show you how to get rid of the garbage <strong>with</strong>in our data while<br />

keeping the valuable part of it.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!