10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Extracting Features <strong>with</strong> Transformers<br />

Feature extraction<br />

Extracting features is one of the most critical tasks in data mining, and it<br />

generally affects your end result more than the choice of data mining algorithm.<br />

Unfortunately, there are no hard and fast rules for choosing features that will result<br />

in high performance data mining. In many ways, this is where the science of data<br />

mining becomes more of an art. Creating good features relies on intuition, domain<br />

expertise, data mining experience, trial and error, and sometimes a little luck.<br />

Representing reality in models<br />

Not all datasets are presented in terms of features. Sometimes, a dataset consists<br />

of nothing more than all of the books that have been written by a given author.<br />

Sometimes, it is the film of each of the movies released in 1979. At other times,<br />

it is a library collection of interesting historical artifacts.<br />

From these datasets, we may want to perform a data mining task. For the books, we<br />

may want to know the different categories that the author writes. In the films, we<br />

may wish to see how women are portrayed. In the historical artifacts, we may want<br />

to know whether they are from one country or another. It isn't possible to just pass<br />

these raw datasets into a decision tree and see what the result is.<br />

For a data mining algorithm to assist us here, we need to represent these as features.<br />

Features are a way to create a model and the model provides an approximation of<br />

reality in a way that data mining algorithms can understand. Therefore, a model is<br />

just a simplified version of some aspect of the real world. As an example, the game of<br />

chess is a simplified model for historical warfare.<br />

Selecting features has another advantage: they reduce the complexity of the real<br />

world into a more manageable model. Imagine how much information it would take<br />

to properly, accurately, and fully describe a real-world object to someone that has<br />

no background knowledge of the item. You would need to describe the size, weight,<br />

texture, composition, age, flaws, purpose, origin, and so on.<br />

The complexity of real objects is too much for current algorithms, so we use these<br />

simpler models instead.<br />

This simplification also focuses our intent in the data mining application. In later<br />

chapters, we will look at clustering and where it is critically important. If you put<br />

random features in, you will get random results out.<br />

However, there is a downside as this simplification reduces the detail, or may<br />

remove good indicators of the things we wish to perform data mining on.<br />

[ 82 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!