10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 5<br />

Thought should always be given to how to represent reality in the form of a model.<br />

Rather than just using what has been used in the past, you need to consider the goal<br />

of the data mining exercise. What are you trying to achieve? In Chapter 3, Predicting<br />

Sports Winners <strong>with</strong> Decision Trees, we created features by thinking about the goal<br />

(predicting winners) and used a little domain knowledge to come up <strong>with</strong> ideas for<br />

new features.<br />

Not all features need to be numeric or categorical. Algorithms have<br />

been developed that work directly on text, graphs, and other data<br />

structures. Unfortunately, those algorithms are outside the scope of<br />

this book. In this book, we mainly use numeric or categorical features.<br />

The Adult dataset is a great example of taking a complex reality and attempting to<br />

model it using features. In this dataset, the aim is to estimate if someone earns more<br />

than $50,000 per year. To download the dataset, navigate to http://archive.ics.<br />

uci.edu/ml/datasets/Adult and click on the <strong>Data</strong> Folder link. Download the<br />

adult.data and adult.names into a directory named Adult in your data folder.<br />

This dataset takes a complex task and describes it in features. These features describe<br />

the person, their environment, their background, and their life status.<br />

Open a new I<strong>Python</strong> Notebook for this chapter and set the data's filename and<br />

import pandas to load the file:<br />

import os<br />

import pandas as pd<br />

data_folder = os.path.join(os.path.expanduser("~"), "<strong>Data</strong>",<br />

"Adult")<br />

adult_filename = os.path.join(data_folder, "adult.data")<br />

Using pandas as before, we load the file <strong>with</strong> read_csv:<br />

adult = pd.read_csv(adult_filename, header=None,<br />

names=["Age", "Work-Class", "fnlwgt",<br />

"Education", "Education-Num",<br />

"Marital-Status", "Occupation",<br />

"Relationship", "Race", "Sex",<br />

"Capital-gain", "Capital-loss",<br />

"Hours-per-week", "Native-Country",<br />

"Earnings-Raw"])<br />

Most of the code is the same as in the previous chapters.<br />

[ 83 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!