10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 5<br />

A transformer is akin to a converting function. It takes data of one form as input<br />

and returns data of another form as output. Transformers can be trained using some<br />

training dataset, and these trained parameters can be used to convert testing data.<br />

The transformer API is quite simple. It takes data of a specific format as input and<br />

returns data of another format (either the same as the input or different) as output.<br />

Not much else is required of the programmer.<br />

The transformer API<br />

Transformers have two key functions:<br />

• fit(): This takes a training set of data as input and sets internal parameters<br />

• transform(): This performs the transformation itself. This can take either<br />

the training dataset, or a new dataset of the same format<br />

Both fit() and transform() fuction should take the same data type as input, but<br />

transform() can return data of a different type.<br />

We are going to create a trivial transformer to show the API in action. The<br />

transformer will take a NumPy array as input, and discretize it based on the mean.<br />

Any value higher than the mean (of the training data) will be given the value 1 and<br />

any value lower or equal to the mean will be given the value 0.<br />

We did a similar transformation <strong>with</strong> the Adult dataset using pandas: we took<br />

the Hours-per-week feature and created a LongHours feature if the value was more<br />

than 40 hours per week. This transformer is different for two reasons. First, the code<br />

will conform to the scikit-learn API, allowing us to use it in a pipeline. Second, the<br />

code will learn the mean, rather than taking it as a fixed value (such as 40 in the<br />

LongHours example).<br />

Implementation details<br />

To start, open up the I<strong>Python</strong> Notebook that we used for the Adult dataset.<br />

Then, click on the Cell menu item and choose Run All. This will rerun all of<br />

the cells and ensure that the notebook is up to date.<br />

First, we import the TransformerMixin, which sets the API for us. While <strong>Python</strong><br />

doesn't have strict interfaces (as opposed to languages like Java), using a mixin<br />

like this allows scikit-learn to determine that the class is actually a transformer.<br />

We also need to import a function that checks the input is of a valid type.<br />

We will use that soon.<br />

[ 99 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!