10.11.2016 Views

Learning Data Mining with Python

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Authorship Attribution<br />

Kernels<br />

When the data cannot be separated linearly, the trick is to embed it onto a higher<br />

dimensional space. What this means, <strong>with</strong> a lot of hand-waving about the details, is<br />

to add pseudo-features until the data is linearly separable (which will always happen<br />

if you add enough of the right kinds of features).<br />

The trick is that we often compute the inner-produce of the samples when finding<br />

the best line to separate the dataset. Given a function that uses the dot product, we<br />

effectively manufacture new features <strong>with</strong>out having to actually define those new<br />

features. This is handy because we don't know what those features were going to<br />

be anyway. We now define a kernel as a function that itself is the dot product of<br />

the function of two samples from the dataset, rather than based on the samples<br />

(and the made-up features) themselves.<br />

We can now compute what that dot product is (or approximate it) and then just<br />

use that.<br />

There are a number of kernels in common use. The linear kernel is the most<br />

straightforward and is simply the dot product of the two sample feature vectors,<br />

the weight feature, and a bias value. There is also a polynomial kernel, which raises<br />

the dot product to a given degree (for instance, 2). Others include the Gaussian<br />

(rbf) and Sigmoidal functions. In our previous code sample, we tested between<br />

the linear kernel and the rbf kernels.<br />

The end result from all this derivation is that these kernels effectively define a<br />

distance between two samples that is used in the classification of new samples in<br />

SVMs. In theory, any distance could be used, although it may not share the same<br />

characteristics that enable easy optimization of the SVM training.<br />

In scikit-learn's implementation of SVMs, we can define the kernel parameter to<br />

change which kernel function is used in computations, as we saw in the previous<br />

code sample.<br />

Character n-grams<br />

We saw how function words can be used as features to predict the author of a<br />

document. Another feature type is character n-grams. An n-gram is a sequence of n<br />

objects, where n is a value (for text, generally between 2 and 6). Word n-grams have<br />

been used in many studies, usually relating to the topic of the documents. However,<br />

character n-grams have proven to be of high quality for authorship attribution.<br />

[ 198 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!