10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Classifying <strong>with</strong> function words<br />

Next, we import our classes. The only new thing here is the support vector<br />

machines, which we will cover in the next section (for now, just consider it<br />

a standard classification algorithm). We import the SVC class, an SVM for<br />

classification, as well as the other standard workflow tools we have seen before:<br />

from sklearn.svm import SVC<br />

from sklearn.cross_validation import cross_val_score<br />

from sklearn.pipeline import Pipeline<br />

from sklearn import grid_search<br />

Chapter 9<br />

Support vector machines take a number of parameters. As I said, we will use one<br />

blindly here, before going into detail in the next section. We then use a dictionary<br />

to set which parameters we are going to search. For the kernel parameter, we will<br />

try linear and rbf. For C, we will try values of 1 and 10 (descriptions of these<br />

parameters are covered in the next section). We then create a grid search to search<br />

these parameters for the best choices:<br />

parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}<br />

svr = SVC()<br />

grid = grid_search.GridSearchCV(svr, parameters)<br />

Gaussian kernels (such as rbf) only work for reasonably sized datasets,<br />

such as when the number of features is fewer than about 10,000.<br />

Next, we set up a pipeline that takes the feature extraction step using the<br />

CountVectorizer (only using function words), along <strong>with</strong> our grid search using<br />

SVM. The code is as follows:<br />

pipeline1 = Pipeline([('feature_extraction', extractor),<br />

('clf', grid)<br />

])<br />

Next, we apply cross_val_score to get our cross validated score for this pipeline.<br />

The result is 0.811, which means we approximately get 80 percent of the predictions<br />

correct. For 7 authors, this is a good result!<br />

[ 195 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!