08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

° ° Experiment <strong>with</strong> whether or not to use the logarithm of the word<br />

counts (sublinear_tf)<br />

° ° Experiment <strong>with</strong> whether or not to track word counts or simply track<br />

whether words occur or not by setting binary to True or False<br />

• MultinomialNB<br />

° ° Decide which of the following smoothing methods to use by<br />

setting alpha:<br />

° ° Add-one or Laplace smoothing: 1<br />

° ° Lidstone smoothing: 0.01, 0.05, 0.1, or 0.5<br />

° ° No smoothing: 0<br />

A simple approach could be to train a classifier for all those reasonable exploration<br />

values while keeping the other parameters constant and checking the classifier's<br />

results. As we do not know whether those parameters affect each other, doing it<br />

right would require that we train a classifier for every possible combination of all<br />

parameter values. Obviously, this is too tedious for us to do.<br />

Because this kind of parameter exploration occurs frequently in machine learning<br />

tasks, scikit-learn has a dedicated class for it called GridSearchCV. It takes an<br />

estimator (an instance <strong>with</strong> a classifier-like interface), which would be the pipeline<br />

instance in our case, and a dictionary of parameters <strong>with</strong> their potential values.<br />

GridSearchCV expects the dictionary's keys to obey a certain format so that it is able<br />

to set the parameters of the correct estimator. The format is as follows:<br />

____...__<br />

Now, if we want to specify the desired values to explore for the min_df parameter of<br />

TfidfVectorizer (named vect in the Pipeline description), we would have to say:<br />

Param_grid={"vect__ngram_range"=[(1, 1), (1, 2), (1, 3)]}<br />

This would tell GridSearchCV to try out unigrams, bigrams, and trigrams as<br />

parameter values for the ngram_range parameter of TfidfVectorizer.<br />

Then it trains the estimator <strong>with</strong> all possible parameter/value combinations. Finally,<br />

it provides the best estimator in the form of the member variable best_estimator_.<br />

As we want to compare the returned best classifier <strong>with</strong> our current best one, we<br />

need to evaluate it the same way. Therefore, we can pass the ShuffleSplit instance<br />

using the CV parameter (this is the reason CV is present in GridSearchCV).<br />

[ 133 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!