08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Our first estimator<br />

Now we have everything in place to create our first vectorizer. The most convenient<br />

way to do it is to inherit it from BaseEstimator. It requires us to implement the<br />

following three methods:<br />

• get_feature_names(): This returns a list of strings of the features that we<br />

will return in transform().<br />

• fit(document, y=None): As we are not implementing a classifier, we can<br />

ignore this one and simply return self.<br />

• transform(documents): This returns numpy.array(), containing an array of<br />

shape (len(documents), len(get_feature_names)). This means that for<br />

every document in documents, it has to return a value for every feature name<br />

in get_feature_names().<br />

Let us now implement these methods:<br />

sent_word_net = load_sent_word_net()<br />

class LinguisticVectorizer(BaseEstimator):<br />

def get_feature_names(self):<br />

return np.array(['sent_neut', 'sent_pos', 'sent_neg',<br />

'nouns', 'adjectives', 'verbs', 'adverbs',<br />

'allcaps', 'exclamation', 'question', 'hashtag',<br />

'mentioning'])<br />

# we don't fit here but need to return the reference<br />

# so that it can be used like fit(d).transform(d)<br />

def fit(self, documents, y=None):<br />

return self<br />

def _get_sentiments(self, d):<br />

sent = tuple(d.split())<br />

tagged = nltk.pos_tag(sent)<br />

pos_vals = []<br />

neg_vals = []<br />

nouns = 0.<br />

adjectives = 0.<br />

verbs = 0.<br />

adverbs = 0.<br />

[ 143 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!