08.06.2015 Views

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

Building Machine Learning Systems with Python - Richert, Coelho

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Chapter 6<br />

Determining the word types<br />

Determining the word types is what part of speech (POS) tagging is all about. A<br />

POS tagger parses a full sentence <strong>with</strong> the goal to arrange it into a dependence tree,<br />

where each node corresponds to a word and the parent-child relationship determines<br />

which word it depends on. With this tree, it can then make more informed decisions;<br />

for example, whether the word "book" is a noun ("This is a good book.") or a verb<br />

("Could you please book the flight?").<br />

You might have already guessed that NLTK will also play a role also in this area.<br />

And indeed, it comes readily packaged <strong>with</strong> all sorts of parsers and taggers. The POS<br />

tagger we will use, nltk.pos_tag(), is actually a full-blown classifier trained using<br />

manually annotated sentences from the Penn Treebank Project (http://www.cis.<br />

upenn.edu/~treebank). It takes as input a list of word tokens and outputs a list of<br />

tuples, each element of which contains the part of the original sentence and its part of<br />

speech tag:<br />

>>> import nltk<br />

>>> nltk.pos_tag(nltk.word_tokenize("This is a good book."))<br />

[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book',<br />

'NN'), ('.', '.')]<br />

>>> nltk.pos_tag(nltk.word_tokenize("Could you please book the<br />

flight?"))<br />

[('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'),<br />

('the', 'DT'), ('flight', 'NN'), ('?', '.')]<br />

The POS tag abbreviations are taken from the Penn Treebank Project (adapted from<br />

http://americannationalcorpus.org/OANC/penn.html):<br />

POS tag Description Example<br />

CC coordinating conjunction or<br />

CD cardinal number 2 second<br />

DT determiner the<br />

EX existential there there are<br />

FW foreign word kindergarten<br />

IN<br />

preposition/subordinating<br />

conjunction<br />

on, of, like<br />

JJ adjective cool<br />

JJR adjective, comparative cooler<br />

JJS adjective, superlative coolest<br />

LS list marker 1)<br />

MD modal could, will<br />

[ 139 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!