Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Building Machine Learning Systems with Python - Richert, Coelho
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Chapter 6<br />
Determining the word types<br />
Determining the word types is what part of speech (POS) tagging is all about. A<br />
POS tagger parses a full sentence <strong>with</strong> the goal to arrange it into a dependence tree,<br />
where each node corresponds to a word and the parent-child relationship determines<br />
which word it depends on. With this tree, it can then make more informed decisions;<br />
for example, whether the word "book" is a noun ("This is a good book.") or a verb<br />
("Could you please book the flight?").<br />
You might have already guessed that NLTK will also play a role also in this area.<br />
And indeed, it comes readily packaged <strong>with</strong> all sorts of parsers and taggers. The POS<br />
tagger we will use, nltk.pos_tag(), is actually a full-blown classifier trained using<br />
manually annotated sentences from the Penn Treebank Project (http://www.cis.<br />
upenn.edu/~treebank). It takes as input a list of word tokens and outputs a list of<br />
tuples, each element of which contains the part of the original sentence and its part of<br />
speech tag:<br />
>>> import nltk<br />
>>> nltk.pos_tag(nltk.word_tokenize("This is a good book."))<br />
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('good', 'JJ'), ('book',<br />
'NN'), ('.', '.')]<br />
>>> nltk.pos_tag(nltk.word_tokenize("Could you please book the<br />
flight?"))<br />
[('Could', 'MD'), ('you', 'PRP'), ('please', 'VB'), ('book', 'NN'),<br />
('the', 'DT'), ('flight', 'NN'), ('?', '.')]<br />
The POS tag abbreviations are taken from the Penn Treebank Project (adapted from<br />
http://americannationalcorpus.org/OANC/penn.html):<br />
POS tag Description Example<br />
CC coordinating conjunction or<br />
CD cardinal number 2 second<br />
DT determiner the<br />
EX existential there there are<br />
FW foreign word kindergarten<br />
IN<br />
preposition/subordinating<br />
conjunction<br />
on, of, like<br />
JJ adjective cool<br />
JJR adjective, comparative cooler<br />
JJS adjective, superlative coolest<br />
LS list marker 1)<br />
MD modal could, will<br />
[ 139 ]