10.11.2016 Views

Learning Data Mining with Python

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Chapter 9<br />

The use of function words is less defined by the content of the document and more by<br />

the decisions made by the author. This makes them good candidates for separating<br />

the authorship traits between different users. For instance, while many Americans are<br />

particular about the different in usage between that and which in a sentence, people<br />

from other countries, such as Australia, are less particular about this. This means that<br />

some Australians will lean towards almost exclusively using one word or the other,<br />

while others may use which much more. This difference, combined <strong>with</strong> thousands of<br />

other nuanced differences, makes a model of authorship.<br />

Counting function words<br />

We can count function words using the CountVectorizer class we used in<br />

Chapter 6, Social Media Insight Using Naive Bayes. This class can be passed a<br />

vocabulary, which is the set of words it will look for. If a vocabulary is not passed<br />

(we didn't pass one in the code of Chapter 6), then it will learn this vocabulary<br />

from the dataset. All the words are in the training set of documents (depending on<br />

the other parameters of course).<br />

First, we set up our vocabulary of function words, which is just a list containing<br />

each of them. Exactly which words are function words and which are not is up<br />

for debate. I've found this list, from published research, to be quite good:<br />

function_words = ["a", "able", "aboard", "about", "above", "absent",<br />

"according" , "accordingly", "across", "after", "against",<br />

"ahead", "albeit", "all", "along", "alongside", "although",<br />

"am", "amid", "amidst", "among", "amongst", "amount", "an",<br />

"and", "another", "anti", "any", "anybody", "anyone",<br />

"anything", "are", "around", "as", "aside", "astraddle",<br />

"astride", "at", "away", "bar", "barring", "be", "because",<br />

"been", "before", "behind", "being", "below", "beneath",<br />

"beside", "besides", "better", "between", "beyond", "bit",<br />

"both", "but", "by", "can", "certain", "circa", "close",<br />

"concerning", "consequently", "considering", "could",<br />

"couple", "dare", "deal", "despite", "down", "due", "during",<br />

"each", "eight", "eighth", "either", "enough", "every",<br />

"everybody", "everyone", "everything", "except", "excepting",<br />

"excluding", "failing", "few", "fewer", "fifth", "first",<br />

"five", "following", "for", "four", "fourth", "from", "front",<br />

"given", "good", "great", "had", "half", "have", "he",<br />

"heaps", "hence", "her", "hers", "herself", "him", "himself",<br />

"his", "however", "i", "if", "in", "including", "inside",<br />

[ 193 ]

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!