04.01.2015 Views

Smart Reader: Building a Naive Bayes Classifier - Computer ...

Smart Reader: Building a Naive Bayes Classifier - Computer ...

Smart Reader: Building a Naive Bayes Classifier - Computer ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Smart</strong> <strong>Reader</strong>: <strong>Building</strong> a <strong>Naive</strong> <strong>Bayes</strong> <strong>Classifier</strong><br />

Dana Scott<br />

Department of <strong>Computer</strong> Science<br />

University of Massachusetts Lowell<br />

dana-s@charter.net<br />

ABSTRACT<br />

<strong>Smart</strong> <strong>Reader</strong> is software written entirely in Python that<br />

attempts to classify text. The algorithm calculates the<br />

probability that a given text file belongs to a particular<br />

category based on a previously defined language model.<br />

The algorithm and model use a naive <strong>Bayes</strong>ian network of<br />

prior probabilities and features.<br />

Author Keywords<br />

bag-of-words model, Laplacian smoothing, naive <strong>Bayes</strong>,<br />

text classification<br />

INTRODUCTION<br />

<strong>Smart</strong> <strong>Reader</strong> solves a text or document classification<br />

problem. That is, when given an unlabeled piece of text to<br />

which category does it belong More specifically, this<br />

project aims to classify business news stories into 78<br />

different genres based solely on their content.<br />

PRODUCT DESCRIPTION<br />

I began by building a language model based on the data set,<br />

Reuters-21578, Distribution 1.0 (Reuters, 1997). This data<br />

was already partitioned into 7,769 training and 3,019 test<br />

files. The ApteMod version provides an index of file<br />

names and their corresponding categories. It also includes a<br />

list of stopwords that are common to all categories and<br />

therefore can be excluded.<br />

A Python dictionary or hash table holds 23,357 unique<br />

words which are the features in my <strong>Bayes</strong>ian network. I<br />

copied this dictionary 78 times thereby creating one for<br />

Permission to make digital or hard copies of all or part of this work for<br />

personal or classroom use is granted without fee provided that copies are<br />

not made or distributed for profit or commercial advantage and that copies<br />

bear this notice and the full citation on the first page. To copy otherwise,<br />

or republish, to post on servers or to redistribute to lists, requires prior<br />

specific permission and/or a fee.<br />

each category. The categories were provided in the<br />

ApteMod version of the Reuters corpus. I counted up the<br />

frequency of each word for each category in the 7,769<br />

training files. I was able to identify them, as each training<br />

file is labeled with the classes to which it belongs. I chose<br />

only the dominant or first class listed. If a word did not<br />

appear at all, I gave it a count of one. This is one way to<br />

implement Laplacian or Add-one smoothing. Once I had a<br />

frequency count for each word in each category, I divided<br />

them by the total number of non-unique words per category.<br />

This gave me a floating-point probability of each feature<br />

occurring in each category.<br />

Calculating naive <strong>Bayes</strong> requires a prior probability. This<br />

is the probability of a text file occurring in a category<br />

before accounting for its features. I calculated this by<br />

counting the number of training files per category, then<br />

dividing that count by, the total number of training files,<br />

7,769. The algorithm works by creating a new dictionary of<br />

categories populated by each category's prior probability,<br />

that is, the frequency with which it occurred in the training<br />

set. The document is then parsed into word tokens. I used<br />

a tokenizer from the Natural Language Toolkit for this task<br />

(Bird, 2009). The classifier looks up each token in a<br />

document and multiplies its probability by the prior<br />

probability for that category and all the other feature<br />

probabilities. This multiplication is carried out for all 78<br />

categories. The classifier then chooses the category with<br />

the highest product as the most likely label.<br />

ANALYSIS OF RESULTS<br />

I tested my classifier on the 3,019 test files in the data set<br />

and compared its results to the index of labels. It correctly<br />

classified 2,124 files and incorrectly classified 895 files.<br />

Therefore, its accuracy is 70 percent. I had planned to<br />

compare its accuracy against other working naive <strong>Bayes</strong><br />

classifiers such as the one in the Natural Language Toolkit<br />

but ran out of time to implement it.<br />

DISCUSSION<br />

Primarily, I learned the importance of choosing an<br />

appropriate data set. Russell and Norvig warn that<br />

choosing a large enough data set is fundamental to a


properly functioning naive <strong>Bayes</strong> classifier. Additionally,<br />

they mention that feature selection or extraction also<br />

determines results (Russell, 2010). Gabrilovich and<br />

Markovitch used Reuters-21578 in their research. They<br />

conclude “at higher OC [Outlier Count] values much more<br />

moderate (if any) feature selection should be performed,<br />

while aggressive selection causes degradation in accuracy.”<br />

They found that for the 10 largest categories the Outlier<br />

Count is 78, “which explains why feature selection does...<br />

more harm then good (Gabrilovich, 2004).”<br />

Initially, I had planned to use a much larger Reuters corpus,<br />

RCV1, that contains 810,000 files. However, the project's<br />

time frame only allowed for the smaller, Reuters-21578,<br />

Distribution 1.0 (Reuters, 1997). In either case, the training<br />

data would yield a specialized business news' classifier<br />

rather than a more general breakdown of topics such as<br />

politics, sports, arts, entertainment, science, etc. So, a<br />

careful choice of data set and subsequent selection of<br />

features will determine much of a project's success.<br />

Secondly, I was surprised by the size and processing time<br />

of the data set. In future projects, I will try to estimate the<br />

size of my data structures and the time it takes to compute<br />

them. Even if this is not foreseeable from the outset, I will<br />

build in milestones wherein I can evaluate how much work<br />

the program is doing, how efficiently, and how well that<br />

work addresses the initial software design or research<br />

questions.<br />

Thirdly, I realized that <strong>Bayes</strong>' nets are limited in that they<br />

require many features to effectively classify text and many<br />

domain-specific techniques must be employed to actually<br />

get one working. I employed Laplacian or Add-one<br />

smoothing to achieve 70 percent accuracy with the test data.<br />

However, more is required to raise that to people's<br />

expectations that it will work at least 90 percent of the time.<br />

I also question the algorithm's intelligence in light of<br />

Marvin Minksy's talk and Society of Mind. Text<br />

classification can be a valuable tool. But, I am curious<br />

about where calculating probabilities fits into the artificial<br />

intelligence framework and if it could play a role,<br />

supporting or otherwise, in strong AI. This brings me to<br />

research conducted on topic modeling.<br />

CONCLUSION<br />

Given that I was able to accurately classify 7 out of 10<br />

business news stories, I plan to use similar techniques such<br />

as naive <strong>Bayes</strong> to experiment with feature selection by<br />

creating multiple classifiers which I can compare with each<br />

other as well as existing ones such as the one in the Natural<br />

Language Toolkit.<br />

Topic modeling seems promising as it allows for<br />

classification even before all the categories are fully<br />

defined. According to Wallach, “Latent Dirichlet allocation<br />

(Blei et al., 2003) provides an alternative approach to<br />

modeling textual corpora. Documents are modeled as finite<br />

mixtures over an underlying set of latent topics inferred<br />

from correlations between words, independent of word<br />

order.” However, she insists that although ignoring word<br />

order makes computational sense, it is unrealistic. She goes<br />

on to suggest that bigram models are more realistic.<br />

(Wallach, 2006).<br />

These are some of the ideas of which I have become aware<br />

by undertaking the problem of text classification. I would<br />

incorporate these new directions in any future work in the<br />

hope of writing software that is more general, realistic, and<br />

addresses actual needs.<br />

ACKNOWLEDGMENT<br />

The work described in this paper was conducted as part of a<br />

Fall 2010 Artificial Intelligence course, taught in the<br />

<strong>Computer</strong> Science department of the University of<br />

Massachusetts Lowell by Prof. Fred Martin.<br />

REFERENCES<br />

1. Bird, S., Klein E., Loper E., Natural Language<br />

Processing with Python. O'Reilly Media, Inc.<br />

Sebastopol, CA, USA, 2009.<br />

2. Gabrilovich, E., Markovitch, S., (2004). Text<br />

categorization with many redundant features: using<br />

aggressive feature selection to make SVMs competitive<br />

with C4.5. ICML '04 (pp. 198-205).<br />

3. Reuters (1997). Reuters-21578 text categorization<br />

collection, Distribution 1.0. Reuters.<br />

http://kdd.ics.uci.edu/databases/reuters21578/reuters215<br />

78.html<br />

4. Russell, S., Norvig, P., Artificial Intelligence: A<br />

Modern Approach, Third Edition. Pearson Education,<br />

Inc., Upper Saddle River, New Jersey, USA, 2010.<br />

5. Wallach, H., (2006). Topic Modeling: beyond bag-ofwords.<br />

ICML '06 (pp. 997-1005).

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!