Smart Reader: Building a Naive Bayes Classifier - Computer ...
Smart Reader: Building a Naive Bayes Classifier - Computer ...
Smart Reader: Building a Naive Bayes Classifier - Computer ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Smart</strong> <strong>Reader</strong>: <strong>Building</strong> a <strong>Naive</strong> <strong>Bayes</strong> <strong>Classifier</strong><br />
Dana Scott<br />
Department of <strong>Computer</strong> Science<br />
University of Massachusetts Lowell<br />
dana-s@charter.net<br />
ABSTRACT<br />
<strong>Smart</strong> <strong>Reader</strong> is software written entirely in Python that<br />
attempts to classify text. The algorithm calculates the<br />
probability that a given text file belongs to a particular<br />
category based on a previously defined language model.<br />
The algorithm and model use a naive <strong>Bayes</strong>ian network of<br />
prior probabilities and features.<br />
Author Keywords<br />
bag-of-words model, Laplacian smoothing, naive <strong>Bayes</strong>,<br />
text classification<br />
INTRODUCTION<br />
<strong>Smart</strong> <strong>Reader</strong> solves a text or document classification<br />
problem. That is, when given an unlabeled piece of text to<br />
which category does it belong More specifically, this<br />
project aims to classify business news stories into 78<br />
different genres based solely on their content.<br />
PRODUCT DESCRIPTION<br />
I began by building a language model based on the data set,<br />
Reuters-21578, Distribution 1.0 (Reuters, 1997). This data<br />
was already partitioned into 7,769 training and 3,019 test<br />
files. The ApteMod version provides an index of file<br />
names and their corresponding categories. It also includes a<br />
list of stopwords that are common to all categories and<br />
therefore can be excluded.<br />
A Python dictionary or hash table holds 23,357 unique<br />
words which are the features in my <strong>Bayes</strong>ian network. I<br />
copied this dictionary 78 times thereby creating one for<br />
Permission to make digital or hard copies of all or part of this work for<br />
personal or classroom use is granted without fee provided that copies are<br />
not made or distributed for profit or commercial advantage and that copies<br />
bear this notice and the full citation on the first page. To copy otherwise,<br />
or republish, to post on servers or to redistribute to lists, requires prior<br />
specific permission and/or a fee.<br />
each category. The categories were provided in the<br />
ApteMod version of the Reuters corpus. I counted up the<br />
frequency of each word for each category in the 7,769<br />
training files. I was able to identify them, as each training<br />
file is labeled with the classes to which it belongs. I chose<br />
only the dominant or first class listed. If a word did not<br />
appear at all, I gave it a count of one. This is one way to<br />
implement Laplacian or Add-one smoothing. Once I had a<br />
frequency count for each word in each category, I divided<br />
them by the total number of non-unique words per category.<br />
This gave me a floating-point probability of each feature<br />
occurring in each category.<br />
Calculating naive <strong>Bayes</strong> requires a prior probability. This<br />
is the probability of a text file occurring in a category<br />
before accounting for its features. I calculated this by<br />
counting the number of training files per category, then<br />
dividing that count by, the total number of training files,<br />
7,769. The algorithm works by creating a new dictionary of<br />
categories populated by each category's prior probability,<br />
that is, the frequency with which it occurred in the training<br />
set. The document is then parsed into word tokens. I used<br />
a tokenizer from the Natural Language Toolkit for this task<br />
(Bird, 2009). The classifier looks up each token in a<br />
document and multiplies its probability by the prior<br />
probability for that category and all the other feature<br />
probabilities. This multiplication is carried out for all 78<br />
categories. The classifier then chooses the category with<br />
the highest product as the most likely label.<br />
ANALYSIS OF RESULTS<br />
I tested my classifier on the 3,019 test files in the data set<br />
and compared its results to the index of labels. It correctly<br />
classified 2,124 files and incorrectly classified 895 files.<br />
Therefore, its accuracy is 70 percent. I had planned to<br />
compare its accuracy against other working naive <strong>Bayes</strong><br />
classifiers such as the one in the Natural Language Toolkit<br />
but ran out of time to implement it.<br />
DISCUSSION<br />
Primarily, I learned the importance of choosing an<br />
appropriate data set. Russell and Norvig warn that<br />
choosing a large enough data set is fundamental to a
properly functioning naive <strong>Bayes</strong> classifier. Additionally,<br />
they mention that feature selection or extraction also<br />
determines results (Russell, 2010). Gabrilovich and<br />
Markovitch used Reuters-21578 in their research. They<br />
conclude “at higher OC [Outlier Count] values much more<br />
moderate (if any) feature selection should be performed,<br />
while aggressive selection causes degradation in accuracy.”<br />
They found that for the 10 largest categories the Outlier<br />
Count is 78, “which explains why feature selection does...<br />
more harm then good (Gabrilovich, 2004).”<br />
Initially, I had planned to use a much larger Reuters corpus,<br />
RCV1, that contains 810,000 files. However, the project's<br />
time frame only allowed for the smaller, Reuters-21578,<br />
Distribution 1.0 (Reuters, 1997). In either case, the training<br />
data would yield a specialized business news' classifier<br />
rather than a more general breakdown of topics such as<br />
politics, sports, arts, entertainment, science, etc. So, a<br />
careful choice of data set and subsequent selection of<br />
features will determine much of a project's success.<br />
Secondly, I was surprised by the size and processing time<br />
of the data set. In future projects, I will try to estimate the<br />
size of my data structures and the time it takes to compute<br />
them. Even if this is not foreseeable from the outset, I will<br />
build in milestones wherein I can evaluate how much work<br />
the program is doing, how efficiently, and how well that<br />
work addresses the initial software design or research<br />
questions.<br />
Thirdly, I realized that <strong>Bayes</strong>' nets are limited in that they<br />
require many features to effectively classify text and many<br />
domain-specific techniques must be employed to actually<br />
get one working. I employed Laplacian or Add-one<br />
smoothing to achieve 70 percent accuracy with the test data.<br />
However, more is required to raise that to people's<br />
expectations that it will work at least 90 percent of the time.<br />
I also question the algorithm's intelligence in light of<br />
Marvin Minksy's talk and Society of Mind. Text<br />
classification can be a valuable tool. But, I am curious<br />
about where calculating probabilities fits into the artificial<br />
intelligence framework and if it could play a role,<br />
supporting or otherwise, in strong AI. This brings me to<br />
research conducted on topic modeling.<br />
CONCLUSION<br />
Given that I was able to accurately classify 7 out of 10<br />
business news stories, I plan to use similar techniques such<br />
as naive <strong>Bayes</strong> to experiment with feature selection by<br />
creating multiple classifiers which I can compare with each<br />
other as well as existing ones such as the one in the Natural<br />
Language Toolkit.<br />
Topic modeling seems promising as it allows for<br />
classification even before all the categories are fully<br />
defined. According to Wallach, “Latent Dirichlet allocation<br />
(Blei et al., 2003) provides an alternative approach to<br />
modeling textual corpora. Documents are modeled as finite<br />
mixtures over an underlying set of latent topics inferred<br />
from correlations between words, independent of word<br />
order.” However, she insists that although ignoring word<br />
order makes computational sense, it is unrealistic. She goes<br />
on to suggest that bigram models are more realistic.<br />
(Wallach, 2006).<br />
These are some of the ideas of which I have become aware<br />
by undertaking the problem of text classification. I would<br />
incorporate these new directions in any future work in the<br />
hope of writing software that is more general, realistic, and<br />
addresses actual needs.<br />
ACKNOWLEDGMENT<br />
The work described in this paper was conducted as part of a<br />
Fall 2010 Artificial Intelligence course, taught in the<br />
<strong>Computer</strong> Science department of the University of<br />
Massachusetts Lowell by Prof. Fred Martin.<br />
REFERENCES<br />
1. Bird, S., Klein E., Loper E., Natural Language<br />
Processing with Python. O'Reilly Media, Inc.<br />
Sebastopol, CA, USA, 2009.<br />
2. Gabrilovich, E., Markovitch, S., (2004). Text<br />
categorization with many redundant features: using<br />
aggressive feature selection to make SVMs competitive<br />
with C4.5. ICML '04 (pp. 198-205).<br />
3. Reuters (1997). Reuters-21578 text categorization<br />
collection, Distribution 1.0. Reuters.<br />
http://kdd.ics.uci.edu/databases/reuters21578/reuters215<br />
78.html<br />
4. Russell, S., Norvig, P., Artificial Intelligence: A<br />
Modern Approach, Third Edition. Pearson Education,<br />
Inc., Upper Saddle River, New Jersey, USA, 2010.<br />
5. Wallach, H., (2006). Topic Modeling: beyond bag-ofwords.<br />
ICML '06 (pp. 997-1005).