Smart Reader: Building a Naive Bayes Classifier - Computer ...

Smart Reader: Building a Naive Bayes Classifier 

Dana Scott 

Department of Computer Science 

University of Massachusetts Lowell 

dana-s@charter.net 

ABSTRACT 

Smart Reader is software written entirely in Python that 

attempts to classify text. The algorithm calculates the 

probability that a given text file belongs to a particular 

category based on a previously defined language model. 

The algorithm and model use a naive Bayesian network of 

prior probabilities and features. 

Author Keywords 

bag-of-words model, Laplacian smoothing, naive Bayes, 

text classification 

INTRODUCTION 

Smart Reader solves a text or document classification 

problem. That is, when given an unlabeled piece of text to 

which category does it belong More specifically, this 

project aims to classify business news stories into 78 

different genres based solely on their content. 

PRODUCT DESCRIPTION 

I began by building a language model based on the data set, 

Reuters-21578, Distribution 1.0 (Reuters, 1997). This data 

was already partitioned into 7,769 training and 3,019 test 

files. The ApteMod version provides an index of file 

names and their corresponding categories. It also includes a 

list of stopwords that are common to all categories and 

therefore can be excluded. 

A Python dictionary or hash table holds 23,357 unique 

words which are the features in my Bayesian network. I 

copied this dictionary 78 times thereby creating one for 

Permission to make digital or hard copies of all or part of this work for 

personal or classroom use is granted without fee provided that copies are 

not made or distributed for profit or commercial advantage and that copies 

bear this notice and the full citation on the first page. To copy otherwise, 

or republish, to post on servers or to redistribute to lists, requires prior 

specific permission and/or a fee. 

each category. The categories were provided in the 

ApteMod version of the Reuters corpus. I counted up the 

frequency of each word for each category in the 7,769 

training files. I was able to identify them, as each training 

file is labeled with the classes to which it belongs. I chose 

only the dominant or first class listed. If a word did not 

appear at all, I gave it a count of one. This is one way to 

implement Laplacian or Add-one smoothing. Once I had a 

frequency count for each word in each category, I divided 

them by the total number of non-unique words per category. 

This gave me a floating-point probability of each feature 

occurring in each category. 

Calculating naive Bayes requires a prior probability. This 

is the probability of a text file occurring in a category 

before accounting for its features. I calculated this by 

counting the number of training files per category, then 

dividing that count by, the total number of training files, 

7,769. The algorithm works by creating a new dictionary of 

categories populated by each category's prior probability, 

that is, the frequency with which it occurred in the training 

set. The document is then parsed into word tokens. I used 

a tokenizer from the Natural Language Toolkit for this task 

(Bird, 2009). The classifier looks up each token in a 

document and multiplies its probability by the prior 

probability for that category and all the other feature 

probabilities. This multiplication is carried out for all 78 

categories. The classifier then chooses the category with 

the highest product as the most likely label. 

ANALYSIS OF RESULTS 

I tested my classifier on the 3,019 test files in the data set 

and compared its results to the index of labels. It correctly 

classified 2,124 files and incorrectly classified 895 files. 

Therefore, its accuracy is 70 percent. I had planned to 

compare its accuracy against other working naive Bayes 

classifiers such as the one in the Natural Language Toolkit 

but ran out of time to implement it. 

DISCUSSION 

Primarily, I learned the importance of choosing an 

appropriate data set. Russell and Norvig warn that 

choosing a large enough data set is fundamental to a

properly functioning naive Bayes classifier. Additionally, 

they mention that feature selection or extraction also 

determines results (Russell, 2010). Gabrilovich and 

Markovitch used Reuters-21578 in their research. They 

conclude “at higher OC [Outlier Count] values much more 

moderate (if any) feature selection should be performed, 

while aggressive selection causes degradation in accuracy.” 

They found that for the 10 largest categories the Outlier 

Count is 78, “which explains why feature selection does... 

more harm then good (Gabrilovich, 2004).” 

Initially, I had planned to use a much larger Reuters corpus, 

RCV1, that contains 810,000 files. However, the project's 

time frame only allowed for the smaller, Reuters-21578, 

Distribution 1.0 (Reuters, 1997). In either case, the training 

data would yield a specialized business news' classifier 

rather than a more general breakdown of topics such as 

politics, sports, arts, entertainment, science, etc. So, a 

careful choice of data set and subsequent selection of 

features will determine much of a project's success. 

Secondly, I was surprised by the size and processing time 

of the data set. In future projects, I will try to estimate the 

size of my data structures and the time it takes to compute 

them. Even if this is not foreseeable from the outset, I will 

build in milestones wherein I can evaluate how much work 

the program is doing, how efficiently, and how well that 

work addresses the initial software design or research 

questions. 

Thirdly, I realized that Bayes' nets are limited in that they 

require many features to effectively classify text and many 

domain-specific techniques must be employed to actually 

get one working. I employed Laplacian or Add-one 

smoothing to achieve 70 percent accuracy with the test data. 

However, more is required to raise that to people's 

expectations that it will work at least 90 percent of the time. 

I also question the algorithm's intelligence in light of 

Marvin Minksy's talk and Society of Mind. Text 

classification can be a valuable tool. But, I am curious 

about where calculating probabilities fits into the artificial 

intelligence framework and if it could play a role, 

supporting or otherwise, in strong AI. This brings me to 

research conducted on topic modeling. 

CONCLUSION 

Given that I was able to accurately classify 7 out of 10 

business news stories, I plan to use similar techniques such 

as naive Bayes to experiment with feature selection by 

creating multiple classifiers which I can compare with each 

other as well as existing ones such as the one in the Natural 

Language Toolkit. 

Topic modeling seems promising as it allows for 

classification even before all the categories are fully 

defined. According to Wallach, “Latent Dirichlet allocation 

(Blei et al., 2003) provides an alternative approach to 

modeling textual corpora. Documents are modeled as finite 

mixtures over an underlying set of latent topics inferred 

from correlations between words, independent of word 

order.” However, she insists that although ignoring word 

order makes computational sense, it is unrealistic. She goes 

on to suggest that bigram models are more realistic. 

(Wallach, 2006). 

These are some of the ideas of which I have become aware 

by undertaking the problem of text classification. I would 

incorporate these new directions in any future work in the 

hope of writing software that is more general, realistic, and 

addresses actual needs. 

ACKNOWLEDGMENT 

The work described in this paper was conducted as part of a 

Fall 2010 Artificial Intelligence course, taught in the 

Computer Science department of the University of 

Massachusetts Lowell by Prof. Fred Martin. 

REFERENCES 

1. Bird, S., Klein E., Loper E., Natural Language 

Processing with Python. O'Reilly Media, Inc. 

Sebastopol, CA, USA, 2009. 

2. Gabrilovich, E., Markovitch, S., (2004). Text 

categorization with many redundant features: using 

aggressive feature selection to make SVMs competitive 

with C4.5. ICML '04 (pp. 198-205). 

3. Reuters (1997). Reuters-21578 text categorization 

collection, Distribution 1.0. Reuters. 

http://kdd.ics.uci.edu/databases/reuters21578/reuters215 

78.html 

4. Russell, S., Norvig, P., Artificial Intelligence: A 

Modern Approach, Third Edition. Pearson Education, 

Inc., Upper Saddle River, New Jersey, USA, 2010. 

5. Wallach, H., (2006). Topic Modeling: beyond bag-ofwords. 

ICML '06 (pp. 997-1005).

Smart Reader: Building a Naive Bayes Classifier - Computer ...

Create successful ePaper yourself

Delete template?

Save as template?