Ensemble Classification for Spam Filtering Based on Clustering of ...

<strong>Ensemble</strong> <strong>Classification</strong> <strong>for</strong> <strong>Spam</strong> <strong>Filtering</strong> 

<strong>Based</strong> on Clustering of Text Corpora 

Robert Neumayer 

neumayer@tcd.ie 

Visiting Student 

Final Year Project 2006 

Supervisor: Dr. Pádraig Cunningham

Contents 

1 Introduction 1 

1.1 General Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

1.2 Synopsis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 

2 Related Work 3 

2.1 Supervised Machine Learning . . . . . . . . . . . . . . . . . . . . 4 

2.1.1 k-Nearest Neighbour Classifiers . . . . . . . . . . . . . . . 4 

2.2 Unsupervised Machine Learning . . . . . . . . . . . . . . . . . . . 4 

2.2.1 k-Means Clustering . . . . . . . . . . . . . . . . . . . . . . 5 

2.2.2 Self-Organizing Map . . . . . . . . . . . . . . . . . . . . . 5 

2.2.3 <strong>Ensemble</strong> <strong>Classification</strong> . . . . . . . . . . . . . . . . . . . 5 

2.3 Objectives Revisited . . . . . . . . . . . . . . . . . . . . . . . . . 7 

3 Text Categorisation and Text In<strong>for</strong>mation Retrieval 8 

3.1 Application Scenario . . . . . . . . . . . . . . . . . . . . . . . . . 9 

3.2 Per<strong>for</strong>mance Measurements . . . . . . . . . . . . . . . . . . . . . 9 

3.3 Database Implementation . . . . . . . . . . . . . . . . . . . . . . 10 

4 Feature Selection and Dimensionality Reduction 12 

4.0.1 In<strong>for</strong>mation Gain . . . . . . . . . . . . . . . . . . . . . . . 12 

4.0.2 Odds Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . 13 

4.1 Combination of Feature Selection Approaches . . . . . . . . . . . 13 

4.1.1 Feature Selection Experiments . . . . . . . . . . . . . . . 13 

4.1.2 Document Frequency Thresholding . . . . . . . . . . . . . 14 

5 Ensemles of Classifiers <strong>for</strong> <strong>Spam</strong> <strong>Classification</strong> 16 

5.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . 16 

5.2 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

5.3 <strong>Classification</strong> Experiments . . . . . . . . . . . . . . . . . . . . . . 17 

5.3.1 k-NN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 

5.3.2 Decision Function . . . . . . . . . . . . . . . . . . . . . . 18 

5.3.3 <strong>Ensemble</strong> <strong>Classification</strong> . . . . . . . . . . . . . . . . . . . 18 

5.3.4 Lessons Learned from the First Shot . . . . . . . . . . . . 19 

5.4 New Winner Take All Strategy . . . . . . . . . . . . . . . . . . . 21 

1

5.4.1 Aggregation Technique Revisited . . . . . . . . . . . . . . 21 

6 Clustering of the TREC <strong>Spam</strong> Corpus 23 

6.1 Cluster Validation Techniques: Silhouette Value . . . . . . . . . . 23 

6.2 Cluster Validation of the Self-Organizing Map . . . . . . . . . . . 24 

6.3 Finally: Subsampling of the TREC Corpus . . . . . . . . . . . . 25 

7 Conclusions and Future Work 28 

8 Acknowledgements 30 

2

Abstract 

<strong>Spam</strong> filtering has become a very important issue throughout the last years as 

unsolicited bulk e-mail imposes large problems in terms of both the amount of 

time spent on and the resources needed to automatically filter those messages. 

Text in<strong>for</strong>mation retrieval offers the tools and algorithms to handle text documents 

in their abstract vector <strong>for</strong>m. Thereon, machine learning algorithms 

can be applied. This work deals with the possible improvements gained from 

ensembles, i.e. multiple, differing classifiers <strong>for</strong> the same task. Those individual 

classifiers can fit parts of the training data better and there<strong>for</strong>e may 

improve classification results, when the best fitting classifier can be found. Basic 

classification algorithms as well as clustering are introduced. Furthermore 

the application of the ensemble idea is explained and experimental results are 

presented.

Chapter 1 

Introduction 

Unsolicited commercial e-mail has become a serious problem <strong>for</strong> both private 

and commercial e-mail users. By now, spam is estimated to be a very high 

percentage of all e-mail traffic (in some estimates up to 80 % or even more). 

The resulting waste of resources (hardware, time, etc.) is enormous. 

The currently employed infrastructure <strong>for</strong> e-mail transfer, the simple mail 

transfer protocol (SMTP), hardly provides any support <strong>for</strong> detecting or preventing 

spam. We are also lacking a widely accepted and deployed authentication 

mechanism <strong>for</strong> the sending of e-mails. Thus, until a new global e-mail infrastructure 

has been developed which allows <strong>for</strong> better controlling of this problem, 

two major approaches show the greatest potential <strong>for</strong> coping with the problem: 

detecting spam based on content filtering or preventing spam based on 

increasing the costs associated with sending out e-mail messages. 

This work concentrates on the application of machine learning techniques 

to the spam problem and there<strong>for</strong>e clearly belongs to the content filtering category. 

The main task of machine learning is to “learn” labels of instances (like 

legitimate or non-legitimate <strong>for</strong> e-mail messages) from a given training set, and 

subsequently tag new or unknown instances (e.g. incoming e-mails). 

More specifically, the feasibility of decomposing spam training sets into 

smaller, more specialised subsets <strong>for</strong> spam filtering is investigated. The main 

idea behind this is that the similarity of incoming messages to all those subsets 

can be computed, one of which the message is likely to fit in best. That subset 

can then be used to label that specific instance. There<strong>for</strong>e, the ensemble classification 

approach was chosen. Usually, one classifier is built on one training set, 

and this classifier is then used to categorise incoming messages. <strong>Ensemble</strong>s comprise 

several classifiers <strong>for</strong> the same task, varying either in parameter settings 

or data used <strong>for</strong> training. This work relies on the latter type of ensemble. 

1

1.1 General Objectives 

The goal of this work is to prove that exploiting the diversity amongst spam and 

ham messages and the use of multiple classifiers based on those subgroups can 

outper<strong>for</strong>m one single classifier in terms of classification accuracy. The primary 

purpose is to show the applicability of ensemble techniques to the spam problem, 

particularly in how far different aggregation techniques to combine the results 

of several classifiers are feasible in this scenario. Moreover, the issue of classifier 

confidence, i.e. how confident a classifier is that his decision is the right one, 

will be addressed. 

1.2 Synopsis 

The remainder of this report is structured as follows. Chapter 2 gives an 

overview of previous research, Chapter 3 introduces the main principles of text 

categorisation and the application of machine learning techniques. Furthermore, 

Chapter 3 deals with the problem of feature selection in the context of text in<strong>for</strong>mation 

retrieval. The application of ensemble techniques to the spam problem 

is explained in Chapter 5. Chapter 6 gives an overview about clustering and 

experimental results. Finally, Chapter 7 gives a short outlook and summarises 

the work done. 

2

Chapter 2 

Related Work 

Numerous spam filtering strategies with varying degrees of efficiency have been 

proposed and developed. Many of them implement static or rule based approaches, 

which leave opportunities <strong>for</strong> spammers to react and circumvent existing 

anti-spam solutions. As a result, many of the existing approaches yield 

good classification results only over short periods of time (until spammers have 

found a workaround). This aspect manifests itself in an ongoing “arms race” between 

spammers and developers of spam filtering software. The most important 

and the most widespread example <strong>for</strong> such an approach is Apache’s <strong>Spam</strong>Assassin 

[1], currently a de-facto standard <strong>for</strong> e-mail classification and spam filtering. 

Here, a large number (currently over 700) of many different types of tests and 

checks are combined. In this case the decision whether a message is spam or 

not is based on a simple thresholding mechanism. The weightings <strong>for</strong> the different 

tests are computed by a linear perceptron based upon large spam and ham 

corpora. Beyond that, many other products and tools have been developed <strong>for</strong> 

spam filtering (both commercial and open source). Although they may have 

different names, they often rely on similar methods, and some are even directly 

based on <strong>Spam</strong>Assassin. 

More generally, by omitting the special nature of e-mail traffic, spam filtering 

can be regarded as a classic text categorisation task. Thorough research has been 

conducted in this field. 

Machine learning techniques are widely used in automated text classification, 

including spam filtering (see [2] <strong>for</strong> a survey of techniques used). Due to 

their simplicity and per<strong>for</strong>mance advantages over many other sophisticated approaches 

naïve Bayes classifiers [3] are maybe the most prevalent. They have 

been used <strong>for</strong> spam filtering with great success, since online training is particularly 

well supported. Many implementations are available, most of them based 

on Paul Graham’s ideas [4, 5]. Case-based techniques, concentrating on editing 

the case base and counteracting concept drift in spam mails, are described 

in [6]. [7] shows that n-gram frequencies do not necessarily boost classification 

per<strong>for</strong>mance in the spam context as well as the importance of meta data <strong>for</strong> this 

task. 

3

In terms of classifier techniques, the feasibility of applying support vector 

machines (SVM) <strong>for</strong> text classification has been investigated in [8]. The use 

of latent semantic indexing (LSI) <strong>for</strong> spam filtering is assessed in [9]. A comprehensive 

comparison of different text categorisation methods is also described 

in [10]. A k-NN classifier is compared to different LSI variants and support 

vector machines. SVMs and a k-NN supported LSI per<strong>for</strong>med best, all of which 

having both advantages and disadvantages. 

2.1 Supervised Machine Learning 

The main task in classification or supervised learning is to train a system on 

a given labelled training set and then assign unknown instances to on of those 

labels. For the e-mail scenario that means, a known set of legitimate as well as 

non-legitimate e-mails are used to train a classifier, the in<strong>for</strong>mation taken from 

those labelled messages is then later used when new and unknown messages 

come in. How exactly that works depends on the algorithm used. The k nearest 

neighbour algorithm is one of the simplest machine learning techniques, yet it 

is quite a powerful one and will be used as classification technique throughout 

this report. 

2.1.1 k-Nearest Neighbour Classifiers 

k-NN compares a given query to all vectors in the training set and assigns the 

query to the class that the majority of the k most similar vectors are assigned to. 

Similarity in this context can be computed by a range of distance measurements, 

of which the Euclidean distance is the most common. k-NN classifiers belong 

to the category of so called lazy learners, i.e. all computational resources are 

needed at classification time. Its training phase is involves merely the storage 

of the training set as opposed to other algorithms like back-propagation neural 

networks. 

Each query is compared to every single stored instance. Now the k closest 

instances in the training set are determined and the query is assigned to the 

prevailing class of those neighbours. If, <strong>for</strong> example, the three closest instances 

to a query are all labelled spam, the query message is labelled spam as well. 

The confidence of the 3-NN classifier would be 1.0, but could be .66 if only two 

out of the three neighbours would be spam. 

2.2 Unsupervised Machine Learning 

Unsupervised learning denotes the partitioning of a data set into subsets that 

contain similar elements in terms of a distance function. Clustering does not 

need class labels as the partitioning is done based on the data vectors only. In 

this context, clustering is used to partition existing ham and spam sets into 

subgroups. 

4

2.2.1 k-Means Clustering 

k-Means clustering is a rather simple and well known technique. The input 

<strong>for</strong> the algorithm is a possibly unlabelled data set and a number of clusters to 

partition the data into. For each cluster, a centre vector is randomly initialised. 

Each data point is then assigned to the closest cluster centre. Then the new 

cluster centres are computed as the average over all data points assigned to each 

cluster. This is repeated until there are no more changes in cluster assignments. 

This technique was not actually used <strong>for</strong> any experiments, but is mentioned 

to introduce the general clustering idea as well as to point out the differences 

between this algorithm and the Self-Organizing Map that is introduced in the 

next section. 

2.2.2 Self-Organizing Map 

The Self-Organizing Map, an unsupervised neural network that provides a mapping 

from a high-dimensional input space to usually two-dimensional output 

space was used <strong>for</strong> clustering [11, 12]. During that process topological relations 

are preserved as faithfully as possible. A SOM consists of a set of i units arranged 

in a two-dimensional grid, each attached to a weight vector m i ∈ R n . 

Elements from the high-dimensional input space, referred to as input vectors 

x ∈ R n , are presented to the SOM and the activation of each unit <strong>for</strong> the 

presented input vector is calculated using an activation function (usually the 

Euclidean Distance). In the next step the weight vector of the unit showing the 

highest activation is modified as to more closely resemble the presented input 

vector. Subsequently, the weight vector of the winner is moved towards the presented 

input signal by a certain fraction of the Euclidean distance as indicated 

by a time-decreasing learning rate α. Consequently, the next time the same input 

signal is presented, this unit’s activation will be even higher. Furthermore, 

the weight vectors of units neighbouring the winner, as described by a timedecreasing 

neighbourhood function, are modified accordingly, yet to a smaller 

amount as compared to the winner. Figure 2.1 shows that data points x can be 

assigned to different weight vectors m over time. This behaviour is particularly 

influenced by the learning rate and neighbouring function. The result of this 

learning procedure is a topologically ordered mapping of the presented input 

signals in two-dimensional space. Accordingly, similar input data are mapped 

onto neighbouring regions of the map. 

2.2.3 <strong>Ensemble</strong> <strong>Classification</strong> 

<strong>Ensemble</strong>s are groups of machine learning instances that work together and 

are used to improve the classification results of the overall system. The most 

popular ones are boosting and bagging [13]. Those algorithms train classifier 

instances on different subsets of the overall data set. Bagging combines the 

results of classifiers trained on sub samplings of the data set. A good overview 

of different ways of constructing ensembles as well as an explanation about 

5

R n 

m(t) c 

x(t) 

m(t+1) c 

c 

Input Space 

Output Space 

Figure 2.1: The Self-Organizing Map training algorithm basically maps data 

points from a high-dimensional input space to a low-dimensional output space. 

why ensembles are able to outper<strong>for</strong>m its single members is pointed out in [14]. 

The importance of diversity <strong>for</strong> the successful combination of classifiers is given 

in [15]. 

An impressive application of ensemble techniques <strong>for</strong> decision trees are Random 

Forests [16], where several decision trees are constructed <strong>for</strong> the same 

problem and their results are aggregated to find the best overall classification 

decision. 

In this context, ensembles are used in the sense that several classifiers are 

trained on the same problem (spam detection), but on different subsets of spam 

and ham messages. Assume that there are two distinct groups in spam data: 

medical and investment topics. Furthermore there are two main groups of private 

messages like private and business mails. A newly incoming e-mail is likely 

to best fit into one category only, depending on its content. For that scenario, 

four classifiers would be trained, one <strong>for</strong> each combination of ham and spam: 

1. Medical / private 

2. Medical / business 

3. Investment / private 

4. Investment / business 

6

Every new message is now classified by all of those four classifiers, their 

possible output labellings and levels of confidence (<strong>for</strong> one incoming message) 

could be: 

1. <strong>Spam</strong> / 1.00 



4. Ham / 0.66 

This message would clearly be tagged as spam as most classifiers are confident 

that it is. This example, of course, needs the data to be categorised into 

subgroups a priori, which usually is not the case. Manual sorting of a mail 

repository into those categories is not very feasible as well. The clustering part 

(identifying existing subgroups of the data) will be covered in more detail in 

Chapter 6. 

2.3 Objectives Revisited 

The basic objective of this work is to provide e-mail classification based on the 

concept of the vector space model which has been introduced in [17] and is 

widely used, particularly in the area of in<strong>for</strong>mation retrieval [18, 19]. The next 

chapters will cover the application of that model to text data and introduce the 

techniques associated with it as well as the e-mail categorisation task in general. 

After the process of sub sampling and ensemble classification will be explained 

in detail. 

7

Chapter 3 

Text Categorisation and 

Text In<strong>for</strong>mation Retrieval 

In the in<strong>for</strong>mation retrieval context, an e-mail is hence treated as a simple text 

document and is represented by a set of (characteristic) words or tokens. Models 

<strong>for</strong> text representation range from lists of whole words to vectors of n-grams (i. e., 

tokens of size n). Tokenisation may include stemming, i. e., stripping off affixes 

of words leaving only word stems. It is very common to use lists of stop words, 

i. e., static, predefined lists of words that are removed from the documents be<strong>for</strong>e 

further processing (see [20] or ranks.nl 1 <strong>for</strong> a sample list of English stop words). 

Features in this category are purely content-based, computed only from the 

message text itself (without aggregated “meta-in<strong>for</strong>mation” of any kind) and 

will be denoted as “low-level” features in this paper. 

In classic text categorisation low-level features are computed from a labelled 

training set of sufficient size. New messages can be assigned to the class represented 

by the most “similar” messages in terms of word co-occurrences. 

An introduction to In<strong>for</strong>mation Retrieval as such is given in [21]. The basic 

idea is to treat text as a bag of words or tokens. IR abstracts from any kind 

of linguistic in<strong>for</strong>mation and is often referred to as statistical natural language 

processing (NLP). Documents are represented as term vectors. A document 

collection containing the following two documents: 

This is a text document. 

and 

And so is this document a text document. 

would represent its documents by a vector of length 7, details are shown in 

Table 3.1. 

1 http://www.ranks.nl/tools/stopwords.html 

8

Table 3.1: Cluster validation experimental results <strong>for</strong> a subset of 500 ham messages 

of the TREC corpus. 

Document/Token this is a text document and so 

1 1 1 1 1 1 

2 1 1 1 1 2 1 1 

Document frequency 2 2 2 2 3 1 

This representation is subsequently used to calculate distances between or 

similarities of documents in the vector space. 

Once a text is represented by tokens, more sophisticated techniques can be 

applied. In the context of a vector space model a document is denoted by d, a 

term (token) by t, and the number of documents in a corpus by N. 

The number of times term t appears in document d is denoted as the term 

frequency tf(d), the number of documents in the collection that term t occurs 

in is denoted as document frequency df(t), as shown in Table 3.1. The process 

of assigning weights to terms according to their importance or significance <strong>for</strong> 

the classification is called “term-weighting”. The basic assumptions are that 

terms that occur very often in a document are more important <strong>for</strong> classification, 

whereas terms that occur in a high fraction of all documents are less important. 

The weighting we rely on is the most common model of term frequency inverse 

document frequency [22], where the weight tfidf of a term in a document is 

computed as: 

tfidf(t,d) = tf(d) ∗ ln(N/df(t)) (3.1) 

This results in vectors of weight values <strong>for</strong> each document d in the collection. 

<strong>Based</strong> on such vector representations of documents, classification methods can 

be applied. 

3.1 Application Scenario 

<strong>Spam</strong> filtering can be regarded as a binary classification problem. There are 

two categories—called spam and ham, and each incoming message has to be 

assigned to one of them. 

More generally, the two categories are also called positives and negatives. In 

spam filtering the term positive usually denotes spam, since this terminology 

may be ambivalent, the terms of accuracy rates by class are emphasised on, i.e. 

sensitivity and specificity. 

3.2 Per<strong>for</strong>mance Measurements 

A very important question in this context is how to measure the detection rate 

of a spam filter. Averaged classification accuracy does not reflect the negative 

9

effects of misclassification of legitimate messages, i. e., misclassifications of both 

spam and ham messages have the same weight, which might not reflect their 

actual value. There<strong>for</strong>e, at first, the terms sensitivity s 1 and specificity s 2 shall 

be introduced. 

Assuming that the correct class of a message is known to be spam, it can 

either be a “true positive”, a spam message that is classified as spam, or a “false 

positive”, a ham message that is classified as spam. Hence, “false positives” 

are legitimate mail messages that are wrongly classified as spam. A low false 

positive rate may be a highly critical aspect in spam filtering (no one wants to 

lose legitimate mail). 

The sensitivity s 1 denotes the fraction of correctly classified members of 

class 1 (spam, positives). Analogously, specificity s 2 denotes the fraction of correctly 

classified members of class 2 (ham, negatives). The averaged classification 

accuracy a of a spam filter there<strong>for</strong>e is the average of s 1 and s 2 . 

a = s 1 + s 2 

2 

corr. class. spams + corr. class. hams 

= 

2 

true positives true negatives 

( 

positives 

+ 

negatives 

) 

= 

2 

(3.2) 

We will still list ham and spam as well as the averaged accuracy to reflect the 

higher importance of misclassified ham messages. 

3.3 Database Implementation 

The back-end was implemented in Java using MySQL 2 . The main design goal 

was to get a flexible architecture, that is easily extensible. An overview of the 

database schema is depicted in Figure 3.1. Basically documents are grouped 

by corpora. Documents are indexed and preprocessed, after which tokenisation 

is done and feature vectors can directly be generated by sql statements. 

Preprocessing and indexing is done via different implementations <strong>for</strong> different 

types of documents. The current implementation includes processing of maildir 

messages, wikipedia pages and song lyrics, but is meant to be easily extensible. 

Features selection methods implemented include In<strong>for</strong>mation Gain, Odds 

Ratio and document frequency thresholding. The output <strong>for</strong>mats are somlib 

used in [23], arff used by the Weka machine learning suite [24] as well as plain 

text <strong>for</strong> use within Matlab. 

A k-NN classifier is implemented including different aggregation methods <strong>for</strong> 

the ensemble results, which is mainly used <strong>for</strong> experiments including individual 

feature selection. 

2 http://www.mysql.org 

10

Figure 3.1: Database back-end <strong>for</strong> document indexing. 

11

Chapter 4 

Feature Selection and 

Dimensionality Reduction 

When tokenizing text documents one often faces very high dimensional data. 

Tens of thousands of dimensions are not easy to handle, there<strong>for</strong>e feature selection 

plays a significant role. Document frequency thresholding achieves reductions 

in dimensionality by excluding terms having very high or very low 

document frequencies. Terms that occur in almost all documents in a collection 

do not provide any discriminating in<strong>for</strong>mation. It is similar <strong>for</strong> terms that 

have a very low document frequencies, although those features might be helpful 

if they are not distributed evenly across classes. If the term “golf” has a low 

document frequency it can still help to discriminate hams and spams if it only 

occurs in ham messages or vice versa. 

4.0.1 In<strong>for</strong>mation Gain 

In<strong>for</strong>mation Gain is a technique originally used to compute splitting criteria <strong>for</strong> 

decision trees. Different feature selection models including In<strong>for</strong>mation Gain 

are described in [25]. The basic idea behind IG is to find out how well each 

single feature separates the given data set. 

The overall entropy I <strong>for</strong> a given dataset S is computed by: 

I = 

C∑ 

p i log 2 p i (4.1) 

i=1 

where C denotes the available classes and p i the proportion of instances that belongs 

to one of the i classes. Now the reduction in entropy or gain in in<strong>for</strong>mation 

is computed <strong>for</strong> each attribute or token. 

IG(S,A) = I(S) − ∑ vǫA 

|S v | 

|S| I(S v) (4.2) 

12

where v is a value of A and S v the number of instances where A has that value. 

This results in an In<strong>for</strong>mation Gain value <strong>for</strong> each token extracted from 

a given document collection, documents are represented by a given number 

of tokens having the highest In<strong>for</strong>mation Gain values <strong>for</strong> the content-based 

experiments. 

4.0.2 Odds Ratio 

OR(w,c i ) = P(w|c i) × [1 − P(w|c i )] 

[1 − P(w|c i )] × P(w|¯c i ) 

(4.3) 

Equation 4.3 shows how the Odds Ratio <strong>for</strong> a given word and class is calculated. 

P(w|c i ) denotes the probability of word w occurring in the category c i . A high 

Odds Ratio <strong>for</strong> a word means that it is very predictive <strong>for</strong> a class. This results in 

a list of predictive tokens <strong>for</strong> each class. The best features according to the Odds 

Ratio measurement are there<strong>for</strong>e the ones having the highest predictability <strong>for</strong> 

a class. 

4.1 Combination of Feature Selection Approaches 

Since In<strong>for</strong>mation Gain and Odds Ratio slightly differ in which features are 

selected. Since both approaches’ selection methods are justified, OR selects 

rather frequent features, whereas IG also takes more less frequent terms into 

account, the combination of both techniques seemed promising. In a set of 

experiments to prove this, messages were represented by the terms having the 

highest values <strong>for</strong> IG plus having the highest values <strong>for</strong> OR. 

4.1.1 Feature Selection Experiments 

McNemar’s test [26] can be applied to compare two different classification algorithms 

or algorithms trained on different data A and B. There<strong>for</strong>e, each 

classifier is trained on a training set, the results of the test run are used to 

construct a contingency table, the following notation is used: 

• n00 Number of examples misclassified by both algorithms. 

• n01 Number of examples misclassified by A but not by B. 

• n10 Number of examples misclassified by B but not by A. 

• n11 Number of examples misclassified by neither algorithm. 

The statistic is given in Equation 4.4. 

(|n 01 − n 10 | − 1) 2 

n 01 + n 10 

(4.4) 

13

Figure 4.1: <strong>Classification</strong> results <strong>for</strong> different feature combinations calculated 

by Odds Ratio and In<strong>for</strong>mation Gain. 

The null hypothesis in this context is that both algorithms/classifiers have 

the same error rate. Given a level of significance α = .05 , it can be rejected if 

the result from the statistic is higher than χ 2 1,.95 = 3.8415 

Figure 4.1 shows classification results obtained from a combination of feature 

selection techniques. Setup 1 was conducted using In<strong>for</strong>mation Gain, selecting 

the best 612 features, setup 2 is based on the best 619 features <strong>for</strong> Odds Ratio, 

whereas setup 3 combines the best 300 terms taken from IG with the best 300 

terms taken from OR. In<strong>for</strong>mation Gain per<strong>for</strong>ms slightly better than Odds 

Ratio, but the difference between IG and the combination did not show significant 

differences. In that case the result from the test was 0.0529 and the null 

hypothesis could not be rejected. 

4.1.2 Document Frequency Thresholding 

Document frequency thresholding is a feasible feature selection <strong>for</strong> unsupervised 

data too. The basic assumption here is that very frequent terms are less 

discriminative to distinguish between classes (a term occurring in every single 

spam and ham message would not contribute to differentiate between them). 

The largest number of tokens, however, occurs only in a very small number 

14

of documents. The biggest advantages of document frequency thresholding is 

that there is no need <strong>for</strong> class in<strong>for</strong>mation and it is there<strong>for</strong>e mainly used <strong>for</strong> 

clustering applications, where the data only consists of one class or no class 

in<strong>for</strong>mation is available at all. Besides, document frequency thresholding is far 

less expensive in terms of computational power. In this context that technique 

is used <strong>for</strong> dimensionality reduction <strong>for</strong> clustering and to compare the classification 

results obtained by the more sophisticated approaches. The document 

frequency thresholding is followed as follows: 

• At first the upper threshold is fixed to .5, hence all terms that occur in 

more than half of the documents are omitted 

• The lower boundary is dynamically changed as to achieve the desired 

number of features 

It is there<strong>for</strong>e possible to get the requested dimensionality, usually about 700 

features. 

15

Chapter 5 

Ensemles of Classifiers <strong>for</strong> 

<strong>Spam</strong> <strong>Classification</strong> 

5.1 Experimental Setup 

Table 5.1 lists the corpora that were used <strong>for</strong> the initial experiments. The 

lingspam corpus 1 is publicly available, whereas the other ones were collected 

by myself in 2004 and 2005 and the Department of Distributed Systems at the 

University of Vienna in August 2004 2 , respectively. The Lingspam corpus is 

the only one that’s entirely English, the other ones contain both German and 

English messages. 

1 http://www.aueb.gr/users/ion/publications.html 

2 http://spam.ani.univie.ac.at 

Table 5.1: Feature Sets and <strong>Classification</strong> Methods Used <strong>for</strong> Experiments. 

Corpus Class Size 

lingspamham ham 2412 

lingspamspam spam 481 

own2005ham ham 2151 

own2005spam spam 2274 

august2004ham ham 1494 

august2004spam spam 1498 

16

Table 5.2: Results obtained by a 3-NN classifier based on ten fold crossvalidation 

(dataset of size 10310). 

Dim. Corr. Ham Corr. <strong>Spam</strong> Corr. 

1000 0.9741 0.9807 0.9767 

57 0.9098 0.9697 0.9450 

5.2 Preprocessing 

Those datasets were indexed and all attachments as well as German and English 

stopwords taken from ranks.nl 3 were removed. After that tokenization was done 

via the Apache Lucene 4 standard tokenizer. Porter stemming may be an option 

here in the future. This resulted in 313224 distinct features <strong>for</strong> those documents. 

To reduce the dimensionality of the resulting vectors, I used in<strong>for</strong>mation gain 

to get a weighting <strong>for</strong> each term [25], computed over all six classes (as opposed 

to two types, namely ham or spam). 

Using the In<strong>for</strong>mation Gain weighting I reduced the dimensionality to 1000 

and 57 respectively and will present experimental results <strong>for</strong> both of them. For 

the upcoming experiments, I used the term frequency vectors <strong>for</strong> those features. 

5.3 <strong>Classification</strong> Experiments 

I present two scenarios, one using a k-NN classifier and ten fold crossvalidation, 

one using ensembles of k-NN classifiers. 

5.3.1 k-NN 

All data is tagged according to its class (spam—ham) only. Further, it is evaluated 

by ten fold crossvalidation based on those classes. The overall size of the 

data is 10310 instances. 

Table 5.2 summarizes the results <strong>for</strong> both dimensionalities, pointing out that 

the classification rates are very high even using very low dimensional data. To 

find out if the sheer size of the set is causing these good results, I tried to only 

use a subset of that data by randomly removing all instances except 3000. The 

results obtained <strong>for</strong> as smaller dataset are detailed in Table 5.3. Those figures 

show that the negative impact on accuracy on ham messages only <strong>for</strong> the low 

dimensional sets. The trade-off between classification accuracy and the size of 

the training set is shown clearly. 

3 http://www.ranks.nl/stopwords 

4 http://lucene.apache.org 

17

Table 5.3: Results obtained by a 3-NN classifier based on ten fold crossvalidation 

(reduced dataset of size 3000). 

Dim. Corr. Ham Corr. <strong>Spam</strong> Corr. 

1000 0.9715 0.9091 0.9617 

57 0.3876 0.9187 0.8333 

5.3.2 Decision Function 

The scenario of several classifiers needs another meta classifier that evaluates 

the results of the single classifiers, the two basic methods to evaluate the results 

of an ensemble of classifiers are described below. 

Majority Count <strong>Ensemble</strong> An instance is tagged as the class that is chosen 

by the majority of the classifiers. If a message is classified by nine classifiers 5 and 

it is classified as spam by five, spam is the result of the ensemble classification. 

Winner(s) Take All <strong>Ensemble</strong> The ’winner take all’ concept only uses the 

results from the classifier that is most confident of its choice. For k-NN classifiers 

this means that the classifier that found the most matching neighbours is the 

winner and is solely responsible <strong>for</strong> a given message. The ensemble of 3-NN 

classifiers used <strong>for</strong> the experiments often shows several winners (i.e. classifiers 

that find either three or zero neighbours <strong>for</strong> a given class). There<strong>for</strong>e, majority 

count is used again. For example, if four classifiers find three spam neighbours 

and five classifiers find zero spam neighbours, a message will be classified as 

ham. 

5.3.3 <strong>Ensemble</strong> <strong>Classification</strong> 

By contrast to only a single classifier, multiple classifiers can be used to determine 

whether a message is spam or ham. There<strong>for</strong>e the dataset was split by 

the initial corpora assignments, consisting of six corpora, each being assigned to 

either spam or ham. As shown in Algorithm 1, each of those six corpora is split 

into a training and a test set. A classifier is now trained <strong>for</strong> every plausible combination 

of corpora, i.e. only spam/ham combinations, being evaluated against 

the testing set of this one. Then the evaluation results are rounded and set to 

the majority count. In the case of six corpora each corpus is used <strong>for</strong> training 

three k −NN classifiers (number of corpora not being the same class). E.g., the 

first ham corpus is used together with all spam corpora, three in total. The test 

set of that spam corpora is then evaluated by all three classifiers. Those results 

are summarized into one by majority count, whatever most of the classifiers 

5 Three sets × two classes 

18

Table 5.4: <strong>Ensemble</strong> classification results obtained by a 3-NN classifier (full 

dataset of size 10310). 

Dim. Corr. Ham Corr. <strong>Spam</strong> Corr 

1000 0.9903 0.9925 0.9914 

57 0.9882 0.9907 0.9895 

assign to the message, is compared to it’s actual class, yielding in very stable 

and good results. 

Algorithm 1 <strong>Classification</strong> Using <strong>Ensemble</strong>s of k-NN Classifiers 

<strong>for</strong> each corpus do 

split into corpus.training and corpus.test 

<strong>for</strong> each corpus comp ¬ corpus and corpus comp.class ¬ corpus.class do 

knn = knn(corpus.data, corpus comp.data, corpus.target, corpus 

comp.target) 

result = knn<strong>for</strong>ward(corpus.target) 

end <strong>for</strong> 

compute average accuracy 

compute voting accuracy 


Table 5.4 shows the accuracy values <strong>for</strong> both classes. 

5.3.4 Lessons Learned from the First Shot 

The results obtained in the last section are misleading and strongly influenced by 

the choice of training and building classifiers with data taken from the same set. 

To overcome that weakness, a new algorithm <strong>for</strong> splitting the sets is introduced 

in Algorithm 2. The new algorithm splits the data into training and test sets 

first and then sorts all training data by corpus. Those corpus data are used 

to train the classifiers (ninech fold). The test data now consists of data across 

all corpora and yields much worse but more reliable yet much worse results. 

As outlined in Table 5.5, accuracy <strong>for</strong> ham outper<strong>for</strong>ms accuracy <strong>for</strong> spam by 

far. I investigated those differences by manually testing the accuracies <strong>for</strong> the 

individual corpora. A classifier was trained <strong>for</strong> corpora one and two, tested 

against corpora three and four and five and six, respectively. The results are 

displayed in Table 5.6. Those experiments were conducted <strong>for</strong> the expanded 

datasset as well, but did not lead to significantly different results, althought the 

ham detection rates <strong>for</strong> classifiers built on the lingspam data were higher. The 

lingspam dataset seems to have a negative effect on classifiers as the results <strong>for</strong> 

all training sets that had those data in it are much lower. 

19

Algorithm 2 <strong>Classification</strong> Using <strong>Ensemble</strong>s of k-NN Classifiers 

split into n folds 

<strong>for</strong> each fold n do 

split sets into sets by corpus 

<strong>for</strong> each corpus do 

<strong>for</strong> each corpus comp ¬ corpus and corpus comp.class ¬ corpus.class do 

knn = knn(corpus.data, corpus comp.data, corpus.target, corpus 

comp.target) 

result = knn<strong>for</strong>ward(fold.training) 



compute average accuracy 

compute voting accuracy 

compute winners take all accuracy 


compute average accuracies over all folds 

Table 5.5: <strong>Ensemble</strong> classification results obtained by a 3-NN classifier (extended 

dataset of 10310). 

Result Type Corr. Ham Corr. <strong>Spam</strong> Corr 

Avg. 0.7479 0.4616 0.5891 

Avg. voting 0.9802 0.3317 0.6205 

Avg. winning 0.9348 0.5218 0.7056 

Table 5.6: <strong>Classification</strong> results across different corpora (corr. ham, corr. spam, 

corr). 

Train / Test lingspam own2005 august2004 

lingspam X 0.9 0.06 0.49 0.96 0.02 0.48 

own2005 0.93 0.08 0.79 X 0.94 0.38 0.66 

august2004 0.91 0.12 0.78 0.95 0.43 0.68 X 

20

Table 5.7: <strong>Ensemble</strong> classification results obtained by a 3-NN classifier size of 

data set 7417 (without lingspam). The first three lines are the results <strong>for</strong> the 

in<strong>for</strong>mation gain features, the other ones are taken from the same corpora, but 

document frequency thresholding was used <strong>for</strong> feature selection 

Result Type Corr. Ham Corr. <strong>Spam</strong> Corr 

Avg. 0.9282 0.6462 0.7868 

Avg. winning (vot) 0.9529 0.9854 0.9693 

Avg. winning (dist) 0.9222 0.9905 0.9571 

Avg: 0.9446 0.7077 0.8246 

Avg. voting: 0.9673 0.9801 0.9738 

Avg. winning (dist): 0.9761 0.9782 0.9771 

5.4 New Winner Take All Strategy 

The winner take all stratety was changed so that the one with the lowest distances 

over the nearest neighbours is chosen. This identifies the most certain 

classifier that is very likely to be that one that has been trained with data from 

the same corpus as in the testing data. Furthermore, it was adapted to only 

take the most confident classifier into account, defined by the probability of 

neighbours in the same class. 

The ensemble has a much stronger effect on the noisy results obtained using 

lingspam as training corpus. 

5.4.1 Aggregation Technique Revisited 

As shown in Table 5.7, both aggregation functions have positive effects on classification 

accuracy, whereas the low rates <strong>for</strong> spam issue from the poor per<strong>for</strong>mance 

of the individual classifiers. There are two sets of spam and ham sets 

left. To illustrate the effects of different corpora, I trained a classifier on 90 per 

cent of the data of corpora one and two, as well as on corpora three and four, 

the results were obtained on the remaining ten per cent of that data. Table 5.8 

points out that accuracy on testdata from the same corpus is high, whereas the 

accuracy obtained by classifiers trained on one corpus and evaluated on another 

drops <strong>for</strong> spam. This could be explained by differences across spam corpora, i.e. 

spam corpora differ more than ham corpora do. 

21

Table 5.8: Evaluation 10 Fold Cross Validation Runs <strong>for</strong> Individual Corpora and 

Results <strong>for</strong> Cross-Corpora Classifiers (Single Runs Only as there Is no Volatiliy 

between runs), that are trained on one corpus and evaluated on the other one. 

Trainingcorpora Corr. Ham Corr. <strong>Spam</strong> Corr 

lingspam 0.9918 0.8131 0.9634 

lingspam (extended) 0.9925 0.8066 0.9613 

own 0.9940 0.9805 0.9869 

august 0.9900 0.9764 0.9833 

Trainingcorpora / Testcorpora Corr. Ham Corr. <strong>Spam</strong> Corr 

lingspam / own 0.9265 0.1139 0.5089 

lingspam / august 0.7885 0.2877 0.5378 

own / lingspam 0.9349 0.0852 0.5579 

own / august 0.9498 0.3812 0.6651 

august / lingspam 0.9117 0.1227 0.5616 

august / own 0.9512 0.4371 0.6870 

22

Chapter 6 

Clustering of the TREC 

<strong>Spam</strong> Corpus 

To come closer to a real world scenario, I decided to redo the experiments 

<strong>for</strong> another corpus, determining the subclusters using the Self-Organizing Map. 

The TREC spam corpus was used <strong>for</strong> the 2005 TREC experiments, comparable 

values exist. It is a corpus of about 92.000 messages, 42.000 ham, 50.000 spam. 

The sheer size of it would make experiments very difficult, so I decided to use a 

subset of 25 per cent each. 

The remaining messages, 13.179 spam and 9.751 hams, were initially clustered 

on a 10 × 1 and 3 × 1 Self-Organizing Map. Figure 6.2 gives an overview 

of the respective clusterings. The next step was to subsample that corpus according 

to the data points mapped best to each unit, i.e. taking the best n 

mapped units from each unit and assume those to be the clusters <strong>for</strong> the ensembles. 

Those clusters did not improve the classification results, there<strong>for</strong>e cluster 

validation techniques were used to find an optimal number of units to fit the 

data. 

6.1 Cluster Validation Techniques: Silhouette 

Value 

The measuring of the clustering quality produced by either different algorithms 

or <strong>for</strong> different parameter settings is a vital issue in clustering. This is mostly 

used to find the right setting <strong>for</strong> the number of clusters. The following experiments 

assume that the number of units of the Self-Organizing Map is the 

number of clusters and there<strong>for</strong>e is the value to be optimized. 

S(i) = 

b(i) − a(i) 

max{a(i),b(i)} 

(6.1) 

Where i is an index over all data vectors, a(i) the average distance of i to 

23

Table 6.1: Cluster validation experimental results <strong>for</strong> a subset of 500 ham messages 

of the TREC corpus. 

Number of units Silhoutte value 

2 .6251866819394923 

3 .1718971524408012 

4 .2465533744740692 

7 .3051800716163628 

8 .4343567080307692 

9 .3779828692816084 

10 .1307392552296967 

all other vectors of that cluster, b(i) the average distance of i to all data vetors 

in the closest cluster. The value lies between −1 and 1. 

−1 ≤ s(i) ≤ 1 (6.2) 

S(i) there<strong>for</strong>e is the silhouette value <strong>for</strong> data vector i, the overall silhouette 

value <strong>for</strong> a clustering is the average over all single silhouette values. 

n∑ 

S(i)/n (6.3) 

i=1 

Let n be the number of instances. Analogously, the silhouette <strong>for</strong> single 

clusters can be defined as: 

m∑ 

S(j)/m (6.4) 

j=1 

where c is the number of clusters and j the instances assigned to that cluster. 

To get started I subsampled 500 messages of the TREC ham messages and did 

the cluster validation <strong>for</strong> different numbers of units. From Table 6.1 can be 

inferred that the clustering quality peaks at about 8 units. 

6.2 Cluster Validation of the Self-Organizing Map 

Due to per<strong>for</strong>mance issues, the silhouette validations compares every unit to 

all other vectors assigned to that unit and to all vectors in the closest unit, I 

introduced modfications to fit the Self-Organizing Map scenario better. 

Let each comparison be based on unit’s weight vectors rather than the actual 

data vectors, a(i) is defined as follows. 

b(i) is defined as: 

a(i) = dist(w(i),i) (6.5) 

24

Figure 6.1: Color coding of the Self-Organizing Map according to SOM Silhouette 

values showing the clustering quality <strong>for</strong> each unit. 

b(i) = dist(i,wc(i)) (6.6) 

Where w(i) denotes the weight vector of the unit data point i is assigned to 

and sc(i) denotes the weight vector of the closest unit. The overall silhouette 

computation is then based on those values <strong>for</strong> a(i) and b(i). The experimental 

evaluation from now on is done this technique, because it needs significantly 

less computational power. Hence, the quality of different clusterings can be 

compared by their Silhouette values. Furthermore the results can be used to 

visualize the correctness of the clustering. Figure 6.1 shows the color coding 

<strong>for</strong> units in a 4 × 1 Self-Organizing Map. Blue colors denote low values, hence 

low clustering quality, medium values are represented by yellow, whereas high 

values are shown in green. 

6.3 Finally: Subsampling of the TREC Corpus 

Table 6.2 lists the overall SOM Silhouette values <strong>for</strong> different numbers of units 

and the 25 per cent ham of the TREC corpus. The optimal number of units 

was chosen as 9 (3 × 3) and will used as the number of units <strong>for</strong> ham clustering 

from now on. Four subsamples were now chosen from each corpus (ham and 

spam). To better resemble the nature of the Self-Organizing Map clustering 

those subsamples were overlapping by about 10 per cent each. This resulted in 

eight clusters of about 1.000 to 2.000 messages each. All upcoming experiments 

use those clusters <strong>for</strong> training and are validated by large holdout test set of 

about 10.000 messages. 

Table 6.3 shows comparable values <strong>for</strong> the spam data. 

Figure 6.2 show the clustering of the TREC corpus.on a 3×3 Self-Organizing 

Map which will consequently be used . 

25

Table 6.2: Cluster validation experimental results <strong>for</strong> the 25 per cent ham messages 

of the TREC corpus using SOM Silhouette Validation. 

Number of units SOM Silhoutte value 

3 .19816503285425702 

4 .15606009446852034 

5 .10004831518442062 

6 .05601588449922962 

7 .07248564371459906 

9 (3 × 3) .10156611162236466 

9 (9 × 1) .04731680857423294 

12 (4 × 3) .00125944139418076 

12 (6 × 2) .05679888210695271 

100 (10 × 10) .01840285484455534 

Table 6.3: Cluster validation experimental results <strong>for</strong> the 25 per cent spam 

messages of the TREC corpus using SOM Silhouette Validation. 

Number of units SOM Silhoutte value 

2 .09956252884023553 

3 .04810049914144519 

4 .06804273304015597 

5 .04054336515792361 

6 .05319534384274752 

7 .02452131491388584 

9 (3 × 3) .11044377249864945 

9 (9 × 1) .02227273382246468 

12 (4 × 3) .06919165522839628 

12 (6 × 2) .07843263638337034 

Table 6.4: Results <strong>for</strong> different machine learning approaches and feature selection 

methods. 

Feature selection k-NN k-NN ensemble 

Infogain 0.9307/0.9830/0.9607 0.8351/0.9498/0.9010 

Thresholding 0.9775/0.8107/0.9066 

Corpus Thres. 0.9822/0.9077/0.9505 0.9071/0.9861/0.9466 

Ind. Infogain X 0.9715/0.9085/0.9353 

Ind. Thresholding X 0.6857/0.9903/0.8608 

Ind. Corpus Thresh. X 

26

Figure 6.2: Clustering of the TREC ham corpus on a 3 × 3 Self-Organizing 

Map. 

27

Chapter 7 

Conclusions and Future 

Work 

Results <strong>for</strong> both corpora used show that exploiting natural clusters in ham 

and spam corpora is promising in general. I, however, failed to prove that 

ensembles of k-NN classifiers outper<strong>for</strong>m a single classifier that is trained on all 

the data vectors. I could show that the ensemble results are always better than 

the average result obtained from classifiers within the ensemble. Experimental 

results also showed that the decrease in per<strong>for</strong>mance from lower dimensionality 

is stronger <strong>for</strong> the monolithic classifier. 

Future work may include classifiers based on individual feature selection but 

full training set since the classification accuracy of the single classifier trained 

on the full set could never be outper<strong>for</strong>med by any of the ensemble solutions. 

Clustering quality may not be at an optimum. The Self-Organizing Map 

clustering is best suited <strong>for</strong> exploratory data analysis, the application of different 

clustering algorithms is more likely to produce better results. Particularly the 

cluster validation technique fails because the assumption that clusters and units 

Table 7.1: Final results. 

Result Type IG Ind. IG 

Manually selected clusters 

Monolithic k-NN 0.988 X 

k-NN ensemble X 0.975 

Cluster similarity 0.974 X 

Self-Organizing Map clusters of TREC data set 

Monolithic k-NN 0.960 X 

k-NN ensemble X 0.946 

Cluster similarity 0.970 X 

28

are equivalent does not hold. There<strong>for</strong>e cluster based algorithms seem more 

promising. 

The ensemble results can be considered very stable, there is almost no volatility 

between test runs. To evaluate the pure k-NN case, cross validation is needed, 

because the datasets are shuffled be<strong>for</strong>e each run and lead to different results 

depending of the order of the training set. The most interesting result is that 

often combinations of corpora that initially do not belong to each other (e.g. 

lingspam ham and not lingspam ham) can lead to better results than the ones 

that initially belong to each other (lingspam ham and lingspam ham). The 

lingspam corpus seems to be the one yielding the lowest classification accuracies. 

This may be caused by the uneven distribution between ham and spam 

messages of almost .75 and .25 as well as the nature of the ham messages themselves, 

because most of them seem to be retrieved via mailing lists. On the 

other hand, the lingspam corpus shows how different results can improve the 

ensemble results positively. 

Since the more different the corpora are, the more the results benefit from 

ensemble classification, there’s probably much improvement to be found in improving 

the clustering quality, which is definitely worth wile. 

29

Chapter 8 

Acknowledgements 

All clustering code used in this project was taken from the Somtoolbox application, 

part of the data mining working group at the department <strong>for</strong> software 

technology and interactive systems, Vienna Univsersity of Technology 1 . A plugin 

<strong>for</strong> the Silhouette cluster validation parts was implemented <strong>for</strong> this thesis. 

The text in<strong>for</strong>mation retrieval system shortly described in Chapter 3 was partly 

implemented <strong>for</strong> this project too. Its further preprocessing and indexing parts 

are beyond the scope of this work. Machine learning code was implemented in 

Java and Matlab, covering the ensemble classification, aggregation functions, 

and cluster based classification parts of the project. I also used the Netlab 2 , as 

well as Som toolbox 3 , and the Colt library <strong>for</strong> Java <strong>for</strong> matrix calculations 4 . 

1 http://www.ifs.tuwien.ac.at/dm 

2 http://www.ncrg.aston.ac.uk/netlab 

3 http://www.cis.hut.fi/projects/somtoolbox/ 

4 http://hoschek.home.cern.ch/hoschek/colt/ 

30

List of Tables 

3.1 Cluster validation experimental results <strong>for</strong> a subset of 500 ham 

messages of the TREC corpus. . . . . . . . . . . . . . . . . . . . 9 

5.1 Feature Sets and <strong>Classification</strong> Methods Used <strong>for</strong> Experiments. . 16 

5.2 Results obtained by a 3-NN classifier based on ten fold crossvalidation 

(dataset of size 10310). . . . . . . . . . . . . . . . . . . . 17 

5.3 Results obtained by a 3-NN classifier based on ten fold crossvalidation 

(reduced dataset of size 3000). . . . . . . . . . . . . . . . 18 

5.4 <strong>Ensemble</strong> classification results obtained by a 3-NN classifier (full 

dataset of size 10310). . . . . . . . . . . . . . . . . . . . . . . . . 19 

5.5 <strong>Ensemble</strong> classification results obtained by a 3-NN classifier (extended 

dataset of 10310). . . . . . . . . . . . . . . . . . . . . . . 20 

5.6 <strong>Classification</strong> results across different corpora (corr. ham, corr. 

spam, corr). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 

5.7 <strong>Ensemble</strong> classification results obtained by a 3-NN classifier size 

of data set 7417 (without lingspam). The first three lines are the 

results <strong>for</strong> the in<strong>for</strong>mation gain features, the other ones are taken 

from the same corpora, but document frequency thresholding was 

used <strong>for</strong> feature selection . . . . . . . . . . . . . . . . . . . . . . . 21 

5.8 Evaluation 10 Fold Cross Validation Runs <strong>for</strong> Individual Corpora 

and Results <strong>for</strong> Cross-Corpora Classifiers (Single Runs Only as 

there Is no Volatiliy between runs), that are trained on one corpus 

and evaluated on the other one. . . . . . . . . . . . . . . . . . . . 22 

6.1 Cluster validation experimental results <strong>for</strong> a subset of 500 ham 

messages of the TREC corpus. . . . . . . . . . . . . . . . . . . . 24 

6.2 Cluster validation experimental results <strong>for</strong> the 25 per cent ham 

messages of the TREC corpus using SOM Silhouette Validation. 26 

6.3 Cluster validation experimental results <strong>for</strong> the 25 per cent spam 

messages of the TREC corpus using SOM Silhouette Validation. 26 

6.4 Results <strong>for</strong> different machine learning approaches and feature selection 

methods. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 

7.1 Final results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 

31

List of Figures 

2.1 The Self-Organizing Map training algorithm basically maps data 

points from a high-dimensional input space to a low-dimensional 

output space. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 

3.1 Database back-end <strong>for</strong> document indexing. . . . . . . . . . . . . . 11 

4.1 <strong>Classification</strong> results <strong>for</strong> different feature combinations calculated 

by Odds Ratio and In<strong>for</strong>mation Gain. . . . . . . . . . . . . . . . 14 

6.1 Color coding of the Self-Organizing Map according to SOM Silhouette 

values showing the clustering quality <strong>for</strong> each unit. . . . 25 

6.2 Clustering of the TREC ham corpus on a 3 × 3 Self-Organizing 

Map. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 

32

Bibliography 

[1] Apache Software Foundation: <strong>Spam</strong>assassin open-source spam filter. Internet, 

Open Source (2005) http://spamassassin.apache.org/. 

[2] Sebastiani, F.: Machine learning in automated text categorization. ACM 

Computing Surveys 34(1) (2002) 1–47 

[3] R.O.Duda, P.E.Hart: Pattern <strong>Classification</strong> and Scene Analysis. John 

Wiley (1973) Chapter 2, Bayes Decision Theory. 

[4] Graham, P.: A plan <strong>for</strong> spam (2002) http://www.paulgraham.com/ 

stopspam.html. 

[5] Graham, P.: Better bayesian filtering (2003) http://www.paulgraham. 

com/better.html. 

[6] Delany, S.J., Cunningham, P., Coyle, L.: An assessment of case-based reasoning 

<strong>for</strong> spam filtering. Technical report, Dublin Institute of Technology 

(2004) 

[7] Berger, H., Köhle, M., Merkl, D.: On the impact of document representation 

on classifier per<strong>for</strong>mance in e-mail categorization. In: Proceedings of 

the 4th International Conference on In<strong>for</strong>mation Systems Technology and 

its Applications (ISTA’05). (2005) 1930 

[8] Tong, S., Koller, D.: Support vector machine active learning with applications 

to text classification. J. Mach. Learn. Res. 2 (2002) 45–66 

[9] Gee, K.: Using latent semantic indexing to filter spam. In: ACM Symposium 

on Applied Computing, Data Mining Track. (2003) 460–464 Available: 

http://ranger.uta.edu/~cook/pubs/sac03.ps. 

[10] Cardoso-Cachopo, A., Oliveira, A.L.: An empirical comparison of text categorization 

methods. In Nascimento, M.A., Moura, E.S.D., Oliveira, A.L., 

eds.: Proceedings of SPIRE-03, 10th International Symposium on String 

Processing and In<strong>for</strong>mation Retrieval, Manaus, BR, Springer Verlag, Heidelberg, 

DE (2003) 183–196 Published in the “Lecture Notes in Computer 

Science” series, number 2857. 

33

[11] Kohonen, T.: Self-organized <strong>for</strong>mation of topologically correct feature 

maps. Biological Cybernetics 43 (1982) 59–69 

[12] Kohonen, T.: Self-Organizing Maps. 3rd edn. Volume 30 of Springer Series 

in In<strong>for</strong>mation Sciences. Springer, Berlin (2001) 

[13] Breiman, L.: Bagging predictors. Machine Learning 24 (1996) 123–140 

[14] Dietterich, T.G.: <strong>Ensemble</strong> methods in machine learning. Lecture Notes 

in Computer Science 1857 (2000) 1–15 

[15] Adeva, J.J.G., Beresi, U.C., Calvo, R.A.: Accuracy and diversity in ensembles 

of text (0) 

[16] Breiman, L.: Random <strong>for</strong>ests. Machine Learning 45 (2001) 5–32 

[17] Salton, G., Lesk, M.: Computer evaluation of indexing and text processing. 

Journal of the ACM 15(1) (1968) 8–36 

[18] Salton, G., Wong, A., Yang, C.S.: A vector space model <strong>for</strong> automatic 

indexing. Commun. ACM 18 (1975) 613–620 

[19] Wong, S.K.M., Raghavan, V.V.: Vector space model of in<strong>for</strong>mation retrieval: 

a reevaluation. In: SIGIR ’84: Proceedings of the 7th annual 

international ACM SIGIR conference on Research and development in in<strong>for</strong>mation 

retrieval, Swinton, UK, UK, British Computer Society (1984) 

167–185 

[20] Manning, C.D., Schutze, H.: Foundations of Statistical Natural Language 

Processing. MIT Press (1999) 

[21] Salton, G., McGill, M.: Introduction to Modern In<strong>for</strong>mation Retrieval. 

McGraw-Hill, New York, NY (1983) 

[22] Salton, G., Buckley, C.: Term-weighting approaches in automatic text 

retrieval. In<strong>for</strong>mation Processing and Management 24(5) (1988) 513–523 

[23] Rauber, A., Pampalk, E., Merkl, D.: The SOM-enhanced JukeBox: Organization 

and visualization of music collections based on perceptual models. 

Journal of New Music Research 32 (2003) 193–210 

[24] Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools 

and techniques. 2nd edition edn. Morgan Kaufmann Publishers Inc., San 

Francisco, CA, USA (2005) 

[25] Yang, Y., Pedersen, J.O.: A comparative study on feature selection in text 

categorization. In Fisher, D.H., ed.: Proceedings of ICML-97, 14th International 

Conference on Machine Learning, Nashville, US, Morgan Kaufmann 

Publishers, San Francisco, US (1997) 412–420 

[26] Everitt, B.S.: The Analysis of Contingency Tables. first edn. Chapman 

and Hall (1977) 

34

Ensemble Classification for Spam Filtering Based on Clustering of ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?