12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

§3 ANNOYANCE-FILTER GETTING STARTED 3<br />

3. Getting started.<br />

<strong>The</strong> <strong>Annoyance</strong> <strong>Filter</strong> is organised as a toolbox which can be used to explore content-based mail filtering.<br />

It includes diagnostic tools and output which will eventually be little used once the program is tuned and<br />

put into production.<br />

<strong>The</strong> program is normally run in two phases. In the training phase, collections of legitimate and junk mail<br />

stored in UNIX mail folders are read and used to build a dictionary in which the probability of a word’s<br />

identifying a message as junk is computed. This dictionary is then exported to be used in subsequent runs<br />

to classify incoming messages based on the word probabilities determined from prior messages.<br />

3.1. Building<br />

If you have a more or less standard present-day UNIX system, you should be able to build and install the<br />

program with the commands:<br />

./configure<br />

make<br />

make check<br />

make install<br />

3.2. Training<br />

Now you must train the program to discriminate legitimate junk and mail by showing it collections of<br />

such mail you’ve hand sorted into a pile of stuff you want to receive and another which you don’t. Assuming<br />

you have mail folders containing collections of legitimate mail and junk named “m−good” and “m−junk”<br />

respectively, you can perform the training phase and create a binary dictionary file named “dict.bin” and<br />

a fast dictionary “fdict.bin” for classifying messages with the command:<br />

annoyance−filter --mail m-good --junk m-junk --prune \<br />

--write dict.bin --fwrite fdict.bin<br />

<strong>The</strong> arguments to the −−mail and −−junk options can be either UNIX “mail folders” consisting of one<br />

or more E-mail messages concatenated into a single file, or the name of a directory containing messages in<br />

individual files. In either case, the files may be compressed with gzip—annoyance−filter will automatically<br />

expand them. You can supply as many −−mail and −−junk options as you like on a command line; the<br />

contents added cumulatively to the dictionary.<br />

It is absolutely essential that the collections of legitimate and junk mail used to train annoyance−filter<br />

be completely clean—no junk in the −−mail collection or vice versa. Pollution of either collection by messages<br />

belonging in the other is very likely to corrupt the calculation of probabilities, resulting in messages which<br />

belong in one category being assigned to the other. <strong>The</strong> utilities/splitmail.pl program can help in<br />

manually sorting mail into the required two piles, and I hope some day I will have the time to adequately<br />

document it.<br />

You may find it worthwhile to add an archive of mail you’ve sent to the legitimate category with −−mail.<br />

In many cases, the words you use in mail you send are an excellent predictor of how worthy an incoming<br />

message is of your attention. I’ve found this works well with my own archives, but I haven’t tested how<br />

effective it is for a broader spectrum of users.<br />

When you compile the collections of junk and legitimate mail to train annoyance−filter, it’s important to<br />

include all the copies of similar or identical messages you’ve received in either category. annoyance−filter<br />

bases its classifications on the frequency of indicative words in the entire set of mail you receive. An obscure<br />

string embedded in a mail worm spewed onto the net may not filter it out if you train annoyance−filter<br />

with only one copy, but will certainly consign it to the junk heap if you train annoyance−filter with the<br />

twenty or thirty you receive a day.<br />

3.3. Scoring<br />

Dictionary in hand, you can now proceed to the scoring phase, where the dictionary is used, along with<br />

the list of words appearing in a message, to determine its overall probability of being junk. If you have a<br />

mail message in a file “mail.txt”, you can compute and display its junk probability with:<br />

annoyance−filter --fread fdict.bin --test mail.txt

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!