The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
§3 ANNOYANCE-FILTER GETTING STARTED 3<br />
3. Getting started.<br />
<strong>The</strong> <strong>Annoyance</strong> <strong>Filter</strong> is organised as a toolbox which can be used to explore content-based mail filtering.<br />
It includes diagnostic tools and output which will eventually be little used once the program is tuned and<br />
put into production.<br />
<strong>The</strong> program is normally run in two phases. In the training phase, collections of legitimate and junk mail<br />
stored in UNIX mail folders are read and used to build a dictionary in which the probability of a word’s<br />
identifying a message as junk is computed. This dictionary is then exported to be used in subsequent runs<br />
to classify incoming messages based on the word probabilities determined from prior messages.<br />
3.1. Building<br />
If you have a more or less standard present-day UNIX system, you should be able to build and install the<br />
program with the commands:<br />
./configure<br />
make<br />
make check<br />
make install<br />
3.2. Training<br />
Now you must train the program to discriminate legitimate junk and mail by showing it collections of<br />
such mail you’ve hand sorted into a pile of stuff you want to receive and another which you don’t. Assuming<br />
you have mail folders containing collections of legitimate mail and junk named “m−good” and “m−junk”<br />
respectively, you can perform the training phase and create a binary dictionary file named “dict.bin” and<br />
a fast dictionary “fdict.bin” for classifying messages with the command:<br />
annoyance−filter --mail m-good --junk m-junk --prune \<br />
--write dict.bin --fwrite fdict.bin<br />
<strong>The</strong> arguments to the −−mail and −−junk options can be either UNIX “mail folders” consisting of one<br />
or more E-mail messages concatenated into a single file, or the name of a directory containing messages in<br />
individual files. In either case, the files may be compressed with gzip—annoyance−filter will automatically<br />
expand them. You can supply as many −−mail and −−junk options as you like on a command line; the<br />
contents added cumulatively to the dictionary.<br />
It is absolutely essential that the collections of legitimate and junk mail used to train annoyance−filter<br />
be completely clean—no junk in the −−mail collection or vice versa. Pollution of either collection by messages<br />
belonging in the other is very likely to corrupt the calculation of probabilities, resulting in messages which<br />
belong in one category being assigned to the other. <strong>The</strong> utilities/splitmail.pl program can help in<br />
manually sorting mail into the required two piles, and I hope some day I will have the time to adequately<br />
document it.<br />
You may find it worthwhile to add an archive of mail you’ve sent to the legitimate category with −−mail.<br />
In many cases, the words you use in mail you send are an excellent predictor of how worthy an incoming<br />
message is of your attention. I’ve found this works well with my own archives, but I haven’t tested how<br />
effective it is for a broader spectrum of users.<br />
When you compile the collections of junk and legitimate mail to train annoyance−filter, it’s important to<br />
include all the copies of similar or identical messages you’ve received in either category. annoyance−filter<br />
bases its classifications on the frequency of indicative words in the entire set of mail you receive. An obscure<br />
string embedded in a mail worm spewed onto the net may not filter it out if you train annoyance−filter<br />
with only one copy, but will certainly consign it to the junk heap if you train annoyance−filter with the<br />
twenty or thirty you receive a day.<br />
3.3. Scoring<br />
Dictionary in hand, you can now proceed to the scoring phase, where the dictionary is used, along with<br />
the list of words appearing in a message, to determine its overall probability of being junk. If you have a<br />
mail message in a file “mail.txt”, you can compute and display its junk probability with:<br />
annoyance−filter --fread fdict.bin --test mail.txt