The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
204 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />
256. Development log.<br />
2002 August 28<br />
Created development tree and commenced implementation.<br />
2002 September 1<br />
Release 0.1 circulated for review.<br />
2002 September 6<br />
Added the ability to compute descriptive statistics of the dictionary built by parsing the −−mail and<br />
−−junk folders, using the facilities of the statlib.w program. Statistics are written to standard output.<br />
Added a −−plot option to plot a histogram of words in a newly parsed dictionary (not a lookup<br />
dictionary loaded with −−read). Creating the plot requires the GNUPLOT and PBMPlus utilities to be<br />
installed.<br />
2002 September 7<br />
Well, after a huge amount of hunkering down and twiddling, parsing of MIME multi-part messages and<br />
decoding of parts encoded in Base64 and Quoted−Printable encoding now seems to be working. This<br />
drastically improves the quality of parsing, particularly for junk where these forms of encoding are used<br />
as “stealth” to evade other content-based filters.<br />
2002 September 8<br />
Added the ability to read mail folders compressed with gzip or other compressors detected by the<br />
Autoconf script. This saves a lot of space when you’re keeping large training archives around. This will<br />
work only on systems with suitable decompressors and the popen facility.<br />
2002 September 9<br />
Added the −−pdiag option to write the parser diagnostics to a designated file.<br />
controlled by a gnarly # define.<br />
Previously this was<br />
Added a “X−<strong>Annoyance</strong>−<strong>Filter</strong>−Decoder” line to the −−pdiag output to indicate the activation of<br />
decoders (including the sink) for MIME parts in the message. <strong>The</strong>se lines are not seen by the token<br />
parser.<br />
Fixed a bug in parsing of tokens including ISO accented characters. . .signed characters strike again.<br />
2002 September 10<br />
Added a −−ptrace option to include the actual tokens parsed as indented, quoted lines following each<br />
line of parser input in the −−pdiag file.<br />
Added code to classifyMessage which appends lines to the message header in the −−pdiag file giving<br />
the aggregate junk probability and the most significant words and their individual probabilities.<br />
Separated the mail and junk thresholds, which may now be set independently by the −−threshjunk<br />
and −−threshmail options. <strong>The</strong> −−classify command now writes “INDT” (for “indeterminate”) if a<br />
message falls between the two thresholds and exits with a return status of 4.<br />
Added the −−binwrite and −−binread options to export and import a dictionary as a portable<br />
(assuming IEEE floating point on all platforms) binary file. This will permit easier distribution of<br />
dictionary databases and may be faster to load than the lookupDictionary .