12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

204 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />

256. Development log.<br />

2002 August 28<br />

Created development tree and commenced implementation.<br />

2002 September 1<br />

Release 0.1 circulated for review.<br />

2002 September 6<br />

Added the ability to compute descriptive statistics of the dictionary built by parsing the −−mail and<br />

−−junk folders, using the facilities of the statlib.w program. Statistics are written to standard output.<br />

Added a −−plot option to plot a histogram of words in a newly parsed dictionary (not a lookup<br />

dictionary loaded with −−read). Creating the plot requires the GNUPLOT and PBMPlus utilities to be<br />

installed.<br />

2002 September 7<br />

Well, after a huge amount of hunkering down and twiddling, parsing of MIME multi-part messages and<br />

decoding of parts encoded in Base64 and Quoted−Printable encoding now seems to be working. This<br />

drastically improves the quality of parsing, particularly for junk where these forms of encoding are used<br />

as “stealth” to evade other content-based filters.<br />

2002 September 8<br />

Added the ability to read mail folders compressed with gzip or other compressors detected by the<br />

Autoconf script. This saves a lot of space when you’re keeping large training archives around. This will<br />

work only on systems with suitable decompressors and the popen facility.<br />

2002 September 9<br />

Added the −−pdiag option to write the parser diagnostics to a designated file.<br />

controlled by a gnarly # define.<br />

Previously this was<br />

Added a “X−<strong>Annoyance</strong>−<strong>Filter</strong>−Decoder” line to the −−pdiag output to indicate the activation of<br />

decoders (including the sink) for MIME parts in the message. <strong>The</strong>se lines are not seen by the token<br />

parser.<br />

Fixed a bug in parsing of tokens including ISO accented characters. . .signed characters strike again.<br />

2002 September 10<br />

Added a −−ptrace option to include the actual tokens parsed as indented, quoted lines following each<br />

line of parser input in the −−pdiag file.<br />

Added code to classifyMessage which appends lines to the message header in the −−pdiag file giving<br />

the aggregate junk probability and the most significant words and their individual probabilities.<br />

Separated the mail and junk thresholds, which may now be set independently by the −−threshjunk<br />

and −−threshmail options. <strong>The</strong> −−classify command now writes “INDT” (for “indeterminate”) if a<br />

message falls between the two thresholds and exits with a return status of 4.<br />

Added the −−binwrite and −−binread options to export and import a dictionary as a portable<br />

(assuming IEEE floating point on all platforms) binary file. This will permit easier distribution of<br />

dictionary databases and may be faster to load than the lookupDictionary .

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!