12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

§256 ANNOYANCE-FILTER DEVELOPMENT LOG 205<br />

Added the −−clearjunk and −−clearmail options to clear counts of junk and mail. This can be used,<br />

in conjunction with the binwrite option, to prepare databases for use by folks who do not wish to<br />

prepare their own.<br />

2002 September 11<br />

Added the ability to enforce minimum and maximum length constraints on tokens returned by tokenParser.<br />

<strong>The</strong> limits are set to accept tokens from 1 to 255 characters in the tokenParser constructor, and may<br />

be changed at any time with the setTokenLengthLimits method. Note that the length limits are not<br />

reset by a call to setSource .<br />

Set the default token parser length limits to accept tokens between 1 and 64 characters.<br />

doubtless be the subject of yet more command line options before long.<br />

This will<br />

Modified the code which decides whether a mail folder is compressed to check for the argument being a<br />

symbolic link. If so, the link target is tested for the extension indicating a compressed file. I only follow<br />

links one level—if this poses a problem, your life is probably too complicated.<br />

Fixed computation of probability to avoid crashes if no words are present in a category. Probabilities<br />

don’t make any sense in such circumstances, but you may wish to create such a database for use with<br />

−−binread.<br />

Added logic to dictionary ::exportToBinaryFile and dictionary ::exportToBinaryFile to save and<br />

restore the count of messages contributing to the dictionary in the messageCount array in a pseudo-word<br />

called “␣COUNTS␣” (obligatorily) at the start of the dictionary. <strong>The</strong>se counts are required should we<br />

need to recompute the probability subsequent to loading the dictionary.<br />

Added the −−newword and.–sigwords options to specify the probability given to words in a message<br />

which don’t appear in the dictionary and the number of “most significant” words whose probabilities<br />

are used to determine the aggregate probability a given message is junk.<br />

2002 September 12<br />

Added logic to cope with the body of a message being encoded in a Content−TransferEncoding.<br />

While processing the header, this and the Content−Type are parsed as in MIME headers, with their<br />

arguments saved in bodyContentType , bodyContentTypeCharset , and bodyContentTransferEncoding . At<br />

the end of the header, if a bodyContentTransferEncoding has been specified, the values are transferred<br />

to the corresponding mime . . . variables and multiPart is set with an end terminator of the null string.<br />

<strong>The</strong> latter disables the decoder’s test for a part end sentinel and the warning for an unterminated part.<br />

Messages with Subject lines which contain ISO 8859 encoded characters employ a form of Quoted−Printable<br />

encoding to permit these characters to appear in a mail header where only 7 bit ASCII is permitted.<br />

I added code to mailFolder to detect these lines and call a new decodeEscapedText method of<br />

quotedPrintableMIMEdecoder to decode them if properly formed. This will permit parsing of ISO<br />

subject lines, which may prove critical in discriminating among messages with very short body copy.<br />

Yikes! As far as I can determine from the RFCs, what we’re supposed to do with continued header<br />

lines is just concatenate them, discarding all white space on the continuation even if this runs together<br />

tokens on adjacent lines. At least, if you don’t do this, encoded words split across continued Subject<br />

lines end up with nugatory white space in the middle. So, I fixed 〈 Check for continuation of mail header<br />

lines 143 〉 to “work this way”. Given our definition of tokens, it’s likely to fix more things than it breaks<br />

anyway.<br />

Added documentation to the CWEB file for yesterday’s new options.<br />

2002 September 13

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!