12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

§256 ANNOYANCE-FILTER DEVELOPMENT LOG 207<br />

fixed a bug in updating the message counts which applied to both −−csvread and the existing −−read<br />

code, but which only manifested itself when loading multiple dictionaries.<br />

Wheels within wheels. . .MIME multipart messages can, of course, be nested. You can be blithely parsing<br />

your way through a message when you trip over a part with a Content−type of “multipart/alternative”,<br />

which pushes a new part boundary onto the stack, to be popped when the end sentinel of that nested<br />

section is encountered. What fun. We consequently introduce a new partBoundaryStack to keep track<br />

of the nested part boundary sentinels, along with all of the defensive code needed to cope with the<br />

realities of real world mail.<br />

2002 September 15<br />

Loosened up the test for multipart Content−type so that “multipart/related” types will be recognised.<br />

Added the long-awaited −−transcript option. (Thanks, Kern, for suggesting it!) A transcript of the input<br />

message for a −−test or −−classify operation is written to the argument file name (standard output<br />

if the argument is “−”, with X−<strong>Annoyance</strong>−<strong>Filter</strong>−Junk−Probability and X−<strong>Annoyance</strong>−<strong>Filter</strong>−Classification<br />

items appended to the header indicating the calculated junk probability and classification according to<br />

the thresholds.<br />

Finished the first cut of multiple byte character set decoders and interpreters. A decoder scans the mail<br />

body (encoded or not), and parses the byte stream into logical characters up to 32 bits in width. An<br />

interpreter expresses these characters in a form suitable for analysis. Ideographic languages are typically<br />

interpreted as one word per character, other languages as one letter per character. <strong>The</strong>se components<br />

must, of course, be utterly bullet-proof as they will be subjected to every possibly kind of garbage in the<br />

course of parsing real-world mail. At the moment, we have decoders for EUC and Big5, and interpreters<br />

for GB2312 and Big5.<br />

Added a decoder for EUC-encoded Korean (euc−kr) as an example of how to handle an alphabetic<br />

language with a non-Western character set.<br />

2002 September 16<br />

Modified EUC MBCSdecoder to discard the balance of any encoded line in which an invalid EUC<br />

second byte is encountered. After encountering such garbage, the rest of the line is usually junk and<br />

there’s no profit in blithering through it.<br />

Added logic to scan application binary byte streams for possible embedded tokens. <strong>The</strong> new −−binword<br />

option sets the shortest sequence of contiguous ASCII alphanumeric characters or dollar signs (with possible<br />

embedded hyphens and apostrophes, but not permitting these character at the start or end of a<br />

token—the default is 5 characters, which is a tad more discriminating than the UNIX strings which<br />

defaults to 4 printable characters. You can disable the scanning of binary streams entirely by setting<br />

−−binword to zero. Scanning binary streams might seem to be a curious endeavour, but it’s highly<br />

effective at percolating text embedded in viruses and worm attachments to junk mail to the top of the<br />

junk probability hit parade, then screening them out when the arrive in incoming mail.<br />

Although the Subject line is the most important, any line in a mail header may actually contain quoted<br />

sequences specifying a character set and Quoted−Printable or Base64 encoded characters. I modified<br />

〈 Check for encoded header line and decode 147 〉 to no longer restrict decoding to the subject line.<br />

Once decoded, if the charset specification in a header line quoted sequence is a character set we<br />

understand, it is not decoded and interpreted. ISO−8859 sets of all flavours are decoded but not<br />

processed further.<br />

Fixed a few gcc −Wall quibbles in tokenDefinition which popped up on Solaris compiler but didn’t<br />

seem to perturb the almost identical version of gcc on Linux.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!