The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
§256 ANNOYANCE-FILTER DEVELOPMENT LOG 207<br />
fixed a bug in updating the message counts which applied to both −−csvread and the existing −−read<br />
code, but which only manifested itself when loading multiple dictionaries.<br />
Wheels within wheels. . .MIME multipart messages can, of course, be nested. You can be blithely parsing<br />
your way through a message when you trip over a part with a Content−type of “multipart/alternative”,<br />
which pushes a new part boundary onto the stack, to be popped when the end sentinel of that nested<br />
section is encountered. What fun. We consequently introduce a new partBoundaryStack to keep track<br />
of the nested part boundary sentinels, along with all of the defensive code needed to cope with the<br />
realities of real world mail.<br />
2002 September 15<br />
Loosened up the test for multipart Content−type so that “multipart/related” types will be recognised.<br />
Added the long-awaited −−transcript option. (Thanks, Kern, for suggesting it!) A transcript of the input<br />
message for a −−test or −−classify operation is written to the argument file name (standard output<br />
if the argument is “−”, with X−<strong>Annoyance</strong>−<strong>Filter</strong>−Junk−Probability and X−<strong>Annoyance</strong>−<strong>Filter</strong>−Classification<br />
items appended to the header indicating the calculated junk probability and classification according to<br />
the thresholds.<br />
Finished the first cut of multiple byte character set decoders and interpreters. A decoder scans the mail<br />
body (encoded or not), and parses the byte stream into logical characters up to 32 bits in width. An<br />
interpreter expresses these characters in a form suitable for analysis. Ideographic languages are typically<br />
interpreted as one word per character, other languages as one letter per character. <strong>The</strong>se components<br />
must, of course, be utterly bullet-proof as they will be subjected to every possibly kind of garbage in the<br />
course of parsing real-world mail. At the moment, we have decoders for EUC and Big5, and interpreters<br />
for GB2312 and Big5.<br />
Added a decoder for EUC-encoded Korean (euc−kr) as an example of how to handle an alphabetic<br />
language with a non-Western character set.<br />
2002 September 16<br />
Modified EUC MBCSdecoder to discard the balance of any encoded line in which an invalid EUC<br />
second byte is encountered. After encountering such garbage, the rest of the line is usually junk and<br />
there’s no profit in blithering through it.<br />
Added logic to scan application binary byte streams for possible embedded tokens. <strong>The</strong> new −−binword<br />
option sets the shortest sequence of contiguous ASCII alphanumeric characters or dollar signs (with possible<br />
embedded hyphens and apostrophes, but not permitting these character at the start or end of a<br />
token—the default is 5 characters, which is a tad more discriminating than the UNIX strings which<br />
defaults to 4 printable characters. You can disable the scanning of binary streams entirely by setting<br />
−−binword to zero. Scanning binary streams might seem to be a curious endeavour, but it’s highly<br />
effective at percolating text embedded in viruses and worm attachments to junk mail to the top of the<br />
junk probability hit parade, then screening them out when the arrive in incoming mail.<br />
Although the Subject line is the most important, any line in a mail header may actually contain quoted<br />
sequences specifying a character set and Quoted−Printable or Base64 encoded characters. I modified<br />
〈 Check for encoded header line and decode 147 〉 to no longer restrict decoding to the subject line.<br />
Once decoded, if the charset specification in a header line quoted sequence is a character set we<br />
understand, it is not decoded and interpreted. ISO−8859 sets of all flavours are decoded but not<br />
processed further.<br />
Fixed a few gcc −Wall quibbles in tokenDefinition which popped up on Solaris compiler but didn’t<br />
seem to perturb the almost identical version of gcc on Linux.