The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
206 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />
Subject lines can, of course, also contain sequences encoded in Base64, tagged with a “?B?” following<br />
the charset specification. Added decoding of these sequences, along with the requisite decodeEscapedText<br />
method of base64MIMEdecoder.<br />
Made a slight revision to the definition of tokens in the tokenParser. While “−” and “’” continue to<br />
be considered part of a token if embedded within it, they can no longer be the first or last characters of<br />
a token. This improves recognition of words in typical text, based on tests against the big collection. A<br />
new not at ends array of bool is used to define which characters may not begin or end a token.<br />
Completely rewrote how the tokenParser determines character types in parsing for tokens. Previously,<br />
characters were classified by looking them up in a collection of global arrays of bool. To permit changing<br />
the definition of a token on the fly, I defined a new class, tokenDefinition, which collects together the<br />
lookup tables which determine which characters constitute a token and indicate the sets of characters<br />
(if any) which cannot exclusively make up a token and which cannot be the first or last character of a<br />
token. In addition, the minimum and maximum acceptable length for tokens are stored and methods<br />
permit testing all of these quantities. You can initialise the values as you with the methods provided,<br />
or use pre-defined initialiser functions for ISO-8859 and ASCII alphanumeric sets.<br />
Well, let’s declare this a red banner day for the annoyance−filter! No, you’re not dreaming. . .we’re<br />
actually ending this day with fewer command line options than those which greeted the dawn, and<br />
the whole concept of the “lookup dictionary” has been banished, along with snowdrifts of prose in<br />
the documentation explaining the difference between a “dictionary” and a ‘lookup dictionary” and the<br />
things you could or couldn’t do with, or to, them respectively. <strong>The</strong> original idea was that you work with<br />
dictionary objects when assembling the database of mail and junk, and then export the results as a<br />
lean and mean lookup dictionary which could be loaded like lightning to classify subsequent messages.<br />
Well, it turns out that if you use binary I/O for the dictionary, it’s just as fast as loading the lookup<br />
dictionary, and all of the confusion is eliminated. Further, the user is thereby encouraged to keep a<br />
dictionary on hand which can be updated at any time to incorporate new examples of mail and junk.<br />
This is all much more the Bayesian spirit of eternal refinement than settling on a probability set without<br />
subsequent refinement.<br />
Since the lookup dictionary is no more, there’s no need to distinguish the dictionary read and write<br />
commands as binary. Hence, the −−binread and −−binwrite options have been renamed −−read and<br />
−−write, freed up by the lookup dictionary elimination.<br />
2002 September 14<br />
<strong>The</strong> direct concatenation of multiple-line header items added a couple of days ago broke 〈 Process<br />
multipart MIME header declaration 150 〉 thanks to fat-fingered character counting in the recognition of<br />
sentinels. I fixed this, and modified the code to perform all parsing on a canonicalised string to avoid<br />
case sensitivity problems. Note that the boundary itself is and must remain case sensitive.<br />
Fixed some gcc −Wall natters which had crept in since the option was accidentally removed by<br />
autoconf.<br />
Added the ability to read a mailFolder from standard input. If the fname argument to the constructor<br />
is “-” cin is used as the input stream.<br />
Renamed the −−csv option −−csvwrite in keeping with nefarious plans soon to be disclosed, and added<br />
a pseudo “␣COUNTS␣” word to the start of the CSV file giving the number of mail and junk messages in<br />
the dictionary as is done in binary dictionary dumps. Changed the sort order for the CSV file so that<br />
words with identical probabilities are sorted into lexical order.<br />
Added a −−csvread option to import a dictionary from a CSV file in the format created by −−csvwrite.<br />
<strong>The</strong> CSV file is added to the existing in-memory dictionary; multiple −−csvread and −−read command<br />
may be used to assemble a dictionary. <strong>The</strong> CSV file imported need not be sorted in any particular order<br />
and may contain comments whose first nonblank character is “;” or “#”. In the process, I found and