12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

206 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />

Subject lines can, of course, also contain sequences encoded in Base64, tagged with a “?B?” following<br />

the charset specification. Added decoding of these sequences, along with the requisite decodeEscapedText<br />

method of base64MIMEdecoder.<br />

Made a slight revision to the definition of tokens in the tokenParser. While “−” and “’” continue to<br />

be considered part of a token if embedded within it, they can no longer be the first or last characters of<br />

a token. This improves recognition of words in typical text, based on tests against the big collection. A<br />

new not at ends array of bool is used to define which characters may not begin or end a token.<br />

Completely rewrote how the tokenParser determines character types in parsing for tokens. Previously,<br />

characters were classified by looking them up in a collection of global arrays of bool. To permit changing<br />

the definition of a token on the fly, I defined a new class, tokenDefinition, which collects together the<br />

lookup tables which determine which characters constitute a token and indicate the sets of characters<br />

(if any) which cannot exclusively make up a token and which cannot be the first or last character of a<br />

token. In addition, the minimum and maximum acceptable length for tokens are stored and methods<br />

permit testing all of these quantities. You can initialise the values as you with the methods provided,<br />

or use pre-defined initialiser functions for ISO-8859 and ASCII alphanumeric sets.<br />

Well, let’s declare this a red banner day for the annoyance−filter! No, you’re not dreaming. . .we’re<br />

actually ending this day with fewer command line options than those which greeted the dawn, and<br />

the whole concept of the “lookup dictionary” has been banished, along with snowdrifts of prose in<br />

the documentation explaining the difference between a “dictionary” and a ‘lookup dictionary” and the<br />

things you could or couldn’t do with, or to, them respectively. <strong>The</strong> original idea was that you work with<br />

dictionary objects when assembling the database of mail and junk, and then export the results as a<br />

lean and mean lookup dictionary which could be loaded like lightning to classify subsequent messages.<br />

Well, it turns out that if you use binary I/O for the dictionary, it’s just as fast as loading the lookup<br />

dictionary, and all of the confusion is eliminated. Further, the user is thereby encouraged to keep a<br />

dictionary on hand which can be updated at any time to incorporate new examples of mail and junk.<br />

This is all much more the Bayesian spirit of eternal refinement than settling on a probability set without<br />

subsequent refinement.<br />

Since the lookup dictionary is no more, there’s no need to distinguish the dictionary read and write<br />

commands as binary. Hence, the −−binread and −−binwrite options have been renamed −−read and<br />

−−write, freed up by the lookup dictionary elimination.<br />

2002 September 14<br />

<strong>The</strong> direct concatenation of multiple-line header items added a couple of days ago broke 〈 Process<br />

multipart MIME header declaration 150 〉 thanks to fat-fingered character counting in the recognition of<br />

sentinels. I fixed this, and modified the code to perform all parsing on a canonicalised string to avoid<br />

case sensitivity problems. Note that the boundary itself is and must remain case sensitive.<br />

Fixed some gcc −Wall natters which had crept in since the option was accidentally removed by<br />

autoconf.<br />

Added the ability to read a mailFolder from standard input. If the fname argument to the constructor<br />

is “-” cin is used as the input stream.<br />

Renamed the −−csv option −−csvwrite in keeping with nefarious plans soon to be disclosed, and added<br />

a pseudo “␣COUNTS␣” word to the start of the CSV file giving the number of mail and junk messages in<br />

the dictionary as is done in binary dictionary dumps. Changed the sort order for the CSV file so that<br />

words with identical probabilities are sorted into lexical order.<br />

Added a −−csvread option to import a dictionary from a CSV file in the format created by −−csvwrite.<br />

<strong>The</strong> CSV file is added to the existing in-memory dictionary; multiple −−csvread and −−read command<br />

may be used to assemble a dictionary. <strong>The</strong> CSV file imported need not be sorted in any particular order<br />

and may contain comments whose first nonblank character is “;” or “#”. In the process, I found and

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!