12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

216 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />

modified the clean target in the actual Makefile to leave cweave.c around. I also modified our own<br />

clean target to clean the cweb directory as well.<br />

Attempting to build .dvi or <strong>pdf</strong> targets after you’d cleaned the cweb directory failed for lack of cweave;<br />

I added a dependency to Makefile.in to ensure it’s rebuilt when needed.<br />

Since certain recent versions of gcc libraries have begun to natter if C++ include files specify the .h<br />

extension (which, for years, was required by those self-same libraries), I eliminated them from our list<br />

of includes, which finally seems to work on gcc 2.96. Doubtless this will torpedo somebody using an<br />

earlier version.<br />

Broke up the unreadably monolithic list of include files into sections which explain what’s what.<br />

Dooooh! Forgot to disable the declaration of the <strong>pdf</strong>TextExtractor in mailFolder when HAVE_PDF_DECODER<br />

was not defined, which was the undoing of the Win32 build; fixed.<br />

Release 0.1-RC5.<br />

2002 October 19<br />

Added a check in classifyMessages to verify that a dictionary has been loaded before attempting to<br />

classify a message. If no dictionary is present, a warning is written to standard error and the junk<br />

probability is returned as 0.5.<br />

Added a warning if command line are specified after a −−classify command. Since this command<br />

always exits with an exit code indicating the classification, specifying subsequent arguments is always<br />

an error.<br />

Added a bunch of consistency checking for combinations of options which don’t make any sense and<br />

suggest the user doesn’t understand in which order they should be specified. To facilitate this, I modified<br />

the code for the −−classify option to set a new lastOption flag to bail out of the option processing<br />

loop and set exitStatus to the classification rather than exiting directly before the option consistency<br />

checks are performed. This cleans up the control structure in any case.<br />

In the process of adding the above code, I discovered that the any ( ) method of bitset seems to be<br />

broken in the glibc which accompanies gcc 2.96. I tested count ( ) against zero and that seems to work<br />

OK.<br />

Implemented phrase tokens. You can consider phrases of consecutive tokens as primitive tokens by<br />

specifying the minimum and maximum words composing a phrase with the −−phrasemin and phrasemax<br />

options. <strong>The</strong>se default to 1 and 1, which suppresses all phrase-related flailing around. If set otherwise,<br />

tokens are assembled into a queue and all phrases within the length bounds are emitted as tokens. How<br />

well this works is a research question we may now address with the requisite tool in hand.<br />

2002 October 20<br />

Added code to import a binary dictionary file with the −−read option using memory-mapped I/O if<br />

./configure detects that facility and defines HAVE_MMAP. This isn’t a big win on individual runs of<br />

the program, but if you’re installing it on a high volume server, multiple read-only references to the<br />

dictionary file (be sure to make the file read-only, by the way) can simply bring the file into memory<br />

where it is re-used by multiple instances of the program. (Of course, if the system has an efficient file<br />

system cache, that may work just as well, but there’s no harm in memory mapping in any case.) Thanks<br />

to the C++ theologians who deprecated the incredibly useful strstream facility, which is precisely what<br />

you need to efficiently access a block of memory mapped data as a stream, I included a copy of the<br />

definition of this facility in mystrstream.h so we don’t have to depend on the C++ library providing it.<br />

I was a little worried about writing phrases in CSV format without quoting the fields, but I did an<br />

experiment with Excel and discovered it doesn’t quote such fields either—it only uses quotes if the cell

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!