The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
216 DEVELOPMENT LOG ANNOYANCE-FILTER §256<br />
modified the clean target in the actual Makefile to leave cweave.c around. I also modified our own<br />
clean target to clean the cweb directory as well.<br />
Attempting to build .dvi or <strong>pdf</strong> targets after you’d cleaned the cweb directory failed for lack of cweave;<br />
I added a dependency to Makefile.in to ensure it’s rebuilt when needed.<br />
Since certain recent versions of gcc libraries have begun to natter if C++ include files specify the .h<br />
extension (which, for years, was required by those self-same libraries), I eliminated them from our list<br />
of includes, which finally seems to work on gcc 2.96. Doubtless this will torpedo somebody using an<br />
earlier version.<br />
Broke up the unreadably monolithic list of include files into sections which explain what’s what.<br />
Dooooh! Forgot to disable the declaration of the <strong>pdf</strong>TextExtractor in mailFolder when HAVE_PDF_DECODER<br />
was not defined, which was the undoing of the Win32 build; fixed.<br />
Release 0.1-RC5.<br />
2002 October 19<br />
Added a check in classifyMessages to verify that a dictionary has been loaded before attempting to<br />
classify a message. If no dictionary is present, a warning is written to standard error and the junk<br />
probability is returned as 0.5.<br />
Added a warning if command line are specified after a −−classify command. Since this command<br />
always exits with an exit code indicating the classification, specifying subsequent arguments is always<br />
an error.<br />
Added a bunch of consistency checking for combinations of options which don’t make any sense and<br />
suggest the user doesn’t understand in which order they should be specified. To facilitate this, I modified<br />
the code for the −−classify option to set a new lastOption flag to bail out of the option processing<br />
loop and set exitStatus to the classification rather than exiting directly before the option consistency<br />
checks are performed. This cleans up the control structure in any case.<br />
In the process of adding the above code, I discovered that the any ( ) method of bitset seems to be<br />
broken in the glibc which accompanies gcc 2.96. I tested count ( ) against zero and that seems to work<br />
OK.<br />
Implemented phrase tokens. You can consider phrases of consecutive tokens as primitive tokens by<br />
specifying the minimum and maximum words composing a phrase with the −−phrasemin and phrasemax<br />
options. <strong>The</strong>se default to 1 and 1, which suppresses all phrase-related flailing around. If set otherwise,<br />
tokens are assembled into a queue and all phrases within the length bounds are emitted as tokens. How<br />
well this works is a research question we may now address with the requisite tool in hand.<br />
2002 October 20<br />
Added code to import a binary dictionary file with the −−read option using memory-mapped I/O if<br />
./configure detects that facility and defines HAVE_MMAP. This isn’t a big win on individual runs of<br />
the program, but if you’re installing it on a high volume server, multiple read-only references to the<br />
dictionary file (be sure to make the file read-only, by the way) can simply bring the file into memory<br />
where it is re-used by multiple instances of the program. (Of course, if the system has an efficient file<br />
system cache, that may work just as well, but there’s no harm in memory mapping in any case.) Thanks<br />
to the C++ theologians who deprecated the incredibly useful strstream facility, which is precisely what<br />
you need to efficiently access a block of memory mapped data as a stream, I included a copy of the<br />
definition of this facility in mystrstream.h so we don’t have to depend on the C++ library providing it.<br />
I was a little worried about writing phrases in CSV format without quoting the fields, but I did an<br />
experiment with Excel and discovered it doesn’t quote such fields either—it only uses quotes if the cell