The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
The Annoyance Filter.pdf - Fourmilab
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
§256 ANNOYANCE-FILTER DEVELOPMENT LOG 223<br />
Now, of course, we must deal with this. I installed the fdstream.hpp package developed by Nicolai<br />
M. Josuttis in the source directory, extending it to permit declaration of fdistream and fdostream<br />
objects with a default file descriptor of zero, which can be specified later by a new attach method, thus<br />
requiring fewer changes to existing code which uses the fstream ::attach mechanism. <strong>The</strong>re is little or<br />
no error checking—you can screw things up mightily by swapping file descriptors on the fly, but then<br />
you could before with fstream ::attach !<br />
To test this class and dip my toe into the acid bath of post-fstream ::attach plumbing, I modified<br />
<strong>pdf</strong>TextExtractor to use fdistream to read the pipe from pfdtotext, which is a simpler case than<br />
the tangle associated with compressed file decoding. This worked the first time, meaning I should<br />
look over my shoulder when migrating the attach references in the compressed file code to the new<br />
mechanism. Note that the existing code has lots of ad hoc tweaks, all tagged with OLDWAY, to enable<br />
the currently-working code. Before we’re ready to ship, all of the OLDWAY dust-bunnies should be<br />
cleaned up and a clean build and regression test run on 2.96 and 3.2.2 parameterised exclusively by the<br />
configure script.<br />
Added code to mailFolder to use a new fdistream to read the pipe when decompressing mail folder<br />
files and compressed files in mail directories.<br />
In the gcc 3.2.2 library, closing and opening an ifstream does not clear ios ::eofbit in the descriptor as<br />
it used to. (I consider this a stone bug—when you close one file and open another, only an idiot would<br />
consider the end of file condition from the previous file still asserted.) In any case, I added a clear ( ) of<br />
the ifstream we use while traversing a directory in 〈 Advance to next file if traversing directory 138 〉<br />
so this doesn’t sabotage reading messages in a directory.<br />
Re-tested directory traversal, with and without compressed files in the directory, on gcc 2.96 and 3.2.2<br />
to verify the modified code works on both. It does.<br />
2003 February 18<br />
Added logic to configure.in to test whether the C++ library is compatible with the fdstream.hpp<br />
package. If so, we use it; otherwise we assume it’s an old library which supports the attach method for<br />
fstream I/O. <strong>The</strong> config.h.in variable HAVE_FDSTREAM_COMPATIBILITY will be defined if fdstream.hpp<br />
is to be used.<br />
Added a test to configure.in which determines whether the C++ library is compatible with the new<br />
mystrstream_new.h. If so, it’s included. Otherwise, the earlier mystrstream.h is used as before. If<br />
the new strstream package works, HAVE_NEW_STRSTREAM is defined in config.h.in.<br />
With these changes, the source configures and builds correctly on gcc 2.96 and 3.2.2 without any tweaks<br />
or changes.<br />
As suggested by Kern Sibbald, I changed the default −−phraselimit to 48 characters.<br />
As reported by Jim Hamilton, some mail systems which store individual messages as separate files in<br />
folder directories do not prefix each message file with the “From␣” sentinel we were counting to mark<br />
message boundaries. This resulted in bad message counts, affecting probability computation and, worse,<br />
failure to reset decoder modes, etc. after a mailformed message. I added a new expectingNewMessage<br />
flag, which is set at the start of every new file mailFolder reads (whether a composite mail folder or<br />
a file within a directory). When expectingNewMessage is set, the first line of the file with a nonblank<br />
character in the leftmost character position is considered the start of a new message regardless of its<br />
contents.<br />
2003 February 19<br />
Added the ability to parse a composite mail folder file using either pure BSD (“From␣” always denotes<br />
start of message and is quoted in every other case) from “consensus UNIX” format, where “From␣” only<br />
marks the start of a new message when it appears after a blank line. Sun “Content−Length:” folders