12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

§256 ANNOYANCE-FILTER DEVELOPMENT LOG 223<br />

Now, of course, we must deal with this. I installed the fdstream.hpp package developed by Nicolai<br />

M. Josuttis in the source directory, extending it to permit declaration of fdistream and fdostream<br />

objects with a default file descriptor of zero, which can be specified later by a new attach method, thus<br />

requiring fewer changes to existing code which uses the fstream ::attach mechanism. <strong>The</strong>re is little or<br />

no error checking—you can screw things up mightily by swapping file descriptors on the fly, but then<br />

you could before with fstream ::attach !<br />

To test this class and dip my toe into the acid bath of post-fstream ::attach plumbing, I modified<br />

<strong>pdf</strong>TextExtractor to use fdistream to read the pipe from pfdtotext, which is a simpler case than<br />

the tangle associated with compressed file decoding. This worked the first time, meaning I should<br />

look over my shoulder when migrating the attach references in the compressed file code to the new<br />

mechanism. Note that the existing code has lots of ad hoc tweaks, all tagged with OLDWAY, to enable<br />

the currently-working code. Before we’re ready to ship, all of the OLDWAY dust-bunnies should be<br />

cleaned up and a clean build and regression test run on 2.96 and 3.2.2 parameterised exclusively by the<br />

configure script.<br />

Added code to mailFolder to use a new fdistream to read the pipe when decompressing mail folder<br />

files and compressed files in mail directories.<br />

In the gcc 3.2.2 library, closing and opening an ifstream does not clear ios ::eofbit in the descriptor as<br />

it used to. (I consider this a stone bug—when you close one file and open another, only an idiot would<br />

consider the end of file condition from the previous file still asserted.) In any case, I added a clear ( ) of<br />

the ifstream we use while traversing a directory in 〈 Advance to next file if traversing directory 138 〉<br />

so this doesn’t sabotage reading messages in a directory.<br />

Re-tested directory traversal, with and without compressed files in the directory, on gcc 2.96 and 3.2.2<br />

to verify the modified code works on both. It does.<br />

2003 February 18<br />

Added logic to configure.in to test whether the C++ library is compatible with the fdstream.hpp<br />

package. If so, we use it; otherwise we assume it’s an old library which supports the attach method for<br />

fstream I/O. <strong>The</strong> config.h.in variable HAVE_FDSTREAM_COMPATIBILITY will be defined if fdstream.hpp<br />

is to be used.<br />

Added a test to configure.in which determines whether the C++ library is compatible with the new<br />

mystrstream_new.h. If so, it’s included. Otherwise, the earlier mystrstream.h is used as before. If<br />

the new strstream package works, HAVE_NEW_STRSTREAM is defined in config.h.in.<br />

With these changes, the source configures and builds correctly on gcc 2.96 and 3.2.2 without any tweaks<br />

or changes.<br />

As suggested by Kern Sibbald, I changed the default −−phraselimit to 48 characters.<br />

As reported by Jim Hamilton, some mail systems which store individual messages as separate files in<br />

folder directories do not prefix each message file with the “From␣” sentinel we were counting to mark<br />

message boundaries. This resulted in bad message counts, affecting probability computation and, worse,<br />

failure to reset decoder modes, etc. after a mailformed message. I added a new expectingNewMessage<br />

flag, which is set at the start of every new file mailFolder reads (whether a composite mail folder or<br />

a file within a directory). When expectingNewMessage is set, the first line of the file with a nonblank<br />

character in the leftmost character position is considered the start of a new message regardless of its<br />

contents.<br />

2003 February 19<br />

Added the ability to parse a composite mail folder file using either pure BSD (“From␣” always denotes<br />

start of message and is quoted in every other case) from “consensus UNIX” format, where “From␣” only<br />

marks the start of a new message when it appears after a blank line. Sun “Content−Length:” folders

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!