12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

§4 ANNOYANCE-FILTER OPTIONS 5<br />

4. Options.<br />

Options are specified on the command line. Options are treated as commands—most instruct the program<br />

to perform some specific action; consequently, the order in which they are specified is significant; they are<br />

processed left to right. Long options beginning with “−−” may be abbreviated to any unambiguous prefix;<br />

single-letter options introduced by a single “−” without arguments may be aggregated.<br />

−−annotate options<br />

Add the annotations requested by the characters in options to the transcript generated by<br />

the −−transcript option. Upper and lower case options are treated identically. Available<br />

annotations are:<br />

d Decoder diagnostics<br />

p Parser warnings and error messages<br />

w Most significant words and their probabilities<br />

−−autoprune n<br />

−−biasmail n<br />

−−binword n<br />

As the dictionary is bring built by appending mail to it with the −−mail and −−junk<br />

options, unique words will automatically be pruned from it whenever the dictionary<br />

exceeds approximately n bytes. This is particularly handy when loading large collections<br />

of messages with −−phrasemax set greater than one, as a very large number of unique<br />

phrases may clutter the dictionary being built and exceed the memory capacity of your<br />

computer. You could split the mail collection into multiple parts and explicitly −−prune<br />

after each part, but −−autoprune is much more convenient.<br />

<strong>The</strong> frequency of words appearing in legitimate mail is inflated by the floating point<br />

factor n, which defaults to 2. This biases the classification of messages in favour of<br />

“false negatives”—junk mail deemed legitimate, while reducing the probability of “false<br />

positives” (legitimate mail erroneously classified as junk, which is bad). <strong>The</strong> higher the<br />

setting of −−biasmail, the greater the bias in favour of false negatives will be.<br />

Binary character streams (for example, attachments of application-specific files, including<br />

the executable code of worm and virus attachments) are scanned and contiguous sequences<br />

of alphanumeric ASCII characters n characters or longer are added to the list of words<br />

in the message. <strong>The</strong> dollar sign (“$”) is considered an alphanumeric character for these<br />

purposes, and words may have embedded hyphens and apostrophes, but may not begin or<br />

end with those characters. If −−binword is set to zero, scanning of binary attachments is<br />

disabled entirely. <strong>The</strong> default setting is 5 characters.<br />

−−bsdfolder<br />

<strong>The</strong> next −−mail or −−junk folder will be parsed using “classic BSD” rules for identifying<br />

the start of individual messages in the folder. In BSD-style folders, the text “From␣” as the<br />

leftmost characters of a line always denotes the start of a new message: any appearance of<br />

this text in any other context is always quoted, often by prefixing a “>” character. In the<br />

default UNIX folder syntax, “From␣” only marks the start of a new message if it appears<br />

following one or more blank lines. Note that you must specify −−bsdfolder before each<br />

folder to be read with BSD rules; it is not a modal setting.<br />

−−classify fname<br />

Classify mail in fname. If it equals or exceeds the junk threshold (see −−threshjunk),<br />

“JUNK” is written to standard output and the program exits with status code 3. If the<br />

message scores less than or equal to the mail threshold (see −−threshmail), “MAIL” is<br />

written to standard output and the program exits with status 0. If the message’s score<br />

falls between the two thresholds, its content is deemed indeterminate; “INDT” is written<br />

to standard output and the program exits with a status of 4. <strong>The</strong> output can be used<br />

to set an environment variable in Procmail to control the disposition of the message. If<br />

fname is “−” the message is read from standard input.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!