12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

§256 ANNOYANCE-FILTER DEVELOPMENT LOG 209<br />

the iterator argument to the erase become invalid, in certain cases (but not always), an iterator to the<br />

previous item—not erased, becomes invalid, leading to perdition when you attempt to pick up the scan<br />

for purgable words from that point. After a second tussle with remove if , no more fruitful than the last<br />

(for further detail, see the dictionary ::purge implementation, I gave up and rewrote purge to resume<br />

the scan from the start of the set every time it erases a member. This may not be efficient, but at least<br />

it doesn’t crash! In circumstances where a large percentage of the dictionary is going to be purged, it<br />

would probably be better to scan for contiguous groups of words eligible for purging, then erase them<br />

with the flavour of the method which takes a start and end iterator, but given how infrequently −−purge<br />

is likely to be used, I don’t think it’s worth the complication.<br />

In a fit of false economy, I accidentally left the door open to the possibility that with an improbable albeit<br />

conceivable sequence of options we might try to classify a message without updating the the probabilities<br />

in the dictionary to account for words added in this run. I added calls on updateProbability ( ) in the<br />

appropriate places to guarantee this cannot happen. <strong>The</strong> only circumstances in which this will result in<br />

redundant computation of probabilities is while building dictionaries, and the probability computation<br />

time is trivial next to the I/O and parsing in that process.<br />

In the normal course of events the vast majority of runs of the program will load a single dictionary and<br />

use it to classify a single message. Since we’ve guaranteed that the probabilities will always be updated<br />

before they’re written to a file, there’s no need to recompute the probabilities when we’re only importing<br />

a single dictionary. I added a check for this and optimised out the probability computation. When<br />

merging dictionaries with multiple −−read and/or −−csvread commands, the probability is recomputed<br />

after adding words to the dictionary.<br />

If you used a dictionary in which rare words had not been removed with −−purge to classify a message,<br />

you got screwball results because the −1 probability used to flag rare words was treated as if it were<br />

genuine. It occurred to me that folks building a dictionary by progressive additions might want to keep<br />

unusual words around on the possibility they’d eventually be seen enough times to assign a significant<br />

probability. I fixed 〈 Classify message tokens by probability of significance 188 〉 to treat words with<br />

a probability of −1 as if they had not been found, this simulating the effect of a −−purge. Minor<br />

changes were also required to CSV import to avoid confusion between rare words and the pseudo-word<br />

used to store message counts. Note that it’s still more efficient to −−purge the dictionary you use on<br />

classification runs, but if you don’t want to keep separate purged and unpurged dictionaries around,<br />

you don’t need to any more.<br />

Added a new −−annotate option, which takes an argument consisting of one or more single character<br />

flags (case insensitive) which request annotations to be added to the −−transcript. <strong>The</strong> first such flag<br />

is “w”, which adds the list of words and probabilities used to rank the message in the same form as<br />

included in the −−pdiag report. To avoid duplication, I broke the code which generates the word list<br />

out into a new addSignificantWordDiagnostics method of classifyMessage.<br />

Added a “p” annotation which causes parser diagnostics to be included in the −−transcript. This gets<br />

rid of all the conditional compilation based on PARSE_DEBUG and automatically copies the diagnostics<br />

to standard error if verbose is set. Parser diagnostics are reported with the reportParserDiagnostic<br />

method of mailFolder; other classes which report errors do so via a pointer to the mailFolder they’re<br />

acting on behalf of.<br />

Well, my sleazy reset to the beginning trick for dictionary purge really was intolerably slow for real<br />

world dictionaries. I pitched the whole mess and replaced it with code which makes a queue of the words<br />

we wish to leave in the dictionary, then does a clear on the dictionary and re-insert s the items which<br />

survived. This is simple enough to entirely avoid map iterator hooliganism and runs like lightning,<br />

albeit using more memory.<br />

Break out the champagne! <strong>The</strong> detestable MIME_DEBUG conditional compilation is now a thing of the<br />

past, supplanted by a new “d” −−annotate flag. No need to recompile every time you’re inclined to<br />

psychoanalyse a message the parser spit up.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!