12.06.2015 Views

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

The Annoyance Filter.pdf - Fourmilab

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

§5 ANNOYANCE-FILTER PHRASE-BASED CLASSIFICATION 9<br />

5. Phrase-based classification.<br />

annoyance−filter has the ability to classify messages based upon occurrences of multiple-word phrases<br />

as well as individual words. Here are results from an empirical test of classifying messages by single word<br />

frequencies compared to considering both individual words, phrases of 1–2 and 1–3 words, and phrases of two<br />

to three words. With this test set (compiled by hand sorting three years of legitimate and junk mail), adding<br />

classification by two word phrases reduces the number of false negatives (junk mail erroneously classified as<br />

legitimate) by more than 90%, while preserving 100% accuracy in identifying legitimate mail.<br />

Folder −−phrasemin −−phrasemax Total Mail Junk Prob<br />

Junk 1 1 8957 37 8920 0.9970<br />

Mail 1 1 2316 2316 0 0.0000<br />

Junk 1 2 8957 3 8954 0.9997<br />

Mail 1 2 2316 2316 0 0.0000<br />

Junk 1 3 8957 9 8948 0.9983<br />

Mail 1 3 2316 2316 0 0.0000<br />

Junk 2 3 8957 9 8948 0.9981<br />

Mail 2 3 2316 2316 0 0.0000<br />

<strong>The</strong>re’s no need to overdo it, however. Note that extending classification to phrases of up to three words<br />

actually slightly reduced the accuracy with which junk was recognised. In most circumstances, classifying<br />

based on phrases of one and two words will yield the best results.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!