22.02.2014 Views

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Now we calculate (a):<br />

p(F|E) = p(E|F)p(F) / (p(E|F)p(F) + p(E|F)p(F)) =<br />

(0.99$10 45 ) / (0.99$10 45 + 0.005$0.99999) = 0.002.<br />

Roughly 0.2% <strong>of</strong> people who test positive actually have the disease. Getting a<br />

positive should not be an immediate cause for alarm (famous last words).<br />

Now we calculate (b):<br />

p(F|E) = p(E|F)p(F) / (p(E|F)p(F) + p(E|F)p(F))<br />

(0.995$0.99999) / (0.995$0.99999 + 0.01$10 45 ) = 0.9999999.<br />

Thus, 99.99999% <strong>of</strong> people who test negative really do not have the disease.<br />

119<br />

Bayesian Spam Filters used to be the first line <strong>of</strong> defense for email programs.<br />

Like many good things, the spammers ran right over the process in about two<br />

years. However, it is an interesting example <strong>of</strong> useful discrete mathematics.<br />

The filtering involves a training period. Email messages need to be marked as<br />

Good or Bad messages, which we will denote as being the G or B sets.<br />

Eventually the filter will mark messages for you, hopefully accurately.<br />

The filter finds all <strong>of</strong> the words in both sets and keeps a running total <strong>of</strong> each<br />

word per set. We construct two functions n G (w) and n B (w) that return the<br />

number <strong>of</strong> messages containing the word w in the G and B sets, respectively.<br />

We use a uniform distribution. The empirical probability that a spam message<br />

contains the word w is p(w) = n B (w) / |B|. The empirical probability that a nonspam<br />

message contains the word w is q(w) = n G (w) / |G|.<br />

We can use p and q to estimate if an incoming message is or is not spam based<br />

on a set <strong>of</strong> words that we build dynamically over time.<br />

120

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!