22.02.2014 Views

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

Discrete Mathematics University of Kentucky CS 275 Spring ... - MGNet

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Using just one word to determine if a message is spam or not leads to excessive<br />

numbers <strong>of</strong> false positives and negatives. We actually have to use the<br />

generalized Bayes theorem with a large set <strong>of</strong> words.<br />

k<br />

i=1<br />

E i<br />

p(S | ! ) =<br />

k<br />

i=1<br />

!<br />

k<br />

i=1<br />

p(E i<br />

|S)<br />

k<br />

i=1<br />

! p(E i<br />

|S) + ! p(E i<br />

|S)<br />

,<br />

which we estimate assuming equal probability that an incoming message is<br />

spam or not by<br />

r(w 1<br />

,w 1<br />

,...,w 1<br />

) =<br />

!<br />

k<br />

i=1<br />

p(w i<br />

)<br />

.<br />

k<br />

k<br />

! p(w i<br />

) + ! q(w i<br />

)<br />

i=1<br />

i=1<br />

123<br />

Example: The word w 1 = stock appears in 400 / 2000 spam messages and in just<br />

60 / 1000 good messages. The word w 2 = undervalued appears in 200 / 2000<br />

spam messages and in just 25 / 1000 good messages. Estimate the likelihood that<br />

an incoming message with both words in it is spam. We know p(stock) = 0.2 and<br />

q(stock) = 0.06. Similarly, p(undervalued) = 0.1 and q(undervalued) = .025. So,<br />

r(stock,undervalued) =<br />

p(stock)p(undervalued)<br />

p(stock)p(undervalued)+q(stock)q(undervalued)<br />

=<br />

0.2!0.1<br />

0.2!0.1+0.06!0.025<br />

= 0.930 > 0.9<br />

Note: Looking for particular pairs or triplets <strong>of</strong> words and treating each as a<br />

single entity is another method for filtering. For example, enhance performance<br />

probably indicates spam to almost anyone, but high performance computing<br />

probably does not indicate spam to someone in computational sciences (but<br />

probably will for someone working in, say, Maytag repair).<br />

124

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!