08.10.2016 Views

Foundations of Data Science

2dLYwbK

2dLYwbK

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

and<br />

Prob (A|B) =<br />

Prob (B|A) Prob (A)<br />

Prob (B)<br />

≈<br />

0.99 × 0.001<br />

0.0210<br />

= 0.0471<br />

Even though the test fails to detect a defective product only 1% <strong>of</strong> the time when it<br />

is defective and claims that it is defective when it is not only 2% <strong>of</strong> the time, the test<br />

is correct only 4.7% <strong>of</strong> the time when it says a product is defective. This comes about<br />

because <strong>of</strong> the low frequencies <strong>of</strong> defective products.<br />

The words prior, a posteriori, and likelihood come from Bayes theorem.<br />

a posteriori =<br />

likelihood × prior<br />

normalizing constant<br />

Prob (B|A) Prob (A)<br />

Prob (A|B) =<br />

Prob (B)<br />

The a posteriori probability is the conditional probability <strong>of</strong> A given B. The likelihood<br />

is the conditional probability Prob(B|A).<br />

Unbiased Estimators<br />

Consider n samples x 1 , x 2 , . . . , x n from a Gaussian distribution <strong>of</strong> mean µ and variance<br />

σ 2 . For this distribution, m = x 1+x 2 +···+x n<br />

is an unbiased estimator <strong>of</strong> µ, which means<br />

n<br />

n∑<br />

that E(m) = µ and 1 (x<br />

n i − µ) 2 is an unbiased estimator <strong>of</strong> σ 2 . However, if µ is not<br />

i=1<br />

n∑<br />

known and is approximated by m, then 1 (x<br />

n−1 i − m) 2 is an unbiased estimator <strong>of</strong> σ 2 .<br />

Maximum Likelihood Estimation MLE<br />

i=1<br />

Suppose the probability distribution <strong>of</strong> a random variable x depends on a parameter<br />

r. With slight abuse <strong>of</strong> notation, since r is a parameter rather than a random variable, we<br />

denote the probability distribution <strong>of</strong> x as p (x|r) . This is the likelihood <strong>of</strong> observing x if<br />

r was in fact the parameter value. The job <strong>of</strong> the maximum likelihood estimator, MLE,<br />

is to find the best r after observing values <strong>of</strong> the random variable x. The likelihood <strong>of</strong> r<br />

being the parameter value given that we have observed x is denoted L(r|x). This is again<br />

not a probability since r is a parameter, not a random variable. However, if we were to<br />

apply Bayes’ rule as if this was a conditional probability, we get<br />

L(r|x) = Prob(x|r)Prob(r) .<br />

Prob(x)<br />

Now, assume Prob(r) is the same for all r. The denominator Prob(x) is the absolute<br />

probability <strong>of</strong> observing x and is independent <strong>of</strong> r. So to maximize L(r|x), we just maximize<br />

Prob(x|r). In some situations, one has a prior guess as to the distribution Prob(r).<br />

This is then called the “prior” and in that case, we call Prob(x|r) the posterior which we<br />

try to maximize.<br />

396

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!