01.12.2014 Views

Chapter 2

Chapter 2

Chapter 2

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

Lecture 4<br />

Bayes Classification Rule<br />

Dept. of Electrical and Computer Engineering<br />

0909.402.02 / 0909.504.04<br />

Theory & Applications of<br />

Pattern Recognition<br />

P R<br />

A priori / A posteriori prob.<br />

Loss function<br />

Bayes decision rule<br />

The likelihood ratio test<br />

Maximum a posteriori (MAP) Criterion<br />

Min. error rate classification<br />

Discriminant functions<br />

Error bounds and prob.<br />

Background: Pattern Classification, Duda, Hart and Stork, Copyright ® John Wiley and Sons, 2001, PR logo Copyright © Robi Polikar, 2001


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Today in PR<br />

Review of Bayes theorem<br />

Bayes Decision Theory<br />

Bayes rule<br />

Loss function & expected loss<br />

Minimum error rate classification<br />

Classification using discriminant functions<br />

Error bounds & probabilities


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Bayes Rule<br />

Suppose, we know P(ω 1 ), P(ω 2 ), P(x|ω 1 ) and P(x|ω 2 ), and that we have observed<br />

the value of the feature (a random variable) x<br />

How would you decide on the “state of nature” – type of fish, based on this info?<br />

Bayes theory allows us to compute the posterior probabilities from prior and classconditional<br />

probabilities<br />

Likelihood: The (class-conditional) probability of observing a feature value of<br />

x, given that the correct class is ω j<br />

. All things being equal, the category with<br />

higher class conditional probability is more “likely” to be the correct class.<br />

Posterior Probability: The<br />

(conditional) probability of correct<br />

class being ω j<br />

, given that feature<br />

value x has been observed<br />

P(<br />

ω |<br />

j<br />

x)<br />

( x Iω<br />

)<br />

P<br />

j<br />

P(<br />

x | ω<br />

j<br />

) ⋅ P(<br />

ω<br />

j<br />

)<br />

= =<br />

C<br />

P(<br />

x)<br />

P(<br />

x | ω ) ⋅ P(<br />

ω )<br />

∑<br />

k = 1<br />

k<br />

Prior Probability: The total<br />

probability of correct class being<br />

class ω j<br />

determined based on prior<br />

experience<br />

k<br />

Evidence: The total probability of<br />

observing the feature value as x


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Bayes Decision Rule<br />

Choose ω i if P(ω i | x) > P(ω j | x) for all i≠j, i,j=1,2,…,c<br />

If there are multiple features, x={x 1 , x 2 ,…, x d } <br />

Choose ω i if P(ω i | x) > P(ω j | x) for all i≠j i,j=1,2,…,c


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

The Loss Function<br />

Mathematical description of how costly each action (making a class decision) is. Are<br />

certain mistakes costlier than others?<br />

{ω 1<br />

, ω 2<br />

,…, ω c<br />

}: Set of states of nature (classes)<br />

{α 1<br />

, α 2<br />

,… α a<br />

}: Set of possible actions. Note that a need not be same as c. Because we<br />

may make more (or less) number of actions than the number of classes.<br />

For example, not making a decision (rejection) is also an action.<br />

{λ 1<br />

, λ 2<br />

,… λ a<br />

}: Losses associated with each action<br />

λ(α i<br />

| ω j<br />

}: The loss function: Loss incurred by taking action i when the true state of nature<br />

is in fact j.<br />

R(α i<br />

| x): Conditional risk - Expected loss for taking action i<br />

c<br />

R(<br />

αi<br />

| x)<br />

= ∑λ(<br />

αi<br />

| ω j ) ⋅ P(<br />

ω j<br />

j=<br />

1<br />

| x)<br />

Bayes decision takes the action that minimizes the conditional risk !


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Bayes Decision Rule<br />

Using Conditional Risk<br />

1. Compute conditional risk R(<br />

αi<br />

| x)<br />

= ∑λ(<br />

αi<br />

| ω<br />

j<br />

) ⋅ P(<br />

ω<br />

j<br />

| x)<br />

for each action taken<br />

2. Select the action that has the minimum conditional risk. Let this be action k<br />

c<br />

j=<br />

1<br />

3. The overall risk is then<br />

R<br />

=<br />

∫<br />

x∈X<br />

R(α<br />

k<br />

| x ) ⋅ p(<br />

x)<br />

dx<br />

Integrated over all<br />

possible values of x<br />

Conditional risk associated with taking action<br />

α(x) based on the observation x.<br />

Probability that x<br />

will be observed<br />

4. This is the Bayes Risk, the minimum possible risk that can be taken by any classifier !


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Two-Class<br />

Special Case<br />

Definitions:<br />

α 1 : Decide on ω 1,<br />

α 2 : Decide on ω 2 ,<br />

λ ij : λ(α i | ω j ) Loss for deciding on ω i when the SON is ω j<br />

Conditional risk:<br />

R(α 1 | x) = λ 11 P(ω 1 | x)+ λ 12 P(ω 2 | x)<br />

R(α 2 | x) = λ 21 P(ω 1 | x)+ λ 22 P(ω 2 | x)<br />

Note that λ 11 and λ 22 need not be zero, though we expect λ 11 < λ 12 , λ 22 < λ 21<br />

Decide on ω 1 if R(α 1 | x) < R(α 2 | x), decide on ω 2 , otherwise<br />

( x | ω1<br />

)<br />

( x | ω )<br />

p ω<br />

Λ( x ) = ><br />

1<br />

p <<br />

2<br />

ω 2<br />

( λ12<br />

− λ22<br />

)<br />

( λ − λ )<br />

21<br />

11<br />

⋅<br />

P<br />

P<br />

( ω2<br />

)<br />

( ω )<br />

1<br />

The Likelihood Ratio Test (LRT): Pick ω 1 if the<br />

LRT is greater then a threshold that is independent of x.<br />

This rule, which minimizes the Bayes risk, is also called<br />

the Bayes Criterion.


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Example<br />

From R. Gutierrez @ TAMU


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Example<br />

(to be Fully solved on Request on Friday)<br />

λ 11<br />

λ 22 λ 12<br />

λ 21<br />

Modified from R. Gutierrez @ TAMU


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Minimum Error-Rate Classification:<br />

Multiclass Case<br />

If we associate taking action i as selecting class i, and if all errors are equally likely, we<br />

obtain the zero-one loss (symmetrical cost function)<br />

( |ω )<br />

λ α<br />

i<br />

j<br />

⎧0,<br />

= ⎨<br />

⎩1,<br />

This loss function assigns no loss to correct classification, and assigns 1 to misclassification.<br />

The risk corresponding to this loss function is then<br />

∑<br />

R( α | x)<br />

=<br />

P(<br />

ω | x)<br />

i<br />

What does this tell us…?<br />

To minimize this risk (average probability of error), we need to choose the class that<br />

maximizes the posterior probability P(ω i<br />

|x)<br />

if<br />

if<br />

i<br />

i<br />

=<br />

≠<br />

P(<br />

ω<br />

j<br />

| x)<br />

= 1−<br />

j<br />

j= ≠ i<br />

1,..., c<br />

j<br />

j<br />

i<br />

Λ<br />

( x)<br />

=<br />

p<br />

p<br />

( x | ω )<br />

i<br />

( x | ω )<br />

j<br />

> <<br />

ω i<br />

ω j<br />

P<br />

P<br />

( ω )<br />

j<br />

( ω )<br />

i<br />

⇔<br />

P<br />

P<br />

( ω | x)<br />

i<br />

( ω | x)<br />

j<br />

ω<br />

><br />

i<br />

<<br />

1<br />

ω j<br />

Maximum a posteriori (MAP) criterion<br />

Maximum likelihood criterion for equal priors


P R<br />

Error Probabilities<br />

(Bayes Rule Rules!)<br />

In a two class case, there are two sources of error: x is in R 1 , yet SON is ω 2 , or vice versa<br />

P(<br />

error)<br />

=<br />

=<br />

∫<br />

R<br />

1<br />

∞<br />

∫<br />

−∞<br />

p<br />

( error | x) ⋅ p( x)<br />

⋅ dx<br />

( | x) ⋅ p( x) ⋅ dx<br />

+ p( | x) ⋅ p( x) ⋅dx<br />

p ω 2<br />

ω 1<br />

∫<br />

R<br />

2<br />

x B : Optimal Bayes solution<br />

x*: Non-optimal solution<br />

P(error) = +<br />

P( x ∈ R1,<br />

ω2)<br />

= P(<br />

x ∈ R1<br />

| ω2)<br />

⋅ P(<br />

ω2)<br />

Theory and Applications of Pattern Recognition<br />

P( x ∈ R2,<br />

ω1)<br />

= P(<br />

x ∈ R2<br />

| ω1)<br />

⋅ P(<br />

ω1)<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ


P R<br />

Probability of Error<br />

In multi-class case, there are more ways to be wrong then to be right, so we exploit the fact<br />

that P(error)=1-P(correct), where<br />

P<br />

C<br />

( correct) = P( x ∈ R , ω ) = P( x ∈ R | ω ) P( ω )<br />

=<br />

∑<br />

i=<br />

1<br />

C<br />

∑ ∫<br />

i= 1 x∈R<br />

P<br />

i<br />

i<br />

( x | ω ) P( ω ) dx<br />

= P( ω | x) P( x) dx<br />

i<br />

i<br />

i= 1 x∈R<br />

Of course, in order to minimize the P(error), we need to maximize P(correct) for which we<br />

need to maximize each and every one of the integrals. Note that P(x) is common to all<br />

integrals, therefore the expression will be maximized by choosing the decision regions R i<br />

where the posterior probabilities P(ω i<br />

|x) are maximum:<br />

i<br />

C<br />

∑<br />

i=<br />

1<br />

C<br />

i<br />

∑ ∫<br />

i<br />

i<br />

i<br />

i<br />

Theory and Applications of Pattern Recognition<br />

From R. Gutierrez @ TAMU<br />

x<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Discriminant Based<br />

Classification<br />

A discriminant is a function g(x), that discriminates between classes. This function<br />

assigns the input vector to a class according to its definition: Choose class i if<br />

gi ( x)<br />

> g ( x)<br />

∀i<br />

≠ j,<br />

i,<br />

j = 1,2,...,<br />

c<br />

j<br />

Bayes rule can be implemented in terms of discriminant functions<br />

g<br />

i<br />

( x)<br />

= P(<br />

ω | x)<br />

i<br />

Each discriminant function generates<br />

c decision regions, R 1<br />

,…,R c<br />

, which are<br />

separated by decision boundaries. Decision<br />

regions need NOT be contiguous.<br />

The decision boundary satisfies g i( x)<br />

= g j(<br />

x)<br />

x∈<br />

R<br />

∀i<br />

≠<br />

1<br />

j,<br />

⇔<br />

g<br />

i,<br />

i<br />

( x)<br />

><br />

g<br />

j = 1,2,..., c<br />

j<br />

( x)


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Discriminant Functions<br />

We may view the classifier as an automated machine that computes c discriminants<br />

and selects the category corresponding to the largest discriminant<br />

A neural network is one such classifier<br />

for Bayes classifier with non-uniform risks, R(α i |x):<br />

for MAP classifier (of uniform risks):<br />

for maximum likelihood classifier (of equal priors):<br />

g<br />

g<br />

g<br />

i<br />

i<br />

i<br />

( x ) = −R( αi<br />

| x)<br />

( x ) = P( ωi<br />

| x)<br />

( x ) = P( x | ω )<br />

i


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Discriminant Functions<br />

In fact, multiplying every DF with the same constant, or adding/subtracting a<br />

constant to all DFs does not change the decision boundary<br />

In general every g i (x) can be replaced by f (g i (x) ), where f(.) is any monotonically<br />

increasing function without affecting the actual decision boundary<br />

Some linear or non-linear transformations of the previously stated DFs may greatly<br />

simplify the design of the classifier<br />

What examples can you think of…?


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Normal Densities<br />

If likelihood probabilities are normally distributed, then a number of simplifications can be made.<br />

In particular, the discriminant function can be written as in this greatly simplified form (!)<br />

p(<br />

x | ω ) =<br />

i<br />

p<br />

d / 2 1/ 2<br />

( 2π<br />

) Σi<br />

( x | ω ) ~ N( µ , Σ )<br />

i<br />

1<br />

i<br />

e<br />

1<br />

−<br />

2<br />

T −1<br />

[( x−µ<br />

) Σ ( x−µ<br />

)]<br />

i<br />

i<br />

i<br />

i<br />

g<br />

i<br />

( x)<br />

= −<br />

1<br />

2<br />

[<br />

T<br />

( ) ( )] −1<br />

d 1 −1<br />

x − µ ⋅Σ ⋅ x − µ − ln2π<br />

− ln Σ + ln P(<br />

ω )<br />

i<br />

2<br />

2<br />

i<br />

i<br />

There are three distinct cases that can occur:


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Σ<br />

= σ<br />

Case 1: _________ i<br />

2<br />

I<br />

Features are statistically independent, and all features have the same variance: Dist. are<br />

spherical in d dimensions, the boundary is a generalized hyperplane (linear discriminant) of<br />

d-1 dimensions, and features create equal sized hyperspherical clusters. Examples of such<br />

hyperspherical clusters are:<br />

The general form of the<br />

discriminant is then<br />

g<br />

i<br />

2<br />

x − µ<br />

2σ<br />

2<br />

( x) = − + ln P( ω )<br />

i<br />

If priors are the same:<br />

g i<br />

( x)<br />

= −<br />

T<br />

( x − µ ) ( x − µ )<br />

2<br />

2σ<br />

Minimum Distance Classifier


P R<br />

Σ<br />

= σ<br />

Case 1:________ i<br />

2<br />

I<br />

This case results in linear discriminants that can be written in the form<br />

w<br />

i<br />

1 −1<br />

T<br />

= µ ,<br />

0<br />

ln ( )<br />

2 i<br />

wi<br />

= µ ⋅µ<br />

2 i<br />

+ P ω<br />

i i<br />

σ 2σ<br />

g<br />

i<br />

+<br />

T<br />

i<br />

( x)<br />

= w x wi<br />

0<br />

Threshold (Bias)<br />

of the i th category<br />

1-D case<br />

3-D case 2-D case<br />

Note how priors shift the discriminant function away from the more likely mean !!!<br />

Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Σ = Σ<br />

Case 2:_______ i<br />

Covariance matrices are arbitrary, but equal to each other for all classes. Features then form hyperellipsoidal<br />

clusters of equal size and shape. This also results in linear discriminant functions<br />

whose decision boundaries are again hyperplanes: T<br />

g ( ) 1 −<br />

x = − ( x − µ ) Σ<br />

1 ( x − µ ) + ln P ω<br />

i<br />

2<br />

[ ] ( )<br />

T<br />

gi<br />

( x ) = w x + w<br />

−1<br />

1 T −1<br />

i i0<br />

wi<br />

= Σ µ<br />

i,<br />

wi<br />

0<br />

= − µ Σ µ<br />

i<br />

+ ln P(<br />

ωi<br />

)<br />

i<br />

2<br />

i


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Case 3:____________<br />

Σ i = Arbitrary<br />

All bets are off !In two class case, the decision boundaries form hyperquadratics.<br />

The discriminant functions are now, in general, quadratic (nor linear) and non-contiguous<br />

1 −1<br />

−1<br />

T<br />

T<br />

W<br />

g<br />

i<br />

( x)<br />

= x Wi<br />

x + wi<br />

x + w<br />

i<br />

= − Σi<br />

, wi<br />

= Σi<br />

µ<br />

i<br />

i0<br />

2<br />

1 T −1<br />

1<br />

wi<br />

0<br />

= − µ Σi<br />

µ<br />

i<br />

− ln Σi<br />

+ ln P(<br />

ωi<br />

)<br />

i<br />

2 2<br />

Hyperbolic Parabolic<br />

Linear<br />

Ellipsoidal Circular


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Case 3:____________<br />

Σ i = Arbitrary<br />

For the multi class case, the boundaries will look even more complicated. As an example<br />

Decision<br />

Boundaries


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Case 3:____________<br />

Σ i = Arbitrary<br />

In 3-D


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Conclusions<br />

The Bayes classifier for normally distributed classes is in general a quadratic<br />

classifier and can be computed<br />

The Bayes classifier for normally distributed classes with equal covariance matrices<br />

is a linear classifier<br />

For normally distributed classes with equal covariance matrices and equal priors is<br />

a minimum – Mahalanobis distance classifier<br />

For normally distributed classes with equal covariance matrices proportional to the<br />

identity matrix and with equal priors is a minimum Euclidean distance classifier<br />

Note that using a minimum Euclidean or Mahalanobis distance classifier implicitly<br />

makes certain assumptions regarding statistical properties of the data, which may or<br />

may not – and in general are not – true.<br />

However, in many cases, certain simplifications and approximations can be made<br />

that warrant making such assumptions even if they are not true. The bottom line in<br />

practice in deciding whether the assumptions are warranted is does the damn thing<br />

solve my classification problem…?


Theory and Applications of Pattern Recognition<br />

© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />

P R<br />

Error Bounds<br />

It is difficult, at best if possible, to analytically compute the error probabilities,<br />

Particularly when the decision regions are not contiguous. However, upper bounds for<br />

this error can be obtained:<br />

The Chernoff bound and its approximation Bhattacharya bound are two such bounds<br />

that are often used. If the distributions are Gaussian, these expressions are relatively<br />

easier to compute Often times even non-Gaussian cases are considered as Gaussian.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!