Chapter 2
Chapter 2
Chapter 2
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
Lecture 4<br />
Bayes Classification Rule<br />
Dept. of Electrical and Computer Engineering<br />
0909.402.02 / 0909.504.04<br />
Theory & Applications of<br />
Pattern Recognition<br />
P R<br />
A priori / A posteriori prob.<br />
Loss function<br />
Bayes decision rule<br />
The likelihood ratio test<br />
Maximum a posteriori (MAP) Criterion<br />
Min. error rate classification<br />
Discriminant functions<br />
Error bounds and prob.<br />
Background: Pattern Classification, Duda, Hart and Stork, Copyright ® John Wiley and Sons, 2001, PR logo Copyright © Robi Polikar, 2001
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Today in PR<br />
Review of Bayes theorem<br />
Bayes Decision Theory<br />
Bayes rule<br />
Loss function & expected loss<br />
Minimum error rate classification<br />
Classification using discriminant functions<br />
Error bounds & probabilities
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Bayes Rule<br />
Suppose, we know P(ω 1 ), P(ω 2 ), P(x|ω 1 ) and P(x|ω 2 ), and that we have observed<br />
the value of the feature (a random variable) x<br />
How would you decide on the “state of nature” – type of fish, based on this info?<br />
Bayes theory allows us to compute the posterior probabilities from prior and classconditional<br />
probabilities<br />
Likelihood: The (class-conditional) probability of observing a feature value of<br />
x, given that the correct class is ω j<br />
. All things being equal, the category with<br />
higher class conditional probability is more “likely” to be the correct class.<br />
Posterior Probability: The<br />
(conditional) probability of correct<br />
class being ω j<br />
, given that feature<br />
value x has been observed<br />
P(<br />
ω |<br />
j<br />
x)<br />
( x Iω<br />
)<br />
P<br />
j<br />
P(<br />
x | ω<br />
j<br />
) ⋅ P(<br />
ω<br />
j<br />
)<br />
= =<br />
C<br />
P(<br />
x)<br />
P(<br />
x | ω ) ⋅ P(<br />
ω )<br />
∑<br />
k = 1<br />
k<br />
Prior Probability: The total<br />
probability of correct class being<br />
class ω j<br />
determined based on prior<br />
experience<br />
k<br />
Evidence: The total probability of<br />
observing the feature value as x
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Bayes Decision Rule<br />
Choose ω i if P(ω i | x) > P(ω j | x) for all i≠j, i,j=1,2,…,c<br />
If there are multiple features, x={x 1 , x 2 ,…, x d } <br />
Choose ω i if P(ω i | x) > P(ω j | x) for all i≠j i,j=1,2,…,c
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
The Loss Function<br />
Mathematical description of how costly each action (making a class decision) is. Are<br />
certain mistakes costlier than others?<br />
{ω 1<br />
, ω 2<br />
,…, ω c<br />
}: Set of states of nature (classes)<br />
{α 1<br />
, α 2<br />
,… α a<br />
}: Set of possible actions. Note that a need not be same as c. Because we<br />
may make more (or less) number of actions than the number of classes.<br />
For example, not making a decision (rejection) is also an action.<br />
{λ 1<br />
, λ 2<br />
,… λ a<br />
}: Losses associated with each action<br />
λ(α i<br />
| ω j<br />
}: The loss function: Loss incurred by taking action i when the true state of nature<br />
is in fact j.<br />
R(α i<br />
| x): Conditional risk - Expected loss for taking action i<br />
c<br />
R(<br />
αi<br />
| x)<br />
= ∑λ(<br />
αi<br />
| ω j ) ⋅ P(<br />
ω j<br />
j=<br />
1<br />
| x)<br />
Bayes decision takes the action that minimizes the conditional risk !
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Bayes Decision Rule<br />
Using Conditional Risk<br />
1. Compute conditional risk R(<br />
αi<br />
| x)<br />
= ∑λ(<br />
αi<br />
| ω<br />
j<br />
) ⋅ P(<br />
ω<br />
j<br />
| x)<br />
for each action taken<br />
2. Select the action that has the minimum conditional risk. Let this be action k<br />
c<br />
j=<br />
1<br />
3. The overall risk is then<br />
R<br />
=<br />
∫<br />
x∈X<br />
R(α<br />
k<br />
| x ) ⋅ p(<br />
x)<br />
dx<br />
Integrated over all<br />
possible values of x<br />
Conditional risk associated with taking action<br />
α(x) based on the observation x.<br />
Probability that x<br />
will be observed<br />
4. This is the Bayes Risk, the minimum possible risk that can be taken by any classifier !
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Two-Class<br />
Special Case<br />
Definitions:<br />
α 1 : Decide on ω 1,<br />
α 2 : Decide on ω 2 ,<br />
λ ij : λ(α i | ω j ) Loss for deciding on ω i when the SON is ω j<br />
Conditional risk:<br />
R(α 1 | x) = λ 11 P(ω 1 | x)+ λ 12 P(ω 2 | x)<br />
R(α 2 | x) = λ 21 P(ω 1 | x)+ λ 22 P(ω 2 | x)<br />
Note that λ 11 and λ 22 need not be zero, though we expect λ 11 < λ 12 , λ 22 < λ 21<br />
Decide on ω 1 if R(α 1 | x) < R(α 2 | x), decide on ω 2 , otherwise<br />
( x | ω1<br />
)<br />
( x | ω )<br />
p ω<br />
Λ( x ) = ><br />
1<br />
p <<br />
2<br />
ω 2<br />
( λ12<br />
− λ22<br />
)<br />
( λ − λ )<br />
21<br />
11<br />
⋅<br />
P<br />
P<br />
( ω2<br />
)<br />
( ω )<br />
1<br />
The Likelihood Ratio Test (LRT): Pick ω 1 if the<br />
LRT is greater then a threshold that is independent of x.<br />
This rule, which minimizes the Bayes risk, is also called<br />
the Bayes Criterion.
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Example<br />
From R. Gutierrez @ TAMU
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Example<br />
(to be Fully solved on Request on Friday)<br />
λ 11<br />
λ 22 λ 12<br />
λ 21<br />
Modified from R. Gutierrez @ TAMU
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Minimum Error-Rate Classification:<br />
Multiclass Case<br />
If we associate taking action i as selecting class i, and if all errors are equally likely, we<br />
obtain the zero-one loss (symmetrical cost function)<br />
( |ω )<br />
λ α<br />
i<br />
j<br />
⎧0,<br />
= ⎨<br />
⎩1,<br />
This loss function assigns no loss to correct classification, and assigns 1 to misclassification.<br />
The risk corresponding to this loss function is then<br />
∑<br />
R( α | x)<br />
=<br />
P(<br />
ω | x)<br />
i<br />
What does this tell us…?<br />
To minimize this risk (average probability of error), we need to choose the class that<br />
maximizes the posterior probability P(ω i<br />
|x)<br />
if<br />
if<br />
i<br />
i<br />
=<br />
≠<br />
P(<br />
ω<br />
j<br />
| x)<br />
= 1−<br />
j<br />
j= ≠ i<br />
1,..., c<br />
j<br />
j<br />
i<br />
Λ<br />
( x)<br />
=<br />
p<br />
p<br />
( x | ω )<br />
i<br />
( x | ω )<br />
j<br />
> <<br />
ω i<br />
ω j<br />
P<br />
P<br />
( ω )<br />
j<br />
( ω )<br />
i<br />
⇔<br />
P<br />
P<br />
( ω | x)<br />
i<br />
( ω | x)<br />
j<br />
ω<br />
><br />
i<br />
<<br />
1<br />
ω j<br />
Maximum a posteriori (MAP) criterion<br />
Maximum likelihood criterion for equal priors
P R<br />
Error Probabilities<br />
(Bayes Rule Rules!)<br />
In a two class case, there are two sources of error: x is in R 1 , yet SON is ω 2 , or vice versa<br />
P(<br />
error)<br />
=<br />
=<br />
∫<br />
R<br />
1<br />
∞<br />
∫<br />
−∞<br />
p<br />
( error | x) ⋅ p( x)<br />
⋅ dx<br />
( | x) ⋅ p( x) ⋅ dx<br />
+ p( | x) ⋅ p( x) ⋅dx<br />
p ω 2<br />
ω 1<br />
∫<br />
R<br />
2<br />
x B : Optimal Bayes solution<br />
x*: Non-optimal solution<br />
P(error) = +<br />
P( x ∈ R1,<br />
ω2)<br />
= P(<br />
x ∈ R1<br />
| ω2)<br />
⋅ P(<br />
ω2)<br />
Theory and Applications of Pattern Recognition<br />
P( x ∈ R2,<br />
ω1)<br />
= P(<br />
x ∈ R2<br />
| ω1)<br />
⋅ P(<br />
ω1)<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ
P R<br />
Probability of Error<br />
In multi-class case, there are more ways to be wrong then to be right, so we exploit the fact<br />
that P(error)=1-P(correct), where<br />
P<br />
C<br />
( correct) = P( x ∈ R , ω ) = P( x ∈ R | ω ) P( ω )<br />
=<br />
∑<br />
i=<br />
1<br />
C<br />
∑ ∫<br />
i= 1 x∈R<br />
P<br />
i<br />
i<br />
( x | ω ) P( ω ) dx<br />
= P( ω | x) P( x) dx<br />
i<br />
i<br />
i= 1 x∈R<br />
Of course, in order to minimize the P(error), we need to maximize P(correct) for which we<br />
need to maximize each and every one of the integrals. Note that P(x) is common to all<br />
integrals, therefore the expression will be maximized by choosing the decision regions R i<br />
where the posterior probabilities P(ω i<br />
|x) are maximum:<br />
i<br />
C<br />
∑<br />
i=<br />
1<br />
C<br />
i<br />
∑ ∫<br />
i<br />
i<br />
i<br />
i<br />
Theory and Applications of Pattern Recognition<br />
From R. Gutierrez @ TAMU<br />
x<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Discriminant Based<br />
Classification<br />
A discriminant is a function g(x), that discriminates between classes. This function<br />
assigns the input vector to a class according to its definition: Choose class i if<br />
gi ( x)<br />
> g ( x)<br />
∀i<br />
≠ j,<br />
i,<br />
j = 1,2,...,<br />
c<br />
j<br />
Bayes rule can be implemented in terms of discriminant functions<br />
g<br />
i<br />
( x)<br />
= P(<br />
ω | x)<br />
i<br />
Each discriminant function generates<br />
c decision regions, R 1<br />
,…,R c<br />
, which are<br />
separated by decision boundaries. Decision<br />
regions need NOT be contiguous.<br />
The decision boundary satisfies g i( x)<br />
= g j(<br />
x)<br />
x∈<br />
R<br />
∀i<br />
≠<br />
1<br />
j,<br />
⇔<br />
g<br />
i,<br />
i<br />
( x)<br />
><br />
g<br />
j = 1,2,..., c<br />
j<br />
( x)
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Discriminant Functions<br />
We may view the classifier as an automated machine that computes c discriminants<br />
and selects the category corresponding to the largest discriminant<br />
A neural network is one such classifier<br />
for Bayes classifier with non-uniform risks, R(α i |x):<br />
for MAP classifier (of uniform risks):<br />
for maximum likelihood classifier (of equal priors):<br />
g<br />
g<br />
g<br />
i<br />
i<br />
i<br />
( x ) = −R( αi<br />
| x)<br />
( x ) = P( ωi<br />
| x)<br />
( x ) = P( x | ω )<br />
i
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Discriminant Functions<br />
In fact, multiplying every DF with the same constant, or adding/subtracting a<br />
constant to all DFs does not change the decision boundary<br />
In general every g i (x) can be replaced by f (g i (x) ), where f(.) is any monotonically<br />
increasing function without affecting the actual decision boundary<br />
Some linear or non-linear transformations of the previously stated DFs may greatly<br />
simplify the design of the classifier<br />
What examples can you think of…?
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Normal Densities<br />
If likelihood probabilities are normally distributed, then a number of simplifications can be made.<br />
In particular, the discriminant function can be written as in this greatly simplified form (!)<br />
p(<br />
x | ω ) =<br />
i<br />
p<br />
d / 2 1/ 2<br />
( 2π<br />
) Σi<br />
( x | ω ) ~ N( µ , Σ )<br />
i<br />
1<br />
i<br />
e<br />
1<br />
−<br />
2<br />
T −1<br />
[( x−µ<br />
) Σ ( x−µ<br />
)]<br />
i<br />
i<br />
i<br />
i<br />
g<br />
i<br />
( x)<br />
= −<br />
1<br />
2<br />
[<br />
T<br />
( ) ( )] −1<br />
d 1 −1<br />
x − µ ⋅Σ ⋅ x − µ − ln2π<br />
− ln Σ + ln P(<br />
ω )<br />
i<br />
2<br />
2<br />
i<br />
i<br />
There are three distinct cases that can occur:
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Σ<br />
= σ<br />
Case 1: _________ i<br />
2<br />
I<br />
Features are statistically independent, and all features have the same variance: Dist. are<br />
spherical in d dimensions, the boundary is a generalized hyperplane (linear discriminant) of<br />
d-1 dimensions, and features create equal sized hyperspherical clusters. Examples of such<br />
hyperspherical clusters are:<br />
The general form of the<br />
discriminant is then<br />
g<br />
i<br />
2<br />
x − µ<br />
2σ<br />
2<br />
( x) = − + ln P( ω )<br />
i<br />
If priors are the same:<br />
g i<br />
( x)<br />
= −<br />
T<br />
( x − µ ) ( x − µ )<br />
2<br />
2σ<br />
Minimum Distance Classifier
P R<br />
Σ<br />
= σ<br />
Case 1:________ i<br />
2<br />
I<br />
This case results in linear discriminants that can be written in the form<br />
w<br />
i<br />
1 −1<br />
T<br />
= µ ,<br />
0<br />
ln ( )<br />
2 i<br />
wi<br />
= µ ⋅µ<br />
2 i<br />
+ P ω<br />
i i<br />
σ 2σ<br />
g<br />
i<br />
+<br />
T<br />
i<br />
( x)<br />
= w x wi<br />
0<br />
Threshold (Bias)<br />
of the i th category<br />
1-D case<br />
3-D case 2-D case<br />
Note how priors shift the discriminant function away from the more likely mean !!!<br />
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Σ = Σ<br />
Case 2:_______ i<br />
Covariance matrices are arbitrary, but equal to each other for all classes. Features then form hyperellipsoidal<br />
clusters of equal size and shape. This also results in linear discriminant functions<br />
whose decision boundaries are again hyperplanes: T<br />
g ( ) 1 −<br />
x = − ( x − µ ) Σ<br />
1 ( x − µ ) + ln P ω<br />
i<br />
2<br />
[ ] ( )<br />
T<br />
gi<br />
( x ) = w x + w<br />
−1<br />
1 T −1<br />
i i0<br />
wi<br />
= Σ µ<br />
i,<br />
wi<br />
0<br />
= − µ Σ µ<br />
i<br />
+ ln P(<br />
ωi<br />
)<br />
i<br />
2<br />
i
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Case 3:____________<br />
Σ i = Arbitrary<br />
All bets are off !In two class case, the decision boundaries form hyperquadratics.<br />
The discriminant functions are now, in general, quadratic (nor linear) and non-contiguous<br />
1 −1<br />
−1<br />
T<br />
T<br />
W<br />
g<br />
i<br />
( x)<br />
= x Wi<br />
x + wi<br />
x + w<br />
i<br />
= − Σi<br />
, wi<br />
= Σi<br />
µ<br />
i<br />
i0<br />
2<br />
1 T −1<br />
1<br />
wi<br />
0<br />
= − µ Σi<br />
µ<br />
i<br />
− ln Σi<br />
+ ln P(<br />
ωi<br />
)<br />
i<br />
2 2<br />
Hyperbolic Parabolic<br />
Linear<br />
Ellipsoidal Circular
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Case 3:____________<br />
Σ i = Arbitrary<br />
For the multi class case, the boundaries will look even more complicated. As an example<br />
Decision<br />
Boundaries
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Case 3:____________<br />
Σ i = Arbitrary<br />
In 3-D
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Conclusions<br />
The Bayes classifier for normally distributed classes is in general a quadratic<br />
classifier and can be computed<br />
The Bayes classifier for normally distributed classes with equal covariance matrices<br />
is a linear classifier<br />
For normally distributed classes with equal covariance matrices and equal priors is<br />
a minimum – Mahalanobis distance classifier<br />
For normally distributed classes with equal covariance matrices proportional to the<br />
identity matrix and with equal priors is a minimum Euclidean distance classifier<br />
Note that using a minimum Euclidean or Mahalanobis distance classifier implicitly<br />
makes certain assumptions regarding statistical properties of the data, which may or<br />
may not – and in general are not – true.<br />
However, in many cases, certain simplifications and approximations can be made<br />
that warrant making such assumptions even if they are not true. The bottom line in<br />
practice in deciding whether the assumptions are warranted is does the damn thing<br />
solve my classification problem…?
Theory and Applications of Pattern Recognition<br />
© 2003, Robi Polikar, Rowan University, Glassboro, NJ<br />
P R<br />
Error Bounds<br />
It is difficult, at best if possible, to analytically compute the error probabilities,<br />
Particularly when the decision regions are not contiguous. However, upper bounds for<br />
this error can be obtained:<br />
The Chernoff bound and its approximation Bhattacharya bound are two such bounds<br />
that are often used. If the distributions are Gaussian, these expressions are relatively<br />
easier to compute Often times even non-Gaussian cases are considered as Gaussian.