Chapter 2
Chapter 2
Chapter 2
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Pattern Classification<br />
<strong>Chapter</strong> 2. Bayesian Decision Theory
2<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
•What is a pattern?<br />
Introduction<br />
•In statistical pattern recognition, a<br />
pattern is a d-dimensional feature<br />
vector<br />
x = (x 1, x 2, …, x d) t
3<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
• The sea bass/salmon example<br />
• State of nature<br />
• prior<br />
• State of nature is a random variable<br />
• The catch of salmon and sea bass is equiprobable<br />
P(ω 1 ) = P(ω 2 ) (Prior)<br />
P(ω 1 ) + P( ω 2 ) = 1 (exclusivity and<br />
exhaustivity)
4<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Decision rule with only the prior information<br />
• Decide ω 1 if P(ω 1 ) > P(ω 2 )<br />
otherwise decide ω 2<br />
•Use of the class–conditional information<br />
•P(x | ω 1 ) and P(x | ω 2 ) describe the difference<br />
in lightness between populations of sea<br />
and salmon
5<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction
dfjPx<br />
kgkdfj ( )<br />
6<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
•Posterior, likelihood, evidence<br />
P(ω j | x) = P(x | ω j ) P (ω j ) / P(x)<br />
Introduction<br />
Where in case of two categories<br />
∑ j=2<br />
( ) ( )<br />
j j<br />
P(x) = P x ω P ω<br />
j=1<br />
Posterior = (Likelihood * Prior) / Evidence
7<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction
8<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Decision given the posterior probabilities<br />
x is an observation for which:<br />
if P(ω 1 | x) > P(ω 2 | x) True state of nature = ω 1<br />
if P(ω 1 | x) < P(ω 2 | x) True state of nature = ω 2<br />
•Therefore:<br />
whenever we observe a particular x, the probability of<br />
error is :<br />
P(error | x) = P(ω 1 | x) if we decide ω 2<br />
P(error | x) = P(ω 2 | x) if we decide ω 1
9<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Minimizing the probability of error<br />
• Decide ω 1 if P(ω 1 | x) > P(ω 2 | x);<br />
• otherwise decide ω 2<br />
Therefore:<br />
P(error | x) = min [P(ω 1 | x), P(ω 2 | x)]<br />
(Bayes decision)
10<br />
Pattern Recognition:<br />
Probability of Error<br />
Introduction
11<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
Generalization of the preceding ideas<br />
– Use of more than one feature<br />
– Use more than two states of nature<br />
– Allowing actions and not only decide on the state<br />
of nature<br />
– Introduce a loss function which is more general<br />
than the probability of error
12<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Allowing actions other than classification,<br />
primarily allows the possibility of rejection –<br />
refusing to make a decision in close or bad<br />
cases<br />
•The loss function states how costly each<br />
action taken is
13<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Let {ω 1 , ω 2 ,…, ω c } be the set of c states of<br />
nature (“categories”)<br />
•Let {α 1 , α 2 ,…, α a } be the set of possible actions<br />
•Let λ(α i | ω j ) be the loss incurred for taking<br />
action α i when the state of nature is ω j
14<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Overall risk R = Sum of all R(α i | x) for i = 1,…,a<br />
•Minimizing R(α i | x) for i = 1,…, a<br />
j=c<br />
( ) ∑ ( ) ( )<br />
i i j j<br />
R α x = λαωP ω x<br />
j=1<br />
for i =1, L,a
15<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Introduction<br />
•Select the action α i for which R(α i | x) is<br />
minimum<br />
R is minimum and R in this case is called<br />
the Bayes risk = best performance that<br />
can be achieved.
16<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
•Two-category classification<br />
• α 1 : deciding ω 1<br />
• α 2 : deciding ω 2<br />
Introduction<br />
• λ ij = λ(α i | ω j )<br />
loss incurred for deciding ω i when the true state of<br />
nature is ω j<br />
•Conditional risk:<br />
•R(α 1 | x) = λ 11 P(ω 1 | x) + λ 12 P(ω 2 | x)<br />
•R(α 2 | x) = λ 21 P(ω 1 | x) + λ 22 P(ω 2 | x)
17<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
Our rule is the following:<br />
if R(α 1 | x) < R(α 2 | x)<br />
action α 1 : “decide ω 1 ” is taken<br />
Introduction<br />
This results in the equivalent rule :<br />
decide ω 1 if:<br />
(λ 21 - λ 11 ) P(x | ω 1 ) P(ω 1 ) > (λ 12 - λ 22 ) P(x | ω 2 ) P(ω 2 )<br />
and decide ω 2 otherwise
18<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
•Likelihood ratio:<br />
Introduction<br />
The preceding rule is equivalent to the<br />
following rule:<br />
( )<br />
( )<br />
( )<br />
( )<br />
Pxω λ -λ P ω<br />
if > ×<br />
Pxω<br />
λ -λ P ω<br />
1 12 22<br />
2<br />
2<br />
21 11 1<br />
•Then take action α 1 (decide ω 1 )<br />
•Otherwise take action α 2 (decide ω 2 )
19<br />
Pattern Recognition:<br />
Statistical Decision Theory<br />
•Optimal decision property<br />
Introduction<br />
• “If the likelihood ratio exceeds a threshold<br />
value independent of the input pattern x, we<br />
can take optimal actions”
20<br />
Pattern Recognition:<br />
Introduction<br />
Minimum-Error-Rate Classification<br />
•Actions are decisions on classes<br />
•If action α i is taken and the true state of nature<br />
is ω j then:<br />
• the decision is correct if i = j and in error if i ≠ j<br />
•Seek a decision rule that minimizes the<br />
probability of error which is the error rate
21<br />
Pattern Recognition:<br />
Introduction<br />
Minimum-Error-Rate Classification<br />
Introduction of the zero-one loss function:<br />
⎧0<br />
i= j<br />
λ( α i,ω j)<br />
= ⎨ i,j=1, L ,c<br />
⎩ 1 i ≠ j<br />
•Therefore, the conditional risk is:<br />
j=c<br />
( ) ( ) ( )<br />
i i j j<br />
∑<br />
R α x = λαω P ω x<br />
j=1<br />
∑<br />
ji ≠<br />
( ) ( )<br />
j i<br />
= P ω x =1-P ω x<br />
•“The risk corresponding to this loss function<br />
is the average probability error”
22<br />
Pattern Recognition:<br />
Introduction<br />
Minimum-Error-Rate Classification<br />
• Minimizing the risk requires maximizing<br />
P(ω i | x) (since R(α i | x) = 1 – P(ω i | x))<br />
• For Minimum error rate<br />
Decide ω i if P (ω i | x) > P(ω j | x) ∀j ≠ i
23<br />
Pattern Recognition:<br />
Introduction<br />
Minimum-Error-Rate Classification<br />
•Investigate the loss function:<br />
( 2)<br />
( )<br />
( 1)<br />
( )<br />
λ12 -λ P ω<br />
Pxω<br />
22<br />
Let θ = × = then decide ω if: >θ<br />
λ -λ P ω Pxω<br />
λ 1 λ<br />
21 11 1 2<br />
If λ is the zero-one loss function which means:<br />
λ =<br />
⎛ 0<br />
⎜<br />
⎝ 1<br />
1 ⎞<br />
⎟<br />
0 ⎠<br />
then θ =<br />
P<br />
P<br />
ω<br />
ω<br />
= θ<br />
( 2 )<br />
( )<br />
λ a<br />
1<br />
( 2 )<br />
( )<br />
⎛ 0 2 ⎞<br />
2P ω<br />
if λ = ⎜ ⎟ then θ = = θ<br />
⎝ 1 0 ⎠<br />
P ω<br />
1<br />
λ b
24<br />
Pattern Recognition:<br />
Introduction<br />
Decision Regions: Effect of Loss Functions
25<br />
Pattern Recognition:<br />
Introduction<br />
Classifiers, Discriminant<br />
Functions and Decision Surfaces<br />
THE MULTICATEGORY CASE<br />
Set of discriminant functions<br />
g i(x), i = 1,…, c<br />
The classifier assigns a feature<br />
vector x to class ω i<br />
if: g i(x) > g j(x) ∀j ≠ i
26<br />
Pattern Recognition:<br />
Introduction<br />
Classifiers, Discriminant<br />
Functions and Decision Surfaces<br />
•Let g i(x) = - R(α i | x)<br />
(max. discriminant corresponds to min. risk)<br />
•For the minimum error rate, we take<br />
g i(x) = P(ω i | x)<br />
(max. discrimination corresponds to max. posterior)<br />
In this case, we can also write:<br />
g i(x) = P(x | ω i) P(ω i) or<br />
g i(x) = ln P(x | ω i) + ln P(ω i) (ln: natural logarithm)
27<br />
Pattern Recognition:<br />
Introduction<br />
General Statistical Classifier
28<br />
Pattern Recognition:<br />
Introduction<br />
Classifiers, Discriminant<br />
Functions and Decision Surfaces<br />
• Feature space divided into c decision regions<br />
if g i (x) > g j (x) ∀j ≠ i then x is in R i<br />
R i means assign x to ω i<br />
• The two-category case:<br />
• A classifier is a dichotomizer with two<br />
discriminant functions g 1 and g 2<br />
• Let g(x) ≡ g 1 (x) – g 2 (x)<br />
• Decide ω 1 if g(x) > 0;<br />
• Otherwise decide ω 2
29<br />
Pattern Recognition:<br />
Introduction<br />
Classifiers, Discriminant<br />
Functions and Decision Surfaces<br />
•The computation of g(x)<br />
( ) ( )<br />
g(x) = P ω x -P ω x<br />
1 2<br />
( )<br />
( )<br />
P x ω P ω<br />
=ln +ln<br />
P x ω<br />
P ω<br />
( )<br />
1 1<br />
2<br />
( )<br />
2
30<br />
Pattern Recognition:<br />
Introduction<br />
Classifiers, Discriminant<br />
Functions and Decision Surfaces
31<br />
Pattern Recognition:<br />
The Normal Density<br />
Introduction<br />
Univariate density<br />
-Density which is analytically tractable<br />
-Continuous density<br />
-A lot of processes are asymptotically Gaussian<br />
-Handwritten characters, speech sounds are<br />
examples or prototypes corrupted by<br />
random process (central limit theorem)<br />
1<br />
p ( x ) =<br />
e<br />
σ 2 π<br />
Where: µ = mean or expected value of x<br />
σ 2 = the variance of x<br />
−<br />
1<br />
2<br />
⎛<br />
⎜<br />
⎝<br />
x − µ ⎞<br />
⎟<br />
σ<br />
⎠<br />
2
32<br />
Pattern Recognition:<br />
The Normal Density<br />
Introduction
33<br />
Pattern Recognition:<br />
The Normal Density<br />
Introduction<br />
•Multivariate density<br />
•Multivariate normal density in d dimensions is:<br />
1 ⎡ 1 t -1 ⎤<br />
P(x) = exp<br />
⎢<br />
- d/2 1/2 ( x -µ ) Σ ( x-µ )<br />
⎣<br />
⎥<br />
2π Σ<br />
2<br />
⎦<br />
( )<br />
where: x = (x 1 , x 2 , …, x d ) t<br />
µ = (µ 1 , µ 2 , …, µ d ) t mean vector<br />
Σ = d*d covariance matrix<br />
|Σ| and Σ -1 are the determinant<br />
and inverse,<br />
respectively
34<br />
Pattern Recognition:<br />
Introduction<br />
Discriminant Functions for the Normal<br />
Density<br />
•We saw that the minimum error-rate<br />
classification can be achieved by the<br />
discriminant function<br />
g i (x) = ln P(x | ω i ) + ln P(ω i )<br />
•Case of multivariate normal distribution<br />
1 d 1<br />
g(x)=- x-µ x-µ - ln2π- In Σ +InP ω<br />
2 2 2<br />
∑ -1<br />
i i i<br />
i i i<br />
( ) ( ) ( )<br />
t
35<br />
Pattern Recognition:<br />
Introduction<br />
Discrimination and Classification for<br />
Different Cases<br />
∑ 2<br />
i<br />
Case = σ I<br />
g (x) = w x + w (linear discriminant function)<br />
i<br />
t<br />
i i0<br />
where :<br />
(w is called the threshold for the ith<br />
i0<br />
category!)<br />
µ i 1 t<br />
w= i ;w 2 i0 =- µµ+lnP(ω 2 i i i)<br />
σ 2σ
36<br />
Pattern Recognition:<br />
Introduction<br />
Discrimination and Classification for<br />
Different Cases<br />
–A classifier that uses linear discriminant<br />
functions is called “a linear machine”<br />
–The decision surfaces for a linear machine are<br />
pieces of hyperplanes defined by: g i (x) = g j (x)
37<br />
Pattern Recognition:<br />
Equal Covariances<br />
Introduction
38<br />
Pattern Recognition:<br />
Introduction<br />
Discrimination and Classification for<br />
Different Cases<br />
•The hyperplane separating R i and R j<br />
( )<br />
2<br />
1 σ P ω<br />
x = ( µ +µ ) - ln µ -µ<br />
2 µ-µ P ω<br />
i ( )<br />
( ) j<br />
0 i j 2<br />
i j<br />
i j<br />
• Always orthogonal to the line linking the means<br />
1<br />
if P ω =P ω then x = µ +µ<br />
2<br />
( ) ( ) ( )<br />
i j 0 i j
39<br />
Pattern Recognition:<br />
Shift in Priors<br />
Introduction
40<br />
Pattern Recognition:<br />
Shift in Priors<br />
Introduction
41<br />
Pattern Recognition:<br />
Introduction<br />
Discrimination and Classification for<br />
Different Cases<br />
Case Σ i = Σ (covariance of all classes are<br />
identical but arbitrary)<br />
Hyperplane separating R i and R j<br />
ln ⎡ ( ) ( ) ⎤<br />
⎣<br />
P ωi P ωj<br />
⎦<br />
( i<br />
t<br />
j) -1 ( i j)<br />
1<br />
x = ( µ +µ ) - × µ -µ<br />
2 µ-µ Σ µ-µ<br />
( )<br />
0 i j i j<br />
(the hyperplane separating R i and R j is<br />
generally not orthogonal to the line<br />
between the means)
42<br />
Pattern Recognition:<br />
Decision Surfaces<br />
Introduction
43<br />
Pattern Recognition:<br />
Decision Surfaces<br />
Introduction
44<br />
Pattern Recognition:<br />
Introduction<br />
Discrimination and Classification for<br />
Different Cases<br />
Case Σ i = arbitrary<br />
– The covariance matrices are different for each category<br />
where :<br />
( )<br />
g x = x Wx + wx+<br />
w<br />
t<br />
t<br />
i i i i0<br />
1<br />
W=-<br />
2<br />
∑<br />
∑<br />
i i<br />
-1<br />
w = µ<br />
-1<br />
i i i<br />
1 t -1 1<br />
w i0 = - µ i ∑ µ i - ln ∑ + lnP( ωi<br />
)<br />
i i<br />
2 2<br />
(Hyperquadratics which are: hyperplanes, pairs of hyperplanes,<br />
hyperspheres, hyperellipsoids, hyperparaboloids, etc.)
45<br />
Pattern Recognition:<br />
Decision Boundaries<br />
Introduction
46<br />
Pattern Recognition:<br />
Decision Boundaries<br />
Introduction
47<br />
Pattern Recognition:<br />
Decision Boundaries<br />
Introduction
48<br />
Pattern Recognition:<br />
Introduction<br />
Bayesian Decision Theory – Discrete<br />
Features<br />
Components of x are binary or integer valued, x<br />
can take only one of m discrete values<br />
v 1 , v 2 , …, v m<br />
Case of independent binary features in 2 category<br />
problem<br />
Let x = (x 1 , x 2 , …, x d ) t where each x i is either 0 or 1,<br />
with probabilities:<br />
p i = P(x i = 1 | ω 1 )<br />
q i = P(x i = 1 | ω 2 )
49<br />
Pattern Recognition:<br />
Introduction<br />
Bayesian Decision Theory – Discrete<br />
Features<br />
The discriminant function in this case is:<br />
d<br />
∑<br />
g(x) = w x + w<br />
where<br />
and<br />
i=1<br />
i i 0<br />
( )<br />
( )<br />
p i 1- q i<br />
w i = ln i = 1, L ,d<br />
q 1-p<br />
i i<br />
1-p<br />
( 1 )<br />
( )<br />
P ω<br />
d<br />
i<br />
w 0 = ∑ ln + ln<br />
i=1 1-qi P ω 2<br />
decide ω if g(x) > 0 and ω if g (x) ≤ 0<br />
1 2