C S 590 P attern R eco g n itio n ch eatsh eet
C S 590 P attern R eco g n itio n ch eatsh eet
C S 590 P attern R eco g n itio n ch eatsh eet
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
CS<strong>590</strong> P<strong>attern</strong> R<strong>eco</strong>gn<strong>itio</strong>n<br />
<strong>ch</strong><strong>eatsh</strong><strong>eet</strong><br />
Wei Peng<br />
1 Bayesian Decision Theory<br />
1.1 Preliminaries<br />
P(ωi|x) = p(x|ω i)P(ω)<br />
p(x)<br />
P(posterior) = likelihood×prior<br />
evidence<br />
p(X = x|ω) is class-cond<strong>itio</strong>nal function.<br />
Risk function<br />
R(ω,δ) = Rδ = Eω[L(ω,δ(x))]<br />
=<br />
∫<br />
Bayesian risk function<br />
x<br />
R(ωi|x) =<br />
L(ω,δ(x))p(x|ω)dx<br />
c ∑<br />
j=1<br />
λijP(ωj|x)<br />
Two-category classification<br />
R(α1|x) = λ11P(ω1|x)+λ12P(ω2|x)<br />
R(α2|x) = λ21P(ω1|x)+λ22P(ω2|x).<br />
Choose α1 ifR(α1|x) < R(α2|x), i.e.<br />
p(x|ω1)<br />
p(x|ω2) > λ 12 −λ22<br />
λ21 −λ11<br />
× P(ω 2)<br />
ω1<br />
1.2 Minimax Criterion<br />
Minimax criterion implies worst case analysis.<br />
R =<br />
∫<br />
+<br />
∫<br />
R1<br />
[λ11P(ω1)p(x|ω1)+λ12P(ω2)p(x|ω2)]dx<br />
R2<br />
[λ21P(ω1)p(x|ω1)+λ22P(ω2)p(x|ω2)]dx<br />
Use the facts P(ω2) = 1−P(ω1) and ∫ p(x|ω 1)dx =<br />
R1<br />
1− ∫ p(x|ω 1)<br />
R2<br />
R(P(ω1)) =<br />
the minimax risk<br />
{ }} ∫ {<br />
λ22 +(λ12 −λ22)<br />
+P(ω1)[(λ11 −λ22)<br />
∫<br />
+(λ21 −λ11)<br />
−(λ12 −λ22)<br />
∫<br />
R2<br />
R1<br />
R1<br />
.<br />
p(x|ω2)dx<br />
p(x|ω1)dx<br />
p(x|ω1)dx]<br />
The coefficient of P(ω1) is0.<br />
1.3 Neyman-Pearson<br />
For hypothesis test, among the tests with size α<br />
(given type i error—false ∫ positive)<br />
α =<br />
Ω2<br />
p(x|ω1)dx,<br />
the likelihood test is the most powerful (minimal Type ii<br />
error—false negative) one with power 1−β, in whi<strong>ch</strong><br />
β is given by ∫<br />
β =<br />
Ω1<br />
p(x|ω2)dx.<br />
The likelihood test is to favors ω1 if in whi<strong>ch</strong> T is<br />
some threshold; otherwise, favors ω2.<br />
Thus, Neyman-Pearson rule states that the decision<br />
region Ω1 forω1 is<br />
Cα = {x|Λ(x) ≥ Tα}<br />
in whi<strong>ch</strong><br />
Λ(x) = . p(x|ω 1)<br />
p(x|ω2)<br />
is the likelihood and Tα satisfying<br />
∫<br />
α =<br />
Ω2<br />
p(x|ω1)dx =<br />
Then the power<br />
∫<br />
of the test is<br />
1−β = p(x|ω2)dx =<br />
Example<br />
Ω2<br />
p(x|ω1) ∼ N(µ1,σ 2 ) =<br />
∫ Tα<br />
−∞<br />
∫ ∞<br />
Tα<br />
(<br />
1<br />
√ exp 2πσ<br />
2<br />
p(Λ|ω1)dΛ.<br />
p(Λ|ω2)dΛ.<br />
− (x−µ 1) 2<br />
2σ 2 )<br />
p(x|ω2) ∼ N(µ2,σ 2 ) =<br />
(<br />
1<br />
√ exp 2πσ<br />
2<br />
− (x−µ 2) 2<br />
2σ 2 )<br />
Let the decision region Cα = {x|x ≥ x1}, we have<br />
α = 1 2 − 1 2 erf(x 1 −µ1<br />
√ ), 2σ<br />
or equivalently<br />
The power is<br />
x1 = µ1 + √ 2σerf −1 (1−α).<br />
1−β = 1 2 − 1 2 erf(x 1 −µ2<br />
√ ). 2σ<br />
1.4 Bayesian Classifier: Gaussian Density<br />
1.4.1 Background<br />
Use discriminant function, e.g., posterior probabilities<br />
(with or without normalization factor), logposteriorgi(x)<br />
= lnp(x|ωi)+lnP(ωi).<br />
Univariate and multivariate Gaussian<br />
p(x) =<br />
We have<br />
p(x) = √ 1 exp 2πσ<br />
1<br />
(2π) d 2|Σ| 1 2<br />
exp<br />
(<br />
µ = E[x] =<br />
Σ = E[(x−µ)(x−µ) T ] =<br />
and<br />
cdf = 1 2<br />
[<br />
) (− (x−µ)2<br />
2σ 2<br />
− 1 2 (x−µ)T Σ −1 (x−µ)<br />
∫<br />
∫<br />
xp(x)dx,<br />
)<br />
(x−µ)(x−µ) T p(x)dx,<br />
1+ erf( x−µ √ ) 2σ<br />
]<br />
.<br />
.<br />
in whi<strong>ch</strong><br />
erf(x) = √ 2 ∫ x<br />
π<br />
0<br />
e −t2 dt.<br />
is the error function, defined forx ≥ 0.<br />
Mahalanobis distance is √ (x−µ) T Σ −1 (x−µ).<br />
LetΛ = diag(λ1,...,λd) andA T = [e1,...,ed], then<br />
Σ = A T ΛA.<br />
1.4.2 Bayesian Decision Boundary<br />
Take log-posterior as the discriminant function<br />
gi(x) = lnp(x|ωi)+lnP(ωi)<br />
= − 1 2 (x−µ i) T Σ −1<br />
i (x−µi)− d 2 ln2π<br />
− 1 2 ln|Σ i|+lnP(ωi).<br />
The decision boundary is<br />
• Σi = σ 2 I<br />
w T (x−x0) = 0<br />
in whi<strong>ch</strong> w = µi −µj and<br />
x0 = 1 2 (µ i +µj)−<br />
σ 2<br />
‖µi −µj‖ ln[P(ω i)<br />
P(ωj) ](µ i −µj).<br />
• Σi = Σ<br />
w T (x−x0) = 0<br />
in whi<strong>ch</strong> w = Σ −1 (µi −µj) and<br />
x0 = 1 2 (µ i+µj)−<br />
ln[P(ωi)/P(ωj)]<br />
(µi −µj) T Σ −1 (µi −µj) (µ i−µj).<br />
• Σi is arbitrary<br />
gi(x) = x T Wix+w i T x+wi0<br />
in whi<strong>ch</strong> Wi = − 1 2 Σ−1 i ,wi = Σ −1 µi, and<br />
wi0 = − 1 2 µT i Σ −1<br />
i µi − 1 2 ln|Σ i|+lnP(ωi).<br />
1.5 Two-class Error Boundary<br />
1.5.1 Chernoff Bound<br />
P(error) =<br />
∫ ∞<br />
−∞<br />
P(error|x)p(x)dx<br />
and by Bayesian decision rule<br />
P(error|x) = min[P(ω1|x),P(ω2|x)] = . min[a,b].<br />
By the inequality min[a,b] ≤ a β b 1−β for a,b ≥ 0<br />
and 0 ≤ β ≤ 1 and the Bayes rule P(ωi|x)p(x) =<br />
p(x|ωi)P(ωi), we have ∫<br />
P(error) ≤ P(ω1) β P(ω2) 1−β p(x|ω1) β p(x|ω2) 1−β dx<br />
for some 0 ≤ β ≤ 1.<br />
For the ∫ case that p(x|ωi) ∼ N(µi,Σi), we get<br />
p(x|ω1) β p(x|ω2) 1−β dx = exp[−k(β)]
where<br />
k(β) = β(1−β)<br />
2<br />
(µ2 −µ1) T<br />
×[βΣ1 +(1−β)Σ2] −1 (µ2 −µ1)<br />
+ 1 2 ln |βΣ 1 +(1−β)Σ2|<br />
|Σ1| β |Σ2| 1−β .<br />
Find the β = β ∗ whi<strong>ch</strong> minimizes k(β). The Chernoff<br />
bound is found by substitutes β ∗ intoP(error).<br />
1.5.2 Bhatta<strong>ch</strong>arya Bound<br />
The Bhatta<strong>ch</strong>arya bound is found by substitutes β =<br />
1<br />
2 intoP(error).<br />
1.6 Noisy Feature<br />
Deal with bad (noisy) features by marginalization<br />
P(ωi|xg) =<br />
∫<br />
P(ω i|xg,xb)p(xg,xb)dxb<br />
∫<br />
p(x g,xb)dxb<br />
Choose ωi if P(ωi|xg) > P(ωj|xg), ∀j ≠ i.<br />
2 Parameter Estimation/Learning<br />
2.1 Maximum Likelihood Estimation<br />
2.1.1 Background<br />
p(D|θ) is the likelihood of θ with respect to the sample<br />
D. By the independent-sample assumption,<br />
p(D|θ) =<br />
n ∏<br />
k=1<br />
p(xk|θ).<br />
Log-likelihoood maths tractable.<br />
2.1.2 Plug-in Rule<br />
Substitute the estimated parameter ˆθ for the true<br />
ones in the class-cond<strong>itio</strong>nal densities. Then use<br />
p(x|ωi,ˆθ) as if they were true densities in constructing<br />
the decision rule.<br />
2.1.3 Example: Gaussian<br />
Conclusion: use sample mean and covariance.<br />
ˆµ = 1 n<br />
n ∑<br />
i=1<br />
xi = sample mean<br />
1<br />
ˆΣ =<br />
n (x i − ˆµ)(xi − ˆµ) T = sample covariance.<br />
2.2 Bayesian Learning<br />
2.2.1 Idea<br />
• Allows treating the parameters as random variables<br />
themselves and estimates the density for<br />
the parameters.<br />
• Can be formulated as a recursive estimation, thus<br />
allowing one to incorporate new evidence one at<br />
a time as they come along.<br />
.<br />
2.2.2 Formulation<br />
p(ωi|x,Di) =<br />
p(x|D) =<br />
p(x|ωi,Di)P(ω)<br />
Σ<br />
∫<br />
c j=1 p(x|ω j,Dj)P(ωj)<br />
p(x|θ)p(θ|D)dθ<br />
The key lies in estimating p(θ|D).<br />
2.2.3 Example: Univariate Gaussian<br />
If p(xk|µ) ∼ N(µ,σ 2 ) and p(µ) ∼ N(µ0,σ 0), 2 we<br />
have<br />
µn =<br />
( ) nσ<br />
2<br />
0<br />
nσ 0 2 +σ2<br />
ˆµn +<br />
σ 2<br />
nσ 2 0 +σ2µ 0<br />
σ n 2 = σ2 0σ 2<br />
nσ 0 2 +σ2.<br />
σ 2 /σ 0 2 is called dogmatism.<br />
We get<br />
p(θ|D) ∼ N(µn,σ 2 +σ n).<br />
2<br />
2.2.4 Example: Multivariate Gaussian<br />
Similar results:<br />
p(x|D) ∼ N(µn,Σ+Σn)<br />
(Σ0 + 1 ) −1<br />
n Σ ˆµn + 1 (<br />
n Σ<br />
µn = Σ0<br />
(<br />
Σ0 + 1 n Σ ) −1<br />
µ0<br />
Σn = Σ0<br />
Σ0 + 1 n Σ ) −1 1<br />
n Σ.