C S 590 P attern R eco g n itio n ch eatsh eet

CS590 Pattern Recognition 

cheatsheet 

Wei Peng 

1 Bayesian Decision Theory 

1.1 Preliminaries 

P(ωi|x) = p(x|ω i)P(ω) 

p(x) 

P(posterior) = likelihood×prior 

evidence 

p(X = x|ω) is class-conditional function. 

Risk function 

R(ω,δ) = Rδ = Eω[L(ω,δ(x))] 

= 

∫ 

Bayesian risk function 

x 

R(ωi|x) = 

L(ω,δ(x))p(x|ω)dx 

c ∑ 

j=1 

λijP(ωj|x) 

Two-category classification 

R(α1|x) = λ11P(ω1|x)+λ12P(ω2|x) 

R(α2|x) = λ21P(ω1|x)+λ22P(ω2|x). 

Choose α1 ifR(α1|x) < R(α2|x), i.e. 

p(x|ω1) 

p(x|ω2) > λ 12 −λ22 

λ21 −λ11 

× P(ω 2) 

ω1 

1.2 Minimax Criterion 

Minimax criterion implies worst case analysis. 

R = 

∫ 

+ 

∫ 

R1 

[λ11P(ω1)p(x|ω1)+λ12P(ω2)p(x|ω2)]dx 

R2 

[λ21P(ω1)p(x|ω1)+λ22P(ω2)p(x|ω2)]dx 

Use the facts P(ω2) = 1−P(ω1) and ∫ p(x|ω 1)dx = 

R1 

1− ∫ p(x|ω 1) 

R2 

R(P(ω1)) = 

the minimax risk 

{ }} ∫ { 

λ22 +(λ12 −λ22) 

+P(ω1)[(λ11 −λ22) 

∫ 

+(λ21 −λ11) 

−(λ12 −λ22) 

∫ 

R2 

R1 

R1 

. 

p(x|ω2)dx 

p(x|ω1)dx 

p(x|ω1)dx] 

The coefficient of P(ω1) is0. 

1.3 Neyman-Pearson 

For hypothesis test, among the tests with size α 

(given type i error—false ∫ positive) 

α = 

Ω2 

p(x|ω1)dx, 

the likelihood test is the most powerful (minimal Type ii 

error—false negative) one with power 1−β, in which 

β is given by ∫ 

β = 

Ω1 

p(x|ω2)dx. 

The likelihood test is to favors ω1 if in which T is 

some threshold; otherwise, favors ω2. 

Thus, Neyman-Pearson rule states that the decision 

region Ω1 forω1 is 

Cα = {x|Λ(x) ≥ Tα} 

in which 

Λ(x) = . p(x|ω 1) 

p(x|ω2) 

is the likelihood and Tα satisfying 

∫ 

α = 

Ω2 

p(x|ω1)dx = 

Then the power 

∫ 

of the test is 

1−β = p(x|ω2)dx = 

Example 

Ω2 

p(x|ω1) ∼ N(µ1,σ 2 ) = 

∫ Tα 

−∞ 

∫ ∞ 

Tα 

( 

1 

√ exp 2πσ 

2 

p(Λ|ω1)dΛ. 

p(Λ|ω2)dΛ. 

− (x−µ 1) 2 

2σ 2 ) 

p(x|ω2) ∼ N(µ2,σ 2 ) = 

( 

1 

√ exp 2πσ 

2 

− (x−µ 2) 2 

2σ 2 ) 

Let the decision region Cα = {x|x ≥ x1}, we have 

α = 1 2 − 1 2 erf(x 1 −µ1 

√ ), 2σ 

or equivalently 

The power is 

x1 = µ1 + √ 2σerf −1 (1−α). 

1−β = 1 2 − 1 2 erf(x 1 −µ2 

√ ). 2σ 

1.4 Bayesian Classifier: Gaussian Density 

1.4.1 Background 

Use discriminant function, e.g., posterior probabilities 

(with or without normalization factor), logposteriorgi(x) 

= lnp(x|ωi)+lnP(ωi). 

Univariate and multivariate Gaussian 

p(x) = 

We have 

p(x) = √ 1 exp 2πσ 

1 

(2π) d 2|Σ| 1 2 

exp 

( 

µ = E[x] = 

Σ = E[(x−µ)(x−µ) T ] = 

and 

cdf = 1 2 

[ 

) (− (x−µ)2 

2σ 2 

− 1 2 (x−µ)T Σ −1 (x−µ) 

∫ 

∫ 

xp(x)dx, 

) 

(x−µ)(x−µ) T p(x)dx, 

1+ erf( x−µ √ ) 2σ 

] 

. 

. 

in which 

erf(x) = √ 2 ∫ x 

π 

0 

e −t2 dt. 

is the error function, defined forx ≥ 0. 

Mahalanobis distance is √ (x−µ) T Σ −1 (x−µ). 

LetΛ = diag(λ1,...,λd) andA T = [e1,...,ed], then 

Σ = A T ΛA. 

1.4.2 Bayesian Decision Boundary 

Take log-posterior as the discriminant function 

gi(x) = lnp(x|ωi)+lnP(ωi) 

= − 1 2 (x−µ i) T Σ −1 

i (x−µi)− d 2 ln2π 

− 1 2 ln|Σ i|+lnP(ωi). 

The decision boundary is 

• Σi = σ 2 I 

w T (x−x0) = 0 

in which w = µi −µj and 

x0 = 1 2 (µ i +µj)− 

σ 2 

‖µi −µj‖ ln[P(ω i) 

P(ωj) ](µ i −µj). 

• Σi = Σ 

w T (x−x0) = 0 

in which w = Σ −1 (µi −µj) and 

x0 = 1 2 (µ i+µj)− 

ln[P(ωi)/P(ωj)] 

(µi −µj) T Σ −1 (µi −µj) (µ i−µj). 

• Σi is arbitrary 

gi(x) = x T Wix+w i T x+wi0 

in which Wi = − 1 2 Σ−1 i ,wi = Σ −1 µi, and 

wi0 = − 1 2 µT i Σ −1 

i µi − 1 2 ln|Σ i|+lnP(ωi). 

1.5 Two-class Error Boundary 

1.5.1 Chernoff Bound 

P(error) = 

∫ ∞ 

−∞ 

P(error|x)p(x)dx 

and by Bayesian decision rule 

P(error|x) = min[P(ω1|x),P(ω2|x)] = . min[a,b]. 

By the inequality min[a,b] ≤ a β b 1−β for a,b ≥ 0 

and 0 ≤ β ≤ 1 and the Bayes rule P(ωi|x)p(x) = 

p(x|ωi)P(ωi), we have ∫ 

P(error) ≤ P(ω1) β P(ω2) 1−β p(x|ω1) β p(x|ω2) 1−β dx 

for some 0 ≤ β ≤ 1. 

For the ∫ case that p(x|ωi) ∼ N(µi,Σi), we get 

p(x|ω1) β p(x|ω2) 1−β dx = exp[−k(β)]

where 

k(β) = β(1−β) 

2 

(µ2 −µ1) T 

×[βΣ1 +(1−β)Σ2] −1 (µ2 −µ1) 

+ 1 2 ln |βΣ 1 +(1−β)Σ2| 

|Σ1| β |Σ2| 1−β . 

Find the β = β ∗ which minimizes k(β). The Chernoff 

bound is found by substitutes β ∗ intoP(error). 

1.5.2 Bhattacharya Bound 

The Bhattacharya bound is found by substitutes β = 

1 

2 intoP(error). 

1.6 Noisy Feature 

Deal with bad (noisy) features by marginalization 

P(ωi|xg) = 

∫ 

P(ω i|xg,xb)p(xg,xb)dxb 

∫ 

p(x g,xb)dxb 

Choose ωi if P(ωi|xg) > P(ωj|xg), ∀j ≠ i. 

2 Parameter Estimation/Learning 

2.1 Maximum Likelihood Estimation 

2.1.1 Background 

p(D|θ) is the likelihood of θ with respect to the sample 

D. By the independent-sample assumption, 

p(D|θ) = 

n ∏ 

k=1 

p(xk|θ). 

Log-likelihoood maths tractable. 

2.1.2 Plug-in Rule 

Substitute the estimated parameter ˆθ for the true 

ones in the class-conditional densities. Then use 

p(x|ωi,ˆθ) as if they were true densities in constructing 

the decision rule. 

2.1.3 Example: Gaussian 

Conclusion: use sample mean and covariance. 

ˆµ = 1 n 

n ∑ 

i=1 

xi = sample mean 

1 

ˆΣ = 

n (x i − ˆµ)(xi − ˆµ) T = sample covariance. 

2.2 Bayesian Learning 

2.2.1 Idea 

• Allows treating the parameters as random variables 

themselves and estimates the density for 

the parameters. 

• Can be formulated as a recursive estimation, thus 

allowing one to incorporate new evidence one at 

a time as they come along. 

. 

2.2.2 Formulation 

p(ωi|x,Di) = 

p(x|D) = 

p(x|ωi,Di)P(ω) 

Σ 

∫ 

c j=1 p(x|ω j,Dj)P(ωj) 

p(x|θ)p(θ|D)dθ 

The key lies in estimating p(θ|D). 

2.2.3 Example: Univariate Gaussian 

If p(xk|µ) ∼ N(µ,σ 2 ) and p(µ) ∼ N(µ0,σ 0), 2 we 

have 

µn = 

( ) nσ 

2 

0 

nσ 0 2 +σ2 

ˆµn + 

σ 2 

nσ 2 0 +σ2µ 0 

σ n 2 = σ2 0σ 2 

nσ 0 2 +σ2. 

σ 2 /σ 0 2 is called dogmatism. 

We get 

p(θ|D) ∼ N(µn,σ 2 +σ n). 

2 

2.2.4 Example: Multivariate Gaussian 

Similar results: 

p(x|D) ∼ N(µn,Σ+Σn) 

(Σ0 + 1 ) −1 

n Σ ˆµn + 1 ( 

n Σ 

µn = Σ0 

( 

Σ0 + 1 n Σ ) −1 

µ0 

Σn = Σ0 

Σ0 + 1 n Σ ) −1 1 

n Σ.

C S 590 P attern R eco g n itio n ch eatsh eet

Create successful ePaper yourself

Delete template?

Save as template?