28.12.2013 Views

C S 590 P attern R eco g n itio n ch eatsh eet

C S 590 P attern R eco g n itio n ch eatsh eet

C S 590 P attern R eco g n itio n ch eatsh eet

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

CS<strong>590</strong> P<strong>attern</strong> R<strong>eco</strong>gn<strong>itio</strong>n<br />

<strong>ch</strong><strong>eatsh</strong><strong>eet</strong><br />

Wei Peng<br />

1 Bayesian Decision Theory<br />

1.1 Preliminaries<br />

P(ωi|x) = p(x|ω i)P(ω)<br />

p(x)<br />

P(posterior) = likelihood×prior<br />

evidence<br />

p(X = x|ω) is class-cond<strong>itio</strong>nal function.<br />

Risk function<br />

R(ω,δ) = Rδ = Eω[L(ω,δ(x))]<br />

=<br />

∫<br />

Bayesian risk function<br />

x<br />

R(ωi|x) =<br />

L(ω,δ(x))p(x|ω)dx<br />

c ∑<br />

j=1<br />

λijP(ωj|x)<br />

Two-category classification<br />

R(α1|x) = λ11P(ω1|x)+λ12P(ω2|x)<br />

R(α2|x) = λ21P(ω1|x)+λ22P(ω2|x).<br />

Choose α1 ifR(α1|x) < R(α2|x), i.e.<br />

p(x|ω1)<br />

p(x|ω2) > λ 12 −λ22<br />

λ21 −λ11<br />

× P(ω 2)<br />

ω1<br />

1.2 Minimax Criterion<br />

Minimax criterion implies worst case analysis.<br />

R =<br />

∫<br />

+<br />

∫<br />

R1<br />

[λ11P(ω1)p(x|ω1)+λ12P(ω2)p(x|ω2)]dx<br />

R2<br />

[λ21P(ω1)p(x|ω1)+λ22P(ω2)p(x|ω2)]dx<br />

Use the facts P(ω2) = 1−P(ω1) and ∫ p(x|ω 1)dx =<br />

R1<br />

1− ∫ p(x|ω 1)<br />

R2<br />

R(P(ω1)) =<br />

the minimax risk<br />

{ }} ∫ {<br />

λ22 +(λ12 −λ22)<br />

+P(ω1)[(λ11 −λ22)<br />

∫<br />

+(λ21 −λ11)<br />

−(λ12 −λ22)<br />

∫<br />

R2<br />

R1<br />

R1<br />

.<br />

p(x|ω2)dx<br />

p(x|ω1)dx<br />

p(x|ω1)dx]<br />

The coefficient of P(ω1) is0.<br />

1.3 Neyman-Pearson<br />

For hypothesis test, among the tests with size α<br />

(given type i error—false ∫ positive)<br />

α =<br />

Ω2<br />

p(x|ω1)dx,<br />

the likelihood test is the most powerful (minimal Type ii<br />

error—false negative) one with power 1−β, in whi<strong>ch</strong><br />

β is given by ∫<br />

β =<br />

Ω1<br />

p(x|ω2)dx.<br />

The likelihood test is to favors ω1 if in whi<strong>ch</strong> T is<br />

some threshold; otherwise, favors ω2.<br />

Thus, Neyman-Pearson rule states that the decision<br />

region Ω1 forω1 is<br />

Cα = {x|Λ(x) ≥ Tα}<br />

in whi<strong>ch</strong><br />

Λ(x) = . p(x|ω 1)<br />

p(x|ω2)<br />

is the likelihood and Tα satisfying<br />

∫<br />

α =<br />

Ω2<br />

p(x|ω1)dx =<br />

Then the power<br />

∫<br />

of the test is<br />

1−β = p(x|ω2)dx =<br />

Example<br />

Ω2<br />

p(x|ω1) ∼ N(µ1,σ 2 ) =<br />

∫ Tα<br />

−∞<br />

∫ ∞<br />

Tα<br />

(<br />

1<br />

√ exp 2πσ<br />

2<br />

p(Λ|ω1)dΛ.<br />

p(Λ|ω2)dΛ.<br />

− (x−µ 1) 2<br />

2σ 2 )<br />

p(x|ω2) ∼ N(µ2,σ 2 ) =<br />

(<br />

1<br />

√ exp 2πσ<br />

2<br />

− (x−µ 2) 2<br />

2σ 2 )<br />

Let the decision region Cα = {x|x ≥ x1}, we have<br />

α = 1 2 − 1 2 erf(x 1 −µ1<br />

√ ), 2σ<br />

or equivalently<br />

The power is<br />

x1 = µ1 + √ 2σerf −1 (1−α).<br />

1−β = 1 2 − 1 2 erf(x 1 −µ2<br />

√ ). 2σ<br />

1.4 Bayesian Classifier: Gaussian Density<br />

1.4.1 Background<br />

Use discriminant function, e.g., posterior probabilities<br />

(with or without normalization factor), logposteriorgi(x)<br />

= lnp(x|ωi)+lnP(ωi).<br />

Univariate and multivariate Gaussian<br />

p(x) =<br />

We have<br />

p(x) = √ 1 exp 2πσ<br />

1<br />

(2π) d 2|Σ| 1 2<br />

exp<br />

(<br />

µ = E[x] =<br />

Σ = E[(x−µ)(x−µ) T ] =<br />

and<br />

cdf = 1 2<br />

[<br />

) (− (x−µ)2<br />

2σ 2<br />

− 1 2 (x−µ)T Σ −1 (x−µ)<br />

∫<br />

∫<br />

xp(x)dx,<br />

)<br />

(x−µ)(x−µ) T p(x)dx,<br />

1+ erf( x−µ √ ) 2σ<br />

]<br />

.<br />

.<br />

in whi<strong>ch</strong><br />

erf(x) = √ 2 ∫ x<br />

π<br />

0<br />

e −t2 dt.<br />

is the error function, defined forx ≥ 0.<br />

Mahalanobis distance is √ (x−µ) T Σ −1 (x−µ).<br />

LetΛ = diag(λ1,...,λd) andA T = [e1,...,ed], then<br />

Σ = A T ΛA.<br />

1.4.2 Bayesian Decision Boundary<br />

Take log-posterior as the discriminant function<br />

gi(x) = lnp(x|ωi)+lnP(ωi)<br />

= − 1 2 (x−µ i) T Σ −1<br />

i (x−µi)− d 2 ln2π<br />

− 1 2 ln|Σ i|+lnP(ωi).<br />

The decision boundary is<br />

• Σi = σ 2 I<br />

w T (x−x0) = 0<br />

in whi<strong>ch</strong> w = µi −µj and<br />

x0 = 1 2 (µ i +µj)−<br />

σ 2<br />

‖µi −µj‖ ln[P(ω i)<br />

P(ωj) ](µ i −µj).<br />

• Σi = Σ<br />

w T (x−x0) = 0<br />

in whi<strong>ch</strong> w = Σ −1 (µi −µj) and<br />

x0 = 1 2 (µ i+µj)−<br />

ln[P(ωi)/P(ωj)]<br />

(µi −µj) T Σ −1 (µi −µj) (µ i−µj).<br />

• Σi is arbitrary<br />

gi(x) = x T Wix+w i T x+wi0<br />

in whi<strong>ch</strong> Wi = − 1 2 Σ−1 i ,wi = Σ −1 µi, and<br />

wi0 = − 1 2 µT i Σ −1<br />

i µi − 1 2 ln|Σ i|+lnP(ωi).<br />

1.5 Two-class Error Boundary<br />

1.5.1 Chernoff Bound<br />

P(error) =<br />

∫ ∞<br />

−∞<br />

P(error|x)p(x)dx<br />

and by Bayesian decision rule<br />

P(error|x) = min[P(ω1|x),P(ω2|x)] = . min[a,b].<br />

By the inequality min[a,b] ≤ a β b 1−β for a,b ≥ 0<br />

and 0 ≤ β ≤ 1 and the Bayes rule P(ωi|x)p(x) =<br />

p(x|ωi)P(ωi), we have ∫<br />

P(error) ≤ P(ω1) β P(ω2) 1−β p(x|ω1) β p(x|ω2) 1−β dx<br />

for some 0 ≤ β ≤ 1.<br />

For the ∫ case that p(x|ωi) ∼ N(µi,Σi), we get<br />

p(x|ω1) β p(x|ω2) 1−β dx = exp[−k(β)]


where<br />

k(β) = β(1−β)<br />

2<br />

(µ2 −µ1) T<br />

×[βΣ1 +(1−β)Σ2] −1 (µ2 −µ1)<br />

+ 1 2 ln |βΣ 1 +(1−β)Σ2|<br />

|Σ1| β |Σ2| 1−β .<br />

Find the β = β ∗ whi<strong>ch</strong> minimizes k(β). The Chernoff<br />

bound is found by substitutes β ∗ intoP(error).<br />

1.5.2 Bhatta<strong>ch</strong>arya Bound<br />

The Bhatta<strong>ch</strong>arya bound is found by substitutes β =<br />

1<br />

2 intoP(error).<br />

1.6 Noisy Feature<br />

Deal with bad (noisy) features by marginalization<br />

P(ωi|xg) =<br />

∫<br />

P(ω i|xg,xb)p(xg,xb)dxb<br />

∫<br />

p(x g,xb)dxb<br />

Choose ωi if P(ωi|xg) > P(ωj|xg), ∀j ≠ i.<br />

2 Parameter Estimation/Learning<br />

2.1 Maximum Likelihood Estimation<br />

2.1.1 Background<br />

p(D|θ) is the likelihood of θ with respect to the sample<br />

D. By the independent-sample assumption,<br />

p(D|θ) =<br />

n ∏<br />

k=1<br />

p(xk|θ).<br />

Log-likelihoood maths tractable.<br />

2.1.2 Plug-in Rule<br />

Substitute the estimated parameter ˆθ for the true<br />

ones in the class-cond<strong>itio</strong>nal densities. Then use<br />

p(x|ωi,ˆθ) as if they were true densities in constructing<br />

the decision rule.<br />

2.1.3 Example: Gaussian<br />

Conclusion: use sample mean and covariance.<br />

ˆµ = 1 n<br />

n ∑<br />

i=1<br />

xi = sample mean<br />

1<br />

ˆΣ =<br />

n (x i − ˆµ)(xi − ˆµ) T = sample covariance.<br />

2.2 Bayesian Learning<br />

2.2.1 Idea<br />

• Allows treating the parameters as random variables<br />

themselves and estimates the density for<br />

the parameters.<br />

• Can be formulated as a recursive estimation, thus<br />

allowing one to incorporate new evidence one at<br />

a time as they come along.<br />

.<br />

2.2.2 Formulation<br />

p(ωi|x,Di) =<br />

p(x|D) =<br />

p(x|ωi,Di)P(ω)<br />

Σ<br />

∫<br />

c j=1 p(x|ω j,Dj)P(ωj)<br />

p(x|θ)p(θ|D)dθ<br />

The key lies in estimating p(θ|D).<br />

2.2.3 Example: Univariate Gaussian<br />

If p(xk|µ) ∼ N(µ,σ 2 ) and p(µ) ∼ N(µ0,σ 0), 2 we<br />

have<br />

µn =<br />

( ) nσ<br />

2<br />

0<br />

nσ 0 2 +σ2<br />

ˆµn +<br />

σ 2<br />

nσ 2 0 +σ2µ 0<br />

σ n 2 = σ2 0σ 2<br />

nσ 0 2 +σ2.<br />

σ 2 /σ 0 2 is called dogmatism.<br />

We get<br />

p(θ|D) ∼ N(µn,σ 2 +σ n).<br />

2<br />

2.2.4 Example: Multivariate Gaussian<br />

Similar results:<br />

p(x|D) ∼ N(µn,Σ+Σn)<br />

(Σ0 + 1 ) −1<br />

n Σ ˆµn + 1 (<br />

n Σ<br />

µn = Σ0<br />

(<br />

Σ0 + 1 n Σ ) −1<br />

µ0<br />

Σn = Σ0<br />

Σ0 + 1 n Σ ) −1 1<br />

n Σ.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!