Lecture7 Slide - The Department of Statistics and Applied Probability ...

1 

ST5207 Nonparametric Regression, Lecture 7 

Lijian Yang 

Department of Statistics & Probability 

Michigan State University 

East Lansing, MI 48824 

and 

Department of Statistics & Applied Probability 

National University of Singapore 

Singapore 117546 

ST5207 Nonparametric Regression, 10th March 2005

2 

Multivariate nonparametric estimation 

• Let { Y i , X T i 

} n 

i=1 = {Y i, X i1 , ..., X id } n i=1 

be i.i.d. sample from model 

Y = m (X) + σ (X) ε, X = (X 1 , ..., X d ) 

where the noise satisfies E (ε|X) = 0, var (ε|X) = 1. 


3 



} n 

i=1 = {Y i, X i1 , ..., X id } n i=1 


Y = m (X) + σ (X) ε, X = (X 1 , ..., X d ) 


• How to estimate multivariate function m? 


4 



} n 

i=1 = {Y i, X i1 , ..., X id } n i=1 


Y = m (X) + σ (X) ε, X = (X 1 , ..., X d ) 



• We will discuss Nadaraya-Watson and local linear methods. 


5 



} n 

i=1 = {Y i, X i1 , ..., X id } n i=1 


Y = m (X) + σ (X) ε, X = (X 1 , ..., X d ) 



• We will discuss Nadaraya-Watson and local linear methods. 

• Conventions on multivariate kernel and bandwidth vector 

d∏ 

( ) 

1 uα 

K h (u) = K , u = (u 1 , ..., u d ) , h = (h 1 , ..., h d ) 

h α h α 

α=1 


6 



} n 

i=1 = {Y i, X i1 , ..., X id } n i=1 

Y = m (X) + σ (X) ε, X = (X 1 , ..., X d ) 




• We will discuss Nadaraya-Watson and local linear methods 

• Conventions on multivariate kernel and bandwidth vector 

d∏ 

( ) 

1 uα 

K h (u) = K , u = (u 1 , ..., u d ) , h = (h 1 , ..., h d ) 

h α h α 

α=1 

• Wand & Jones (1995). Kernel Smoothing, Chapman and Hall, 

London, and Wand & Ruppert (1994) (see reference list in syllabus) 

use bandwidth matrix, instead of vector. 


7 


• NW estimator 

ˆm (x) = arg min 

c 

n ∑ 

i=1 

(Y i − c) 2 w i (x) , w i (x) = K h (X i − x) 


8 




c 

n ∑ 

i=1 


• The explicit formula is 

ˆm (x) = 

n∑ 

i=1 

n∑ 

i=1 

Y i K h (X i − x) 

K h (X i − x) 


9 




c 

n ∑ 

i=1 


• The explicit formula is 

ˆm (x) = 

n∑ 

i=1 

n∑ 

i=1 

Y i K h (X i − x) 

K h (X i − x) 

• Limiting distribution is 

√ 

nh1 · · · h d 

{ 

ˆm (x) − m (x) − 

} 

d∑ 

h 2 αb α (x) 

α=1 

D 

→ N {0, v(x)} 


10 


• Bias and variance functions are 

{ 1 ∂ 2 m 

b α (x) = µ 2 (K) 

2 ∂ 2 (x) + ∂m (x) ∂f 

} 

(x) f −1 (x) 

x α ∂x α ∂x α 

{∫ 

v(x) = σ 2 (x)f −1 (x) 

} d 

K 2 (u) du 


11 



{ 1 ∂ 2 m 

b α (x) = µ 2 (K) 

2 ∂ 2 (x) + ∂m (x) ∂f 

} 

(x) f −1 (x) 


{∫ 

v(x) = σ 2 (x)f −1 (x) 

} d 

K 2 (u) du 

• The local linear estimator (to be discussed next) has limiting 

distribution of the same form, but with 

b α (x) = µ 2 (K) 1 2 

∂ 2 m 

∂ 2 x α 

(x) 


12 



{ 1 

b α (x) = µ 2 (K) 

2 

{∫ 

v(x) = σ 2 (x)f −1 (x) 

∂ 2 m 

∂ 2 (x) + ∂m (x) ∂f (x) f −1 (x) 


} d 

K 2 (u) du 

• The local linear estimator (to be discussed next) has limiting 

distribution of the same form, but with 

b α (x) = µ 2 (K) 1 2 

∂ 2 m 

∂ 2 x α 

(x) 

• The local linear weighted least squares problem is 

n∑ { 

2 

{ ˆm (x) , ∇m (x)} = arg min Y i − a − (X i − x) b} T wi (x) 

a,b 

i=1 

} 


13 

Multivariate local linear estimation 

• Matrices: W = W (x) = diag { n −1 K h (X i − x) } 

X = X (x) = 

m = 

⎛ 

⎜ 

⎝ 

⎛ 

⎜ 

⎝ 

⎞ ⎛ 

1, (X 1 − x) T 

1, (X 2 − x) T 

, Y = 

· · · ⎟ ⎜ 

⎠ ⎝ 

1, (X n − x) T 

m (X 1 ) 

m (X 2 ) 

· · · 

m (X n ) 

⎞ 

⎛ 

, e = 

⎟ ⎜ 

⎠ ⎝ 

ε 1 

ε 2 

· · · 

ε n 

⎞ 

⎟ 

⎠ 

Y 1 

Y 2 

· · · 

Y n 

⎞ 

⎟ 

⎠ 


14 


• Matrices: W = W (x) = diag { n −1 K h (X i − x) } 

• 

X = X (x) = 

m = 

⎛ 

⎜ 

⎝ 

⎛ 

⎜ 

⎝ 

⎞ ⎛ 

1, (X 1 − x) T 

1, (X 2 − x) T 

, Y = 

· · · ⎟ ⎜ 

⎠ ⎝ 

1, (X n − x) T 

m (X 1 ) 

m (X 2 ) 

· · · 

m (X n ) 

⎞ 

⎛ 

, e = 

⎟ ⎜ 

⎠ ⎝ 

ε 1 

ε 2 

· · · 

ε n 

⎞ 

⎟ 

⎠ 

Y 1 

Y 2 

· · · 

Y n 

{ ˆm (x) , ∇m (x)} T = ( X T WX ) −1 

X T WY 

⎞ 

⎟ 

⎠ 


15 


• Separately, the estimators are (α = 1, ..., d) 

ˆm (x) = e T 0 

( 

X T WX ) −1 

X T WY, e T 0 = (1, 0, ..., 0) 

̂∂m 

∂x α 

(x) = e T α 

( 

X T WX ) −1 

X T WY, e T α = (0, 0, ..., 0, 1, 0..., 0) 


16 



ˆm (x) = e T 0 

( 

X T WX ) −1 

X T WY, e T 0 = (1, 0, ..., 0) 

̂∂m 

∂x α 

(x) = e T α 

( 

X T WX ) −1 

X T WY, e T α = (0, 0, ..., 0, 1, 0..., 0) 

• Corresponding functions are (α = 1, ..., d) 

m (x) = m (x) e T ( 

0 X T WX ) −1 

X T WXe 0 

∂m 

(x) = ∂m (x) e T ( 

α X T WX ) −1 

X T WXe α 

∂x α ∂x α 


17 



ˆm (x) = e T 0 

( 

X T WX ) −1 

X T WY, e T 0 = (1, 0, ..., 0) 

̂∂m 

∂x α 

(x) = e T α 

( 

X T WX ) −1 

X T WY, e T α = (0, 0, ..., 0, 1, 0..., 0) 

• Corresponding functions are (α = 1, ..., d) 

m (x) = m (x) e T ( 

0 X T WX ) −1 

X T WXe 0 

∂m 

(x) = ∂m (x) e T ( 

α X T WX ) −1 

X T WXe α 

∂x α ∂x α 

• In addition, observe that 

e T 0 

( 

X T WX ) −1 

X T W 

d∑ 

α=1 

∂m 

∂x α 

(x) Xe α = 0 


18 


• The error decomposition for ˆm (x) is 

ˆm (x) − m (x) = e T 0 

( 

X T WX ) −1 

X T We+ 

e T 0 

( 

X T WX ) −1 

X T Wm−m (x) e T 0 

( 

X T WX ) −1 

X T WXe 0 

−e T 0 

( 

X T WX ) −1 

X T W 

d∑ 

α=1 

∂m 

∂x α 

(x) Xe α 


19 


• The error decomposition for ˆm (x) is 

ˆm (x) − m (x) = e T 0 

( 

X T WX ) −1 

X T We+ 

e T 0 

( 

X T WX ) −1 

X T Wm−m (x) e T 0 

( 

X T WX ) −1 

X T WXe 0 

−e T 0 

• Which becomes 

( 

X T WX ) −1 

X T W 

d∑ 

α=1 

∂m 

∂x α 

(x) Xe α 

e T 0 

ˆm (x) − m (x) = e T ( 

0 X T WX ) −1 

X T We+ 

( 

X T WX ) { 

} 

−1 

d∑ 

X T ∂m 

W m − m (x) Xe 0 − (x) Xe α 

∂x α 

α=1 


20 


• The limiting distribution for ˆm (x) is 

{ 

√ d∑ 

nh1 · · · h d ˆm (x) − m (x) − 

α=1 

h 2 αb α (x) 

} 

D 

→ N {0, v(x)} 

b α (x) = d K 

2 

∂ 2 m 

∂ 2 x α 

(x) , v(x) = σ 2 (x)f −1 (x)c d K 


21 



{ 

√ d∑ 

nh1 · · · h d ˆm (x) − m (x) − 

b α (x) = d K 

2 

α=1 

h 2 αb α (x) 

} 

D 

→ N {0, v(x)} 

∂ 2 m 

∂ 2 x α 

(x) , v(x) = σ 2 (x)f −1 (x)c d K 

• The Asymptotic MISE (AMISE { ˆm (x) ; h}) is 

∫ 

σ 2 (x)dxc d K 

nh 1 · · · h d 

+ d2 K 

4 

d∑ 

α,β=1 

h 2 αh 2 β 

∫ 

∂ 2 m 

∂ 2 (x) ∂2 m 

x α ∂ 2 (x) f(x)dx 

x β 

h opt = v (m, σ, K) n −1/(d+4) , AMISE { ˆm (x) ; h opt } = C (m, σ, K) n −4/(d+4) 


22 



{ 

√ d∑ 

nh1 · · · h d ˆm (x) − m (x) − 

b α (x) = d K 

2 

α=1 

h 2 αb α (x) 

} 

D 

→ N {0, v(x)} 

∂ 2 m 

∂ 2 x α 

(x) , v(x) = σ 2 (x)f −1 (x)c d K 

• The Asymptotic MISE (AMISE { ˆm (x) ; h}) is 

∫ 

σ 2 (x)dxc d d∑ 

∫ 

K 

+ d2 K 

∂ 

h 2 

nh 1 · · · h d 4 

αh 2 2 m 

β 

∂ 2 (x) ∂2 m 

x α ∂ 2 (x) f(x)dx 

x β 

α,β=1 

h opt = v (m, σ, K) n −1/(d+4) , AMISE { ˆm (x) ; h opt } = C (m, σ, K) n −4/(d+4) 

• The ”curse of dimensionality”: slower convergence rate n −2/(d+4) 

with high dimension d (Intuitively, why?) 


23 

Computing and dimension reduction 

• In XploRe, there are two related quantlets, “lregxestp” for local 

linear and “regxestp” for NW estimators 


24 




• We show the output on an example of n = 200 observations 

generated with m (x) = cos (x 1 ) + cos (x 2 ) for X distributed 

uniformly on [−π, π] 


25 







• One natural way to ”reduce” dimension is additive model. This 

means that 

d∑ 

m (x) = c + m α (x α ) 

α=1 

with the identification conditions Em α (X α ) ≡ 0, α = 1, ..., d. 


26 







• One natural way to ”reduce” dimension is additive model. This 

means that 

d∑ 

m (x) = c + m α (x α ) 

α=1 

with the identification conditions Em α (X α ) ≡ 0, α = 1, ..., d. 

• In XploRe, there are two related quantlets, “backfit” for backfitting 

and “intest” for integration estimators of additive model.

Lecture7 Slide - The Department of Statistics and Applied Probability ...

Create successful ePaper yourself

Delete template?

Save as template?