LECTURE 18 - Cornell - Cornell University

LECTURE 18: NONLINEAR MODELS 

The basic point is that smooth nonlinear models 

look like linear models locally. 

Models linear in parameters are no problem even if 

they are nonlinear in variables. For example: 

ϕ( y) = β + β ψ ( x ) + β ψ ( x ) + ... 

0 1 1 1 2 2 2 

with ϕ and ψ known functions of observable 

regressors, is still a linear regression model. 

However, 

y = θ 1 

+ θ 2 

e xθ 3 

+ ε 

is nonlinear (arising, for example, as a solution to a 

differential equation). 

N.M. Kiefer, Cornell University, Econ 620, Lecture 18 

1

Notation: 

y i = f(X i ,θ) + ε i =f i (θ) + ε i 

i = 1,2, ..., N 

Stacking up yields 

y = f(θ) + ε 

where y is Nx1, f(θ) is Nx1, θ is Kx1 and ε is Nx1. 

For X i 

i = 1,2,...,N fixed, f: R K →R N . 

∂f 

∂θ′ 

= F( θ) = 

⎡ ∂f1 

∂f1 

⎢ 

. . . 

∂θ1 

∂θ 

⎢ 

⎢ 

. . . . . 

⎢ 

⎢ 

∂f 

∂ 

⎢ 

N 

f 

. . . 

⎣⎢ 

∂θ1 

∂θ 

⎤ 

⎥ 

⎥ 

⎥ 

. 

⎥ 

⎥ 

⎥ 

⎦⎥ 

Obviously F(θ) is NxK. Assume Eε = 0 and 

Vε = σ 2 I. 

K 

N 

K 


2

The nonlinear least squares estimator minimizes 

S(θ) = 

N 

2 

∑ ( yi 

− fi( θ)) 

= (y - f( θ)) ′ (y - f( θ)). 

i 1 

= 

Differentiation yields 

∂ ∂ 

S( 

θ)= 

∂θ′ 

∂θ′ 

( y− 

f ( θ)) 

′(y-f( 

θ)) 

=-2(y-f( θ))F( 

′ θ). 

Thus, the nonlinear least squares (NLS) estimator 

$θ satisfies 

F( θ $ )( ′ y− f( θ $ )) = 0. 

(Are these equations familiar?) 

This is like the property 

method. 

Xe ′ 

= 0 in the LS 


3

Computing $θ : 

The Gauss-Newton method is to use a first-order 

expansion of f(θ) around θ T 

(a "trial" value) in 

S(θ), giving 

S ( θ) = (y-f( θ T ) −F( θ )( θ− θ ))′( y−f( θ ) −F( θ )( θ−θ 

)). 

T T T T T T 

Minimizing S T 

in θ gives 

−1 

θM = θT + [F( θT )′ F( θT)] F( θT) ′( y− 

f( θT)). 

(Exercise: Show θ M 

is the minimizer.) 

We know S T 

(θ M 

) ≤ S T 

(θ T 

). 

Is it true that S(θ M 

)≤ S(θ T 

)? 


4

S 

S T (θ) 

S(θ) 

θ-hat 

θ M 

θ T 

θ 

S T (θ) 

θ M 

S(θ) 

θ-hat 

θ T 


5

The method is to use θ M 

for the new trial value, 

expand again, minimize and iterate. But there can 

be problems. For example, it is possible that 

S(θ M 

) > S(θ T 

), but there is a θ* between θ T 

and θ M 

, 

θ* = θ T 

+ λ(θ M 

-θ T 

) 

for some λ < 1, with S(θ*) ≤ S(θ T 

). This suggests 

trying a decreasing sequence of λ's in [0,1], leading 

to the modified Gauss-Newton method. 

(1) Start with θ T 

and compute 

−1 

D T 

= [ F( θ )′ F( θ )] F( θ )′( y− 

f( θ )). 

T T T T 

(2) Find λ such that S(θ T 

+ λD T 

) < S(θ T 

). 

(3) Set θ* = θ T 

+ λD T 

and go to (1) using θ* as 

a new value for θ T 

. 


6

Stop when the changes in parameter values and S 

between iterations are small. Good practice is to 

try several different starting values. 

Estimation of σ 2 = V(ε): 

s 

= 

(y - f( ˆ)) θ ′(y - f( ˆ)) 

N − K 

2 θ 

σ$ 

2 = 

(y - f( θ$ )) ′ (y - f( θ$ 

)) 

N 

Why are there two possibilities? 


7

Note the simplification possible when the 

model is linear in some of the parameters 

(as in the example) 

y = θ 1 

+ θ 2 

exp(θ 3 

x) + ε 

Here, given θ 3 

, the other parameters can 

be estimated by OLS. Thus the sum of 

squares function S can be concentrated, 

that is written as a function of one 

parameter alone. The nonlinear 

maximization problem is 1-dimensional, 

not 3. This is an important trick. 

We used the same device to estimate 

models with autocorrelation the ρ - 

differenced model could be estimated by 

OLS conditional on ρ. 


8

Inference: 

We can show that 

1 

$θ = θ 0 

+ F( 

θ )′ 

F( 

θ )) 

− F( 

θ ) ε + r 

( ′ 

0 0 

0 

where θ 0 

is the true parameter value and 

plim N 1/2 r = 0 (show). So r can be ignored in 

calculating the asymptotic distribution of $θ . This 

is just like the expression in the linear model - 

decomposing $β into the true value plus sampling 

error. 

Thus the asymptotic distribution of N 1/2 ( $θ -θ 0 

) is 

N 

⎛ 

⎜ 

2⎛ ( F( 

θ0) 

′ F( 

θ0) 

⎞ 

0, σ 

⎜ 

⎟ 

⎝ ⎝ N ⎠ 

−1 

⎞ 

⎟. 

⎠ 

So the approximate distribution of 

N(θ 0 

,σ 2 

−1 

( F( θ )′ 

F( θ )) 

0 0 

$θ 

becomes 


9

In practice σ 2 is estimated by s 2 or σ 

and 

F( θ )′ 

F( θ ) 

0 0 

is estimated by 

F( θ $ )′ 

F( θ $ ). 

Check that this is OK. 

Applications: 

1. The overidentified SEM is a nonlinear 

regression model (linear in variables, nonlinear in 

parameters) - consider the reduced form equations 

in terms of structural parameters. 

2. Many other models - more to come. 

$ 2 


10

Efficiency: Consider another consistent 

estimator θ * with “sampling error” in the 

form 

n 1/2 (θ*- θ)= 

1 

F( 

θ )′ 

F( 

θ )) 

− F( 

θ )′ 

ε +C’ε+r 

( 

0 0 

0 

It can be shown, in a proof like that of the 

Gauss-Markov Theorem, that the minimum 

variance estimator in this class (locally 

linear) has C=0. 

A better property holds when the errors are 

iid normal. Then, the NLS estimator is the 

MLE, and we have the Cramer-Rao 

efficiency result. 


11

Nonlinear SEM 

Let y i = f(Z i ,θ) + ε i =f i (θ) + ε I 

Where now Z is used to indicate included 

endogenous variables. NLS will not be 

consistent (why not?). The trick is to find 

instruments W and look at 

W’ y = W’f(θ) + W’ ε 

If the model is just-identified, we can just set 

W’ ε = 0 (its expected value) and solve. 

Otherwise, we can do nonlinear GLS, 

minimizing the variance-weighted sum of 

squares 

(y-f(θ))’W(W’W) -1 W’(y-f(θ)) 


12

This θ-hat is called the nonlinear 2SLS 

estimator (by Amemiya, who studied its 

properties in 1974). Note that there are not 

two separate stages. 

Specifically, it might be tempting to just 

obtain a first-stage Z-hat = (I-M)Z = [Y 2 - 

hat,X 1 ] and do nonlinear regression of y on 

f(Z-hat, θ). 

This does not necessarily work. Z-hat is 

orthogonal to ε, but f may not be. In the 

linear regression case, we want Z-hat’ ε =0. 

The corresponding condition here is F-hat’ 

ε=0. Since F-hat depends generally on θ, 

there is no real way to do this in stages. 


13

Asymptotic Distribution Theory 

n 

−1/2 

⎛ 

2 ( ( 

0) 

( ' ) 

( 

0) 

⎜ ⎛ F θ ′ W W W 

θ −θ 

→ N 0, σ ⎜ 

⎜ 

⎝ ⎝ N 

−1 

W' 

F( 

θ0) 

⎞ 

⎟ 

⎠ 

−1 

⎞ 

⎟. 

⎟ 

⎠ 

In practice σ 2 is estimated by 

(or over n-k) 

(y-f(Z, θhat)’(y-f(Z, θhat)/n 

and F(θ 0 ) is estimated by F(θhat) 

Don’t forget to remove the factor N -1 in the 

variance when approximating the variance of 

your estimator. 

The calculation is exactly like the usual 

calculation of the variance of the GLS 

estimator 


14

Proposition: 

∑ 

∞ 

MISCELLANEOUS: 

2 

( f i ( θ ) ( θ )) θ θ 

i 1 0 - f ′ = ∞ , ′ ≠ 

= 

i 0 

is necessary for the existence of a consistent 

NLS estimator. (Compare this with OLS). 

Funny example: Consider y i = e -αi + ε i 

where 

α∈(0,2π) and V(ε i ) = σ 2 . Is there a consistent 

estimator? 

Consistency can be shown (generally, not just in 

this example) using regularity conditions, not 

requiring derivatives. Normality can be shown 

using first but not second derivatives . 


15

Many models can be set up as nonlinear 

regressions - like qualitative dependent variable 

models. Sometimes it is hard to interpret 

parameters. We might be interested in functions 

of parameters, for example elasticities. 

Tests on functions of parameters: 

Remember the delta-method. Let $θ ~ N(θ 0 

,Σ) be 

a consistent estimator of θ whose true value is 

θ 0 

. Suppose g(θ) is the quantity of interest. 

Expanding g( $θ ) around θ 0 

yields 

g( $ ∂g 

θ) = g( θ ) ( θ $ 0 + θ= θ − θ0) 

0 

∂θ 

+ more. 


16

This implies that 

g 

Vg ( ( $ )) $ 2 ∂g 

∂ 

θ = E(g( θ) - g( θ0 

)) ≈ ⎛ ′ 

. 

⎝ ⎜ ⎞ ⎛ ⎞ 

⎟ Σ⎜ 

⎟ 

∂θ ⎠ ⎝ ∂θ ⎠ 

0 0 

Asymptotically, g( $θ ) ~ N(g(θ 0 

), V(g( $θ )). 

This is a very useful method. It also shows that 

normal approximation may be poor. (why?) 

How can the normal approximation be 

improved? This is a deep question. One trick is 

to choose a parametrization for which ∂ 2 l/∂θ∂θ′ 

is constant or nearly so. Why does this work? 

Consider the Taylor expansion of the score … . 


17

LECTURE 18 - Cornell - Cornell University

Create successful ePaper yourself

Delete template?

Save as template?