12.07.2015 Views

STA 414S/2104S: Some notes on bias-variance trade-off February ...

STA 414S/2104S: Some notes on bias-variance trade-off February ...

STA 414S/2104S: Some notes on bias-variance trade-off February ...

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<str<strong>on</strong>g>STA</str<strong>on</strong>g> <str<strong>on</strong>g>414S</str<strong>on</strong>g>/<str<strong>on</strong>g>2104S</str<strong>on</strong>g>: <str<strong>on</strong>g>Some</str<strong>on</strong>g> <str<strong>on</strong>g>notes</str<strong>on</strong>g> <strong>on</strong> <strong>bias</strong>-<strong>variance</strong> <strong>trade</strong>-<strong>off</strong> <strong>February</strong>, 2010The goal is to understand in more detail the role of predicti<strong>on</strong> error in model selecti<strong>on</strong> andassessment, and to study the noti<strong>on</strong> of <strong>bias</strong> and <strong>variance</strong> <strong>trade</strong>-<strong>off</strong>. We assume that we havea set of training data T , typically N instances (x 1 , y 1 ), . . . , (x N , y N ), where x is usually avector of features and y a resp<strong>on</strong>se, either c<strong>on</strong>tinuous or discrete. We further assume thatwe have a fitted functi<strong>on</strong> ˆf(·), which is a rule that has been c<strong>on</strong>structed from T , that mapsa vector x to a value y. In this notati<strong>on</strong> the dependence of ˆf(·) <strong>on</strong> T is suppressed. In mostapplicati<strong>on</strong>s, ˆf(·) is obtained by minimizingN∑L{y i , f(x i )}i=1where L{Y, f(X)} is a loss functi<strong>on</strong>, and the minimum is calculated over some class <strong>off</strong>uncti<strong>on</strong>s {f} to be specified.Comm<strong>on</strong> choices of loss functi<strong>on</strong>s are squared error loss, absolute error, and more rarelylosses based <strong>on</strong> the pth norm, for c<strong>on</strong>tinuous resp<strong>on</strong>ses y and especially for models wheref(x) = E(Y | x). For discrete resp<strong>on</strong>ses a loss functi<strong>on</strong> that counts misclassificati<strong>on</strong>s is quitecomm<strong>on</strong>. The loss functi<strong>on</strong> that leads to maximum likelihood estimati<strong>on</strong> is −2 log Pr(Y ; θ)or more generally −2 log Pr(Y ; f(·)).In c<strong>on</strong>sidering errors made in using ˆf(·) for predicti<strong>on</strong>, or more specifically <strong>on</strong> test data, twoquantities of interest areErr T = E{L(Y, ˆf(X)) | T }, (1)Err = E(Err T ) (2)where the expectati<strong>on</strong> in (??) is over the joint distributi<strong>on</strong> of (Y, X), c<strong>on</strong>diti<strong>on</strong>al <strong>on</strong> thetraining data T , and the expectati<strong>on</strong> in (??) is over the training data. Err T is called testerror, predicti<strong>on</strong> error or generalizati<strong>on</strong> error, and Err is called expected predicti<strong>on</strong>/testerror. Loosely speaking, Err T is relevant to model selecti<strong>on</strong>: how well will a range of modelsto the training data serve to predict new data?, whereas Err is relevant to model assessment:how well will a range of possible models work for a range of applicati<strong>on</strong>s? However, Err Tis more difficult to analyse theoretically than Err. Of course neither of these measures canaddress errors in predicti<strong>on</strong> due to changes in the underlying structure of the data or theerrors; the test data must have something in comm<strong>on</strong> with the training data in order thatthe training data be useful for predicti<strong>on</strong>.Example: additive errors. Suppose the model is Y = f(X) + ɛ, where E(ɛ) = 0,var(ɛ) = σ 2 ɛ . It is natural in this setting to use squared error loss, and to c<strong>on</strong>diti<strong>on</strong> <strong>on</strong> thefeature variables. Thus we compute, as in §7.3,Err(x 0 ) = E{(Y − ˆf(x 0 )) 2 | X = x 0 }= σ 2 ɛ + {E( ˆf(x 0 )) − f(x 0 )} 2 + var{f(x 0 )},1


In the text in (7.12) the squared <strong>bias</strong> term is included, but for this particular example it iszero.Example: ridge regressi<strong>on</strong>. In the same linear model, if we have ˆβ = (X T X +λI) −1 X T y,then E ˆf(x 0 ) = x T 0 (X T X + λI) −1 X T Xβ and var( ˆf(x 0 )) = x T 0 (X T X + λI) −1 x 0 σ 2 ɛ , and it is atleast plausible that there is some value of λ for which Err(x 0 ) is smaller than its value whenλ = 0.We now c<strong>on</strong>sider the estimati<strong>on</strong> of Err in more general settings. The training error is definedaserr = 1 N∑L(y i ,Nˆf(x i )),i=1and it is clearly an underestimate of Err, because ˆf(·) is usually determined to minimize err.In §7.4 the “in-sample generalizati<strong>on</strong> error”, or “in-sample test error” is defined asErr in = 1 N∑E{L(Yi 0 ,Nˆf(x i )) | T },i=1where the expectati<strong>on</strong> is over a new sample of Y ’s of size N; <strong>on</strong>e Y for each x i . This is stillc<strong>on</strong>diti<strong>on</strong>al <strong>on</strong> x 1 , . . . , x N . It can be shown thatE y (Err in − err) = 2 N∑cov(y i ,Nˆf(x i ))for both squared error loss and 0-1 loss, and that this holds approximately for log-likelihoodloss. 1 Further, for ˆf(x i ) determined by linear regressi<strong>on</strong> with d basis functi<strong>on</strong>s, the sec<strong>on</strong>dterm is (2/N)dσɛ 2 . This could be simple least squares regressi<strong>on</strong> as above, with d = p, orit could be any type of regressi<strong>on</strong> spline, or regressi<strong>on</strong> <strong>on</strong> wavelet basis. More generally, asstated in §7.6, ∑ Ni=1 cov(y i, ˆf(x i )) = trace(S)σɛ 2 , for a linear model and any linear fittingmethod ˆf = Sy, which includes regularizati<strong>on</strong> methods such as spline smoothing and ridgeregressi<strong>on</strong>.This result gives a way to correct err to give an estimate of Err in : sincethen an estimate of the LHS is given byi=1E y (Err in ) = E y (err) + 2 N dσ2 ɛ ,err + 2 N dσ2 ɛ .It is further claimed that for log-likelihood loss, it can be shown that as N → ∞,−2E{log Pr(y; ˆθ)} ≃ − 2 N∑log Pr(y i ;Nˆθ) + 2 d N1 Here E y means expectati<strong>on</strong> over the training y 1 , . . . , y N This is similar to the calculati<strong>on</strong> that takesErr T to Err, but emphasizes via the notati<strong>on</strong> that the training x’s are fixed.3i=1


where d is the number of parameters in ˆθ, thus motivating the estimating the expected lossby the RHS, which is Akaike’s informati<strong>on</strong> criteri<strong>on</strong> AIC: see, e.g. (7.30) (where howeverthere is a typo: σ 2 ɛ should not be in the equati<strong>on</strong>).A different derivati<strong>on</strong> of AIC is given in Davis<strong>on</strong> (2003, §4.7), tied more closely to maximumlikelihood fitting. Suppose we have a random sample Y 1 , . . . , Y N from an unknown truedensity g(y), but that we fit the family of models {f(y; θ); θ ∈ Θ} by maximizing the loglikelihoodfuncti<strong>on</strong> l(θ) = Σ log f(y i ; θ). Define the Kullback-Liebler discrepancy∫D(f θ , g) =log{ g(y)f(y; θ)}g(y)dy;this is 0 if f(y; θ) = g(y) and otherwise is positive. Denote by θ g the value of θ that minimizesD(f θ , g). The expected likelihood ratio statistic for comparing g with f θ at θ = ˆθ for a newrandom sample Y 1 + , . . . , Y + N from g, independent of Y 1, . . . , Y N is[ N{ }]∑E g+ g(Y +i )logf(Y +i ; ˆθ)= nD(fˆθ, g) ≥ nD(f θg , g).Davis<strong>on</strong> shows thati=1nD(fˆθ, g) . = nD(f θg , g) + (1/2)tr{(ˆθ − θ g )(ˆθ − θ g ) T I g (θ g )},where I g = −n ∫ {∂ 2 log f(y; θ)/∂θ∂θ T }g(y)dY and E g is over the distributi<strong>on</strong> of ˆθ. He thenshows that this can be estimated by−l(ˆθ) + c,where c estimates (1/2)tr{(ˆθ − θ g )(ˆθ − θ g ) T I g (θ g )}, and finally shows that c = p = dim(θ) isthe the expected value of this quantity, so serves as a reas<strong>on</strong>able estimator. This gives theestimator−l(ˆθ) + p,the AIC is typically a scalar multiple of this. Our book uses the multiple (2/N); other booksuse 2.Finally, a seemingly more direct estimate of ERR T is the cross-validati<strong>on</strong> estimateCV ( ˆf) = 1 N∑L{y i ,Nˆf −κ(i) (x i )},i=1where ˆf −κ(i) (x i ) is the predicti<strong>on</strong> of y i based <strong>on</strong> a fitted model that omits either the partiti<strong>on</strong>κ(i) that y i falls in (K-fold cross-validati<strong>on</strong>), or simply the ith observati<strong>on</strong> (leave-<strong>on</strong>e-outcross-validati<strong>on</strong>). This would seem to give an internal estimate of Err T , but the book arguesthat in fact it seems to estimate Err instead. For linear smoothers and squared-error loss itcan be shown thatCV = 1 N∑ {y i − ˆf(x i )} 2;N (1 − S ii ) 2i=1replacing (1 − S ii ) by 1 − tr(S)/N makes computati<strong>on</strong>s even simpler. This latter expressi<strong>on</strong>is known as the GCV criteri<strong>on</strong>.4

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!