Linear Regression Notes

Linear RegressionMarch 28, 2013IntroductionLinear regression is used when we have a numeric response variable and numeric (and possibly categorical)predictor (explanatory) variable(s). The mean of the response variable is to be related to the predictor(s)with random error terms assumed to be independent and normally distributed with constant variance. Thefitting of linear regression models is very flexible, allowing for fitting curvature and interactions betweenfactors.Simple Linear RegressionWhen there is a single numeric predictor, we refer to the model as Simple Regression.variable is denoted as Y and the predictor variable is denoted as X. The model is:The responseY = β 0 + β 1 X + ɛɛ ∼ N(0, σ 2 ) independentHere β 0 is the intercept (mean of Y when X=0) and β 1 is the slope (the change in the mean of Y whenX increases by 1 unit). Of primary concern is whether β 1 = 0, which implies the mean of Y is constant (β 0 ),and thus Y and X are not associated.1

Estimation of Model ParametersWe obtain a sample of pairs (X i , Y i ) i = 1, . . . , n. Our goal is to choose estimators of β 0 and β 1 thatminimize the error sum of squares: Q = ∑ ni=1 ɛ2 i . The resulting estimators are (from calculus):ˆβ 1 =∑ ni=1 (X i − X)(Y i − Y )∑ ni=1 (X i − X) 2ˆβ0 = Y − ˆβ 1 XOnce we have estimates, we obtain fitted values and residuals for each observation. The error sumof squares (SSE) are obtained as the sum of the squared residuals:Fitted Values: Ŷ i = ˆβ 0 + ˆβn∑1 X i Residuals: e i = Y i − Ŷi SSE = (Y i − Ŷi) 2i=1The (unbiased) estimate of the error variance σ 2 is s 2 = MSE = SSEn−2, where MSE is the MeanSquare Error. The subtraction of 2 can be thought of as the fact that we have estimated two parameters:β 0 and β 1 .The estimators ˆβ 1 and ˆβ 0 are linear functions of Y 1 , . . . , Y n and thus using basic rules of mathematicalstatistics, their sampling distributions are:ˆβ 1 ∼ N(σ 2 )β 1 , ∑ ni=1 (X i − X) 2ˆβ 0 ∼ N[ (β 0 , σ 2 1n + X 2 ])∑ ni=1 (X i − X) 2The standard error is the square root of the variance, and the estimated standard error is the standarderror with the unknown σ 2 replaced by MSE.√[SE{ ˆβ MSE1 } = ∑ ni=1 (X SE{ ˆβ i − X) 2 0 } = √ 1 MSEn + X 2 ]∑ ni=1 (X i − X) 2Inference Regarding β 1Primarily of interest is inferences regarding β 1 . Note that if β 1 = 0, Y and X are not associated. Wecan test hypotheses and construct confidence intervals based on the estimate β 1 and its estimated standard2

Product Moment Coefficient of Correlation. It is invariant to linear transformations of Y and X, anddoes not distinguish which is the dependent and which is the independent variables. This makes it a widelyreported measure when researchers are interested in how 2 random variables vary together in a population.The population correlation coefficient is labeled ρ, and the sample correlation is labeled r, and is computedas:∑ ni=1r =(X i − X)(Y i − Y )√ ∑ni=1 (X i − X) ∑ =2 ni=1 (Y i − Y ) 2(sX)ˆβ 1s Ywhere s X and s Y are the standard deviations of X and Y , respectively. While ˆβ 1 can take on any value,r lies between -1 and +1, taking on the extreme values if all of the points fall on a straight line. The test ofwhether ρ = 0 is mathematically equivalent to the t-test for testig thether β 1 = 0. The 2-sided test is givenbelow:H 0 : ρ = 0 H A : ρ ≠ 0 T S : t obs =r√1−r 2n−2RR : |t obs | ≥ t α/2,n−2 P − value : P (t n−2 ≥ |t obs |)To construct a large-sample confidence interval, we use Fisher’s z transform to make r approximatelynormal. We then construct a confidence interval on the transformed correlation, then ”back transform” theend points:z ′ = 1 ( ) 1 + r2 ln 1 − r(1 − α)100% CI for 1 ( ) √1 + ρ12 ln : z ′ ± z α/21 − ρn − 3Labeling the endpoints of the Confidence Interval as (a, b), we obtain:(1 − α)100% Confidence Interval for ρ :( e 2a )− 1e 2a + 1 , e2b − 1e 2b + 1Model DiagnosticsThe inferences regarding the simple linear regression model (tests and confidence intervals) are based on thefollowing assumptions:• Relation between Y and X is linear6

• Errors are normally distributed• Errors have constant variance• Errors are independentThese assumptions can be checked graphically, as well as by statistical tests.Checking LinearityA plot of the residuals versus X should be a random cloud of points centered at 0 (they sum to 0).”U-shaped” or ”inverted U-shaped” pattern is inconsistent with linearity.AA test for linearity can be conducted when there are repeat observations at certain X-levels (methodshave also been developed to ”group X values). Suppose we have c distinct X-levels, with n j observationsat the j th level. The data need to be re-labeled as Y ij where j represents the X group, and i represents theindividual case within the group (i = 1, . . . , n j ). We compute the following quantities:Y j =∑ nji=1 Y ijn jŶ j = ˆβ 0 + ˆβ 1 X jWe then decompose the Error Sum of Squares into Pure Error and Lack of Fit:n∑(Y i − Ŷi) 2 =i=1n j∑i=1 j=1c∑ ( ) 2c∑)Yij − Y j + n j(Y j − Ŷjj=1SSE = SSP E + SSLFWe then partition the error degrees of freedom (n − 2) into Pure Error (n − c) and Lack of Fit (c − 2).This leads to an F -test for testing H 0 : Relation is Linear versus H A : Relation is not Linear:T S : F obs =[SSLF/(c − 2)][SSP E/(n − c)] = MSLFMSP ERR : F obs ≥ F α,c−2,n−c P -Value : P (F c−2,n−c ≥ F obs )If the relationship is not linear, we can add polynomial terms to allow for ”bends” in the relationshipbetween Y and X using multiple regression.7

∑ nii=1d ij = |e i j−ẽ i | i = 1, 2; j = 1, . . . , n i d i =d ∑ ni( ) 2ijs 2 i=1 dij − d ii =n i n i − 1s 2 p = (n 1 − 1)s 2 1 + (n 2 − 1)s 2 2n 1 + n 2 − 2Then, a 2-sample t-test is conducted to test H 0 : Equal Variances in the 2 groups:T S : t obs = √s 2 pd 1 − d 2() RR : |t obs | ≥ t α/2,n−2 P -value = P (t n−2 ≥ |t obs |)1n 1+ 1 n 2Breusch-Pagan Test - Fits a regression of the squared residuals on X and tests whether the (natural)log of the variance is linearly related to X. When the regression of the squared residuals is fit, we obtainSSR e 2, the regression sum of squares. The test is conducted as follows, where SSE is the Error Sum ofSquares for the regression of Y on X:T S : Xobs 2 = (SSR e 2/2)(SSE/n) 2 RR : Xobs 2 ≥ χ 2 α,1 P -value: P ( χ 2 1 ≥ Xobs2 )When the variance is not constant, we can transform Y (often can use the Box-Cox transformation toobtain constant variance).We can also use Estimated Weighted Least Squares by relating the standard deviation (or variance)of the errors to the mean. This is an iterative process, where the weights are re-weighted each iteration. Theweights are the reciprocal of the estimated variance (as a function of the mean). Iteration continues untilthe regression coefficient estimates stabilize.When the distribution of Y is a from a known family (e.g. Binomial, Poisson, Gamma), we can fit aGeneralized Linear Model.Checking IndependenceWhen the data are a time (or spatial) series, the errors can be correlated over time (or space), referred toas autocorrelated. A plot of residuals versus time should be random, not displaying a trending pattern(linear or cyclical). If it does show these patterns, autocorrelation may be present.The Durbin-Watson test is used to test for serial autocorrelation in the errors, where the null hypothesisis that the errors are uncorrelated. Unfortunately, the formal test can end in one of 3 possible outcomes:9

DFFITS - These measure how much an individual case’s fitted value shifts when it is included in theregression fit, and when it is excluded. The shift is divided by its standard error, so we are measuring howmany standard errors a fitted value shifts, due to its being included in the regression model. Cases with theDFFITS values greater than 2 √ p ′ /n in absolute value are considered influential on their own fitted values.DFBETAS - One of these is computed for each case, for each regression coefficient (including theintercept). DFBETAS measures how much the estimated regression coefficient shifts when that case isincluded and excluded from the model, in units of standard errors. Cases with DFBETAS values larger than2/sqrt(n) in absolute value are considered to be influential on the estimated regression coefficient.Cook’s D - Is a measure that represents each case’s aggregate influence on all regression coefficients,and all cases’ fitted values. Cases with Cook’s D larger than F .50,p ′ ,n−p ′ are considered influential.COVRATIO - This measures each case’s influence on the estimated standard errors of the regressioncoefficients (inflating or deflating them). Cases with COVRATIO outside of 1 ± 3p ′ /n are consideredinfluential.Multiple Linear RegressionWhen there are more than one predictor variables, the model generalizes to multiple linear regression. Thecalculations become more complex, but conceptually, the ideas remain the same. We will use the notationof p as the number of predictors, and p ′ = p + 1 as the number of parameters in the model (including theintercept). The model can be written as:Y = β 0 + β 1 X 1 + · · · + β p X p + ɛɛ ∼ N(0, σ 2 ) independentWe then obtain least squares (and maximum likelihood) estimates ˆβ 0 , ˆβ 1 , . . . , ˆβ p that minimize the errorsum of squares. The fitted values, residuals, and error sum of squares are obtained as:Ŷ i = ˆβ 0 + ˆβ 1 X i1 + · · · ˆβ p X ip e i = Y i − Ŷi SSE =n∑e 2 ii=1The degrees of freedom for error are now n − p ′ = n − (p + 1), as we have now estimated p ′ = p + 1parameters.In the multiple linear regression model, β j represents the change in E{Y } when X j increases by 1 unit,with all other predictor variables being held constant. It is thus often referred to as the partial regressioncoefficient.11

Testing and Estimation for Partial Regression CoefficientsOnce we fit the model, obtaining the estimated regression coefficients, we also obtain standard errors foreach coefficient (actually, we obtain an estimated variance-covariance matrix for the coefficients).If we wish to test whether Y is associated with X j , after controlling for the remaining p − 1 predictors,we are testing whether β j = 0. This is equivalent to the t-test from simple regression (in general, we cantest whether a regression coefficient is any specific number, although software packages are testing whetherit is 0):H 0 : β j = β j0 H A : β j ≠ β j0 T S : t obs = ˆβ j − β j0SE{ ˆβ j }RR : |t obs | ≥ t α/2,n−p ′ P -value : P (t n−p ′ ≥ |t obs |)One-sided tests make the same adjustments as in simple linear regression:H + A : β j > β j0 RR : t obs ≥ t α,n−p ′ P -value : P (t n−p ′ ≥ t obs )H − A : β j < β j0 RR : t obs ≤ −t α,n−p ′ P -value : P (t n−p ′ ≤ t obs )A (1 − α)100% confidence interval for β j is obtained as:ˆβ j ± t α/2,n−p ′SE{ ˆβ j }Note that the confidence interval represents the values of β j0 that the two-sided test: H 0 : β j = β j0 H A :β j ≠ β j0 fails to reject the null hypothesis.Analysis of VarianceWhen there is no association between Y and X 1 , . . . , X p (β 1 = · · · = β p = 0), the best predictor of eachobservation is Y = ˆβ 0 (in terms of minimizing sum of squares of prediction errors). In this case, the totalvariation can be denoted as T SS = ∑ ni=1 (Y i − Y ) 2 , the Total Sum of Squares, just as with simpleregression.When there is an association between Y and at least one of X 1 , . . . , X p (not all β i = 0), the best predictorof each observation is Ŷi = ˆβ 0 + ˆβ 1 X i1 + · · · + ˆβ p X ip (in terms of minimizing sum of squares of prediction12

errors). In this case, the error variation can be denoted as SSE = ∑ ni=1 (Y i − Ŷi) 2 , the Error Sum ofSquares.The difference between T SS and SSE is the variation ”explained” by the regression of Y on X 1 , . . . , X p(as opposed to having ignored X 1 , . . . , X p ). It represents the difference between the fitted values and themean: SSR = ∑ ni=1 (Ŷi − Y ) 2 the Regression Sum of Squares.T SS = SSE + SSRn∑n∑n∑(Y i − Y ) 2 = (Y i − Ŷi) 2 + (Ŷi − Y ) 2i=1i=1i=1Each sum of squares has a degrees of freedom associated with it. The Total Degrees of Freedomis df Total = n − 1. The Error Degrees of Freedom is df Error = n − p ′ . The Regression Degrees ofFreedom is df Regression = p. Note that when we have p = 1 predictor, this generalizes to simple regression.df Total = df Error + df Regressionn − 1 = n − p ′ + pError and Regression sums of squares have a Mean Square, which is the sum of squares divided byits corresponding degrees of freedom: MSE = SSE/(n − p ′ ) and MSR = SSR/p. It can be shown thatthese mean squares have the following Expected Values, average values in repeated sampling at the sameobserved X levels:E{MSE} = σ 2 E{MSR} ≥ σ 2Note that when β 1 = · · · β p = 0, then E{MSR} = E{MSE}, otherwise E{MSR} > E{MSE}. A wayof testing whether β 1 = · · · β p = 0 is by the F -test:H 0 : β 1 = · · · β p = 0 H A : Not all β j = 0 T S : F obs = MSRMSERR : F obs ≥ F α,p,n−p ′ P -value : P (F p,n−p ′ ≥ F obs )The Analysis of Variance is typically set up in a table as in Table 2.A measure often reported from a regression analysis is the Coefficient of Determination or R 2 . Thisrepresents the variation in Y ”explained” by X 1 , . . . , X p , divided by the total variation in Y .r 2 =∑ ni=1 (Ŷi − Y ) 2∑ ni=1 (Y i − Y ) = SSR2 T SS = 1 − SSET SS0 ≤ R 2 ≤ 113

m − 1 dummy variables, making it the reference group. The β s for the other groups (levels of the qualitativevariable) reflect the difference in the mean for that group with the reference group, controlling for all otherpredictors.Note that if the qualitative variable has 2 levels, there will be a single dummy variable, and we can testfor differences in the effects of the 2 levels with a t-test, controlling for all other predictors. If there arem − 1 ¿ 2 dummy variables, we can use the F -test to test whether all m − 1 β s are 0, controlling for all otherpredictors.Models With Interaction TermsWhen the effect of one predictor depends on the level of another predictor (and vice versa), the predictorsare said to interact. The way we can model interaction(s) is to create a new variable that is the productof the 2 predictors. Suppose we have Y , and 2 numeric predictors: X 1 and X 2 . We create a new predictorX 3 = X 1 X 2 . Now, consider the model:E{Y } = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 1 X 2 = β 0 + β 2 X 2 + + (β 1 + β 3 X 2 ) X 1Thus, the slope with respect to X 1 depends on the level of X 2 , unless β 3 = 0, which we can test with at-test. This logic extends to qualitative variables as well. We create cross-product terms between numeric (orother categorical) predictors with the m − 1 dummy variables representing the qualitative predictor. Thent-test (m − 1 = 1) or F -test (m − 1 > 2) can be conducted to test for interactions among predictors.Models With CurvatureWhen a plot of Y versus one or more of the predictors displays curvature, we can include polynomial termsto ”bend” the regression line. Often, to avoid multicollinearity, we work with centered predictor(s), bysubtracting off their mean(s). If the data show k bends, we will include k + 1 polynomial terms. Supposewe have a single predictor variable, with 2 ”bends” appearing in a scatterplot. Then, we will include termsup to the a third order term. Note that even if lower order terms are not significant, when a higher orderterm is significant, we keep the lower order terms (unless there is some physical reason not to). We can nowfit the model:E{Y } = β 0 + β 1 X + β 2 X 2 + β 3 X 3If we wish to test whether the fit is linear, as opposed to ”not linear,” we could test H 0 : β 2 = β 3 = 0.15

Response surfaces are often fit when we have 2 or more predictors, and include ”linear effects,” ”quadraticeffects,” and ”interaction effects”. In the case of 3 predictors, a full model would be of the form:E{Y } = β 0 + β 1 X 1 + β 2 X 2 + β 3 X 3 + β 11 X 2 1 + β 22 X 2 2 + β 33 X 2 3 + β 12 X 1 X 2 + β 13 X 1 X 3 + β 23 X 2 X 3We typically wish to simplify the model, to make it more parsimonious, when possible.Model BuildingWhen we have many predictors, we may wish to use an algorithm to determine which variables to includein the model. These variables can be main effects, interactions, and polynomial terms. Note that there aretwo common approaches. One method involves testing variables based on t-tests, or equivalently F -testsfor partial regression coefficients. An alternative method involves comparing models based on model basedmeasures, such as Akaike Information Criterion (AIC), or Schwartz Bayesian Information criterion (BICor SBC). These measures can be written as follows (note that different software packages print differentversions, as some parts are constant for all potential models). The goal is to minimize the measures.AIC(Model) = nln(SSE(Model))+2p ′ −nln(n)BIC(Model) = nln(SSE(Model))+[ln(n)]p ′ −nln(n)Note that SSE(Model) depends on the variables included in the current model. The measures put apenalty on excess predictor variables, with BIC placing a higher penalty when ln(n) > 2. Note that p ′ is thenumber of parameters in the model (including the intercept), and n is the sample size.Backward EliminationThis is a ”top-down” method, which begins with a ”Complete” Model, with all potential predictors. Theanalyst then chooses a significance level to stay in the model (SLS). The model is fit, and the predictorwith the lowest t-statistic in absolute value (largest P -value) is identified. If the P -value is larger than SLS,the variable is dropped from the model. Then the model is re-fit with all other predictors (this will changeall regression coefficients, standard errors, and P -values). The process continues until all variables haveP -values below SLS.The model based approach fits the full model, with all predictors and computes AIC (or BIC). Then,each variable is dropped one-at-a-time, and AIC (or BIC) is obtained for each model. If none of the modelswith one dropped variable has AIC (or BIC) below that for the full model, the full model is kept, otherwisethe model with the lowest AIC (or BIC) is kept as the new full model. The process continues until novariables should be dropped (none of the ”drop one variable models” has a lower AIC (or BIC) than the”full model.”16

Forward SelectionThis is a ”bottom-up, which begins with all ”Simple” Models, each with one predictor. The analyst thenchooses a significance level to enter into the model (SLE). Each model is fit, and the predictor with thehighest t-statistic in absolute value (smallest P -value) is identified. If the P -value is smaller than SLE, thevariable is entered into the model. Then all two variable models including the best predictor in the firstround, with each of the other predictors. The best second variable is identified, and its P -value is comparedwith SLE. If its P -value is below SLE, the variable is added to the model. The process continues until nopotential added variables have P -values below SLE.The model based approach fits each simple model, with one predictor and computes AIC (or BIC).The best variable is identified (assuming its AIC (or BIC) is smaller than that for the null model, with nopredictors). Then, each potential variable is added one-at-a-time, and AIC (or BIC) is obtained for eachmodel. If none of the models with one added variable has AIC (or BIC) below that for the best simplemodel, the simple model is kept, otherwise the model with the lowest AIC (or BIC) is kept as the new fullmodel. The process continues until no variables should be added (none of the ”add one variable models” hasa lower AIC (or BIC) than the ”reduced model.”Stepwise RegressionThis approach is a hybrid of forward selection and backward elimination. It begins like forward selection,but then applies backward elimination at each step. In forward selection, once a variable is entered, it staysin the model. In stepwise regression, once a new variable is entered, all previously entered variables aretested, to confirm they should stay in the model, after controlling for the new entrant, as well as the otherprevious entrant.All Possible RegressionsWe can fit all possible regression models, and use model based measures to choose the ”best” model. Commonlyused measures are: Adjusted-R 2 (equivalently MSE), Mallow’s C p statistic, AIC, and BIC. Theformulas, and decision criteria are given below (where p ′ is the number of parameters in the ”current” modelbeing fit:Adjusted-R 2(- 1 −)n−1n−p ′SSET SS- Goal is to maximizeMallow’s C p - C p = SSE(Model)MSE(Complete) + 2p′ − n - Goal is to have C p ≤ p ′AIC - AIC(Model) = nln(SSE(Model)) + 2p ′ − nln(n) - Goal is to minimizeBIC - BIC(Model) = nln(SSE(Model)) + [ln(n)]p ′ − nln(n) - Goal is to minimize17

Linear Regression Notes

Create successful ePaper yourself

Delete template?

Save as template?