28.07.2014 Views

Linear Regression

Linear Regression

Linear Regression

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Linear</strong> <strong>Regression</strong><br />

In this tutorial we will explore fitting linear regression models using STATA. We<br />

will also cover ways of re-expressing variables in a data set if the conditions for<br />

linear regression aren’t satisfied.<br />

We will be working with the data set discussed in examples 9.43-44 on page 210<br />

of the textbook. The data set consists of three variables waist (waist size in<br />

inches), weight (weight in pounds) and fat (body fat in %) measured on 20 male<br />

subjects. To access the data type:<br />

use http://www.stat.columbia.edu/~martin/W1111/Data/Body_fat<br />

in the command window.<br />

To create a scatter plot for the variables fat and waist type:<br />

scatter fat waist<br />

This gives rise to the following plot:<br />

fat<br />

0 10 20 30 40<br />

30 35 40 45<br />

waist<br />

Studying the plot, the association between the variables appears to be strong,<br />

linear and positive. As the scatter plot indicates a linear relationship between the<br />

variables we decide to find the least-squares regression line.<br />

We do this by typing the command:<br />

regress fat waist<br />

In this notation the first variable, fat, is the response variable and the second<br />

variable, waist, is the explanatory variable.


This command gives rise to the following output in the results window:<br />

The output indicates that the least-square regression line is given by.<br />

fat<br />

ˆ = −62.55<br />

+ 2. 22waist<br />

This implies that for each additional inch in waist size, the model predicts an<br />

increase of 2.22% body fat. The fraction of the variability in fat that is explained<br />

by the least squares line of fat on waist is equal to 0.7865.<br />

Next, we want to calculate the predicted values from the regression. We can do<br />

this by typing:<br />

predict yhat, xb<br />

This command is solely used to create a new variable, yhat, and there will be no<br />

output in the results window. However, if you look in the variables window a new<br />

variable yhat is now present. To plot the regression line together with the data<br />

type:<br />

scatter fat waist || line yhat waist<br />

A vertical line can be obtained by simultaneously pressing the shift and the<br />

backslash (\) button on your keyboard. This button is located directly above the<br />

enter key. To obtain two vertical lines, repeat this procedure twice.


The command above tells STATA to create a scatterplot of fat against waist and<br />

superimpose the line given by yhat created in the previous command. This<br />

command gives the following plot:<br />

fat/<strong>Linear</strong> prediction<br />

0 10 20 30 40<br />

30 35 40 45<br />

waist<br />

fat<br />

<strong>Linear</strong> prediction<br />

The line appears to fit the data well. However, it is important to make residual<br />

plots when performing regression. We can calculate the residuals by typing the<br />

command:<br />

predict r, resid<br />

Again, note that other than creating a new variable, r, there will be no additional<br />

output. The new variable consists of the set of residuals, and a residual plot can<br />

be created by typing:<br />

scatter r waist<br />

This gives rise to the following plot:<br />

Residuals<br />

-10 -5 0 5 10<br />

30 35 40 45<br />

waist<br />

The residual plot shows no apparent pattern. The residual plot and the relatively<br />

2<br />

high value of R indicate that the linear model we fit is appropriate.


Re-expressing Data<br />

Often the conditions necessary for performing linear regression aren’t satisfied in<br />

a data set. However, it may still be possible to use these methods if we reexpress<br />

one or both of the variables.<br />

To re-express data we need be able to create new variables using STATA. We<br />

can do this using the generate command. For example to create a new variable<br />

named logx which is the logarithm of an already existing variable x, we type:<br />

generate logx = log(x)<br />

If we instead wanted to create a variable that is the square root of x, we could<br />

type<br />

generate sqx = sqrt(x)<br />

In general, the command is on the format:<br />

generate new_variable = expression(old_variable)<br />

where expression is the mathematical function applied to the old variable.<br />

Note that by default STATA uses log base e.<br />

<strong>Linear</strong> regression using re-expressed data<br />

In this portion of the tutorial we will be working with the data set discussed in<br />

example 10.11 on page 256 of the textbook. The data set gives information on<br />

the highest paid baseball players in the period spanning 1980-2001. The data set<br />

consists of 3 variables player, year and salary. To access the data type:<br />

use http://www.stat.columbia.edu/~martin/W1111/Data/salary<br />

in the command window.<br />

We begin by making a scatter plot of salary and year.<br />

scatter salary year


This gives rise to the following plot:<br />

salary<br />

0 5 10 15 20 25<br />

1980 1985 1990 1995 2000<br />

year<br />

The relationship between year and highest salary is moderately strong, positive<br />

and curved. Since the scatter plot shows a curved relationship, a linear model is<br />

not appropriate. However, it appears that taking the logarithm of salary may help<br />

straighten the plot. We can generate a new variable named logsalary, which is<br />

the logarithm of the variable salary, by typing:<br />

generate logsalary = log(salary)<br />

We can make a scatter plot of this new variable against year by typing<br />

scatter logsalary year<br />

This gives rise to the following plot:<br />

logsalary<br />

0 1 2 3<br />

1980 1985 1990 1995 2000<br />

year<br />

It appears that the transformation has significantly straightened the scatter plot.<br />

We can now proceed with fitting a linear regression model to the transformed<br />

data by typing:<br />

regress logsalary year<br />

Note that now the response variable is logsalary instead of salary.


This gives rise to the following output:<br />

The output indicates that the least-square regression line is given by.<br />

log( salary<br />

ˆ ) = −261.28<br />

+ 0. 13year<br />

The fraction of the variability in log(salary) that is explained by the least squares<br />

line of log(salary) on year is equal to 0.9622.<br />

Next, we want to calculate the predicted values from our regression. We can do<br />

this by typing:<br />

predict yhat, xb<br />

Note that other than creating a new variable, yhat, there will be no additional<br />

output. To plot the regression line together with the data type:<br />

scatter logsalary year || line yhat year<br />

The command above tells STATA to create a scatterplot of logsalary against year<br />

and to superimpose the line given by yhat. This command gives the following<br />

output<br />

logsalary/<strong>Linear</strong> prediction<br />

0 1 2 3<br />

1980 1985 1990 1995 2000<br />

year<br />

logsalary<br />

<strong>Linear</strong> prediction


The line appears to fit the data well. However, we always want to make sure to<br />

check the residual plots. We can calculate the residuals by typing the command:<br />

predict r, resid<br />

Again, note that other than creating a new variable named r there will be no<br />

additional output. We can use this new variable to create a residual plot by<br />

typing:<br />

scatter r year<br />

This gives rise to the following output:<br />

Residuals<br />

-.4 -.2 -5.55e-17 .2 .4<br />

1980 1985 1990 1995 2000<br />

year<br />

The residual plot shows no apparent pattern.<br />

Homework:<br />

Do problems RII.8 and 10.9 from the textbook.<br />

Solve both of these problems using STATA. For each questions make sure to<br />

hand in<br />

(a) your log file,<br />

(b) a scatter plot with a regression line superimposed,<br />

(c) a residual plot, and<br />

(d) answers to all the questions in the text.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!