Linear Regression
Linear Regression
Linear Regression
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Linear</strong> <strong>Regression</strong><br />
In this tutorial we will explore fitting linear regression models using STATA. We<br />
will also cover ways of re-expressing variables in a data set if the conditions for<br />
linear regression aren’t satisfied.<br />
We will be working with the data set discussed in examples 9.43-44 on page 210<br />
of the textbook. The data set consists of three variables waist (waist size in<br />
inches), weight (weight in pounds) and fat (body fat in %) measured on 20 male<br />
subjects. To access the data type:<br />
use http://www.stat.columbia.edu/~martin/W1111/Data/Body_fat<br />
in the command window.<br />
To create a scatter plot for the variables fat and waist type:<br />
scatter fat waist<br />
This gives rise to the following plot:<br />
fat<br />
0 10 20 30 40<br />
30 35 40 45<br />
waist<br />
Studying the plot, the association between the variables appears to be strong,<br />
linear and positive. As the scatter plot indicates a linear relationship between the<br />
variables we decide to find the least-squares regression line.<br />
We do this by typing the command:<br />
regress fat waist<br />
In this notation the first variable, fat, is the response variable and the second<br />
variable, waist, is the explanatory variable.
This command gives rise to the following output in the results window:<br />
The output indicates that the least-square regression line is given by.<br />
fat<br />
ˆ = −62.55<br />
+ 2. 22waist<br />
This implies that for each additional inch in waist size, the model predicts an<br />
increase of 2.22% body fat. The fraction of the variability in fat that is explained<br />
by the least squares line of fat on waist is equal to 0.7865.<br />
Next, we want to calculate the predicted values from the regression. We can do<br />
this by typing:<br />
predict yhat, xb<br />
This command is solely used to create a new variable, yhat, and there will be no<br />
output in the results window. However, if you look in the variables window a new<br />
variable yhat is now present. To plot the regression line together with the data<br />
type:<br />
scatter fat waist || line yhat waist<br />
A vertical line can be obtained by simultaneously pressing the shift and the<br />
backslash (\) button on your keyboard. This button is located directly above the<br />
enter key. To obtain two vertical lines, repeat this procedure twice.
The command above tells STATA to create a scatterplot of fat against waist and<br />
superimpose the line given by yhat created in the previous command. This<br />
command gives the following plot:<br />
fat/<strong>Linear</strong> prediction<br />
0 10 20 30 40<br />
30 35 40 45<br />
waist<br />
fat<br />
<strong>Linear</strong> prediction<br />
The line appears to fit the data well. However, it is important to make residual<br />
plots when performing regression. We can calculate the residuals by typing the<br />
command:<br />
predict r, resid<br />
Again, note that other than creating a new variable, r, there will be no additional<br />
output. The new variable consists of the set of residuals, and a residual plot can<br />
be created by typing:<br />
scatter r waist<br />
This gives rise to the following plot:<br />
Residuals<br />
-10 -5 0 5 10<br />
30 35 40 45<br />
waist<br />
The residual plot shows no apparent pattern. The residual plot and the relatively<br />
2<br />
high value of R indicate that the linear model we fit is appropriate.
Re-expressing Data<br />
Often the conditions necessary for performing linear regression aren’t satisfied in<br />
a data set. However, it may still be possible to use these methods if we reexpress<br />
one or both of the variables.<br />
To re-express data we need be able to create new variables using STATA. We<br />
can do this using the generate command. For example to create a new variable<br />
named logx which is the logarithm of an already existing variable x, we type:<br />
generate logx = log(x)<br />
If we instead wanted to create a variable that is the square root of x, we could<br />
type<br />
generate sqx = sqrt(x)<br />
In general, the command is on the format:<br />
generate new_variable = expression(old_variable)<br />
where expression is the mathematical function applied to the old variable.<br />
Note that by default STATA uses log base e.<br />
<strong>Linear</strong> regression using re-expressed data<br />
In this portion of the tutorial we will be working with the data set discussed in<br />
example 10.11 on page 256 of the textbook. The data set gives information on<br />
the highest paid baseball players in the period spanning 1980-2001. The data set<br />
consists of 3 variables player, year and salary. To access the data type:<br />
use http://www.stat.columbia.edu/~martin/W1111/Data/salary<br />
in the command window.<br />
We begin by making a scatter plot of salary and year.<br />
scatter salary year
This gives rise to the following plot:<br />
salary<br />
0 5 10 15 20 25<br />
1980 1985 1990 1995 2000<br />
year<br />
The relationship between year and highest salary is moderately strong, positive<br />
and curved. Since the scatter plot shows a curved relationship, a linear model is<br />
not appropriate. However, it appears that taking the logarithm of salary may help<br />
straighten the plot. We can generate a new variable named logsalary, which is<br />
the logarithm of the variable salary, by typing:<br />
generate logsalary = log(salary)<br />
We can make a scatter plot of this new variable against year by typing<br />
scatter logsalary year<br />
This gives rise to the following plot:<br />
logsalary<br />
0 1 2 3<br />
1980 1985 1990 1995 2000<br />
year<br />
It appears that the transformation has significantly straightened the scatter plot.<br />
We can now proceed with fitting a linear regression model to the transformed<br />
data by typing:<br />
regress logsalary year<br />
Note that now the response variable is logsalary instead of salary.
This gives rise to the following output:<br />
The output indicates that the least-square regression line is given by.<br />
log( salary<br />
ˆ ) = −261.28<br />
+ 0. 13year<br />
The fraction of the variability in log(salary) that is explained by the least squares<br />
line of log(salary) on year is equal to 0.9622.<br />
Next, we want to calculate the predicted values from our regression. We can do<br />
this by typing:<br />
predict yhat, xb<br />
Note that other than creating a new variable, yhat, there will be no additional<br />
output. To plot the regression line together with the data type:<br />
scatter logsalary year || line yhat year<br />
The command above tells STATA to create a scatterplot of logsalary against year<br />
and to superimpose the line given by yhat. This command gives the following<br />
output<br />
logsalary/<strong>Linear</strong> prediction<br />
0 1 2 3<br />
1980 1985 1990 1995 2000<br />
year<br />
logsalary<br />
<strong>Linear</strong> prediction
The line appears to fit the data well. However, we always want to make sure to<br />
check the residual plots. We can calculate the residuals by typing the command:<br />
predict r, resid<br />
Again, note that other than creating a new variable named r there will be no<br />
additional output. We can use this new variable to create a residual plot by<br />
typing:<br />
scatter r year<br />
This gives rise to the following output:<br />
Residuals<br />
-.4 -.2 -5.55e-17 .2 .4<br />
1980 1985 1990 1995 2000<br />
year<br />
The residual plot shows no apparent pattern.<br />
Homework:<br />
Do problems RII.8 and 10.9 from the textbook.<br />
Solve both of these problems using STATA. For each questions make sure to<br />
hand in<br />
(a) your log file,<br />
(b) a scatter plot with a regression line superimposed,<br />
(c) a residual plot, and<br />
(d) answers to all the questions in the text.