11.07.2015 Views

Homework Assignment 5 Warm-up Problems - UCLA Biostatistics

Homework Assignment 5 Warm-up Problems - UCLA Biostatistics

Homework Assignment 5 Warm-up Problems - UCLA Biostatistics

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Biostatistics</strong> 100B <strong>Homework</strong> 5February 12th, 2007<strong>Homework</strong> <strong>Assignment</strong> 5Due Date: Wednesday, February 1stNote: There are 6 problems on this assignment. The first 4 problems should provide you with basic practiceon the material. They are relevant for the exams, but do NOT need to be turned in. You must turn inproblems 5 and 6 with the corresponding STATA printouts to receive full credit. The assignment is dueWednesday, February 21st, in class. Note the different turn-in day due to the President’s Day holiday!Note: Output from any calculations done in STATA MUST be included with your assignment for full credit.If I do not specify which way to do a problem you may chose whether to do it by hand or in STATA. All theSTATA commands needed to complete this homework are given at the end of the assignment along with atemplate for a suggested STATA session to be done during the corresponding lab. You do not need to turnin a separate lab report–simply turn in the relevant STATA output from the lab as part of your homework.Note: You are encouraged to work with fellow students, as necessary, on these problems. However, eachof you MUST write <strong>up</strong> your solution ON YOUR OWN and IN YOUR OWN WORDS. The style of yourwrite-<strong>up</strong> is as important as getting the correct answer. Your solutions should be easy to follow, and containEnglish explanations of what you are doing and why. You do not have to write an essay for each problem, butyou should give enough comments so that someone who has not seen the problem statement can understandyour work. In the real world you will not be answering predetermined questions from a textbook for someonewho already knows the answer! You do not have to type your assignments. However, if they are too sloppyto read, too hard to understand, or give just numbers with no comments, you WILL lose points. <strong>Problems</strong>labeled (WBCH) are based on the text Statistics for Management and Economics by Watson, Billingsley,Crofts and Huntsberger, which I have used in previous courses.<strong>Warm</strong>-<strong>up</strong> <strong>Problems</strong>(1) Interpreting A Multiple Regression Equation: (WBCH 13.1) A regression equation was foundto be Ŷ = 20 + 14X 1−7X 2 . Which of the following statements are correct? You should explain yourreasoning in each case. If the statement is not correct, say how you would correct it or how you could checkif the statement were true.(a) A one unit increase in X 1 causes Y to increase by fourteen units.(b) Variable Y is more highly correlated with X 1 than X 2 since the coefficient on X 1 is positive.(c) If the value of X 2 is large enough, one can obtain negative predictions of Y.(2) Special Issues In Multiple Regression:(a) Briefly explain what is meant by the term overfitting, why it can be a problem in multiple regression,and how you can check for it or avoid it.(b) Briefly explain what is meant by the term multicollinearity and give an example in which it mightoccur.(3) Condo Prices (WBCH 13.27): A real estate broker with Berg Land and Development engaged theservices of a consultant to develop a multiple regression model to predict the sales price of condominiums,1


Y, in thousands of dollars from the area of the floor space, X 1 , in hundreds of square feet and whether ornot the condominium had access to a swimming pool, X 2 . (X 2 = 0 with no pool access and X 2 = 1 withpool access.) Data for n=20 condos are shown below.(a) Find the estimated multiple regression equation.Price Area Pool Price Area Pool67.9 12.0 1 59.1 11.1 069.5 13.8 1 59.3 10.5 067.2 14.6 0 64.3 11.9 161.1 12.6 0 56.6 9.7 068.3 12.1 1 50.3 9.2 143.5 8.0 0 63.7 12.5 059.2 11.0 1 65.5 13.4 152.6 9.6 0 61.4 12.9 056.2 9.9 1 63.2 13.3 056.1 10.6 0 64.8 13.5 1(b) Find the predicted price of a condo that is 1200 square feet and has a pool. (Be careful of your units!)(c) What is the estimated influence of having a swimming pool on the sales price of a condo? (In otherwords, what is the interpretation of b 2 ?)(d) Find a 95% confidence interval for β 2 and explain its meaning.(e) Logically, would you expect having a pool to raise or lower the price of a condo? Set <strong>up</strong> and performthe appropriate hypothesis test to prove your point. Make sure you state your hypotheses, p-value, andconclusions.(4) Salary Prediction: The manager of Statisticorp, a company based in the city of Los Seraphim, iswondering what factors determine employees’ salaries. In the past, number of years spent with the companyhas been considered the most important predictor. However in recent times gender and age discriminationhave become hot-button issues, so she is particularly interested in knowing whether the age and gender ofan employee give additional information about salary. She takes a random sample of 20 employees. LetY be salary (in thousands of dollars), X 1 the number of years with the company, X 2 age, and let X 3 bean indicator which is 1 if the employee is female, and 0 if the employee is male. A data set containinginformation for the employees is given in the table below. Use it to answer the following questions.Salary Years Age Gender Salary Years Age Gender23 2 25 1 25 3 26 127 4 28 1 35 5 29 145 6 28 1 47 6 30 150 8 40 1 65 10 40 170 12 45 1 102 20 45 127 2 25 0 30 3 26 030 4 28 0 39 5 29 047 6 28 0 51 6 30 057 8 40 0 70 10 40 077 12 45 0 110 20 45 02


(a) Find the correlations among Y, X 1 , and X 2 . Do you think X 2 (age) is a good predictor of Y (salary)?Why or why not?(b) Fit the multiple regression of Salary on Years, Gender and Age. Based on the STATA multiple regressionprintout perform an hypothesis test to determine whether β 2 , the coefficient of age, is significantly differentfrom 0. Clearly state the null and alternative hypotheses and the p-value. What does this test tell youabout the usefulness of age as a predictor in this model? In light of your answer to (a), how could this havehappened? What should you do to resolve the problem? Explain briefly.(c) Test the null hypothesis β 3 = 0 versus the alternative β 3 ≠ 0, giving the test statistic and the p-value.What does this say about the respective salaries of men and women?(d) S<strong>up</strong>pose you had started with a theory that women make less money than men with the same qualificationsand wanted to prove it. How would this have changed your hypotheses in part (c)? What would yournew p-value and conclusions have been? Explain briefly.(e) Perform an overall F test for this regression. Make sure you state the null and alternative hypotheses,both mathematically and in words, give the test statistic and p-value, say whether or not you reject andwhy, and explain your conclusions.(f) Based on the answers to the previous parts, what final model you would choose to use for this data set?Briefly justify your choice. Note that the model you choose does not need to be one you have actually fit!<strong>Problems</strong> To Turn In(5) Still More Height and Weight A doctor is interested in understanding what factors predict theweight of teenagers. As we have seen before, height is a good predictor of weight. The doctor suspects agemay also be an important factor. She collects data from the next ten teens who come to her office on weight(in pounds), age (in years) abd height (in inches). The data are presented in the accompanying table.Weight Age Height Weight Age Height90 16 63 86 13 47.872 13 48.6 112 14 55.8132 16 62.2 122 16 55100 14 48.6 88 13 50.262 10 39.8 98 15 56.6(a) Fit two separate simple linear regressions, one of weight on height and one of weight on age. Is there asignificant linear relationship in each case? Explain.(b) Fit the multiple regression of weight on height and age. What is the estimated regression equation? Doesthe regression overall explain a significant amount of the variability in weight? Explain using an F test. Makesure to write out the null and alternative hypotheses, both mathematically and in words, give the test statisticand p-value, say whether or not you reject and why, and explain your real-world conclusions. Use α = .05.(c) In the regression of part (b) are the variables height and age significant? Explain using individual ttests. In each case, write the null and alternative hypotheses, both mathematically and in words, give thetest statistic and p-value, say whether or not you reject and why, and explain your conclusions. Use α = .05.Also use the STATA printout to find confidence intervals for β 1 ad β 2 and give brief interpretations of them.3


Do these intervals confirm the results of your hypothesis tests? Explain.(d) Explain what has caused the apparent contradictions in (a), (b) and (c). Hint: Do you expect there tobe a relationship between age and height? Why? Verify this by computing the correlation between the twovariables. How would you fix this problem? Be as specific as you can.(6) Statistics Leaves Me Breathless (based on Rosner Problem 11.46) A study was performed toassess the relationship between pulmonary function and a variety of other factors in children and teens. Thedata (which are not reproduced here since there are hundreds of observations) are provided on the web sitein both Excel and STATA formats along with the variables for the other problems on this assignment. Theresponse variable, Y, is forced expiratory volume (fev) and the predictor variables are age (in years), height(in inches), gender (male=1, female=0) and smoking status (current smoker=1, non-smoker = 0). Use thedata to answer the following questions.(a) Before performing the regression, what signs do you expect for the coefficients of the four predictorvariables? Explain briefly in each case.(b) Fit a simple linear regression of forced expiratory volume (fev) on smoking using STATA. Does the signof the coefficient s<strong>up</strong>rise you? Calculate the table of correlations among fev, height, age and smoking anduse these numbers to explain what has happened.(c) Use STATA to fit a multiple regression of forced expiratory volume on the four predictor variables. Noware your predictions for the signs verified?(d) Give careful interpretations of the coefficients of the age and smoking variables. Your answer shouldinclude the numeric values of the coefficients and incorporate the appropriate units.(e) Predict the fev score for Silly Sally, a 14 year old girl who is 5 feet tall and smokes, based on your model.If we wanted a range of values that we could be fairly sure would include Sally’s fev, what kind of intervalshould we use? Explain briefly(f) According to your model who should have a higher fev, a boy, or a girl who who is two inches taller buta year younger? (You may assume they have the same smoking status.) Briefly explain your reasoning.(g) What percentage of the variability in fev score is explained by age, height, gender and smoking status?Which number should you use to answer this question and why? Does it make much difference in this model?What does that tell you?(h) Find the average distance from points to the estimated multiple regression surface. Do you think thismodel does a good job of predicting fev score? Explain briefly.(i) Label the values of SSE, SSR, SST, MSR, and MSE, on your STATA printout and explain briefly whateach one is telling you in the context of the problem.(h) Is the overall regression useful for explaining the variability in fev scores? Check this by performing anF test. Be sure to state the null and alternative hypotheses, both mathematically and in words, give the teststatistic and the p-value, say whether or not you reject and why, and explain your real-world conclusions.You may use α = .05.(i) S<strong>up</strong>pose you had a theory that smoking would harm your pulmonary function and hence decrease yourfev score. Perform an appropriate test to try to prove your theory. Be sure to state the null and alternative4


hypotheses, both mathematically and in words, give the test statistic and the p-value, say whether or notyou reject and why, and explain your real-world conclusions. Does the STATA confidence interval for thecoefficient of the smoking variable confirm your test result? Explain briefly.(j) Your answer to part (i) is a bit disappointing. It turns out that this data set includes children who are asyoung as four years old. Obviously children this young should not be smoking. In fact, the youngest smokerin the data set is nine years old. Perhaps we should restrict our model to children nine years and older. Refitthe model with this restriction and see if it makes a difference to the significance of the smoking variable.Also comment on what has happened to the p-value of the gender variable, the value of the F statistic, andthe value of R 2 -adjusted. You can get bonus credit for explaining what has probably caused these changes.STATA Commands and a Sample Lab SessionThere isn’t really much new in STATA this week because multiple regression and simple regression work thesame way! For instance to fit a regression with three predictor (X) variables you would typeregress yvar xvar1 xvar2 xvar3You may also find it useful to recall how to obtain correlations in STATA. For instance, if you want a tableof pairwise correlations for variables 1 to 3 simply typecor var1 var2 var3The only other thing you need to know is how to restrict your analysis to a particular part of a data set. Youdo this by using an “if” statement after your main command. For instance if we were fitting a regressionwhere our Y variable was infant mortality rate and our X variable was year and we wanted to restrict theanalysis to years after 1975 we would typeregress mortality year if year > 1975If we were fitting a regression model of house price on size and whether or not there was a pool and wantedto restrict the analysis to houses with a pool as in the warm-<strong>up</strong> problm three we would typeregress price size pool if pool==1Note that in this case I have typed “=” twice to tell STATA I mean exactly equal to. Not equal is != Otherrestrictions work in similar ways.Note: In order to figure out what I have named the variables for this assignment it is helpfulto look in the STATA data editor. For Problem 5 the variable names are Weight, Age andHeight–note the capital letters which are used to distiguish them from variables with the samename in Problem 6. In Problem 6 the variable names are fev, age, height, gender and smoke.5

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!