12.07.2015 Views

Winter 2009 Final Exam

Winter 2009 Final Exam

Winter 2009 Final Exam

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Name:<strong>Final</strong> <strong>Exam</strong>Wednesday, March 18th, <strong>2009</strong>General Comments:• This exam is closed book. However, you may use six pages, front and back, of notes andformulas (your cheat sheets from midterms 1 and 2 plus two additional sheets). Write youranswers on the exam sheets. If you need more space, continue your answer on the back ofthe page. Normal and t-tables are attached at the end. Make sure you have all 18 pages!• The exam is designed to be about 2 hours long. However you have 3 hours to complete it.There are 3 questions, worth a total of 100 points. They are not equally weighted, nor arethey of equal difficulty. The number of points each question is worth is printed with theproblem. Read the questions carefully. If you are unsure of the interpretation, come ask.• You must show your work to obtain full credit. If you use a result from class, state whatresult you are using. If you can’t complete a problem for any reason, explain what conceptsare at issue, and how you would attack the problem. If you can’t work out a number you needfor a later part of a problem give it a symbol and show how you would do the calculationswith a symbol in place of the missing number. It is a good idea to explain your reasoningbriefly in English. If I can’t tell that you understood what you were doing, I can’t give youcredit, particularly if you get the wrong numerical answer. GOOD LUCK!Question Total Points ScoreNumber Possible Received1 262 263 48Total 100THE STORY BEHIND THE EXAM: Professor U. R. Helpful is an AIDS researcher at ourfavorite school, the University of Calculationally Literate Adults. (Did he perversely choose hisfield based on his name?) He is interested in reducing the viral load of patients in general and inparticular in reducing the risk that pregnant women transmit HIV to their babies. Your job is tohelp him analyze some of his data.1


Question 1: Special Delivery (26 points, 35 minutes)In the developed world most people with HIV receive some form of “highly active antiretroviraltherapy” or HAART. (HAART regimens are basically cocktails of multiple drugs that are moreeffective because the virus is less likely to become resistant in their presence.) However in underdevelopednations HAART is rarer because of its cost. Professor Helpful believes that HAARTregimens will help reduce the risk of HIV positive pregnant women passing on the infection to theirbabies and must therefore be agressively promoted in poor countries. He has followed n=300 HIVpositive pregant women, 100 of whom are receiving at most a basic non-HAART treatment, 100 ofwhom are taking HAART regimen A, and 100 of whom are taking HAART regimen B. (I’ll skip thedrug names to keep this simple!) He records Y, whether or not the baby is HIV positive (1 = yes,0 = no) and which treatment regimen the mother was on (X 1 = 1 if the mother was on HAART Aand 0 otherwise, X 2 = 1 if mother was on HAART B and 0 otherwise), and fits a logistic regression.The corresponding STATA printouts are below. Use them to answer the following questions.. logit HIVplus HAART_A HAART_BLogistic regression Number of obs = 300LR chi2(2) = 6.75Prob > chi2 = 0.0342Log likelihood = -96.32681 Pseudo R2 = 0.0339------------------------------------------------------------------------------HIVplus | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------HAART_A | -0.539 .431 -1.25 0.211 -1.383 0.305HAART_B | -1.286 .534 -2.41 0.016 -2.332 -0.240_cons | -1.658 .273 -6.08 0.000 -2.193 -1.124------------------------------------------------------------------------------. logistic HIVplus HAART_A HAART_BLogistic regression Number of obs = 300LR chi2(2) = 6.75Prob > chi2 = 0.0342Log likelihood = -96.32681 Pseudo R2 = 0.0339------------------------------------------------------------------------------hivplus | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------HAART_A | .583 .251 -1.25 0.211 .251 1.357HAART_B | .276 .147 -2.41 0.016 .097 0.787------------------------------------------------------------------------------2


Part aOverall, is treatment regimen useful for explaining whether a woman passes on HIV infection toher baby? Write down the mathematical hypotheses you are testing, circle the relevant p-valueon one of the printouts and give your real-world conclusions using α = .05. You do NOT need toprovide any other details.Part bGive a brief interpretation of the odds ratio for the HAART A variable and show how to computeit from the first regression printout.Part cDo HAART A and HAART B appear to reduce a mother’s risk of passing on HIV to her infant?Explain briefly using α = .05 and give the p-values corresponding to the tests you are performing.You do NOT need to write out any other details of the tests.3


Part dFind the odds ratio comparing the risk of HIV transmission for mothers in the HAART A groupcompared to those in the HAART B group. Show your work. Based on this estimate which ofthese treatment regimens is more effective? Briefly explain your reasoning. Do you think you canbe 95% sure this treatment is better? Explain.Part eSuppose that the women in the HAART A, and B groups had also had a previous pregnancy duringwhich they were HIV positive but NOT treated with HAART. What sort of test could ProfessorHelpful have used then to determine whether these two treatments were helpful? What would bethe advantage of this sort of study?Part fSuppose Professor Helpful wants to determine whether the rate of mother to infant transmissionof HIV in the non-HAART group is different from the overall rate in the two HAART groups (i.e.subjects are classified as HAART or non-HAART with no specification of whether they got HAARTA or B). What method could he use other than logistic regression or a contingency table test to dothis comparison?4


Part g (Optional Bonus)We could have solved the original problem in this question using a contingency table approachrather than logistic regression. Use the information from the regression printout to figure out whatProfessor Helpful’s contingency table must have been. Show your work.nonHAART HAART A HAART B TotalInfant HIV+Infant HIV-Total 100 100 100 300Part h (Optional Bonus)Perform the test you described in part (f) giving the null and alternative hypotheses mathematicallyand in words, computing the test statistic and p-value and giving your real-world conclusions. (Part(g) will be helpful here!)5


Question 2: Prenatal Care-acteristics (26 points, 35 minutes)Professor Helpful recognizes that there are probably many factors besides treatment regimen thataffect whether a mother transmits HIV to her baby. He has thus added the following variables tohis logistic regression model from Question 1: X 3 , the mother’s viral load in copies per milliliter ofblood (higher viral load is worse), X 4 , the mother’s age in years, X 5 , the number of years the motherhas been HIV positive, X 6 , the number of weeks during the pregnancy for which the mother wasreceiving HAART therapy, and X 7 the method by which the baby was delivered (1 = C-section, 0 =natural delivery). The new printouts are given below. Use them to answer the following questions.. logit HIVplus HAART_A HAART_B VLoad Age YrsHIV WksHAART DeliveryLogistic regression Number of obs = 300LR chi2(7) = 32.47Prob > chi2 = 0.000Log likelihood = -26.51722 Pseudo R2 = 0.500------------------------------------------------------------------------------HIVplus | Coef. Std. Err. z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------HAART_A | -0.70 0.250 2.80 0.005 [-1.19, -0.21]HAART_B | -1.80 0.300 6.00 0.000 [-2.39, -1.21]VLoad |0.00001 0.0000025 4.00 0.000 [.000005, .000015]Age | 0.10 0.050 2.00 0.046 [ 0.00, 0.20]YrsHIV | 0.10 0.080 1.25 0.211 [-0.06, 0.26]WksHAART | -0.05 0.010 -5.00 0.000 [-0.07, -0.03]Delivery | -0.40 0.150 -2.67 0.004 [-0.69, -0.11]_cons | -5.00 0.500 -10.00 0.000 [-5.98, -4.02]------------------------------------------------------------------------------. logistic HIVplus HAART_A HAART_BLogistic regression Number of obs = 300LR chi2(7) = 32.47Prob > chi2 = 0.000Log likelihood = -26.51722 Pseudo R2 = 0.500------------------------------------------------------------------------------HIVplus | OddsRatio z P>|z| [95% Conf. Interval]-------------+----------------------------------------------------------------HAART_A | 0.4966 2.80 0.005 [0.3042, 0.8106]HAART_B | 0.1652 6.00 0.000 [0.0916, 0.2982]VLoad | 1.00001 4.00 0.000 [1.000005, 1.000015]Age | 1.1052 2.00 0.046 [1.0020, 1.2190]YrsHIV | 1.1052 1.25 0.211 [0.9448, 1.2928]WksHAART | 0.9512 -5.00 0.000 [0.9328, 0.9700]Delivery | 0.6703 -2.67 0.004 [0.4996, 0.8994]------------------------------------------------------------------------------6


Part aFind the probability that a 30 year old women on HAART A for 20 weeks of her pregnancy witha viral load of 10,000 who has been HIV positive for 10 years will have an HIV negative baby ifshe delivers by Cesarean Section. Show your work.Part bExplain as precisely as you can the meaning of the p-value for X 7 , the delivery variable. Youranswer should be specific to this context and incorporate the relevant numeric value(s).Part c(i) Give a brief interpretation of the confidence interval for the odds ratio for X 6 , the weeks treatedvariable. (ii) Find a 95% confidence interval for the odds ratio associated with an extra MONTH(4 weeks) of HAART treatment. Based on this latter interval can you be sure that, all else equal,an extra month of HAART treatment will reduce the risk of mother to child transmission by 10%.7


Part dProfessor Helpful believes overfitting is an issue in this model. (i) Explain why he is correct. (ii)Give a possible real-world cause of the overfitting and say how you would check whether your ideawas correct. (iii) Say what variable you would remove first in a backwards stepwise procedure andwhy. (iv) What do you think would happen to the pseudo R 2 if you removed this variable? Why?8


Question 3: Health is Where the HAART Is (48 points, 70 minutes)Professor Helpful wants to understand how the different medication regimens he studied in Question1 and 2 affect an AIDS patient’s viral load. He has selected n=124 HIV positive individualsand randomly assigned them to four treatment groups: nonHAART (N), HAART A (A), HAARTB (B), and a combination of HAART A and B (AB). (Ignore the ethical implications of this designfor now!) He has fit two ANOVA models, one with the raw viral load, Y, (in copies per milliliter ofblood) as the response and one with log 10 Y , the base 10 logarithm of the viral load, as the response.(Note: This has nothing to do with logistic regression–it is just that viral loads are often measuredon a log scale!) The corresponding STATA printouts are below along with their associated residualplots. Use them to answer the questions on the following pages.. oneway vload groupNumber of obs = 124 R-squared = 0.1727Root MSE = 146094 Adj R-squared = 0.1520Analysis of VarianceSource SS df MS F Prob > F------------------------------------------------------------------------Between groups 5.3456e+11 3 1.7819e+11 8.35 0.0000Within groups 2.5612e+12 120 2.1344e+10------------------------------------------------------------------------Total 3.0958e+12 123 2.5169e+10**************************************************************************. oneway log10_vload group, tabulate| Summary of log10_vloadgroup | Mean Std. Dev. Freq.------------+------------------------------------N | 4.6157984 .91861186 31A | 4.2090149 .60964526 31B | 3.9325629 .55513912 31AB | 3.2174303 .26721532 31------------+------------------------------------Total | 3.9937017 .80689929 124Number of obs = 124 R-squared = 0.4025Root MSE = .631486 Adj R-squared = 0.3875Analysis of VarianceSource SS df MS F Prob > F------------------------------------------------------------------------Between groups 32.2306776 3 10.7435592 26.94 0.0000Within groups 47.8529569 120 .398774641------------------------------------------------------------------------Total 80.0836345 123 .651086469


.Part aWhich of these two models do you prefer? Justify your choice in terms of (i) how well the modelsmeet the regression assumptions and (ii) using one other appropriate number from the printouts.10


Note: For the remaining parts of the problem base your answers on the model with log 10 viralload as the response regardless of your answers to part (a).Part bBased on this data is there evidence that the mean log 10 viral load differs across the treatmentregimens? Justify your answer by performing an appropriate overall hypothesis test. State themathematical hypotheses in the classical ANOVA framework, give the p-value, and your real-worldconclusions.The test statistics and p-values for pairwise comparisons of the group means are given below. Usethem to answer parts (c) - (e).Part cPair t obs p-valueN vs A 2.535 0.01253N vs B 4.259 0.00004N vs AB 8.718 0.00000A vs BA vs AB 6.182 0.00000B vs AB 4.459 0.00000Calculate the 95% confidence interval associated with testing the difference between the A and Bgroups, and provide a brief interpretation of it.11


Part dWhich pairs of means are significantly different from one another at the α = .05 level withoutadjusting for multiple testing? Explain briefly without writing out any of the test details. Whatdoes this suggest about the relative merits of the treatment regimens?Part eSuppose you used the Bonferroni correction to adjust for the multiple comparisons in parts (c) and(d). Would this change any of your conclusions from part (d)? Explain briefly. Would it changethe confidence interval in part (c)? If not, say why not. If it would, say how, though you do notneed to recalculate the interval.12


Part fProfessor Helpful is interested in proving that patients on HAART (the A, B and combo regimens)do better overall than patients who are not taking HAART. (i) Write down an appropriate linearcombination, LC, for the comparison he wishes to do. (You may assume that the various HAARTregimens are equally common in the population of interest.) (ii) Give your best estimate of LC andthe corresponding standard error (iii) Use these numbers to perform the hypothesis test of interestto Professor. Give the null and alternative hypothesis, mathematically and in words with a justificationof your choice, compute the test statistic and give your real-world conclusions using α = .05.Part gSuppose that instead of doing an ANOVA Professor Helpful had fit a regression model to this datausing HAART B as the reference group. Write down the estimated regression equation he wouldhave obtained.13


Part hProfessor Helpful has decided to fit a slightly different model to this data. He creates two indicators,X 1 = 1 if the person’s regimen includes HAART A, X 2 = 1 if the person’s regimen includes HAARTB, and X 3 = X1 ∗ X 2 , an interaction term. The resulting regression pringout is below. (i) Explainas precisely as you can what the interaction means in this model including what the sign of b 3 tellsyou. (ii) Is the interaction significant at α = .05? What does this tell you?. regress log10_vload A B A*BSource | SS df MS Number of obs = 124-------------+------------------------------ F( 3, 120) = 26.94Model | 32.2306776 3 10.7435592 Prob > F = 0.0000Residual | 47.8529569 120 .398774641 R-squared = 0.4025-------------+------------------------------ Adj R-squared = 0.3875Total | 80.0836345 123 .65108646 Root MSE = .63149------------------------------------------------------------------------------log10_vload | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------A | -.4067835 .1603976 -2.54 0.012 -.7243596 -.0892073B | -.6832355 .1603976 -4.26 0.000 -1.000812 -.3656593A*B | -.3083491 .2268365 -1.36 0.177 -.7574696 .1407713_cons | 4.615798 .1134182 40.70 0.000 4.391238 4.840359------------------------------------------------------------------------------14


Part i (Optional Bonus)As we have discussed in an ANOVA it is hard to judge an independence violation from the residualplot. In this problem is the independence assumption justified and how can you tell?Part j (Optional Bonus)Note that the overall F test, RMSE and R 2 values are the same for the log 1 0 viral load model atthe start of the problem and the interaction model in part (i). Explain why this is the case andwhat the essential difference between the two models is.Part k (Optional Bonus)Explain why the log transformation is often used to resolve issues of non-normality and why it issometimes called a variance stabilizing transformation.15


Part l (Hard Optional Bonus)We never learned how to calculate CIs and PIs for Y in a multiple regression because the formulasfor the standard deviations sŶ0 and s Y0 are messy. However in an ANOVA context where thereare only indicator variables it is possible to write down simple formulas. (i) Explain what the CIand PI formulas ought to be using ANOVA notation and (ii) for the very brave, verify that if yourANOVA has two groups then these formulas reduce to the formulas we learned for simple regressioninvolving ¯X and SSX. (Note: To do the latter part you may find it useful to remember that yourX values are all 0’s and 1’s denoting group membership, and to let n 1 be the number of 1’s and n 2the number of 0’s, with n 1 + n 2 = n, the total sample size.)Congratulations!!! You Are Done!! Have A Great Spring Break!!16

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!