12.07.2015 Views

Midterm 2 Practice Problems With Solutions

Midterm 2 Practice Problems With Solutions

Midterm 2 Practice Problems With Solutions

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Midterm</strong> 2 <strong>Practice</strong> <strong>Problems</strong> <strong>With</strong> <strong>Solutions</strong>Problem 1: Television Ads: You own a chain of stores that sells television sets and you wantto know whether your advertising is increasing your sales. Let Y be the number of TVs you sell ina given month, and let X be the amount of money you spend on advertising in a given month inthousands of dollars. You have data on advertising expenditures and sales for n=42 months andhave fit a simple linear regression of Y on X. The printout for this regression is given below alongwith a few useful summary statistics. Use it to answer the following questions:The regression equation isTV Sales = 48.4 + 10.2 Ad-SpendingParameter EstimatesPredictor Coef Stdev t-ratio pConstant 48.40 17.61 2.75 0.009Ad Spending 10.2457 0.5224 19.61 0.000RMSE = 38.54 R-sq = 90.6% R-sq(adj) = 90.3%Analysis of VarianceSOURCE DF SS MS F pRegression 1 571411 571411 384.70 0.000Error 40 59413 1485Total 41 6308241


Our t statistic would be t obs s b1= 19.61 from the printout. The F statistic would beF obs = MSRMSE= 384.70. The corresponding p-value can be read off the printout either from thepredictor table or the ANOVA table as .000. This is certainly smaller than α = .005 so we rejectthe null hypothesis. We conclude that there is a significant linear relationship between advertisingexpenditures and television sales. Knowing how much you spend on advertising does tell yousomething about what your sales will be like. In fact, the relationship is positive so spending moreon ads is associated with higher sales, just as we would hope.= b 1−0(d) Find a 99% confidence interval for β 1 , the slope of the regression line, and briefly explain whatit tells you about the relationship between advertising and television sales.Solution: The general formula for a confidence interval for β 1 in a simple linear regression isb 1 ± t α/2,n−2 s b1From the printout b 1 = 10.2457 and s b1 = .5224. We have n=42 months worth of data andα/2 = .005 for a 99% interval so we need t .005,40 = 2.704. The resulting confidence interval is [8.83,11.66]. This means that we are 99% sure that β 1 is between 8.83 and 11.66. What does that mean?It means for every extra $1000 we spend on advertising we sell between 8.83 and 11.66 more TVson average. This uses the definition that β 1 gives the change in Y (here TV sales) associated witha one unit change in X (here $1000 more spent on advertising.)(e) Suppose your company makes a $100 profit per television sold BEFORE taking advertisingcosts into account. According to your best estimate, do the ads appear to be paying for themselves?Explain briefly. Can you be 99% (really 99.5%) sure the ads are paying for themselves?Explain briefly.Solution: Our best estimate is that β 1 = 10.2457. In other words, for every extra $1000 spenton advertising we sell an extra 10.2457 TVs. Since we make a profit of $100 per TV before advertisingexpenses this means every $1000 we spend on advertising results in an extra $1024.57in TV sales. The sales exceed the costs of the ads–barely!–so our best guess is that the ads arepaying for themselves. However, we are NOT 99% sure the ads are paying for themselves. Frompart (e) all we can say is that we are 99% sure that we have increased our sales between $883and $1151 for each $1000 spent on ads. The values at the low end of the interval do NOT coverthe advertising costs. In fact we might lose over $100 for every $1000 spent on ads! For those ofyou keeping track of the exact percentages, we can be 99.5% sure of generating AT LEAST $883in additional sales (there is a .5% chance of above $1166) which is why I worded the question as I did.Problem 2, Leaping Into the FutureIn the modern Olympic era, performances in track and field have been steadily improving. Thetable below gives the winning distance (in inches) for the Olympic long jump from 1952 to 1984.Use the simple linear regression printout to answer the following questions.Scatterplot3


-Distance-x--340+- x x-- x- x320+ x- x-- x-300+ x----------+---------+---------+---------+---------+--------Year1956.0 1962.0 1968.0 1974.0 1980.0Regression Analysisreg Distance YearSource | SS df MS Number of obs = 9-------------+------------------------------ F( 1, 7) = 9.21Model | 1137.52604 1 1137.52604 Prob > F = 0.0190Residual | 864.973958 7 123.567708 R-squared = 0.5681-------------+------------------------------ Adj R-squared = 0.5063Total | 2002.5 8 250.3125 Root MSE = 11.116------------------------------------------------------------------------------Distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------Year | 1.088542 .3587706 3.03 0.019 .2401839 1.936899_cons | -1817.833 706.0703 -2.57 0.037 -3487.424 -148.2423------------------------------------------------------------------------------. reg Distance Year YearsqSource | SS df MS Number of obs = 9-------------+------------------------------ F( 2, 6) = 6.88Model | 1394.3493 2 697.174648 Prob > F = 0.0280Residual | 608.150703 6 101.358451 R-squared = 0.6963-------------+------------------------------ Adj R-squared = 0.59514


Total | 2002.5 8 250.3125 Root MSE = 10.068------------------------------------------------------------------------------Distance | Coef. Std. Err. t P>|t| [95% Conf. Interval]-------------+----------------------------------------------------------------Year | 225.7233 141.1208 1.60 0.161 -119.5868 571.0333Yearsq | -.0570718 .0358538 -1.59 0.163 -.1448028 .0306591_cons | -222852.3 138860.1 -1.60 0.160 -562630.8 116926.1------------------------------------------------------------------------------********************Year Distance1952 2981956 308.251960 319.751964 317.751968 350.51972 324.51976 328.51980 336.251984 336.25(a) Give the units and interpretation of b 1 in the simple regression model.Solution: The regression coefficient b 1 always gives than change in Y associated with a one unitchange in X. Since b 1 must convert from X units to Y units, the units of b 1 are the units of Ydivided by the units of X. In this problem, X is in years and Y is distance in inches, so the units ofb 1 are inches per year. Since b 1 = 1.08854, a one unit change in year is associated with a 1.08854inch change in distance, i.e. the winning long jump distance increases by 1.08854 inches per year.Naturally, since the Olympics are only held every four years, this really means that the winningdistance increases by about 4.35 inches every Olympiad.(b) What proportion of the variability in distance is explained by year using the simple linearregression model? Does the model do a good job in this respect?Solution: The proportion or percentage of variability explained by the regression is given byR 2 = 56.8%, or, if we want an unbiased estimate, by Radj 2 = 50.6. Whichever number you use,the regression is explaining barely over half the variability and leaving nearly half the variabilityunexplained. This is not very good, though it is certainly better than nothing.(c) Does the simple linear regression model do a good job of predicting the Y values? Make sureyou justify your answer.Solution: This was one of the most frequently missed questions on the exam on which it appeared.In order to tell whether a regression makes good predictions, you need to know how big the errors5


made by the regression are. One way of evaluating this is to look at the typical distance fromthe points to the regression line. This number is estimated by s Y |X = √ MSE. This number canbe found as Root MSE on the printout, or by taking the square root of MSE from the ANOVAtable. Here RMSE = √ 123.57 = 11.1161. To tell whether this means the errors are large, wemust compare RMSE to the Y values we are trying to predict. The Y values in this problem rangefrom 298 to 336. Thus we are making an error of roughly 3-4%. This seems pretty good. However,we really should consider the context of the problem. The errors we are making are on the orderof 11 inches–nearly a foot. Long jump competitions are usually decided by much less than this soour errors, in context, are still rather large. Note: Many people tried to use R 2 or an F test to saywhether the model is a good predictor. These values try to get at whether the model explains alot of variability. You can explain quite a lot of variability and still have bad predictions.(d) Is there a significant linear relationship between years and distance? Justify your answer usingan appropriate test.Solution: We could use either a t test or an F test since they are the same for simple linearregression. Our null and alternative hypotheses areH 0 : β 1 = 0–i.e. there is not a significant relationshipH A : β 1 ≠ 0–i.e. there is a significant relationship between the year and the distance of the winninglong jump.From the printout, the test statistics are t obs = 3.03 for the t test, and F = 9.2057 for the F test.In both cases, the p-value for the test is .0190 which is much less than α = .05. Therefore, wereject the null hypothesis and conclude that there is a significant linear relationship between yearand the winning long jump distance. To get full credit, you only needed to quote the p-value andexplain your conclusions.(e) In 1968, the Olympics were held in Mexico City, and many records were set, probably due tothe high altitude. Is the point for 1968 an outlier or an influential point or both? Explain. Whatwould happen to your answers to (b)-(d) if this point were removed?Solution: The 1968 point is both an outlier and an influential point. It is an outlier because ithas an unusual Y value. (Note that it’s X value is right in the middle of the data set and is NOTunusual.) It is an influential point because it has single-handedly pulled the line up towards itselfand away from the rest of the points. If the point is removed, the regression line will go rightthrough the middle of the rest of the points. Thus the amount of unexplained variability will besmaller and the amount of explained variability will be higher. This will cause R 2 to go up, s Y |X togo down (and hence we will get better predictions), and F to increase (resulting in a lower p-valuefor our test). It is never easy without removing a point and refitting the regression to tell just howinfluential that point is. In this case the point is highly influential. For instance, R 2 goes up from56% to well over 90%.(f) Looking at the diagnostic plots for the simple linear regression, does it appear that any of ourregression assumptions have been violated? Make sure you state each of the assumptions that can6


e checked with each plot and whether you think they are OK. What do you think is causing anyproblems you see with the plots, and how might you fix them?Solution: Using a residual plot we can check mean 0, constant variance, and independence/appropriatenessof the linear model. Normality can be checked with a histogram and/or QQ plot.From the residual plot, it appears that the mean 0 assumption is violated. For most values of X,the residuals are all negative. We need the residuals to be centered about the line. This is causedin part by the influential point in 1968. If we took the point out, the regression would go morethrough the middle of the remaining points and the residuals would be more balanced. However Ithink there might still be a curved shape, from the dip in the 60s and 70’s so I would still considerthis assumption to be violated.Whether you consider the constant variance assumption to be met depends on whether or not youinclude the 1968 point. If you do, the spread of the residuals is much wider at 1968 than anywhereelse. If you leave the point out, there is a fairly even band around the residuals. In general I prefernot to judge a model bad if there is only a single point causing the problems so I would say thisassumption is mostly OK.The assumption that caused the most disagreement was the one involving independence/appropriatenessof the linear model. I see a bit of a curved pattern in this data but it is hard to tell, especiallygiven the 1968 point and the fact that we only have one observation every 4 years whetehr this ismeaningful. We gave credit on this assumption either way as long as people explained carefully.Normality also looks a little questionable for this data as the histogram is not too symmetric andthe points in the quantile plot don’t follow the straight line all that well. It’s not terrible but overallI would say this assumption is violated too.No matter what you think is the right model for this data, the 1968 point makes the error assumptionsmuch more questionable. Removing it will definitely improve the model. It is OK to removethe point in this case because we have a good reason to believe it is abnormal and not representativeof what will happen in the future. Mexico City is at an extremely high altitude and this resultedin abnormally strong performances in the short distance track and field events.(g) A zealous sports fan suggests that the winning distance in the long jump cannot increase forever, but should instead level off. He therefore suggests fitting a curvilinear regression to the data.The second printout shows the results of fitting the modelY = β 0 + β 1 Y ear + β 2 Y ear 2Is it worth adding the term Y ear 2 to the model based on the data presented here? Answer thisquestion using an appropriate test. Make sure you state the null and alternative hypotheses, thep-value for the test, and your conclusions.Solution: First, I certainly agree with the sports fan that the winning long jump distance shouldlevel off eventually. The real issues are (a) has that leveling off already begun or is a linear model7


OK in the range of data we have, and (b) is a parabolic model the right one to take into account theleveling off. The data does not curve too much, so I suspect the answer to (a) is that a linear modelis OK for now. Logically, I think the answer to (b) is no–parabolas do not level off as X increases.Thus I suspect ahead of time that the term Y ears 2 is not going to add much to this model. Tocheck this, I need to do a t test to see whether β 2 = 0. The null and alternative hypotheses are asusualH 0 : β 2 = 0–i.e. Y ears 2 does not add anything to the model beyond what was already given byYearH A : β 2 ≠ 0–i.e. Y ear 2 does make a significant contribution to the modelFrom the curvilinear regression printout, the test statistic is t obs = −1.59 and the p-value is .163.Since this p-value is much larger than α = .05 we fail to reject the null hypothesis and concludethat Y ear 2 adds nothing new to the model. It is not worth including when Year is already in themodel. Note: Many people who took this exam tried to use an F test. This is a multiple regressionproblem. In multiple regression, an F test checks whether the variables collectively are useful. Inthis case, the F test is significant. However, that only tells us that at least one of Year and Year 2is useful–it says nothing specific about Year 2 .Problem 3: Regression Assumptions:A residual plot from a simple linear regression analysis is shown below. It is followed by fourstatements about the error assumptions for this model. In each case, say whether the statement iscorrect. If the statement is not correct, give an appropriate statement about the error assumptionreferred to.Y | ** **| * **| * * * *0|---------*---------------| ** ***| * * *| * *--------------------------X(a) The mean 0 assumption is correct because there are approximately as many points above theline as below it.Solution: This is FALSE. We need the points to be centered about the line for EVERY valueof X, not just overall. The mean 0 assumption is clearly violated for this plot. The residuals arepositive, then negative, then positive again.(b) The constant variance assumption is violated because there is a curved pattern to the pointss.8


Solution: This is FALSE. The constant variance assumption has nothing to do with whetherthere is a curved pattern to the data. It has to do with whether the points have the same spreadfor each value of X. If we draw a band about these points it seems to be of roughly constant width.Thus the constant variance assumption is not violated.(c) The errors for this data set are approximately normally distributed.Solution: We can’t really tell that from the residual plot. We would need a histogram or a normalquantile plot to determine this properly.(d) A linear model is not appropriate for this data set because of the curved pattern in the points.Solution: This is TRUE. The curved pattern in the points suggests that a polynomial model isprobably more appropriate for this data.Problem 4: Computer ChaosYou have been hired as a statistical consultant by a large hardware store. They are interested inknowing how their sales of fans depend on the weather. They have presented you with data fromthe previous summer. Their data consists of two variables, Y, the number of fans sold in eachweek, and X the hottest temperature during that week. They have given you data for n=12 weeks.During those weeks the average temperature was found to be ¯X = 80 and the average numberof fans sold per week was Ȳ = 160. You have further managed to calculate from your data thatSCP = 200, SSX = 100, and RMSE = 4. You have gotten sick of doing the calculations by handand decided to use the computer. Unfortunately (what a shock) the program is malfunctioningand your printout has a lot of blanks. In this problem you will fill in the blanks and answer somequestions for the hardware store. Note: It is possible to completely answer parts (b)-(g) even ifyou can’t fill in a single number in the printout, so don’t give up on them!!(a) Below is a printout given by your computer. Fill in the blanks ( ) with the appropriatenumbers using the information given above. I have left a blank page after this one on which toshow your work and a suggested order for doing the calculations. Give at least a brief indication,either in formulas or words, of how you got the numbers. If there is a number you can’t figure out,put in a symbol for it and show how you would get all the other numbers using the symbol.The regression equation isFans = _____+ ______TemperaturePredictor Coef SE Coef T PConstant ____ 15 _______ 1.000Temperature ____ ____ _______ .0009


RMSE = _____ R-sq = ____ R-sq(adj) = .6857Analysis of VarianceSOURCE DF Sum Squares Mean Squares F PRegression ___ ____ _____ _____ _____Error ___ ____ _____Total ___ 560(1) Find the estimated regression equation.(2) Fill in RMSE and MSE.(3) Fill in the degrees of freedom in the ANOVA table, and then the rest of the ANOVA table.(4) Fill in R 2 .Solution: I give the filled in printout below. I got the numbers in this order. First,Second,b 1 = SCPSSX = 200100 = 2b 0 = Ȳ − b 1 ¯X = 160 − (2)(80) = 0This lets us fill in the regression equation and the Coef column of the table below it. To get themissing entry in the SE Coef column we need to computes b1 = RMSE √SSX= 4 √100= .4Then we get the t ratios by dividing the Coef column values by the SE Coef column values to get0 and 5.We are given that RMSE= 4. Also, MSE = s 2 Y |X = 42 = 16 in the ANOVA table. Then we canfill in the degrees of freedom. In a simple regression we have 1 degree of freedom for regression, n-2= 12-2 = 10 for error, and n-1=12-1 = 11 for total.Now we can fill in the various sums of squares. We know MSE = 16 and has 10 degrees of freedom.Since MSE = SSE/n-2 we must have SSE = 16*10 = 160. Next, we note that SSR + SSE = SSTand we know SST = 560 from the table. Thus SSR = 400. MSR = SSR/1 for simple regression10


so MSR = 400 also. F = MSR/MSE = 400/16 = 25. The p-value for the F test in a simple linearregression is the same as that for the t-test of β 1 so we fill in 0. This completes the ANOVA table.Finally we must obtain R 2 . From the ANOVA table, R 2 = SSR/SST = 400/560 = 71.43%. Thecomplete printout is below.The regression equation isFans = 0 + 2TemperaturePredictor Coef SE Coef T PConstant 0 15 0.00 1.00Temperature 2 .4 5.00 0.00RMSE = 4 R-sq = 71.43% R-sq(adj) = 68.57%Analysis of VarianceSOURCE DF SS MS F pRegression 1 400 400 25 .000Error 10 160 16Total 11 560(b) Give the units and real-world interpretations of the regression coefficients β 0 and β 1 . (Note:You do not need to quote the numbers to do this, though it may be helpful to do so if you knowthem.)Solution: This question, especially the part about units, created some difficulties so take carefulnote of my answers. First, β 0 is the average value of Y when X=0. In real world terms thismeans that β 0 represents the number of fans sold by the store WHEN THE TEMPERATURE IS0 DEGREES. Since β 0 is in essence a Y value it must have the same units as Y, in this case, fansper week. It was necessary to say this explicitly! (Note that we found b 0 = 0 meaning we estimatethat on average the store sells 0 fans when the temperature is 0 degrees. This makes perfect sense!)β 1 is the slope of the regression line and represents the change in Y associated with a one unitchange in X. Here that means that β 1 tells us how many extra fans we sell for each 1 degree thatthe temperature increases. Our best estimate is b 1 = 2 meaning that we sell, on average, 2 extrafans for every additional degree of temperature. The units of b 1 are always the units of Y dividedby the units of X. This is because we must multiply X by b 1 and come out with a Y value. Herethose units are fans per week per degree.(c) Is there a significant linear relationship between temperature and the number of fans sold bythe store? Answer this question by performing a t test. You must state the null and alternativehypotheses, both mathematically and in words, quote the p-value, and give your conclusions. Youdo not need to quote the test statistic and no calculations are required. Use α = .05.11


Solution: Testing whether there is a significant relationship between fan sales and temperature isequivalent to testing whether β 1 = 0. Our hypotheses areH 0 : β 1 = 0–there is not a significant relationship between fan sales and temperature, or equivalently,temperature does not help explain the variability in fan sales.H A : β 1 ≠ 0–i.e. there is a significant linear relationship between temperature and fan sales, orequivalently temperature does help explain the variability in fan sales.From the printout in (a) the p-value for the t-test is .000. This is much smaller than α = .05 sowe reject the null hypothesis and conclude that there is a significant linear relationship betweenfan sales and temperature. Knowing how warm it is does tell the store something about how manyfans they will sell. This is hardly a surprise. However, knowing b 1 does give them an idea of howmany fans they should stock at any given time. Note that we could also have used an F test forthis problem if I hadn’t specified the t test!(d) Calculate a 95% confidence interval for β 0 . Based on your interval, is β 0 different from 0?Explain. (Note: If you couldn’t get b 0 in part (a), you may assume it is 1 for this part of theproblem.) What does this interval tell you?Solution: The formula for a confidence interval for β 0 isb 0 ± t α/2,n−2 s b0Here we found b 0 = 0 in part (a), and s b0 = 15 from the printout in part (a). We have n=12data points and want a 95% confidence interval so we use t .025,10 = 2.228. The resulting confidenceinterval is 0 ± (2.228)(15) or [-33.42, 33.42]. Since this interval includes 0 we cannot conclude thatβ 0 is different from 0. In fact, since b 0 was exactly 0 it is obvious that our data are consistent withβ 0 being 0! This makes perfect sense. β 0 is the number of fans the store sells when the temperatureoutside is 0. Obviously if it is below freezing the store won’t be selling any fans!(e) Is temperature a good predictor of fan sales? Quote the number that you use to determinethis and briefly explain your reasoning.Solution: This was one of the most frequently missed questions on the exam. Take careful note!!To tell if the regression is making good predictions we must look at RMSE. This quantity tells usthe average distance from the data points to the regression line and can be roughly interpreted asthe average error we are making in guessing Y. If this value is small we are doing a good job andif it is large we are doing a bad job. Here RMSE = 4 which means our predictions are typicallyoff by about 4 fans per week. We are told that the store sells Ŷ = 160 fans in an average week,so only being off by 4 fans is very good–an error of about 2.5%. Note that we MUST compareRMSE to the Y values to tell if our predictions are good! An error of 4 is very small compared tosales of 160 but would be very large if we were only selling 5 fans a week! Also note that a highR 2 does NOT prove our predictions are good. It says we have explained a high percentage of thevariability in Y but even a small percentage of unexplained variability can result in large errorsfrom a practical point of view. Similarly, a very small p-value does not prove our predictions aregood. It says our X variable is useful–that our predictions are much BETTER than if we didn’t use12


X–but it doesn’t prove they are right. For an example simply see the homework warmup problemon electricity usage. We had a p-value of 0 but horrible predictions! Of course in that case we alsohad the wrong model, but the basic idea still holds....(f) What percentage of the variability in fan sales is explained by the regression on temperature?Quote the number that you use too determine this and say whether you think the regressionis doing a good job in this respect.Solution: We use R 2 or Radj 2 to find the percentage of variability explained by the regression.They have the same intuitive meaning but Radj 2 is a little more accurate because it takes degreesof freedom into account. We have Radj 2 = 68.57%, so the regression explains roughly two thirdsof the variability in fan sales out of a possible 100%. This is pretty good–well over half–but notfabulous–values in the 80’s or 90’s are usually considered very high.(g) A weather forecast says next week’s temperature will soar to 100! Predict the number of fansyou will sell next week. Suppose you want a range of possible values for the number of fans youwill sell. Calculate the appropriate interval and explain your reasoning. (Use α = .05) How manyfans should you stock to be sure you have enough on hand?Solution: For the prediction we simply plug X=100 into the regression equation and find that weexpect to sell Ŷ = 0+2(100) = 200 fans. For the interval, since we are dealing with a single specificY, namely next week’s sales, rather than average sales when it is 100 degrees, we want a predictioninterval. The basic formula for a prediciton interval is√Ŷ 0 ± t α/2,n−2 s Y |X 1 + 1 n + (X 0 − ¯X) 2SSXWe have t .025,10 = 2.228, s = 4, n = 12, X 0 = 100, ¯X = 80, SSX = 100. The resulting interval is[194.98, 205.02]. To be sure we have enough fans in stock we should have 106, using the high endof the interval to be safe since this is the maximum number we could sell and rounding up sincewe can’t sell part of a fan and 205 could be slightly too low.Multiple Regression(5) College Tuition:A researcher at US Views and World Seaports is conducting a study about tuition at Americancolleges and universities. So far, he has collected data from 20 schools about their tuition costs,Y (in thousands of dollars), their score on an independent rating scale, X 1 (in points out of 100),their size, X 2 (in thousands of undergraduates), and whether they are a public (X 3 = 0) or private(X 3 = 1) school. A printout for the multiple regression of Y on the three X variables is shownbelow. Use it to answer the questions on the following pages.Regression Analysis13


The regression equation isTuition = - 2.41 + 0.0967 Rating - 0.0192 Size + 16.9 TypePredictor Coef StDev T PConstant -2.4053 0.9257 -2.60 0.019Rating 0.09671 0.01172 8.25 0.000Size -0.01923 0.01606 -1.20 0.249Type 16.8581 0.3357 50.21 0.000S = 0.5869 R-Sq = 99.7% R-Sq(adj) = 99.6%Analysis of VarianceSource DF SS MS F PRegression 3 1742.29 580.76 1686.01 0.000Residual Error 16 5.51 0.34Total 19 1747.80Source DF Seq SSRating 1 774.83Size 1 98.89Type 1 868.57Predicted ValuesFit StDev Fit 95.0% CI 95.0% PI20.236 0.402 ( 19.384, 21.088) ( 18.728, 21.744)(a) Interpret b 0 , b 1 , and b 3 in terms of tuition costs, rating, and type or school.noindent Solution: The intercept, b 0 , gives the average tuition value, Y, for a public school(X 3 = 0) with no students (X 2 = 0) and a score of 0 (X 1 = 0) on the rating scale. Note that thisdoes not really make sense since you would not have school with no students. According to theprintout, such a school would in fact charge negative tuition!The coefficient of rating, b 1 , tells you how much increase (or decrease) in tuition is associated witha one unit increase in rating, ASSUMING size and type of school are held FIXED. A one unitincrease in rating is associated with an extra $96.70 in tuition.The coefficient of type of school, b 3 has a different interpretation because type of school is an indicatorvariable. It makes no sense to talk about a ”one unit increase in type of school.” A school is14


either public or private. What b 3 gives you is the DIFFERENCE in tuition between a public andprivate school with the SAME size and rating. The private school costs $16,900 more, all else equal.(b) The author of the study wants to know whether the combination of rating, size, and type ofschool is OVERALL useful for predicting tuition costs. Perform an appropriate hypothesis test toanswer this question. Your answer must include a statement of the null and alternative hypotheses,the rejection rule, the test statistic, the p-value, and an interpretation of your decision to reject ornot reject. However you may use the printout to obtain any numbers that you want–you need notperform any calculations.Solution: We check the overall usefulness of the model using an F test. The null and alternativehypotheses areH 0 : β 1 = β 2 = β 3 = 0 (i.e. none of the variables rating, size and type is useful for explaining thevariability in schools’ tuitions.)H A : At least one of the β’s is not 0 (i.e. overall the model is useful–at least one of the predictorscontains information about tuition costs)Our test statistic is F = MSR/MSE=1686.01 which seems very large. In this problem, α = .05, wehave data on n=20 schools, and there are k = 3 predictor variables in the model. The p-value forthis test, from the printout, is .000 which is much less than α = .05. This reinforces our decisionto reject the null hypothesis. We conclude that at least one of rating, size and type is useful inexplaining the variability in tuition costs. This is hardly a surprise.(c) There are two numbers that give, approximately, percentage of the variability in college tuitionthat is explained by the regression on ratings, size, and type. What are these two numbers? Whichis more appropriate, in the multiple regression setting, for measuring how good the model is? Explainyour choice.Solution: The two numbers that give the percentage of variability explained by a regression modelare R 2 and Radj 2 . These values are 99.7% and 99.6% respectively. In the multiple regression settingit is more appropriate to use Radj 2 . Some people in the past have said this was because R2 adj takesinto account the degrees of freedom. This is true, and an important part of the answer but it isnot sufficient. Radj 2 also takes degrees of freedom into account in simple linear regression. Themore important point is that Radj 2 helps prevent you from overfitting by penalizing you for usingextra predictors that provide no real information. R 2 will always go up as you add new predictors,whether they have anything to do with Y or not. This may make your model look much better thanit really is. However, Radj 2 can actually decrease if the predictors you are adding are not worthwhile.(d) Use the printout to compute a 95% confidence interval for β 2 . Explain what the resultingconfidence interval tells you about the usefulness of size as a predictor of college tuition.Solution: A 95% confidence interval for β 2 has the formb 2 ± t alpha/2,n−k−1 s b215


Here α = .05, n=20, and k=3. From the t table, t .025,16 = 2.120. From the printout, b 2 = −.01923and s b2 = .01606. The resulting confidence interval is−.01923 ± (2.120)(.01606) or [-.05327, .01482]Note that you must keep the negative sign on the value of b 2 !! The confidence interval contains thevalue 0. This means that β 2 MIGHT equal 0. It does not mean β 2 = 0 for certain–it simply means,in essence, that we cannot reject the null hypothesis that b 2 is 0. Thus we conclude that size ofschool may not be useful for predicting tuition costs, ASSUMING THAT RATING AND TYPEOF SCHOOL ARE IN THE MODEL. It is quite possible that by itself size is a useful predictor.(e) Does whether a school is public or private have a significant impact on tuition? Perform anappropriate hypothesis test to answer this question. You must write down the null and alternativehypotheses, and explain how you came to your conclusion, but you need not show all the steps ofthe test.Solution: We need to perform a T test. The appropriate null and alternative hypotheses areH 0 : β 3 = 0 (i.e. there is no difference in tuition between public and private schools if rating andsize have been taken into consideration.)H A : β 3 ≠ 0 (i.e. there is a difference in tuition between public and private schools even after sizeand rating have been accounted for.)Note that you were NOT asked to show private schools are more expensive so a one-sided test isnot appropriate. Also, some people said that a public school has no impact because its X value is0 but that a private school does have an impact on tuition. This makes no sense. For there to be adifference, there have to be two types of schools. It matters which type you are, public or private.We wrote the model as ”you pay this much more for a private school than a public one” but couldjust as easily have written ”you pay this much less for a public school than for a private one.”The test statistic is t obs = 50.21 and the corresponding p-value is .000, so we reject the null hypothesisand conclude that type of school is useful for predicting tuition, ASSUMING SIZE ANDRATING ARE ALSO IN THE MODEL.(f) A student is interested in attending the University of Southern North Dakota at Hoople, aprivate school which supposedly has 1000 students and a rating of 60 on the scoring system usedin this study. Unfortunately, the student’s parents are having a hard time finding out what thetuition at Southern North Dakota is. (This may have something to do with the fact that the schoolis completely fictitious! I will be impressed if you know where it comes from...) Use the printout tofind (i) an estimate for tuition at Southern North Dakota, and (ii) a range of values in which youcan be 95% certain that the true tuition lies. Explain why you chose the interval that you did.Solution: The University of Southern North Dakota at Hoople was invented by Peter Shickele, acomposer who writes humorous music purporting to be the work of P.D.Q. Bach, an illegitimateson of the famous Baroque composer J.S. Bach. SND at Hoople is the infamous P.D.Q. Bach’s alma16


mater. We can find an estimate of the tuition at SND by plugging into the estimated regressionequation. To plug into the regression equation we need to know X 1 , X 2 and X 3 . We know X 3 = 1since this was said to be a private school. We are also given that X 1 = 60 since the school has arating of 60. Some people try to use a rating of .6. The ratings are NOT percentages. It is true thatthe maximum rating is 100, but I told you the scores were between 0 and 100, not between 0 and1! Just because a variable lies between 0 and 100 does not mean you should automatically convertit to a percentage! The most confusion was caused by the value of X 2 . In the problem statement,I told you that X 2 was measured in thousands of students. This means that if the school has 1000students, X 2 = 1.Thus the tuition is$20,236.Ŷ = −2.41 + .0967(60) − .0192(1) + 16.9(1) = 20.236Since we are talking about an individual school we want a prediction interval for the tuition.(If we were interested in the tuition at the average private school with a ranking of 60 and 1000students we would use a confidence interval.) From the printout we are 95% sure that the tuitionat SND at Hoople will be between $18,728 and $21,744.(g) A colleague of the author of the study suggests that perhaps it would be worth adding anothervariable, X 4 , the number of faculty members at a university, to improve the model. Explain whythis is unlikely to be useful. (Hint: the correlation between number of students and number offaculty members is r = .95.)Solution: Since we already know the number of students at the school and there is a very strongrelationship between number of students and number of faculty members, having the number offaculty members probably won’t tell us anything we didn’t already know about tuition. Thereforeit won’t be a useful predictor. In fact because of the high correlation, adding the number of facultymembers to our model will result in a multicollinearity problem which may make some of our usefulpredictors appear useless and may also lead to bad predictions. It is definitely NOT a good ideato add this variable.(6) When I Finish Summer School....I’m Going To StatisticsLand?!In this problem Professor Sadisticus is trying to determine what factors affect attendance athis parks. He has recorded Y, the number of visitors (in millions) to each of his parks each quarterfor the past 5 years. He has also recorded data on X 1 , time (with time 1 being the first quarter,winter, five years ago), X 2 , the price of tickets to the park (in dollars), X 3 the number of rides atthe park, X 4 , the size of the park (in acres), X 5 the population of the city in which the park islocated (in hundreds of thousands of people), and X 6 , the average temperature during the quarter(in degrees). He also has indicator variables for whether there were special discounts offered tolocal residents (X 7 = 1 if there was a discount and X 7 = 0 if there wasn’t) and for the region ofthe country in which the park was located X 8 = X 9 = 0 for the west coast, X 8 = 1, X 9 = 0 for themidwest, and X 8 = 0, X 9 = 1 for the east coast.) He has fit a multiple regression of Y on these ninevariables. Use the printout from his multiple regression and the accompanying summary statisticsto answer the questions on the following pages.17


Ȳ = 5***************************************************************************The regression equation isAttendence = 2 + .25Time - .2Price + .01Rides - .001Size + .05Population +.1Temperature + 1Discount -2Midwest - EastPredictor Coef SE Coef T PConstant 2.000 .500 4.00 .0001Time .250 .100 2.50 .0142Price -.200 .050 -4.00 .0001Rides .010 .004 2.50 .0142Size -.001 .002 - .50 .6180Population .050 .025 2.00 .0480Temperature .100 .048 2.08 .0396Discount 1.000 .400 2.50 0142Midwest -2.000 1.000 -2.00 .0480East -1.000 1.000 -1.00 .3196S = .200 R-Sq = 95.6% R-Sq(adj) = 95.24%Analysis of VarianceSource DF SS MS F PRegression 9 95.60 10.62 265.5 0.000Residual Error 110 4.40 .04Total 119 100.0018


Note: EXCEPT WHERE INDICATED OTHERWISE YOU SHOULD USE α = .05FOR ALL HYPOTHESIS TESTS ON THIS EXAM(a) Does the model as a whole do a good job of explaining attendance at StatisticsLand parks?Answer this question by performing an appropriate hypothesis test. State the null and alternativehypotheses both mathematically and in words, give the test statistic and p-value, and explain yourreal-world conclusions.Solution: To test whether the model as a whole is useful we need an overall F test. Our hypothesesareH 0 : β 1 = β 2 = · · · β 9 = 0–None of time, price, etc. are helpful for explaining the attendence atStatisticsLand parks.H A : At least one β i ≠ 0–At least one of time, price, etc. is useful for explaining attendence. Themodel as a whole does a good job of explaining the number of visitors at the theme parks.Our test statistic is F obs = 265.5 and the corresponding p-value from the ANOVA table is 0. Sincethis is less than our significance level of α = .05 we reject the null hypothesis. We conclude (notsurprisingly) that at least one of the nine variables does help explain the attendence at the parks.(b) Does this model do a good job of predicting attendance at StatisticsLand parks?Carefully justify your answer.Solution: To decide whether a model makes good predictions we must look at ourtypical error, RMSE and compare it to the Y values we are trying to predict. HereRMSE = .2, meaning that we typically are off by about 200,000 people when we tryto predict the number of visitors to a StatisticsLand park in a given quarter. Sincewe average Ȳ = 5 or 5 million visitors(c) Give the real-world interpretations of b 3 and b 7 in this model. Your answer shouldinclude the actual numerical values and units as appropriate.Solution: The coefficient for the Rides variable is b 3 = .01. This means for every additionalride there is at the park, attendance goes up by .01 million = 10,000 peopleassuming all the other variables are held fixed. The coefficient for the discount variableis b 7 = 1. This tells us that on average, when there is a discount being offered 1million mode people come to the park per quarter than when there isn’t a discount,assuming all other variables are held fixed.(d) Based on the correlation table given above the regression printout, should X 4 , thesize of the park, be a good predictor of attendance? Should it’s coefficient, β 4 , bepositive or negative? Explain briefly in each case.Solution: The size variable has a quite strong correlation with attendence, Y, so itshould by itself be a good predictor. Since the correlation is positive we would expectb 4 to be positive as well, indicating that as size increases so does attendance. This19


makes real-world sense. The bigger the park is the more people it can hold.(e) Are b 4 , the estimated regression coefficient for X 4 , and its p-value consistent withyour expectations from part (e)? Justify your answer, and, if there is a lack of consistency,say what has gone wrong, giving evidence from the data to support yourargument.Solution: No, the coefficient b 4 is negative and it’s p-value of .6180 is well above .05suggesting that this variable is not useful WHEN ALL THE OTHER VARIABLESARE IN THE MODEL. This is probably a multicollinearity issue. From the correlationtable we see that Size is also highly correlated with the number of rides (.9)–nota surprise since to have a lot of rides you need to have a lot of space! Since the ridesvariable is more highly correlated with attendance this is the one that has stayed significantin the model.(f) Does it appear that there are statistically significant differences in park attendancein the three regions of the country? If so, in which region(s) is attendance the highest?Briefly justify your answers.Solution: We need to look at the p-values for the indicator variables for the differentregions. The West is our reference region. The Midwest indicator has a p-value of.048 meaning attendance there is significantly different than in the west. From thecoefficient of -2 we see that on average attendance there is 2 million people lowerper quarter. However, the p-value for the East regio is .3196 so there is insufficientevidence to show that attendence is different in the East than in the West. Thus weconclude that the West and East have the highest attendence. it appears since East’scoefficient is negative that West is the highest but we do not have sufficient evidenceto say this for sure.(g) In the accompanying graphics file are a residual plot and a histogram of the errorsfor this multiple regression model. Use them to explain whether or not each of thefour assumptions we make about the errors in a regression model are violated and why.(Hint: Recall that you judge the regression assumptions for a residual plot in multipleregression the same way you would evaluate them using a scatterplot in simple linearregression.)Solution: The residual plot shows that the errors are centered about the 0 line for allX and there is no cirved pattern. Thus the independence and mean 0 assumptions aresatisfied. However the constant variance assumptions is violated–the errors get largeras X (time) gets larger. We judge normality from the histogram–it looks symmetricaland hump-shaped so the normality assumption seems reasonable.(h) On average, how much would you expect attendance at a theme park to decreasein a quarter if you raised the ticket prices by $5, all other things being equal? Brieflyexplain your reasoning.20


Solution: The coefficient for the price is -.2 meaning that for every extra dollar chargedwe get a decrease in attendance of 200,000 people, all else being equal. Thus a 5 dollarraise would correspond to a loss of 1 million people.Part iFind the predicted attendance for the Los Seraphim theme park this summer (thatis, summer of the first year after the recorded data.) Los Seraphim is a west coastcity with a population of 5 million people, and an average summer temperature of70 degrees. You may assume that the park has 50 rides, has 50 acres of space, thatadmission is $45, and that there are currently no discounts being offered.Solution: We just plug into the regression equation being careful of our units. Weknow the data has been measured quarterly for 5 years so there have been 20 timepoints before this year. Summer is the 3rd quarter of the year (winter, spring, summer,fall) so this must be time point 23. The population is measured in hundreds ofthousands of people so we have X 5 = 50 for 5 million people. There is no discountoffered so X 7 = 0 and the park is in the west so X 8 = X 9 = 0. The rest of the numberscan be plugged in as they are given yieldingŶ = 2 + .25(23) − .2(45) + .01(50) − .001(50) + .05(50) + .1(70) + 0 − 2(0) − 1(0) = 8.7Thus we predict the park will have 8.7 million customers this summer.(j) Professor Sadisticus is planning to open a new theme park in the city of Hollybrick.He knows it will take a while for the park to become profitable but would like to be95% sure that in total over the next 10 years attendance will be high enough so thathe does not lose money on it. (i) What sort of interval should he use when predictingquarterly attendance at the park to find his projected profits over this period? (ii)Do you see any potential problems with the predictions he is making? Briefly explainyour reasoning in each case. No calculations are required.Solution: Prof. Sadisticus should use a confidence interval, not a prediction interval.He isn’t interested in what happens in one particular quarter–he wants to know onaverage over the long haul how much money he will make. The problem with hispredictions is that he is extrapolating 10 years into the future, way outside the rangeof his data. There is no guarantee his parks will continue to grow at the same rate forsuch a long time.(k) Professor Sadisticus has a theory that as it gets hotter more people come to histheme parks. Because of this he is considering adding a new water ride, the RandomSplatter, and setting up ice-cream stands on hot days, but before the does this he21


wants to be sure that his theory is correct. Perform the appropriate hypothesis testto prove Professor Sadisticus’ theory. Be sure you state your null and alternativehypotheses both mathematically and in words with a justification of your choice, givethe test statistic and p-value, and explain your real-world conclusions.Solution: Here we are being asked to do a 1-sided test because he wants to prove astemperature goes up attendance goes up. This must be our alternative hypothesis sowe haveβ 6 ≤ 0–Once the other variables have been taken into account terperature has a negativeor no relationship with attendance.β 6 > 0–After accounting for the other factors, temperature has a positive relationshipwith attendance.Our test statistric is t obs = 2.08. Since the test is 1-sided we have to divide the p-valuefrom the printout in half yielding .0198. Since this is less than our significance level of.05 we reject the null hypothesis and conclude that higher temperatures are associatedwith higher attendance even after accounting for all the other variables.(l) Silly Sally, a summer intern at StatisticsLand, is very excited by the results ofyour test in part (j). She concludes that hotter weather causes people to visit yourtheme parks and proposes that new parks should be opened in desert areas like Arizonaor Saharan Africa. Explain what is wrong with (i) Sally’s conclusion and (ii) herproposal. Your answer should include an example of why Sally’s conclusion might bewrong.Solution: Sally is making several mistakes. First, correlation is not necessarily causation.People may be visiting the parks because it is summer and that is when theyare on vacation, not because it is hot outside. Her proposal is bad because even ifwarm weather brought people to the parks, extreme heat as in the desert might not.Models are only useful for making predictions about situations similar to the dataon which they were built. Saharan Africa for instance is very different from any ofthe places in the US that Prof. Sadisticus has his parks and our model will not bereliable there. In fact, probably attendance at parks in such an area would be very low.(7) Attendance Trends: Professor Sadisticus has decided to look more closely at patternsof attendance at his Los Seraphim park. He has recorded monthly attendancefor 60 months and has fit three regression models of attendance on time to help himevaluate the shape of the attendance trend. Partial regression printouts are givenbelow.(a) Is the quadratic trend superior to the simple linear trend? Justify your answerby performing an appropriate hypothesis test. You do not need to write out all thedetails–just explain your basic reasoning and circle the p-value you are using on theprintout.22


Solution: We need to look at the p-value of the time-squared term in Model 2. Wesee that the p-value is .0150 which is less than α = .05. This allows us to conclude thataddingtime-squared to the model is useful or equivalently that a curvilinear model isbetter than a simple linear model.(b) Rank the trend models from best to worst. Explain briefly what numbers you areusing to make your decisions.Solution: Model 3 is the best, followed by Model 2 and the worst is Model 1. Model 3has the highest Radj 2 meaning it explains the most variability, and the smallest RMSEmeaning it makes the best predictions. Model 2 is second best in these categories.Note that we can’t really compare the p-values because the models don’t all have thesame number of variables, and we were not given the F statistics.***********************************************************Model 1:The regression equation isAttendence = 3 + .25 TimePredictor Coef SE Coef T PConstant 3.00 .750 4.00 .0001Time .25 .100 2.00 .0500RMSE = .4 R-sq = 65% R-sq(adj) = 64.6%***********************************************************Model 2:The regression equation isAttendence = 3 + .3Time - .01Time^2Predictor Coef SE Coef T PConstant 3.00 .750 4.00 .0001Time .30 .200 1.50 .1390Time-squared -.01 .004 -2.50 .0150RMSE = .3 R-sq = 80% R-sq(adj) = 79.7%***********************************************************Model 3:23


The regression equation isAttendence = 6 - 15(1/Time)Predictor Coef SE Coef T PConstant 6.00 1.00 6.00 .00001/Time -15.00 6.82 -2.20 .0320RMSE = .2 R-sq = 95% R-sq(adj) = 94.4%***********************************************************24

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!