29.06.2013 Views

Homework 2 Solutions STA 106, Winter 2012 - Statistics

Homework 2 Solutions STA 106, Winter 2012 - Statistics

Homework 2 Solutions STA 106, Winter 2012 - Statistics

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Homework</strong> 2 <strong>Solutions</strong><br />

<strong>STA</strong> <strong>106</strong>, <strong>Winter</strong> <strong>2012</strong><br />

1.) a. r = 5<br />

n.i = 5<br />

n.t = r * n.i<br />

T = (1 / sqrt(2)) * qtukey(.95, r, n.t - r)<br />

S = sqrt((r - 1) * qf(.95, r - 1, n.t - r))<br />

g = c(2, 5, 10)<br />

B = qt(1 - (.05 / (2 * g)), n.t - r)<br />

T = 2.99, S = 3.39, B(g = 2) = 2.42, B(g = 5) = 2.84, B(g = 10) = 3.15<br />

Here we see that when there are only two comparisons Bonferonni performs the best, but when the<br />

number of comparisons is 10, as there would be if we were to conduct every pairwise comparison, then<br />

Bonferonni performs the worst. From this we can generalize that Bonferonni should be used when<br />

the number of comparisons of interest is small.<br />

b. r = 5<br />

n.i = 20<br />

#The rest of the code is the same as above<br />

T = 2.78, S = 3.14, B(g = 2) = 2.28, B(g = 5) = 2.63, B(g = 10) = 2.87<br />

When we increase the sample size we see that Scheffe is now the worst in all cases. In fact, at 10<br />

comparisons Bonferroni is nearly as good as Tukey. We can then generalize that Bonferonni performs<br />

very efficiently for high sample sizes and few comparisons, but if the number of comparisons is large<br />

Tukey remains the best option.<br />

2.) All three procedures are of the form “estimator ± statistic ×SE.” The only difference between the<br />

intervals is the choice of statistic: T, S, or B. The smaller the statistic the more precise the interval, and<br />

since all three properly adjust for Type I error, and the choice of the statistic does not depend on the<br />

observed data (i.e. we’re not data snooping), then it is appropriate to choose the smallest interval.<br />

3.) r = 2<br />

n.i = 10<br />

n.t = r * n.i<br />

g = 1<br />

#The rest of the code is the same as above<br />

T = 2.88, S = 2.88, B = 2.88<br />

In the case where we have two groups that we are comparing (i.e. a two-sample t-test) all three tests<br />

are equivalent and return nearly the value that would normally be used in a standard two-sample t-test.<br />

The only difference, as can be seen from the Bonferonni statistic, is that there is one fewer degree of<br />

freedom.<br />

4.) a. #READ IN DATA<br />

rehab = read.table(file = "http://br312.math.tntech.edu/.../CH16PR09.txt")<br />

#RENAME THE VARIABLES<br />

names(rehab) = c("time", "status", "ID")<br />

#LOAD GPLOTS PACKAGE<br />

library(gplots)<br />

#MAKE MAIN EFFECTS PLOT<br />

plotmeans(rehab$time ~ as.factor(rehab$status))<br />

1


It is also acceptable if you made a “line plot” here.<br />

b. #FIT ANOVA MODEL<br />

model = aov(rehab$time ~ as.factor(rehab$status))<br />

#INPUT BASIC INFORMATION FROM THE DATA<br />

r = 3<br />

n.i = c(8, 10, 6)<br />

n.t = sum(n.i)<br />

#OBTAIN GROUP MEANS<br />

#I use "unique" so I only end up with one value for each group<br />

#and I use "round" to avoid issues with computer error<br />

ybar.i = unique(round(model$fitted.values, 2))<br />

#CALCULATE MSE - REMEBER MSE IS JUST THE SUM OF THE RESIDUALS SQUARED<br />

#DIVIDED BY THE DEGREES OF FREEDOM<br />

MSE = sum(model$residuals^2)/(n.t - r)<br />

SE = sqrt(MSE / n.i[2])<br />

#CALCULATE THE t-<strong>STA</strong>TISTIC<br />

t = qt(1 - .01 / 2, n.t - r)<br />

#COMPUTE CONFIDENCE INTERVAL<br />

c(ybar.i[2] - t * SE, ybar.i[2] + t * SE)<br />

c. #FIND POINT ESTIMATES<br />

D.1 = ybar.i[2] - ybar.i[3]<br />

D.2 = ybar.i[1] - ybar.i[2]<br />

#CALCULATE B<br />

99% CI for group 2: (28.01497, 35.98503)<br />

2


g = 2<br />

B = qt(1 - (.05 / (2 * g)), n.t - r)<br />

#CALCULATE <strong>STA</strong>NDARD ERRORS<br />

SE.1 = sqrt(MSE * (1/n.i[2] + 1/n.i[3]))<br />

SE.2 = sqrt(MSE * (1/n.i[1] + 1/n.i[2]))<br />

#CALCULATE CONFIDENCE INTERVALS<br />

c(D.1 - B * SE.1, D.1 + B * SE.1)<br />

c(D.2 - B * SE.2, D.2 + B * SE.2)<br />

95% Family Confidence Intervals for:<br />

µ2 − µ3 : (2.452, 13.548)<br />

µ1 − µ2 : (0.904, 11.096)<br />

d. T = (1 / sqrt(2)) * qtukey(.95, r, n.t - r)<br />

T = 2.52 while B = 2.41. Since we are only doing two comparisons the Bonferroni method is more<br />

efficient in this case.<br />

e. If the researcher chose to estimate D3 = µ1 − µ3 as well, the Bonferroni statistic would need to be<br />

recalculated, but the Tukey statistic would remain the same.<br />

f. NOTE: Until now I have been calculating all pairwise comparisons directly. Here I will use the built in<br />

function to computer all Tukey intervals. Similarly all Bonferroni intervals and several other methods<br />

not covered in this course can be calculated with the function pairwise.t.test.<br />

TukeyHSD(model)<br />

Tukey multiple comparisons of means<br />

95% family-wise confidence level<br />

Fit: aov(formula = rehab$time ~ as.factor(rehab$status))<br />

$‘as.factor(rehab$status)‘<br />

diff lwr upr p adj<br />

2-1 -6 -11.32141 -0.6785856 0.0253639<br />

3-1 -14 -20.05870 -7.9413032 0.0000254<br />

3-2 -8 -13.79322 -2.2067778 0.0060547<br />

No intervals contain 0, thus, all the means are significantly different. That is, the recovery time for<br />

each initial fitness status is significantly different than the other two.<br />

5.) a. #READ IN DATA<br />

cash = read.table(file = "http://br312.math.tntech.edu/.../CH16PR10.txt")<br />

#RENAME VARIABLES<br />

names(cash) = c("offer", "age", "ID")<br />

#MAKE PLOT<br />

plotmeans(cash$offer~as.factor(cash$age))<br />

3


. #FIT ANOVA MODEL<br />

model = aov(cash$offer~as.factor(cash$age))<br />

#INPUT BASIC INFORMATION FROM THE DATA<br />

r = 3<br />

n.i = c(12, 12, 12)<br />

n.t = sum(n.i)<br />

#OBTAIN GROUP MEANS<br />

ybar.i = unique(round(model$fitted.values, 2))<br />

#CALCULATE MSE<br />

MSE = sum(model$residuals^2)/(n.t - r)<br />

SE = sqrt(MSE / n.i[1])<br />

#CALCULATE THE t-<strong>STA</strong>TISTIC<br />

t = qt(1 - .01 / 2, n.t - r)<br />

#COMPUTE CONFIDENCE INTERVAL<br />

c(ybar.i[1] - t * SE, ybar.i[1] + t * SE)<br />

99% CI for the mean offer for young individuals: (20.25496, 22.74504)<br />

c. #OBTAIN POINT ESTIMATE<br />

D.hat = ybar.i[3] - ybar.i[1]<br />

#CALCULATE <strong>STA</strong>NDARD ERROR<br />

SE = sqrt(MSE * (1 / n.i[3] + 1 / n.i[1]))<br />

#COMPUTE t-<strong>STA</strong>TISTIC<br />

t = pt(1 - .01 / 2, n.t - r)<br />

#COMPUTE CONFIDENCE INTERVAL<br />

c(D.hat - t * SE, D.hat + t * SE)<br />

99% CI for the mean offer for D = µ3 − µ1: (-1.840755, 1.680755)<br />

We are 99% confident that the true value of the difference in mean offers lies within the given interval.<br />

Since this interval contains zero, or no difference in the means, there is no significant evidence to<br />

suggest that younger and older individuals receive different offers.<br />

d. #OBTAIN POINT ESTIMATE<br />

L.hat = 2 * ybar.i[2] - ybar.i[1] - ybar.i[3]<br />

#CALCULATE <strong>STA</strong>NDARD ERROR<br />

4


SE = sqrt(MSE * (4/n.i[2] + 1/n.i[1] + 1/n.i[3]))<br />

#COMPUTE TEST <strong>STA</strong>TISTIC<br />

t = L.hat/SE<br />

#DETERMINE CRITICAL VALUE<br />

t.crit = qt(1 - .01/2, n.t - r)<br />

The method to solving this problem becomes much clearer when we rewrite the null hypothesis in the<br />

following way:<br />

H0 : µ2 − µ1 = µ3 − µ2<br />

⇒H0 : 2µ2 − µ1 − µ3 = 0<br />

H1 : 2µ2 − µ1 − µ3 = 0<br />

<br />

∗ |t | ≤ 2.733 Fail to Reject H0<br />

Decision Rule =<br />

|t∗ | > 2.733 Reject H0<br />

We get t ∗ = 11.275, so we reject H0. There is significant evidence to suggest that the offer for the<br />

young and old age groups differs from the middle age group.<br />

6.) (a) Let L = µ1 + µ2 − 2µ3.<br />

#READ IN DATA<br />

prod = read.table(file = "http://br312.math.tntech.edu/.../CH16PR07.txt")<br />

#RENAME VARIABLES<br />

names(prod) = c("prod", "exp", "ID")<br />

#FIT ANOVA MODEL<br />

model = aov(prod$prod~as.factor(prod$exp))<br />

#INPUT BASIC INFORMATION FROM THE DATA<br />

r = 3<br />

n.i = c(9, 12, 6)<br />

n.t = sum(n.i)<br />

#OBTAIN GROUP MEANS<br />

ybar.i = unique(round(model$fitted.values, 2))<br />

#OBTAIN POINT ESTIMATE<br />

L.hat = ybar.i[1] + ybar.i[2] - 2*ybar.i[3]<br />

#CALCULATE <strong>STA</strong>NDARD ERROR<br />

MSE = sum(model$residuals^2)/(n.t - r)<br />

SE = sqrt(MSE * (1 / n.i[1] + 1 / n.i[2] + 4 / n.i[3]))<br />

#COMPUTE t-<strong>STA</strong>TISTIC<br />

t = qt(1 - .05 / 2, n.t - r)<br />

#COMPUTE CONFIDENCE INTERVAL<br />

c(L.hat - t * SE, L.hat + t * SE)<br />

The 95% confidence interval for L is given by: (-4.922284, -1.857716). This interval does not<br />

contain zero, so there is strong evidence to suggest there is a difference in the mean productivity<br />

improvement between firms with low or moderate research and development expenditures and firms<br />

with high expenditures.<br />

(b) Let L = 9<br />

27 µ1 + 12<br />

27 µ2 + 6<br />

27 µ3.<br />

#OBTAIN POINT ESTIMATE<br />

L.hat = (3/9)*ybar.i[1] + (4/9)*ybar.i[2] + (2/9)*ybar.i[3]<br />

#CALCULATE <strong>STA</strong>NDARD ERROR<br />

5


SE = sqrt(MSE * ((3/9)^2 / n.i[1] + (4/9)^2 / n.i[2] + (2/9)^2/ n.i[3]))<br />

#COMPUTE t-<strong>STA</strong>TISTIC<br />

t = qt(1 - .05 / 2, n.t - r)<br />

#COMPUTE CONFIDENCE INTERVAL<br />

c(L.hat - t * SE, L.hat + t * SE)<br />

We are 95% confident that the mean overall productivity lies within the interval (7.633330 8.268892).<br />

(c) #OBTAIN POINT ESTIMATES<br />

D1.hat = ybar.i[3] - ybar.i[2]<br />

D2.hat = ybar.i[3] - ybar.i[1]<br />

D3.hat = ybar.i[2] - ybar.i[1]<br />

L4.hat = .5 * (ybar.i[1] + ybar.i[2]) - ybar.i[3]<br />

#CALCULATE <strong>STA</strong>NDARD ERRORs<br />

SE1 = sqrt(MSE * (1 / n.i[3] + 1 / n.i[2]))<br />

SE2 = sqrt(MSE * (1 / n.i[3] + 1 / n.i[1]))<br />

SE3 = sqrt(MSE * (1 / n.i[2] + 1 / n.i[1]))<br />

SE4 = sqrt(MSE * (.5^2 / n.i[1] + .5^2 / n.i[2] + 1 / n.i[3]))<br />

#COMPUTE S<br />

S = sqrt((r - 1) * qf(.90, r - 1, n.t - r))<br />

#CALCULATE CONFIDENCE INTERVALS<br />

c(D1.hat - S * SE1, D1.hat + S * SE1)<br />

c(D2.hat - S * SE2, D2.hat + S * SE2)<br />

c(D3.hat - S * SE3, D3.hat + S * SE3)<br />

c(L4.hat - S * SE4, L4.hat + S * SE4)<br />

The 90% family confidence intervals are given by:<br />

D1 : (0.169, 1.971) D3 : (0.455, 2.045)<br />

D2 : (1.370, 3.270) L1 : (−2.531, −0.859)<br />

No intervals contained 0, so all differences are significant. That is there is strong evidence to suggest<br />

that the mean productiviy improvement for every firm differs from the other two. Furthermore, the<br />

interval for L1 confirms our results from part a.<br />

7.) In order to solve this problem we need to make a couple assumptions. First, we assume that the sample<br />

size for each group, ni, is the same for every group. Denote ni = n, then nT = 3n. Further assume that<br />

the estimate of MSE from the data is the true value. A Tukey interval has the form,<br />

ˆD ± T × s( ˆ D)<br />

Now set the margin of error, the right side of the expression above, equal to 3 and let α = .05. Also, we<br />

have from the data that MSE = 3.8 We get,<br />

T × s( ˆ D) =3<br />

<br />

1<br />

√ q(1 − α; r, nT − r) MSE(<br />

2 1<br />

+<br />

n1<br />

1<br />

) =3<br />

n2<br />

1<br />

√ q(.95; 3, 3n − 3)<br />

2 √ <br />

2<br />

3.8<br />

n =3<br />

q(.95; 3, 3n − 3) √ 1n =1.539<br />

q(.95; 3, 3n − 3) √ 1n − 1.539 =0<br />

6


If the above equation returns a negative number for a given value of n, then that sample size satisfies<br />

our minimum required precision. Through trying several values we find that n = 6 is the smallest value<br />

to return a negative value of the function.<br />

7

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!