Homework 2 Solutions STA 106, Winter 2012 - Statistics
Homework 2 Solutions STA 106, Winter 2012 - Statistics
Homework 2 Solutions STA 106, Winter 2012 - Statistics
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Homework</strong> 2 <strong>Solutions</strong><br />
<strong>STA</strong> <strong>106</strong>, <strong>Winter</strong> <strong>2012</strong><br />
1.) a. r = 5<br />
n.i = 5<br />
n.t = r * n.i<br />
T = (1 / sqrt(2)) * qtukey(.95, r, n.t - r)<br />
S = sqrt((r - 1) * qf(.95, r - 1, n.t - r))<br />
g = c(2, 5, 10)<br />
B = qt(1 - (.05 / (2 * g)), n.t - r)<br />
T = 2.99, S = 3.39, B(g = 2) = 2.42, B(g = 5) = 2.84, B(g = 10) = 3.15<br />
Here we see that when there are only two comparisons Bonferonni performs the best, but when the<br />
number of comparisons is 10, as there would be if we were to conduct every pairwise comparison, then<br />
Bonferonni performs the worst. From this we can generalize that Bonferonni should be used when<br />
the number of comparisons of interest is small.<br />
b. r = 5<br />
n.i = 20<br />
#The rest of the code is the same as above<br />
T = 2.78, S = 3.14, B(g = 2) = 2.28, B(g = 5) = 2.63, B(g = 10) = 2.87<br />
When we increase the sample size we see that Scheffe is now the worst in all cases. In fact, at 10<br />
comparisons Bonferroni is nearly as good as Tukey. We can then generalize that Bonferonni performs<br />
very efficiently for high sample sizes and few comparisons, but if the number of comparisons is large<br />
Tukey remains the best option.<br />
2.) All three procedures are of the form “estimator ± statistic ×SE.” The only difference between the<br />
intervals is the choice of statistic: T, S, or B. The smaller the statistic the more precise the interval, and<br />
since all three properly adjust for Type I error, and the choice of the statistic does not depend on the<br />
observed data (i.e. we’re not data snooping), then it is appropriate to choose the smallest interval.<br />
3.) r = 2<br />
n.i = 10<br />
n.t = r * n.i<br />
g = 1<br />
#The rest of the code is the same as above<br />
T = 2.88, S = 2.88, B = 2.88<br />
In the case where we have two groups that we are comparing (i.e. a two-sample t-test) all three tests<br />
are equivalent and return nearly the value that would normally be used in a standard two-sample t-test.<br />
The only difference, as can be seen from the Bonferonni statistic, is that there is one fewer degree of<br />
freedom.<br />
4.) a. #READ IN DATA<br />
rehab = read.table(file = "http://br312.math.tntech.edu/.../CH16PR09.txt")<br />
#RENAME THE VARIABLES<br />
names(rehab) = c("time", "status", "ID")<br />
#LOAD GPLOTS PACKAGE<br />
library(gplots)<br />
#MAKE MAIN EFFECTS PLOT<br />
plotmeans(rehab$time ~ as.factor(rehab$status))<br />
1
It is also acceptable if you made a “line plot” here.<br />
b. #FIT ANOVA MODEL<br />
model = aov(rehab$time ~ as.factor(rehab$status))<br />
#INPUT BASIC INFORMATION FROM THE DATA<br />
r = 3<br />
n.i = c(8, 10, 6)<br />
n.t = sum(n.i)<br />
#OBTAIN GROUP MEANS<br />
#I use "unique" so I only end up with one value for each group<br />
#and I use "round" to avoid issues with computer error<br />
ybar.i = unique(round(model$fitted.values, 2))<br />
#CALCULATE MSE - REMEBER MSE IS JUST THE SUM OF THE RESIDUALS SQUARED<br />
#DIVIDED BY THE DEGREES OF FREEDOM<br />
MSE = sum(model$residuals^2)/(n.t - r)<br />
SE = sqrt(MSE / n.i[2])<br />
#CALCULATE THE t-<strong>STA</strong>TISTIC<br />
t = qt(1 - .01 / 2, n.t - r)<br />
#COMPUTE CONFIDENCE INTERVAL<br />
c(ybar.i[2] - t * SE, ybar.i[2] + t * SE)<br />
c. #FIND POINT ESTIMATES<br />
D.1 = ybar.i[2] - ybar.i[3]<br />
D.2 = ybar.i[1] - ybar.i[2]<br />
#CALCULATE B<br />
99% CI for group 2: (28.01497, 35.98503)<br />
2
g = 2<br />
B = qt(1 - (.05 / (2 * g)), n.t - r)<br />
#CALCULATE <strong>STA</strong>NDARD ERRORS<br />
SE.1 = sqrt(MSE * (1/n.i[2] + 1/n.i[3]))<br />
SE.2 = sqrt(MSE * (1/n.i[1] + 1/n.i[2]))<br />
#CALCULATE CONFIDENCE INTERVALS<br />
c(D.1 - B * SE.1, D.1 + B * SE.1)<br />
c(D.2 - B * SE.2, D.2 + B * SE.2)<br />
95% Family Confidence Intervals for:<br />
µ2 − µ3 : (2.452, 13.548)<br />
µ1 − µ2 : (0.904, 11.096)<br />
d. T = (1 / sqrt(2)) * qtukey(.95, r, n.t - r)<br />
T = 2.52 while B = 2.41. Since we are only doing two comparisons the Bonferroni method is more<br />
efficient in this case.<br />
e. If the researcher chose to estimate D3 = µ1 − µ3 as well, the Bonferroni statistic would need to be<br />
recalculated, but the Tukey statistic would remain the same.<br />
f. NOTE: Until now I have been calculating all pairwise comparisons directly. Here I will use the built in<br />
function to computer all Tukey intervals. Similarly all Bonferroni intervals and several other methods<br />
not covered in this course can be calculated with the function pairwise.t.test.<br />
TukeyHSD(model)<br />
Tukey multiple comparisons of means<br />
95% family-wise confidence level<br />
Fit: aov(formula = rehab$time ~ as.factor(rehab$status))<br />
$‘as.factor(rehab$status)‘<br />
diff lwr upr p adj<br />
2-1 -6 -11.32141 -0.6785856 0.0253639<br />
3-1 -14 -20.05870 -7.9413032 0.0000254<br />
3-2 -8 -13.79322 -2.2067778 0.0060547<br />
No intervals contain 0, thus, all the means are significantly different. That is, the recovery time for<br />
each initial fitness status is significantly different than the other two.<br />
5.) a. #READ IN DATA<br />
cash = read.table(file = "http://br312.math.tntech.edu/.../CH16PR10.txt")<br />
#RENAME VARIABLES<br />
names(cash) = c("offer", "age", "ID")<br />
#MAKE PLOT<br />
plotmeans(cash$offer~as.factor(cash$age))<br />
3
. #FIT ANOVA MODEL<br />
model = aov(cash$offer~as.factor(cash$age))<br />
#INPUT BASIC INFORMATION FROM THE DATA<br />
r = 3<br />
n.i = c(12, 12, 12)<br />
n.t = sum(n.i)<br />
#OBTAIN GROUP MEANS<br />
ybar.i = unique(round(model$fitted.values, 2))<br />
#CALCULATE MSE<br />
MSE = sum(model$residuals^2)/(n.t - r)<br />
SE = sqrt(MSE / n.i[1])<br />
#CALCULATE THE t-<strong>STA</strong>TISTIC<br />
t = qt(1 - .01 / 2, n.t - r)<br />
#COMPUTE CONFIDENCE INTERVAL<br />
c(ybar.i[1] - t * SE, ybar.i[1] + t * SE)<br />
99% CI for the mean offer for young individuals: (20.25496, 22.74504)<br />
c. #OBTAIN POINT ESTIMATE<br />
D.hat = ybar.i[3] - ybar.i[1]<br />
#CALCULATE <strong>STA</strong>NDARD ERROR<br />
SE = sqrt(MSE * (1 / n.i[3] + 1 / n.i[1]))<br />
#COMPUTE t-<strong>STA</strong>TISTIC<br />
t = pt(1 - .01 / 2, n.t - r)<br />
#COMPUTE CONFIDENCE INTERVAL<br />
c(D.hat - t * SE, D.hat + t * SE)<br />
99% CI for the mean offer for D = µ3 − µ1: (-1.840755, 1.680755)<br />
We are 99% confident that the true value of the difference in mean offers lies within the given interval.<br />
Since this interval contains zero, or no difference in the means, there is no significant evidence to<br />
suggest that younger and older individuals receive different offers.<br />
d. #OBTAIN POINT ESTIMATE<br />
L.hat = 2 * ybar.i[2] - ybar.i[1] - ybar.i[3]<br />
#CALCULATE <strong>STA</strong>NDARD ERROR<br />
4
SE = sqrt(MSE * (4/n.i[2] + 1/n.i[1] + 1/n.i[3]))<br />
#COMPUTE TEST <strong>STA</strong>TISTIC<br />
t = L.hat/SE<br />
#DETERMINE CRITICAL VALUE<br />
t.crit = qt(1 - .01/2, n.t - r)<br />
The method to solving this problem becomes much clearer when we rewrite the null hypothesis in the<br />
following way:<br />
H0 : µ2 − µ1 = µ3 − µ2<br />
⇒H0 : 2µ2 − µ1 − µ3 = 0<br />
H1 : 2µ2 − µ1 − µ3 = 0<br />
<br />
∗ |t | ≤ 2.733 Fail to Reject H0<br />
Decision Rule =<br />
|t∗ | > 2.733 Reject H0<br />
We get t ∗ = 11.275, so we reject H0. There is significant evidence to suggest that the offer for the<br />
young and old age groups differs from the middle age group.<br />
6.) (a) Let L = µ1 + µ2 − 2µ3.<br />
#READ IN DATA<br />
prod = read.table(file = "http://br312.math.tntech.edu/.../CH16PR07.txt")<br />
#RENAME VARIABLES<br />
names(prod) = c("prod", "exp", "ID")<br />
#FIT ANOVA MODEL<br />
model = aov(prod$prod~as.factor(prod$exp))<br />
#INPUT BASIC INFORMATION FROM THE DATA<br />
r = 3<br />
n.i = c(9, 12, 6)<br />
n.t = sum(n.i)<br />
#OBTAIN GROUP MEANS<br />
ybar.i = unique(round(model$fitted.values, 2))<br />
#OBTAIN POINT ESTIMATE<br />
L.hat = ybar.i[1] + ybar.i[2] - 2*ybar.i[3]<br />
#CALCULATE <strong>STA</strong>NDARD ERROR<br />
MSE = sum(model$residuals^2)/(n.t - r)<br />
SE = sqrt(MSE * (1 / n.i[1] + 1 / n.i[2] + 4 / n.i[3]))<br />
#COMPUTE t-<strong>STA</strong>TISTIC<br />
t = qt(1 - .05 / 2, n.t - r)<br />
#COMPUTE CONFIDENCE INTERVAL<br />
c(L.hat - t * SE, L.hat + t * SE)<br />
The 95% confidence interval for L is given by: (-4.922284, -1.857716). This interval does not<br />
contain zero, so there is strong evidence to suggest there is a difference in the mean productivity<br />
improvement between firms with low or moderate research and development expenditures and firms<br />
with high expenditures.<br />
(b) Let L = 9<br />
27 µ1 + 12<br />
27 µ2 + 6<br />
27 µ3.<br />
#OBTAIN POINT ESTIMATE<br />
L.hat = (3/9)*ybar.i[1] + (4/9)*ybar.i[2] + (2/9)*ybar.i[3]<br />
#CALCULATE <strong>STA</strong>NDARD ERROR<br />
5
SE = sqrt(MSE * ((3/9)^2 / n.i[1] + (4/9)^2 / n.i[2] + (2/9)^2/ n.i[3]))<br />
#COMPUTE t-<strong>STA</strong>TISTIC<br />
t = qt(1 - .05 / 2, n.t - r)<br />
#COMPUTE CONFIDENCE INTERVAL<br />
c(L.hat - t * SE, L.hat + t * SE)<br />
We are 95% confident that the mean overall productivity lies within the interval (7.633330 8.268892).<br />
(c) #OBTAIN POINT ESTIMATES<br />
D1.hat = ybar.i[3] - ybar.i[2]<br />
D2.hat = ybar.i[3] - ybar.i[1]<br />
D3.hat = ybar.i[2] - ybar.i[1]<br />
L4.hat = .5 * (ybar.i[1] + ybar.i[2]) - ybar.i[3]<br />
#CALCULATE <strong>STA</strong>NDARD ERRORs<br />
SE1 = sqrt(MSE * (1 / n.i[3] + 1 / n.i[2]))<br />
SE2 = sqrt(MSE * (1 / n.i[3] + 1 / n.i[1]))<br />
SE3 = sqrt(MSE * (1 / n.i[2] + 1 / n.i[1]))<br />
SE4 = sqrt(MSE * (.5^2 / n.i[1] + .5^2 / n.i[2] + 1 / n.i[3]))<br />
#COMPUTE S<br />
S = sqrt((r - 1) * qf(.90, r - 1, n.t - r))<br />
#CALCULATE CONFIDENCE INTERVALS<br />
c(D1.hat - S * SE1, D1.hat + S * SE1)<br />
c(D2.hat - S * SE2, D2.hat + S * SE2)<br />
c(D3.hat - S * SE3, D3.hat + S * SE3)<br />
c(L4.hat - S * SE4, L4.hat + S * SE4)<br />
The 90% family confidence intervals are given by:<br />
D1 : (0.169, 1.971) D3 : (0.455, 2.045)<br />
D2 : (1.370, 3.270) L1 : (−2.531, −0.859)<br />
No intervals contained 0, so all differences are significant. That is there is strong evidence to suggest<br />
that the mean productiviy improvement for every firm differs from the other two. Furthermore, the<br />
interval for L1 confirms our results from part a.<br />
7.) In order to solve this problem we need to make a couple assumptions. First, we assume that the sample<br />
size for each group, ni, is the same for every group. Denote ni = n, then nT = 3n. Further assume that<br />
the estimate of MSE from the data is the true value. A Tukey interval has the form,<br />
ˆD ± T × s( ˆ D)<br />
Now set the margin of error, the right side of the expression above, equal to 3 and let α = .05. Also, we<br />
have from the data that MSE = 3.8 We get,<br />
T × s( ˆ D) =3<br />
<br />
1<br />
√ q(1 − α; r, nT − r) MSE(<br />
2 1<br />
+<br />
n1<br />
1<br />
) =3<br />
n2<br />
1<br />
√ q(.95; 3, 3n − 3)<br />
2 √ <br />
2<br />
3.8<br />
n =3<br />
q(.95; 3, 3n − 3) √ 1n =1.539<br />
q(.95; 3, 3n − 3) √ 1n − 1.539 =0<br />
6
If the above equation returns a negative number for a given value of n, then that sample size satisfies<br />
our minimum required precision. Through trying several values we find that n = 6 is the smallest value<br />
to return a negative value of the function.<br />
7