Stat 849- Homework 3 - 1 - Question 1 (a) Plot: Since there is at least ...

Stat 849- Homework 3 

Question 1 

(a) Plot: 

VL 

0 50000 150000 250000 

5 10 15 

GSS 

Since there is at least one sample faraway from the rest, it is worth to test the presence 

of outliers. 

> hm3q1 plot(hm3q1) 

> 

> lm1 outlier.test(lm1) 

max|rstudent| = 10.58428, degrees of freedom = 44, 

unadjusted p = 1.125766e-13, Bonferroni p = 5.291101e-12 

Observation: 28 

> plot(hm3q1[-28,]) 

> abline(coef(lm1)) 

> 



unadjusted p = 5.113096e-05, Bonferroni p = 0.002352024 




unadjusted p = 0.000333085, Bonferroni p = 0.01498882 


> 

- 1 -






Therefore, we will exclude samples : 6, 16, 28. 

(b) 

> summary(lm4) 

Call: 

lm(formula = VL ~ GSS, data = hm3q1[c(-6, -16, -28), ]) 

Residuals: 

Min 1Q Median 3Q Max 

-20544 -12787 -8729 2756 49174 

Coefficients: 

Estimate Std. Error t value Pr(>|t|) 

(Intercept) 11780.0 6900.0 1.707 0.0952 . 

GSS 619.2 811.5 0.763 0.4497 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 19460 on 42 degrees of freedom 

(1 observation deleted due to missingness) 

Multiple R-squared: 0.01367, Adjusted R-squared: -0.009813 

F-statistic: 0.5822 on 1 and 42 DF, p-value: 0.4497 

The plot: 

> plot(hm3q1[c(-6,-16,-28),1:2]) 

> abline(coef(lm4)) 

- 2 -


(c) The following figure tells that the normality assumption on error term violated as 

well. Variance is not stable in the first plot, QQ plot shows that it is not like normal 

distribution as well. 

Residuals vs Fitted 

Normal Q-Q 

Residuals 

-20000 20000 

3 

21 

44 

Standardized residuals 

-1 0 1 2 3 

44 21 

3 

14000 18000 

-2 -1 0 1 2 

Fitted values 

Theoretical Quantiles 


0.0 0.5 1.0 1.5 

Scale-Location 

44 21 

3 


-1 0 1 2 3 

Residuals vs Leverage 

21 44 

Cook's distance 

9 

0.5 

14000 18000 

0.00 0.04 0.08 0.12 

Fitted values 

Leverage 

(d) We can apply Boxcox transformation. 

> par(mfrow=c(1,2)) 

> b1 lambda hm3q1$y summary(lmb1


Coefficients: 


(Intercept) 18.43851 2.16262 8.526 5.99e-11 *** 

GSS 0.02962 0.24779 0.120 0.905 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

Residual standard error: 6.302 on 45 degrees of freedom 

(1 observation deleted due to missingness) 

Multiple R-squared: 0.0003173, Adjusted R-squared: -0.0219 

F-statistic: 0.01428 on 1 and 45 DF, p-value: 0.9054 

> plot(hm3q1[,-2]) 

> abline(coef(lmb1)) 

> outlier.test(lmb1) 




- 4 -



(a) 

4 5 6 7 8 9 10 

Brand_Liking 

60 70 80 90 100 

4 5 6 7 8 9 10 

Moisture_Content 

Sweetness 

2.0 2.5 3.0 3.5 4.0 

60 70 80 90 100 2.0 2.5 3.0 3.5 4.0 

From the plot, we can tell Brand_Liking is linearly related to Moisture_Content and 

Sweetness. 

(b) 

> brand fit summary(fit) 

Call: 

lm(formula = Brand_Liking ~ as.factor(Moisture_Content) + 

as.factor(Sweetness), 

data = brand) 

Residuals: 

Min 1Q Median 3Q Max 

-3.625 -1.312 -0.125 1.563 4.125 

Coefficients: 


(Intercept) 64.125 1.500 42.737 1.40e-13 *** 

as.factor(Moisture_Content)6 8.000 1.898 4.215 0.00145 ** 

as.factor(Moisture_Content)8 19.250 1.898 10.142 6.42e-07 *** 

as.factor(Moisture_Content)10 25.750 1.898 13.567 3.26e-08 *** 

as.factor(Sweetness)4 8.750 1.342 6.520 4.31e-05 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 


Multiple R-squared: 0.9597, Adjusted R-squared: 0.9451 

- 5 -


F-statistic: 65.51 on 4 and 11 DF, p-value: 1.338e-07 

The coefficients on front of moisture content {6, 8,10} are the expected differences 

between that content and moisture content 4. 

(c) 

First plot shows that there is curve-like relation in error terms. It contradicts the 

Gauss-Markov assumptions in that error terms are independent. 

(d) 

- 6 -


(e) 

The plot in (d) shows that there is curve-like relation. After log transformation of X1 

it looks better from the diagnostic plot below. 

> summary(lm2 |t|) 

(Intercept) 14.3428 4.6599 3.078 0.00881 ** 

log(Moisture_Content) 28.7205 2.1391 13.427 5.37e-09 *** 

Sweetness 4.3750 0.7328 5.970 4.67e-05 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 



F-statistic: 108 on 2 and 13 DF, p-value: 7.993e-09 

> par(mfrow=c(2,2)) 

> plot(lm2) 

- 7 -



Normal Q-Q 

Residuals 

-4 -2 0 2 4 

4 

7 

15 


-1.0 0.0 1.0 2.0 

7 

4 

15 

65 75 85 95 

-2 -1 0 1 2 

Fitted values 



0.0 0.4 0.8 1.2 


15 

4 

7 


-1 0 1 2 

Residuals vs Leverage 

Cook's distance 

15 

4 

14 

0.5 

65 75 85 95 

0.00 0.10 0.20 

Fitted values 

Leverage 

(f) 

>summary(lm2 |t|) 

(Intercept) 14.3428 4.6599 3.078 0.00881 ** 

log(Moisture_Content) 28.7205 2.1391 13.427 5.37e-09 *** 

Sweetness 4.3750 0.7328 5.970 4.67e-05 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

- 8 -




F-statistic: 108 on 2 and 13 DF, p-value: 7.993e-09 

> outlier.test(lm2) 




From outlier test above, we can reject the null hypothesis that 15th observation is an 

outlier. 

(g) 

> plot(leverage) 

> abline(h=2*p/n,col='red',lty=1) 

> x


> (fcv abline(h=fcv,col='red',lty=3) 

> ind bigcd fcv) 

> text(ind[bigcd],cooks.distance(lm2)[bigcd]- 

0.02,row.names(bp)[bigcd],col='red') 

cooks distance plot 

cooks.distance(lm2) 

0.00 0.05 0.10 0.15 0.20 0.25 0.30 0.35 

4 

7 

14 

15 

5 10 15 

Index 

So, observation 4, 7, 14 and 15 are influential points. 

- 10 -



(a) 

> hay summary(lm1 |t|) 

(Intercept) 2.4750 0.1227 20.177 < 2e-16 *** 

as.factor(A)2 2.9750 0.1735 17.150 4.82e-16 *** 

as.factor(A)3 3.5000 0.1735 20.176 < 2e-16 *** 

as.factor(B)2 2.1250 0.1735 12.250 1.55e-12 *** 

as.factor(B)3 2.1000 0.1735 12.106 2.03e-12 *** 

as.factor(A)2:as.factor(B)2 1.3500 0.2453 5.503 7.91e-06 *** 



as.factor(A)3:as.factor(B)3 5.1750 0.2453 21.094 < 2e-16 *** 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 



F-statistic: 774.9 on 8 and 27 DF, p-value: < 2.2e-16 

> anova(lm1) 

Analysis of Variance Table 

Response: hours 

Df Sum Sq Mean Sq F value Pr(>F) 

as.factor(A) 2 220.020 110.010 1827.86 < 2.2e-16 *** 

as.factor(B) 2 123.660 61.830 1027.33 < 2.2e-16 *** 

as.factor(A):as.factor(B) 4 29.425 7.356 122.23 < 2.2e-16 *** 

Residuals 27 1.625 0.060 

--- 

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

From results above, yˆ23 = 2.475 + 2.975 + 2.100 + 1.575 = 9.125. 

(b) Diagnostic plots. Second plot, QQ plot, looks a little skewed, but generally it 

looks all right. 

- 11 -



Normal Q-Q 

Residuals 

-0.4 0.0 0.4 

6 

23 

29 


-2 -1 0 1 2 

23 629 

4 6 8 10 12 

-2 -1 0 1 2 

Fitted values 



0.0 0.4 0.8 1.2 


6 

23 

29 

4 6 8 10 12 


-2 -1 0 1 2 

Constant Leverage: 

Residuals vs Factor Levels 

6 

23 

29 

as.factor(A) : 

1 2 3 

Fitted values 

Factor Level Combinations 

(c) There is some interaction since they are not always parallel. 

B 

hours 

4 6 8 10 12 

3 

2 

1 

1 2 3 

A 

- 12 -


(d) 

From the result of ANOVA above, we can see that F statistic is 122.23. So we reject 

null hypothesis that there is no interaction. 

(e) 

F statistic is 18827.86 for A and 1027.33 for B. So we reject null hypothesis that there 

is no main effect. 


ˆ ε = Y − Yˆ 

= ( I − H ) Y , and H and ( I − H ) is idempotent. 

(a) E ( ˆ) ε = E( 

Y −Yˆ) 

= Xβ 

− Xβ 

= 0 

2 

2 2 

(b) cov( ˆ) ε = cov(( I − H ) Y ) = σ ( I − H ) = σ ( I − H ) 

2 

2 

(c) cov( ˆ, ε Y ) = cov( Y −Yˆ, 

Y ) = ( I − H ) cov( Y ) = σ ( I − H ) 

(d) cov( ˆ, ε Yˆ) 

= cov(( I − H ) Y, 

HY) 

= ( I − H )cov( Y ) H = cov( Y )( I − H ) H = 0 

(e) ˆ 

−1 

X ' ˆ ε = X '( Y − X ˆ) β = X ' Y − X ' Xβ 

= X ' Y − X ' X ( X ' X ) X ' Y = X ' Y − X ' Y = 0 

p× 

1 

n 

n 

1 

ˆε is the first row of X ' εˆ because the first row is 1 vector. Therefore ˆ ε = 0 

∑ 

i= 

1 

i 

(f) ˆ ε ' Y = (( I − H ) Y )' Y = Y '( I − H ) Y because ( I − H ) is symmetric. 

(g) ˆ ε ' Yˆ 

= (( I − H ) Y )' Yˆ 

= Y '( I − H ) HY = 0 

(h) 

ˆ' ε X = (( I − H ) Y )' X = Y '( I − H ) X = Y ' X −Y 

' HX 

−1 

= Y ' X −Y 

' X ( X ' X ) X ' X = Y ' X −Y 

' X = 0 

n 

∑ 

i= 

1 

i 

- 13 -



(a) 

X 

* 

' X 

∑ 

n 

⎡ 

2 

1 ( X − ) 

= 1 1 

X 

i i 1 

⎢ 

⎢ n −1 

SX1 

* SX1 

= 

⎢ 

n 

− − 

⎢ 1 ∑ ( X )( 

= 1 

X1 

X 

2 

X 

i i 

i 

⎢⎣ 

n −1 

SX1 

* SX 

2 

1 2 

) 

∑ 

n 

1 ( X − − 

= 1 

X 

1)( 

X 

2 

X 

i i 

i 

n −1 

SX1 

* SX 

2 

n 

2 

1 ∑ ( X − ) 

= 1 2 

X 

i i 2 

n −1 

SX * SX 

1 2 

2 

2 

) ⎤ 

⎥ 

⎥ ⎡1 

⎥ 

= ⎢ 

⎥ 

⎣r 

⎥⎦ 

r⎤ 

1 

⎥ 

⎦ 

Therefore, 

* * −1 

2 

2 2 

2 2 

Var ( β ) = ( X ' X ) σ * . Then we will have σ 

β 1* 

= σ 

β 2* 

= σ * /(1 − r ) 

(b) When the intercorrelations among the predictor variables increase, it will increase 

the variance of the estimates. 

- 14 -



(a) 

⎡1 

0 0 1 0 0⎤ 

⎢ 

⎥ 

⎢ 

1 0 0 0 1 0 

⎥ 

⎢0 

1 0 1 0 0⎥ 

X = ⎢ 

⎥ , rank(X) =5 

⎢0 

1 0 0 0 1⎥ 

⎢0 

0 1 0 1 0⎥ 

⎢ 

⎥ 

⎣0 

0 1 0 0 1 

⎦ 

(b) Here I will first show why β 

1 

is not estimable. 

Suppose we can estimate β 

1 

, then we should have a vector C such that 

⎧ c1 

+ c2 

= 1 

⎪ 

⎪ 

c3 

+ c4 

= 0 

⎪c5 

+ c6 

= 0 

CX = (1,0,0,0,0,0) . Then we will have ⎨ . This system has no solution. 

⎪c1 

+ c3 

= 0 

⎪c2 

+ c5 

= 0 

⎪ 

⎩c4 

+ c6 

= 0 

Therefore a 

1 

is not estimable. Similarly the rest components of β is not estimable. 

(c) Let CX = ( 1, −1,0,0,0,0 

) , we will have 

⎧ c1 

+ c2 

= 1 

⎪ 

⎪ 

c3 

+ c4 

= −1 

⎪ c5 

+ c6 

= 0 

⎨ . We can set C=(1,0,-1,0,0,0). Therefore ψ 

1 

is estimable. 

⎪ c1 

+ c3 

= 0 

⎪ c2 

+ c5 

= 0 

⎪ 

⎩ c4 

+ c6 

= 0 

Let CX = ( 1,1, −2,0,0,0) 

, we will have 

⎧ c1 

+ c2 

= 1 

⎪ 

⎪ 

c3 

+ c4 

= 1 

⎪c5 

+ c6 

= −2 

⎨ . We can set C=(0,1,0,1,-1,-1). Therefore ψ 

2 

is estimable. 

⎪ c1 

+ c3 

= 0 

⎪ c2 

+ c5 

= 0 

⎪ 

⎩ c4 

+ c6 

= 0 

- 15 -

Stat 849- Homework 3 - 1 - Question 1 (a) Plot: Since there is at least ...

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?