16.10.2014 Views

A brief solution to the written exam in STK9900/STK4900 Statistical ...

A brief solution to the written exam in STK9900/STK4900 Statistical ...

A brief solution to the written exam in STK9900/STK4900 Statistical ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

A <strong>brief</strong> <strong>solution</strong> <strong>to</strong> <strong>the</strong> <strong>written</strong> <strong>exam</strong> <strong>in</strong> <strong>STK9900</strong>/<strong>STK4900</strong><br />

<strong>Statistical</strong> methods and applications.<br />

Thursday 6. June 2013.<br />

The <strong>exam</strong> <strong>in</strong> <strong>STK4900</strong> and <strong>STK9900</strong> have substantial overlap, but are not <strong>the</strong> same. This<br />

is a short <strong>solution</strong> <strong>to</strong> <strong>the</strong> <strong>STK9900</strong> questions, which <strong>in</strong>cludes <strong>the</strong> questions for <strong>STK4900</strong>.<br />

Exercise 1<br />

a) To test if <strong>the</strong> expression of gene 1 is a significant explana<strong>to</strong>ry variable for bone density,<br />

we use <strong>the</strong> t-test statistic<br />

t = ˆβ 1<br />

se( ˆβ 1 )<br />

where ˆβ 1 is <strong>the</strong> estimated effect of gene 1 and se( ˆβ 1 ) is <strong>the</strong> correspond<strong>in</strong>g standard error.<br />

Us<strong>in</strong>g <strong>the</strong> output <strong>in</strong> (I), <strong>the</strong> test statistic takes <strong>the</strong> value<br />

t = −1.1297<br />

0.3609 = −3.13.<br />

The value -3.13 is <strong>to</strong> be compared with <strong>the</strong> t-distribution with 82 degrees of freedom.<br />

Us<strong>in</strong>g <strong>the</strong> table for t-distributions, we f<strong>in</strong>d that <strong>the</strong> P-value is smaller than 2 · 0.005 = 0.01<br />

(two-sided, us<strong>in</strong>g 60 degrees of freedom from <strong>the</strong> table). Hence β 1 is significantly different<br />

from 0 and gene 1 is a significant explana<strong>to</strong>ry variable for bone density.<br />

b) In <strong>the</strong> results (II) we see that gene 1 is no longer significant when gene 2 is <strong>in</strong>cluded.<br />

Gene 2 i highly significant. Hence if we have <strong>in</strong>formation on gene 2, gene 1 does not conta<strong>in</strong><br />

enough additional <strong>in</strong>formation on bone density <strong>to</strong> be significant. This can happen when<br />

<strong>the</strong> covariates are (highly) correlated. We see that <strong>the</strong> correlation between gene 1 and gene<br />

2 is 0.73.<br />

c) The second model has only significant covariates and a slightly higher Adjusted R 2 ,<br />

hence should be considered better that <strong>the</strong> first (<strong>the</strong> first model has a higher R 2 , but has<br />

also one more covariate, so we use Adjusted R 2 for comparison here). For <strong>the</strong> second model,<br />

we <strong>in</strong>terpret <strong>the</strong> estimated effects as: A high gene expression for gene 2 tends <strong>to</strong> reduce<br />

expected bone density, when <strong>the</strong> o<strong>the</strong>r covariates (genes) are kept unchanged. A high gene<br />

expression for gene 3 tends <strong>to</strong> <strong>in</strong>crease expected bone density, when <strong>the</strong> o<strong>the</strong>r genes are<br />

unchanged. The same for gene 4. We also see that gene 3 has <strong>the</strong> largest estimated effect<br />

on bone density.<br />

d) [9900] Short answer: Interaction between covariates is when <strong>the</strong> effect of one covariate<br />

on <strong>the</strong> response depends on <strong>the</strong> level of ano<strong>the</strong>r covariate. Some more discussion appreciated,<br />

expla<strong>in</strong><strong>in</strong>g how this can be modelled etc., see lecture notes Lecture 4. In <strong>the</strong> sett<strong>in</strong>g<br />

here, <strong>in</strong>teraction between genes would mean f.ex. that <strong>the</strong> effect of a high expression of<br />

gene r could depend on <strong>the</strong> expression level of ano<strong>the</strong>r gene s.<br />

1


Exercise 2<br />

a) We have data (x 1 , x 2 , y) for n = 35 subjects (surgeries), where x 1 and x 2 are explana<strong>to</strong>ry<br />

vaiables and <strong>the</strong> outcome y is b<strong>in</strong>ary (0 and 1). We are <strong>in</strong>terested <strong>in</strong> modell<strong>in</strong>g <strong>the</strong><br />

probability of gett<strong>in</strong>g a sore throat as a function of <strong>the</strong> explana<strong>to</strong>ry variables, hence we<br />

model p(x 1 , x 2 ) = P (y = 1|x 1 , x 2 ) with a logistic regression model<br />

p(x 1 , x 2 ) = exp(β 0 + β 1 x 1 + β 2 x 2 )<br />

1 + exp(β 0 + β 1 x 1 + β 2 x 2 ) .<br />

This ensures that p(x 1 + x 2 ) is a proper probability, and <strong>the</strong> log odds will be l<strong>in</strong>ear <strong>in</strong> <strong>the</strong><br />

covariates.<br />

b) With only one explana<strong>to</strong>ry variable x 1 , we f<strong>in</strong>d<br />

estimated as<br />

P (y = 1|x 1 = 30) = exp(β 0 + β 1 · 30)<br />

1 + exp(β 0 + β 1 · 30)<br />

exp(−2.21358 + 0.07038 · 30)<br />

1 + exp(−2.21358 + 0.07038 · 30) = 0.4745.<br />

c) The odds ratio for 40 m<strong>in</strong>utes wrt. 30 m<strong>in</strong>utes is exp(10 · ˆβ 1 ) = exp(0.7038) = 2.02 .<br />

A confidence <strong>in</strong>terval for 10 · β 1 is found through<br />

10 · ˆβ 1 ± 1.96se(10 · ˆβ 1 ),<br />

where se(10· ˆβ 1 ) = 10·se( ˆβ 1 ). With <strong>the</strong> estimates from R we get 0.7038±1.96·0.2667, giv<strong>in</strong>g<br />

<strong>the</strong> 95% confidence <strong>in</strong>terval (0.181068, 1.226532). This gives a 95% confidence <strong>in</strong>terval for<br />

<strong>the</strong> odds ratio as<br />

(exp(0.181068), exp(1.226532))<br />

that is, (1.199, 3.409). The OR tells us that <strong>the</strong> odds for a sore throat are doubled if <strong>the</strong><br />

surgery lasts 10 m<strong>in</strong>utes more. From <strong>the</strong> confidence <strong>in</strong>terval we can conclude that <strong>the</strong>re is<br />

some uncerta<strong>in</strong>ty about this doubl<strong>in</strong>g, but at least we can say that <strong>the</strong> odds are <strong>in</strong>creased<br />

(start<strong>in</strong>g above 1).<br />

d) Wald test:<br />

H 0 : β 2 = 0 versus H a : β 2 ≠ 0<br />

From R we f<strong>in</strong>d <strong>the</strong> z-value -1.798, and a P-value= 2P (Z > 1.798) = 2 · 0.036 = 0.072<br />

where Z is standard normal. This P-value is not small enough <strong>to</strong> reject H 0 at level 0.05,<br />

so <strong>the</strong> type (x 2 ) is not significantly improv<strong>in</strong>g <strong>the</strong> model.<br />

Deviance test:<br />

G = D 0 − D = 33.651 − 30.138 = 3.513<br />

G is χ 2 with 1 df under H 0 . From <strong>the</strong> χ 2 table we f<strong>in</strong>d that <strong>the</strong> P-value must be larger<br />

than 0.05, and conclude aga<strong>in</strong> that type (x 2 ) is not significantly improv<strong>in</strong>g <strong>the</strong> model.<br />

2


Exercise 3<br />

a) The estimated rate ratio is exp(1.2016) = 3.325, hence <strong>the</strong> rate of wave caused damages<br />

is more than three times larger for <strong>the</strong> ships built <strong>in</strong> 70-74 compared <strong>to</strong> 60-64.<br />

b) [9900] For an explanation of over dispersion, see lecture notes for Lecture 8. Should<br />

write down <strong>the</strong> overdispersed Poisson regression model, <strong>in</strong>troduc<strong>in</strong>g V ar(Y i ) = φµ i , and<br />

expla<strong>in</strong> that us<strong>in</strong>g <strong>the</strong> quasi family <strong>in</strong> <strong>the</strong> glm allows <strong>to</strong> estimate φ and corrects <strong>the</strong> standard<br />

errors <strong>to</strong><br />

√<br />

se ∗ = se ˆφ.<br />

For <strong>the</strong> ship damage data, ˆφ = 3.19643 and <strong>the</strong> standard errors of <strong>the</strong> estimated coefficients<br />

are scaled up with a fac<strong>to</strong>r<br />

√<br />

ˆφ = 1.788,<br />

and <strong>the</strong> test statistics and p-values corrected correspond<strong>in</strong>gly. The p-values become bigger,<br />

but <strong>the</strong> effects are still significant.<br />

c) This is a Poisson model with observations that are aggregated counts. If Y i is <strong>the</strong><br />

number of counts <strong>in</strong> group i, <strong>the</strong> model can be <strong>written</strong> as<br />

E(Y i ) = exp(log(months i ) + β 0 + β 1 x Bi + β 2 x Ci + β 3 x Di + β 4 x Ei<br />

+β 5 x 65−69,i + β 6 x 70−74,i + β 7 x 75−79,i + β 8 x op75−79,i )<br />

where x Bi = 1 if ship type <strong>in</strong> group i is B and 0 o<strong>the</strong>rwise, etc. We test<br />

us<strong>in</strong>g a deviance test:<br />

H 0 : β 1 = β 2 = β 3 = β 4 = β 8 = 0<br />

G = D 0 − D = 73.225 − 38.695 = 34.53.<br />

G is χ 2 with 5 df under H 0 . P (χ 2 5 > 34.53) < 0.005 and we reject H 0 at any reasonable<br />

level of significance. Hence we can conclude that <strong>the</strong> new variables significantly improve<br />

<strong>the</strong> model.<br />

d) Us<strong>in</strong>g estimated rate ratios, we f<strong>in</strong>d that ship type B, C and D all are safer than ship<br />

type A, when all o<strong>the</strong>r covariates are unchanged. Ship type E is more dangerous.<br />

Specifically we f<strong>in</strong>d RR=exp(-0.6874)=0.5029 for ship type C compared <strong>to</strong> ship type A,<br />

when all o<strong>the</strong>r covariates are unchanged. Similarly we f<strong>in</strong>d RR=exp(-0.5433)=0.5808 for<br />

ship type B compared <strong>to</strong> A. Ship type C has <strong>the</strong>refore <strong>the</strong> lowest rate of wave caused<br />

damage, but <strong>the</strong> standard error of ˆβ 2 is much larger than that of ˆβ 1 , so it is also reasonable<br />

<strong>to</strong> argue that ship type B should be considered <strong>the</strong> safest, given that <strong>the</strong> difference <strong>in</strong> RR<br />

is very small.<br />

The RR for operation <strong>in</strong> <strong>the</strong> last period wrt <strong>the</strong> first is RR=exp(0.38447)=1.4688. This<br />

means that <strong>the</strong>re was an <strong>in</strong>creased risk of damage for operations <strong>in</strong> <strong>the</strong> period 1975-79<br />

compared <strong>to</strong> <strong>the</strong> previous period 1960-74, when all o<strong>the</strong>r covariates as ship type and year<br />

of construction are <strong>the</strong> same.<br />

3

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!