Solutions to the practice problems. - UCLA Biostatistics
Solutions to the practice problems. - UCLA Biostatistics
Solutions to the practice problems. - UCLA Biostatistics
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Biostatistics</strong> 201B<br />
Final Practice <strong>Solutions</strong><br />
March 6th, 2013<br />
Final Exam Practice Problems With <strong>Solutions</strong><br />
(1) Survival Analysis Basics:<br />
(a) Explain what <strong>the</strong> term censoring means, some different ways in which it can occur, and why it is an<br />
important issue in analyses of time <strong>to</strong> event or survival data.<br />
Solution: Censoring refers <strong>to</strong> <strong>the</strong> situation when a value of interest is only partially observed. It is a general<br />
phenomenon but comes up most frequently when analyzing “time <strong>to</strong> event” or survival data and I will use<br />
that terminology. In <strong>the</strong> survival setting censoring means you have some information about when <strong>the</strong> event<br />
of interest occurred but you don’t know it exactly. Because of this you would be throwing away information<br />
if you treated <strong>the</strong> value as missing but you would be biasing your answer if you used <strong>the</strong> partial information<br />
as if it were <strong>the</strong> true time of <strong>the</strong> event.<br />
There are many kinds of censoring. Right censoring refers <strong>to</strong> <strong>the</strong> situation in which you know <strong>the</strong> survival<br />
time was longer than a certain value. This could happen, for instance if you were measuring time <strong>to</strong> cancer<br />
recurrence and you know <strong>the</strong> patient was cancer free at <strong>the</strong>ir last visit at time t, but now <strong>the</strong>y have dropped<br />
out of <strong>the</strong> study and you don’t know <strong>the</strong>ir status. If you let T be <strong>the</strong> true time until cancer recurrence, you<br />
know T > t but you don’t know any more than that. Left censoring occurs when you know <strong>the</strong> survival<br />
time is less than a certain value. Suppose a subject in our cancer study was cancer free at baseline but by<br />
<strong>the</strong> time of <strong>the</strong>ir first visit at time t <strong>the</strong> cancer had already recurred. You would know T < t but you would<br />
have no idea exactly when it happened. Interval censoring occurs when you know <strong>the</strong> survival time is<br />
between two values. For instance we might know that our cancer patient was disease free at <strong>the</strong>ir previous<br />
visit at time t 1 but now at <strong>the</strong>ir current visit at t 2 <strong>the</strong> cancer has recurred. We know t 1 < T < t 2 but we<br />
don’t know <strong>the</strong> exact time.<br />
Censoring can occur for a variety of reasons. It can be known about in advance, for instance if you plan <strong>to</strong><br />
s<strong>to</strong>p <strong>the</strong> study at a particular time when many subjects will still not have had <strong>the</strong> event of interest. Everyone<br />
still in <strong>the</strong> study who has not had <strong>the</strong> event will be censored at <strong>the</strong> end point. It can also occur randomly,<br />
for instance because a subject drops out, dies for a reason o<strong>the</strong>r than <strong>the</strong> event under study, or o<strong>the</strong>rwise<br />
becomes unobservable. What is really important is whe<strong>the</strong>r <strong>the</strong> censoring is non-informative–i.e. knowing<br />
when or if <strong>the</strong> subject was censored doesn’t tell you anything about when <strong>the</strong> actual event occured. You can<br />
have non-informative censoring, for instance because a study participant moves for job-related reasons and<br />
<strong>the</strong>refore can’t participate any more. However subjects frequently drop out for reasons having <strong>to</strong> do with <strong>the</strong><br />
event under study, e.g. <strong>the</strong>y become <strong>to</strong>o sick <strong>to</strong> participate which may in turn suggest <strong>the</strong> event is imminent.<br />
Most standard survival methods assume <strong>to</strong> some extent or ano<strong>the</strong>r that <strong>the</strong> censoring is uninformative, at<br />
least after you have adjusted for appropriate covariates. However this is always an issue that needs careful<br />
attention.<br />
(b) Give an intuitive description of <strong>the</strong> key quantities in survival analysis (<strong>the</strong> survival function, S(t), <strong>the</strong><br />
cumulative death function, F(t), <strong>the</strong> death density, f(t) and <strong>the</strong> hazard function, h(t) = f(t)/S(t).<br />
Solution: The survival function is simple <strong>the</strong> probability that a subject survives past a certain point or<br />
equivalently <strong>the</strong> proportion of <strong>the</strong> population that survives past that point. If T is <strong>the</strong> survival time <strong>the</strong>n<br />
namely S(t) = P (T ≥ t). The cumulative death distribution is <strong>the</strong> opposite of <strong>the</strong> survival function. It<br />
gives <strong>the</strong> proportion of a population that has <strong>the</strong> event prior <strong>to</strong> a given time, F (t) = P (T ≤ t). The death<br />
density, f(t), gives <strong>the</strong> relative likelihood of <strong>the</strong> event of interest happening at time t. You can think of it<br />
1
as <strong>the</strong> probability that <strong>the</strong> event happens in a tiny interval around time t. The hazard function is like <strong>the</strong><br />
density function except that it is conditional on having survived up <strong>to</strong> time t. It gives <strong>the</strong> relative likelihood<br />
of <strong>the</strong> event happening at time t given that <strong>the</strong> event has not happened up <strong>to</strong> time t. You can think of it as<br />
<strong>the</strong> probability that <strong>the</strong> event will happen in a tiny interval after time t assuming it hasn’t happened prior<br />
<strong>to</strong> t. The hazard function is given by h(t) = f(t)/S(t). Note that all <strong>the</strong>se quantities are inter-related. If<br />
you know one of <strong>the</strong>m for all values of t <strong>the</strong>n you can calculate all <strong>the</strong> o<strong>the</strong>rs.<br />
(c) Decribe <strong>the</strong> two major categories of approaches <strong>to</strong> estimating <strong>the</strong> survival quantities above.<br />
Solution: There are both parametric and non-parametric techniques for estimating <strong>the</strong> key survival functions.<br />
In <strong>the</strong> parametric approach you assume that <strong>the</strong> survival time, T, has a particular distribution (e.g. an<br />
exponential distribution) or equivalently that <strong>the</strong> survival curve or hazard function has a particular shape,<br />
and you <strong>the</strong>n use maximum likelihood ideas <strong>to</strong> estimate <strong>the</strong> parameters of that distribution (e.g. mean,<br />
variance, etc.). In <strong>the</strong> non-parametric framework you use <strong>the</strong> empirical distribution given by <strong>the</strong> observed<br />
event times and censoring times in your sample. The Kaplan-Meier product limit estima<strong>to</strong>r of <strong>the</strong> survival<br />
curve is <strong>the</strong> classic non-parametric technique. There is also a famous estima<strong>to</strong>r of <strong>the</strong> cumulative hazard<br />
function, H(t) which “adds up” <strong>the</strong> <strong>to</strong>tal hazard a subject has been exposed <strong>to</strong> up <strong>to</strong> time t, called <strong>the</strong><br />
Nelson-Aalen estima<strong>to</strong>r. Both require only that you know <strong>the</strong> event and censoring times.<br />
(d) Explain briefly <strong>the</strong> ideas behind <strong>the</strong> accelerated failure time model and <strong>the</strong> Cox proportional hazards<br />
model for incorporating covariates in<strong>to</strong> a survival model.<br />
Solution: The accelerated failure time (AFT) model is basically a generalized linear model with a log link<br />
and <strong>the</strong> usual systematic component consisting of a linear combination of <strong>the</strong> predic<strong>to</strong>r variables or X’s. It<br />
can be fit in conjunction with a number of distribution functions and is a parametric model. The trick is<br />
that <strong>the</strong> censoring times have <strong>to</strong> be built in<strong>to</strong> <strong>the</strong> likelihood function that you are trying <strong>to</strong> maximize. The<br />
interpretation of <strong>the</strong> coefficients proceeds in <strong>the</strong> usual way as it would for a Poisson or o<strong>the</strong>r model that has<br />
a log link. The Cox Propor<strong>to</strong>nal Hazards model is what is called a “semi-parametric” model. You make<br />
some assumptions about <strong>the</strong> form of <strong>the</strong> distribution/hazard function but s<strong>to</strong>p short of estimating <strong>the</strong> whole<br />
thing. Specifically <strong>the</strong> proportional hazards model assumes that <strong>the</strong> hazard function can be written as<br />
h(t) = h 0 (t)c(Xβ)<br />
where h 0 (t) is called <strong>the</strong> baseline hazard function and c is a known function (frequently <strong>the</strong> exponential so<br />
that c(Xβ) = e Xβ ) applied <strong>to</strong> our usual linear combination of <strong>the</strong> X’s. This is a multuiplicative model. If<br />
c is <strong>the</strong> exponential function <strong>the</strong>n taking logs puts us back on our favorite additive scale. Note that while<br />
we assume <strong>the</strong> form of <strong>the</strong> function c is known we make no assumptions about <strong>the</strong> form of <strong>the</strong> baseline<br />
hazard function (o<strong>the</strong>r than that it is non-negative and usually continuous). This is why <strong>the</strong> model is called<br />
semi-parametric. In fact <strong>to</strong> understand <strong>the</strong> impact of <strong>the</strong> X’s we do not need <strong>to</strong> estimate <strong>the</strong> shape of<br />
<strong>the</strong> baseline hazard because if we take <strong>the</strong> hazard ratio for two people with different covariate values <strong>the</strong><br />
baseline hazard cancels out, leaving us a piece that depends only on c.<br />
(2) Weighted Analysis Basics:<br />
(a) Give four examples of situations in which you might want <strong>to</strong> perform a weighted analysis.<br />
Solution: There are many situations in which you might want <strong>to</strong> perform a weighted analysis. Examples<br />
include fitting an OLS regression model in which <strong>the</strong> constant variance assumption is violated; fitting a<br />
regression model in which <strong>the</strong> observed Y’s are measured with different levels of accuracy (this occurs, for<br />
example, if <strong>the</strong> Y’s are actually averages of several replications and <strong>the</strong> number of replicates varies from<br />
observation <strong>to</strong> observation); fitting a model where <strong>the</strong> observational units have differing sizes or perceived<br />
importance (e.g. measurements on countries); observational studies where some types of subjects were harder<br />
2
<strong>to</strong> reach than o<strong>the</strong>rs so are not represented equally in <strong>the</strong> population; studies in which sampling was deliberately<br />
stratified <strong>to</strong> allow for obtaining adequate estimates for rare groups–in this case <strong>the</strong> rare group needs <strong>to</strong><br />
be down-weighted when <strong>the</strong> estimates from <strong>the</strong> strata are combined; studies where <strong>the</strong> sampling was done in<br />
a hierarchical or clustered manner–e.g. by neighborhood or school–so that certain areas are over-represented<br />
(and <strong>the</strong>ir measurements are correlated); and many, many o<strong>the</strong>rs. Basically you need weighting when your<br />
sample is not a nice independent and identically distributed draw from <strong>the</strong> population of interest so that it<br />
is in some way un-representative of that population.<br />
(b) Describe briefly some of <strong>the</strong> ways in which you could incorporate weights in<strong>to</strong> an estimation procedure.<br />
Solution: For many estimation and modeling situations <strong>the</strong> weights (once you have <strong>the</strong>m!) can be built<br />
right in<strong>to</strong> <strong>the</strong> likelihood function you need <strong>to</strong> maximize. For instance, in OLS regression <strong>the</strong> maximum<br />
likelihood solution is <strong>the</strong> same as <strong>the</strong> least squares solution which minimizes <strong>the</strong> sum of squared differences<br />
between <strong>the</strong> observed Y’s and <strong>the</strong> values predicted by <strong>the</strong> regression line. In <strong>the</strong> weighted regression setting<br />
you simply minimize <strong>the</strong> sum of squared errors for each point multiplied by <strong>the</strong>ir weights. A similar strategy<br />
applies <strong>to</strong> all sorts of o<strong>the</strong>r regression models. When you are doing something like stratified sampling, ra<strong>the</strong>r<br />
than writing down <strong>the</strong> likelihood, it may be obvious how <strong>to</strong> apply <strong>the</strong> weights. For instance, if you have<br />
stratified your sample in<strong>to</strong> men and women and calculated <strong>the</strong> mean value of your variable of interest separately<br />
in both groups <strong>the</strong>n you could create <strong>the</strong> weighted mean simply by multiplying <strong>the</strong> group means by<br />
<strong>the</strong> relative proportions of men and women in <strong>the</strong> sample. Sometimes you do this even if <strong>the</strong> sample was not<br />
deliberately obtained by stratification. For instance, in an observational sample where you were concerned<br />
about representativeness, you might group <strong>the</strong> observations in<strong>to</strong> “census” bins (could be age ranges, neighborhoods,<br />
or any o<strong>the</strong>r grouping of interest) and <strong>the</strong>n use <strong>the</strong> census proportions for those bins <strong>to</strong> weight<br />
<strong>the</strong> estimates calculated separately for <strong>the</strong> bins. Sometimes <strong>the</strong> calculation of <strong>the</strong> appropriate weights can<br />
get very complex depending on <strong>the</strong> sampling scheme and <strong>the</strong> sources of bias.<br />
(c) Define as carefully as you can what a propensity score is and give two examples of when and how you<br />
might want <strong>to</strong> use it.<br />
Solution: A propensity score is simply <strong>the</strong> probability of having or not having a characteristic of interest.<br />
The most common use of propensity scores is in observational studies where you wish <strong>to</strong> know <strong>the</strong> causal<br />
impact of a treatment or o<strong>the</strong>r binary characterstic on an outcome of interest but <strong>the</strong> subjects weren’t randomized<br />
<strong>to</strong> <strong>the</strong> treatment and so <strong>the</strong>re may be confounding fac<strong>to</strong>rs. In this case, <strong>the</strong> propensity score is <strong>the</strong><br />
probability that a subject receives <strong>the</strong> treatment given <strong>the</strong>ir observed values of <strong>the</strong> potential confounders.<br />
Propensity scores provide an alternative <strong>to</strong> matching or covariate adjustment. It may be very difficult <strong>to</strong><br />
match subjects adequately if <strong>the</strong> number of possible confounders is large and <strong>the</strong> number of available cases<br />
is small. One could add <strong>the</strong> confounders as predic<strong>to</strong>rs, along with <strong>the</strong> treatment, in<strong>to</strong> <strong>the</strong> model for <strong>the</strong><br />
outcome. However this requires knowing <strong>the</strong> “right” model and if <strong>the</strong>re are large numbers of confounders<br />
this may be complicated by multicollinearity or overfitting issues. Instead, one creates a logistic regression<br />
model with <strong>the</strong> treatment (yes or no) as <strong>the</strong> outcome variables and <strong>the</strong> potential confounders as predic<strong>to</strong>rs.<br />
The resulting predicted probabilities of treatment are <strong>the</strong> propensity scores. These scores can be used in a<br />
number of ways. First, one can match subjects on <strong>the</strong>ir propensity score. This is much easier than matching<br />
on many covariates, some values of which may be rare. Second, one my stratify <strong>the</strong> analysis by creating<br />
bins of people with similar propensity scores. One <strong>the</strong>n examines <strong>the</strong> difference between <strong>the</strong> treatment and<br />
non-treatment groups within each bin and combines <strong>the</strong>m, A common choice is <strong>to</strong> create 5 bins based on<br />
<strong>the</strong> quintiles of <strong>the</strong> propensity score distribution but you need <strong>to</strong> check that within <strong>the</strong>se bins <strong>the</strong> original<br />
confounders are well enough balanced between <strong>the</strong> treatment and control group. If <strong>the</strong>y are not you may<br />
need <strong>to</strong> use finer binning. Thirdly, you can simply use <strong>the</strong> propensity score as a covariate, along with <strong>the</strong><br />
treatment variable, in <strong>the</strong> analysis of <strong>the</strong> outcome. It can be shown under certain not <strong>to</strong> rigid requriements<br />
that matching. stratifying or adjusting by <strong>the</strong> propensity score will correctly account for <strong>the</strong> confounders,<br />
giving you an improve estimate of <strong>the</strong> “true” treatment effect.<br />
3
Propensity scores can also be used <strong>to</strong> deal with missing data. Instead of creating a logistic model for “treatment”<br />
one creates a logistic model for “missingness” for each variable that has missing values based on <strong>the</strong><br />
o<strong>the</strong>r available variables. One can <strong>the</strong>n use <strong>the</strong> propensity scores ei<strong>the</strong>r <strong>to</strong> up weight points that are similar<br />
<strong>to</strong> <strong>the</strong> observations that had missing values or else use <strong>the</strong> propensity scores <strong>to</strong> perform imputation, using<br />
subjects with similar propensity scores <strong>to</strong> those with missing values <strong>to</strong> do <strong>the</strong> filling in.<br />
(3) Survival Analysis Example: The numbers below represent <strong>the</strong> survival times in years for two groups<br />
with a rare disease. The values with (+) are censored. Use <strong>the</strong> values <strong>to</strong> answer <strong>the</strong> following questions.<br />
(Note: (You should know how <strong>to</strong> do <strong>the</strong> calculations in this problem manually but it’s much easier in STATA<br />
or SAS. In STATA <strong>the</strong> relevant commands are under <strong>the</strong> “sts” header. You use “sts set” <strong>to</strong> tell STATA what<br />
your time and censoring variables are. You use “sts graph, by(group)” <strong>to</strong> get survival curves, separated by<br />
groups if you like. You use “sts test time group, logrank” <strong>to</strong> obtain <strong>the</strong> log rank test. I would never make<br />
<strong>the</strong> calculations this messy on <strong>the</strong> exam–I might simply give you <strong>the</strong> tables of calculations like <strong>the</strong> one on<br />
<strong>the</strong> class handout and ask you <strong>to</strong> explain how one of <strong>the</strong> values was computed....)<br />
Group 0: 1.33+, 2.21, 2.80+, 3.92. 5.44+, 9.99, 12.01, 12.07, 17.31, 28.65<br />
Group 1: 0.122+, 0.43, .64, 1.58, 2.02, 3.08, 3.62+, 4.33, 5.52+, 11.86<br />
(a) and (b) Show how <strong>to</strong> find <strong>the</strong> Kaplan-Meier estimates of <strong>the</strong> survival curves for <strong>the</strong>se two groups and<br />
sketch <strong>the</strong> first few values. Does it appear that one group does better than <strong>the</strong> o<strong>the</strong>r (The complete<br />
curves are shown in <strong>the</strong> accompanying graphics file for your reference.) Specifically, ive <strong>the</strong> estimated 3-year<br />
survival probabilities, S 1 (3) and S 2 (3) based on <strong>the</strong>se data along with <strong>the</strong> corresponding confidence intervals.<br />
Solution: To compute <strong>the</strong> Kaplan-Meier curve we need <strong>to</strong> order <strong>the</strong> observed event times (deaths, not<br />
censoring times) and count <strong>the</strong> number of people who die at those event times and who are at risk at each<br />
of those event times. If we let d i be <strong>the</strong> number who die at event time t i and Y i be <strong>the</strong> number at risk at<br />
time t i <strong>the</strong>n <strong>the</strong> K-M estima<strong>to</strong>r of <strong>the</strong> survival function is<br />
The corresponding variance is<br />
Ŝ(t) = Π ti≤t(1 − d i<br />
Y i<br />
)<br />
Ŝ 2 i (t) ∑ t i≤t<br />
d i<br />
Y i (Y i − d i )<br />
For group 0 we have 7 uncensored event times and no two people die at <strong>the</strong> same time so we have d i = 1 in all<br />
cases. To get <strong>the</strong> number at risk we need <strong>to</strong> account for <strong>the</strong> censoring–only people who have not previously<br />
died or been censored by time t i get counted as at risk for <strong>the</strong> subsequent interval. I show <strong>the</strong> first few calculations<br />
for both groups. Below we are interested in 3-year survival so we really only need <strong>the</strong> calculations<br />
that occur before t = 3. For group 0, <strong>the</strong> first event time is t 1 = 2.21. One person is censored before that<br />
time, so 9 out of 10 subjects are still in study and we have Y 1 = 9. The resulting estimate of <strong>the</strong> survival time<br />
is Ŝ(t 1) = 1−1/9 = 8/9 = .889. The corresponding estimate of <strong>the</strong> variance is (.889) 2 (1/(9∗(9−1))) = .011.<br />
The second observed death occurs at 3.92 which is already past t = 3. This means that our 3-year estimate<br />
of <strong>the</strong> survival probability for group 1 is .889. Let’s do <strong>the</strong> next calculation anyway. There is ano<strong>the</strong>r person<br />
censored in <strong>the</strong> interim so Y i = 7. Our estimated survival value is Ŝ(t 2) = (1 − 1/9)(1 − 1/7) = .762. The<br />
corresponding variance is (.762) 2 (1/72 + 1/42) = .022. To plot <strong>the</strong>se we draw a “step function” which is flat<br />
with Ŝ(t) = 1 for times prior <strong>to</strong> t 1 = 2.21 and <strong>the</strong>n drops down <strong>to</strong> .889 between t 1 = 2.21 and t 2 = 3.92 and<br />
<strong>the</strong>n <strong>to</strong> .762 until t 3 = 9.99 and so on.<br />
For group 1 <strong>the</strong>re are four events before time t = 3 and only one censoring time, which occurs before <strong>the</strong><br />
first event. Thus our numbers at risk our Y 1 = 9, Y 2 = 8, Y 3 = 7 and Y 4 = 6. The 3-year estimate of <strong>the</strong><br />
4
survival curve is <strong>the</strong>refore<br />
The corresponding variance estimate is<br />
Ŝ(3) = (1 − 1 9 )(1 − 1 8 )(1 − 1 7 )(1 − 1 6 ) = .556<br />
ˆV (S(3)) = (.556) 2 1<br />
[<br />
9(9 − 1) + 1<br />
8(8 − 1) + 1<br />
7(7 − 1) + 1<br />
6(6 − 1) ] = .0275<br />
The plots of both survival curves are shown in <strong>the</strong> accompanying graphics file.<br />
To get confidence intervals for <strong>the</strong> survival times we simply take <strong>the</strong> square roots of <strong>the</strong> variance estimates<br />
above <strong>to</strong> get <strong>the</strong> standard erros and use a Z-interval. For group 0 <strong>the</strong> CI for <strong>the</strong> 3-year survival rate is<br />
.889 ± (1.96) √ .011 = [.68, 1]. The estimated upper limit of <strong>the</strong> interval actually goes above 1 but of course<br />
we can’t have a probability greater than 1. Our interval is very wide because we do not have very many<br />
data points with events in this interval! For group 1 <strong>the</strong> confidence interval is .556 ± 1.96 √ .0275 = [.23, .88].<br />
(c) Estimate <strong>the</strong> median survival time for people in <strong>the</strong>se two groups.<br />
Solution: The median survival time is <strong>the</strong> time at which 50% of people live longer than that and 50% of people<br />
live less long than that. For group 0, if you carry out <strong>the</strong> calculations you find that <strong>the</strong> survival estimate<br />
drops from .61 between times t 3 = 9.99 and t 4 = 12.01 <strong>to</strong> .45 between t 4 = 12.01 and t 5 = 12.07. Therefore<br />
<strong>the</strong> median survival time must be somewhere between 9.99 and 12.01. Our best guess is that it is around 11<br />
years. For group 2, <strong>the</strong> drop below 50% occurs between <strong>the</strong> 4th and 5th event times, t 4 = 2.02, t 5 = 3.08.<br />
Thus our estimate of <strong>the</strong> median event time is around two and a half years, much much lower than for group 0.<br />
(d) Test whe<strong>the</strong>r <strong>the</strong>re is a difference in 3-year survival probability for <strong>the</strong>se two groups using (a) a Z-test<br />
and (b) a confidence interval for <strong>the</strong> difference between <strong>the</strong> survival curves.<br />
Solution: The test statistic for a difference between two survival curves at time t is<br />
For our data at three years we have<br />
Ŝ 0 (t) −<br />
Z = √ Ŝ1(t)<br />
ˆV (S 0 (t)) + ˆV (S 1 (t))<br />
Z =<br />
.889 − .556<br />
√ .011 + .0275<br />
= 1.72<br />
If we are doing a two-sided test our p-value would be 2P (Z ≥ 1.72) = 2 ∗ (.043) = .086 which is not quite<br />
good enough. I got this from a Z table but we could alternatively use <strong>the</strong> fact that we would need <strong>the</strong> test<br />
statistic <strong>to</strong> be greater than z .025 = 1.96 in absolute value. Ei<strong>the</strong>r way, we don’t quite make it. However if<br />
we were doing a 1-sided test that <strong>the</strong> survival time was greater in group 0 than in group 1 <strong>the</strong>n <strong>the</strong> p-value<br />
would be .043 (or cu<strong>to</strong>ff z .05 = 1.645) and we would find a significant result.<br />
We can also find a 95% confidence interval for <strong>the</strong> difference in means. The estimate and <strong>the</strong> standard error<br />
just correspond <strong>to</strong> <strong>the</strong> two pieces of our Z statistic. We have<br />
(.889 − .556) ± (1.96)( √ .011 + .0275) = [−.05, .72].<br />
Since <strong>the</strong> interval includes 0 we can’t be 95% sure <strong>the</strong>re is a difference though it is pretty close.<br />
5
(e) The prin<strong>to</strong>ut below shows <strong>the</strong> log-rank test comparing <strong>the</strong>se two groups. Explain what hypo<strong>the</strong>ses are<br />
being tested and give your real-world conclusions.<br />
Solution: The log rank test tests whe<strong>the</strong>r <strong>the</strong> hazards rates for <strong>the</strong> two groups are <strong>the</strong> same over a given<br />
interval (generally up <strong>to</strong> <strong>the</strong> time of <strong>the</strong> last observed event which for us occurred at 28.65 years. The<br />
hypo<strong>the</strong>ses are <strong>the</strong>refore:<br />
H 0 : h 0 (t) = h 1 (t) for all t ≤ 28.65–<strong>the</strong> two groups have <strong>the</strong> same hazard function (or equivalently survival<br />
function, distribution of survival times, etc.) over <strong>the</strong> period covered by our study.<br />
H A : For at least some t, h 0 (t) ≠ h 1 (t)–<strong>the</strong> two hazard functions, survival curves, etc. are not identical.<br />
From <strong>the</strong> prin<strong>to</strong>ut we see that <strong>the</strong> p-value for <strong>the</strong> test is .0189 which is less than our usual significance level<br />
of α = .05 so we reject <strong>the</strong> null hypo<strong>the</strong>sis and conclude that <strong>the</strong>re is a difference in <strong>the</strong> hazard functions,<br />
at least over some part of <strong>the</strong> 0-30 year time interval. Given what we have seen in <strong>the</strong> K-M plot of <strong>the</strong> two<br />
survival curves this isn’t <strong>to</strong>o surprising–<strong>the</strong> curves look very different. The reason our results have been so<br />
close is that our sample is very small.<br />
Log-rank test for equality of survivor functions<br />
| Events Events<br />
group | observed expected<br />
------+-------------------------<br />
0 | 7 10.47<br />
1 | 7 3.53<br />
------+-------------------------<br />
Total | 14 14.00<br />
chi2(1) = 5.51<br />
Pr>chi2 = 0.0189<br />
(f) Suppose none of <strong>the</strong> observations had been censored. What would your estimates of <strong>the</strong> 3-year survival<br />
probabilities have been and why<br />
Solution: If <strong>the</strong>re is no censoring <strong>the</strong>n <strong>the</strong> Kaplan-Meier estimate of <strong>the</strong> survival curve is simply <strong>the</strong><br />
proportion of people who have survived up <strong>to</strong> that time. For 3-year survival if all our observations were<br />
uncensored we’d have ˆp 0 = 7/10 = .7 surviving in group 0 and ˆp 1 = 5/19 = .5 surviving in group 1.<br />
6