Solutions to the practice problems. - UCLA Biostatistics

Biostatistics 201B 

Final Practice Solutions 

March 6th, 2013 

Final Exam Practice Problems With Solutions 

(1) Survival Analysis Basics: 

(a) Explain what the term censoring means, some different ways in which it can occur, and why it is an 

important issue in analyses of time to event or survival data. 

Solution: Censoring refers to the situation when a value of interest is only partially observed. It is a general 

phenomenon but comes up most frequently when analyzing “time to event” or survival data and I will use 

that terminology. In the survival setting censoring means you have some information about when the event 

of interest occurred but you don’t know it exactly. Because of this you would be throwing away information 

if you treated the value as missing but you would be biasing your answer if you used the partial information 

as if it were the true time of the event. 

There are many kinds of censoring. Right censoring refers to the situation in which you know the survival 

time was longer than a certain value. This could happen, for instance if you were measuring time to cancer 

recurrence and you know the patient was cancer free at their last visit at time t, but now they have dropped 

out of the study and you don’t know their status. If you let T be the true time until cancer recurrence, you 

know T > t but you don’t know any more than that. Left censoring occurs when you know the survival 

time is less than a certain value. Suppose a subject in our cancer study was cancer free at baseline but by 

the time of their first visit at time t the cancer had already recurred. You would know T < t but you would 

have no idea exactly when it happened. Interval censoring occurs when you know the survival time is 

between two values. For instance we might know that our cancer patient was disease free at their previous 

visit at time t 1 but now at their current visit at t 2 the cancer has recurred. We know t 1 < T < t 2 but we 

don’t know the exact time. 

Censoring can occur for a variety of reasons. It can be known about in advance, for instance if you plan to 

stop the study at a particular time when many subjects will still not have had the event of interest. Everyone 

still in the study who has not had the event will be censored at the end point. It can also occur randomly, 

for instance because a subject drops out, dies for a reason other than the event under study, or otherwise 

becomes unobservable. What is really important is whether the censoring is non-informative–i.e. knowing 

when or if the subject was censored doesn’t tell you anything about when the actual event occured. You can 

have non-informative censoring, for instance because a study participant moves for job-related reasons and 

therefore can’t participate any more. However subjects frequently drop out for reasons having to do with the 

event under study, e.g. they become too sick to participate which may in turn suggest the event is imminent. 

Most standard survival methods assume to some extent or another that the censoring is uninformative, at 

least after you have adjusted for appropriate covariates. However this is always an issue that needs careful 

attention. 

(b) Give an intuitive description of the key quantities in survival analysis (the survival function, S(t), the 

cumulative death function, F(t), the death density, f(t) and the hazard function, h(t) = f(t)/S(t). 

Solution: The survival function is simple the probability that a subject survives past a certain point or 

equivalently the proportion of the population that survives past that point. If T is the survival time then 

namely S(t) = P (T ≥ t). The cumulative death distribution is the opposite of the survival function. It 

gives the proportion of a population that has the event prior to a given time, F (t) = P (T ≤ t). The death 

density, f(t), gives the relative likelihood of the event of interest happening at time t. You can think of it 

1

as the probability that the event happens in a tiny interval around time t. The hazard function is like the 

density function except that it is conditional on having survived up to time t. It gives the relative likelihood 

of the event happening at time t given that the event has not happened up to time t. You can think of it as 

the probability that the event will happen in a tiny interval after time t assuming it hasn’t happened prior 

to t. The hazard function is given by h(t) = f(t)/S(t). Note that all these quantities are inter-related. If 

you know one of them for all values of t then you can calculate all the others. 

(c) Decribe the two major categories of approaches to estimating the survival quantities above. 

Solution: There are both parametric and non-parametric techniques for estimating the key survival functions. 

In the parametric approach you assume that the survival time, T, has a particular distribution (e.g. an 

exponential distribution) or equivalently that the survival curve or hazard function has a particular shape, 

and you then use maximum likelihood ideas to estimate the parameters of that distribution (e.g. mean, 

variance, etc.). In the non-parametric framework you use the empirical distribution given by the observed 

event times and censoring times in your sample. The Kaplan-Meier product limit estimator of the survival 

curve is the classic non-parametric technique. There is also a famous estimator of the cumulative hazard 

function, H(t) which “adds up” the total hazard a subject has been exposed to up to time t, called the 

Nelson-Aalen estimator. Both require only that you know the event and censoring times. 

(d) Explain briefly the ideas behind the accelerated failure time model and the Cox proportional hazards 

model for incorporating covariates into a survival model. 

Solution: The accelerated failure time (AFT) model is basically a generalized linear model with a log link 

and the usual systematic component consisting of a linear combination of the predictor variables or X’s. It 

can be fit in conjunction with a number of distribution functions and is a parametric model. The trick is 

that the censoring times have to be built into the likelihood function that you are trying to maximize. The 

interpretation of the coefficients proceeds in the usual way as it would for a Poisson or other model that has 

a log link. The Cox Proportonal Hazards model is what is called a “semi-parametric” model. You make 

some assumptions about the form of the distribution/hazard function but stop short of estimating the whole 

thing. Specifically the proportional hazards model assumes that the hazard function can be written as 

h(t) = h 0 (t)c(Xβ) 

where h 0 (t) is called the baseline hazard function and c is a known function (frequently the exponential so 

that c(Xβ) = e Xβ ) applied to our usual linear combination of the X’s. This is a multuiplicative model. If 

c is the exponential function then taking logs puts us back on our favorite additive scale. Note that while 

we assume the form of the function c is known we make no assumptions about the form of the baseline 

hazard function (other than that it is non-negative and usually continuous). This is why the model is called 

semi-parametric. In fact to understand the impact of the X’s we do not need to estimate the shape of 

the baseline hazard because if we take the hazard ratio for two people with different covariate values the 

baseline hazard cancels out, leaving us a piece that depends only on c. 

(2) Weighted Analysis Basics: 

(a) Give four examples of situations in which you might want to perform a weighted analysis. 

Solution: There are many situations in which you might want to perform a weighted analysis. Examples 

include fitting an OLS regression model in which the constant variance assumption is violated; fitting a 

regression model in which the observed Y’s are measured with different levels of accuracy (this occurs, for 

example, if the Y’s are actually averages of several replications and the number of replicates varies from 

observation to observation); fitting a model where the observational units have differing sizes or perceived 

importance (e.g. measurements on countries); observational studies where some types of subjects were harder 

2

to reach than others so are not represented equally in the population; studies in which sampling was deliberately 

stratified to allow for obtaining adequate estimates for rare groups–in this case the rare group needs to 

be down-weighted when the estimates from the strata are combined; studies where the sampling was done in 

a hierarchical or clustered manner–e.g. by neighborhood or school–so that certain areas are over-represented 

(and their measurements are correlated); and many, many others. Basically you need weighting when your 

sample is not a nice independent and identically distributed draw from the population of interest so that it 

is in some way un-representative of that population. 

(b) Describe briefly some of the ways in which you could incorporate weights into an estimation procedure. 

Solution: For many estimation and modeling situations the weights (once you have them!) can be built 

right into the likelihood function you need to maximize. For instance, in OLS regression the maximum 

likelihood solution is the same as the least squares solution which minimizes the sum of squared differences 

between the observed Y’s and the values predicted by the regression line. In the weighted regression setting 

you simply minimize the sum of squared errors for each point multiplied by their weights. A similar strategy 

applies to all sorts of other regression models. When you are doing something like stratified sampling, rather 

than writing down the likelihood, it may be obvious how to apply the weights. For instance, if you have 

stratified your sample into men and women and calculated the mean value of your variable of interest separately 

in both groups then you could create the weighted mean simply by multiplying the group means by 

the relative proportions of men and women in the sample. Sometimes you do this even if the sample was not 

deliberately obtained by stratification. For instance, in an observational sample where you were concerned 

about representativeness, you might group the observations into “census” bins (could be age ranges, neighborhoods, 

or any other grouping of interest) and then use the census proportions for those bins to weight 

the estimates calculated separately for the bins. Sometimes the calculation of the appropriate weights can 

get very complex depending on the sampling scheme and the sources of bias. 

(c) Define as carefully as you can what a propensity score is and give two examples of when and how you 

might want to use it. 

Solution: A propensity score is simply the probability of having or not having a characteristic of interest. 

The most common use of propensity scores is in observational studies where you wish to know the causal 

impact of a treatment or other binary characterstic on an outcome of interest but the subjects weren’t randomized 

to the treatment and so there may be confounding factors. In this case, the propensity score is the 

probability that a subject receives the treatment given their observed values of the potential confounders. 

Propensity scores provide an alternative to matching or covariate adjustment. It may be very difficult to 

match subjects adequately if the number of possible confounders is large and the number of available cases 

is small. One could add the confounders as predictors, along with the treatment, into the model for the 

outcome. However this requires knowing the “right” model and if there are large numbers of confounders 

this may be complicated by multicollinearity or overfitting issues. Instead, one creates a logistic regression 

model with the treatment (yes or no) as the outcome variables and the potential confounders as predictors. 

The resulting predicted probabilities of treatment are the propensity scores. These scores can be used in a 

number of ways. First, one can match subjects on their propensity score. This is much easier than matching 

on many covariates, some values of which may be rare. Second, one my stratify the analysis by creating 

bins of people with similar propensity scores. One then examines the difference between the treatment and 

non-treatment groups within each bin and combines them, A common choice is to create 5 bins based on 

the quintiles of the propensity score distribution but you need to check that within these bins the original 

confounders are well enough balanced between the treatment and control group. If they are not you may 

need to use finer binning. Thirdly, you can simply use the propensity score as a covariate, along with the 

treatment variable, in the analysis of the outcome. It can be shown under certain not to rigid requriements 

that matching. stratifying or adjusting by the propensity score will correctly account for the confounders, 

giving you an improve estimate of the “true” treatment effect. 

3

Propensity scores can also be used to deal with missing data. Instead of creating a logistic model for “treatment” 

one creates a logistic model for “missingness” for each variable that has missing values based on the 

other available variables. One can then use the propensity scores either to up weight points that are similar 

to the observations that had missing values or else use the propensity scores to perform imputation, using 

subjects with similar propensity scores to those with missing values to do the filling in. 

(3) Survival Analysis Example: The numbers below represent the survival times in years for two groups 

with a rare disease. The values with (+) are censored. Use the values to answer the following questions. 

(Note: (You should know how to do the calculations in this problem manually but it’s much easier in STATA 

or SAS. In STATA the relevant commands are under the “sts” header. You use “sts set” to tell STATA what 

your time and censoring variables are. You use “sts graph, by(group)” to get survival curves, separated by 

groups if you like. You use “sts test time group, logrank” to obtain the log rank test. I would never make 

the calculations this messy on the exam–I might simply give you the tables of calculations like the one on 

the class handout and ask you to explain how one of the values was computed....) 

Group 0: 1.33+, 2.21, 2.80+, 3.92. 5.44+, 9.99, 12.01, 12.07, 17.31, 28.65 

Group 1: 0.122+, 0.43, .64, 1.58, 2.02, 3.08, 3.62+, 4.33, 5.52+, 11.86 

(a) and (b) Show how to find the Kaplan-Meier estimates of the survival curves for these two groups and 

sketch the first few values. Does it appear that one group does better than the other (The complete 

curves are shown in the accompanying graphics file for your reference.) Specifically, ive the estimated 3-year 

survival probabilities, S 1 (3) and S 2 (3) based on these data along with the corresponding confidence intervals. 

Solution: To compute the Kaplan-Meier curve we need to order the observed event times (deaths, not 

censoring times) and count the number of people who die at those event times and who are at risk at each 

of those event times. If we let d i be the number who die at event time t i and Y i be the number at risk at 

time t i then the K-M estimator of the survival function is 

The corresponding variance is 

Ŝ(t) = Π ti≤t(1 − d i 

Y i 

) 

Ŝ 2 i (t) ∑ t i≤t 

d i 

Y i (Y i − d i ) 

For group 0 we have 7 uncensored event times and no two people die at the same time so we have d i = 1 in all 

cases. To get the number at risk we need to account for the censoring–only people who have not previously 

died or been censored by time t i get counted as at risk for the subsequent interval. I show the first few calculations 

for both groups. Below we are interested in 3-year survival so we really only need the calculations 

that occur before t = 3. For group 0, the first event time is t 1 = 2.21. One person is censored before that 

time, so 9 out of 10 subjects are still in study and we have Y 1 = 9. The resulting estimate of the survival time 

is Ŝ(t 1) = 1−1/9 = 8/9 = .889. The corresponding estimate of the variance is (.889) 2 (1/(9∗(9−1))) = .011. 

The second observed death occurs at 3.92 which is already past t = 3. This means that our 3-year estimate 

of the survival probability for group 1 is .889. Let’s do the next calculation anyway. There is another person 

censored in the interim so Y i = 7. Our estimated survival value is Ŝ(t 2) = (1 − 1/9)(1 − 1/7) = .762. The 

corresponding variance is (.762) 2 (1/72 + 1/42) = .022. To plot these we draw a “step function” which is flat 

with Ŝ(t) = 1 for times prior to t 1 = 2.21 and then drops down to .889 between t 1 = 2.21 and t 2 = 3.92 and 

then to .762 until t 3 = 9.99 and so on. 

For group 1 there are four events before time t = 3 and only one censoring time, which occurs before the 

first event. Thus our numbers at risk our Y 1 = 9, Y 2 = 8, Y 3 = 7 and Y 4 = 6. The 3-year estimate of the 

4

survival curve is therefore 

The corresponding variance estimate is 

Ŝ(3) = (1 − 1 9 )(1 − 1 8 )(1 − 1 7 )(1 − 1 6 ) = .556 

ˆV (S(3)) = (.556) 2 1 

[ 

9(9 − 1) + 1 

8(8 − 1) + 1 

7(7 − 1) + 1 

6(6 − 1) ] = .0275 

The plots of both survival curves are shown in the accompanying graphics file. 

To get confidence intervals for the survival times we simply take the square roots of the variance estimates 

above to get the standard erros and use a Z-interval. For group 0 the CI for the 3-year survival rate is 

.889 ± (1.96) √ .011 = [.68, 1]. The estimated upper limit of the interval actually goes above 1 but of course 

we can’t have a probability greater than 1. Our interval is very wide because we do not have very many 

data points with events in this interval! For group 1 the confidence interval is .556 ± 1.96 √ .0275 = [.23, .88]. 

(c) Estimate the median survival time for people in these two groups. 

Solution: The median survival time is the time at which 50% of people live longer than that and 50% of people 

live less long than that. For group 0, if you carry out the calculations you find that the survival estimate 

drops from .61 between times t 3 = 9.99 and t 4 = 12.01 to .45 between t 4 = 12.01 and t 5 = 12.07. Therefore 

the median survival time must be somewhere between 9.99 and 12.01. Our best guess is that it is around 11 

years. For group 2, the drop below 50% occurs between the 4th and 5th event times, t 4 = 2.02, t 5 = 3.08. 

Thus our estimate of the median event time is around two and a half years, much much lower than for group 0. 

(d) Test whether there is a difference in 3-year survival probability for these two groups using (a) a Z-test 

and (b) a confidence interval for the difference between the survival curves. 

Solution: The test statistic for a difference between two survival curves at time t is 

For our data at three years we have 

Ŝ 0 (t) − 

Z = √ Ŝ1(t) 

ˆV (S 0 (t)) + ˆV (S 1 (t)) 

Z = 

.889 − .556 

√ .011 + .0275 

= 1.72 

If we are doing a two-sided test our p-value would be 2P (Z ≥ 1.72) = 2 ∗ (.043) = .086 which is not quite 

good enough. I got this from a Z table but we could alternatively use the fact that we would need the test 

statistic to be greater than z .025 = 1.96 in absolute value. Either way, we don’t quite make it. However if 

we were doing a 1-sided test that the survival time was greater in group 0 than in group 1 then the p-value 

would be .043 (or cutoff z .05 = 1.645) and we would find a significant result. 

We can also find a 95% confidence interval for the difference in means. The estimate and the standard error 

just correspond to the two pieces of our Z statistic. We have 

(.889 − .556) ± (1.96)( √ .011 + .0275) = [−.05, .72]. 

Since the interval includes 0 we can’t be 95% sure there is a difference though it is pretty close. 

5

(e) The printout below shows the log-rank test comparing these two groups. Explain what hypotheses are 

being tested and give your real-world conclusions. 

Solution: The log rank test tests whether the hazards rates for the two groups are the same over a given 

interval (generally up to the time of the last observed event which for us occurred at 28.65 years. The 

hypotheses are therefore: 

H 0 : h 0 (t) = h 1 (t) for all t ≤ 28.65–the two groups have the same hazard function (or equivalently survival 

function, distribution of survival times, etc.) over the period covered by our study. 

H A : For at least some t, h 0 (t) ≠ h 1 (t)–the two hazard functions, survival curves, etc. are not identical. 

From the printout we see that the p-value for the test is .0189 which is less than our usual significance level 

of α = .05 so we reject the null hypothesis and conclude that there is a difference in the hazard functions, 

at least over some part of the 0-30 year time interval. Given what we have seen in the K-M plot of the two 

survival curves this isn’t too surprising–the curves look very different. The reason our results have been so 

close is that our sample is very small. 

Log-rank test for equality of survivor functions 

| Events Events 

group | observed expected 

------+------------------------- 

0 | 7 10.47 

1 | 7 3.53 

------+------------------------- 

Total | 14 14.00 

chi2(1) = 5.51 

Pr>chi2 = 0.0189 

(f) Suppose none of the observations had been censored. What would your estimates of the 3-year survival 

probabilities have been and why 

Solution: If there is no censoring then the Kaplan-Meier estimate of the survival curve is simply the 

proportion of people who have survived up to that time. For 3-year survival if all our observations were 

uncensored we’d have ˆp 0 = 7/10 = .7 surviving in group 0 and ˆp 1 = 5/19 = .5 surviving in group 1. 

6

Solutions to the practice problems. - UCLA Biostatistics

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?