mixed - Stata

Title 

stata.com 

mixed — Multilevel mixed-effects linear regression 

Syntax 

Syntax Menu Description Options 

Remarks and examples Stored results Methods and formulas Acknowledgments 

References 

Also see 

mixed depvar fe equation [ || re equation ] [ || re equation . . . ] [ , options ] 

where the syntax of fe equation is 

[ 

indepvars 

] [ 

if 

] [ 

in 

] [ 

weight 

] [ 

, fe options 

] 

and the syntax of re equation is one of the following: 

for random coefficients and intercepts 

levelvar: [ varlist ] [ , re options ] 

for random effects among the values of a factor variable 

levelvar: R.varname [ , re options ] 

levelvar is a variable identifying the group structure for the random effects at that level or is 

representing one group comprising all observations. 

all 

fe options 

Model 

noconstant 

Description 

suppress constant term from the fixed-effects equation 

re options 

Model 

covariance(vartype) 

noconstant 

collinear 

fweight(exp) 

pweight(exp) 

Description 

variance–covariance structure of the random effects 

suppress constant term from the random-effects equation 

keep collinear variables 

frequency weights at higher levels 

sampling weights at higher levels 

1

2 mixed — Multilevel mixed-effects linear regression 

options 

Model 

mle 

reml 

pwscale(scale method) 

residuals(rspec) 

SE/Robust 

vce(vcetype) 

Reporting 

level(#) 

variance 

stddeviations 

noretable 

nofetable 

estmetric 

noheader 

nogroup 

nostderr 

nolrtest 

display options 

EM options 

emiterate(#) 

emtolerance(#) 

emonly 

emlog 

emdots 

Maximization 

maximize options 

matsqrt 

matlog 

coeflegend 

Description 

fit model via maximum likelihood; the default 

fit model via restricted maximum likelihood 

control scaling of sampling weights in two-level models 

structure of residual errors 

vcetype may be oim, robust, or cluster clustvar 

set confidence level; default is level(95) 

show random-effects and residual-error parameter estimates as variances 

and covariances; the default 

show random-effects and residual-error parameter estimates as standard 

deviations 

suppress random-effects table 

suppress fixed-effects table 

show parameter estimates in the estimation metric 

suppress output header 

suppress table summarizing groups 

do not estimate standard errors of random-effects parameters 

do not perform likelihood-ratio test comparing with linear regression 

control column formats, row spacing, line width, display of omitted 

variables and base and empty cells, and factor-variable labeling 

number of EM iterations; default is emiterate(20) 

EM convergence tolerance; default is emtolerance(1e-10) 

fit model exclusively using EM 

show EM iteration log 

show EM iterations as dots 

control the maximization process; seldom used 

parameterize variance components using matrix square roots; the default 

parameterize variance components using matrix logarithms 

display legend instead of statistics

vartype 

Description 

mixed — Multilevel mixed-effects linear regression 3 

independent one unique variance parameter per random effect, all covariances 0; 

the default unless the R. notation is used 

exchangeable equal variances for random effects, and one common pairwise 

covariance 

identity equal variances for random effects, all covariances 0; 

the default if the R. notation is used 

unstructured all variances and covariances to be distinctly estimated 

indepvars may contain factor variables; see [U] 11.4.3 Factor variables. 

depvar, indepvars, and varlist may contain time-series operators; see [U] 11.4.4 Time-series varlists. 

bootstrap, by, jackknife, mi estimate, rolling, and statsby are allowed; see [U] 11.1.10 Prefix commands. 

Weights are not allowed with the bootstrap prefix; see [R] bootstrap. 

pweights and fweights are allowed; see [U] 11.1.6 weight. 

coeflegend does not appear in the dialog box. 

See [U] 20 Estimation and postestimation commands for more capabilities of estimation commands. 

Menu 

Statistics > Multilevel mixed-effects models > Linear regression 

Description 

mixed fits linear mixed-effects models. The overall error distribution of the linear mixed-effects 

model is assumed to be Gaussian, and heteroskedasticity and correlations within lowest-level groups 

also may be modeled. 

Options 

✄ 

✄ 

Model 

 

noconstant suppresses the constant (intercept) term and may be specified for the fixed-effects 

equation and for any or all of the random-effects equations. 

covariance(vartype), where vartype is 

independent | exchangeable | identity | unstructured 

specifies the structure of the covariance matrix for the random effects and may be specified for 

each random-effects equation. An independent covariance structure allows for a distinct variance 

for each random effect within a random-effects equation and assumes that all covariances are 0. 

exchangeable structure specifies one common variance for all random effects and one common 

pairwise covariance. identity is short for “multiple of the identity”; that is, all variances are 

equal and all covariances are 0. unstructured allows for all variances and covariances to be 

distinct. If an equation consists of p random-effects terms, the unstructured covariance matrix will 

have p(p + 1)/2 unique parameters. 

covariance(independent) is the default, except when the R. notation is used, in which 

case covariance(identity) is the default and only covariance(identity) and covariance(exchangeable) 

are allowed.


collinear specifies that mixed not omit collinear variables from the random-effects equation. 

Usually, there is no reason to leave collinear variables in place; in fact, doing so usually causes 

the estimation to fail because of the matrix singularity caused by the collinearity. However, with 

certain models (for example, a random-effects model with a full set of contrasts), the variables 

may be collinear, yet the model is fully identified because of restrictions on the random-effects 

covariance structure. In such cases, using the collinear option allows the estimation to take 

place with the random-effects equation intact. 

fweight(exp) specifies frequency weights at higher levels in a multilevel model, whereas frequency 

weights at the first level (the observation level) are specified in the usual manner, for example, 

[fw=fwtvar1]. exp can be any valid Stata expression, and you can specify fweight() at levels 

two and higher of a multilevel model. For example, in the two-level model 

. mixed fixed_portion [fw = wt1] || school: . . . , fweight(wt2) . . . 

the variable wt1 would hold the first-level (the observation-level) frequency weights, and wt2 

would hold the second-level (the school-level) frequency weights. 

pweight(exp) specifies sampling weights at higher levels in a multilevel model, whereas sampling 

weights at the first level (the observation level) are specified in the usual manner, for example, 

[pw=pwtvar1]. exp can be any valid Stata expression, and you can specify pweight() at levels 

two and higher of a multilevel model. For example, in the two-level model 

. mixed fixed_portion [pw = wt1] || school: . . . , pweight(wt2) . . . 

variable wt1 would hold the first-level (the observation-level) sampling weights, and wt2 would 

hold the second-level (the school-level) sampling weights. 

See Survey data in Remarks and examples below for more information regarding the use of 

sampling weights in multilevel models. 

Weighted estimation, whether frequency or sampling, is not supported under restricted maximumlikelihood 

estimation (REML). 

mle and reml specify the statistical method for fitting the model. 

mle, the default, specifies that the model be fit using maximum likelihood (ML). 

reml specifies that the model be fit using restricted maximum likelihood (REML), also known as 

residual maximum likelihood. 

pwscale(scale method), where scale method is 

size | effective | gk 

controls how sampling weights (if specified) are scaled in two-level models. 

scale method size specifies that first-level (observation-level) weights be scaled so that they 

sum to the sample size of their corresponding second-level cluster. Second-level sampling 

weights are left unchanged. 

scale method effective specifies that first-level weights be scaled so that they sum to the 

effective sample size of their corresponding second-level cluster. Second-level sampling weights 

are left unchanged. 

scale method gk specifies the Graubard and Korn (1996) method. Under this method, secondlevel 

weights are set to the cluster averages of the products of the weights at both levels, and 

first-level weights are then set equal to 1. 

pwscale() is supported only with two-level models. See Survey data in Remarks and examples 

below for more details on using pwscale().


residuals(rspec), where rspec is 

restype [ , residual options ] 

specifies the structure of the residual errors within the lowest-level groups (the second level of a 

multilevel model with the observations comprising the first level) of the linear mixed model. For 

example, if you are modeling random effects for classes nested within schools, then residuals() 

refers to the residual variance–covariance structure of the observations within classes, the lowestlevel 

groups. 

restype is 

independent | exchangeable | ar # | ma # | unstructured | 

banded # | toeplitz # | exponential 

By default, restype is independent, which means that all residuals are independent and 

identically distributed (i.i.d.) Gaussian with one common variance. When combined with 

by(varname), independence is still assumed, but you estimate a distinct variance for each 

level of varname. Unlike with the structures described below, varname does not need to be 

constant within groups. 

restype exchangeable estimates two parameters, one common within-group variance and one 

common pairwise covariance. When combined with by(varname), these two parameters 

are distinctly estimated for each level of varname. Because you are modeling a withingroup 

covariance, varname must be constant within lowest-level groups. 

restype ar # assumes that within-group errors have an autoregressive (AR) structure of 

order #; ar 1 is the default. The t(varname) option is required, where varname is an 

integer-valued time variable used to order the observations within groups and to determine 

the lags between successive observations. Any nonconsecutive time values will be treated 

as gaps. For this structure, # + 1 parameters are estimated (# AR coefficients and one 

overall error variance). restype ar may be combined with by(varname), but varname 

must be constant within groups. 

restype ma # assumes that within-group errors have a moving average (MA) structure of 

order #; ma 1 is the default. The t(varname) option is required, where varname is an 

integer-valued time variable used to order the observations within groups and to determine 

the lags between successive observations. Any nonconsecutive time values will be treated 

as gaps. For this structure, # + 1 parameters are estimated (# MA coefficients and one 

overall error variance). restype ma may be combined with by(varname), but varname 


restype unstructured is the most general structure; it estimates distinct variances for 

each within-group error and distinct covariances for each within-group error pair. The 

t(varname) option is required, where varname is a nonnegative-integer–valued variable 

that identifies the observations within each group. The groups may be unbalanced in that 

not all levels of t() need to be observed within every group, but you may not have 

repeated t() values within any particular group. When you have p levels of t(), then 

p(p + 1)/2 parameters are estimated. restype unstructured may be combined with 

by(varname), but varname must be constant within groups. 

restype banded # is a special case of unstructured that restricts estimation to the 

covariances within the first # off-diagonals and sets the covariances outside this band to 

0. The t(varname) option is required, where varname is a nonnegative-integer–valued 

variable that identifies the observations within each group. # is an integer between 0 and 

p − 1, where p is the number of levels of t(). By default, # is p − 1; that is, all elements


✄ 

of the covariance matrix are estimated. When # is 0, only the diagonal elements of the 

covariance matrix are estimated. restype banded may be combined with by(varname), 

but varname must be constant within groups. 

restype toeplitz # assumes that within-group errors have Toeplitz structure of order #, 

for which correlations are constant with respect to time lags less than or equal to # 

and are 0 for lags greater than #. The t(varname) option is required, where varname 

is an integer-valued time variable used to order the observations within groups and to 

determine the lags between successive observations. # is an integer between 1 and the 

maximum observed lag (the default). Any nonconsecutive time values will be treated as 

gaps. For this structure, # + 1 parameters are estimated (# correlations and one overall 

error variance). restype toeplitz may be combined with by(varname), but varname 


restype exponential is a generalization of the AR covariance model that allows for unequally 

spaced and noninteger time values. The t(varname) option is required, where varname 

is real-valued. For the exponential covariance model, the correlation between two errors 

is the parameter ρ, raised to a power equal to the absolute value of the difference between 

the t() values for those errors. For this structure, two parameters are estimated (the 

correlation parameter ρ and one overall error variance). restype exponential may be 

combined with by(varname), but varname must be constant within groups. 

residual options are by(varname) and t(varname). 

✄ 

by(varname) is for use within the residuals() option and specifies that a set of distinct 

residual-error parameters be estimated for each level of varname. In other words, you 

use by() to model heteroskedasticity. 

t(varname) is for use within the residuals() option to specify a time variable for the 

ar, ma, toeplitz, and exponential structures, or to identify the observations when 

restype is unstructured or banded. 

SE/Robust 

 

vce(vcetype) specifies the type of standard error reported, which includes types that are derived 

from asymptotic theory (oim), that are robust to some kinds of misspecification (robust), and 

that allow for intragroup correlation (cluster clustvar); see [R] vce option. If vce(robust) is 

specified, robust variances are clustered at the highest level in the multilevel model. 

vce(robust) and vce(cluster clustvar) are not supported with REML estimation. 

✄ 

✄ Reporting 

level(#); see [R] estimation options. 

variance, the default, displays the random-effects and residual-error parameter estimates as variances 

and covariances. 

stddeviations displays the random-effects and residual-error parameter estimates as standard 

deviations and correlations. 

noretable suppresses the random-effects table from the output. 

nofetable suppresses the fixed-effects table from the output. 

estmetric displays all parameter estimates in the estimation metric. Fixed-effects estimates are 

unchanged from those normally displayed, but random-effects parameter estimates are displayed 

as log-standard deviations and hyperbolic arctangents of correlations, with equation names that


organize them by model level. Residual-variance parameter estimates are also displayed in their 

original estimation metric. 

noheader suppresses the output header, either at estimation or upon replay. 

nogroup suppresses the display of group summary information (number of groups, average group 

size, minimum, and maximum) from the output header. 

nostderr prevents mixed from calculating standard errors for the estimated random-effects parameters, 

although standard errors are still provided for the fixed-effects parameters. Specifying this option 

will speed up computation times. nostderr is available only when residuals are modeled as 

independent with constant variance. 

nolrtest prevents mixed from fitting a reference linear regression model and using this model to 

calculate a likelihood-ratio test comparing the mixed model to ordinary regression. This option 

may also be specified on replay to suppress this test from the output. 

display options: noomitted, vsquish, noemptycells, baselevels, allbaselevels, nofvlabel, 

fvwrap(#), fvwrapon(style), cformat(% fmt), pformat(% fmt), sformat(% fmt), and 

nolstretch; see [R] estimation options. 

✄ 

✄ 

EM options 

 

These options control the expectation-maximization (EM) iterations that take place before estimation 

switches to a gradient-based method. When residuals are modeled as independent with constant 

variance, EM will either converge to the solution or bring parameter estimates close to the solution. 

For other residual structures or for weighted estimation, EM is used to obtain starting values. 

✄ 

emiterate(#) specifies the number of EM iterations to perform. The default is emiterate(20). 

emtolerance(#) specifies the convergence tolerance for the EM algorithm. The default is 

emtolerance(1e-10). EM iterations will be halted once the log (restricted) likelihood changes 

by a relative amount less than #. At that point, optimization switches to a gradient-based method, 

unless emonly is specified, in which case maximization stops. 

emonly specifies that the likelihood be maximized exclusively using EM. The advantage of specifying 

emonly is that EM iterations are typically much faster than those for gradient-based methods. 

The disadvantages are that EM iterations can be slow to converge (if at all) and that EM provides 

no facility for estimating standard errors for the random-effects parameters. emonly is available 

only with unweighted estimation and when residuals are modeled as independent with constant 

variance. 

emlog specifies that the EM iteration log be shown. The EM iteration log is, by default, not 

displayed unless the emonly option is specified. 

emdots specifies that the EM iterations be shown as dots. This option can be convenient because 

the EM algorithm may require many iterations to converge. 

✄ 

Maximization 

 

maximize options: difficult, technique(algorithm spec), iterate(#), [ no ] log, trace, 

gradient, showstep, hessian, showtolerance, tolerance(#), ltolerance(#), 

nrtolerance(#), and nonrtolerance; see [R] maximize. Those that require special mention 

for mixed are listed below. 

For the technique() option, the default is technique(nr). The bhhh algorithm may not be 

specified. 

matsqrt (the default), during optimization, parameterizes variance components by using the matrix 

square roots of the variance–covariance matrices formed by these components at each model level.


matlog, during optimization, parameterizes variance components by using the matrix logarithms of 

the variance–covariance matrices formed by these components at each model level. 

The matsqrt parameterization ensures that variance–covariance matrices are positive semidefinite, 

while matlog ensures matrices that are positive definite. For most problems, the matrix square root 

is more stable near the boundary of the parameter space. However, if convergence is problematic, 

one option may be to try the alternate matlog parameterization. When convergence is not an issue, 

both parameterizations yield equivalent results. 

The following option is available with mixed but is not shown in the dialog box: 

coeflegend; see [R] estimation options. 

Remarks and examples 

stata.com 

Remarks are presented under the following headings: 

Introduction 

Two-level models 

Covariance structures 

Likelihood versus restricted likelihood 

Three-level models 

Blocked-diagonal covariance structures 

Heteroskedastic random effects 

Heteroskedastic residual errors 

Other residual-error structures 

Crossed-effects models 

Diagnosing convergence problems 

Survey data 

Introduction 

Linear mixed models are models containing both fixed effects and random effects. They are a 

generalization of linear regression allowing for the inclusion of random deviations (effects) other than 

those associated with the overall error term. In matrix notation, 

y = Xβ + Zu + ɛ (1) 

where y is the n × 1 vector of responses, X is an n × p design/covariate matrix for the fixed effects 

β, and Z is the n × q design/covariate matrix for the random effects u. The n × 1 vector of errors 

ɛ is assumed to be multivariate normal with mean 0 and variance matrix σ 2 ɛ R. 

The fixed portion of (1), Xβ, is analogous to the linear predictor from a standard OLS regression 

model with β being the regression coefficients to be estimated. For the random portion of (1), Zu+ɛ, 

we assume that u has variance–covariance matrix G and that u is orthogonal to ɛ so that 

[ ] 

u 

Var = 

ɛ 

[ 

G 0 

] 

0 σɛ 2 R 

The random effects u are not directly estimated (although they may be predicted), but instead are 

characterized by the elements of G, known as variance components, that are estimated along with 

the overall residual variance σɛ 2 and the residual-variance parameters that are contained within R.


The general forms of the design matrices X and Z allow estimation for a broad class of linear 

models: blocked designs, split-plot designs, growth curves, multilevel or hierarchical designs, etc. 

They also allow a flexible method of modeling within-cluster correlation. Subjects within the same 

cluster can be correlated as a result of a shared random intercept, or through a shared random 

slope on (say) age, or both. The general specification of G also provides additional flexibility—the 

random intercept and random slope could themselves be modeled as independent, or correlated, or 

independent with equal variances, and so forth. The general structure of R also allows for residual 

errors to be heteroskedastic and correlated, and allows flexibility in exactly how these characteristics 

can be modeled. 

Comprehensive treatments of mixed models are provided by, among others, Searle, Casella, and 

McCulloch (1992); McCulloch, Searle, and Neuhaus (2008); Verbeke and Molenberghs (2000); 

Raudenbush and Bryk (2002); Demidenko (2004); and Pinheiro and Bates (2000). In particular, 

chapter 2 of Searle, Casella, and McCulloch (1992) provides an excellent history. 

The key to fitting mixed models lies in estimating the variance components, and for that there exist 

many methods. Most of the early literature in mixed models dealt with estimating variance components 

in ANOVA models. For simple models with balanced data, estimating variance components amounts 

to solving a system of equations obtained by setting expected mean-squares expressions equal to their 

observed counterparts. Much of the work in extending the ANOVA method to unbalanced data for 

general ANOVA designs is due to Henderson (1953). 

The ANOVA method, however, has its shortcomings. Among these is a lack of uniqueness in that 

alternative, unbiased estimates of variance components could be derived using other quadratic forms 

of the data in place of observed mean squares (Searle, Casella, and McCulloch 1992, 38–39). As a 

result, ANOVA methods gave way to more modern methods, such as minimum norm quadratic unbiased 

estimation (MINQUE) and minimum variance quadratic unbiased estimation (MIVQUE); see Rao (1973) 

for MINQUE and LaMotte (1973) for MIVQUE. Both methods involve finding optimal quadratic forms 

of the data that are unbiased for the variance components. 

The most popular methods, however, are ML and REML, and these are the two methods that are 

supported by mixed. The ML estimates are based on the usual application of likelihood theory, given 

the distributional assumptions of the model. The basic idea behind REML (Thompson 1962) is that 

you can form a set of linear contrasts of the response that do not depend on the fixed effects β, but 

instead depend only on the variance components to be estimated. You then apply ML methods by 

using the distribution of the linear contrasts to form the likelihood. 

Returning to (1): in clustered-data situations, it is convenient not to consider all n observations at 

once but instead to organize the mixed model as a series of M independent groups (or clusters) 

y j = X j β + Z j u j + ɛ j (2) 

for j = 1, . . . , M, with cluster j consisting of n j observations. The response y j comprises the rows 

of y corresponding with the jth cluster, with X j and ɛ j defined analogously. The random effects u j 

can now be thought of as M realizations of a q × 1 vector that is normally distributed with mean 0 

and q × q variance matrix Σ. The matrix Z i is the n j × q design matrix for the jth cluster random 

effects. Relating this to (1), note that 

⎡ 

⎤ 

Z 1 0 · · · 0 

0 Z 

Z = ⎢ 2 · · · 0 

⎣ 

. 

. 

. .. 

. .. 

0 0 0 Z M 

⎥ 

⎦ ; 

⎡ ⎤ 

u 1 

u = ⎣ . ⎦ 

. ; G = I M ⊗ Σ; R = I M ⊗ Λ (3) 

u M 

The mixed-model formulation (2) is from Laird and Ware (1982) and offers two key advantages. 

First, it makes specifications of random-effects terms easier. If the clusters are schools, you can


simply specify a random effect at the school level, as opposed to thinking of what a school-level 

random effect would mean when all the data are considered as a whole (if it helps, think Kronecker 

products). Second, representing a mixed-model with (2) generalizes easily to more than one set of 

random effects. For example, if classes are nested within schools, then (2) can be generalized to allow 

random effects at both the school and the class-within-school levels. This we demonstrate later. 

In the sections that follow, we assume that residuals are independent with constant variance; that 

is, in (3) we treat Λ equal to the identity matrix and limit ourselves to estimating one overall residual 

variance, σɛ 2 . Beginning in Heteroskedastic residual errors, we relax this assumption. 

Two-level models 

We begin with a simple application of (2) as a two-level model, because a one-level linear model, 

by our terminology, is just standard OLS regression. 

Example 1 

Consider a longitudinal dataset, used by both Ruppert, Wand, and Carroll (2003) and Diggle 

et al. (2002), consisting of weight measurements of 48 pigs on 9 successive weeks. Pigs are 

identified by the variable id. Below is a plot of the growth curves for the first 10 pigs. 

. use http://www.stata-press.com/data/r13/pig 

(Longitudinal analysis of pig weights) 

. twoway connected weight week if id


. mixed weight week || id: 

Performing EM optimization: 

Performing gradient-based optimization: 

Iteration 0: log likelihood = -1014.9268 


Computing standard errors: 

Mixed-effects ML regression Number of obs = 432 

Group variable: id Number of groups = 48 

Obs per group: min = 9 

avg = 9.0 

max = 9 

Wald chi2(1) = 25337.49 

Log likelihood = -1014.9268 Prob > chi2 = 0.0000 

weight Coef. Std. Err. z P>|z| [95% Conf. Interval] 

week 6.209896 .0390124 159.18 0.000 6.133433 6.286359 

_cons 19.35561 .5974059 32.40 0.000 18.18472 20.52651 

Random-effects Parameters Estimate Std. Err. [95% Conf. Interval] 

id: Identity 

var(_cons) 14.81751 3.124226 9.801716 22.40002 

Notes: 

var(Residual) 4.383264 .3163348 3.805112 5.04926 

LR test vs. linear regression: chibar2(01) = 472.65 Prob >= chibar2 = 0.0000 

1. By typing weight week, we specified the response, weight, and the fixed portion of the model 

in the same way that we would if we were using regress or any other estimation command. Our 

fixed effects are a coefficient on week and a constant term. 

2. When we added || id:, we specified random effects at the level identified by the group variable 

id, that is, the pig level (level two). Because we wanted only a random intercept, that is all we 

had to type. 

3. The estimation log consists of three parts: 

a. A set of EM iterations used to refine starting values. By default, the iterations themselves are 

not displayed, but you can display them with the emlog option. 

b. A set of gradient-based iterations. By default, these are Newton–Raphson iterations, but other 

methods are available by specifying the appropriate maximize options; see [R] maximize. 

c. The message “Computing standard errors”. This is just to inform you that mixed has finished 

its iterative maximization and is now reparameterizing from a matrix-based parameterization 

(see Methods and formulas) to the natural metric of variance components and their estimated 

standard errors. 

4. The output title, “Mixed-effects ML regression”, informs us that our model was fit using ML, the 

default. For REML estimates, use the reml option. 

Because this model is a simple random-intercept model fit by ML, it would be equivalent to using 

xtreg with its mle option. 

5. The first estimation table reports the fixed effects. We estimate β 0 = 19.36 and β 1 = 6.21.


6. The second estimation table shows the estimated variance components. The first section of the 

table is labeled id: Identity, meaning that these are random effects at the id (pig) level and that 

their variance–covariance matrix is a multiple of the identity matrix; that is, Σ = σuI. 2 Because 

we have only one random effect at this level, mixed knew that Identity is the only possible 

covariance structure. In any case, the variance of the level-two errors, σu 2 , is estimated as 14.82 

with standard error 3.12. 

7. The row labeled var(Residual) displays the estimated variance of the overall error term; that 

is, ̂σ 2 ɛ = 4.38. This is the variance of the level-one errors, that is, the residuals. 

8. Finally, a likelihood-ratio test comparing the model with one-level ordinary linear regression, model 

(4) without u j , is provided and is highly significant for these data. 

We now store our estimates for later use: 

. estimates store randint 

Example 2 

Extending (4) to allow for a random slope on week yields the model 

and we fit this with mixed: 

. mixed weight week || id: week 


weight ij = β 0 + β 1 week ij + u 0j + u 1j week ij + ɛ ij (5) 








avg = 9.0 

max = 9 

Wald chi2(1) = 4689.51 



week 6.209896 .0906819 68.48 0.000 6.032163 6.387629 

_cons 19.35561 .3979159 48.64 0.000 18.57571 20.13551 


id: Independent 

var(week) .3680668 .0801181 .2402389 .5639103 

var(_cons) 6.756364 1.543503 4.317721 10.57235 

var(Residual) 1.598811 .1233988 1.374358 1.85992 

LR test vs. linear regression: chi2(2) = 764.42 Prob > chi2 = 0.0000 

Note: LR test is conservative and provided only for reference. 

. estimates store randslope


Because we did not specify a covariance structure for the random effects (u 0j , u 1j ) ′ , mixed used 

the default Independent structure; that is, 

[ ] [ ] 

u0j σ 

2 

Σ = Var = u0 0 

u 1j 0 σu1 

2 

with ̂σ u0 2 = 6.76 and ̂σ u1 2 = 0.37. Our point estimates of the fixed effects are essentially identical to 

those from model (4), but note that this does not hold generally. Given the 95% confidence interval 

for ̂σ u1 2 , it would seem that the random slope is significant, and we can use lrtest and our two 

stored estimation results to verify this fact: 

. lrtest randslope randint 

Likelihood-ratio test LR chi2(1) = 291.78 

(Assumption: randint nested in randslope) Prob > chi2 = 0.0000 

Note: The reported degrees of freedom assumes the null hypothesis is not on 

the boundary of the parameter space. If this is not true, then the 

reported test is conservative. 

The near-zero significance level favors the model that allows for a random pig-specific regression 

line over the model that allows only for a pig-specific shift. 

(6) 

Covariance structures 

In example 2, we fit a model with the default Independent covariance given in (6). Within any 

random-effects level specification, we can override this default by specifying an alternative covariance 

structure via the covariance() option. 

Example 3 

We generalize (6) to allow u 0j and u 1j to be correlated; that is, 

[ ] [ ] 

u0j σ 

2 

Σ = Var = u0 σ 01 

u 1j σ 01 σu1 

2


. mixed weight week || id: week, covariance(unstructured) 









avg = 9.0 

max = 9 

Wald chi2(1) = 4649.17 



week 6.209896 .0910745 68.18 0.000 6.031393 6.388399 

_cons 19.35561 .3996387 48.43 0.000 18.57234 20.13889 


id: Unstructured 

var(week) .3715251 .0812958 .2419532 .570486 

var(_cons) 6.823363 1.566194 4.351297 10.69986 

cov(week,_cons) -.0984378 .2545767 -.5973991 .4005234 

var(Residual) 1.596829 .123198 1.372735 1.857505 



But we do not find the correlation to be at all significant. 

. lrtest . randslope 


(Assumption: randslope nested in .) Prob > chi2 = 0.6959 

Instead, we could have also specified covariance(identity), restricting u 0j and u 1j to not 

only be independent but also to have common variance, or we could have specified covariance(exchangeable), 

which imposes a common variance but allows for a nonzero correlation. 

Likelihood versus restricted likelihood 

Thus far, all our examples have used ML to estimate variance components. We could have just as 

easily asked for REML estimates. Refitting the model in example 2 by REML, we get


. mixed weight week || id: week, reml 



Iteration 0: log restricted-likelihood = -870.51473 

Iteration 1: log restricted-likelihood = -870.51473 


Mixed-effects REML regression Number of obs = 432 



avg = 9.0 

max = 9 

Wald chi2(1) = 4592.10 

Log restricted-likelihood = -870.51473 Prob > chi2 = 0.0000 


week 6.209896 .0916387 67.77 0.000 6.030287 6.389504 

_cons 19.35561 .4021144 48.13 0.000 18.56748 20.14374 



var(week) .3764405 .0827027 .2447317 .5790317 

var(_cons) 6.917604 1.593247 4.404624 10.86432 

var(Residual) 1.598784 .1234011 1.374328 1.859898 



Although ML estimators are based on the usual likelihood theory, the idea behind REML is to 

transform the response into a set of linear contrasts whose distribution is free of the fixed effects β. 

The restricted likelihood is then formed by considering the distribution of the linear contrasts. Not 

only does this make the maximization problem free of β, it also incorporates the degrees of freedom 

used to estimate β into the estimation of the variance components. This follows because, by necessity, 

the rank of the linear contrasts must be less than the number of observations. 

As a simple example, consider a constant-only regression where y i ∼ N(µ, σ 2 ) for i = 1, . . . , n. 

The ML estimate of σ 2 can be derived theoretically as the n-divided sample variance. The REML 

estimate can be derived by considering the first n − 1 error contrasts, y i − y, whose joint distribution 

is free of µ. Applying maximum likelihood to this distribution results in an estimate of σ 2 , that is, 

the (n − 1)-divided sample variance, which is unbiased for σ 2 . 

The unbiasedness property of REML extends to all mixed models when the data are balanced, and 

thus REML would seem the clear choice in balanced-data problems, although in large samples the 

difference between ML and REML is negligible. One disadvantage of REML is that likelihood-ratio (LR) 

tests based on REML are inappropriate for comparing models with different fixed-effects specifications. 

ML is appropriate for such LR tests and has the advantage of being easy to explain and being the 

method of choice for other estimators. 

Another factor to consider is that ML estimation under mixed is more feature-rich, allowing for 

weighted estimation and robust variance–covariance matrices, features not supported under REML. In 

the end, which method to use should be based both on your needs and on personal taste.


Examining the REML output, we find that the estimates of the variance components are slightly 

larger than the ML estimates. This is typical, because ML estimates, which do not incorporate the 

degrees of freedom used to estimate the fixed effects, tend to be biased downward. 

Three-level models 

The clustered-data representation of the mixed model given in (2) can be extended to two nested 

levels of clustering, creating a three-level model once the observations are considered. Formally, 

y jk = X jk β + Z (3) 

jk u(3) k 

+ Z (2) 

jk u(2) jk + ɛ jk (7) 

for i = 1, . . . , n jk first-level observations nested within j = 1, . . . , M k second-level groups, which 

are nested within k = 1, . . . , M third-level groups. Group j, k consists of n jk observations, so y jk , 

X jk , and ɛ jk each have row dimension n jk . Z (3) 

jk 

is the n jk × q 3 design matrix for the third-level 

random effects u (3) 

k 

, and Z(2) 

jk is the n jk × q 2 design matrix for the second-level random effects u (2) 

jk . 

Furthermore, assume that 

u (3) 

k 

∼ N(0, Σ 3 ); u (2) 

jk ∼ N(0, Σ 2); ɛ jk ∼ N(0, σ 2 ɛ I) 

and that u (3) 

k 

, u(2) jk , and ɛ jk are independent. 

Fitting a three-level model requires you to specify two random-effects equations: one for level 

three and then one for level two. The variable list for the first equation represents Z (3) 

jk 

second equation represents Z (2) 

jk 

; that is, you specify the levels top to bottom in mixed. 

Example 4 

and for the 

Baltagi, Song, and Jung (2001) estimate a Cobb–Douglas production function examining the 

productivity of public capital in each state’s private output. Originally provided by Munnell (1990), 

the data were recorded over 1970–1986 for 48 states grouped into nine regions.


. use http://www.stata-press.com/data/r13/productivity 

(Public Capital Productivity) 

. describe 

Contains data from http://www.stata-press.com/data/r13/productivity.dta 

obs: 816 Public Capital Productivity 

vars: 11 29 Mar 2013 10:57 

size: 29,376 (_dta has notes) 

storage display value 

variable name type format label variable label 

state byte %9.0g states 1-48 

region byte %9.0g regions 1-9 

year int %9.0g years 1970-1986 

public float %9.0g public capital stock 

hwy float %9.0g log(highway component of public) 

water float %9.0g log(water component of public) 

other float %9.0g log(bldg/other component of 

public) 

private float %9.0g log(private capital stock) 

gsp float %9.0g log(gross state product) 

emp float %9.0g log(non-agriculture payrolls) 

unemp float %9.0g state unemployment rate 

Sorted by: 

Because the states are nested within regions, we fit a three-level mixed model with random intercepts 

at both the region and the state-within-region levels. That is, we use (7) with both Z (3) 

jk 

and Z(2) 

jk 

set 

to the n jk × 1 column of ones, and Σ 3 = σ3 2 and Σ 2 = σ2 2 are both scalars. 

. mixed gsp private emp hwy water other unemp || region: || state: 

(output omitted ) 


No. of Observations per Group 

Group Variable Groups Minimum Average Maximum 

region 9 51 90.7 136 

state 48 17 17.0 17 

Wald chi2(6) = 18829.06 

Log likelihood = 1430.5017 Prob > chi2 = 0.0000 

gsp Coef. Std. Err. z P>|z| [95% Conf. Interval] 

private .2671484 .0212591 12.57 0.000 .2254814 .3088154 

emp .754072 .0261868 28.80 0.000 .7027468 .8053973 

hwy .0709767 .023041 3.08 0.002 .0258172 .1161363 

water .0761187 .0139248 5.47 0.000 .0488266 .1034109 

other -.0999955 .0169366 -5.90 0.000 -.1331906 -.0668004 

unemp -.0058983 .0009031 -6.53 0.000 -.0076684 -.0041282 

_cons 2.128823 .1543854 13.79 0.000 1.826233 2.431413



region: Identity 

state: Identity 

var(_cons) .0014506 .0012995 .0002506 .0083957 

var(_cons) .0062757 .0014871 .0039442 .0099855 

Notes: 

var(Residual) .0013461 .0000689 .0012176 .0014882 



1. Our model now has two random-effects equations, separated by ||. The first is a random intercept 

(constant only) at the region level (level three), and the second is a random intercept at the state 

level (level two). The order in which these are specified (from left to right) is significant—mixed 

assumes that state is nested within region. 

2. The information on groups is now displayed as a table, with one row for each grouping. You can 

suppress this table with the nogroup or the noheader option, which will suppress the rest of the 

header, as well. 

3. The variance-component estimates are now organized and labeled according to level. 

After adjusting for the nested-level error structure, we find that the highway and water components 

of public capital had significant positive effects on private output, whereas the other public buildings 

component had a negative effect. 

Technical note 

In the previous example, the states are coded 1–48 and are nested within nine regions. mixed 

treated the states as nested within regions, regardless of whether the codes for each state were unique 

between regions. That is, even if codes for states were duplicated between regions, mixed would 

have enforced the nesting and produced the same results. 

The group information at the top of the mixed output and that produced by the postestimation 

command estat group (see [ME] mixed postestimation) take the nesting into account. The statistics 

are thus not necessarily what you would get if you instead tabulated each group variable individually. 

Model (7) extends in a straightforward manner to more than three levels, as does the specification 

of such models in mixed. 

Blocked-diagonal covariance structures 

Covariance matrices of random effects within an equation can be modeled either as a multiple of 

the identity matrix, as diagonal (that is, Independent), as exchangeable, or as general symmetric 

(Unstructured). These may also be combined to produce more complex block-diagonal covariance 

structures, effectively placing constraints on the variance components.


Example 5 

Returning to our productivity data, we now add random coefficients on hwy and unemp at the 

region level. This only slightly changes the estimates of the fixed effects, so we focus our attention 

on the variance components: 

. mixed gsp private emp hwy water other unemp || region: hwy unemp || state:, 

> nolog nogroup nofetable 


Wald chi2(6) = 17137.94 



region: Independent 

var(hwy) .0000209 .0001103 6.71e-10 .650695 

var(unemp) .0000238 .0000135 7.84e-06 .0000722 

var(_cons) .0030349 .0086684 .0000112 .8191296 


var(_cons) .0063658 .0015611 .0039365 .0102943 

var(Residual) .0012469 .0000643 .001127 .0013795 



. estimates store prodrc 

This model is the same as that fit in example 4 except that Z (3) 

jk 

is now the n jk × 3 matrix with 

columns determined by the values of hwy, unemp, and an intercept term (one), in that order, and 

(because we used the default Independent structure) Σ 3 is 

Σ 3 = 

( 

hwy unemp cons 

σa 2 ) 

0 0 

0 σb 2 0 

0 0 σc 

2 

The random-effects specification at the state level remains unchanged; that is, Σ 2 is still treated as 

the scalar variance of the random intercepts at the state level. 

An LR test comparing this model with that from example 4 favors the inclusion of the two random 

coefficients, a fact we leave to the interested reader to verify. 

The estimated variance components, upon examination, reveal that the variances of the random 

coefficients on hwy and unemp could be treated as equal. That is, 

Σ 3 = 

( 

hwy unemp cons 

σa 2 ) 

0 0 

0 σa 2 0 

0 0 σc 

2 

looks plausible. We can impose this equality constraint by treating Σ 3 as block diagonal: the first 

block is a 2 × 2 multiple of the identity matrix, that is, σ 2 aI 2 ; the second is a scalar, equivalently, a 

1 × 1 multiple of the identity.


We construct block-diagonal covariances by repeating level specifications: 

. mixed gsp private emp hwy water other unemp || region: hwy unemp, 

> cov(identity) || region: || state:, nolog nogroup nofetable 


Wald chi2(6) = 17136.65 




var(hwy unemp) .0000238 .0000134 7.89e-06 .0000719 



var(_cons) .0028191 .0030429 .0003399 .023383 

var(_cons) .006358 .0015309 .0039661 .0101925 

var(Residual) .0012469 .0000643 .001127 .0013795 



We specified two equations for the region level: the first for the random coefficients on hwy and 

unemp with covariance set to Identity and the second for the random intercept cons, whose 

covariance defaults to Identity because it is of dimension 1. mixed labeled the estimate of σa 2 as 

var(hwy unemp) to designate that it is common to the random coefficients on both hwy and unemp. 

An LR test shows that the constrained model fits equally well. 

. lrtest . prodrc 


(Assumption: . nested in prodrc) Prob > chi2 = 0.9784 




Because the null hypothesis for this test is one of equality (H 0 : σa 2 = σb 2 ), it is not on the 

boundary of the parameter space. As such, we can take the reported significance as precise rather 

than a conservative estimate. 

You can repeat level specifications as often as you like, defining successive blocks of a blockdiagonal 

covariance matrix. However, repeated-level equations must be listed consecutively; otherwise, 

mixed will give an error. 


In the previous estimation output, there was no constant term included in the first region equation, 

even though we did not use the noconstant option. When you specify repeated-level equations, 

mixed knows not to put constant terms in each equation because such a model would be unidentified. 

By default, it places the constant in the last repeated-level equation, but you can use noconstant 

creatively to override this. 

Linear mixed-effects models can also be fit using meglm with the default gaussian family. meglm 

provides two more covariance structures through which you can impose constraints on variance 

components; see [ME] meglm for details.


Heteroskedastic random effects 

Blocked-diagonal covariance structures and repeated-level specifications of random effects can also 

be used to model heteroskedasticity among random effects at a given level. 

Example 6 

Following Rabe-Hesketh and Skrondal (2012, sec. 7.2), we analyze data from Asian children in 

a British community who were weighed up to four times, roughly between the ages of 6 weeks and 

27 months. The dataset is a random sample of data previously analyzed by Goldstein (1986) and 

Prosser, Rasbash, and Goldstein (1991). 

. use http://www.stata-press.com/data/r13/childweight 

(Weight data on Asian children) 

. describe 

Contains data from http://www.stata-press.com/data/r13/childweight.dta 

obs: 198 Weight data on Asian children 

vars: 5 23 May 2013 15:12 




id int %8.0g child identifier 

age float %8.0g age in years 

weight float %8.0g weight in Kg 

brthwt int %8.0g Birth weight in g 

girl float %9.0g bg gender 

Sorted by: id age 

. graph twoway (line weight age, connect(ascending)), by(girl) 

> xtitle(Age in years) ytitle(Weight in kg) 

boy 

girl 

Weight in kg 

5 10 15 20 

0 1 2 3 0 1 2 3 

Graphs by gender 

Age in years 

Ignoring gender effects for the moment, we begin with the following model for the ith measurement 

on the jth child: 

weight ij = β 0 + β 1 age ij + β 2 age 2 ij + u j0 + u j1 age ij + ɛ ij


This models overall mean growth as quadratic in age and allows for two child-specific random 

effects: a random intercept u j0 , which represents each child’s vertical shift from the overall mean 

(β 0 ), and a random age slope u j1 , which represents each child’s deviation in linear growth rate from 

the overall mean linear growth rate (β 1 ). For simplicity, we do not consider child-specific changes in 

the quadratic component of growth. 

. mixed weight age c.age#c.age || id: age, nolog 




avg = 2.9 

max = 5 

Wald chi2(2) = 1863.46 



age 7.693701 .2381076 32.31 0.000 7.227019 8.160384 

c.age#c.age -1.654542 .0874987 -18.91 0.000 -1.826037 -1.483048 

_cons 3.497628 .1416914 24.68 0.000 3.219918 3.775338 



var(age) .2987207 .0827569 .1735603 .5141388 

var(_cons) .5023857 .141263 .2895294 .8717297 

var(Residual) .3092897 .0474887 .2289133 .417888 



Because there is no reason to believe that the random effects are uncorrelated, it is always a good 

idea to first fit a model with the covariance(unstructured) option. We do not include the output 

for such a model because for these data the correlation between random effects is not significant; 

however, we did check this before reverting to mixed’s default Independent structure. 

Next we introduce gender effects into the fixed portion of the model by including a main gender 

effect and a gender–age interaction for overall mean growth:


. mixed weight i.girl i.girl#c.age c.age#c.age || id: age, nolog 




avg = 2.9 

max = 5 

Wald chi2(4) = 1942.30 



girl 

girl -.5104676 .2145529 -2.38 0.017 -.9309835 -.0899516 

girl#c.age 

boy 7.806765 .2524583 30.92 0.000 7.311956 8.301574 

girl 7.577296 .2531318 29.93 0.000 7.081166 8.073425 

c.age#c.age -1.654323 .0871752 -18.98 0.000 -1.825183 -1.483463 

_cons 3.754275 .1726404 21.75 0.000 3.415906 4.092644 



var(age) .2772846 .0769233 .1609861 .4775987 

var(_cons) .4076892 .12386 .2247635 .7394906 

var(Residual) .3131704 .047684 .2323672 .422072 



. estimates store homoskedastic 

The main gender effect is significant at the 5% level, but the gender–age interaction is not: 

. test 0.girl#c.age = 1.girl#c.age 

( 1) [weight]0b.girl#c.age - [weight]1.girl#c.age = 0 

chi2( 1) = 1.66 

Prob > chi2 = 0.1978 

On average, boys are heavier than girls, but their average linear growth rates are not significantly 

different. 

In the above model, we introduced a gender effect on average growth, but we still assumed that the 

variability in child-specific deviations from this average was the same for boys and girls. To check 

this assumption, we introduce gender into the random component of the model. Because support 

for factor-variable notation is limited in specifications of random effects (see Crossed-effects models 

below), we need to generate the interactions ourselves.


. gen boy = !girl 

. gen boyXage = boy*age 

. gen girlXage = girl*age 

. mixed weight i.girl i.girl#c.age c.age#c.age || id: boy boyXage, noconstant 

> || id: girl girlXage, noconstant nolog nofetable 




avg = 2.9 

max = 5 

Wald chi2(4) = 2358.11 




var(boy) .3161091 .1557911 .1203181 .8305061 

var(boyXage) .4734482 .1574626 .2467028 .9085962 


var(girl) .5798676 .1959725 .2989896 1.124609 

var(girlXage) .0664634 .0553274 .0130017 .3397538 

var(Residual) .3078826 .046484 .2290188 .4139037 



. estimates store heteroskedastic 

In the above, we suppress displaying the fixed portion of the model (the nofetable option) 

because it does not differ much from that of the previous model. 

Our previous model had the random-effects specification 

|| id: age 

which we have replaced with the dual repeated-level specification 

|| id: boy boyXage, noconstant || id: girl girlXage, noconstant 

The former models a random intercept and random slope on age, and does so treating all children as 

a random sample from one population. The latter also specifies a random intercept and random slope 

on age, but allows for the variability of the random intercepts and slopes to differ between boys and 

girls. In other words, it allows for heteroskedasticity in random effects due to gender. We use the 

noconstant option so that we can separate the overall random intercept (automatically provided by 

the former syntax) into one specific to boys and one specific to girls. 

There seems to be a large gender effect in the variability of linear growth rates. We can compare 

both models with an LR test, recalling that we stored the previous estimation results under the name 

homoskedastic: 

. lrtest homoskedastic heteroskedastic 


(Assumption: homoskedastic nested in heteroskedas ~ c) Prob > chi2 = 0.0145 



reported test is conservative.


Because the null hypothesis here is one of equality of variances and not that variances are 0, the 

above does not test on the boundary; thus we can treat the significance level as precise and not 

conservative. Either way, the results favor the new model with heteroskedastic random effects. 

Heteroskedastic residual errors 

Up to this point, we have assumed that the level-one residual errors—the ɛ’s in the stated 

models—have been i.i.d. Gaussian with variance σɛ 2 . This is demonstrated in mixed output in the 

random-effects table, where up until now we have estimated a single residual-error variance, labeled 

as var(Residual). 

To relax the assumptions of homoskedasticity or independence of residual errors, use the residuals() 

option. 

Example 7 

West, Welch, and Galecki (2007, chap. 7) analyze data studying the effect of ceramic dental veneer 

placement on gingival (gum) health. Data on 55 teeth located in the maxillary arches of 12 patients 

were considered. 

. use http://www.stata-press.com/data/r13/veneer, clear 

(Dental veneer data) 

. describe 

Contains data from http://www.stata-press.com/data/r13/veneer.dta 

obs: 110 Dental veneer data 

vars: 7 24 May 2013 12:11 




patient byte %8.0g Patient ID 

tooth byte %8.0g Tooth number with patient 

gcf byte %8.0g Gingival crevicular fluid (GCF) 

age byte %8.0g Patient age 

base_gcf byte %8.0g Baseline GCF 

cda float %9.0g Average contour difference after 

veneer placement 

followup byte %9.0g t Follow-up time: 3 or 6 months 

Sorted by: 

Veneers were placed to match the original contour of the tooth as closely as possible, and researchers 

were interested in how contour differences (variable cda) impacted gingival health. Gingival health 

was measured as the amount of gingival crevical fluid (GCF) at each tooth, measured at baseline 

(variable base gcf) and at two posttreatment follow-ups at 3 and 6 months. The variable gcf records 

GCF at follow-up, and the variable followup records the follow-up time. 

Because two measurements were taken for each tooth and there exist multiple teeth per patient, we 

fit a three-level model with the following random effects: a random intercept and random slope on 

follow-up time at the patient level, and a random intercept at the tooth level. For the ith measurement 

of the jth tooth from the kth patient, we have 

gcf ijk = β 0 + β 1 followup ijk + β 2 base gcf ijk + β 3 cda ijk + β 4 age ijk + 

u 0k + u 1k followup ijk + v 0jk + ɛ ijk


which we can fit using mixed: 

. mixed gcf followup base_gcf cda age || patient: followup, cov(un) || tooth:, 

> reml nolog 




patient 12 2 9.2 12 

tooth 55 2 2.0 2 

Wald chi2(4) = 7.48 


gcf Coef. Std. Err. z P>|z| [95% Conf. Interval] 

followup .3009815 1.936863 0.16 0.877 -3.4952 4.097163 

base_gcf -.0183127 .1433094 -0.13 0.898 -.299194 .2625685 

cda -.329303 .5292525 -0.62 0.534 -1.366619 .7080128 

age -.5773932 .2139656 -2.70 0.007 -.9967582 -.1580283 

_cons 45.73862 12.55497 3.64 0.000 21.13133 70.34591 


patient: Unstructured 

var(followup) 41.88772 18.79997 17.38009 100.9535 

var(_cons) 524.9851 253.0205 204.1287 1350.175 

cov(followup,_cons) -140.4229 66.57623 -270.9099 -9.935908 

tooth: Identity 

var(_cons) 47.45738 16.63034 23.8792 94.3165 

var(Residual) 48.86704 10.50523 32.06479 74.47382 



We used REML estimation for no other reason than variety. 

Among the other features of the model fit, we note that the residual variance σɛ 

2 was estimated 

as 48.87 and that our model assumed that the residuals were independent with constant variance 

(homoskedastic). Because it may be the case that the precision of gcf measurements could change 

over time, we modify the above to estimate two distinct error variances: one for the 3-month follow-up 

and one for the 6-month follow-up. 

To fit this model, we add the residuals(independent, by(followup)) option, which maintains 

independence of residual errors but allows for heteroskedasticity with respect to follow-up time.


. mixed gcf followup base_gcf cda age || patient: followup, cov(un) || tooth:, 

> residuals(independent, by(followup)) reml nolog 




patient 12 2 9.2 12 

tooth 55 2 2.0 2 

Wald chi2(4) = 7.51 


gcf Coef. Std. Err. z P>|z| [95% Conf. Interval] 

followup .2703944 1.933096 0.14 0.889 -3.518405 4.059193 

base_gcf .0062144 .1419121 0.04 0.965 -.2719283 .284357 

cda -.2947235 .5245126 -0.56 0.574 -1.322749 .7333023 

age -.5743755 .2142249 -2.68 0.007 -.9942487 -.1545024 

_cons 45.15089 12.51452 3.61 0.000 20.62288 69.6789 


patient: Unstructured 

var(followup) 41.75169 18.72989 17.33099 100.583 

var(_cons) 515.2018 251.9661 197.5542 1343.596 

cov(followup,_cons) -139.0496 66.27806 -268.9522 -9.14694 

tooth: Identity 

var(_cons) 47.35914 16.48931 23.93514 93.70693 

Residual: Independent, 

by followup 

3 months: var(e) 61.36785 18.38913 34.10946 110.4096 

6 months: var(e) 36.42861 14.97501 16.27542 81.53666 



Comparison of both models via an LR test reveals the difference in residual variances to be not 

significant, something we leave to you to verify as an exercise. 

The default residual-variance structure is independent, and when specified without by() is 

equivalent to the default behavior of mixed: estimating one overall residual standard variance for the 

entire model. 

Other residual-error structures 

Besides the default independent residual-error structure, mixed supports four other structures that 

allow for correlation between residual errors within the lowest-level (smallest or level two) groups. 

For purposes of notation, in what follows we assume a two-level model, with the obvious extension 

to higher-level models. 

The exchangeable structure assumes one overall variance and one common pairwise covariance; 

that is,


⎡ 

ɛ 

⎤ ⎡ 

j1 σɛ 2 σ 1 · · · σ 

⎤ 

1 

ɛ j2 

Var(ɛ j ) = Var ⎢ 

⎣ . ⎥ 

⎦ = σ 1 σɛ 2 · · · σ 1 

⎢ 

⎣ 

. 

. 

. 

. .. 

. ⎥ .. ⎦ 

ɛ jnj σ 1 σ 1 σ 1 σɛ 

2 

By default, mixed will report estimates of the two parameters as estimates of the common variance 

σ 2 ɛ and of the covariance σ 1. If the stddeviations option is specified, you obtain estimates of σ ɛ 

and the pairwise correlation. When the by(varname) option is also specified, these two parameters 

are estimated for each level varname. 

The ar p structure assumes that the errors have an AR structure of order p. That is, 

ɛ ij = φ 1 ɛ i−1,j + · · · + φ p ɛ i−p,j + u ij 

where u ij are i.i.d. Gaussian with mean 0 and variance σu 2. mixed reports estimates of φ 1, . . . , φ p 

and the overall error variance σɛ 2 , which can be derived from the above expression. The t(varname) 

option is required, where varname is a time variable used to order the observations within lowest-level 

groups and to determine any gaps between observations. When the by(varname) option is also 

specified, the set of p + 1 parameters is estimated for each level of varname. If p = 1, then the 

estimate of φ 1 is reported as rho, because in this case it represents the correlation between successive 

error terms. 

The ma q structure assumes that the errors are an MA process of order q. That is, 

ɛ ij = u ij + θ 1 u i−1,j + · · · + θ q u i−q,j 

where u ij are i.i.d. Gaussian with mean 0 and variance σu 2. mixed reports estimates of θ 1, . . . , θ q 

and the overall error variance σɛ 2 , which can be derived from the above expression. The t(varname) 

option is required, where varname is a time variable used to order the observations within lowest-level 

groups and to determine any gaps between observations. When the by(varname) option is also 

specified, the set of q + 1 parameters is estimated for each level of varname. 

The unstructured structure is the most general and estimates unique variances and unique pairwise 

covariances for all residuals within the lowest-level grouping. Because the data may be unbalanced 

and the ordering of the observations is arbitrary, the t(varname) option is required, where varname 

is an identification variable that matches error terms in different groups. If varname has n distinct 

levels, then n(n + 1)/2 parameters are estimated. Not all n levels need to be observed within each 

group, but duplicated levels of varname within a given group are not allowed because they would 

cause a singularity in the estimated error-variance matrix for that group. When the by(varname) 

option is also specified, the set of n(n + 1)/2 parameters is estimated for each level of varname. 

The banded q structure is a special case of unstructured that confines estimation to within 

the first q off-diagonal elements of the residual variance–covariance matrix and sets the covariances 

outside this band to 0. As is the case with unstructured, the t(varname) option is required, where 

varname is an identification variable that matches error terms in different groups. However, with 

banded variance structures, the ordering of the values in varname is significant because it determines 

which covariances are to be estimated and which are to be set to 0. For example, if varname has 

n = 5 distinct values t = 1, 2, 3, 4, 5, then a banded variance–covariance structure of order q = 2 

would estimate the following:


⎡ ⎤ ⎡ 

ɛ 1j σ 2 ⎤ 

1 σ 12 σ 13 0 0 

ɛ 2j 

σ Var(ɛ j ) = Var ⎢ ɛ 3j ⎥ 

⎣ ⎦ = 12 σ2 2 σ 23 σ 24 0 

⎢ σ 13 σ 23 σ 2 3 σ 34 σ 35 ⎥ 

⎣ 

ɛ 4j 0 σ 24 σ 34 σ4 2 ⎦ 

σ 45 

ɛ 5j 0 0 σ 35 σ 45 σ5 

2 

In other words, you would have an unstructured variance matrix that constrains σ 14 = σ 15 = σ 25 = 0. 

If varname has n distinct levels, then (q + 1)(2n − q)/2 parameters are estimated. Not all n levels 

need to be observed within each group, but duplicated levels of varname within a given group are 

not allowed because they would cause a singularity in the estimated error-variance matrix for that 

group. When the by(varname) option is also specified, the set of parameters is estimated for each 

level of varname. If q is left unspecified, then banded is equivalent to unstructured; that is, all 

variances and covariances are estimated. When q = 0, Var(ɛ j ) is treated as diagonal and can thus be 

used to model uncorrelated yet heteroskedastic residual errors. 

The toeplitz q structure assumes that the residual errors are homoskedastic and that the correlation 

between two errors is determined by the time lag between the two. That is, Var(ɛ ij ) = σ 2 ɛ and 

Corr(ɛ ij , ɛ i+k,j ) = ρ k 

If the lag k is less than or equal to q, then the pairwise correlation ρ k is estimated; if the lag is greater 

than q, then ρ k is assumed to be 0. If q is left unspecified, then ρ k is estimated for each observed lag 

k. The t(varname) option is required, where varname is a time variable t used to determine the lags 

between pairs of residual errors. As such, t() must be integer-valued. q +1 parameters are estimated: 

one overall variance σɛ 2 and q correlations. When the by(varname) option is also specified, the set 

of q + 1 parameters is estimated for each level of varname. 

The exponential structure is a generalization of the AR structure that allows for noninteger and 

irregularly spaced time lags. That is, Var(ɛ ij ) = σ 2 ɛ and 

Corr(ɛ ij , ɛ kj ) = ρ |i−k| 

for 0 ≤ ρ ≤ 1, with i and k not required to be integers. The t(varname) option is required, where 

varname is a time variable used to determine i and k for each residual-error pair. t() is real-valued. 

mixed reports estimates of σɛ 

2 and ρ. When the by(varname) option is also specified, these two 

parameters are estimated for each level of varname. 

Example 8 

Pinheiro and Bates (2000, chap. 5) analyze data from a study of the estrus cycles of mares. 

Originally analyzed in Pierson and Ginther (1987), the data record the number of ovarian follicles 

larger than 10mm, daily over a period ranging from three days before ovulation to three days after 

the subsequent ovulation.


. use http://www.stata-press.com/data/r13/ovary 

(Ovarian follicles in mares) 

. describe 

Contains data from http://www.stata-press.com/data/r13/ovary.dta 

obs: 308 Ovarian follicles in mares 

vars: 6 20 May 2013 13:49 




mare byte %9.0g mare ID 

stime float %9.0g Scaled time 

follicles byte %9.0g Number of ovarian follicles > 10 

mm in diameter 

sin1 float %9.0g sine(2*pi*stime) 

cos1 float %9.0g cosine(2*pi*stime) 

time float %9.0g time order within mare 

Sorted by: mare stime 

The stime variable is time that has been scaled so that ovulation occurs at scaled times 0 and 1, 

and the time variable records the time ordering within mares. Because graphical evidence suggests 

a periodic behavior, the analysis includes the sin1 and cos1 variables, which are sine and cosine 

transformations of scaled time, respectively. 

We consider the following model for the ith measurement on the jth mare: 

follicles ij = β 0 + β 1 sin1 ij + β 2 cos1 ij + u j + ɛ ij 

The above model incorporates the cyclical nature of the data as affecting the overall average 

number of follicles and includes mare-specific random effects u j . Because we believe successive 

measurements within each mare are probably correlated (even after controlling for the periodicity in 

the average), we also model the within-mare errors as being AR of order 2. 

. mixed follicles sin1 cos1 || mare:, residuals(ar 2, t(time)) reml nolog 


Group variable: mare Number of groups = 11 


avg = 28.0 

max = 31 

Wald chi2(2) = 34.72 


follicles Coef. Std. Err. z P>|z| [95% Conf. Interval] 

sin1 -2.899228 .5110786 -5.67 0.000 -3.900923 -1.897532 

cos1 -.8652936 .5432926 -1.59 0.111 -1.930127 .1995402 

_cons 12.14455 .9473631 12.82 0.000 10.28775 14.00135



mare: Identity 

Residual: AR(2) 

var(_cons) 7.092439 4.401937 2.101337 23.93843 

phi1 .5386104 .0624899 .4161325 .6610883 

phi2 .144671 .0632041 .0207933 .2685488 

var(e) 14.25104 2.435238 10.19512 19.92054 



We picked an order of 2 as a guess, but we could have used LR tests of competing AR models to 

determine the optimal order, because models of smaller order are nested within those of larger order. 

Example 9 

Fitzmaurice, Laird, and Ware (2011, chap. 7) analyzed data on 37 subjects who participated in an 

exercise therapy trial. 

. use http://www.stata-press.com/data/r13/exercise 

(Exercise Therapy Trial) 

. describe 

Contains data from http://www.stata-press.com/data/r13/exercise.dta 

obs: 259 Exercise Therapy Trial 

vars: 4 24 Jun 2012 18:35 




id byte %9.0g Person ID 

day byte %9.0g Day of measurement 

program byte %9.0g 1 = reps increase; 2 = weights 

increase 

strength byte %9.0g Strength measurement 

Sorted by: id day 

Subjects (variable id) were placed on either an increased-repetition regimen (program==1) or a program 

that kept the repetitions constant but increased weight (program==2). Muscle-strength measurements 

(variable strength) were taken at baseline (day==0) and then every two days over the next twelve 

days. 

Following Fitzmaurice, Laird, and Ware (2011, chap. 7), and to demonstrate fitting residual-error 

structures to data collected at uneven time points, we confine our analysis to those data collected at 

baseline and at days 4, 6, 8, and 12. We fit a full two-way factorial model of strength on program 

and day, with an unstructured residual-error covariance matrix over those repeated measurements 

taken on the same subject:


. keep if inlist(day, 0, 4, 6, 8, 12) 

(74 observations deleted) 

. mixed strength i.program##i.day || id:, 

> noconstant residuals(unstructured, t(day)) nolog 




avg = 4.7 

max = 5 

Wald chi2(9) = 45.85 


strength Coef. Std. Err. z P>|z| [95% Conf. Interval] 

2.program 1.360119 1.003549 1.36 0.175 -.6068016 3.32704 

day 

4 1.125 .3322583 3.39 0.001 .4737858 1.776214 

6 1.360127 .3766894 3.61 0.000 .6218298 2.098425 

8 1.583563 .4905876 3.23 0.001 .6220287 2.545097 

12 1.623576 .5372947 3.02 0.003 .5704977 2.676654 

program#day 

2 4 -.169034 .4423472 -0.38 0.702 -1.036019 .6979505 

2 6 .2113012 .4982385 0.42 0.671 -.7652283 1.187831 

2 8 -.1299763 .6524813 -0.20 0.842 -1.408816 1.148864 

2 12 .3212829 .7306782 0.44 0.660 -1.11082 1.753386 

_cons 79.6875 .7560448 105.40 0.000 78.20568 81.16932 


id: 

(empty) 

Residual: Unstructured 

var(e0) 9.14566 2.126248 5.79858 14.42475 

var(e4) 11.87114 2.761219 7.524948 18.72757 

var(e6) 10.06571 2.348863 6.371091 15.90284 

var(e8) 13.22464 3.113921 8.335981 20.98026 

var(e12) 13.16909 3.167347 8.219208 21.09995 

cov(e0,e4) 9.625236 2.33197 5.054659 14.19581 

cov(e0,e6) 8.489043 2.106377 4.36062 12.61747 

cov(e0,e8) 9.280414 2.369554 4.636173 13.92465 

cov(e0,e12) 8.898006 2.348243 4.295535 13.50048 

cov(e4,e6) 10.49185 2.492529 5.606578 15.37711 

cov(e4,e8) 11.89787 2.848751 6.314421 17.48132 

cov(e4,e12) 11.28344 2.805027 5.785689 16.78119 

cov(e6,e8) 11.0507 2.646988 5.862697 16.2387 

cov(e6,e12) 10.5006 2.590278 5.423748 15.57745 

cov(e8,e12) 12.4091 3.010796 6.508051 18.31016 





Because we are using the variable id only to group the repeated measurements and not to introduce 

random effects at the subject level, we use the noconstant option to omit any subject-level effects.


The unstructured covariance matrix is the most general and contains many parameters. In this example, 

we estimate a distinct residual variance for each day and a distinct covariance for each pair of days. 

That there is positive covariance between all pairs of measurements is evident, but what is not as 

evident is whether the covariances may be more parsimoniously represented. One option would be to 

explore whether the correlation diminishes as the time gap between strength measurements increases 

and whether it diminishes systematically. Given the irregularity of the time intervals, an exponential 

structure would be more appropriate than, say, an AR or MA structure. 

. estimates store unstructured 

. mixed strength i.program##i.day || id:, noconstant 

> residuals(exponential, t(day)) nolog nofetable 




avg = 4.7 

max = 5 

Wald chi2(9) = 36.77 



id: 

(empty) 

Residual: Exponential 

rho .9786462 .0051238 .9659207 .9866854 

var(e) 11.22349 2.338371 7.460765 16.88389 





In the above example, we suppressed displaying the main regression parameters because they 

did not differ much from those of the previous model. While the unstructured model estimated 15 

variance–covariance parameters, the exponential model claims to get the job done with just 2, a fact 

that is not disputed by an LR test comparing the two nested models (at least not at the 0.01 level). 

. lrtest unstructured . 


(Assumption: . nested in unstructured) Prob > chi2 = 0.0481 



reported test is conservative.


Crossed-effects models 

Not all mixed models contain nested levels of random effects. 

Example 10 

Returning to our longitudinal analysis of pig weights, suppose that instead of (5) we wish to fit 

for the i = 1, . . . , 9 weeks and j = 1, . . . , 48 pigs and 

weight ij = β 0 + β 1 week ij + u i + v j + ɛ ij (8) 

u i ∼ N(0, σ 2 u); v j ∼ N(0, σ 2 v); ɛ ij ∼ N(0, σ 2 ɛ ) 

all independently. Both (5) and (8) assume an overall population-average growth curve β 0 + β 1 week 

and a random pig-specific shift. 

The models differ in how week enters into the random part of the model. In (5), we assume 

that the effect due to week is linear and pig specific (a random slope); in (8), we assume that the 

effect due to week, u i , is systematic to that week and common to all pigs. The rationale behind (8) 

could be that, assuming that the pigs were measured contemporaneously, we might be concerned that 

week-specific random factors such as weather and feeding patterns had significant systematic effects 

on all pigs. 

Model (8) is an example of a two-way crossed-effects model, with the pig effects v j being crossed 

with the week effects u i . One way to fit such models is to consider all the data as one big cluster, 

and treat the u i and v j as a series of 9 + 48 = 57 random coefficients on indicator variables for 

week and pig. In the notation of (2), 

⎡ ⎤ 

u 1 

. . 

u 

u = 

9 

∼ N(0, G); G = 

v 

⎢ 1 

⎣ . ⎥ 

. 

⎦ 

v 48 

[ ] 

σ 

2 

u I 9 0 

0 σvI 2 48 

Because G is block diagonal, it can be represented in mixed as repeated-level equations. All we need 

is an identification variable to identify all the observations as one big group and a way to tell mixed 

to treat week and pig as factor variables (or equivalently, as two sets of overparameterized indicator 

variables identifying weeks and pigs, respectively). mixed supports the special group designation 

all for the former and the R.varname notation for the latter.


. use http://www.stata-press.com/data/r13/pig, clear 

(Longitudinal analysis of pig weights) 

. mixed weight week || _all: R.week || _all: R.id 







Group variable: _all Number of groups = 1 


avg = 432.0 

max = 432 

Wald chi2(1) = 13258.28 



week 6.209896 .0539313 115.14 0.000 6.104192 6.315599 

_cons 19.35561 .6333982 30.56 0.000 18.11418 20.59705 


_all: Identity 

_all: Identity 

var(R.week) .0849874 .0868856 .0114588 .6303302 

var(R.id) 14.83623 3.126142 9.816733 22.42231 

var(Residual) 4.297328 .3134404 3.724888 4.957741 



. estimates store crossed 

Thus we estimate ̂σ 2 u = 0.08 and ̂σ 2 v = 14.84. Both (5) and (8) estimate a total of five parameters: 

two fixed effects and three variance components. The models, however, are not nested within each 

other, which precludes the use of an LR test to compare both models. Refitting model (5) and looking 

at the Akaike information criteria values by using estimates stats, 

. quietly mixed weight week || id:week 

. estimates stats crossed . 

Akaike’s information criterion and Bayesian information criterion 

Model Obs ll(null) ll(model) df AIC BIC 

crossed 432 . -1013.824 5 2037.648 2057.99 

. 432 . -869.0383 5 1748.077 1768.419 

Note: 

N=Obs used in calculating BIC; see [R] BIC note 

definitely favors model (5). This finding is not surprising given that our rationale behind (8) was 

somewhat fictitious. In our estimates stats output, the values of ll(null) are missing. mixed 

does not fit a constant-only model as part of its usual estimation of the full model, but you can use 

mixed to fit a constant-only model directly, if you wish.


The R.varname notation is equivalent to giving a list of overparameterized (none dropped) 

indicator variables for use in a random-effects specification. When you specify R.varname, mixed 

handles the calculations internally rather than creating the indicators in the data. Because the set of 

indicators is overparameterized, R.varname implies noconstant. You can include factor variables in 

the fixed-effects specification by using standard methods; see [U] 11.4.3 Factor variables. However, 

random-effects equations support only the R.varname factor specification. For more complex factor 

specifications (such as interactions) in random-effects equations, use generate to form the variables 

manually, as we demonstrated in example 6. 


Although we were able to fit the crossed-effects model (8), it came at the expense of increasing the 

column dimension of our random-effects design from 2 in model (5) to 57 in model (8). Computation 

time and memory requirements grow (roughly) quadratically with the dimension of the random effects. 

As a result, fitting such crossed-effects models is feasible only when the total column dimension is 

small to moderate. 

Reexamining model (8), we note that if we drop u i , we end up with a model equivalent to (4), 

meaning that we could have fit (4) by typing 

instead of 

. mixed weight week || _all: R.id 

. mixed weight week || id: 

as we did when we originally fit the model. The results of both estimations are identical, but the 

latter specification, organized at the cluster (pig) level with random-effects dimension 1 (a random 

intercept) is much more computationally efficient. Whereas with the first form we are limited in how 

many pigs we can analyze, there is no such limitation with the second form. 

Furthermore, we fit model (8) by using 

. mixed weight week || _all: R.week || _all: R.id 

as a direct way to demonstrate the R. notation. However, we can technically treat pigs as nested 

within the all group, yielding the equivalent and more efficient (total column dimension 10) way 

to fit (8): 

. mixed weight week || _all: R.week || id: 

We leave it to you to verify that both produce identical results. See Rabe-Hesketh and Skrondal (2012) 

for additional techniques to make calculations more efficient in more complex models. 

Example 11 

As another example of how the same model may be fit in different ways by using mixed (and 

as a way to demonstrate covariance(exchangeable)), consider the three-level model used in 

example 4: 

y jk = X jk β + u (3) 

k 

+ u (2) 

jk + ɛ jk


where y jk represents the logarithms of gross state products for the n jk = 17 observations from state 

j in region k, X jk is a set of regressors, u (3) 

k 

is a random intercept at the region level, and u (2) 

jk 

is 

a random intercept at the state (nested within region) level. We assume that u (3) 

k 

∼ N(0, σ3) 2 and 

u (2) 

jk 

∼ N(0, σ2 2) independently. Define 

⎡ 

v k = ⎢ 

⎣ 

u (3) 

k 

+ u (2) 

1k 

u (3) 

k 

+ u (2) 

2k 

. 

. 

u (3) 

k 

+ u (2) 

M k ,k 

where M k is the number of states in region k. Making this substitution, we can stack the observations 

for all the states within region k to get 

⎤ 

⎥ 

⎦ 

y k = X k β + Z k v k + ɛ k 

where Z k is a set of indicators identifying the states within each region; that is, 

for a k-column vector of 1s J k , and 

Z k = I Mk ⊗ J 17 

⎡ 

σ3 2 + σ2 2 σ3 2 · · · σ 2 ⎤ 

3 

σ3 2 σ3 2 + σ2 2 · · · σ3 

2 Σ = Var(v k ) = ⎢ 

⎣ 

. 

. 

. .. 

. ⎥ .. ⎦ 

σ3 2 σ3 2 σ3 2 σ3 2 + σ2 

2 

M k ×M k 

Because Σ is an exchangeable matrix, we can fit this alternative form of the model by specifying the 

exchangeable covariance structure.


. use http://www.stata-press.com/data/r13/productivity 

(Public Capital Productivity) 

. mixed gsp private emp hwy water other unemp || region: R.state, 

> cov(exchangeable) 



Group variable: region Number of groups = 9 


avg = 90.7 

max = 136 

Wald chi2(6) = 18829.06 


gsp Coef. Std. Err. z P>|z| [95% Conf. Interval] 

private .2671484 .0212591 12.57 0.000 .2254813 .3088154 

emp .7540721 .0261868 28.80 0.000 .7027468 .8053973 

hwy .0709767 .023041 3.08 0.002 .0258172 .1161363 

water .0761187 .0139248 5.47 0.000 .0488266 .1034109 

other -.0999955 .0169366 -5.90 0.000 -.1331907 -.0668004 

unemp -.0058983 .0009031 -6.53 0.000 -.0076684 -.0041282 

_cons 2.128823 .1543855 13.79 0.000 1.826233 2.431413 


region: Exchangeable 

var(R.state) .0077263 .0017926 .0049032 .0121749 

cov(R.state) .0014506 .0012995 -.0010963 .0039975 

var(Residual) .0013461 .0000689 .0012176 .0014882 



The estimates of the fixed effects and their standard errors are equivalent to those from example 4, 

and remapping the variance components from (σ 2 3 + σ 2 2, σ 2 3, σ 2 ɛ ), as displayed here, to (σ 2 3, σ 2 2, σ 2 ɛ ), 

as displayed in example 4, will show that they are equivalent as well. 

Of course, given the discussion in the previous technical note, it is more efficient to fit this model 

as we did originally, as a three-level model. 

Diagnosing convergence problems 

Given the flexibility of mixed-effects models, you will find that some models fail to converge 

when used with your data; see Diagnosing convergence problems in [ME] me for advice applicable 

to mixed-effects models in general. 

In unweighted LME models with independent and homoskedastic residuals, one useful way to 

diagnose problems of nonconvergence is to rely on the EM algorithm (Dempster, Laird, and Rubin 1977), 

normally used by mixed only as a means of refining starting values. The advantages of EM are that it 

does not require a Hessian calculation, each successive EM iteration will result in a larger likelihood, 

iterations can be calculated quickly, and iterations will quickly bring parameter estimates into a 

neighborhood of the solution. The disadvantages of EM are that, once in a neighborhood of the


solution, it can be slow to converge, if at all, and EM provides no facility for estimating standard 

errors of the estimated variance components. One useful property of EM is that it is always willing 

to provide a solution if you allow it to iterate enough times, if you are satisfied with being in a 

neighborhood of the optimum rather than right on the optimum, and if standard errors of variance 

components are not crucial to your analysis. 

If you encounter a nonconvergent model, try using the emonly option to bypass gradient-based 

optimization. Use emiterate(#) to specify the maximum number of EM iterations, which you will 

usually want to set much higher than the default of 20. If your EM solution shows an estimated 

variance component that is near 0, a ridge is formed by an interval of values near 0, which produces 

the same likelihood and looks equally good to the optimizer. In this case, the solution is to drop the 

offending variance component from the model. 

Survey data 

Multilevel modeling of survey data is a little different from standard modeling in that weighted 

sampling can take place at multiple levels in the model, resulting in multiple sampling weights. Most 

survey datasets, regardless of the design, contain one overall inclusion weight for each observation in 

the data. This weight reflects the inverse of the probability of ultimate selection, and by “ultimate” we 

mean that it factors in all levels of clustered sampling, corrections for noninclusion and oversampling, 

poststratification, etc. 

For simplicity, in what follows assume a simple two-stage sampling design where groups are 

randomly sampled and then individuals within groups are sampled. Also assume that no additional 

weight corrections are performed; that is, sampling weights are simply the inverse of the probability 

of selection. The sampling weight for observation i in cluster j in our two-level sample is then 

w ij = 1/π ij , where π ij is the probability that observation i, j is selected. If you were performing a 

standard analysis such as OLS regression with regress, you would simply use a variable holding w ij 

as your pweight variable, and the fact that it came from two levels of sampling would not concern 

you. Perhaps you would type vce(cluster groupvar) where groupvar identifies the top-level groups 

to get standard errors that control for correlation within these groups, but you would still use only a 

single weight variable. 

Now take these same data and fit a two-level model with mixed. As seen in (14) in Methods and 

formulas later in this entry, it is not sufficient to use the single sampling weight w ij , because weights 

enter into the log likelihood at both the group level and the individual level. Instead, what is required 

for a two-level model under this sampling design is w j , the inverse of the probability that group j 

is selected in the first stage, and w i|j , the inverse of the probability that individual i from group j is 

selected at the second stage conditional on group j already being selected. It simply will not do to 

just use w ij without making any assumptions about w j . 

Given the rules of conditional probability, w ij = w j w i|j . If your dataset has only w ij , then you 

will need to either assume equal probability sampling at the first stage (w j = 1 for all j) or find 

some way to recover w j from other variables in your data; see Rabe-Hesketh and Skrondal (2006) 

and the references therein for some suggestions on how to do this, but realize that there is little yet 

known about how well these approximations perform in practice. 

What you really need to fit your two-level model are data that contain w j in addition to either w ij 

or w i|j . If you have w ij —that is, the unconditional inclusion weight for observation i, j—then you 

need to either divide w ij by w j to obtain w i|j or rescale w ij so that its dependence on w j disappears. 

If you already have w i|j , then rescaling becomes optional (but still an important decision to make). 

Weight rescaling is not an exact science, because the scale of the level-one weights is at issue 

regardless of whether they represent w ij or w i|j : because w ij is unique to group j, the group-to-group


magnitudes of these weights need to be normalized so that they are “consistent” from group to group. 

This is in stark contrast to a standard analysis, where the scale of sampling weights does not factor 

into estimation, instead only affecting the estimate of the total population size. 

mixed offers three methods for standardizing weights in a two-level model, and you can specify 

which method you want via the pwscale() option. If you specify pwscale(size), then the w i|j (or 

w ij , it does not matter) are scaled to sum to the cluster size n j . Method pwscale(effective) adds 

in a dependence on the sum of the squared weights so that level-one weights sum to the “effective” 

sample size. Just like pwscale(size), pwscale(effective) also behaves the same whether you 

have w i|j or w ij , and so it can be used with either. 

Although both pwscale(size) and pwscale(effective) leave w j untouched, the pwscale(gk) 

method is a little different in that 1) it changes the weights at both levels and 2) it does assume 

you have w i|j for level-one weights and not w ij (if you have the latter, then first divide by w j ). 

Using the method of Graubard and Korn (1996), it sets the weights at the group level (level two) to 

the cluster averages of the products of both level weights (this product being w ij ). It then sets the 

individual weights to 1 everywhere; see Methods and formulas for the computational details of all 

three methods. 

Determining which method is “best” is a tough call and depends on cluster size (the smaller 

the clusters, the greater the sensitivity to scale), whether the sampling is informative (that is, the 

sampling weights are correlated with the residuals), whether you are interested primarily in regression 

coefficients or in variance components, whether you have a simple random-intercept model or a 

more complex random-coefficients model, and other factors; see Rabe-Hesketh and Skrondal (2006), 

Carle (2009), and Pfeffermann et al. (1998) for some detailed advice. At the very least, you want 

to compare estimates across all three scaling methods (four, if you add no scaling) and perform a 

sensitivity analysis. 

If you choose to rescale level-one weights, it does not matter whether you have w i|j or w ij . For 

the pwscale(size) and pwscale(effective) methods, you get identical results, and even though 

pwscale(gk) assumes w i|j , you can obtain this as w i|j = w ij /w j before proceeding. 

If you do not specify pwscale(), then no scaling takes place, and thus at a minimum, you need 

to make sure you have w i|j in your data and not w ij . 

Example 12 

Rabe-Hesketh and Skrondal (2006) analyzed data from the 2000 Programme for International 

Student Assessment (PISA) study on reading proficiency among 15-year-old American students, as 

performed by the Organisation for Economic Co-operation and Development (OECD). The original 

study was a three-stage cluster sample, where geographic areas were sampled at the first stage, schools 

at the second, and students at the third. Our version of the data does not contain the geographic-areas 

variable, so we treat this as a two-stage sample where schools are sampled at the first stage and 

students at the second.


. use http://www.stata-press.com/data/r13/pisa2000 

(Programme for International Student Assessment (PISA) 2000 data) 

. describe 

Contains data from http://www.stata-press.com/data/r13/pisa2000.dta 

obs: 2,069 Programme for International 

Student Assessment (PISA) 2000 

data 

vars: 11 12 Jun 2012 10:08 




female byte %8.0g 1 if female 

isei byte %8.0g International socio-economic 

index 

w_fstuwt float %9.0g Student-level weight 

wnrschbw float %9.0g School-level weight 

high_school byte %8.0g 1 if highest level by either 

parent is high school 

college byte %8.0g 1 if highest level by either 

parent is college 

one_for byte %8.0g 1 if one parent foreign born 

both_for byte %8.0g 1 if both parents are foreign 

born 

test_lang byte %8.0g 1 if English (the test language) 

is spoken at home 

pass_read byte %8.0g 1 if passed reading proficiency 

threshold 

id_school int %8.0g School ID 

Sorted by: 

For student i in school j, where the variable id school identifies the schools, the variable 

w fstuwt is a student-level overall inclusion weight (w ij , not w i|j ) adjusted for noninclusion and 

nonparticipation of students, and the variable wnrschbw is the school-level weight w j adjusted for 

oversampling of schools with more minority students. The weight adjustments do not interfere with 

the methods prescribed above, and thus we can treat the weight variables simply as w ij and w j , 

respectively. 

Rabe-Hesketh and Skrondal (2006) fit a two-level logistic model for passing a reading proficiency 

threshold. We fit a two-level linear random-intercept model for socioeconomic index. Because we 

have w ij and not w i|j , we rescale using pwscale(size) and thus obtain results as if we had w i|j .


. mixed isei female high_school college one_for both_for test_lang 

> [pw=w_fstuwt] || id_school:, pweight(wnrschbw) pwscale(size) 


Mixed-effects regression Number of obs = 2069 

Group variable: id_school Number of groups = 148 


avg = 14.0 

max = 28 

Wald chi2(6) = 187.23 

Log pseudolikelihood = -1443093.9 Prob > chi2 = 0.0000 

(Std. Err. adjusted for 148 clusters in id_school) 

Robust 

isei Coef. Std. Err. z P>|z| [95% Conf. Interval] 

female .59379 .8732886 0.68 0.497 -1.117824 2.305404 

high_school 6.410618 1.500337 4.27 0.000 3.470011 9.351224 

college 19.39494 2.121145 9.14 0.000 15.23757 23.55231 

one_for -.9584613 1.789947 -0.54 0.592 -4.466692 2.54977 

both_for -.2021101 2.32633 -0.09 0.931 -4.761633 4.357413 

test_lang 2.519539 2.393165 1.05 0.292 -2.170978 7.210056 

_cons 28.10788 2.435712 11.54 0.000 23.33397 32.88179 

Robust 


id_school: Identity 

var(_cons) 34.69374 8.574865 21.37318 56.31617 

var(Residual) 218.7382 11.22111 197.8147 241.8748 

Notes: 

1. We specified the level-one weights using standard Stata weight syntax, that is, [pw=w fstuwt]. 

2. We specified the level-two weights via the pweight(wnrschbw) option as part of the randomeffects 

specification for the id school level. As such, it is treated as a school-level weight. 

Accordingly, wnrschbw needs to be constant within schools, and mixed did check for that before 

estimating. 

3. Because our level-one weights are unconditional, we specified pwscale(size) to rescale them. 

4. As is the case with other estimation commands in Stata, standard errors in the presence of sampling 

weights are robust. 

5. Robust standard errors are clustered at the top level of the model, and this will always be true unless 

you specify vce(cluster clustvar), where clustvar identifies an even higher level of grouping. 

As a form of sensitivity analysis, we compare the above with scaling via pwscale(gk). Because 

pwscale(gk) assumes w i|j , you want to first divide w ij by w j . But you can handle that within the 

weight specification itself.


. mixed isei female high_school college one_for both_for test_lang 

> [pw=w_fstuwt/wnrschbw] || id_school:, pweight(wnrschbw) pwscale(gk) 


Mixed-effects regression Number of obs = 2069 

Group variable: id_school Number of groups = 148 


avg = 14.0 

max = 28 

Wald chi2(6) = 291.37 

Log pseudolikelihood = -7270505.6 Prob > chi2 = 0.0000 

(Std. Err. adjusted for 148 clusters in id_school) 

Robust 

isei Coef. Std. Err. z P>|z| [95% Conf. Interval] 

female -.3519458 .7436334 -0.47 0.636 -1.80944 1.105549 

high_school 7.074911 1.139777 6.21 0.000 4.84099 9.308833 

college 19.27285 1.286029 14.99 0.000 16.75228 21.79342 

one_for -.9142879 1.783091 -0.51 0.608 -4.409082 2.580506 

both_for 1.214151 1.611885 0.75 0.451 -1.945085 4.373388 

test_lang 2.661866 1.556491 1.71 0.087 -.3887996 5.712532 

_cons 31.20145 1.907413 16.36 0.000 27.46299 34.93991 

Robust 


id_school: Identity 

var(_cons) 31.67522 6.792239 20.80622 48.22209 

var(Residual) 226.2429 8.150714 210.8188 242.7955 

The results are somewhat similar to before, which is good news from a sensitivity standpoint. Note 

that we specified [pw=w fstwtw/wnrschbw] and thus did the conversion from w ij to w i|j within 

our call to mixed. 

We close this section with a bit of bad news. Although weight rescaling and the issues that arise 

have been well studied for two-level models, as pointed out by Carle (2009), “. . . a best practice 

for scaling weights across multiple levels has yet to be advanced.” As such, pwscale() is currently 

supported only for two-level models. If you are fitting a higher-level model with survey data, you 

need to make sure your sampling weights are conditional on selection at the previous stage and not 

overall inclusion weights, because there is currently no rescaling option to fall back on if you do not.


Stored results 

mixed stores the following in e(): 

Scalars 

e(N) 

number of observations 

e(k) 

number of parameters 

e(k f) 

number of fixed-effects parameters 

e(k r) 

number of random-effects parameters 

e(k rs) 

number of variances 

e(k rc) 

number of covariances 

e(k res) 

number of residual-error parameters 

e(N clust) 

number of clusters 

e(nrgroups) 

number of residual-error by() groups 

e(ar p) 

AR order of residual errors, if specified 

e(ma q) 

MA order of residual errors, if specified 

e(res order) 

order of residual-error structure, if appropriate 

e(df m) 

model degrees of freedom 

e(ll) 

log (restricted) likelihood 

e(chi2) χ 2 

e(p) 

significance 

e(ll c) 

log likelihood, comparison model 

e(chi2 c) 

χ 2 , comparison model 

e(df c) 

degrees of freedom, comparison model 

e(p c) 

significance, comparison model 

e(rank) 

rank of e(V) 

e(rc) 

return code 

e(converged) 

1 if converged, 0 otherwise 

Macros 

e(cmd) 

mixed 

e(cmdline) 

command as typed 

e(depvar) 

name of dependent variable 

e(wtype) 

weight type (first-level weights) 

e(wexp) 

weight expression (first-level weights) 

e(fweightk) 

fweight expression for kth highest level, if specified 

e(pweightk) 

pweight expression for kth highest level, if specified 

e(ivars) 

grouping variables 

e(title) 

title in estimation output 

e(redim) 

random-effects dimensions 

e(vartypes) 

variance-structure types 

e(revars) 

random-effects covariates 

e(resopt) 

residuals() specification, as typed 

e(rstructure) 

residual-error structure 

e(rstructlab) 

residual-error structure output label 

e(rbyvar) 

residual-error by() variable, if specified 

e(rglabels) 

residual-error by() groups labels 

e(pwscale) 

sampling-weight scaling method 

e(timevar) 

residual-error t() variable, if specified 

e(chi2type) Wald; type of model χ 2 test 

e(clustvar) 

name of cluster variable 

e(vce) 

vcetype specified in vce() 

e(vcetype) 

title used to label Std. Err. 

e(method) 

ML or REML 

e(opt) 

type of optimization 

e(optmetric) 

matsqrt or matlog; random-effects matrix parameterization 

e(emonly) 

emonly, if specified 

e(ml method) 

type of ml method 

e(technique) 

maximization technique 

e(properties) 

b V 

e(estat cmd) 

program used to implement estat 

e(predict) 

program used to implement predict 

e(asbalanced) 

factor variables fvset as asbalanced 

e(asobserved) 

factor variables fvset as asobserved


Matrices 

e(b) 

e(N g) 

e(g min) 

e(g avg) 

e(g max) 

e(tmap) 

e(V) 

e(V modelbased) 

Functions 

e(sample) 

coefficient vector 

group counts 

group-size minimums 

group-size averages 

group-size maximums 

ID mapping for unstructured residual errors 

variance–covariance matrix of the estimator 

model-based variance 

marks estimation sample 

Methods and formulas 

As given by (1), in the absence of weights we have the linear mixed model 

y = Xβ + Zu + ɛ 

where y is the n × 1 vector of responses, X is an n × p design/covariate matrix for the fixed effects 

β, and Z is the n × q design/covariate matrix for the random effects u. The n × 1 vector of errors 

ɛ is for now assumed to be multivariate normal with mean 0 and variance matrix σɛ 2 I n . We also 

assume that u has variance–covariance matrix G and that u is orthogonal to ɛ so that 

[ ] [ ] 

u G 0 

Var = 

ɛ 0 σɛ 2 I n 

Considering the combined error term Zu + ɛ, we see that y is multivariate normal with mean Xβ 

and n × n variance–covariance matrix 

V = ZGZ ′ + σ 2 ɛ I n 

Defining θ as the vector of unique elements of G results in the log likelihood 

L(β, θ, σ 2 ɛ ) = − 1 2 

{ 

n log(2π) + log |V| + (y − Xβ) ′ V −1 (y − Xβ) } (9) 

which is maximized as a function of β, θ, and σɛ 2 . As explained in chapter 6 of Searle, Casella, 

and McCulloch (1992), considering instead the likelihood of a set of linear contrasts Ky that do not 

depend on β results in the restricted log likelihood 

L R (β, θ, σ 2 ɛ ) = L(β, θ, σ 2 ɛ ) − 1 2 log ∣ ∣ X ′ V −1 X ∣ ∣ (10) 

Given the high dimension of V, however, the log-likelihood and restricted log-likelihood criteria are 

not usually computed by brute-force application of the above expressions. Instead, you can simplify 

the problem by subdividing the data into independent clusters (and subclusters if possible) and using 

matrix decomposition methods on the smaller matrices that result from treating each cluster one at a 

time. 

Consider the two-level model described previously in (2), 

y j = X j β + Z j u j + ɛ j 

for j = 1, . . . , M clusters with cluster j containing n j observations, with Var(u j ) = Σ, a q × q 

matrix.


Efficient methods for computing (9) and (10) are given in chapter 2 of Pinheiro and Bates (2000). 

Namely, for the two-level model, define ∆ to be the Cholesky factor of σɛ 2Σ−1 , such that σɛ 2Σ−1 = 

∆ ′ ∆. For j = 1, . . . , M, decompose 

[ ] [ ] 

Zj R11j 

= Q 

∆ j 

0 

by using an orthogonal-triangular (QR) decomposition, with Q j a (n j + q)-square matrix and R 11j 

a q-square matrix. We then apply Q j as follows: 

[ ] [ ] [ ] [ ] 

R10j 

= Q ′ Xj c1j 

j ; 

= Q ′ yj 

j 

R 00j 0 c 0j 0 

Stack the R 00j and c 0j matrices, and perform the additional QR decomposition 

⎡ 

⎤ 

R 001 c 01 [ ] 

⎣ 

. 

⎦ R00 c 

. = Q 0 

0 

0 c 1 

c 0M 

R 00M 

Pinheiro and Bates (2000) show that ML estimates of β, σɛ 2 , and ∆ (the unique elements of ∆, 

that is) are obtained by maximizing the profile log likelihood (profiled in ∆) 

L(∆) = n 2 {log n − log(2π) − 1} − n log ||c 1|| + 

where || · || denotes the 2-norm. Following this maximization with 

REML estimates are obtained by maximizing 

followed by 

L R (∆) = n − p 

2 

M∑ 

log 

det(∆) 

∣det(R 11j ) ∣ (11) 

j=1 

̂β = R −1 

00 c 0; ̂σ 2 ɛ = n −1 ||c 1 || 2 (12) 

{log(n − p) − log(2π) − 1} − (n − p) log ||c 1 || 

− log |det(R 00 )| + 

M∑ 

log 

det(∆) 

∣det(R 11j ) ∣ 

j=1 

̂β = R −1 

00 c 0; ̂σ 2 ɛ = (n − p) −1 ||c 1 || 2 

For numerical stability, maximization of (11) and (13) is not performed with respect to the unique 

elements of ∆ but instead with respect to the unique elements of the matrix square root (or matrix 

logarithm if the matlog option is specified) of Σ/σɛ 2 ; define γ to be the vector containing these 

elements. 

Once maximization with respect to γ is completed, (γ, σɛ 2 ) is reparameterized to {α, log(σ ɛ )}, 

where α is a vector containing the unique elements of Σ, expressed as logarithms of standard 

deviations for the diagonal elements and hyperbolic arctangents of the correlations for off-diagonal 

elements. This last step is necessary 1) to obtain a joint variance–covariance estimate of the elements 

of Σ and σɛ 2 ; 2) to obtain a parameterization under which parameter estimates can be interpreted 

individually, rather than as elements of a matrix square root (or logarithm); and 3) to parameterize 

these elements such that their ranges each encompass the entire real line. 

(13)


Obtaining a joint variance–covariance matrix for the estimated {α, log(σ ɛ )} requires the evaluation 

of the log likelihood (or log-restricted likelihood) with only β profiled out. For ML, we have 

L ∗ {α, log(σ ɛ )} = L{∆(α, σɛ 2 ), σɛ 2 } 

= − n 2 log(2πσ2 ɛ ) − ||c 1|| 2 M∑ 

+ log 

det(∆) 

∣det(R 11j ) ∣ 

with the analogous expression for REML. 

The variance–covariance matrix of ̂β is estimated as 

̂Var(̂β) = ̂σ 

2 

ɛ R −1 

00 

2σ 2 ɛ 

( ) 

R 

−1 ′ 

00 

but this does not mean that ̂Var(̂β) is identical under both ML and REML because R00 depends on 

∆. Because ̂β is asymptotically uncorrelated with {̂α, log(̂σɛ )}, the covariance of ̂β with the other 

estimated parameters is treated as 0. 

Parameter estimates are stored in e(b) as {̂β, ̂α, log(̂σɛ )}, with the corresponding (block-diagonal) 

variance–covariance matrix stored in e(V). Parameter estimates can be displayed in this metric by 

specifying the estmetric option. However, in mixed output, variance components are most often 

displayed either as variances and covariances or as standard deviations and correlations. 

EM iterations are derived by considering the u j in (2) as missing data. Here we describe the 

procedure for maximizing the log likelihood via EM; the procedure for maximizing the restricted log 

likelihood is similar. The log likelihood for the full data (y, u) is 

L F (β, Σ, σ 2 ɛ ) = 

j=1 

M∑ { 

log f1 (y j |u j , β, σɛ 2 ) + log f 2 (u j |Σ) } 

j=1 

where f 1 (·) is the density function for multivariate normal with mean X j β + Z j u j and variance 

σɛ 2 I nj , and f 2 (·) is the density for multivariate normal with mean 0 and q × q covariance matrix 

Σ. As before, we can profile β and σɛ 2 out of the optimization, yielding the following EM iterative 

procedure: 

1. For the current iterated value of Σ (t) , fix ̂β = ̂β(Σ (t) ) and ̂σ 2 ɛ = ̂σ 2 ɛ (Σ (t) ) according to (12). 

2. Expectation step: Calculate 

{ 

} 

D(Σ) ≡ E L F (̂β, Σ, ̂σ 

2 

ɛ )|y 

= C − M 2 log det (Σ) − 1 2 

M∑ 

E ( u ′ j Σ−1 u j |y ) 

where C is a constant that does not depend on Σ, and the expected value of the quadratic form 

u ′ j Σ−1 u j is taken with respect to the conditional density f(u j |y, ̂β, Σ (t) , ̂σ 2 ɛ ). 

3. Maximization step: Maximize D(Σ) to produce Σ (t+1) . 

j=1


For general, symmetric Σ, the maximizer of D(Σ) can be derived explicitly, making EM iterations 

quite fast. 

For general, residual-error structures, 

Var(ɛ j ) = σ 2 ɛ Λ j 

where the subscript j merely represents that ɛ j and Λ j vary in dimension in unbalanced data, the 

data are first transformed according to 

y ∗ j = ̂Λ−1/2 

j y j ; X ∗ j = ̂Λ−1/2 

j X j ; Z ∗ j = ̂Λ−1/2 

j Z j ; 

and the likelihood-evaluation techniques described above are applied to yj ∗, X∗ j , and Z∗ j instead. 

The unique elements of Λ, ρ, are estimated along with the fixed effects and variance components. 

Because σɛ 2 is always estimated and multiplies the entire Λ j matrix, ̂ρ is parameterized to take this 

into account. 

In the presence of sampling weights, following Rabe-Hesketh and Skrondal (2006), the weighted 

log pseudolikelihood for a two-level model is given as 

[ 

M∑ ∫ { nj 

} ] 

∑ 

L(β, Σ, σɛ 2 ) = w j log exp w i|j log f 1 (y ij |u j , β, σɛ 2 ) f 2 (u j |Σ)du j (14) 

j=1 

i=1 

where w j is the inverse of the probability of selection for the jth cluster, w i|j is the inverse of the 

conditional probability of selection of individual i given the selection of cluster j, and f 1 (·) and 

f 2 (·) are the multivariate normal densities previously defined. 

Weighted estimation is achieved through incorporating w j and w i|j into the matrix decomposition 

methods detailed above to reflect replicated clusters for w j and replicated observations within clusters 

for w i|j . Because this estimation is based on replicated clusters and observations, frequency weights 

are handled similarly. 

Rescaling of sampling weights can take one of three available forms: 

Under pwscale(size), 

Under pwscale(effective), 

w ∗ i|j = n jw ∗ i|j 

w ∗ i|j = w∗ i|j 

{ nj 

{ nj 

} −1 

∑ 

w i|j 

i=1 

} { 

∑ ∑ 

nj 

w i|j 

i=1 

Under both the above, w j remains unchanged. For method pwscale(gk), however, both weights are 

modified: 

n 

∑ j 

wj ∗ = n −1 

j w i|j w j ; wi|j ∗ = 1 

i=1 

Under ML estimation, robust standard errors are obtained in the usual way (see [P] robust) with 

the one distinction being that in multilevel models, robust variances are, at a minimum, clustered at 

the highest level. This is because given the form of the log likelihood, scores aggregate at the top-level 

clusters. For a two-level model, scores are obtained as the partial derivatives of L j (β, Σ, σɛ 2 ) with 

respect to {β, α, log(σ ɛ )}, where L j is the log likelihood for cluster j and L = ∑ M 

j=1 L j. Robust 

variances are not supported under REML estimation because the form of the log restricted likelihood 

does not lend itself to separation by highest-level clusters. 

i=1 

w 2 i|j 

} −1


EM iterations always assume equal weighting and an independent, homoskedastic error structure. 

As such, with weighted data or when error structures are more complex, EM is used only to obtain 

starting values. 

For extensions to models with three or more levels, see Bates and Pinheiro (1998) and Rabe-Hesketh 

and Skrondal (2006). 

✄ 

✂ 

Charles Roy Henderson (1911–1989) was born in Iowa and grew up on the family farm. His 

education in animal husbandry, animal nutrition, and statistics at Iowa State was interspersed 

with jobs in the Iowa Extension Service, Ohio University, and the U.S. Army. After completing 

his PhD, Henderson joined the Animal Science faculty at Cornell. He developed and applied 

statistical methods in the improvement of farm livestock productivity through genetic selection, 

with particular focus on dairy cattle. His methods are general and have been used worldwide 

in livestock breeding and beyond agriculture. Henderson’s work on variance components and 

best linear unbiased predictions has proved to be one of the main roots of current mixed-model 

methods. 

 

✁ 

Acknowledgments 

We thank Badi Baltagi of the Department of Economics at Syracuse University and Ray Carroll 

of the Department of Statistics at Texas A&M University for each providing us with a dataset used 

in this entry. 

References 

Andrews, M. J., T. Schank, and R. Upward. 2006. Practical fixed-effects estimation methods for the three-way 

error-components model. Stata Journal 6: 461–481. 

Baltagi, B. H., S. H. Song, and B. C. Jung. 2001. The unbalanced nested error component regression model. Journal 

of Econometrics 101: 357–381. 

Bates, D. M., and J. C. Pinheiro. 1998. Computational methods for multilevel modelling. In Technical Memorandum 

BL0112140-980226-01TM. Murray Hill, NJ: Bell Labs, Lucent Technologies. 

http://stat.bell-labs.com/NLME/CompMulti.pdf. 

Cameron, A. C., and P. K. Trivedi. 2010. Microeconometrics Using Stata. Rev. ed. College Station, TX: Stata Press. 

Canette, I. 2011. Including covariates in crossed-effects models. The Stata Blog: Not Elsewhere Classified. 

http://blog.stata.com/2010/12/22/including-covariates-in-crossed-effects-models/. 

Carle, A. C. 2009. Fitting multilevel models in complex survey data with design weights: Recommendations. BMC 

Medical Research Methodology 9: 49. 

Demidenko, E. 2004. Mixed Models: Theory and Applications. Hoboken, NJ: Wiley. 

Dempster, A. P., N. M. Laird, and D. B. Rubin. 1977. Maximum likelihood from incomplete data via the EM 

algorithm. Journal of the Royal Statistical Society, Series B 39: 1–38. 

Diggle, P. J., P. J. Heagerty, K.-Y. Liang, and S. L. Zeger. 2002. Analysis of Longitudinal Data. 2nd ed. Oxford: 

Oxford University Press. 

Fitzmaurice, G. M., N. M. Laird, and J. H. Ware. 2011. Applied Longitudinal Analysis. 2nd ed. Hoboken, NJ: Wiley. 

Goldstein, H. 1986. Efficient statistical modelling of longitudinal data. Annals of Human Biology 13: 129–141. 

Graubard, B. I., and E. L. Korn. 1996. Modelling the sampling design in the analysis of health surveys. Statistical 

Methods in Medical Research 5: 263–281. 

Harville, D. A. 1977. Maximum likelihood approaches to variance component estimation and to related problems. 

Journal of the American Statistical Association 72: 320–338.


Henderson, C. R. 1953. Estimation of variance and covariance components. Biometrics 9: 226–252. 

Hocking, R. R. 1985. The Analysis of Linear Models. Monterey, CA: Brooks/Cole. 

Horton, N. J. 2011. Stata tip 95: Estimation of error covariances in a linear model. Stata Journal 11: 145–148. 

Laird, N. M., and J. H. Ware. 1982. Random-effects models for longitudinal data. Biometrics 38: 963–974. 

LaMotte, L. R. 1973. Quadratic estimation of variance components. Biometrics 29: 311–330. 

Marchenko, Y. V. 2006. Estimating variance components in Stata. Stata Journal 6: 1–21. 

McCulloch, C. E., S. R. Searle, and J. M. Neuhaus. 2008. Generalized, Linear, and Mixed Models. 2nd ed. Hoboken, 

NJ: Wiley. 

Munnell, A. H. 1990. Why has productivity growth declined? Productivity and public investment. New England 

Economic Review Jan./Feb.: 3–22. 

Nichols, A. 2007. Causal inference with observational data. Stata Journal 7: 507–541. 

Pantazis, N., and G. Touloumi. 2010. Analyzing longitudinal data in the presence of informative drop-out: The jmre1 

command. Stata Journal 10: 226–251. 

Pfeffermann, D., C. J. Skinner, D. J. Holmes, H. Goldstein, and J. Rasbash. 1998. Weighting for unequal selection 

probabilities in multilevel models. Journal of the Royal Statistical Society, Series B 60: 23–40. 

Pierson, R. A., and O. J. Ginther. 1987. Follicular population dynamics during the estrous cycle of the mare. Animal 

Reproduction Science 14: 219–231. 

Pinheiro, J. C., and D. M. Bates. 2000. Mixed-Effects Models in S and S-PLUS. New York: Springer. 

Prosser, R., J. Rasbash, and H. Goldstein. 1991. ML3 Software for 3-Level Analysis: User’s Guide for V. 2. London: 

Institute of Education, University of London. 

Rabe-Hesketh, S., and A. Skrondal. 2006. Multilevel modelling of complex survey data. Journal of the Royal Statistical 

Society, Series A 169: 805–827. 

. 2012. Multilevel and Longitudinal Modeling Using Stata. 3rd ed. College Station, TX: Stata Press. 

Rao, C. R. 1973. Linear Statistical Inference and Its Applications. 2nd ed. New York: Wiley. 

Raudenbush, S. W., and A. S. Bryk. 2002. Hierarchical Linear Models: Applications and Data Analysis Methods. 

2nd ed. Thousand Oaks, CA: Sage. 

Ruppert, D., M. P. Wand, and R. J. Carroll. 2003. Semiparametric Regression. Cambridge: Cambridge University 

Press. 

Schunck, R. 2013. Within and between estimates in random-effects models: Advantages and drawbacks of correlated 

random effects and hybrid models. Stata Journal 13: 65–76. 

Searle, S. R. 1989. Obituary: Charles Roy Henderson 1911–1989. Biometrics 45: 1333–1335. 

Searle, S. R., G. Casella, and C. E. McCulloch. 1992. Variance Components. New York: Wiley. 

Thompson, W. A., Jr. 1962. The problem of negative estimates of variance components. Annals of Mathematical 

Statistics 33: 273–289. 

Verbeke, G., and G. Molenberghs. 2000. Linear Mixed Models for Longitudinal Data. New York: Springer. 

West, B. T., K. B. Welch, and A. T. Galecki. 2007. Linear Mixed Models: A Practical Guide Using Statistical 

Software. Boca Raton, FL: Chapman & Hall/CRC. 

Wiggins, V. L. 2011. Multilevel random effects in xtmixed and sem—the long and wide of it. The Stata Blog: Not 

Elsewhere Classified. 

http://blog.stata.com/2011/09/28/multilevel-random-effects-in-xtmixed-and-sem-the-long-and-wide-of-it/.

Also see 


[ME] mixed postestimation — Postestimation tools for mixed 

[ME] me — Introduction to multilevel mixed-effects models 

[MI] estimation — Estimation commands for use with mi estimate 

[SEM] intro 5 — Tour of models (Multilevel mixed-effects models) 

[XT] xtrc — Random-coefficients model 

[XT] xtreg — Fixed-, between-, and random-effects and population-averaged linear models 

[U] 20 Estimation and postestimation commands

mixed - Stata

Create successful ePaper yourself

Delete template?

Save as template?