12.07.2015 Views

correlated random effects models with unbalanced panels

correlated random effects models with unbalanced panels

correlated random effects models with unbalanced panels

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Abstract: I propose some strategies for allowing unobserved heterogeneity to be <strong>correlated</strong><strong>with</strong> observed covariates and sample selection for <strong>unbalanced</strong> <strong>panels</strong>. The methods areextensions of the Chamberlain-Mundlak approach for balanced <strong>panels</strong>. Even for nonlinear<strong>models</strong>, in many cases the estimators can be implemented using standard software. Theframework suggests straightforward tests for sample selection that is <strong>correlated</strong> <strong>with</strong>unobserved shocks while allowing selection to be <strong>correlated</strong> <strong>with</strong> the observed covariates andunobserved heterogeneity.2


1. IntroductionCorrelated <strong>random</strong> <strong>effects</strong> (CRE) approaches to nonlinear panel data <strong>models</strong> are popular<strong>with</strong> empirical researchers, partly because of their simplicity but also because recent research(for example, Blundell and Powell (2003), Altonji and Matzkin (2005), and Wooldridge(2005)) shows that quantities of interest – usually called “average marginal <strong>effects</strong>” (AMEs) or“average partial <strong>effects</strong>” (APEs) – are identified under nonparametric restrictions on thedistribution of heterogeneity given the covariate process. (Exchangeability is one suchrestriction, but it is not the only one.) Wooldridge (2002) shows how the CRE approach appliesto commonly used <strong>models</strong>, such as unobserved <strong>effects</strong> probit, tobit, and count <strong>models</strong>. Papkeand Wooldridge (2008) propose simple CRE methods when the response variable is a fractionor proportion.The leading competitor to CRE approaches are so-called “fixed <strong>effects</strong>” (FE) methods,which, for the purposes of this paper, treat the heterogeneity as parameters to be estimated.(Perhaps a better characterization is that the FE approach studies the properties of the fixedpopulation parameters and, more recently, average partial <strong>effects</strong>, when heterogeneity ishandled via estimating separate parameters for each population unit.) As is well known, exceptin some very special cases, estimating unobserved heterogeneity for each unit in the samplegenerally suffer from the incidental parameters problem – both in estimating populationparameters and APEs. Some headway has been made in obtaining bias-corrected versions of“fixed <strong>effects</strong>” estimators for nonlinear <strong>models</strong> – for example, Hahn and Newey (2004) andFernandez-Val (2008). These methods are promising, but they currently have several practicalshortcomings. First, the number of time periods needed for the bias adjustments to work well isoften greater than is available in many applications. Second, an important point is that recent3


Another method that is often used, but only in special cases, is conditional maximumlikelihood estimation. The general approach is to find a conditional likelihood function that isfree of the unobserved heterogeneity but depends on the population parameters. Unfortunately,this method has rather limited scope. It applies to linear <strong>models</strong> (but where we do not needed)and a few nonlinear <strong>models</strong>. It is most commonly applied to the unobserved <strong>effects</strong> logit modeland also to the unobserved <strong>effects</strong> Poisson regression model. When it applies, the CMLE hasthe advantage that it puts no restrictions on the heterogeneity distribution – eitherunconditionally or conditionally. Unfortunately, even in the limited cases where it applies,CMLE can impose substantive restrictions. For example, the CMLE for the logit model isinconsistent if the conditional independence assumption fails – see Kwak and Wooldridge(2009). (Other CMLEs are more robust, such as those for the linear and Poisson unobserved<strong>effects</strong> <strong>models</strong>, but again these are special cases. See Wooldridge (1999) for the Poisson case.)A positive feature of the CMLE approach is that it works for any number of time periods andimposes no restrictions on the time series properties of the covariates. However, becauseCMLEs are intended to leave heterogeneity distributions unspecified, it is unclear how toobtain average partial <strong>effects</strong>. In other words, we cannot estimate the magnitudes of <strong>effects</strong> ofcovariates.In the balanced panel case, CRE approaches put restrictions on the conditional distributionof heterogeneity given the entire history of the covariates. This is its drawback compared <strong>with</strong>FE or CMLE approaches. But it requires few other assumptions for estimating average partial<strong>effects</strong>, and the restrictions needed on the conditional heterogeneity distribution can be fairlyweak. For example, stationarity and weak dependence of the processes over time are notnecessary, although restrictions such as exchangeability can be very useful – see Altonji and5


future research.2. The Linear Model <strong>with</strong> AdditiveHeterogeneityIt is useful to begin <strong>with</strong> the standard linear model <strong>with</strong> additive heterogeneity. We can setthe framework for more complicated settings and at the same time summarize some results thatare useful for testing key assumptions.Assume that an underlying population consists of a large number of units for whom data onT time periods are potentially observable. We assume <strong>random</strong> sampling from this population,and denote a <strong>random</strong> draw, i. Along <strong>with</strong> the potentially observed outcome, y it , are potentiallyobserved covariates, x it . Generally, we also draw unobservables for each i; we are particularlyinterested in the unobserved heterogeneity, c i .To allow for <strong>unbalanced</strong> <strong>panels</strong>, we explicitly introduce a series of selection indicators foreach i, s i1 ,...,s iT ,wheres it 1 if time period t for unit i can be used in estimation. In thispaper, we only use information on units where a full set of data are observed. Therefore,s it 1 if and only if x it ,y it is fully observed; otherwise, s it 0. This is very common inpanel data applications <strong>with</strong> <strong>unbalanced</strong> <strong>panels</strong>.The linear model <strong>with</strong> additive heterogeneity isy it x it c i u it , t 1,...,T, (2.1)where x it can generally include a full set of time dummies, or other aggretate time variables.We view this as the equation that holds for underlying <strong>random</strong> variables in all T time periods.We are interested in this paper in estimators of that allow for correlation between c i and the8


history of covariates, x it : t 1,...,T. With balanced <strong>panels</strong>, a common assumption is strictexogeneity of the covariates <strong>with</strong> respect to the idiosyncratic errors, which leads to thewell-known fixed <strong>effects</strong> estimator and variants. With an <strong>unbalanced</strong> panel, the keyassumption is most easily stated asEu it |x i ,c i ,s i 0, t 1,...,T (2.2)where x i x i1 ,x i2 ,...,x iT and s i s i1 ,s i2 ,...,s iT . Assumption (2.2) implies that observinga data point in any time period cannot be systematically related to the idiosyncratic errors, u it .It is a version of strict exogeneity of selection (along <strong>with</strong> strict exogeneity of the covariates)conditional on c i . As a practical matter, (2.2) allows selection s it at time period t to bearbitrarily <strong>correlated</strong> <strong>with</strong> x i ,c i , that is, <strong>with</strong> the observable covariates and the unobservedheterogeneity. For later comparisons <strong>with</strong> nonlinear <strong>models</strong>, note that we can combine (2.1)and (2.2) asEy it |x i ,c i ,s i Ey it |x i ,c i x it c i , (2.3)which means we can start from an assumption about a conditional expectation involving theresponse variable, as is crurical for nonlinear <strong>models</strong>.It is well-known – see, for example, Verbeek and Nijman (1996), Hayashi (2001) andWooldridge (2002, Chapter 17) – that the fixed <strong>effects</strong> (<strong>with</strong>in) estimator on the <strong>unbalanced</strong>panel is generally consistent under (2.3), provided there is sufficient time variation in thecovariates and the selected sample is not “too small.” If selection in every time period isindependent of the covariates and idiosyncratic errors in every time period then we can get by<strong>with</strong> a zero correlation assumption between x ir and u it for all r,t 1,...,T in the population.Because we are interested in nonlinear <strong>models</strong>, we will use assumptions stated in terms of9


conditional means.One way to characterize the FE estimator on the <strong>unbalanced</strong> panel is to simply multiplyequation (2.1) through by the selection indicator to gets it y it s it x it s it c i s it u it , t 1,...,T, (2.4)and when we average this equation across t for each i we get−1 Twhere ȳ i T i ∑ r1ȳ i x̄ i c i ū i , t 1,...,T, (2.5)Ts ir y ir is the average of the selected observations and T i ∑ r1s ir is thenumber of time periods observed for unit i; the other averages in (2.5) are defined similarly. Ifwe now multiply (2.5) by s it and subtract from (2.4) we remove c i :s it y it − ȳ i s it x it − x̄ i s it u it − ū i . (2.6)Now we can apply pooled OLS to this equation to obtain the FE estimator on the <strong>unbalanced</strong>panel. It is straightforward to show that (2.2), along <strong>with</strong> a rank condition, is sufficient forconsistency.As a computational point that becomes more important in complicated <strong>models</strong>, note thatthe time averages of y it and x it are computed only for time periods where data exist on the fullset of variables x it ,y it . Consequently, there are often pairs i,t where we may observe someelements in x it ,y it but where the information on these variables is not used in estimation.In the balanced case, it has been known for some time – see Mundlak (1978) – that the FEestimator can be computed as a pooled OLS estimator using the original data, but adding thetime averages of the covariates as additional explanatory variables. Perhaps less well known isthat this algebraic result carries over to the <strong>unbalanced</strong> case. In particular, let−1 Tx̄ i T i ∑ r1s ir x ir be the average of the covariates over the time periods where we observe a10


full set of data on the covariates and response variables. Then, estimate the equationy it x it x̄ i v it (2.7)by pooled OLS using the s it 1 observations. The coefficient vector ̂ is identical to the fixed<strong>effects</strong> (<strong>with</strong>in) estimator on the <strong>unbalanced</strong> panel. Any aggregate time variables, includingtime dummies, should be part of x it , and their time averages must be included in x̄ i. The reasonis that, unlike in the balanced case, the time average of aggregate time variables changes acrossi because we average different time periods for different i.In addition, if we run any pooled regression of the formy it on 1,x it ,x̄ i,z i if s it 1, (2.8)where z i is any vector of time-constant variables, then ̂ is still the fixed <strong>effects</strong> estimate. Forexample, if we add the number of time periods, T i , or interactions of the form T i x̄ i (that is,the sums in addition to the averages), the estimated coefficients on x it are the fixed <strong>effects</strong>estimates. The same is true if we allow a different set of coefficients on x̄ i depending on T i .Ofcourse, we can also add variables such as gender in a wage equation.As other examples of z i , one can use x̄ i1 ,T i1 ,x̄ i2 ,T i2 ,...,x̄ iG ,T iG where we partition1,2,...,T into G groups and then compute the averages of the selection observations, x̄ ig ,and the total number of selected periods, T ig . Because we can get x̄ i as a linear combination ofx̄ i1 ,x̄ i2 ,...,x̄ iG , the pooled OLS estimator of is still the FE estimate. In other words,allowing for very general correlation between c i and the (selected) sequence of covariates inthe standard linear model produces an estimator that is commonly used, and is robust to anykind of correlation between c i and x i1 ,s i1 ,x i2 ,s i2 ,...,x iT ,s it . When we discuss <strong>models</strong><strong>with</strong> <strong>random</strong> slopes in the next section, and nonlinear <strong>models</strong> in Section 4, this point is useful11


ecause we will have to take <strong>models</strong> relating heterogeneity to the covariates and selectionmore seriously. Yet at least in the leading case, the estimator is not sensitive to thespecification of Ec i |x i ,z i .It is useful to have a general result that contains algebraic equivalances for pooled OLS aswell as <strong>random</strong> <strong>effects</strong>. Recall that for a model <strong>with</strong> response variable y it and covariatesx it ,z i ,wherez i contains unity and x it contains any aggregate time variables, the RE estimatorcan be obtained from the pooled OLS regressiony it − i ȳ i on x it − i x̄ i, 1 − i x̄ i, 1 − i z i if s it 1, (2.9)where i 1 − u 2 / u 2 T i c 2 1/2 is a function of T i and the variance parameters; see, forexample, Baltagi (2001, Section 9.2). (Of course, in practice, the variance parameters arereplaced <strong>with</strong> estimates, but that is unimportant for an algebraic equivalance.) For ourpurposes, all that matters is that pooled OLS <strong>with</strong> the time averages added ( i 0) and <strong>random</strong><strong>effects</strong> are special cases.Proposition 2.1: Consider pooled OLS regressions of the form in (2.9), where the time−1 Taverages are computed using the selected observations (so, for example, x̄ i T i ∑ r1s ir x ir ).Note that z i can include the intercept and x it any aggregate time variables. Let ̃ be the vector(K 1 of coefficients on x it − i x̄ i. Then ̃ ̂ FE, the fixed <strong>effects</strong> estimate on the <strong>unbalanced</strong>panel.Proof: The case i 1 for all i is obvious, because then the estimate ̃ is from the pooledregression y it − ȳ i on x it − x̄ i <strong>with</strong> s it 1 – and this defines the FE estimate on the <strong>unbalanced</strong>panel. To handle other cases, we assume that the appropriate matrices are invertible. Generally,the invertibility requirement holds under standard assumptions of time-variation in the x it 12


and no perfect collinearity when 0 ≤ i 1.First consider the case <strong>with</strong>out z i . Then ̃ can be obtained from the Frisch-Waugh theorem.First, regress x it − i x̄ i on 1 − i x̄ i (using the selected sample) and obtain the residuals, sayr̃ it . Then obtain ̃ from the pooled OLS regression (again on the selected sample) of y it − i ȳon r̃ it . The residuals r̃ it are simple to obtain. We can write them asr̃ it x it − i x̄ i − 1 − i x̄ ĩ (2.10)wherẽ N T∑ ∑i1 t1s it 1 − i 2 x̄ i′ x̄ iN∑ T i 1 − i 2 x̄ i′ x̄ ii1N∑ T i 1 − i 2 x̄ i′ x̄ ii1N∑ T i 1 − i 2 x̄ i′ x̄ ii1−1−1−1−1N T∑ ∑i1 t1N∑i1N T∑ ∑i1 t1s it 1 − i x̄ i′ x it − i x̄ iNs it 1 − i x̄ i′ x it − ∑ T i i 1 − i x̄ i′ x̄ ii1NT i 1 − i x̄ i′ x̄ i − ∑ T i i 1 − i x̄ i′ x̄ ii1N∑ T i 1 − i 2 x̄ i′ x̄ i I K .i1(2.11)It follows that r̃ it x it − i x̄ i − 1 − i x̄ i x it − x̄ i, which is simply the time-demeanedcovariates. Now we can writẽ N T∑ ∑i1 t1N T∑ ∑i1 t1s it x it − x̄ i ′ x it − x̄ is it x it − x̄ i ′ x it − x̄ i−1−1N T∑ ∑i1 t1N T∑ ∑i1 t1s it x it − x̄ i ′ y it − i ȳ i (2.12)s it x it − x̄ i ′ y itusing the fact ∑ Tt1s it x it − x̄ i ′ i ȳ i i ȳ i ∑ Tt1s it x it − x̄ i ′ 0 because x̄ i is the average overthe selected time periods. But this final formula is just ̂ FEon the selected sample.For the case <strong>with</strong> z i , we can apply the Frisch-Waugh theorem again to obtain the13


appropriate residuals. That is, now r̃ it are from the regression x it − i x̄ i on 1 − i x̄ i, 1 − i z i<strong>with</strong> s it 1. But now we partial out x it − i x̄ i from 1 − i x̄ i to get residuals q̃ it ,say,andwejust showed q̃ it x it − x̄ i. The other residuals we need are from 1 − i z i on 1 − i x̄ i <strong>with</strong>s it 1, and it is obvious that these, say ẽ i ,dependonlyoni. Sother̃ it are from x it − x̄ i on ẽ iacross i and t <strong>with</strong> s it 1, and because ∑ Tt1s it x it − x̄ i 0 for all i, it follows thatN T∑ ∑i1 t1s it ẽ i ′ q̃ it 0.This means r̃ it q̃ it x it − x̄ i, as before. Now the rest of the proof is the same. The conclusions of Proposition 2.1 verify the previous claims made for pooled OLS as wellas <strong>random</strong> <strong>effects</strong> on the <strong>unbalanced</strong> panel. A nice application of this algebraic equivalanceresult is a simple way to obtain regression-based, fully robust Hausman tests using <strong>unbalanced</strong><strong>panels</strong>. Write a model <strong>with</strong> time-constant variables z i asy it x it z i c i u it , t 1,...,T, (2.13)where, again, we use a data point if s it 1. Assume that z i includes a constant. If we use theMundlak equationy it x it x̄ i z i a i u it (2.14)and estimate this by, say, RE, we know from Proposition 2.1 that the estimate of is the FEestimate. Now the regression based Hausman test is just, say, a Wald test of H 0 : 0 afterRE estimation of the augmented equation. Any unit <strong>with</strong> T i 1 can be included in theestimation and testing, but, of course, if we only have units <strong>with</strong> T i 1 then x̄ i x it for thesingle time period t <strong>with</strong> s it 1; thus, and cannot be distinguished.We can also use Mundlak’s CRE formulation, whether the panel is balanced or not, to test14


a subset of coefficients in . For example, we might postulateEc i |x i ,z i Ec i |x̄ i,z i Ec i |x̄ i1 ,z i , (2.15)where x̄ i1 is x̄ i but <strong>with</strong> the first element, x̄ i1, removed. This allows us to test the possibilitythat, after controlling for x̄ i2,...,x̄ iK,z i , x it1 is exogenous <strong>with</strong> respect to c i (as well asu it ). The test is just a fully robust t test of H 0 : 1 0. A failure to reject provides somejustification for estimating the equationy it x it 2 x̄ i2 ... K x̄ iK z i c i u it (2.16)by <strong>random</strong> <strong>effects</strong> (on the <strong>unbalanced</strong> panel) – likely leading to a more precise, perhaps muchmore precise, estimate of 1 (the coefficient on x it1 ). Naturally, one should make inferencefully robust to heteroskedasticity in the composite error and serial correlation in u it .Using a t statistic on ̂ 1 is not the same, even asymptotically, as comparing ̂ FE,1 and ̂ RE,1via a one degree-of-freedom Hausman test. The latter maintains 0 under the null – that is,the RE estimator of 1 is generally inconsistent under (2.15) – whereas a t test of H 0 : 1 0is silent on other elements of . If 1 is the coefficient of primary interest, it may make moresense to test H 0 : 1 0, as it allows (partial) correlation between c i and x̄ i1 .Excluding x̄ i1 from (2.16) is in the spirit of imposing extra restrictions to estimate theparameters in (2.13). Hausman and Taylor (1981) use exogeneity assumptions on bothtime-constant and time-varying covariates in (2.13), mainly to identify elements of when thefull RE orthogonality conditions do not hold. In (2.16), we do not need to exclude x̄ i1 in orderto identify 1 (because x it1 has some time variation), but doing so may result in an estimateof 1 <strong>with</strong> more precision.Insummary,thissectionhasshownthatevenifwemakeaverystrongassumptioninthe15


<strong>unbalanced</strong> case, namely, c i x̄ i a i , Ea i |s it ,s it x it : t 1,...,T 0, theresulting estimator – pooled OLS or RE – is identical to an estimator, fixed <strong>effects</strong>, that puts norestrictions on Ec i |s it ,s it x it : t 1,...,T. This extension of the usual Mundlak (1978)result for the balanced case suggests that in <strong>models</strong> <strong>with</strong> more complicated functional formsand heterogeneity, simple <strong>models</strong> of Ec i |s it ,s it x it : t 1,...,T may work reasonablywell.3. Linear Models <strong>with</strong> Correlated RandomSlopesIf we start <strong>with</strong> a model that has individual-specific slopes, the presence of <strong>unbalanced</strong><strong>panels</strong> is more difficult to treat. Wooldridge (2005) shows that using fixed <strong>effects</strong> in a linearmodel where the <strong>random</strong> slopes are ignored has some robustness properties for estimating theaverage effect. But those findings do not carry over to <strong>unbalanced</strong> <strong>panels</strong> where selection maybe <strong>correlated</strong> <strong>with</strong> heterogeneity: the slope heterogeneity becomes part of the error term, andcorrelation between selection and the heterogeneity generally causes inconsistency.To see how to handle <strong>unbalanced</strong> <strong>panels</strong>, state the model asEy it |x i ,a i ,b i a i x it b i , (3.1)so, in the population, x it : t 1,...,T is strictly exogenous conditional on a i ,b i .Definea i c i , b i d i and writey it x it c i x it d i u it (3.2)where Eu it |x i ,a i ,b i Eu it |x i ,c i ,d i for all t. We also assume that selection may be relatedto x i ,a i ,b i but not the idiosyncratic shocks:16


Ey it |x i ,a i ,b i ,s i Ey it |x i ,a i ,b i (3.3)orEu it |x i ,a i ,b i ,s i 0, t 1,...,T. (3.4)This is an obvious extension of assumption (2.3).In what follows, we assume that we do not want to select elements of b i that are allowed tochange <strong>with</strong> i. If we had only a few such elements, and a sufficient number of time periods, wecould proceed by eliminating those elements of b i via a generalized <strong>with</strong>in transformation andthen proceed <strong>with</strong> estimation of the constant slopes. Such an approach would be the<strong>unbalanced</strong> version of the methods described by Wooldridge (2002, Chapter 11). Thisapproach is attractive in specific instances, but it cannot be used in general (and not at all fornonlinear <strong>models</strong>).To study estimation on an <strong>unbalanced</strong> panel, multiply (3.2) through by the selectionindicator:s i y it s it s it x it s it c i s it x it d i s it u it (3.5)Now, because we only use observations <strong>with</strong> s it 1, we handle the presence of intercept andslope heterogeneity by conditioning on the entire history of selection and the values of thecovariates if selected. That is, we condition on s it ,s it x it : t 1,...,T. Ifs it 0theobservation is not used; if s it 1 the observation is used and we observe x it . It might seembetter to condition on x i1 ,s i1 ,x i2 ,s i2 ,...,x iT ,s iT , but then if the heterogeneity dependsonly on the history of covariates, we would be left <strong>with</strong> an equation that is not estimable unlessthe covariates are always observed. We want to be able to handle cases where the covariatesare missing, too, as happens when units are simply not observed at all for some time periods.17


Therefore, to obtain a true estimating equation, we condition on s it ,s it x it : t 1,...,T.For notational simplicity, write h i ≡ h it : t 1,...,T ≡ s it ,s it x it : t 1,...,T.Then, extending Mundlak (1978) and Chamberlain (1982, 1984), we work <strong>with</strong>Es i y it |h i s it s it x it s it Ec i |h i s it x it Ed i |h i (3.6)and then make assumptions concerning Ec i |h i and Ed i |h i . Actually, because we caneliminate c i using the <strong>with</strong>in transformation, we could just focus on Ed i |h i . However,assuming we know <strong>models</strong> for Ed i |h i but not for Ec i |h i is somewhat arbitrary, and so wefirst consider the case where we model all expectations.At this point it is useful to point out that if a i ,b i are assumed to be independent (or, atleast, mean independent) of x i1 ,s i1 ,x i2 ,s i2 ,...,x iT ,s iT – an assumption often implicit in<strong>random</strong> coefficient frameworks – then the issue of how to model Ec i |h i and Ed i |h i disappears. The term s it Ec i |h i s it x it Ed i |h i would be identically zero, which means wewould be left <strong>with</strong> Es i y it |h i s it s it x it . Pooled estimation using the selected sample orgeneralized least squares methods can be applied.A simple approach to allowing Ea i ,b i |h i to depend on h i is to model the expectations asexchangeable functions of h it : t 1,...,T. In the balanced panel case, this approach wassuggested by Altonji and Matzkin (2005) in a fully nonparametric setting. The leadingexamples of exchangeable functions are sums (or averages). In keeping <strong>with</strong> the motivationfrom Section 2, we might choosew i ≡ T i ,x̄ i (3.7)as the exchangeable functions satisfyingEc i |h i Ec i |w i , Ed i |h i Ed i |w i . (3.8)18


Further, if we assume that these expectations depend only on x̄ i, and in a linear fashion, wehaveEc i |h i x̄ i − x̄ iEd i |h i x̄ i − x̄ i⊗ I K ,(3.9)(3.10)where x̄ i Ex̄ i is subtracted from x̄ i to ensure the zero unconditional means of c i and d i .Ifwe insert these expectations into (3.6) we obtainEs i y it |h i s it s it x it s it x̄ i − x̄ i s it x̄ i − x̄ i⊗ x it , (3.11)which is an equation <strong>with</strong> the time averages and each time average interacted <strong>with</strong> eachtime-varying covariate. It is now obvious that we can use pooled OLS on the selected sampleto consistently estimate , (the main vector of interest), , and. We can even use, say,<strong>random</strong> <strong>effects</strong> estimation (adjusted, of course, for the <strong>unbalanced</strong> panel), but inference shouldbe made robust to arbitrary heteroskedasticity and serial correlation. As a practical matter, wereplace x̄ i<strong>with</strong> ̂ x̄ i N −1 ∑ Ni1x̄ i as a consistent estimator of x̄ i.Notice that ̂ x̄ iis consistent−1 Tfor the quantity we need, which is the expected value of x̄ i T i ∑ r1s ir x ir .If we drop the set of interactions x̄ i − x̄ i⊗ x it , we know from Section 2 the resultingestimator would be the FE estimator on the <strong>unbalanced</strong> panel. This suggests a simple test forwhether we need to further consider correlation of selection and the <strong>random</strong> slopes. Estimatethe equationy it x it x̄ i − x̄ i⊗ x it a i u it (3.12)by fixed <strong>effects</strong>, so that a i is removed <strong>with</strong>out imposing any assumptions on its conditionaldistribution. If we cannot reject H 0 : 0, we might ignore the possibility of <strong>random</strong> slopesand just use standard FE estimation on the <strong>unbalanced</strong> panel.19


If we conclude that we need to account for the <strong>random</strong> slopes, and that selection might be<strong>correlated</strong> <strong>with</strong> the slopes, the assumptions in (3.9) and (3.10) might be too restrictive. For one,they assume that T i does not directly appear in Ec i ,d i |h i . Second, since x̄ i is an average usingT i elements, it is possible the coefficients change <strong>with</strong> T i . (This certainly would be the caseunder joint normality given any sequence of selection indicators <strong>with</strong> sum T i .) We can allowan unrestricted set of slopes by extending the earlier assumption toTEc i |h i Ec i |T i ,x̄ i ∑r1T r 1T i r − r ∑ 1T i rx̄ i − r rr1TTEd i |h i Ed i |T i ,x̄ i ∑1T i r − r r ∑r1r11T i rx̄ i − r ⊗ I K r,(3.13)(3.14)where the rare the expected values of x̄ i given r time periods observed and r is the fractionof observations <strong>with</strong> r time periods: r Ex̄ i|T i r, r E1T i r (3.15)As a practical matter, the formulation in (3.13) and (3.14) is identical to running separateregressions for each T i :y it on 1, x it , x̄ i, x̄ i − ̂ r ⊗ x it ,fors it 1 (3.16)where ̂ r N r−1∑ N i11T i rx̄ i and N r is the number of observations <strong>with</strong> T i r. Thecoefficient on x it , ̂ r ,istheAPEgivenT i r. We can average these across r to obtain theoverall APE. There is, however, a cost in allowing the flexibility in (3.13) and (3.14): wecannot identify an APE for T i 1 unless we set the coefficients on x̄ i and x̄ i − r ⊗ x it equalto zero. So, we could just exclude the T i 1 observations from the APE calculations, or wecan impose restrictions that we did previously. (The same issue arises if we use fixed <strong>effects</strong>20


estimation to obtain a different ̂ r for each T i: we must exclude the T i 1 subsample.) Underthe assumption that s it ,x it : t 1,...,T is independent and identically distributed, thecoefficients in (3.14) and (3.14) are linear functions of T i , and such a restriction means we canuse the T i 1 observations.A special case of the previous model is the so-called <strong>random</strong> trend model, where x itincludes (in the simplest case) t, so that each unit has its own linear trend. Then, we might wantto allow the <strong>random</strong> trend to be <strong>correlated</strong> <strong>with</strong> features of s it ,s it x it : t 1,...,T otherthan the average and number of observed time periods. For example, for each i we could“estimate” unit-specific intercept and trend coefficient by running regressionss ir x it on s it , s it t, t 1,...,T, (3.17)and then allow these to be <strong>correlated</strong> <strong>with</strong> c i and d i . We omit the details so that we can moveon to nonlinear <strong>models</strong>.As a final comment before we turn to nonlinear <strong>models</strong>, note that in any of the previousformulations we have simple tests of dynamic selection bias available. We have assumed thatour model for Ec i ,d i |s it ,s it x it : t 1,...,T captures how heterogeneity depends on entiresequence of selection and the sequence of selected covariates. Therefore, under the ignorabilityassumption (3.3), no other functions of s it ,s it x it : t 1,...,T should appear in Es i y it |h i .Of course, we cannot include s it as an explanatory variable at time t because we only use data<strong>with</strong> t 1. But we can use lagged and lead values. A simple and possibly revealing test is toadd as extra regressors at time t the variables s i,t1 ,s i,t1 x i,t1 . We can compute a fully robust(to serial correlation and heteroskedasticity) Wald test of the null hypothesis that (3.3) holdsand that our model for Ec i ,d i |h i is correct as the joint significance test of s i,t1 ,s i,t1 x i,t1 . (Ifwe want more of a pure test for selection bias, we can include just s i,t1 and just use a robust t21


statistic.) If we reject the null we might have a selection problem where being in the sample attime t 1 is <strong>correlated</strong> <strong>with</strong> shocks to y at time t, thatis,s i,t1 is <strong>correlated</strong> <strong>with</strong> u it . (In carryingout this test, our hope is that the time-constant parts of the error term are un<strong>correlated</strong> <strong>with</strong> theentire history, s it ,s it x it : t 1,...,T, as is the case when we have properly modeledEc i ,d i |h i .)4. A Modeling Approach for NonlinearModelsWe can apply the general approach for linear <strong>models</strong> <strong>with</strong> <strong>random</strong> slopes to generalnonlinear <strong>models</strong>, although in some cases we have to work in terms of conditional distributionsrather than conditional means. We assume that interest lies in the distributionDy it |x it ,c i (4.1)where, in generaly, y it can be a vector and x it is a set of observed conditioning variables. Inthis section,we denote the vector of heterogeneity by c i . We also restrict attention in this paperto strictly exogenous covariates, so that we impose the substantive restrictionDy it |x i ,c i Dy it |x it ,c i (4.2)where, again, x i x i1 ,x i2 ,...,x iT is the entire history of covariates (whether or not the entirehistory is observed). We assume that we have specified, for each t, a density for Dy it |x it ,c i ,which we write <strong>with</strong> dummy arguments as g t y t |x t ,c;, where is a set of finite dimensionalparameters. Here we focus on the case of specifying marginal distributions for each t, ratherthan a joint distribution. Pooled methods are generally more robust because they do not restrictthe (conditional) independence over time. Plus, as discussed in Wooldridge (2002), the average22


partial <strong>effects</strong> are generally identified by pooled estimation methods, and computationally theyare relatively simple.Given the strict exogeneity assumption, selection is assumed to be ignorable conditonal onx i ,c i :Dy it |x i ,c i ,s i Dy it |x it ,c i , t 1,...,T. (4.3)As in the case of linear <strong>models</strong>, this assumption allows selection to be abitrarily <strong>correlated</strong> <strong>with</strong>x i ,c i but not generally <strong>with</strong> “shocks” to y it .As in the case of the linear model, our <strong>correlated</strong> <strong>random</strong> <strong>effects</strong> approach will be tospecify a model forDc i |s it ,s it x it : t 1,...,T. (4.4)Let w i be a vector of known functions of s it ,s it x it : t 1,...,T that act as sufficientstatistics, so thatDc i |s it ,s it x it : t 1,...,T Dc i |w i , (4.5)just as in Section 3.Now, because Dy it |x it ,c i ,s it 1 Dy it |x it ,c i , it follows that the density of y it givens it ,s it x it ,c i is g t y t |s it x it ,c i ; when s it 1. As we are only using data <strong>with</strong> s it 1, this isenough to construct the density used in estimation: that conditional ons it ,s it x it : t 1,...,T. Lethc|w i ; be a parametric density for Dc i |w i . Then the densitywe need (again for s it 1) isf t y t |x it ,w i ;, RM g ty t |x it ,c;hc|w i ;dc (4.6)where M is the dimension of c i , and this is obtainable (at least in principle) given <strong>models</strong>23


g t y t |x t ,c; and hc|w;. In effect, the same calculations used to “integrate out” unobservedheterogeneity in the balanced case can be used here, too.For each i, a partial log-likelihood function (abusing notation by not distinguishing the trueparameters from the dummy arguments) isT∑ s it logf t y it |x it ,w i ;,. (4.8)t1The true values of the parameters maximize Elogf t y it |x it ,w i ;, given s it 1,andsothepartial MLE generally identifies and . The partial log likelihood for the full sample isN T∑ ∑i1 t1s it logf t y it |x it ,w i ;,, (4.9)andinleadingcases–aswewillseeinSection7–thepartialloglikelihood for the<strong>unbalanced</strong> case is simple to compute. Further, the large-N, fixed-T asymptotics isstraightforward. We do not provide regularity conditions here because in most CREapplications the log likelihoods are very smooth.In general, inference needs to be made robust to the serial dependence in the scores from(4.9). Let be the vector of all parameters, and assume identification holds along <strong>with</strong>regularity conditions. Further, define the scores and Hessians asThenr it ∇ logf t y it |x it ,w i ; ′H it ∇ 2 logf t y it |x it ,w i ;whereN ̂ − d → Normal0,A −1 BA −1 24


A −ET∑ s it H it t1B VarT∑t1s it r it ET∑ s it r it t1T∑ s it r it t1′Notice that the definition of B allows correlation across the scores for different time periodsEstimators of these matrices are standard: we can replace the expectation <strong>with</strong> an averageacross i and replace <strong>with</strong> ̂. In many cases we can replace H it <strong>with</strong> its expectationconditional on w i .Before we turn to estimating quantities of interest, it is useful to know that essentially thesame arguments carry through for estimating conditional means. That is, if we start <strong>with</strong>Ey it |x it ,c i m t x it ,c i as the object of interest, then we can obtain Ey it |x it ,w i byintegrating m t x it ,c i <strong>with</strong> respect to the density of c i given w i (which, again, is valid for themean when s it 1). Of course, the resulting conditional mean will generally depend on the<strong>models</strong> m t x t ,c and hc|w being correctly specified, but if we assume correct specification,we can use a variety of pooled quasi-MLEs for estimation. For example, if y it is a fractionalresponse, we can use the Bernoulli quasi-log likelihood (QLL); if y it is nonnegative, such as acount variable, we can use the Poisson QLL. We will cover some examples in Section 6.5. Estimating Average Partial EffectsIn most nonlinear <strong>models</strong>, the parameters appearing in f t y t |x t ,c; provide only part ofthe story for the effect of x t on y t . The presence of heterogeneity usually means that theelements of can, at best, provide directions and relative magnitudes of <strong>effects</strong>. Fortunately,25


generally for the setup described in Section 4 we have enough information to identify andestimate partial <strong>effects</strong> <strong>with</strong> the hetereogeneity averaged out.We follow Blundell and Powell (2003) and define the average structural function (ASF)for a scalar response, y t .LetEy it |x it ,c i m t x it ,c i be the mean function. ThenASFx t E ci m t x t ,c i (5.1)is the conditional mean function (as a function of the dummy argument, x t ) <strong>with</strong> theheterogeneity, c i , averaged out. Given the ASF, we can compute partial derivatives, or discretechanges, <strong>with</strong> respect to the elements of x t . As discussed in Imbens and Wooldridge (2007),this generally produces the average partial <strong>effects</strong> (APEs), that is, the partial derivatives (orchanges) <strong>with</strong> the heterogeneity averaged out. Fortunately, the ASF (and, therefore, APEs) areoften easy to obtain.Let q t x t ,w; denote the mean associated <strong>with</strong> f t y t |x t ,w;. Then, as discussed inWooldridge (2002) for the balanced case,ASFx t E wi q t x t ,w i ;, (5.2)that is, we can obtain the ASF by averaging out the observed vector of sufficient statistics, w i ,from Ey it |x t ,w i ,s it 1 rather than averaging out c i from Ey it |x t ,c i . In leading cases, wehave direct estimates of q t x t ,w;, in which case we have a simple, consistent estimator ofASFx t :NASFx t N −1 ∑ q t x t ,w i ;̂ (5.3)i1We can use this expression to obtain APEs by taking derivatives or changes <strong>with</strong> respect toelements of x t , for example,26


NAPE tj x t N −1 ∑i1∂q t x t ,w i ;̂∂x tj(5.4)Standard errors of such quantities can be difficult to obtain by the deta method, but the panelbootstrap – where resampling is done in the cross section dimension – is straightforward.further, because we are using pooled methods, the bootstrap is usually quite tractablecomputationally.As a general approach to flexibly estimating APEs, we might choose to estimate a separatemodel for every possible value of T i (except T i 1, which is ruled out by most standard<strong>models</strong> and choices of Dc i |w i . Suppose, for example, that we choose w i T i ,x̄ i. Inpractice, the structure of the density f t y t |x it ,T i ,x̄ i; is the same across all values of T i .(Wewill see examples in the next section.) Thus, for r 2,...,T, weestimateavectorofparameters, ̂ r, using pooled MLE or QMLE <strong>with</strong> T i r. We can then estimate the ASF foreach t asN TASFx t N −1 ∑ ∑i1 r21T i rq t x t ,x̄ i;̂ r (5.5)where q t x t ,x̄ i;̂ r is the estimated conditional mean function using the T i r observations.Of course, because (5.5) does not include the T i 1 observations, the estimated APEsnecessarily exclude that part of the population. Perhaps this is as it should be because, just aswe saw in Section 4 for the linear model <strong>with</strong> <strong>random</strong> coefficients, the T i 1 observationscannot be used to estimate the coefficients. In some cases, though, we may wish to imposeenough constant coefficients in Dy it |x it ,c i so that the T i 1 observations are helpful andused in estimation.With an <strong>unbalanced</strong> panel, there is a somewhat subtle point about computing a single27


average partial effect from the ASF. With a balanced panel, it is natural to average APE tj x t across the distribution of x it , and then possibly across t, too. With a <strong>random</strong> sample in eachtime period, this averaging is straightforward. But if selection s it depends on x it , averagingacross the selected sample does not consistently estimate E xit APE tj x it . Presumably we stillhave an idea of useful values to plug in for x t , but generally estimating specific features of thedistribution of x it can be difficult. We might have to be satisfied <strong>with</strong> computing the average ofAPE tj x it for the selected sample, that is, Es it APE tj x it .6. ExamplesWe now consider a few examples to show how simply the proposed methods apply tostandard <strong>models</strong>. We begin <strong>with</strong> the unobserved <strong>effects</strong> probit model <strong>with</strong>out restricting serialdependence. The model <strong>with</strong> a single source of heterogeneity and strictly exogenous covariatesisPy it 1|x i ,c i Py it 1|x it ,c i x it c i , t 1,...,T (6.1)where x it can include time dummies or other aggregate time variables. Once we specify (6.1)and assume that selection is conditionally ignorable for all t, thatis,Py it 1|x i ,c i ,s i Py it 1|x i ,c i , (6.2)allthatisleftistospecifyamodelforDc i |w i for suitably chosen functions w i ofs it ,s it x it : t 1,...,T. As in the linear case, it makes sense to at least initially chooseexchangeable functions that extend the usual choices in the balanced case. For example, wecan allow Ec i |w i to be a linear function of the time averages <strong>with</strong> different coefficients for28


each number of periods:TEc i |w i ∑r1T r 1T i r ∑ 1T i r x̄ i r(6.3)r1Thus, we either have to be content <strong>with</strong> estimating the APE over the subpopulation <strong>with</strong> T i ≥ 2or imposing more restrictions, such as linear functions in T i in (6.3). At a minimum we shouldallow the variance of c i to change <strong>with</strong> T i ; a simple yet flexible specification isVarc i |w i expT−1 ∑ 1T i r r (6.4)r1where is the variance for the base group, T i T, and the each r is the deviation from thebase group. If we also maintain that Dc i |w i is normal, then we obtain the following responseprobability for s it 1:Py it 1|x it ,w i ,s it 1 x Tit ∑ r2T r 1T i r ∑ r21T i r x̄ i r(6.5)1 exp ∑ T−1 1/2r11T i r rIn the case of the usual model <strong>with</strong> balanced data, the r are all zero, and then only thecoefficients scaled by 1 exp 1/2 are identified. Fortunately, it is exactly these scaledcoefficients that determine the average partial <strong>effects</strong>. A convenient reparameterization isPy it 1|x it ,w i x Tit ∑ r1T r 1T i r ∑ r11T i r x̄ i r(6.6)T1/2exp ∑ r21T i r rso that the denominator is unity when all r are zero. As an additional bonus, the formulationin (6.6) is directly estimable by so-called “heteroskedastic probit” software, where theexplanatory variables at time t are 1,x it ,1T i 2 x̄ i,...,1T i T x̄ i and the explanatoryvariables in the variance are simply the dummy variables 1T i 2,...,1T i T − 1.29


With the estimating equation specified as in (6.6), the average structural function is fairlystraightforward to estimate:NASFx t N −1 ∑ i1x Tt̂ ∑ r1T̂ r1T i r ∑ r1Texp ∑ r21T i r̂ r1T i r x̄ î r(6.7)1/2where the coefficients <strong>with</strong> “^” are from the pooled heteroskedastic probit estimation. Noticehow the functions of T i ,x̄ i are averaged out, leaving the result a function of x t .If,say,x tj iscontinuous, its APE is estimated aŝ jNN −1 ∑ i1Tx t ̂ ∑ r1T̂ r1T i r ∑ r1Texp ∑ r21T i r̂ r1T i r x̄ î r(6.8)1/2where is the standard normal pdf. This is still a function of x t . Notice that in thecontinuous or discrete case, ̂ j provides the direction of the effect, but the magnitude of theeffect is considerably more complicated (and generally a function of x t , of course). Theparameters of the model for Dc i |w i appear directly in the ASF and APEs, and so they cannotbe considered “nuisance” or “incidental” parameters.The above procedure applies, <strong>with</strong>out change, if y it is a fractional response; that is,0 ≤ y it ≤ 1. Then, we interpret the orginal model as Ey it |x it ,c i x it c i ,andthenpartial <strong>effects</strong> are on the mean response. As is well known – for example, Gourieroux,Monfort, and Trognon (1984) – the Bernoulli log likelihood is in the linear exponential family,and so it identifies the parameters of a correctly specified conditional mean. Under theassumptions given, we have the correct functional form for Ey it |x it ,w i ,s it 1.We can easily add the interactions 1T i r x̄ i to the variance function for addedflexibility; if we maintain conditional normality of the heterogeneity, we are still left <strong>with</strong> an30


estimating equation of the heteroskedastict probit form. As in (6.7) and (6.8), those extrafunctions of T i ,x̄ i get averaged out in computing APEs.The normality assumption, as well as specific functional forms for the mean and variance,might seem restrictive. An important practical point is that, once we know the APEs areidentified by averaging w i out of q t x t ,w i , Em t x t ,c i |w i , we are free to use any numberof approximations to the true distribution. For example, one could use a logit functional formrather than probit – even though that particular response probability cannot be easily derivedfrom an underlying model for Ey it |x it ,c i .Perhaps more useful is extending the functional form inside the probit function. Becausewe probably should allow different coefficients for each T i , the notation gets complicated, butwe can add interactions of the form1T i rx̄ i ⊗ x it . (6.9)This is in the spirt of allowing <strong>random</strong> slopes on x it in the orginal probit specification, but thisparticular estimating equation would not be easily derivable from such a model. Instead, as inBlundell and Powell (2003), it recognizes that quantities of interest can be obtained <strong>with</strong>outeven specifying a particular model for Ey it |x it ,c i . We could more formally take asemiparametric approach and assume, say, that Dc i |w i depends on a linear index in1T i 2,...,1T i T,1T i 2 x̄ i,...,1T i T x̄ i. Then, we can modify the Blundelland Powell (2003) approach for cross section data <strong>with</strong> endogenous explanatory variables forthe current panel data setting.Other approaches to estimation are possible. For example, the key ignorability of selectionassumption justifies estimation on any balanced subset of data. So we could only useobservations <strong>with</strong> T i T, say, and then apply the usual CRE methods for the balanced case;31


see Wooldridge (2002) and Imbens and Wooldridge (2007). This is attractive in situationswhere the vast majority of observations have a full set of time periods. (Technically, wereplace the selection indicator, s it , in (4.9) <strong>with</strong> the product, s i1 s i2 s iT to pick outobservations <strong>with</strong> a full set of time periods.) We can also pick out, say, pairs of observations;andsoon.An estimation approach that may be more efficient than just pooling is minimum distanceestimation. In the current setting, we can estimate a different set of parameters, including for ,for each T i 2,...,T, and then impose the restrictions across r using minimum distance. Aless efficient but more flexible approach would be to estimate a standard CRE probit model foreach T i 2,...,T (allowing the variance to change only <strong>with</strong> T i and not x̄ i). This would giveus (implicitly scaled) coefficients ̂ ,̂ r r,̂ r for each r. We can easily compute the ASFsconditional on each r asNASF r x t N r−1 ∑ i11T i rx t ̂ r ̂ r x̄ î r, (6.10)where N r is the number of i <strong>with</strong> T i r. Thes weighted average of these across r is an estimateof the overall ASF for each t (again, as a function of x t ):N TASFx t N −1 ∑ ∑i1 r21T i rx t ̂ r ̂ r x̄ î r, (6.11)Allowing heteroskedasticity as a function of x̄ i is almost trivial because for each T i we can useheteroskedastic probit to estimate the coefficients. ThenN TASFx t N −1 ∑ ∑i1 r21T i rx t̂ r ̂ r x̄ î rexpx̄ î r/2, (6.12)where ̂ r is the vector of variance parameters. Even though estimates for each r may not be32


especially precise,.averaging across all of the estimates can lead to precise estimates of theASF. For even more flexibility we can add interactions x̄ i ⊗ x it in the estimation for each r,<strong>with</strong> or <strong>with</strong>out heteroskedasticity. In the more general case, the ASF is then estimated asN TASFx t N −1 ∑ ∑i1 r21T i rx t̂ r ̂ r x̄ î r x̄ i ⊗ x t ̂ rexpx̄ î r/2, (6.13)so that partial <strong>effects</strong> <strong>with</strong> respect to the elements of x t need to account for the interactions<strong>with</strong> the time averages, x̄ i. The specification underlying (6.13) is very quite flexible, but it doesmean the T i 1 observations are dropped for computing the APEs. If we assumedindependent, identically distributed s it ,x it , Ec i |T i ,x̄ i T i x̄ i andVarc i |T i ,x̄ i exp 0 1 logT i . Thus, we could use interactions <strong>with</strong> T i and x̄ i in the“mean” part of the probit and simply logT i in the variance part, possibly also adding x̄ i foradditional flexibility. Of course, this allows us to include the T i 1 observations at the cost ofmore assumptions.As discussed in Sections 3 and 4 for linear <strong>models</strong>, we can easily relax the restriction thatDc i |s it ,s it x it : t 1,...,T depends only on T i ,x̄ i. We can use sample variances andcovariances, individual-specific trends, or break the time period into intervals and use averagesover those intervals.As in the linear case, we can easily test for dynamic forms of selection bias by including,say, s i,t1 and s i,t1 x i,t1 in any of the previous estimations and conducting a robust, joint test.Everything just covered for the probit (or fractional probit) case extends to, say, theordered probit case. Further, one can use some new strategies for handling multinomialresponses that are computationally simple. Let y it be an (unordered) multinomial response.Then, rather than specifying Dy it |x it ,c i to have any specify form, we can move directly to33


specifications for Dy it |x it ,w i ,s it 1 for w i the chosen sufficient statistics ofs it ,s it x it : t 1,...,T. For example, we might just estimate multinomial logit forDy it |x it ,w i ,s it 1, or nested logit, or some other relatively simple model. Then, the APEs –in this case, for the response probabilities – are obtained by averaging out w i when it is alldone. In fact, the multinomial quasi-MLE can be applied when the y it are shares summing toone, again relying on Gourieroux, Monfort, and Trognon (1984).7. A Proposal for Goodness of FitAn issue that arises in comparing across <strong>models</strong> <strong>with</strong> unobserved heterogeneity is how onemeasures goodness of fit. Measuring fit is further complicated when different estimationmethods are used. For example, suppose that y it is a fractional response, and we want tocompare a linear model – <strong>with</strong> just a single, additive heterogeneity – to a fractional responsemodel, also <strong>with</strong> a single source of hetereogeneity. If the linear model is estimated by fixed<strong>effects</strong> and the fractional model using the methods proposed in Section 6, it is not clear howone can determine which model fits best, and whether the functional form and distributionalassumptions imposed in the fractional case are contributing to a poor fit.By recognizing that the linear fixed <strong>effects</strong> estimator is actually a CRE estimator, we canconsider the goodness-of-fit problem in a unified setting. Suppose initially that there is nomissing data problem and let w i be the functions of x i1 ,x i2 ,...,x iT such thatDc i|x i1 ,x i2 ,...,x iT Dc i|w i . The partial MLE approach implies densities for theconditional distributions Dy it |x i Dy it |x it ,w i , which we denoted f t y t |x t ,w;,. Thus,tocompare fit across <strong>models</strong> where the densities f t y t |x t ,w;, are implied, we can use the value34


of the partial log likelihood,N T∑ ∑i1 t1logf t y it |x it ,w i ;̂,̂. (7.1)The density f t y t |x t ,w;, is obtained from the densities g t y t |x t ,c; and hc|w; –including the choice of w – and so the goodness-of-fit based on the log likelihood depends onthe choice of g t y t |x t ,c; and hc|w;, including which functions of x i1 ,...,x iT are allowedto be related to c i .Different choices can be compared. Naturally, we can add penalties to thelog likelihood for the number of overall parameters, as is done in <strong>with</strong> the Bayesian andAkaike information criteria.Extending this approach to the <strong>unbalanced</strong> case is simple. The partial log likelihood is, ofcourse, evaluated for the observed sample, which means inserting an s it into (7.1). Of course,w i is now a function of s it ,s it x it : t 1,...,T, suchasT i ,x̄ i.In many cases we do not want to specify an entire conditional density f t y t |x t ,w;,. Evenif we do, we might be directly interested in the fit of the mean – something we can compareacross <strong>models</strong> where we may or may not specify a full conditional distribution. As in Section5, let q t x it ,w i be Ey it |x it ,w i ,s it 1. Then we can compute a sum of squared residuals onthe <strong>unbalanced</strong> panel (<strong>with</strong> a balanced panel being as special case) asN T∑ ∑i1 t1s it y it − q t x it ,w i ;̂,̂ 2 , (7.2)Again, this measure of fit is comparable across <strong>models</strong> that differ in Ey it |x it ,c i and Dc i |w i ,including the choice of w i . We can compare, say, a linear model estimated by fixed <strong>effects</strong> to aCRE fractional response model by using35


for the linear case and, say,T−1q t x it ,w i ;, x it x̄ i ∑ 1T i r rr2x Tit ∑ r2T r 1T i r ∑ r21T i r x̄ i r1 exp ∑ T−1 1/2r11T i r rfor the nonlinear case. We can again use penalties for number of parameters if desired.36


8. Concluding RemarksI have offered some simple strategies for allowing <strong>unbalanced</strong> <strong>panels</strong> in <strong>correlated</strong> <strong>random</strong><strong>effects</strong> <strong>models</strong>. Hopefully these methods make applying CRE <strong>models</strong> to panel data setscollected in practice somewhat easier. The nature of the approach – which extends thebalanced case – is the need to model Dc i |s it ,s it x it : t 1,...,T in terms of a set ofsufficient statistics, w i . I have focused on parametric approximations, but the generalapproaches of Altonji and Matzkin (2005) and Blundell and Powell (2003, 2004)A general charge leveled at parametric CRE approaches is that having to model Dc i |w i means that we may face logical inconsistencies when we think of adding another time period tothe data set. for example, the restrictions on the covariate (and, in this case, the selectionprocess) such that Dc i |w i is normal for any T are quite strong. Technically, this is a validcriticism, and it motivates pursuing the nonparametric and semiparametric approaches citedabove. Still, empirical researchers ignore essentially the same logical inconsistencies on a dailybasis. Whenever, say, a new covariate is added to a probit model, the new model cannot be aprobit model if the original model was, unless the new covariate is essentially normallydistributed.One we focus on average partial <strong>effects</strong>, we can think of the original specificationsDy it |x it ,c i and Dc i |w i as convenient ways to obtain estimable distributionsDy it |x it ,w i ,s it 1. We can choose, say, Dy it |x it ,c i in ways that are more flexible thanallowed by either fixed <strong>effects</strong> approaches that treat c i as parameters to estimate or conditionalMLE approaches that try to eliminate c i . For example, of we start <strong>with</strong> a probit modelPy it 1|x it ,c i a i x it b i , the CMLE approach does not apply, and to treat a i ,b i asparameters to estimate requires T i ≥ K 1; in practice, T i should be much larger than K 1, or37


the incidental parameters problem will be severe. By contrast, a CRE approach – whilecomputationally intensive if we carry through <strong>with</strong> the <strong>random</strong> slopes probit model – istractable, and has known large-N properties if we properly model Dc i |w i . Moreoever, if wesimply start <strong>with</strong> very flexible <strong>models</strong> for Py it 1|x it ,w i ,s it 1 –asdiscussedinSection6– and then average out w i , we can approximate the APEs. How well we do requires a fairlysophisticated simulation study.I have focused on pooled estimation methods. These are simple but may be inefficient. Imentioned the possibility of minimum distance estimation when we have restrictions imposedacross the different values of T i . But other possibilities suggest themselves. For example, wemight restrict the joint distribution Dy i1 ,...,y iT |x i ,c i , as is common in pure <strong>random</strong> <strong>effects</strong> orCRE approaches <strong>with</strong> balanced panel. (Usually independence is assumed conditional onx i ,c i .) Of course, such methods are generally more difficult computationally, but the generalapproach should carry through <strong>with</strong> the unconfoundedness assumptionDy i1 ,...,y iT |x i ,c i ,s i Dy i1 ,...,y iT |x i ,c i ,s i .Another possibility, which is more directly applicable, is to use a generalized estimatingequation (GEE) approach. GEE is effectively a multivariate weighted nonlinear least squaresmethod, where the marginal means are assumed to be correctly specified but the jointdistribution is unrestricted – as in the pooled case. But GEE attempts to exploit the neglectedcorrelation over time by specifying simple correlation patters. In the case of a binary orfractional response, specifications such as (6.6) <strong>with</strong> the r 0 are straightforward to handleusing GEE software. See Imbens and Wooldridge (2007) or Papke and Wooldridge (2008) forfurther discussion <strong>with</strong> balanced <strong>panels</strong>.The assumption of strictly exogenous covariates is strong and needs to be relaxed. Of38


course, relaxing strict exogeneity poses challenges for all approaches to nonlinear unobserved<strong>effects</strong> <strong>models</strong>. CRE approaches for the case of lagged dependent variables, but <strong>with</strong> otherwisestrictly exogenous covariates, are available for balanced <strong>panels</strong>; see Wooldridge (2005) for asummary. Wooldridge (2008) suggests an approach, under ignorable selection, that can workin the case of pure attrition (which imposes a particular patter on the selection indicators). Butmore work needs to be done. For specific <strong>models</strong>, a balanced panel is not needed for certainmethods that eliminate the heterogeneity – for example, Honoré and Kyriazidou (2000) fordynamic binary response, Honoré and Hu (2004) for dynamic corner solutions – but thesemethods have other restrictions and effectively require dropping lots of data. Fixed <strong>effects</strong>methods for large T, particularly <strong>with</strong> bias adjustments, seem promising, but their asymptoticproperties <strong>with</strong> small T need not be good, and it is unclear how features such as time dummiesand unit root processes can be handled.39


ReferencesAltonji, J.G. and R.L. Matzkin (2005), “Cross Section and Panel Data Estimators forNonseparable Models <strong>with</strong> Endogenous Regressors,” Econometrica 73, 1053-1102.Baltagi, B. (2001), Econometric Analysis of Panel Data, 2e. Wiley: New York.Blundell, R. and J.L. Powell (2003), “Endogeneity in Nonparametric and SemiparametricRegression Models, <strong>with</strong> Richard Blundell,” in Advances in Economics and Econonometrics:Theory and Applications, Eighth World Congress, Volume 2, M. Dewatripont, L.P. Hansenand S.J. Turnovsky, eds. Cambridge: Cambridge University Press, 312-357.Blundell, R. and J.L. Powell (2004), “Endogeneity in Semiparametric Binary ResponseModels,” Review of Economic Studies 71, 655-679.Chamberlain, G. (1982), “Multivariate Regression Models for Panel Data,” Journal ofEconometrics 1, 5-46.Chamberlain, G. (1984), “Panel Data,” in Handbook of Econometrics, Volume 2, ed. Z.Griliches and M.D. Intriligator. Amsterdam: North Holland, 1248-1318.Chernozhukov, V., I. Fernández-Val, J. Hahn, and W.K. Newey (2009), “Identification andEstimation of Marginal Effects in Nonlinear Panel Models.” Mimeo, Boston UniversityDepartment of Economics.Fernández-Val, I. (2008), “Fixed Effects Estimation of Structural Parameters and MarginalEffects in Panel Probit Models.” Mimeo, Boston University Department of Economics.Hahn, J. and W.K. Newey (2004), “Jackknife and Analytical Bias Reduction for NonlinearPanel Models,” Econometrica 72, 1295-1319.Hayashi, F. (2001), Econometrics. Princeton University Press: Princeton, NJ.Honoré, B.E. and L. Hu (2004), “Estimation of Cross Sectional and Panel Data Censored40


Regression Models <strong>with</strong> Endogeneity,” Journal of Econometrics 122, 293-316.Honoré, B.E. and E. Kyriazidou (2000), “Panel Data Discrete Choice Models <strong>with</strong> LaggedDependent Variables,” Econometrica 68, 839-874.Imbens, G.W. and J.M. Wooldridge (2007), “What’s New in Econometrics?” NBERResearch Summer Institute, Cambridge, July/August, 2007.Kwak, D.W. and J.M. Wooldridge (2009), “The Robustness of the Fixed Effects LogitEstimator to Violations of Conditional Independence,” mimeo, Michigan State UniversityDepartment of Economics.Mundlak, Y. (1978), “On the Pooling of Time Series and Cross Section Data,Econometrica 46, 69-85.Papke, L.E. and J.M. Wooldridge (2008), “Panel Data Methods for Fractional ResponseVariables <strong>with</strong> an Application to Test Pass Rates,” Journal of Econometrics 145, 121-133.Verbeerk, M. and T. Nijman (1996), “Testing for Selectivity Bias in Panel Data,”International Economic Review 33, 681-703.Wooldridge, J.M. (1999), “Distribution-Free Estimation of Some Nonlinear Panel DataModels,” Journal of Econometrics 90, 77-97.Wooldridge, J.M. (2002), Econometric Analysis of Cross Section and Panel Data. MITPress: Cambridge, MA.Wooldridge, J.M. (2005), “Unobserved Heterogeneity and Estimation of Average PartialEffects,” in Identification and Inference for Econometric Models: Essays in Honor of ThomasRothenberg, ed. D.W.K. Andrews and J.H. Stock. Cambridge: Cambridge University Press,27-55.Wooldridge, J.M. (2008), “Nonlinear Dynamic Panel Data Models <strong>with</strong> Unobserved41


Effects,” invited lecture, Canadian Econometrics Study Group, Montreal, September 2008.42

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!