11.07.2015 Views

2DkcTXceO

2DkcTXceO

2DkcTXceO

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

X.-L. Meng 553vironmental progress must become a major focus of central governmentstatistical agencies.” (Groves, February 2, 2012)Multi-source inference therefore refers to situations where we need to drawinference by using data coming from different sources and some (but not all)of which were not collected for inference purposes. It is thus broader and morechallenging than multi-frame inference, where multiple data sets are collectedfor inference purposes but with different survey frames; see Lohr and Rao(2006). Most of us would agree that the very foundation of statistical inferenceis built upon having a representative sample; even in notoriously difficultobservational studies, we still try hard to create pseudo “representative” samplesto reduce the impact of confounding variables. But the availability of avery large subpopulation, however biased, poses new opportunities as well aschallenges.45.4.1 Large absolute size or large relative size?Let us consider a case where we have an administrative record covering f apercent of the population, and a simple random sample (SRS) from the samepopulation which only covers f s percent, where f s ≪ f a . Ideally, we want tocombine the maximal amount of information from both of them to reach ourinferential conclusions. But combining them effectively will depend criticallyon the relative information content in them, both in terms of how to weightthem (directly or implied) and how to balance the gain in information with theincreased analysis cost. Indeed, if the larger administrative dataset is foundto be too biased relative to the cost of processing it, we may decide to ignoreit. Wu’s question therefore is a good starting point because it directly askshow the relative information changes as their relative sizes change: how largeshould f a /f s be before an estimator from the administrative record dominatesthe corresponding one from the SRS, say in terms of MSE?As an initial investigation, let us denote our finite population by{x 1 ,...,x N }. For the administrative record, we let R i =1wheneverx i isrecorded and zero otherwise; and for SRS, we let I i =1ifx i is sampled, andzero otherwise, where i ∈{1,...,N}. Here we assume n a = ∑ Ni=1 R i ≫ n s =∑ Ni=1 I i, and both are considered fixed in the calculations below. Our keyinterest here is to compare the MSEs of two estimators of the finite-samplepopulation mean ¯X N , namely,¯x a = 1 n aN∑i=1x i R i and ¯x s = 1 ∑ Nx i I i .n sRecall for finite-population calculations, all x i ’s are fixed, and all the randomnesscomes from the response/recording indicator R i for ¯x a and the samplingindicator I i for ¯x s . Although the administrative record has no probabilisticmechanism imposed by the data collector, it is a common strategy to modelthe responding (or recording or reporting) behavior via a probabilistic model.i=1

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!