29.07.2014 Views

Practical Considerations in Raking Survey Data

Practical Considerations in Raking Survey Data

Practical Considerations in Raking Survey Data

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Practical</strong> <strong>Considerations</strong> <strong>in</strong> Rak<strong>in</strong>g <strong>Survey</strong> <strong>Data</strong><br />

Michael P. Battaglia 1 , David Izrael 1 , David C. Hoagl<strong>in</strong> 1 , and Mart<strong>in</strong> R. Frankel 1,2<br />

(1) Abt Associates Inc., (2) Baruch College, CUNY<br />

Contact Author: Michael P. Battaglia<br />

Abt Associates Inc., 55 Wheeler Street, Cambridge, MA 02138<br />

(v) 617-349-2425, (f) 617-349-2605, mike_battaglia@abtassoc.com<br />

1


Abstract<br />

A survey sample may cover segments of the target population <strong>in</strong> proportions that do not match the<br />

proportions of those segments <strong>in</strong> the population itself. The differences may arise from sampl<strong>in</strong>g<br />

fluctuations, nonresponse, or because the sample design was not able to cover the entire population.<br />

In such situations one can use rak<strong>in</strong>g to improve the relation between the sample and the<br />

population by adjust<strong>in</strong>g the sampl<strong>in</strong>g weights of the cases <strong>in</strong> the sample so that the marg<strong>in</strong>al totals<br />

of the adjusted weights on specified characteristics agree with the correspond<strong>in</strong>g totals for the<br />

population. The rak<strong>in</strong>g procedure is described, and convergence issues and problems are<br />

discussed. The details of several practical aspects of rak<strong>in</strong>g are then given. The topics covered<br />

have not received much attention <strong>in</strong> the literature on rak<strong>in</strong>g. Specific aspects of rak<strong>in</strong>g are<br />

illustrated with graphical displays of output from a SAS Macro that can be obta<strong>in</strong>ed for free from<br />

the authors.<br />

Key Words<br />

Control totals, convergence, rak<strong>in</strong>g marg<strong>in</strong>s, weights, nonresponse<br />

2


1. Introduction<br />

A survey sample may cover segments of the target population <strong>in</strong> proportions that do not match the<br />

proportions of those segments <strong>in</strong> the population itself. The differences may arise, for example,<br />

from sampl<strong>in</strong>g fluctuations, from nonresponse, or because the sample design was not able to cover<br />

the entire population. In such situations one can often improve the relation between the sample and<br />

the population by adjust<strong>in</strong>g the sampl<strong>in</strong>g weights of the cases <strong>in</strong> the sample so that the marg<strong>in</strong>al<br />

totals of the adjusted weights on specified characteristics agree with the correspond<strong>in</strong>g totals for<br />

the population. This operation is known as rak<strong>in</strong>g ratio estimation (Kalton 1983), rak<strong>in</strong>g, or<br />

sample-balanc<strong>in</strong>g, and the population totals are usually referred to as control totals. Rak<strong>in</strong>g may<br />

reduce nonresponse and noncoverage biases, as well as sampl<strong>in</strong>g variability. The <strong>in</strong>itial sampl<strong>in</strong>g<br />

weights <strong>in</strong> the rak<strong>in</strong>g process are often equal to the reciprocal of the probability of selection and<br />

may have undergone some adjustments for unit nonresponse and noncoverage. The weights from<br />

the rak<strong>in</strong>g process are used <strong>in</strong> estimation and analysis.<br />

The adjustment to control totals is sometimes achieved by creat<strong>in</strong>g a cross-classification of the<br />

categorical control variables (e.g., age categories x gender x race x family-<strong>in</strong>come categories) and<br />

then match<strong>in</strong>g the total of the weights <strong>in</strong> each cell to the control total. This approach, however, can<br />

spread the sample th<strong>in</strong>ly over a large number of cells. It also requires control totals for all cells of<br />

the cross-classification. Often this is not feasible (e.g., control totals may be available for age x<br />

gender x race but not when those cells are subdivided by family <strong>in</strong>come). The use of marg<strong>in</strong>al<br />

control totals for s<strong>in</strong>gle variables (i.e., each marg<strong>in</strong> <strong>in</strong>volves only one control variable) often avoids<br />

many of these difficulties. In return, of course, the two-variable (and higher-order) weighted<br />

distributions of the sample are not required to mimic those of the population.<br />

3


A somewhat different problem motivated the orig<strong>in</strong>al development of sample-balanc<strong>in</strong>g (Dem<strong>in</strong>g<br />

1943). The Census Bureau needed to produce tabulations for the jo<strong>in</strong>t distribution of two (or more)<br />

variables <strong>in</strong> the U.S. population, <strong>in</strong> situations where <strong>in</strong>formation on the jo<strong>in</strong>t distribution was<br />

available only from a sample. The marg<strong>in</strong>al totals, however, were available for the full population,<br />

and so the sample counts <strong>in</strong> the cells of the cross-classification were adjusted to provide an<br />

estimated tabulation that had the correct marg<strong>in</strong>al totals.<br />

Rak<strong>in</strong>g (or sample-balanc<strong>in</strong>g) usually proceeds one variable at a time, apply<strong>in</strong>g a proportional<br />

adjustment to the weights of the cases that belong to the same category of the control variable.<br />

Software for sample-balanc<strong>in</strong>g has been available for many years, but not as part of SAS (except<br />

for the CLAMAR macro from France) or most other major software systems (WESVAR <strong>in</strong>cludes a<br />

rak<strong>in</strong>g algorithm). Older readers may be familiar with a FORTRAN program developed <strong>in</strong> the<br />

1960s by MarketMath, Inc. Although that program executed rapidly, it had a variety of<br />

disadvantages. The user had to create an ASCII <strong>in</strong>put data set, pa<strong>in</strong>stak<strong>in</strong>gly prepare control<br />

statements (the orig<strong>in</strong>al program was designed to read <strong>in</strong>put from cards), and then process its<br />

ASCII output data set. It could rake on at most 12 variables. Also, it handled round<strong>in</strong>g <strong>in</strong> a way<br />

that could lose precision. Izrael et al. (2000) <strong>in</strong>troduced a SAS macro for rak<strong>in</strong>g (sometimes<br />

referred to as the IHB rak<strong>in</strong>g macro) that comb<strong>in</strong>es simplicity and versatility. More recently, the<br />

IHB rak<strong>in</strong>g macro was enhanced to <strong>in</strong>crease its utility and diagnostics (Izrael et al. 2004).<br />

The rak<strong>in</strong>g algorithm and issues related to convergence are discussed next. Several practical rak<strong>in</strong>g<br />

applications are then covered.<br />

4


2. Basic Algorithm<br />

The procedure known as rak<strong>in</strong>g adjusts a set of data so that its marg<strong>in</strong>al totals match specified<br />

control totals on a specified set of variables. The term “rak<strong>in</strong>g” suggests an analogy with the<br />

process of smooth<strong>in</strong>g the soil <strong>in</strong> a garden plot by alternately work<strong>in</strong>g it back and forth with a rake<br />

<strong>in</strong> two perpendicular directions.<br />

In a simple 2-variable example the marg<strong>in</strong>al totals <strong>in</strong> various categories for the two variables are<br />

known from the entire population, but the jo<strong>in</strong>t distribution of the two variables is known only from<br />

a sample. In the cross-classification of the sample, arranged <strong>in</strong> rows and columns, one might beg<strong>in</strong><br />

with the rows, tak<strong>in</strong>g each row <strong>in</strong> turn and multiply<strong>in</strong>g each entry <strong>in</strong> the row by the ratio of the<br />

population total to the weighted sample total for that category, so that the row totals of the adjusted<br />

data agree with the population totals for that variable. The weighted column totals of the adjusted<br />

data, however, may not yet agree with the population totals for the column variable. Thus the next<br />

step, tak<strong>in</strong>g each column <strong>in</strong> turn, multiplies each entry <strong>in</strong> the column by the ratio of the population<br />

total to the current total for that category. Now the weighted column totals of the adjusted data<br />

agree with the population totals for that variable, but the new weighted row totals may no longer<br />

match the correspond<strong>in</strong>g population totals. The process cont<strong>in</strong>ues, alternat<strong>in</strong>g between the rows<br />

and the columns, and agreement on both rows and columns is usually achieved after a few<br />

iterations. The result is a tabulation for the population that reflects the relation of the two variables<br />

<strong>in</strong> the sample.<br />

The above sketch of the rak<strong>in</strong>g procedure focuses on the counts <strong>in</strong> the cells and on the marg<strong>in</strong>s of a<br />

two-variable cross-classification of the sample. In the applications that survey statisticians often<br />

encounter, <strong>in</strong>volv<strong>in</strong>g data from complex surveys, it is more common to work with the survey<br />

5


weights of the n <strong>in</strong>dividual respondents. Thus, the basic rak<strong>in</strong>g algorithm is described <strong>in</strong> terms of<br />

those <strong>in</strong>dividual weights, w, i = 1,2,..., n.<br />

For an unweighted (i.e., equally weighted) sample, one<br />

i<br />

can simply take the <strong>in</strong>itial weights to be w = 1 for each i.<br />

i<br />

In a cross-classification that has J rows and K columns, denote the sum of the<br />

wi<br />

<strong>in</strong> cell ( j, k)<br />

by<br />

w . jk<br />

To <strong>in</strong>dicate further summation, replace a subscript by a + sign. Thus, the <strong>in</strong>itial row totals<br />

and column totals of the sample weights are<br />

w<br />

j +<br />

and<br />

correspond<strong>in</strong>g population control totals by T +<br />

and .<br />

j<br />

T + k<br />

w + k<br />

, respectively. Analogously, denote the<br />

The iterative rak<strong>in</strong>g algorithm produces modified weights, whose sums are denoted by a suitably<br />

subscripted m with a parenthesized superscript for the number of the step. Thus, <strong>in</strong> the twovariable<br />

cross-classification<br />

m (1) jk<br />

denotes the sum of the modified weights <strong>in</strong> cell (j,k) at the end<br />

of Step 1. If one beg<strong>in</strong>s by match<strong>in</strong>g the control totals for the rows,<br />

algorithm are<br />

T<br />

j +<br />

, the <strong>in</strong>itial steps of the<br />

m<br />

(0)<br />

jk<br />

= w<br />

(j = 1,...,J; k=1,...,K)<br />

jk<br />

m = m ( T / m )<br />

(for each k with<strong>in</strong> each j)<br />

(1) (0) (0)<br />

jk jk j+ j+<br />

m = m ( T / m )<br />

(for each j with<strong>in</strong> each k)<br />

(2) (1) (1)<br />

jk jk + k + k<br />

6


The adjustment factors,<br />

(0)<br />

Tj+ / mj+ and<br />

(1)<br />

T+ k<br />

/ m+ k<br />

, are actually applied to the <strong>in</strong>dividual weights,<br />

which could be denoted by<br />

(2)<br />

m<br />

i<br />

for example. In the iterative process an iteration rakes both rows<br />

and columns. Thus, for iteration s ( s = 0, 1, ...) one may write<br />

m = m ( T / m )<br />

(2s+<br />

1) (2 s) (2 s)<br />

jk jk j+ j+<br />

m = m ( T / m )<br />

(2s + 2) (2s + 1) (2s<br />

+ 1)<br />

jk jk + k + k<br />

Bishop et al. (1975) discuss the relationship between iterative proportional fitt<strong>in</strong>g and rak<strong>in</strong>g. They<br />

po<strong>in</strong>t out that rak<strong>in</strong>g was orig<strong>in</strong>ally developed not for fitt<strong>in</strong>g an unsaturated model to a data set, but<br />

rather for comb<strong>in</strong><strong>in</strong>g <strong>in</strong>formation from two or more data sets. In the two-way table discussed<br />

above, one is <strong>in</strong> effect fitt<strong>in</strong>g a fully saturated log-l<strong>in</strong>ear model: the two-factor <strong>in</strong>teraction present<br />

<strong>in</strong> the sample persists after rak<strong>in</strong>g, and the one-factor terms (reflected <strong>in</strong> the population control<br />

totals) are also fitted. Thus, <strong>in</strong> some ways rak<strong>in</strong>g can thus be thought of as fitt<strong>in</strong>g a “ma<strong>in</strong> effects”<br />

model, where the ma<strong>in</strong> effects correspond to the given marg<strong>in</strong>s.<br />

Rak<strong>in</strong>g can also adjust a set of data to control totals on three or more variables. In such situations<br />

the control totals often <strong>in</strong>volve s<strong>in</strong>gle variables, but they may <strong>in</strong>volve two or more variables. In<br />

one example, <strong>in</strong> rak<strong>in</strong>g on three variables one might have control totals T a++ , T +b+ , and T ++c . In<br />

another example, the control totals might be T ab+ and T ++c --- a two-variable marg<strong>in</strong> and a onevariable<br />

marg<strong>in</strong>. In actually carry<strong>in</strong>g out the rak<strong>in</strong>g for this second example, it suffices to treat the<br />

two-variable marg<strong>in</strong> as the one-variable marg<strong>in</strong> for a composite variable, whose values simply<br />

<strong>in</strong>dex the cells of the underly<strong>in</strong>g two-variable marg<strong>in</strong>.<br />

7


Ideally, one should rake on variables that exhibit strong associations with the key survey outcome<br />

variables or that are strongly related to nonresponse or noncoverage. This strategy will reduce the<br />

mean squared error of the key outcome variables. In practice, other considerations may enter. A<br />

variable such as gender may not be related to key outcome variables or to nonresponse or<br />

noncoverage, but rak<strong>in</strong>g on it may be desirable to preserve the “face validity” of the sample.<br />

3. Convergence<br />

Convergence of the rak<strong>in</strong>g algorithm has received considerable attention <strong>in</strong> the statistical literature,<br />

especially <strong>in</strong> the context of iterative proportional fitt<strong>in</strong>g for log-l<strong>in</strong>ear models, where the number of<br />

variables is at least three and the process beg<strong>in</strong>s with a different set of <strong>in</strong>itial values <strong>in</strong> the fitted<br />

table (often 1 <strong>in</strong> each cell). For rak<strong>in</strong>g survey data it is enough that the iterative rak<strong>in</strong>g algorithm<br />

(ord<strong>in</strong>arily) converges, as one would expect from the fact that (<strong>in</strong> a suitable scale) the fitted cell<br />

counts produced by the rak<strong>in</strong>g are the weighted-least-squares fit to the observed cell counts <strong>in</strong> the<br />

full cross-classification of the sample by all the rak<strong>in</strong>g variables (Dem<strong>in</strong>g 1943). As an extreme<br />

example, for the 2 x 2 table shown <strong>in</strong> Table 1, convergence is impossible.<br />

Convergence may require a large number of iterations. Oh and Scheuren (1978) note that the<br />

available convergence proofs make strong assumptions about the cell counts <strong>in</strong> the crossclassification<br />

of the rak<strong>in</strong>g variables – that no cells are empty or that some particular comb<strong>in</strong>ation<br />

of nonempty cells is present. They recommend sett<strong>in</strong>g up the rak<strong>in</strong>g problem <strong>in</strong> a “sensible”<br />

manner to avoid: 1) impos<strong>in</strong>g too many marg<strong>in</strong>al constra<strong>in</strong>ts on the sample, 2) def<strong>in</strong><strong>in</strong>g marg<strong>in</strong>al<br />

categories that conta<strong>in</strong> a small percentage of the sample, and 3) impos<strong>in</strong>g contradictory constra<strong>in</strong>ts<br />

on the sample.<br />

8


The authors’ experience <strong>in</strong>dicates that, <strong>in</strong> general, rak<strong>in</strong>g on a large number of variables slows the<br />

convergence process. However, other factors also affect convergence. One is the number of<br />

categories of the rak<strong>in</strong>g variables. Convergence will typically be slower for rak<strong>in</strong>g on 10 variables<br />

each with 5 categories than for 10 variables each with only 2 categories. A second factor is the<br />

number of sample cases <strong>in</strong> each category of the rak<strong>in</strong>g variables. Convergence may be slow if any<br />

categories conta<strong>in</strong> fewer than 5% of the sample cases. A third factor is the size of the difference<br />

between each control total and the correspond<strong>in</strong>g weighted sample total prior to rak<strong>in</strong>g. If some<br />

differences are large, the number of iterations will typically be higher. One can guard aga<strong>in</strong>st the<br />

possibility of nonconvergence or slow convergence by sett<strong>in</strong>g an upper limit on the number of<br />

iterations (e.g., 50).<br />

Brick et al. (2003) also discuss problems with convergence. They po<strong>in</strong>t out that a large number of<br />

iterations <strong>in</strong>dicate a rak<strong>in</strong>g application that is not “well-behaved” and that problems may exist with<br />

the result<strong>in</strong>g weights – highly variable weights <strong>in</strong>flate sampl<strong>in</strong>g variances and produce unstable<br />

doma<strong>in</strong> estimates. One example of a problem is the use of rak<strong>in</strong>g variables that have a strong<br />

association (correlation). In this situation the number of iterations may be large, and convergence<br />

will not occur if there are <strong>in</strong>consistencies between the associations <strong>in</strong> the sample and the control<br />

totals (Table 1 shows such an example). The log-l<strong>in</strong>ear models literature on structural zeros <strong>in</strong><br />

cont<strong>in</strong>gency tables is directly related to this issue. For example, if one rakes on Food Stamps<br />

eligibility and a poverty status variable, the cross-tabulation of these two variables <strong>in</strong> the sample<br />

will likely result <strong>in</strong> one or more cells that must be empty by def<strong>in</strong>ition.<br />

One simple def<strong>in</strong>ition of convergence requires that each marg<strong>in</strong>al total of the raked weights be<br />

with<strong>in</strong> a specified tolerance of the correspond<strong>in</strong>g control total. As noted above, <strong>in</strong> practice, when a<br />

9


number of rak<strong>in</strong>g variables are <strong>in</strong>volved, one must check for the possibility that the iterations do<br />

not converge (e.g., because of sparseness or some other feature <strong>in</strong> the full cross-classification of the<br />

sample). As already noted, one can guard aga<strong>in</strong>st this possibility by sett<strong>in</strong>g an upper limit on the<br />

number of iterations. As elsewhere <strong>in</strong> data analysis, it is sensible to exam<strong>in</strong>e the sample (<strong>in</strong>clud<strong>in</strong>g<br />

its jo<strong>in</strong>t distribution with respect to all the rak<strong>in</strong>g variables) before do<strong>in</strong>g any rak<strong>in</strong>g. For example,<br />

if the sample conta<strong>in</strong>s no cases <strong>in</strong> a category of one of the rak<strong>in</strong>g variables, it will be necessary to<br />

revise the set of categories and their control totals (say, by comb<strong>in</strong><strong>in</strong>g categories). The authors<br />

recommend, at a m<strong>in</strong>imum, check<strong>in</strong>g the unweighted percentage of sample cases and the<br />

percentage of control cases <strong>in</strong> each category of each rak<strong>in</strong>g variable. Small categories <strong>in</strong> the<br />

sample or <strong>in</strong> the control totals (say under 5%) are potential candidates for collaps<strong>in</strong>g. This step<br />

will reduce the chance of creat<strong>in</strong>g very unequal weights <strong>in</strong> rak<strong>in</strong>g. Category collaps<strong>in</strong>g always<br />

needs to be done carefully, and <strong>in</strong> some <strong>in</strong>stances it may be important to reta<strong>in</strong> a small category <strong>in</strong><br />

the rak<strong>in</strong>g.<br />

4. The IHB Rak<strong>in</strong>g Macro<br />

The IHB SAS macro produces diagnostic output that conta<strong>in</strong>s the follow<strong>in</strong>g <strong>in</strong>formation: number<br />

of iterations, name of variable currently be<strong>in</strong>g raked on, name of BY-variable if there is one, and<br />

marg<strong>in</strong>al control total and calculated total weight for each level of the current rak<strong>in</strong>g variable,<br />

along with their difference and percentage difference. At term<strong>in</strong>ation, the macro gives the iteration<br />

number at which term<strong>in</strong>ation occurred and the reason, which is either that the tolerance has been<br />

met or that the process did not converge. The macro also writes diagnostics <strong>in</strong>to the SAS LOG,<br />

from several of the checks that it makes.<br />

10


Table 2 illustrates the use of the macro with an example <strong>in</strong>volv<strong>in</strong>g two rak<strong>in</strong>g variables, Table 2<br />

calls them VARIABLE1 and VARIABLE2, and a BY-variable, AREA, which has two levels. The<br />

marg<strong>in</strong>al percentage and general control total for each level of the BY-variable are obta<strong>in</strong>ed outside<br />

the example, from PROC FREQ. Prelim<strong>in</strong>ary analyses of the data set showed that all categories of<br />

the rak<strong>in</strong>g variables represented <strong>in</strong> the marg<strong>in</strong>al control data sets exist <strong>in</strong> the sample as well. Table<br />

2 shows the unweighted distribution of each variable. The actual rak<strong>in</strong>g uses the weights of the<br />

<strong>in</strong>dividual cases. With the convergence tolerance set to 1, the rak<strong>in</strong>g converged after 3 iterations<br />

for Area 1, and also after 3 iterations for Area 2.<br />

5. Sources of Control Totals<br />

The discussion of control totals refers to actual totals as opposed to percents. <strong>Survey</strong>s that use<br />

demographic and socioeconomic variables for rak<strong>in</strong>g must locate a source for the population<br />

control totals. An example of a source of true population control totals is the 2000 U.S. Census<br />

short-form data. The U.S. Census long-form variables, the 2000 U.S. Census 5-Percent Public Use<br />

Microdata Sample (PUMS) files, the Current Population <strong>Survey</strong> (CPS), U.S. Census Bureau<br />

population projections, the National Health Interview <strong>Survey</strong>, and private-sector sources such as<br />

Claritas are better viewed as control totals, because they are based either on large samples or on<br />

projection methodologies.<br />

Control totals obta<strong>in</strong>ed from a sample such as the CPS estimates are subject to much smaller<br />

sampl<strong>in</strong>g variability and nonresponse bias, and may be subject to much lower noncoverage bias,<br />

than a survey sample. For state-specific control totals, say for persons aged 0-17 years, the CPS<br />

estimates will be subject to considerably larger sampl<strong>in</strong>g variability; thus they are useful for<br />

national control totals, but potentially less useful for stable state control totals. Comb<strong>in</strong><strong>in</strong>g two<br />

years of CPS data can reduce the sampl<strong>in</strong>g variability of the state control totals. For projection<br />

11


methods (e.g., age by sex by race mid-year population projections from the U.S. Census Bureau),<br />

the basic approach is to project <strong>in</strong>formation forward from 2000 for the non-censal years. Clearly,<br />

the farther one gets from 2000, the greater the likelihood that the projections will be off. This<br />

happened, for example, with the projection of the size of the Hispanic population for the years<br />

before the 2000 Census results came out. Eventually, the American Community <strong>Survey</strong> should<br />

provide a new source of <strong>in</strong>formation for non-censal years.<br />

It is important to make sure that control totals from different sources all add to the same population<br />

total. If not, the rak<strong>in</strong>g will not converge. For example, for a survey <strong>in</strong> the middle of 2003, one<br />

would use Census Bureau age, sex, and race projections of the civilian non<strong>in</strong>stitutionalized<br />

population for July 2003, and obta<strong>in</strong> control totals by household <strong>in</strong>come from the March 2003 CPS.<br />

In this situation one would most likely need to ratio-adjust the CPS <strong>in</strong>come control totals so that<br />

they summed to the Census projection control totals for July 2003.<br />

One must also consider how the variables are measured. A telephone survey may ask a s<strong>in</strong>gle<br />

question to obta<strong>in</strong> household <strong>in</strong>come. The source for the control totals, however, may have an<br />

<strong>in</strong>come variable that is constructed from a series of questions about <strong>in</strong>come from several sources<br />

(wages, cash-assistance programs, <strong>in</strong>terest, dividends, etc.). One needs to consider carefully<br />

whether us<strong>in</strong>g <strong>in</strong>come as a rak<strong>in</strong>g variable makes sense. If the sample is thought to substantially<br />

under-represent low-<strong>in</strong>come persons, then rak<strong>in</strong>g on <strong>in</strong>come may be preferred. If, on the other<br />

hand, there is concern that the survey is measur<strong>in</strong>g <strong>in</strong>come very differently from the source of the<br />

control totals, then consideration should be given to rak<strong>in</strong>g on a proxy variable such as educational<br />

atta<strong>in</strong>ment or even a dichotomous poverty-status variable.<br />

12


Control totals usually do not come with a “miss<strong>in</strong>g” category. The same variable <strong>in</strong> the survey<br />

may have a nontrivial percentage of cases that fall <strong>in</strong> a DK or Refused category. In this situation it<br />

may be possible to impute for item nonresponse <strong>in</strong> the survey before the rak<strong>in</strong>g takes place. When<br />

imputation is not feasible, the follow<strong>in</strong>g procedure can be used to adjust the control totals. Run a<br />

weighted frequency distribution on the rak<strong>in</strong>g variable <strong>in</strong> order to determ<strong>in</strong>e the percentage of<br />

sample cases that have a miss<strong>in</strong>g value (e.g., 4.3%). Allocate 4.3% of the control total to a newly<br />

created miss<strong>in</strong>g category (e.g., 4.3% of 1,500,000 = 64,500). Reapportion the control totals <strong>in</strong> the<br />

other categories so that they add to the reduced control total (1,500,000 – 64,500 = 1,435,500).<br />

After rak<strong>in</strong>g, the weighted distribution of the sample will agree with the revised control totals and<br />

will reflect a 4.3% miss<strong>in</strong>g- data rate <strong>in</strong> weighted frequencies and tabulations.<br />

6. Trade-offs Related to Number of Marg<strong>in</strong>s and Numbers of Categories<br />

Some rak<strong>in</strong>g applications use marg<strong>in</strong>s for age, sex, and race, because it is relatively easy to obta<strong>in</strong><br />

control totals for these variables. In other situations (especially <strong>in</strong> surveys with lower response or<br />

important noncoverage issues) one may need to rake on a considerably larger number of variables.<br />

This is feasible if control totals can be assembled. The authors have seen rak<strong>in</strong>gs that used well<br />

over ten variables. Rak<strong>in</strong>g on many variables will almost always require a large number of<br />

iterations. The authors have also seen rak<strong>in</strong>gs that used a smaller number of variables, but with<br />

fairly detailed categories. Aga<strong>in</strong>, a large number of iterations may be required. In both situations<br />

the cross-classification of the rak<strong>in</strong>g variables often yields an extremely large number of cells. For<br />

example, rak<strong>in</strong>g on 12 dichotomous variables yields 4,096 cells. Rak<strong>in</strong>g on five variables each<br />

conta<strong>in</strong><strong>in</strong>g six categories yields 7,776 cells. Many of these cells will conta<strong>in</strong> no cases <strong>in</strong> the<br />

sample. Such cells, by def<strong>in</strong>ition, rema<strong>in</strong> empty after rak<strong>in</strong>g. However, the two-variable, threevariable,<br />

and higher-order <strong>in</strong>teractions <strong>in</strong> the sample are ma<strong>in</strong>ta<strong>in</strong>ed <strong>in</strong> the rak<strong>in</strong>g to the marg<strong>in</strong>al<br />

13


control totals. The small cell sizes <strong>in</strong>crease the chance that the raked weights will exhibit<br />

considerable variability, because those weights are ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g sample <strong>in</strong>teractions that are quite<br />

unstable.<br />

On top of the challenges of the numbers of variables and categories and the result<strong>in</strong>g number of<br />

underly<strong>in</strong>g cells, large differences, before rak<strong>in</strong>g, between the weighted sample totals be and the<br />

marg<strong>in</strong>al control totals will generally <strong>in</strong>crease the number of iterations. These issues po<strong>in</strong>t to the<br />

need to closely exam<strong>in</strong>e: 1) the variables selected for rak<strong>in</strong>g, 2) the number and size of the<br />

categories of those rak<strong>in</strong>g variables, and 3) the magnitude of differences between the weighted<br />

sample totals and the control totals. Ideal variables for rak<strong>in</strong>g are those related to the key survey<br />

outcome variables and related to nonresponse and/or noncoverage. Variables that do not meet<br />

these conditions are candidates for exclusion from rak<strong>in</strong>g when a large number of variables are<br />

be<strong>in</strong>g considered. The categories of each candidate rak<strong>in</strong>g variable should be exam<strong>in</strong>ed to see<br />

whether they conta<strong>in</strong> a small proportion of the sample cases (say, under 5%) or whether the control<br />

total percentage is small (also, say, under 5%). Such small categories should be considered for<br />

collaps<strong>in</strong>g. Sometimes the small categories of a nom<strong>in</strong>al categorical variables can be collapsed<br />

<strong>in</strong>to a larger residual category. For ord<strong>in</strong>al variables, collaps<strong>in</strong>g with an adjacent category is often<br />

the best approach. If one or more weighted sample totals differ by a large amount from the<br />

correspond<strong>in</strong>g control totals, one should first try to determ<strong>in</strong>e the source of the difference. Is it<br />

extreme differential nonresponse, or has the variable <strong>in</strong> the sample been measured <strong>in</strong> a very<br />

different manner than the correspond<strong>in</strong>g variable used to form the control total? One should<br />

consider whether it is appropriate to use such a variable <strong>in</strong> rak<strong>in</strong>g.<br />

14


7. Exam<strong>in</strong><strong>in</strong>g and Diagnos<strong>in</strong>g Slow Convergence<br />

Sometimes the rak<strong>in</strong>g process does not converge <strong>in</strong> a specified number of iterations. As an aid to<br />

diagnos<strong>in</strong>g such situations and tak<strong>in</strong>g appropriate action, the enhanced IHB rak<strong>in</strong>g macro<br />

<strong>in</strong>corporates a module that, <strong>in</strong> case of non-convergence, uses the data to predict the number of<br />

iterations needed for convergence.<br />

The prediction is based on an empirical observation that the logarithm of the magnitude of the<br />

difference between an adjusted weighted total and its control total decl<strong>in</strong>es l<strong>in</strong>early with the<br />

number of iterations. In the authors’ experience, this relation holds reasonably well when a slowly<br />

converg<strong>in</strong>g rak<strong>in</strong>g process approaches the specified number of iterations (50 <strong>in</strong> most applications).<br />

The enhanced macro extrapolates the last iteration slope and estimates the iteration at which the<br />

slowest converg<strong>in</strong>g variable will cross a given tolerance threshold.<br />

One usually considers a rak<strong>in</strong>g process to be “converg<strong>in</strong>g slowly” if either it does not converge <strong>in</strong> a<br />

specified number of iterations or convergence takes substantially more iterations than usual. In the<br />

authors’ work, convergence usually takes place <strong>in</strong> 5 to 20 iterations. However, when the number of<br />

rak<strong>in</strong>g variables is large (say, more than 8) and some of the rak<strong>in</strong>g variables have numerous levels<br />

(the variable State with 51 categories, for <strong>in</strong>stance), the process may take much longer to converge<br />

or may even not converge <strong>in</strong> an <strong>in</strong>itially set number of iterations. The statistician has options to<br />

proceed with rak<strong>in</strong>g. The first one is by us<strong>in</strong>g the predicted number of iterations from the<br />

diagnostics to rerake the sample, try<strong>in</strong>g to achieve complete convergence. This option is illustrated<br />

later. However, the predicted number of iterations may be impractically large. Then, as a second<br />

option, one may attempt to preprocess the sample data.<br />

15


A common strategy collapses categories of slowly converg<strong>in</strong>g variables. If, for <strong>in</strong>stance, State is<br />

such a variable (with a value for each U.S. state and D.C.), it could be collapsed <strong>in</strong>to, say, Census<br />

Division (9 levels) or even Census Region (4 levels). Of course, the statistician may not always<br />

have flexibility <strong>in</strong> collaps<strong>in</strong>g. He/she may be required to rake by the orig<strong>in</strong>al variables, or the<br />

“slow” variables may already be dichotomous. But if there is some flexibility <strong>in</strong> the statistical<br />

weight<strong>in</strong>g methods, the authors recommend try<strong>in</strong>g collaps<strong>in</strong>g to accelerate convergence.<br />

How does one determ<strong>in</strong>e which rak<strong>in</strong>g variables are “slow”? The most effective way to exam<strong>in</strong>e a<br />

convergence process is to draw graphs. Figure 1 displays a plot of a slow rak<strong>in</strong>g process <strong>in</strong>volv<strong>in</strong>g<br />

12 variables; the x-axis is the iteration number, and the y-axis is log 10 of the maximum (taken over<br />

all categories of a given rak<strong>in</strong>g variable) of the absolute value of the difference between the<br />

adjusted weighted total and the control total. The reference l<strong>in</strong>e <strong>in</strong>dicates the tolerance level, <strong>in</strong> this<br />

example log 10 (1) = 0. One can easily construct this k<strong>in</strong>d of graph us<strong>in</strong>g standard SAS/GRAPH<br />

facilities.<br />

From the graph, one can easily s<strong>in</strong>gle out the four slowest converg<strong>in</strong>g variables (their traces cluster<br />

dist<strong>in</strong>ctly higher): EEE, JJJ, GGG, and AAA. The variables GGG and AAA are dichotomous, so it<br />

is not possible to collapse them. To explore how categories of the variables EEE and JJJ (which<br />

are ord<strong>in</strong>al) converge and which of them might be collapsed, similar graphs show the <strong>in</strong>dividual<br />

categories of those two variables (Figure 2).<br />

Besides visual exploration of convergence of slow categories, one should apply common sense<br />

when comb<strong>in</strong><strong>in</strong>g them. For ord<strong>in</strong>al variables, for <strong>in</strong>stance, it would be logical to comb<strong>in</strong>e adjacent<br />

16


categories. Tak<strong>in</strong>g the mean<strong>in</strong>g of values of EEE and JJJ <strong>in</strong>to account, <strong>in</strong> addition to the graphs <strong>in</strong><br />

Figure 2, collaps<strong>in</strong>g comb<strong>in</strong>ed Categories 1 and 2, and Categories 4 and 5 for both variables<br />

(keep<strong>in</strong>g Category 3 separate). Correspond<strong>in</strong>gly, the respective marg<strong>in</strong>al totals were comb<strong>in</strong>ed,<br />

after which the rak<strong>in</strong>g was rerun and new convergence graphs were constructed for those two<br />

collapsed variables (Figure 3). Because convergence of EEE and JJJ looked promis<strong>in</strong>g, a new<br />

overall convergence graph was constructed for all 12 rak<strong>in</strong>g variables (Figure 4). Compar<strong>in</strong>g this<br />

graph with Figure 1, one can see that collaps<strong>in</strong>g did play a dramatic role <strong>in</strong> speed<strong>in</strong>g convergence.<br />

The rak<strong>in</strong>g process now converges <strong>in</strong> 17 iterations.<br />

As already noted, the statistician may not always have the flexibility to collapse categories, or<br />

he/she may still want to achieve convergence without alter<strong>in</strong>g the rak<strong>in</strong>g variables, i.e., us<strong>in</strong>g as<br />

many iterations as required. But how many are required? The enhanced macro calculates a<br />

predicted number of iterations needed for full convergence. The graph <strong>in</strong> Figure 5 demonstrates a<br />

two-variable rak<strong>in</strong>g process that <strong>in</strong>itially did not converge <strong>in</strong> the default 50 iterations (vertical<br />

reference l<strong>in</strong>e) and predicted 65 as the needed number. When rerun, the rak<strong>in</strong>g did converge at<br />

exactly the 65th iteration. In a fairly rare situation, rerunn<strong>in</strong>g the rak<strong>in</strong>g with the predicted number<br />

of iterations could give non-convergence aga<strong>in</strong>, with a new and much larger number of predicted<br />

iterations. If this occurs, it makes sense to thoroughly exam<strong>in</strong>e sample and population data and<br />

make appropriate changes.<br />

8. Inclusion of Two-Variable Rak<strong>in</strong>g Marg<strong>in</strong>s<br />

Rak<strong>in</strong>g can be viewed as analogous to fitt<strong>in</strong>g a ma<strong>in</strong>-effects-only model. Because of sample size<br />

limitations and/or availability of only one-variable (factor or dimension) control totals, many<br />

rak<strong>in</strong>g applications follow this approach. In some situations it may be important to fit a two-<br />

17


variable <strong>in</strong>teraction to the data. For example, one is plann<strong>in</strong>g to rake on Variables A, B, C, and D.<br />

However, control totals for Variable C crossed with Variable D are available and exhibit a strong<br />

<strong>in</strong>teraction (e.g., persons aged 0-17 years are more likely to be Hispanic than persons aged 65+<br />

years). If the cell counts <strong>in</strong> the C x D marg<strong>in</strong> of the sample are large enough to support fitt<strong>in</strong>g a C<br />

x D <strong>in</strong>teraction, one would rake on three marg<strong>in</strong>s: A, B, and C x D. It is not necessary also to rake<br />

on separate marg<strong>in</strong>s for Variables C and D. If, however, the C x D rak<strong>in</strong>g marg<strong>in</strong> <strong>in</strong>volved<br />

collaps<strong>in</strong>g one could consider add<strong>in</strong>g one-variable marg<strong>in</strong>s to the rak<strong>in</strong>g for Variables C and D<br />

without any collaps<strong>in</strong>g of their categories.<br />

9. Form<strong>in</strong>g Control Totals for Quantity Variables<br />

In a specialized rak<strong>in</strong>g situation one is plann<strong>in</strong>g on rak<strong>in</strong>g a sample of persons on some categorical<br />

variables (e.g., age, sex, and race), but the source of the control totals also has a quantity variable<br />

related, to say, the total number of glasses of milk consumed <strong>in</strong> a week. The survey has also<br />

measured this same quantity variable; but the survey response rate is, let us assume, only 50%.<br />

One may want to ensure that the weighted total number of glasses of milk consumed per week from<br />

the sample agrees closely with the control total. This can be accomplished by divid<strong>in</strong>g the sample<br />

<strong>in</strong>to groups; each group will have a mean number of glasses of milk consumed <strong>in</strong> a week and a sum<br />

of weights. In the rak<strong>in</strong>g process one can modify the sum of the weights <strong>in</strong> each group so that the<br />

sum of the weights times the mean, summed over all the groups, adds to the control value of total<br />

glasses of milk consumed <strong>in</strong> a week. In the simplest application one can divide the sample <strong>in</strong>to two<br />

groups: below versus above the median number of glasses of milk consumed <strong>in</strong> a week based on<br />

the control total data. For each group one can use the control data to obta<strong>in</strong> the total number of<br />

glasses of milk consumed <strong>in</strong> a week. This two-category marg<strong>in</strong> is then added to the rak<strong>in</strong>g.<br />

Convergence may not occur mak<strong>in</strong>g it necessary to shift the group boundary po<strong>in</strong>t away from the<br />

18


median <strong>in</strong> order to achieve convergence. Once convergence is achieved the weighted total number<br />

of glasses of milk consumed <strong>in</strong> a week will be <strong>in</strong> close agreement with the control total value. This<br />

procedure may be extended to modify not only the total over the entire sample, but for various<br />

subpopulations as well.<br />

10. Rak<strong>in</strong>g at the State Level <strong>in</strong> a Large National <strong>Survey</strong><br />

Some large surveys stratify by state and are designed to yield state estimates. The result<strong>in</strong>g total<br />

national sample is usually very large. The survey statisticians seek to provide national estimates as<br />

well as state estimates. Often one sets up rak<strong>in</strong>g control totals at the state level and carries out 51<br />

<strong>in</strong>dividual rak<strong>in</strong>gs. Assume those rak<strong>in</strong>gs use Variables A, B, and C; but the number of categories<br />

of each variable is limited because of the state sample sizes. For example, one might collapse<br />

Variables A, B, and C differently by state. If Variable A were race/ethnicity, one might be able to<br />

use Hispanic as a separate race/ethnicity category <strong>in</strong> California, but not <strong>in</strong> Vermont because of the<br />

small sample size. After the 51 rak<strong>in</strong>gs one might compare weighted distribution of Variables A,<br />

B, and C with national control totals and observe some differences that are caused by the state-level<br />

collaps<strong>in</strong>g of categories. If hav<strong>in</strong>g precise weighted distributions at the national level is important<br />

for analytic or “face validity” reasons, one can use the IHB rak<strong>in</strong>g macro <strong>in</strong> the follow<strong>in</strong>g manner.<br />

Set up a s<strong>in</strong>gle rak<strong>in</strong>g that <strong>in</strong>cludes marg<strong>in</strong>s for State x A, State x B, and State x C (i.e., comb<strong>in</strong>e<br />

the 51 <strong>in</strong>dividual state rak<strong>in</strong>gs <strong>in</strong>to a s<strong>in</strong>gle rak<strong>in</strong>g). Then add detailed national marg<strong>in</strong>s for<br />

Variables A, B, and C. Another, similar example would <strong>in</strong>volve add<strong>in</strong>g Variable D as a national<br />

rak<strong>in</strong>g marg<strong>in</strong> because its control total is available only at the national level (e.g., household<br />

<strong>in</strong>come). This strategy needs to be implemented carefully. Checks should be made for rak<strong>in</strong>g<br />

variables that conta<strong>in</strong> small sample sizes. The coefficient of variation of the weights prior to rak<strong>in</strong>g<br />

19


and after rak<strong>in</strong>g should be exam<strong>in</strong>ed <strong>in</strong> each state to check for large <strong>in</strong>creases <strong>in</strong> the variability of<br />

the weights. F<strong>in</strong>ally, the rak<strong>in</strong>g diagnostics discussed above should be used if convergence<br />

problems arise.<br />

11. Ma<strong>in</strong>ta<strong>in</strong><strong>in</strong>g Prior Nonresponse and Noncoverage Adjustments <strong>in</strong> the F<strong>in</strong>al weights<br />

Frankel et al. (2003) have discussed methods based on data on <strong>in</strong>terruptions <strong>in</strong> telephone service<br />

(of a week or longer <strong>in</strong> the past 12 months) to compensate for the exclusion of persons <strong>in</strong><br />

nontelephone households <strong>in</strong> random-digit-dial<strong>in</strong>g surveys. One typically adjusts the base sampl<strong>in</strong>g<br />

weights of persons with versus without an <strong>in</strong>terruption <strong>in</strong> telephone service. The result<strong>in</strong>g<br />

<strong>in</strong>terruption-based weight adjusts for the noncoverage of nontelephone households. If one then<br />

rakes the sample on age, sex, and race, the impact of the nontelephone adjustment may be diluted<br />

somewhat, even though the rak<strong>in</strong>g starts with <strong>in</strong>terruption-based weight. In that case it generally<br />

makes sense to create weighted control totals (us<strong>in</strong>g the <strong>in</strong>terruption-based weight) from the sample<br />

for persons resid<strong>in</strong>g <strong>in</strong> households with versus without an <strong>in</strong>terruption <strong>in</strong> telephone service. These<br />

weighted control totals should be ratio-adjusted so that they have the same sum as the age, sex, and<br />

race control totals. For example, if the age, sex, and race marg<strong>in</strong>s sum to 180,000,000 persons,<br />

then the <strong>in</strong>terruption marg<strong>in</strong> needs to be adjusted so that it also sums to 180,000,000. The rak<strong>in</strong>g<br />

would use the four variables <strong>in</strong>stead of just three and would ensure that the nontelephone<br />

adjustment is fully reflected <strong>in</strong> the f<strong>in</strong>al weights. This would be appropriate where the <strong>in</strong>terruption<strong>in</strong>-telephone-service<br />

category could be small (e.g., <strong>in</strong> states where telephone coverage is very<br />

high), but one still wants to ma<strong>in</strong>ta<strong>in</strong> that small category <strong>in</strong> the rak<strong>in</strong>g.<br />

20


12. Rak<strong>in</strong>g <strong>Survey</strong>s that Screen for a Specific Target Population<br />

A common survey model for obta<strong>in</strong><strong>in</strong>g <strong>in</strong>terviews with a specific target population is to screen a<br />

sample of households for the presence of members of the target population. An example would be<br />

children with special health care needs. The screen<strong>in</strong>g <strong>in</strong>terview collects a roster of children with,<br />

say, their age, sex, and race, and determ<strong>in</strong>es whether each child has special health care needs. If<br />

the household conta<strong>in</strong>s one child with special health care needs, a detailed <strong>in</strong>terview is conducted<br />

for that child. If the household has two or more such children, one is selected at random for the<br />

detailed <strong>in</strong>terview. Of course, the <strong>in</strong>terview response rate will be less than 100%, because some<br />

parents will not agree to do the detailed <strong>in</strong>terview.<br />

Assume that the survey statisticians need to look at the prevalence of children with special health<br />

care needs, and they will also be analyz<strong>in</strong>g the detailed <strong>in</strong>terview data. In this situation one would<br />

calculate the usual base sampl<strong>in</strong>g weights, make adjustments for unit nonresponse and possibly<br />

make a noncoverage adjustment if warranted. One first obta<strong>in</strong>s control totals for age, sex, and race<br />

<strong>in</strong> the U.S. population aged 0-17 years. One then rakes the entire sample of children <strong>in</strong> the<br />

screened households to those control totals, because that sample is a sample of children aged 0-17<br />

<strong>in</strong> the U.S. The result<strong>in</strong>g screener weights can then be used to estimate the prevalence of children<br />

with special health care needs <strong>in</strong> the U.S.<br />

That screener weight would typically serve as the <strong>in</strong>put weight <strong>in</strong> the calculation of weights for the<br />

children with completed detailed <strong>in</strong>terviews. As part of that calculation process one also seeks to<br />

weight the detailed-<strong>in</strong>terview sample by age, sex, and race. Of course, control totals are unlikely to<br />

be available for children with special health care needs. One can, however, use the screener weight<br />

21


and the sample of children with special health care needs identified <strong>in</strong> the screened households to<br />

form weighted control totals for age, sex, and race and then use those <strong>in</strong> rak<strong>in</strong>g the detailed<strong>in</strong>terview<br />

weights. This method ensures that the survey analysts do not ask why the age<br />

distribution of children with special health care needs from the screener sample does not agree<br />

exactly with the distribution <strong>in</strong> the detailed <strong>in</strong>terview data. Some caution needs to be exercised <strong>in</strong><br />

us<strong>in</strong>g this approach when the screener shows survey evidence of false positives.<br />

13. Rak<strong>in</strong>g to Control Totals Expressed as Percentages and Rak<strong>in</strong>g with No “Input” Weight<br />

Frequently, the user work<strong>in</strong>g with a weighted or an unweighted sample needs to weight it to fit<br />

marg<strong>in</strong>al population proportions. As an example (Table 3), the authors created an 11-case sample<br />

data set that conta<strong>in</strong>s two variables: VAR1, which takes values 1, 2, and 3 with frequencies<br />

27.27%, 45.45% and 27.27%, respectively; and VAR2, which takes values 1 and 2 with<br />

frequencies 45.45% and 54.55%, respectively. The objective was to weight this sample so that the<br />

distributions of VAR1 and VAR2 met the population distributions --- (20%, 35%, 45%) and (60%,<br />

40%), respectively --- with<strong>in</strong> a tolerance of 0.001%.<br />

14. Weight Trimm<strong>in</strong>g and Rak<strong>in</strong>g<br />

Weight trimm<strong>in</strong>g refers to truncation of high or extreme weight values <strong>in</strong> order to reduce their<br />

impact on the variance of the estimates, especially for subgroup estimates. One consequence of the<br />

truncation of high weight values is that the weights of the entire sample will not add to the<br />

population size. Although weight trimm<strong>in</strong>g is a separate topic from rak<strong>in</strong>g; they are certa<strong>in</strong>ly<br />

related <strong>in</strong> the sense that weight trimm<strong>in</strong>g typically takes place at the last step <strong>in</strong> the calculations,<br />

which is often rak<strong>in</strong>g. Many large surveys use weight trimm<strong>in</strong>g (Sr<strong>in</strong>ath 2003, Abt Associates<br />

memorandum). Its objective is to reduce the mean squared error of the key outcome estimates. By<br />

22


trimm<strong>in</strong>g high weight values one generally lowers sampl<strong>in</strong>g variability but may <strong>in</strong>cur some bias.<br />

The MSE will be lower if the reduction <strong>in</strong> variance is large relative to the <strong>in</strong>crease <strong>in</strong> bias aris<strong>in</strong>g<br />

from weight trimm<strong>in</strong>g. There are no established rules for weight trimm<strong>in</strong>g; rather most people use<br />

a general set of guidel<strong>in</strong>es. Some common truncation po<strong>in</strong>ts are: 1) the median weight plus five or<br />

six times the <strong>in</strong>terquartile range (IQR) of the weights, 2) five times the mean weight, 3) the 95 th<br />

percentile of the weights.<br />

How can weight trimm<strong>in</strong>g be <strong>in</strong>corporated <strong>in</strong> rak<strong>in</strong>g? The IHB SAS macro can be used for weight<br />

trimm<strong>in</strong>g <strong>in</strong> the follow<strong>in</strong>g steps (us<strong>in</strong>g as an example the median weight plus six times the IQR as<br />

the truncation po<strong>in</strong>t) 1 :<br />

1. Prior to rak<strong>in</strong>g i, where i references the number of times the rak<strong>in</strong>g is run, exam<strong>in</strong>e the<br />

distribution of the rak<strong>in</strong>g “<strong>in</strong>put” weight and calculate the median weight plus six times the<br />

<strong>in</strong>terquartile (IQR) range of the weights.<br />

2. Truncate values of the <strong>in</strong>put weight that are above the median weight plus six times the<br />

IQR plus one to the median weight plus six times the IQR (values at or below the median<br />

weight plus six times the IQR plus one are not altered).<br />

3. Us<strong>in</strong>g the truncated <strong>in</strong>put weight, run the rak<strong>in</strong>g to obta<strong>in</strong> rak<strong>in</strong>g weight i.<br />

4. Repeat Steps 1 to 3 (i.e., run the rak<strong>in</strong>g a second time, third time, etc.) until there are no<br />

weights that are above the median weight plus six times the IQR plus one.<br />

Although the cutoff value equals the median weight plus six times the IQR, weights that exceed the<br />

median weight plus six times the IQR plus one are truncated to the median weight plus six times<br />

1 A somewhat more sophisticated, but computer <strong>in</strong>tensive, procedure is to apply bounds to the weights as the<br />

rak<strong>in</strong>g is tak<strong>in</strong>g place.<br />

23


the IQR, because the rak<strong>in</strong>g may <strong>in</strong>crease the weight values of the cases that have been truncated,<br />

and thus cause the rak<strong>in</strong>g steps to repeat endlessly. The approach described above does not<br />

guarantee convergence (i.e., after runn<strong>in</strong>g the rak<strong>in</strong>g several times there could still be weights<br />

above the median weight plus six times the IQR plus one), and one could consider add<strong>in</strong>g a larger<br />

constant to <strong>in</strong>crease the chances of convergence, but the authors have found <strong>in</strong> their applications<br />

that convergence is often achieved by add<strong>in</strong>g a constant of one. Table 4 shows an example of the<br />

use of weight trimm<strong>in</strong>g with rak<strong>in</strong>g. Before rak<strong>in</strong>g there are four cases with “<strong>in</strong>put” weights that<br />

exceed the median weight plus six times the IQR plus one of 439.847 (condition). The weights of<br />

those cases are truncated to 438.847 (cutoff) and the rak<strong>in</strong>g is run for the first time. After the first<br />

rak<strong>in</strong>g the condition equals 444.490. Only one case has a weight that exceeds this value and that<br />

weight is truncated to the cutoff of 443.490. After the second rak<strong>in</strong>g no cases have a weight that<br />

exceeds the condition and the process is stopped. The weights from the second rak<strong>in</strong>g add to the<br />

population size and meet the rak<strong>in</strong>g control totals.<br />

15. Summary<br />

The authors have sought to give some background on how rak<strong>in</strong>g works and to discuss the<br />

convergence process. They have also sought to give some warn<strong>in</strong>gs of conditions that need to be<br />

checked before and after rak<strong>in</strong>g. Brick et al. (2003) discuss other examples of issues that one<br />

should be aware of when us<strong>in</strong>g rak<strong>in</strong>g. The IHB SAS macro discussed <strong>in</strong> this paper is available for<br />

free from the first author.<br />

24


References<br />

Bishop YMM, Fienberg SE, and Holland PW. (1975). Discrete Multivariate Analysis: Theory and<br />

Practice. Cambridge, MA: MIT Press.<br />

Brick JM, Montaquila J, and Roth S. (2003). Identify<strong>in</strong>g Problems with Rak<strong>in</strong>g Estimators. 2003<br />

Proceed<strong>in</strong>gs of the Annual Meet<strong>in</strong>g of the American Statistical Association [CD-ROM],<br />

Alexandria, VA: American Statistical Association, pp. 710-717.<br />

Dem<strong>in</strong>g WE. (1943). Statistical Adjustment of <strong>Data</strong>. New York: Wiley.<br />

Frankel MR, Sr<strong>in</strong>ath KP, Hoagl<strong>in</strong> DC, Battaglia MP, Smith PJ, Wright RA, and Khare M. (2003).<br />

Adjustments for non-telephone bias <strong>in</strong> random-digit-diall<strong>in</strong>g surveys. Statistics <strong>in</strong> Medic<strong>in</strong>e,<br />

Volume 22, pp. 1611-1626.<br />

Izrael D, Hoagl<strong>in</strong>, DC, and Battaglia MP. (2000). A SAS Macro for Balanc<strong>in</strong>g a Weighted Sample.<br />

Proceed<strong>in</strong>gs of the Twenty-Fifth Annual SAS Users Group International Conference, Cary, NC:<br />

SAS Institute Inc., pp. 1350-1355.<br />

Izrael D, Hoagl<strong>in</strong> DC, and Battaglia MP. (2004). To Rake or Not To Rake Is Not the Question<br />

Anymore with the Enhanced Rak<strong>in</strong>g Macro. May 2004 SUGI Conference, Montreal, Canada.<br />

Kalton G. (1983). Compensat<strong>in</strong>g for Miss<strong>in</strong>g <strong>Survey</strong> <strong>Data</strong>. <strong>Survey</strong> Research Center, Institute for<br />

Social Research, University of Michigan.<br />

25


Oh HL, and Scheuren F. (1978). Some Unresolved Application Issues <strong>in</strong> Rak<strong>in</strong>g Ratio Estimation.<br />

1978 Proceed<strong>in</strong>gs of the Section on <strong>Survey</strong> Research Methods, Wash<strong>in</strong>gton, DC: American<br />

Statistical Association, pp. 723-728.<br />

26


Table 1. A 2 x 2 Table for Which Rak<strong>in</strong>g Cannot Produce Agreement with the Control Totals<br />

Variable 1<br />

Marg<strong>in</strong>al<br />

Control<br />

Total<br />

Variable 2<br />

Marg<strong>in</strong>al<br />

Control Total<br />

1 2<br />

1 20 0 70<br />

2 0 10 30<br />

50 50 100<br />

27


Table 2. Example of Rak<strong>in</strong>g Us<strong>in</strong>g the IHB SAS Macro<br />

Rak<strong>in</strong>g AREA - 1 VARIABLE1, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 15915.87 22154.39 6238.52 35.486 35.278 0.209<br />

2 10912.05 16533.88 5621.83 24.330 26.328 -1.998<br />

3 18022.90 24112.03 6089.13 40.184 38.395 1.789<br />

========== ======== ========== ========<br />

44850.82 62800.30 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 1 VARIABLE2, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 32684.74 30697.33 -1987.40 52.046 48.881 3.165<br />

2 30115.56 32102.97 1987.40 47.954 51.119 -3.165<br />

========== ======== ========== ========<br />

62800.30 62800.30 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 1 VARIABLE1, iteration - 2<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 22102.81 22154.39 51.586 35.195 35.278 -0.082<br />

2 16442.32 16533.88 91.553 26.182 26.328 -0.146<br />

3 24255.17 24112.03 -143.139 38.623 38.395 0.228<br />

========== ======== ========== ========<br />

62800.30 62800.30 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 1 VARIABLE2, iteration - 2<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 30708.98 30697.33 -11.6455 48.899 48.881 0.019<br />

2 32091.32 32102.97 11.6455 51.101 51.119 -0.019<br />

========== ======== ========== ========<br />

62800.30 62800.30 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 1 VARIABLE1, iteration - 3<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 22154.09 22154.39 0.29992 35.277 35.278 -0.000<br />

2 16533.34 16533.88 0.53717 26.327 26.328 -0.001<br />

3 24112.87 24112.03 -0.83710 38.396 38.395 0.001<br />

========== ======== ========== ========<br />

62800.30 62800.30 100.00 100.00<br />

28


Rak<strong>in</strong>g AREA - 1 VARIABLE2, iteration - 3<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 30697.40 30697.33 -0.068148 48.881 48.881 0.000<br />

2 32102.90 32102.97 0.068148 51.119 51.119 -0.000<br />

========== ======== ========== ========<br />

62800.30 62800.30 100.00 100.00<br />

**** Program for AREA 1 term<strong>in</strong>ated at iteration 3 because all calculated marg<strong>in</strong>s<br />

differ from Marg<strong>in</strong>al Control Totals by less than 1<br />

Rak<strong>in</strong>g AREA - 2 VARIABLE1, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 31377.80 38598.04 7220.24 41.734 37.292 4.441<br />

2 17512.57 29596.11 12083.54 23.292 28.595 -5.303<br />

3 26295.48 35307.30 9011.82 34.974 34.113 0.861<br />

========== ========= ========== ========<br />

75185.84 103501.44 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 2 VARIABLE2, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 51930.05 51902.14 -27.9123 50.173 50.146 0.027<br />

2 51571.39 51599.30 27.9123 49.827 49.854 -0.027<br />

========== ========= ========== ========<br />

103501.44 103501.44 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 2 VARIABLE1, iteration - 2<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 38596.66 38598.04 1.37510 37.291 37.292 -0.001<br />

2 29599.80 29596.11 -3.69114 28.598 28.595 0.004<br />

3 35304.98 35307.30 2.31605 34.111 34.113 -0.002<br />

========== ========= ========== ========<br />

103501.44 103501.44 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 2 VARIABLE2, iteration - 2<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 51902.75 51902.14 -0.61296 50.147 50.146 0.001<br />

2 51598.69 51599.30 0.61296 49.853 49.854 -0.001<br />

========== ========= ========== ========<br />

103501.44 103501.44 100.00 100.00<br />

29


Rak<strong>in</strong>g AREA - 2 VARIABLE1, iteration - 3<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 38598.01 38598.04 0.030193 37.292 37.292 -0.000<br />

2 29596.19 29596.11 -0.081052 28.595 28.595 0.000<br />

3 35307.25 35307.30 0.050859 34.113 34.113 -0.000<br />

========== ========= ========== ========<br />

103501.44 103501.44 100.00 100.00<br />

Rak<strong>in</strong>g AREA - 2 VARIABLE2, iteration - 3<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VARIABLE2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 51902.15 51902.14 -0.013460 50.146 50.146 0.000<br />

2 51599.29 51599.30 0.013460 49.854 49.854 -0.000<br />

========== ========= ========== ========<br />

103501.44 103501.44 100.00 100.00<br />

**** Program for AREA 2 term<strong>in</strong>ated at iteration 3 because all calculated marg<strong>in</strong>s<br />

differ from Marg<strong>in</strong>al Control Totals by less than 1<br />

30


Figure 1. Convergence of a Rak<strong>in</strong>g Process Involv<strong>in</strong>g 12 Variables<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

-1<br />

-2<br />

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51<br />

variable AAA BBB CCC DDD EEE FFF<br />

GGG HHH III JJJ KKK LLL<br />

31


Figure 2. Convergence of Variables EEE and JJJ before Collaps<strong>in</strong>g<br />

6<br />

Variable EEE<br />

5<br />

Variable JJJ<br />

5<br />

4<br />

4<br />

3<br />

3<br />

2<br />

2<br />

1<br />

1<br />

0<br />

-1<br />

0<br />

-2<br />

-1<br />

-3<br />

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50<br />

category 1 2 3 4 5<br />

-2<br />

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50<br />

category 1 2 3 4 5<br />

32


Figure 3. Convergence of Variables EEE and JJJ after Collaps<strong>in</strong>g<br />

6<br />

Variable EEE<br />

5<br />

Variable JJJ<br />

5<br />

4<br />

4<br />

3<br />

3<br />

2<br />

1<br />

2<br />

1<br />

0<br />

0<br />

-1<br />

-1<br />

-2<br />

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50<br />

category 1 3 5<br />

Category 2 and 4 collapsed <strong>in</strong>to 1 and 5 respectively<br />

-2<br />

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50<br />

category 1 3 5<br />

Category 2 and 4 collapsed <strong>in</strong>to 1 and 5 respectively<br />

33


Figure 4. Convergence of All 12 Variables <strong>in</strong> the Rak<strong>in</strong>g Process after collaps<strong>in</strong>g Variables EEE<br />

and JJJ.<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

-1<br />

-2<br />

-3<br />

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51<br />

variable AAA BBB CCC DDD EEE FFF<br />

GGG HHH III JJJ KKK LLL<br />

34


Figure 5. Prediction of the Number of Iterations Needed for Convergence<br />

6<br />

5<br />

4<br />

3<br />

2<br />

1<br />

0<br />

-1<br />

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66<br />

variable AAA BBB<br />

predicted number of iterations for convergence - 65<br />

35


Table 3. Rak<strong>in</strong>g Us<strong>in</strong>g Marg<strong>in</strong>al Percentage Controls and No “Input” Weight (first and last<br />

iteration shown).<br />

The FREQ Procedure<br />

Cumulative Cumulative<br />

VAR1 Frequency Percent Frequency Percent<br />

---------------------------------------------------------<br />

1 3 27.27 3 27.27<br />

2 5 45.45 8 72.73<br />

3 3 27.27 11 100.00<br />

Cumulative Cumulative<br />

VAR2 Frequency Percent Frequency Percent<br />

---------------------------------------------------------<br />

1 5 45.45 5 45.45<br />

2 6 54.55 11 100.00<br />

Rak<strong>in</strong>g VAR1, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VAR1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 3 20 17 27.273 20.000 7.273<br />

2 5 35 30 45.455 35.000 10.455<br />

3 3 45 42 27.273 45.000 -17.727<br />

========== ======== ========== ========<br />

11 100 100.00 100.00<br />

Rak<strong>in</strong>g VAR2, iteration - 1<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VAR2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 42.667 60 17.3333 42.667 60.000 -17.333<br />

2 57.333 40 -17.3333 57.333 40.000 17.333<br />

========== ======== ========== ========<br />

100.000 100 100.00 100.00<br />

Rak<strong>in</strong>g VAR1, iteration - 5<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VAR1 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 20.000 20 0.000256716 20.000 20.000 -0.000<br />

2 35.001 35 -.000834329 35.001 35.000 0.001<br />

3 44.999 45 0.000577612 44.999 45.000 -0.001<br />

========== ======== ========== ========<br />

100.000 100 100.00 100.00<br />

Rak<strong>in</strong>g VAR2, iteration - 5<br />

Marg<strong>in</strong>al<br />

Marg<strong>in</strong>al<br />

Calculated Control Calculated Control Difference<br />

VAR2 marg<strong>in</strong> Total Difference % % <strong>in</strong> %<br />

1 60.000 60 0.000205597 60.000 60.000 -0.000<br />

2 40.000 40 -.000205597 40.000 40.000 0.000<br />

========== ======== ========== ========<br />

100.000 100 100.00 100.00<br />

**** Program term<strong>in</strong>ated at iteration 5 because all Calculated Percents differ from Marg<strong>in</strong>al<br />

Percents by less than 0.001<br />

36


Table 4: Example of Weight Trimm<strong>in</strong>g Dur<strong>in</strong>g Rak<strong>in</strong>g<br />

OBSERVATIONS IN ORIGINAL DATASET TO BE TRUNCATED<br />

CUTOFF: MEDIAN+6*IQR<br />

CONDITION: MEDIAN+6*IQR +1<br />

weight_to_<br />

id truncate mean median IQR cutoff condition<br />

715 477.576 144.250 132.491 51.0592 438.847 439.847<br />

651 509.018 144.250 132.491 51.0592 438.847 439.847<br />

1085 690.762 144.250 132.491 51.0592 438.847 439.847<br />

770 515.720 144.250 132.491 51.0592 438.847 439.847<br />

OBSERVATIONS TO BE TRUNCATED AFTER ITERATION = 1<br />

CUTOFF: MEDIAN+6*IQR<br />

CONDITION: MEDIAN+6*IQR + 1<br />

truncated_<br />

id weight mean median IQR cutoff condition<br />

1085 451.059 144.250 133.108 51.7302 443.490 444.490<br />

OBSERVATIONS TO BE TRUNCATED AFTER ITERATION = 2<br />

CUTOFF: MEDIAN+6*IQR<br />

CONDITION: MEDIAN+6*IQR + 1<br />

THERE ARE NO WEIGHTS TO TRUNCATE<br />

RERAKING-TRUNCATION PROCESS CONVERGED IN 2 ITERATIONS WITH CONDITION MEDIAN+6*IQR+1.<br />

37

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!