24.01.2013 Views

handout - Daina Chiba

handout - Daina Chiba

handout - Daina Chiba

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

Contents<br />

Summer Methods Workshop:<br />

Presenting Statistical Results<br />

<strong>Daina</strong> <strong>Chiba</strong><br />

d.chiba@rice.edu<br />

June 16, 2010<br />

1 Why you want to learn both Stata and R 2<br />

2 Stata 2<br />

2.1 Producing Regression Tables Efficiently . . . . . . . . . . . . . . . . . . . . . 2<br />

2.1.1 DOs and DON’Ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />

2.1.2 Tables with Multiple Columns . . . . . . . . . . . . . . . . . . . . . . 2<br />

2.1.3 Exporting Stata Results to a L ATEX document . . . . . . . . . . . . . 5<br />

2.1.4 Exporting Stata Results to Word/Powerpoint documents . . . . . . . 7<br />

2.2 Additional Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.2.1 Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />

2.2.2 Putting Stars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

2.3 Exporting Stata Results to R . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />

3 R 10<br />

3.1 Text Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />

3.1.1 How to Install Tinn-R (PC users) . . . . . . . . . . . . . . . . . . . . 10<br />

3.1.2 How to Use Tinn-R (PC users) . . . . . . . . . . . . . . . . . . . . . 10<br />

3.1.3 How to install Aquamacs (Mac users) . . . . . . . . . . . . . . . . . . 12<br />

3.1.4 How to use Aquamacs (Mac users) . . . . . . . . . . . . . . . . . . . 12<br />

3.2 Calculating Substantive Effects (a.k.a., CLARIFY “by hand”) . . . . . . . . 13<br />

3.3 Plotting Substantive Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />

3.4 Plotting Substantive Effects: When X is a Discrete Variable... . . . . . . . . 17<br />

3.5 Descriptive Statistics: Plotting (Im)Balance . . . . . . . . . . . . . . . . . . 18<br />

3.6 Plotting Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />

1


1 Why you want to learn both Stata and R<br />

• The field of Political Science is making a gradual transition from Stata to R.<br />

• It may be the case that your co-author can use only one of them.<br />

• Stata’s advantage<br />

– Fast & easy<br />

– Some methods are available only in Stata (e.g., selection models)<br />

• R’s advantage<br />

2 Stata<br />

– Flexible graphic devices<br />

– Able to handle multiple data sets simultaneously.<br />

– Some statistical models are available only in R (e.g., IRT models, Bayesian)<br />

2.1 Producing Regression Tables Efficiently<br />

Although some professional journals encourage the authors to use figures instead of tables in<br />

reporting statistical results, some people still prefer old-fashioned tables to figures (and they<br />

can be your co-author!). So, let us first see how to make a regression table of publicationquality<br />

efficiently.<br />

The table on the next page is a screenshot of a regression table from Russett & Oneal<br />

(2005, 299). 1 We will see how to automate the process of making a regression table like this<br />

on Stata.<br />

2.1.1 DOs and DON’Ts<br />

• Don’t copy and paste from Stata’s Results Window (or a log file) to Word or Excel.<br />

• Automatize the process as much as possible to avoid human error and save time.<br />

2.1.2 Tables with Multiple Columns<br />

If you want to produce a regression table with multiple column entries, use esttab package.<br />

Try running the following chunk of the do-file (in 1_StataPart.do)<br />

1 The paper itself is included in the workshop packet. Find /resources/ONealRussett 2005.pdf<br />

2


Stata code<br />

use "output/smw06162010dataL.dta", clear<br />

global covariates "smldem lrgdem smldep allies lncaprat dircont lndstab majpower"<br />

logit mzmid1 $covariates systsize midpy*, cluster(dyadid)<br />

estimates store allmids<br />

logit mzfatald1 $covariates systsize fatalpy*, cluster(dyadid)<br />

estimates store fatalmids<br />

logit mzcowwar1 $covariates systsize warpy*, cluster(dyadid)<br />

estimates store wars<br />

Now, we have run three models shown in TABLE 1, and stored each of the results in<br />

memory. Let’s display the three models simultaneously. To do this, run the following<br />

Stata code<br />

esttab allmids fatalmids wars, ///<br />

b(%10.3f) se scalars("ll Log lik." "chi2 Chi-squared") ///<br />

label mtitles keep($covariates _cons)<br />

The first line tells Stata to show the three models we estimated above (allmids, fatalmids,<br />

wars), the second line declares that we want beta coefficients rounded up to 3 decimal points<br />

b (%10.3f), standard errors, se, and some other scalar values of interests (N, the maximized<br />

value of the likelihood function, and χ 2 ).<br />

3


You should be seeing a table like this on your Stata window.<br />

Compare the table in the original paper and that on your Stata window. Can you find a<br />

mistake that Oneal and Russett made?<br />

I suspect that O & R copied and pasted the results from Stata window to a Word<br />

document, and somewhere in the process, they accidentally flipped the sign of one coefficient.<br />

If a mistake is this easy to find, maybe either one of the authors and/or publisher and/or<br />

editors and/or reviewers can find it (although in this case nobody found it). But, mistakes<br />

you’ll make maybe less conspicuous and more consequential. So, again, don’t use copy &<br />

paste!<br />

4


2.1.3 Exporting Stata Results to a L ATEX document<br />

If you use L ATEX, incorporating Stata results is very easy, and the process can be almost<br />

completely automatized. First, run the following<br />

Stata code<br />

esttab allmids fatalmids wars using "<strong>handout</strong>/ro2005tbl.tex", tex replace ///<br />

b(%10.3f) se scalars("ll Log lik." "chi2 Chi-squared") ///<br />

label mtitles keep($covariates _cons) ///<br />

title("Models of the onset of militarized interstate disputes, fatal disputes, and war, 1885--2001")<br />

Running this chunk of code will create (replace) a text file ro2005tbl.tex under the <strong>handout</strong><br />

folder. To insert the created table into your paper, add the following line to the L ATEX source<br />

file:<br />

\input{ro2005tbl.tex}<br />

Then you will have the following table (next page).<br />

Every time you make some changes to your statistical model and re-estimate models,<br />

Stata and L ATEX will create a publication-quality table for you automatically. Days of<br />

copying and pasting are over.<br />

5


Table 1: Models of the onset of militarized interstate disputes, fatal disputes, and war, 1885–2001<br />

(1) (2) (3)<br />

allmids fatalmids wars<br />

Lower democracy -0.069 ∗∗∗ -0.096 ∗∗∗ -0.162 ∗∗∗<br />

(0.008) (0.017) (0.030)<br />

Higher democracy 0.038 ∗∗∗ 0.038 ∗∗∗ 0.043<br />

(0.007) (0.011) (0.024)<br />

Lower trade-to-GDP ratio -32.262 ∗∗∗ -95.833 ∗∗∗ -45.763<br />

(8.987) (25.895) (26.878)<br />

Allies 0.075 -0.199 -0.562<br />

(0.101) (0.178) (0.346)<br />

Capability ratio (log) -0.284 ∗∗∗ -0.410 ∗∗∗ -0.754 ∗∗∗<br />

(0.030) (0.049) (0.096)<br />

Contiguous 1.127 ∗∗∗ 1.151 ∗∗∗ 0.982 ∗∗<br />

(0.152) (0.249) (0.357)<br />

Distance (log) -0.289 ∗∗∗ -0.466 ∗∗∗ -0.365 ∗<br />

(0.053) (0.081) (0.161)<br />

Major power 1.007 ∗∗∗ 1.118 ∗∗∗ 2.240 ∗∗∗<br />

(0.133) (0.227) (0.398)<br />

Constant -0.489 -0.643 -3.176 ∗<br />

(0.406) (0.618) (1.267)<br />

Observations 464953 464692 464953<br />

Log lik. -8118.156 -2926.785 -757.367<br />

Chi-squared 3861.950 1345.308 472.660<br />

Standard errors in parentheses<br />

∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001<br />

6


2.1.4 Exporting Stata Results to Word/Powerpoint documents<br />

Microsoft Word is NOT a good software for scientific documentation in the sense that you<br />

have to do a significant part of the otherwise automatized process “by hand,” which not<br />

only enhances the danger of making errors but also requires laborious copying, pasting, and<br />

editing. So, I strongly encourage you to learn how to use L ATEX. One of the first-step would<br />

be to play with the source file of this <strong>handout</strong> (<strong>handout</strong>/<strong>handout</strong>.tex). You should be able<br />

to create a pdf from the source file.<br />

But, again, your may co-author with your superior, who would not want to learn L ATEX and<br />

simply give you an order to create a fancy table on Word or PowerPoint. Here is a way to<br />

minimize the job of copying and pasting and risk of making errors in the process.<br />

First, run the following<br />

Stata code<br />

estout allmids fatalmids wars, ///<br />

cells(b(star fmt(%9.3f)) se(par fmt(%9.3f))) style(fixed) ///<br />

stats(N, fmt(a1 %9.3f %9.3f)), using "<strong>handout</strong>/ro2005tbl.txt", replace<br />

Running this chunk of code will create (replace) a text file ro2005tbl.txt under the <strong>handout</strong><br />

folder.<br />

1. Open Microsoft Excel.<br />

2. Go to “File” and choose “Open” (or, hit Control + O on PC or Command + O on Mac)<br />

3. Find ro2005tbl.txt. You may have to change the “Files of type” option.<br />

4. Specify the numerical columns as “Text” not “General” (Otherwise, Excel will recognize<br />

the numbers in parentheses as negative.)<br />

5. Format as you like, and copy and paste it into a Word or PowerPoint document.<br />

7


2.2 Additional Tips<br />

2.2.1 Rescaling<br />

If one of your coefficients exceeds 10 in absolute values, consider rescaling the variable. If it<br />

exceeds 1, 000, you simply must rescale it. Coefficients greater than 1, 000, 000 are not only<br />

ugly but also misleading.<br />

There are several ways to rescale a variable, none of which changes the statistical results<br />

substantively.<br />

• Multiply the variable with some number (a natural choice would be 10 x with x =<br />

{−3, −2, −1, 1, 2, 3}, for example).<br />

• Standardize the variable (subtract its mean and then divide it with its standard deviation).<br />

After standardization, the variable has 0 mean and a unit variance.<br />

You can apply either one of the above to some or all of the variables included in your<br />

regression.<br />

Table 2: Models with rescaled Trade variables<br />

(1) (2) (3) (4)<br />

original multiply standardizeOne standardizeALL<br />

Lower trade-to-GDP ratio -32.262 ∗∗∗ -0.323 ∗∗∗ -0.094 ∗∗∗ -0.094 ∗∗∗<br />

(-3.59) (-3.59) (-3.59) (-3.59)<br />

Lower democracy -0.069 ∗∗∗ -0.069 ∗∗∗ -0.069 ∗∗∗ -0.403 ∗∗∗<br />

(-8.55) (-8.55) (-8.55) (-8.55)<br />

Higher democracy 0.038 ∗∗∗ 0.038 ∗∗∗ 0.038 ∗∗∗ 0.256 ∗∗∗<br />

(5.69) (5.69) (5.69) (5.69)<br />

Allies 0.075 0.075 0.075 0.018<br />

(0.75) (0.75) (0.75) (0.75)<br />

Observations 464953 464953 464953 464953<br />

Log lik. -8118.156 -8118.156 -8118.156 -8118.156<br />

Chi-squared 3861.950 3861.950 3861.950 3861.950<br />

t statistics in parentheses<br />

∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001<br />

We can see in the above table that results are identical after (2) multiplication of one variable<br />

(3) standardization of one variable or (4) standardization of all variables.<br />

8


2.2.2 Putting Stars<br />

For the sake of convenience, I think it is good to highlight statistically significant coefficients<br />

in some ways. I have no objection to putting stars, but keep in mind that there are some<br />

people who think putting stars is a sin. Some of those people who hate stars, however, think<br />

that it is OK to highlight statistically significant coefficients by writing them in thick font<br />

(I honestly don’t know why). If, unfortunately, your reviewer or your TA expresses such a<br />

belief, you’d better listen to him than trying to point out the stupidity of that person.<br />

FYI, AJPS prohibits us from placing multiple stars to differentiate multiple confidence<br />

levels.<br />

2.3 Exporting Stata Results to R<br />

This will become essential when you need to estimate some fancy statistical models that are<br />

easier to implement in Stata, but then want to use R to make graphs.<br />

Run the following chunk of Stata code:<br />

Stata code<br />

insheet using "output/smw06162010dataS.txt", clear<br />

global covariates "smldem lrgdem smldep allies lncaprat dircont lndstab majpower"<br />

replace smldep = smldep*1000<br />

logit mzmid1 $covariates<br />

mat beta = e(b)<br />

mat vcov = e(V)<br />

// Export Beta and V-COV matrix<br />

preserve<br />

svmat beta, names(matcol)<br />

outsheet beta* in 1 using "output/sr_beta.txt", replace nolabel<br />

restore<br />

preserve<br />

svmat vcov, names(matcol)<br />

outsheet vcov* in 1/9 using "output/sr_vcov.txt", replace nolabel<br />

restore<br />

Two text files will be created (replaced) within the output folder.<br />

9


3 R<br />

3.1 Text Editor<br />

Why you want to use a text editor?<br />

• Syntax highlighting<br />

• Parenthesis matching<br />

• Let you send part or all of the R codes to R<br />

A powerful text editor significantly reduces careless mistakes and save your day.<br />

3.1.1 How to Install Tinn-R (PC users)<br />

If you want to install Tinn-R to a Rice computer (for which you don’t have an Administrator<br />

privilege):<br />

1. Go to the Tinn-R website and download the last runnable version without installer<br />

(1.15.1.7) (.zip, 2.9 Mb) at the bottom of the page. (Alternatively, you can click here<br />

to download it.)<br />

2. Unzip the downloaded zip file.<br />

3. Copy the Tinn-R folder to “System (C:) Program Files”<br />

4. Under Program Files/Tinn-R/bin, there is the executable file named “Tinn-R.exe”.<br />

Create a shortcut to this file and place it to some convenient location (e.g., Desktop).<br />

3.1.2 How to Use Tinn-R (PC users)<br />

1. Open R<br />

2. Open Tinn-R, open the accompanying R code (2_RPart.R) from Tinn-R.<br />

3. Arrange the two windows side by side, so that you can see both windows simultaneously<br />

(See the top figure on the next page.)<br />

4. Assign “Hot keys.”<br />

(a) Click R menu of Tinn-R’s, and choose “Hotkeys of R”<br />

(b) A small window will appear (See the bottom figure on the next page).<br />

(c) Choose “Send selection” from the left menu, and then click on “Active” on the<br />

bottom.<br />

(d) Assign a shortcut key of your choosing. I chose “Ctrl + j”<br />

(e) Click “Add”<br />

10


Now, as you select several lines of the R code from the right window and then hit “Ctrl<br />

+ j”, the selected text will be sent to R and should appear on the left window.<br />

You may need to set the directory:<br />

R code<br />

> setwd("2010_0616")<br />

11


3.1.3 How to install Aquamacs (Mac users)<br />

Download and install Aquamacs from http://aquamacs.org/. Plain and simple.<br />

3.1.4 How to use Aquamacs (Mac users)<br />

1. Open the accompanying R code (2_RPart.R) with Aquamacs.<br />

2. Open a new window by hitting “Command + N”<br />

3. Rearrange the two windows side by side if necessary.<br />

4. When the new window is active, hit “Alt + x” and then “Shift + R”.<br />

5. You should see a message “M-x R” on the bottom of the window. Then hit return<br />

twice. R will begin.<br />

6. Short-cut keys are already assigned. Two of them I frequently use are<br />

• Hit “Ctrl + C + C” (hit C twice while holding down the Ctrl key): send a<br />

“paragraph” to R<br />

• Hit “Ctrol + C + N”: send a line to R.<br />

12


3.2 Calculating Substantive Effects (a.k.a., CLARIFY “by hand”)<br />

Suppose you want to calculate substantive effects of some variable after running some fancy<br />

statistical model, but CLARIFY does not support the model. In this case, you have to<br />

implement the CLARIFY procedure “by hand.” I believe Randy has already covered how to<br />

do it with Stata, so I will show you how to do it with R.<br />

Assume further that the model you want to estimate is not readily available in R. For<br />

example, models such as Heckman’s probit, Sartori’s selection model, von Stein’s selection<br />

model, and censored duration model by Boehmke et al. can be easily estimated on Stata<br />

with ado file, but estimating these models in R requires a bit of programming. 2<br />

1. Estimate a model in Stata<br />

2. Save the estiamtes (coefficient vector and variance covariance matrix) and export them<br />

as a text file<br />

3. From R, import the estimates<br />

4. Simulate coefficients ( ˜ β)<br />

5. Set X of your interest at some value while holding the other variables at mean or<br />

median.<br />

6. Calculate E(Y) = F (X ˜ β), where F (·) is some link function (probit, logit, etc.)<br />

7. Repeat the steps 5 and 6<br />

As we have already completed the steps 1 & 2 in Section 2.3, we will begin with 3.<br />

A nice thing about using R is that we don’t really need to iterate the Steps 5 and 6<br />

using loop. Instead, we will use matrix to set X at many values and calculate E(Y) in one<br />

iteration.<br />

Step 3: Importing estimates from Stata<br />

Run the following chunk of R code (in 2_RPart.R).<br />

R code<br />

> beta vcov std.err z.score p.value reg.tbl colnames(reg.tbl) reg.tbl<br />

Compare the table shown in your R window and the one in your Stata window. They should<br />

be identical.<br />

2 In the future, you might want to learn how to program these models in R, but for now, let’s assume<br />

that we don’t have time to do it.<br />

13


Step 4: Simulate coefficients<br />

Then, we resample parameters (beta coefficients) from the sampling distribution characterized<br />

by the estimated beta coefficient and the variance-covariance matrix.<br />

Let’s draw a 1,000 set of simulated beta.<br />

R code<br />

> require(MASS)<br />

> nsims set.seed(12345)<br />

> simb dim(simb)<br />

> summary(simb)<br />

> hist(simb[1,])<br />

Step 5: Set X<br />

Let’s say we are interested in the effect of the military capability variable (Capability ratio),<br />

and calculate its substantive effects along with 95% confidence intervals. We will vary the<br />

capability variable from its minimum to maximum value in the sample, while holding constant<br />

the other variables at mean or median values.<br />

Let’s first load the data set.<br />

R code<br />

> data data$smldep x.of.i range(x.of.i)<br />

It turns out x.of.i the capability variable ranges from 0 to 8.5. We split this range into 49<br />

intervals.<br />

R code<br />

set.x


Now, we have set X by“column binding”(cbind) the x.of.i vector and representative values<br />

of the other variables.<br />

• What is the dimension of set.x?<br />

• It is important that the orders of the regressors in set.x and coefficients in beta are<br />

matched exactly.<br />

Step 6: Calculate E(Y) = F (Xβ)<br />

Now we have<br />

• Profile X with 50 rows and 9 columns, and<br />

• Simulated ˜ β with 1000 rows and 9 columns.<br />

This means that we are to obtain a distribution of E(Y) for 50 different regressor profiles.<br />

Then,<br />

• What should be the dimension of E(Y)?<br />

• How do we calculate E(Y)? (Note: our model is logit.)<br />

> x.beta exp.y


3.3 Plotting Substantive Effects<br />

Run the R code under section 3.3. You will obtain Figure 1 below (output/figs/subeff.pdf).<br />

Probability of MID Onset<br />

0.0000 0.0010 0.0020 0.0030<br />

Predicted Probability<br />

95% Confidence Bands<br />

0 2 4<br />

Capability Ratio (log)<br />

6 8<br />

Figure 1: The effect of military capability balance on the probability of MID onset<br />

Insert figure caption here. Explain what this figure means substantively, how the other variables are treated<br />

(held at mean / median, etc.), how the confidence bands are calculated (in this case, CLARIFY-style), etc.<br />

Make sure that readers can understand what’s going on without reading the text.<br />

16


3.4 Plotting Substantive Effects: When X is a Discrete Variable...<br />

If the regressor of your interest takes only a limited number of discrete values (for example,<br />

0 or 1), then consider using histograms instead of line plots. An example of histogram<br />

representation of substantive effects are shown in Figure 2. To produce this figure, run the<br />

R code under section 3.4 and you will obtain (output/figs/subeff2.pdf)<br />

Frequency<br />

0 100 200 300 400<br />

Noncontiguous dyads<br />

Contiguous dyads<br />

0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07<br />

Probability of MID Onset<br />

Figure 2: The effect of geographic contiguity on the probability of MID onset<br />

The figure illustrates the effect of contiguity on the probability that a pair of states experience militarized<br />

interstate disputes between one another. The black histogram located to the left shows the distribution<br />

of predicted probability of MID onset for dyads that do not share a geographic border, whereas the gray<br />

histogram to the right shows that for contiguous dyads. All the covariates other than the contiguity variable<br />

is held at their median (binary variables) or mean (continuous variables) values. Uncertainty estimates are<br />

obtained by resampling 1000 draws of β coefficients from the sampling distribution reported in Figure 4.<br />

17


3.5 Descriptive Statistics: Plotting (Im)Balance<br />

This figure illustrates the difference between treated and untreated (control) observation<br />

units in terms of pre-treatment covariates (and/or outcome variable). You might find this<br />

type of figure useful when you want to show (im)balance statistics before / after matching.<br />

In this particular example, the unit of observation is dyad-year, the “treatment” variable<br />

is a binary variable measuring whether or not members of dyad are allied or not in the<br />

observation year.<br />

Run the R code under section 3.5, and you will obtain Figure 3 (output/figs/sumstat.pdf).<br />

MID Onset<br />

Lower<br />

trade−to−GDP ratio<br />

Distance<br />

(log)<br />

0 .01 .02 .03 .04<br />

●<br />

●<br />

0 1 2 3 4<br />

| ●<br />

| ●<br />

0 2 4 6 8 10<br />

|●<br />

●|<br />

Lower democracy<br />

Capability ratio<br />

(log)<br />

Major Power<br />

−10 −5 0 5 10<br />

| ●<br />

0 2 4 6 8<br />

| ●<br />

|●<br />

0 .25 .5 .75 1<br />

●<br />

●<br />

Figure 3: Summary statistics<br />

●<br />

|<br />

Higher democracy<br />

Contiguity<br />

●<br />

●<br />

−10 −5 0 5 10<br />

0 .25 .5 .75 1<br />

●<br />

●<br />

Mean for allied dyads<br />

(N = 788)<br />

Mean for not allied dyads<br />

(N = 11,445)<br />

| Median<br />

25% and 75% percentile<br />

Each panel shows descriptive statistics of each variable. Circles show the mean, vertical ticks show the<br />

median, solid horizontal line segments show the lower and upper quartiles, and dotted lines span the range<br />

of the variables. Black circles represent the mean of the 788 dyad-years that have formal alliance ties with<br />

one another, and white circles represent the mean of the 11,445 dyad-years that are not allied.<br />

18<br />

●<br />

●<br />

|<br />

|


3.6 Plotting Regression Coefficients<br />

Run the R code under section 3.6, and you will obtain Figure 4 (output/figs/coefplot.pdf).<br />

Lower democracy<br />

Higher democracy<br />

Lower<br />

trade−to−GDP ratio<br />

Allies<br />

Capability ratio<br />

Contiguity<br />

Distance<br />

Major Power<br />

−1 0 1 2 3 4<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

−1 0 1 2 3 4<br />

Figure 4: Estimated logit coefficients<br />

This figure shows the estimated coefficients from the logistic regression of MID onset (n = 12,233). Circles<br />

represent the point estimates, and the associated horizontal line segments show the 90% confidence intervals<br />

for the estimate.<br />

19<br />

●<br />

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!