handout - Daina Chiba
handout - Daina Chiba
handout - Daina Chiba
- TAGS
- handout
- daina
- chiba
- dynaman.net
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
Contents<br />
Summer Methods Workshop:<br />
Presenting Statistical Results<br />
<strong>Daina</strong> <strong>Chiba</strong><br />
d.chiba@rice.edu<br />
June 16, 2010<br />
1 Why you want to learn both Stata and R 2<br />
2 Stata 2<br />
2.1 Producing Regression Tables Efficiently . . . . . . . . . . . . . . . . . . . . . 2<br />
2.1.1 DOs and DON’Ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2<br />
2.1.2 Tables with Multiple Columns . . . . . . . . . . . . . . . . . . . . . . 2<br />
2.1.3 Exporting Stata Results to a L ATEX document . . . . . . . . . . . . . 5<br />
2.1.4 Exporting Stata Results to Word/Powerpoint documents . . . . . . . 7<br />
2.2 Additional Tips . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.2.1 Rescaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8<br />
2.2.2 Putting Stars . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
2.3 Exporting Stata Results to R . . . . . . . . . . . . . . . . . . . . . . . . . . 9<br />
3 R 10<br />
3.1 Text Editor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10<br />
3.1.1 How to Install Tinn-R (PC users) . . . . . . . . . . . . . . . . . . . . 10<br />
3.1.2 How to Use Tinn-R (PC users) . . . . . . . . . . . . . . . . . . . . . 10<br />
3.1.3 How to install Aquamacs (Mac users) . . . . . . . . . . . . . . . . . . 12<br />
3.1.4 How to use Aquamacs (Mac users) . . . . . . . . . . . . . . . . . . . 12<br />
3.2 Calculating Substantive Effects (a.k.a., CLARIFY “by hand”) . . . . . . . . 13<br />
3.3 Plotting Substantive Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . 16<br />
3.4 Plotting Substantive Effects: When X is a Discrete Variable... . . . . . . . . 17<br />
3.5 Descriptive Statistics: Plotting (Im)Balance . . . . . . . . . . . . . . . . . . 18<br />
3.6 Plotting Regression Coefficients . . . . . . . . . . . . . . . . . . . . . . . . . 19<br />
1
1 Why you want to learn both Stata and R<br />
• The field of Political Science is making a gradual transition from Stata to R.<br />
• It may be the case that your co-author can use only one of them.<br />
• Stata’s advantage<br />
– Fast & easy<br />
– Some methods are available only in Stata (e.g., selection models)<br />
• R’s advantage<br />
2 Stata<br />
– Flexible graphic devices<br />
– Able to handle multiple data sets simultaneously.<br />
– Some statistical models are available only in R (e.g., IRT models, Bayesian)<br />
2.1 Producing Regression Tables Efficiently<br />
Although some professional journals encourage the authors to use figures instead of tables in<br />
reporting statistical results, some people still prefer old-fashioned tables to figures (and they<br />
can be your co-author!). So, let us first see how to make a regression table of publicationquality<br />
efficiently.<br />
The table on the next page is a screenshot of a regression table from Russett & Oneal<br />
(2005, 299). 1 We will see how to automate the process of making a regression table like this<br />
on Stata.<br />
2.1.1 DOs and DON’Ts<br />
• Don’t copy and paste from Stata’s Results Window (or a log file) to Word or Excel.<br />
• Automatize the process as much as possible to avoid human error and save time.<br />
2.1.2 Tables with Multiple Columns<br />
If you want to produce a regression table with multiple column entries, use esttab package.<br />
Try running the following chunk of the do-file (in 1_StataPart.do)<br />
1 The paper itself is included in the workshop packet. Find /resources/ONealRussett 2005.pdf<br />
2
Stata code<br />
use "output/smw06162010dataL.dta", clear<br />
global covariates "smldem lrgdem smldep allies lncaprat dircont lndstab majpower"<br />
logit mzmid1 $covariates systsize midpy*, cluster(dyadid)<br />
estimates store allmids<br />
logit mzfatald1 $covariates systsize fatalpy*, cluster(dyadid)<br />
estimates store fatalmids<br />
logit mzcowwar1 $covariates systsize warpy*, cluster(dyadid)<br />
estimates store wars<br />
Now, we have run three models shown in TABLE 1, and stored each of the results in<br />
memory. Let’s display the three models simultaneously. To do this, run the following<br />
Stata code<br />
esttab allmids fatalmids wars, ///<br />
b(%10.3f) se scalars("ll Log lik." "chi2 Chi-squared") ///<br />
label mtitles keep($covariates _cons)<br />
The first line tells Stata to show the three models we estimated above (allmids, fatalmids,<br />
wars), the second line declares that we want beta coefficients rounded up to 3 decimal points<br />
b (%10.3f), standard errors, se, and some other scalar values of interests (N, the maximized<br />
value of the likelihood function, and χ 2 ).<br />
3
You should be seeing a table like this on your Stata window.<br />
Compare the table in the original paper and that on your Stata window. Can you find a<br />
mistake that Oneal and Russett made?<br />
I suspect that O & R copied and pasted the results from Stata window to a Word<br />
document, and somewhere in the process, they accidentally flipped the sign of one coefficient.<br />
If a mistake is this easy to find, maybe either one of the authors and/or publisher and/or<br />
editors and/or reviewers can find it (although in this case nobody found it). But, mistakes<br />
you’ll make maybe less conspicuous and more consequential. So, again, don’t use copy &<br />
paste!<br />
4
2.1.3 Exporting Stata Results to a L ATEX document<br />
If you use L ATEX, incorporating Stata results is very easy, and the process can be almost<br />
completely automatized. First, run the following<br />
Stata code<br />
esttab allmids fatalmids wars using "<strong>handout</strong>/ro2005tbl.tex", tex replace ///<br />
b(%10.3f) se scalars("ll Log lik." "chi2 Chi-squared") ///<br />
label mtitles keep($covariates _cons) ///<br />
title("Models of the onset of militarized interstate disputes, fatal disputes, and war, 1885--2001")<br />
Running this chunk of code will create (replace) a text file ro2005tbl.tex under the <strong>handout</strong><br />
folder. To insert the created table into your paper, add the following line to the L ATEX source<br />
file:<br />
\input{ro2005tbl.tex}<br />
Then you will have the following table (next page).<br />
Every time you make some changes to your statistical model and re-estimate models,<br />
Stata and L ATEX will create a publication-quality table for you automatically. Days of<br />
copying and pasting are over.<br />
5
Table 1: Models of the onset of militarized interstate disputes, fatal disputes, and war, 1885–2001<br />
(1) (2) (3)<br />
allmids fatalmids wars<br />
Lower democracy -0.069 ∗∗∗ -0.096 ∗∗∗ -0.162 ∗∗∗<br />
(0.008) (0.017) (0.030)<br />
Higher democracy 0.038 ∗∗∗ 0.038 ∗∗∗ 0.043<br />
(0.007) (0.011) (0.024)<br />
Lower trade-to-GDP ratio -32.262 ∗∗∗ -95.833 ∗∗∗ -45.763<br />
(8.987) (25.895) (26.878)<br />
Allies 0.075 -0.199 -0.562<br />
(0.101) (0.178) (0.346)<br />
Capability ratio (log) -0.284 ∗∗∗ -0.410 ∗∗∗ -0.754 ∗∗∗<br />
(0.030) (0.049) (0.096)<br />
Contiguous 1.127 ∗∗∗ 1.151 ∗∗∗ 0.982 ∗∗<br />
(0.152) (0.249) (0.357)<br />
Distance (log) -0.289 ∗∗∗ -0.466 ∗∗∗ -0.365 ∗<br />
(0.053) (0.081) (0.161)<br />
Major power 1.007 ∗∗∗ 1.118 ∗∗∗ 2.240 ∗∗∗<br />
(0.133) (0.227) (0.398)<br />
Constant -0.489 -0.643 -3.176 ∗<br />
(0.406) (0.618) (1.267)<br />
Observations 464953 464692 464953<br />
Log lik. -8118.156 -2926.785 -757.367<br />
Chi-squared 3861.950 1345.308 472.660<br />
Standard errors in parentheses<br />
∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001<br />
6
2.1.4 Exporting Stata Results to Word/Powerpoint documents<br />
Microsoft Word is NOT a good software for scientific documentation in the sense that you<br />
have to do a significant part of the otherwise automatized process “by hand,” which not<br />
only enhances the danger of making errors but also requires laborious copying, pasting, and<br />
editing. So, I strongly encourage you to learn how to use L ATEX. One of the first-step would<br />
be to play with the source file of this <strong>handout</strong> (<strong>handout</strong>/<strong>handout</strong>.tex). You should be able<br />
to create a pdf from the source file.<br />
But, again, your may co-author with your superior, who would not want to learn L ATEX and<br />
simply give you an order to create a fancy table on Word or PowerPoint. Here is a way to<br />
minimize the job of copying and pasting and risk of making errors in the process.<br />
First, run the following<br />
Stata code<br />
estout allmids fatalmids wars, ///<br />
cells(b(star fmt(%9.3f)) se(par fmt(%9.3f))) style(fixed) ///<br />
stats(N, fmt(a1 %9.3f %9.3f)), using "<strong>handout</strong>/ro2005tbl.txt", replace<br />
Running this chunk of code will create (replace) a text file ro2005tbl.txt under the <strong>handout</strong><br />
folder.<br />
1. Open Microsoft Excel.<br />
2. Go to “File” and choose “Open” (or, hit Control + O on PC or Command + O on Mac)<br />
3. Find ro2005tbl.txt. You may have to change the “Files of type” option.<br />
4. Specify the numerical columns as “Text” not “General” (Otherwise, Excel will recognize<br />
the numbers in parentheses as negative.)<br />
5. Format as you like, and copy and paste it into a Word or PowerPoint document.<br />
7
2.2 Additional Tips<br />
2.2.1 Rescaling<br />
If one of your coefficients exceeds 10 in absolute values, consider rescaling the variable. If it<br />
exceeds 1, 000, you simply must rescale it. Coefficients greater than 1, 000, 000 are not only<br />
ugly but also misleading.<br />
There are several ways to rescale a variable, none of which changes the statistical results<br />
substantively.<br />
• Multiply the variable with some number (a natural choice would be 10 x with x =<br />
{−3, −2, −1, 1, 2, 3}, for example).<br />
• Standardize the variable (subtract its mean and then divide it with its standard deviation).<br />
After standardization, the variable has 0 mean and a unit variance.<br />
You can apply either one of the above to some or all of the variables included in your<br />
regression.<br />
Table 2: Models with rescaled Trade variables<br />
(1) (2) (3) (4)<br />
original multiply standardizeOne standardizeALL<br />
Lower trade-to-GDP ratio -32.262 ∗∗∗ -0.323 ∗∗∗ -0.094 ∗∗∗ -0.094 ∗∗∗<br />
(-3.59) (-3.59) (-3.59) (-3.59)<br />
Lower democracy -0.069 ∗∗∗ -0.069 ∗∗∗ -0.069 ∗∗∗ -0.403 ∗∗∗<br />
(-8.55) (-8.55) (-8.55) (-8.55)<br />
Higher democracy 0.038 ∗∗∗ 0.038 ∗∗∗ 0.038 ∗∗∗ 0.256 ∗∗∗<br />
(5.69) (5.69) (5.69) (5.69)<br />
Allies 0.075 0.075 0.075 0.018<br />
(0.75) (0.75) (0.75) (0.75)<br />
Observations 464953 464953 464953 464953<br />
Log lik. -8118.156 -8118.156 -8118.156 -8118.156<br />
Chi-squared 3861.950 3861.950 3861.950 3861.950<br />
t statistics in parentheses<br />
∗ p < 0.05, ∗∗ p < 0.01, ∗∗∗ p < 0.001<br />
We can see in the above table that results are identical after (2) multiplication of one variable<br />
(3) standardization of one variable or (4) standardization of all variables.<br />
8
2.2.2 Putting Stars<br />
For the sake of convenience, I think it is good to highlight statistically significant coefficients<br />
in some ways. I have no objection to putting stars, but keep in mind that there are some<br />
people who think putting stars is a sin. Some of those people who hate stars, however, think<br />
that it is OK to highlight statistically significant coefficients by writing them in thick font<br />
(I honestly don’t know why). If, unfortunately, your reviewer or your TA expresses such a<br />
belief, you’d better listen to him than trying to point out the stupidity of that person.<br />
FYI, AJPS prohibits us from placing multiple stars to differentiate multiple confidence<br />
levels.<br />
2.3 Exporting Stata Results to R<br />
This will become essential when you need to estimate some fancy statistical models that are<br />
easier to implement in Stata, but then want to use R to make graphs.<br />
Run the following chunk of Stata code:<br />
Stata code<br />
insheet using "output/smw06162010dataS.txt", clear<br />
global covariates "smldem lrgdem smldep allies lncaprat dircont lndstab majpower"<br />
replace smldep = smldep*1000<br />
logit mzmid1 $covariates<br />
mat beta = e(b)<br />
mat vcov = e(V)<br />
// Export Beta and V-COV matrix<br />
preserve<br />
svmat beta, names(matcol)<br />
outsheet beta* in 1 using "output/sr_beta.txt", replace nolabel<br />
restore<br />
preserve<br />
svmat vcov, names(matcol)<br />
outsheet vcov* in 1/9 using "output/sr_vcov.txt", replace nolabel<br />
restore<br />
Two text files will be created (replaced) within the output folder.<br />
9
3 R<br />
3.1 Text Editor<br />
Why you want to use a text editor?<br />
• Syntax highlighting<br />
• Parenthesis matching<br />
• Let you send part or all of the R codes to R<br />
A powerful text editor significantly reduces careless mistakes and save your day.<br />
3.1.1 How to Install Tinn-R (PC users)<br />
If you want to install Tinn-R to a Rice computer (for which you don’t have an Administrator<br />
privilege):<br />
1. Go to the Tinn-R website and download the last runnable version without installer<br />
(1.15.1.7) (.zip, 2.9 Mb) at the bottom of the page. (Alternatively, you can click here<br />
to download it.)<br />
2. Unzip the downloaded zip file.<br />
3. Copy the Tinn-R folder to “System (C:) Program Files”<br />
4. Under Program Files/Tinn-R/bin, there is the executable file named “Tinn-R.exe”.<br />
Create a shortcut to this file and place it to some convenient location (e.g., Desktop).<br />
3.1.2 How to Use Tinn-R (PC users)<br />
1. Open R<br />
2. Open Tinn-R, open the accompanying R code (2_RPart.R) from Tinn-R.<br />
3. Arrange the two windows side by side, so that you can see both windows simultaneously<br />
(See the top figure on the next page.)<br />
4. Assign “Hot keys.”<br />
(a) Click R menu of Tinn-R’s, and choose “Hotkeys of R”<br />
(b) A small window will appear (See the bottom figure on the next page).<br />
(c) Choose “Send selection” from the left menu, and then click on “Active” on the<br />
bottom.<br />
(d) Assign a shortcut key of your choosing. I chose “Ctrl + j”<br />
(e) Click “Add”<br />
10
Now, as you select several lines of the R code from the right window and then hit “Ctrl<br />
+ j”, the selected text will be sent to R and should appear on the left window.<br />
You may need to set the directory:<br />
R code<br />
> setwd("2010_0616")<br />
11
3.1.3 How to install Aquamacs (Mac users)<br />
Download and install Aquamacs from http://aquamacs.org/. Plain and simple.<br />
3.1.4 How to use Aquamacs (Mac users)<br />
1. Open the accompanying R code (2_RPart.R) with Aquamacs.<br />
2. Open a new window by hitting “Command + N”<br />
3. Rearrange the two windows side by side if necessary.<br />
4. When the new window is active, hit “Alt + x” and then “Shift + R”.<br />
5. You should see a message “M-x R” on the bottom of the window. Then hit return<br />
twice. R will begin.<br />
6. Short-cut keys are already assigned. Two of them I frequently use are<br />
• Hit “Ctrl + C + C” (hit C twice while holding down the Ctrl key): send a<br />
“paragraph” to R<br />
• Hit “Ctrol + C + N”: send a line to R.<br />
12
3.2 Calculating Substantive Effects (a.k.a., CLARIFY “by hand”)<br />
Suppose you want to calculate substantive effects of some variable after running some fancy<br />
statistical model, but CLARIFY does not support the model. In this case, you have to<br />
implement the CLARIFY procedure “by hand.” I believe Randy has already covered how to<br />
do it with Stata, so I will show you how to do it with R.<br />
Assume further that the model you want to estimate is not readily available in R. For<br />
example, models such as Heckman’s probit, Sartori’s selection model, von Stein’s selection<br />
model, and censored duration model by Boehmke et al. can be easily estimated on Stata<br />
with ado file, but estimating these models in R requires a bit of programming. 2<br />
1. Estimate a model in Stata<br />
2. Save the estiamtes (coefficient vector and variance covariance matrix) and export them<br />
as a text file<br />
3. From R, import the estimates<br />
4. Simulate coefficients ( ˜ β)<br />
5. Set X of your interest at some value while holding the other variables at mean or<br />
median.<br />
6. Calculate E(Y) = F (X ˜ β), where F (·) is some link function (probit, logit, etc.)<br />
7. Repeat the steps 5 and 6<br />
As we have already completed the steps 1 & 2 in Section 2.3, we will begin with 3.<br />
A nice thing about using R is that we don’t really need to iterate the Steps 5 and 6<br />
using loop. Instead, we will use matrix to set X at many values and calculate E(Y) in one<br />
iteration.<br />
Step 3: Importing estimates from Stata<br />
Run the following chunk of R code (in 2_RPart.R).<br />
R code<br />
> beta vcov std.err z.score p.value reg.tbl colnames(reg.tbl) reg.tbl<br />
Compare the table shown in your R window and the one in your Stata window. They should<br />
be identical.<br />
2 In the future, you might want to learn how to program these models in R, but for now, let’s assume<br />
that we don’t have time to do it.<br />
13
Step 4: Simulate coefficients<br />
Then, we resample parameters (beta coefficients) from the sampling distribution characterized<br />
by the estimated beta coefficient and the variance-covariance matrix.<br />
Let’s draw a 1,000 set of simulated beta.<br />
R code<br />
> require(MASS)<br />
> nsims set.seed(12345)<br />
> simb dim(simb)<br />
> summary(simb)<br />
> hist(simb[1,])<br />
Step 5: Set X<br />
Let’s say we are interested in the effect of the military capability variable (Capability ratio),<br />
and calculate its substantive effects along with 95% confidence intervals. We will vary the<br />
capability variable from its minimum to maximum value in the sample, while holding constant<br />
the other variables at mean or median values.<br />
Let’s first load the data set.<br />
R code<br />
> data data$smldep x.of.i range(x.of.i)<br />
It turns out x.of.i the capability variable ranges from 0 to 8.5. We split this range into 49<br />
intervals.<br />
R code<br />
set.x
Now, we have set X by“column binding”(cbind) the x.of.i vector and representative values<br />
of the other variables.<br />
• What is the dimension of set.x?<br />
• It is important that the orders of the regressors in set.x and coefficients in beta are<br />
matched exactly.<br />
Step 6: Calculate E(Y) = F (Xβ)<br />
Now we have<br />
• Profile X with 50 rows and 9 columns, and<br />
• Simulated ˜ β with 1000 rows and 9 columns.<br />
This means that we are to obtain a distribution of E(Y) for 50 different regressor profiles.<br />
Then,<br />
• What should be the dimension of E(Y)?<br />
• How do we calculate E(Y)? (Note: our model is logit.)<br />
> x.beta exp.y
3.3 Plotting Substantive Effects<br />
Run the R code under section 3.3. You will obtain Figure 1 below (output/figs/subeff.pdf).<br />
Probability of MID Onset<br />
0.0000 0.0010 0.0020 0.0030<br />
Predicted Probability<br />
95% Confidence Bands<br />
0 2 4<br />
Capability Ratio (log)<br />
6 8<br />
Figure 1: The effect of military capability balance on the probability of MID onset<br />
Insert figure caption here. Explain what this figure means substantively, how the other variables are treated<br />
(held at mean / median, etc.), how the confidence bands are calculated (in this case, CLARIFY-style), etc.<br />
Make sure that readers can understand what’s going on without reading the text.<br />
16
3.4 Plotting Substantive Effects: When X is a Discrete Variable...<br />
If the regressor of your interest takes only a limited number of discrete values (for example,<br />
0 or 1), then consider using histograms instead of line plots. An example of histogram<br />
representation of substantive effects are shown in Figure 2. To produce this figure, run the<br />
R code under section 3.4 and you will obtain (output/figs/subeff2.pdf)<br />
Frequency<br />
0 100 200 300 400<br />
Noncontiguous dyads<br />
Contiguous dyads<br />
0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07<br />
Probability of MID Onset<br />
Figure 2: The effect of geographic contiguity on the probability of MID onset<br />
The figure illustrates the effect of contiguity on the probability that a pair of states experience militarized<br />
interstate disputes between one another. The black histogram located to the left shows the distribution<br />
of predicted probability of MID onset for dyads that do not share a geographic border, whereas the gray<br />
histogram to the right shows that for contiguous dyads. All the covariates other than the contiguity variable<br />
is held at their median (binary variables) or mean (continuous variables) values. Uncertainty estimates are<br />
obtained by resampling 1000 draws of β coefficients from the sampling distribution reported in Figure 4.<br />
17
3.5 Descriptive Statistics: Plotting (Im)Balance<br />
This figure illustrates the difference between treated and untreated (control) observation<br />
units in terms of pre-treatment covariates (and/or outcome variable). You might find this<br />
type of figure useful when you want to show (im)balance statistics before / after matching.<br />
In this particular example, the unit of observation is dyad-year, the “treatment” variable<br />
is a binary variable measuring whether or not members of dyad are allied or not in the<br />
observation year.<br />
Run the R code under section 3.5, and you will obtain Figure 3 (output/figs/sumstat.pdf).<br />
MID Onset<br />
Lower<br />
trade−to−GDP ratio<br />
Distance<br />
(log)<br />
0 .01 .02 .03 .04<br />
●<br />
●<br />
0 1 2 3 4<br />
| ●<br />
| ●<br />
0 2 4 6 8 10<br />
|●<br />
●|<br />
Lower democracy<br />
Capability ratio<br />
(log)<br />
Major Power<br />
−10 −5 0 5 10<br />
| ●<br />
0 2 4 6 8<br />
| ●<br />
|●<br />
0 .25 .5 .75 1<br />
●<br />
●<br />
Figure 3: Summary statistics<br />
●<br />
|<br />
Higher democracy<br />
Contiguity<br />
●<br />
●<br />
−10 −5 0 5 10<br />
0 .25 .5 .75 1<br />
●<br />
●<br />
Mean for allied dyads<br />
(N = 788)<br />
Mean for not allied dyads<br />
(N = 11,445)<br />
| Median<br />
25% and 75% percentile<br />
Each panel shows descriptive statistics of each variable. Circles show the mean, vertical ticks show the<br />
median, solid horizontal line segments show the lower and upper quartiles, and dotted lines span the range<br />
of the variables. Black circles represent the mean of the 788 dyad-years that have formal alliance ties with<br />
one another, and white circles represent the mean of the 11,445 dyad-years that are not allied.<br />
18<br />
●<br />
●<br />
|<br />
|
3.6 Plotting Regression Coefficients<br />
Run the R code under section 3.6, and you will obtain Figure 4 (output/figs/coefplot.pdf).<br />
Lower democracy<br />
Higher democracy<br />
Lower<br />
trade−to−GDP ratio<br />
Allies<br />
Capability ratio<br />
Contiguity<br />
Distance<br />
Major Power<br />
−1 0 1 2 3 4<br />
●<br />
●<br />
●<br />
●<br />
●<br />
●<br />
−1 0 1 2 3 4<br />
Figure 4: Estimated logit coefficients<br />
This figure shows the estimated coefficients from the logistic regression of MID onset (n = 12,233). Circles<br />
represent the point estimates, and the associated horizontal line segments show the 90% confidence intervals<br />
for the estimate.<br />
19<br />
●<br />
●