Problems for reproducibility and some possible solutions - Stata
Problems for reproducibility and some possible solutions - Stata
Problems for reproducibility and some possible solutions - Stata
You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong><br />
<strong>possible</strong> <strong>solutions</strong><br />
Ulrich Kohler<br />
kohler@wz-berlin.de<br />
Wissenschaftszentrum Berlin<br />
(Presentation prepared with LAT E X)<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.1/25
Plan of the presentation<br />
Introduction: On <strong>reproducibility</strong><br />
OS <strong>and</strong> environment dependency<br />
Software dependency<br />
Dataset confusion<br />
Transcription errors<br />
Outlook<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.2/25
On Reproducibility<br />
To ensure <strong>reproducibility</strong>, all steps of an analysis should be<br />
documented in program files <strong>and</strong> be made accessible to<br />
others. In this definition<br />
“all steps” means all steps (!),<br />
“documented” means, that it should be easy to find the<br />
program file <strong>for</strong> a specific analysis <strong>and</strong> the program file<br />
should be easy to underst<strong>and</strong>,<br />
“program files” means Do-file, SPS-File, etc, <strong>and</strong><br />
“accessible” means, program files should be h<strong>and</strong>led<br />
to others when requested <strong>and</strong> program files should<br />
produce the same results when running by others.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.3/25
On Reproducibility<br />
I have tried to make reproducible analyses in this sense.<br />
They are accessible on the web. When preparing these<br />
analyses I <strong>some</strong>times was caught in more or less hidden<br />
traps. This presentation shows how I escaped.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.4/25
OS <strong>and</strong> environment dependency<br />
With <strong>Stata</strong>, OS <strong>and</strong> environment dependency is a minor<br />
problem. <strong>Problems</strong> can arise if you specify filenames in<br />
do-files—especially if your research-group works in<br />
different institutions with different local network structures.<br />
Do not use the backslash as directory separator. Code<br />
use subdir/mydata instead of use<br />
subdir\mydata<br />
Do not use absolute pathnames in a do-file. Do not<br />
code use c:/path/mydata. Interactively cd to<br />
c:/path <strong>and</strong> code use mydata in the do-file<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.5/25
OS <strong>and</strong> environment dependency<br />
Major databases are used <strong>for</strong> many projects. They<br />
shouldn’t be copied to each project path. Absolute<br />
Pathnames could be circumvented by setting global<br />
macros in a project-setup file. This way, the pathnames<br />
must be only edited once.<br />
global soepdir<br />
master.do<br />
"c:/important data/gsoep16" //
Software dependency<br />
<strong>Stata</strong><br />
Obviously, a <strong>Stata</strong> do-file requires <strong>Stata</strong>. But with version<br />
control you can specify which <strong>Stata</strong> should be used:<br />
* Analysis of the xyz-hypothesis<br />
version 3<br />
...<br />
xyz.do<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.7/25
Software dependency<br />
Cool Ados<br />
User written programs are useful tools. But sharing do-files<br />
can getting painful if you often use them. And: User written<br />
programs <strong>some</strong>times change. Installing a new version of a<br />
user written program might break do-files.<br />
Install Ados within a do-file:<br />
capture which mkdat<br />
if _rc ˜= 0 {<br />
net from http://www.sowi.uni-mannheim.de/lesas/ado<br />
net install mkdat<br />
}<br />
master.do<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.8/25
Software dependency<br />
Consider copying the program-definition into your<br />
do-file:<br />
capture program drop soepren //
Software dependency<br />
Consider copying a renamed copy of the program in<br />
your project directory (<strong>and</strong> share the renamed copy<br />
with your colleagues):<br />
*! soepren.ado 0.1, ukohler@sowi.uni-mannheim.de<br />
program define mysoepren //
Dataset confusion<br />
Many datasets regularly gets updated by their originators.<br />
Sometimes the update ships with the same name but it is a<br />
complete different file (<strong>for</strong> example the cumulated<br />
ALLBUS).<br />
Check the properties of such datasets:<br />
describe using $allbdir/s1795, short<br />
assert r(N) == 34956 & r(k) == 847<br />
use $allbdir/s1795<br />
mysoepren.do<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.11/25
Transcription errors<br />
At one point of an analysis, results from the <strong>Stata</strong> output<br />
needs to be filled into a text document. This is nowadays<br />
often done by copy <strong>and</strong> paste. However transcription<br />
errors still arise, because<br />
pasted results needs to be “rearranged” in <strong>some</strong> way,<br />
<strong>and</strong>/or<br />
specific numbers needs to be “picked” from different<br />
comm<strong>and</strong> outputs.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.12/25
Transcription errors<br />
The likelihood of transcription errors raise if you did the<br />
copy-paste-rearrangement-picking procedure the fifth time<br />
after detecting data errors over <strong>and</strong> over again.<br />
Minimize copy-paste-rearrangement-picking<br />
procedures by<br />
1. using graphs<br />
2. using available specialized tools<br />
3. exporting datasets which contain the results<br />
4. using file<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.13/25
Transcription errors<br />
Graphs<br />
<strong>Stata</strong> graphs can be included into Ms-Word or L A T E X without<br />
copy-paste. This way the text contains always the latest<br />
version of the graph.<br />
graph dot uc, over(country, sort((mean) uc))<br />
graph export turnout.eps, replace<br />
graph export turnout.wmf, replace<br />
xyz.do<br />
\includegraphics{turnout}<br />
xyz.tex<br />
Einfuegen->Objekt->Aus Datei ertellen-><br />
"turnout.wmf" (Als Verknuepfung markieren)<br />
xyz.doc<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.14/25
Transcription errors<br />
Specialized Tools<br />
There are several tools to get <strong>Stata</strong> output in a more<br />
publication ready <strong>for</strong>mat. Some of them will be described<br />
by Roger Newson in the next issue of the <strong>Stata</strong> Journal<br />
(2003, Nr. 3). Examples are<br />
outreg, re<strong>for</strong>mat, mktab, slist, fsum, ci<strong>for</strong>m,<br />
xcontract<br />
Tools to work with L A T E X: sjlog, latab, est2tex,<br />
outtable, sutex, outtex, maketex, dotex,<br />
listtex<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.15/25
Transcription errors<br />
Export datasets containing the results<br />
Exporting datasets containing the results provide a general<br />
way to get publication-ready output. I used this technique<br />
to produce the following table, which is quite different from<br />
the normal <strong>Stata</strong> output. The table has been fully produced<br />
within <strong>Stata</strong> <strong>and</strong> inserted into this presentation with the<br />
L A T E X-comm<strong>and</strong> input{tables.tex}.<br />
Kriminalitaetsfurcht<br />
Gross Mittel Klein Gueltig Fehlend Gesamt<br />
Maennlich 57.4 38.4 4.2 6377 43 6420<br />
Weiblich 63.5 33.5 3 6807 56 6863<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.16/25
Transcription errors<br />
To produce the table I constructed a <strong>Stata</strong> dataset<br />
which contains the numbers <strong>for</strong> the table <strong>and</strong> nothing<br />
else.<br />
. list<br />
gender n1 n2 n3 n99 Nv N<br />
Maennlich 57.4 38.4 4.2 43 6377 6420<br />
Weiblich 63.5 33.5 3 56 6807 6863<br />
This dataset can be exported to a <strong>for</strong>mat suited <strong>for</strong><br />
Word-Processors <strong>and</strong>/or L A T E X.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.17/25
Transcription errors<br />
To export to MS-Word, type<br />
outsheet gender n1 n2 n3 Nv n99 N using xyz.tsv<br />
inside <strong>Stata</strong> <strong>and</strong> use Word to import xyz.tsv:<br />
Einfügen, Datei, Namen auswählen<br />
Gesamten eingefügten Bereich markieren<br />
Tabelle, Umw<strong>and</strong>eln, Text in Tabelle, Trennzeichen:<br />
Tabstops<br />
Now, dress up the table with Word. Un<strong>for</strong>tunatelly there<br />
seems no way to automate the Word-steps from within<br />
<strong>Stata</strong> (but listtex brings you <strong>some</strong>what closer).<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.18/25
Transcription errors<br />
To export to L A T E X type<br />
tables.do<br />
listtex gender n1 n2 n3 Nv n99 N using tables.tex, rstyle(tabular<br />
head("\begin{tabular}{lrrrrrrr}\hline"<br />
" & \multicolumn{6}{c}{\emph{Kriminalitaetsfurcht}} \\\\"<br />
" & Gross & Mittel & Klein & Gueltig & Missing & Valid \\\<br />
foot("\hline\end{tabular}")<br />
One can also use listtex to export tables to Word.<br />
However you still need to convert text to tables within<br />
word.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.19/25
Transcription errors<br />
Finally, here is how I produced the <strong>Stata</strong> dataset:<br />
* Produce absolute Frequencies<br />
by gender worries, sort: gen n = _N<br />
by gender worries: gen Nv = sum(worries˜=99)<br />
by gender worries: keep if _n == _N<br />
by gender (worries), sort: gen N = sum(n)<br />
by gender (worries): replace Nv = sum(Nv)<br />
by gender: replace N = N[_N]<br />
by gender: replace Nv = Nv[_N]<br />
tables.do<br />
* Re<strong>for</strong>mat the data<br />
reshape wide n, j(worries) i(gender)<br />
* Calculate percentage values from valid n<br />
<strong>for</strong>v i = 1/3 {<br />
replace n‘i’ = round(n‘i’/Nv*100,.1)<br />
}<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.20/25
Transcription errors<br />
Producing publication ready output often is a<br />
data-management problem.<br />
There are specialized tools, but in the long run learning<br />
to use the general data-mangement tools will pay back.<br />
by varlist:, post, reshape, file<br />
Local-macros<br />
Saved Results<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.21/25
Transcription errors<br />
file<br />
The comm<strong>and</strong> file is often convenient if you want to<br />
produce an output with numbers from different procedures.<br />
I used file to produce the following table, which picks<br />
Likelihood-Ratio-Tests from different models. The table has<br />
been fully produced within <strong>Stata</strong> <strong>and</strong> inserted into this<br />
presentation with input{anincchi2.tex}.<br />
Baseline<br />
Hypotheses<br />
Country Main Effects 307.96 261.57<br />
(0) (0)<br />
Interaction Effects 25.08 26.32<br />
(0) (0)<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.22/25
Transcription errors<br />
Here is how I produced the table:<br />
aninc.do<br />
file open an1 using anincchi2.tex, write text replace<br />
file write an1 "\begin{tabular}{lrrr} \hline " _n<br />
file write an1 "& Baseline & Hypotheses \\\\ \hline" _n<br />
file write an1 "Country Main Effects & ‘lrctr1’ & ‘lrctr3’ \\\\"<br />
file write an1 "& (‘pctr1’) & (‘pctr3’) \\\\" _n<br />
file write an1 "Interaction Effects & ‘lria1’ & ‘lria3’ \\\\"<br />
file write an1 "& (‘pia1’) & (‘pia3’) \\\\ \hline" _n<br />
file write an1 "\end{tabular} "<br />
file close an1<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.23/25
Outlook<br />
Combining the tools mentioned so far, it is a natural step to<br />
produce a publication ready PDF-file with tables <strong>for</strong> all<br />
variables in a data-set. The general outline would be:<br />
dreams.do<br />
file open tables using alltables.tex, write text replace<br />
file write tables "\documentclass{book}" _n<br />
file write tables "\begin{document}" _n<br />
<strong>for</strong>each var of varlist _all {<br />
mytabs ‘var’ //
Outlook<br />
The crucial step in the program outline is mytabs<br />
which doesn’t exist so far. At the WZB we have been<br />
quite successful in constructing a specialized version<br />
of mytabs <strong>for</strong> a single dataset with 658 variables. But<br />
a generalization needs <strong>some</strong> further ef<strong>for</strong>ts.<br />
Note also that file can read from arbitrary files.<br />
There<strong>for</strong>e one may also add additional in<strong>for</strong>mation<br />
from a database—i.e. questions from a<br />
questionnaire—to each of the tables.<br />
<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.25/25