29.07.2014 Views

Problems for reproducibility and some possible solutions - Stata

Problems for reproducibility and some possible solutions - Stata

Problems for reproducibility and some possible solutions - Stata

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong><br />

<strong>possible</strong> <strong>solutions</strong><br />

Ulrich Kohler<br />

kohler@wz-berlin.de<br />

Wissenschaftszentrum Berlin<br />

(Presentation prepared with LAT E X)<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.1/25


Plan of the presentation<br />

Introduction: On <strong>reproducibility</strong><br />

OS <strong>and</strong> environment dependency<br />

Software dependency<br />

Dataset confusion<br />

Transcription errors<br />

Outlook<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.2/25


On Reproducibility<br />

To ensure <strong>reproducibility</strong>, all steps of an analysis should be<br />

documented in program files <strong>and</strong> be made accessible to<br />

others. In this definition<br />

“all steps” means all steps (!),<br />

“documented” means, that it should be easy to find the<br />

program file <strong>for</strong> a specific analysis <strong>and</strong> the program file<br />

should be easy to underst<strong>and</strong>,<br />

“program files” means Do-file, SPS-File, etc, <strong>and</strong><br />

“accessible” means, program files should be h<strong>and</strong>led<br />

to others when requested <strong>and</strong> program files should<br />

produce the same results when running by others.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.3/25


On Reproducibility<br />

I have tried to make reproducible analyses in this sense.<br />

They are accessible on the web. When preparing these<br />

analyses I <strong>some</strong>times was caught in more or less hidden<br />

traps. This presentation shows how I escaped.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.4/25


OS <strong>and</strong> environment dependency<br />

With <strong>Stata</strong>, OS <strong>and</strong> environment dependency is a minor<br />

problem. <strong>Problems</strong> can arise if you specify filenames in<br />

do-files—especially if your research-group works in<br />

different institutions with different local network structures.<br />

Do not use the backslash as directory separator. Code<br />

use subdir/mydata instead of use<br />

subdir\mydata<br />

Do not use absolute pathnames in a do-file. Do not<br />

code use c:/path/mydata. Interactively cd to<br />

c:/path <strong>and</strong> code use mydata in the do-file<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.5/25


OS <strong>and</strong> environment dependency<br />

Major databases are used <strong>for</strong> many projects. They<br />

shouldn’t be copied to each project path. Absolute<br />

Pathnames could be circumvented by setting global<br />

macros in a project-setup file. This way, the pathnames<br />

must be only edited once.<br />

global soepdir<br />

master.do<br />

"c:/important data/gsoep16" //


Software dependency<br />

<strong>Stata</strong><br />

Obviously, a <strong>Stata</strong> do-file requires <strong>Stata</strong>. But with version<br />

control you can specify which <strong>Stata</strong> should be used:<br />

* Analysis of the xyz-hypothesis<br />

version 3<br />

...<br />

xyz.do<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.7/25


Software dependency<br />

Cool Ados<br />

User written programs are useful tools. But sharing do-files<br />

can getting painful if you often use them. And: User written<br />

programs <strong>some</strong>times change. Installing a new version of a<br />

user written program might break do-files.<br />

Install Ados within a do-file:<br />

capture which mkdat<br />

if _rc ˜= 0 {<br />

net from http://www.sowi.uni-mannheim.de/lesas/ado<br />

net install mkdat<br />

}<br />

master.do<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.8/25


Software dependency<br />

Consider copying the program-definition into your<br />

do-file:<br />

capture program drop soepren //


Software dependency<br />

Consider copying a renamed copy of the program in<br />

your project directory (<strong>and</strong> share the renamed copy<br />

with your colleagues):<br />

*! soepren.ado 0.1, ukohler@sowi.uni-mannheim.de<br />

program define mysoepren //


Dataset confusion<br />

Many datasets regularly gets updated by their originators.<br />

Sometimes the update ships with the same name but it is a<br />

complete different file (<strong>for</strong> example the cumulated<br />

ALLBUS).<br />

Check the properties of such datasets:<br />

describe using $allbdir/s1795, short<br />

assert r(N) == 34956 & r(k) == 847<br />

use $allbdir/s1795<br />

mysoepren.do<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.11/25


Transcription errors<br />

At one point of an analysis, results from the <strong>Stata</strong> output<br />

needs to be filled into a text document. This is nowadays<br />

often done by copy <strong>and</strong> paste. However transcription<br />

errors still arise, because<br />

pasted results needs to be “rearranged” in <strong>some</strong> way,<br />

<strong>and</strong>/or<br />

specific numbers needs to be “picked” from different<br />

comm<strong>and</strong> outputs.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.12/25


Transcription errors<br />

The likelihood of transcription errors raise if you did the<br />

copy-paste-rearrangement-picking procedure the fifth time<br />

after detecting data errors over <strong>and</strong> over again.<br />

Minimize copy-paste-rearrangement-picking<br />

procedures by<br />

1. using graphs<br />

2. using available specialized tools<br />

3. exporting datasets which contain the results<br />

4. using file<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.13/25


Transcription errors<br />

Graphs<br />

<strong>Stata</strong> graphs can be included into Ms-Word or L A T E X without<br />

copy-paste. This way the text contains always the latest<br />

version of the graph.<br />

graph dot uc, over(country, sort((mean) uc))<br />

graph export turnout.eps, replace<br />

graph export turnout.wmf, replace<br />

xyz.do<br />

\includegraphics{turnout}<br />

xyz.tex<br />

Einfuegen->Objekt->Aus Datei ertellen-><br />

"turnout.wmf" (Als Verknuepfung markieren)<br />

xyz.doc<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.14/25


Transcription errors<br />

Specialized Tools<br />

There are several tools to get <strong>Stata</strong> output in a more<br />

publication ready <strong>for</strong>mat. Some of them will be described<br />

by Roger Newson in the next issue of the <strong>Stata</strong> Journal<br />

(2003, Nr. 3). Examples are<br />

outreg, re<strong>for</strong>mat, mktab, slist, fsum, ci<strong>for</strong>m,<br />

xcontract<br />

Tools to work with L A T E X: sjlog, latab, est2tex,<br />

outtable, sutex, outtex, maketex, dotex,<br />

listtex<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.15/25


Transcription errors<br />

Export datasets containing the results<br />

Exporting datasets containing the results provide a general<br />

way to get publication-ready output. I used this technique<br />

to produce the following table, which is quite different from<br />

the normal <strong>Stata</strong> output. The table has been fully produced<br />

within <strong>Stata</strong> <strong>and</strong> inserted into this presentation with the<br />

L A T E X-comm<strong>and</strong> input{tables.tex}.<br />

Kriminalitaetsfurcht<br />

Gross Mittel Klein Gueltig Fehlend Gesamt<br />

Maennlich 57.4 38.4 4.2 6377 43 6420<br />

Weiblich 63.5 33.5 3 6807 56 6863<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.16/25


Transcription errors<br />

To produce the table I constructed a <strong>Stata</strong> dataset<br />

which contains the numbers <strong>for</strong> the table <strong>and</strong> nothing<br />

else.<br />

. list<br />

gender n1 n2 n3 n99 Nv N<br />

Maennlich 57.4 38.4 4.2 43 6377 6420<br />

Weiblich 63.5 33.5 3 56 6807 6863<br />

This dataset can be exported to a <strong>for</strong>mat suited <strong>for</strong><br />

Word-Processors <strong>and</strong>/or L A T E X.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.17/25


Transcription errors<br />

To export to MS-Word, type<br />

outsheet gender n1 n2 n3 Nv n99 N using xyz.tsv<br />

inside <strong>Stata</strong> <strong>and</strong> use Word to import xyz.tsv:<br />

Einfügen, Datei, Namen auswählen<br />

Gesamten eingefügten Bereich markieren<br />

Tabelle, Umw<strong>and</strong>eln, Text in Tabelle, Trennzeichen:<br />

Tabstops<br />

Now, dress up the table with Word. Un<strong>for</strong>tunatelly there<br />

seems no way to automate the Word-steps from within<br />

<strong>Stata</strong> (but listtex brings you <strong>some</strong>what closer).<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.18/25


Transcription errors<br />

To export to L A T E X type<br />

tables.do<br />

listtex gender n1 n2 n3 Nv n99 N using tables.tex, rstyle(tabular<br />

head("\begin{tabular}{lrrrrrrr}\hline"<br />

" & \multicolumn{6}{c}{\emph{Kriminalitaetsfurcht}} \\\\"<br />

" & Gross & Mittel & Klein & Gueltig & Missing & Valid \\\<br />

foot("\hline\end{tabular}")<br />

One can also use listtex to export tables to Word.<br />

However you still need to convert text to tables within<br />

word.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.19/25


Transcription errors<br />

Finally, here is how I produced the <strong>Stata</strong> dataset:<br />

* Produce absolute Frequencies<br />

by gender worries, sort: gen n = _N<br />

by gender worries: gen Nv = sum(worries˜=99)<br />

by gender worries: keep if _n == _N<br />

by gender (worries), sort: gen N = sum(n)<br />

by gender (worries): replace Nv = sum(Nv)<br />

by gender: replace N = N[_N]<br />

by gender: replace Nv = Nv[_N]<br />

tables.do<br />

* Re<strong>for</strong>mat the data<br />

reshape wide n, j(worries) i(gender)<br />

* Calculate percentage values from valid n<br />

<strong>for</strong>v i = 1/3 {<br />

replace n‘i’ = round(n‘i’/Nv*100,.1)<br />

}<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.20/25


Transcription errors<br />

Producing publication ready output often is a<br />

data-management problem.<br />

There are specialized tools, but in the long run learning<br />

to use the general data-mangement tools will pay back.<br />

by varlist:, post, reshape, file<br />

Local-macros<br />

Saved Results<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.21/25


Transcription errors<br />

file<br />

The comm<strong>and</strong> file is often convenient if you want to<br />

produce an output with numbers from different procedures.<br />

I used file to produce the following table, which picks<br />

Likelihood-Ratio-Tests from different models. The table has<br />

been fully produced within <strong>Stata</strong> <strong>and</strong> inserted into this<br />

presentation with input{anincchi2.tex}.<br />

Baseline<br />

Hypotheses<br />

Country Main Effects 307.96 261.57<br />

(0) (0)<br />

Interaction Effects 25.08 26.32<br />

(0) (0)<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.22/25


Transcription errors<br />

Here is how I produced the table:<br />

aninc.do<br />

file open an1 using anincchi2.tex, write text replace<br />

file write an1 "\begin{tabular}{lrrr} \hline " _n<br />

file write an1 "& Baseline & Hypotheses \\\\ \hline" _n<br />

file write an1 "Country Main Effects & ‘lrctr1’ & ‘lrctr3’ \\\\"<br />

file write an1 "& (‘pctr1’) & (‘pctr3’) \\\\" _n<br />

file write an1 "Interaction Effects & ‘lria1’ & ‘lria3’ \\\\"<br />

file write an1 "& (‘pia1’) & (‘pia3’) \\\\ \hline" _n<br />

file write an1 "\end{tabular} "<br />

file close an1<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.23/25


Outlook<br />

Combining the tools mentioned so far, it is a natural step to<br />

produce a publication ready PDF-file with tables <strong>for</strong> all<br />

variables in a data-set. The general outline would be:<br />

dreams.do<br />

file open tables using alltables.tex, write text replace<br />

file write tables "\documentclass{book}" _n<br />

file write tables "\begin{document}" _n<br />

<strong>for</strong>each var of varlist _all {<br />

mytabs ‘var’ //


Outlook<br />

The crucial step in the program outline is mytabs<br />

which doesn’t exist so far. At the WZB we have been<br />

quite successful in constructing a specialized version<br />

of mytabs <strong>for</strong> a single dataset with 658 variables. But<br />

a generalization needs <strong>some</strong> further ef<strong>for</strong>ts.<br />

Note also that file can read from arbitrary files.<br />

There<strong>for</strong>e one may also add additional in<strong>for</strong>mation<br />

from a database—i.e. questions from a<br />

questionnaire—to each of the tables.<br />

<strong>Problems</strong> <strong>for</strong> <strong>reproducibility</strong> <strong>and</strong> <strong>some</strong> <strong>possible</strong> <strong>solutions</strong> – p.25/25

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!