12.07.2015 Views

Stata Tutorial - Data and Statistical Services - Princeton University

Stata Tutorial - Data and Statistical Services - Princeton University

Stata Tutorial - Data and Statistical Services - Princeton University

SHOW MORE
SHOW LESS
  • No tags were found...

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

1.3. About <strong>Data</strong> for <strong>Stata</strong>To put a data set into <strong>Stata</strong>’s memory, the data set has to be in a format <strong>Stata</strong> underst<strong>and</strong>s. The following is a list of theextensions of files <strong>Stata</strong> can read directly.<strong>Data</strong> Format File Extension Comm<strong>and</strong> to read the data<strong>Stata</strong> .dta . useText (ASCII)Free or fixed columns .raw, .txt . infile usingComma separated values .csv . insheet usingFixed columns .dat . infix usingSAS export .xport, xpt . fdauseMS Access .mdb . odbcMany data download sites provide you with data already formatted for a common statistical program such as <strong>Stata</strong>, SPSS, orSAS. Formatted data often contain variable labels <strong>and</strong> value labels, that make it easier for you to underst<strong>and</strong> the contents of thedata.If <strong>Stata</strong> data are not available, <strong>and</strong> you can choose a data format between SPSS <strong>and</strong> SAS, then I would recommend selectingSPSS. You can use SPSS to open SPSS data, then save the data as <strong>Stata</strong> data. SPSS versions 12 <strong>and</strong> up can save the data as<strong>Stata</strong> 8 data 1 . Windows version of SPSS is available in McCosh 59 cluster or DSS computer lab. If data are only available inSAS format, you may use SAS to open SAS data, then create SAS export file, as <strong>Stata</strong> can read a SAS export file. Windowsversion of SAS is available in DSS computer lab. Unix version of SPSS <strong>and</strong> SAS are available at tombstone.Also, if you acquire data that are in a format other than <strong>Stata</strong>, you may use DBMS/Copy to convert them into <strong>Stata</strong> format.Windows version of DBMS/Copy is available in DSS computer lab. Unix version of DBMS/Copy is available at Tombstone.If you have SAS data, we recommend converting them into SAS transport file in SAS instead of using DBMS/Copy.DBMS/Copy has a known issue in converting value labels from SAS to <strong>Stata</strong>.If formatted data are not available, data distributers may provide set up files in <strong>Stata</strong>, SPSS, or SAS along with ASCII data.ASCII data set is a text file with rows (or columns) of numbers. If a set up file is available in <strong>Stata</strong>, you can attach the variableinformation using <strong>Stata</strong>. If a set up file is available in SPSS, it will be easier to use SPSS to attach the definition, then save thedata as <strong>Stata</strong> data. If a set up file is available in SAS, you may use SAS to attach the file definition, then create a SAS export filein SAS. You may also modify the set up files in text editors to use in <strong>Stata</strong>. Comm<strong>and</strong>s to define data are different in all threeprograms. If no set up files are available <strong>and</strong> only PDF codebooks are available, you will need to select the variables you wantto use <strong>and</strong> create your own set up file for <strong>Stata</strong>.If you need help in defining or converting data, please come by the <strong>Data</strong> <strong>and</strong> <strong>Statistical</strong> <strong>Services</strong> computer lab at A-16-H-3 inFirestone Library during walk-in hours or email data@princeton.edu. The hours <strong>and</strong> directions are available athttp://dss.princeton.edu. If you are emailing questions, please use your <strong>Princeton</strong> email. Our resources <strong>and</strong> assistance areavailable to <strong>Princeton</strong> <strong>University</strong> community members.<strong>Stata</strong> data format has changed from version 9 to version 10. <strong>Stata</strong> 10 can read data saved for <strong>Stata</strong> 9, but <strong>Stata</strong> 9 can not readdata saved for <strong>Stata</strong> 10, while both has the same extension .dta. If you plan to use <strong>Stata</strong> 9 after using <strong>Stata</strong> 10, you may save thedata as <strong>Stata</strong> 9 data in <strong>Stata</strong> 10. Followinig comm<strong>and</strong>s allow you to save data as <strong>Stata</strong> 9 data in <strong>Stata</strong> 10.CMD: . saveold filenameMNU: File=> Save As. Then select “<strong>Stata</strong> 9 <strong>Data</strong>” from the drop down list for box “Save As Type:”1 <strong>Stata</strong> 8 <strong>and</strong> <strong>Stata</strong> 9 data are interchangeable.Page 5 of 28


2. Read in data2.1. Reading in an ASCII data file using a <strong>Stata</strong> set up file.Often times, you may obtain a comm<strong>and</strong> <strong>and</strong> a dictionary files as a set of <strong>Stata</strong> set up files along with a data file. I suggest thatyou save all three files in the same directory. The comm<strong>and</strong> file has an extension .do, the dictionary file .dct <strong>and</strong> data file .txt(or .dat). The comm<strong>and</strong> files in <strong>Stata</strong> are also called do files. Sometimes the do file contains the dictionary, <strong>and</strong> you have twofiles, do file <strong>and</strong> data file. The procedure is similar to having three files.As an example, I downloaded a <strong>Stata</strong> set up file <strong>and</strong> data file for National Health Interview Survey from the Inter-universityConsortium for Political <strong>and</strong> Social Research (ICPSR) web site, http://www.icpsr.umich.edu. The files usually are zipped whenyou download. I extracted the zipped files using WinZip, <strong>and</strong> put them in C:\<strong>Stata</strong>H<strong>and</strong>sOn\Sample<strong>Data</strong> directory. WinZip isavailable in DSS lab computers. OIT computers do not have WinZip, but extraction software that comes with Windows canunzip files.Then I opened the <strong>Stata</strong> comm<strong>and</strong> file using NotePad (any text editor will do, but not a word processor like MS Word).Instructions are given at the beginning of the comm<strong>and</strong> file, s<strong>and</strong>wiched between lines of asterisks as in the picture below. Aforward slash <strong>and</strong> an asterisc (/* texts */) makes the texts in between comments. Follow the instruction <strong>and</strong> specify the name<strong>and</strong> the path of the data, dictionary, <strong>and</strong> output data files in the do file. Here is a copy of the beginning of the do file for theNational Health Interview Survey data.Page 6 of 28


**************************************************************************| STATA SETUP FILE FOR ICPSR 04349| NATIONAL HEALTH INTERVIEW SURVEY, 2004| (DATASET 0004: SAMPLE ADULT)|| Please edit this file as instructed below.| To execute, start <strong>Stata</strong>, change to the directory containing:| - this do file| - the ASCII data file| - the dictionary file|| Then execute the do file (e.g., do 04349-0004-statasetup.do)**************************************************************************/set mem 40m /* Allocating 40 megabyte(s) of RAM for <strong>Stata</strong> SE to read thedata file into memory. */set more off /* This prevents the <strong>Stata</strong> output viewer from pausing theprocess*//****************************************************Section 1: File SpecificationsThis section assigns local macros to the necessary files.Please edit:"data-filename" ==> The name of data file downloaded from ICPSR"dictionary-filename" ==> The name of the dictionary file downloaded."stata-datafile" ==> The name you wish to call your <strong>Stata</strong> data file.Note: We assume that the raw data, dictionary, <strong>and</strong> setup (this do file) allreside in the same directory (or folder). If that is not the caseyou will need to include paths as well as filenames in the macros.********************************************************/File PathFile NameReplace thefile nameshere.local raw_data "C:\<strong>Stata</strong>H<strong>and</strong>sOn\Sample<strong>Data</strong>\04349-0004-<strong>Data</strong>.txt"local dict "C:\<strong>Stata</strong>H<strong>and</strong>sOn\Sample<strong>Data</strong>\04349-0004-<strong>Stata</strong>_dictionary.dct"local outfile "C:\<strong>Stata</strong>H<strong>and</strong>sOn\Sample<strong>Data</strong>\health.dta"/********************************************************Section 2: Infile Comm<strong>and</strong>This section reads the raw data into <strong>Stata</strong> format. If Section 1 was definedproperly, there should be no reason to modify this section. These macrosshould inflate automatically.**********************************************************/infile using `dict', using (`raw_data') clearOnce you have the file paths <strong>and</strong> names inserted into the do file, execute the do file (in this example named 04349-0004-Setup.do) in <strong>Stata</strong> by giving a comm<strong>and</strong>:. do 04349-0004-SetupIn this case, you do not need to modify the dictionary file. In some cases, you may need to specify the data file path <strong>and</strong> namein the dictionary file.I specified in the do file to name output <strong>Stata</strong> data as health.dta (see the third line that starts with “local outfile”), <strong>and</strong> you seethe file listed in the directory in the picture on the previous page.You may obtain a data definition file for SAS or SPSS. The idea of attaching the data definition in SAS or in SPSS is the sameas in <strong>Stata</strong>, except that their data definition would only be in one file, <strong>and</strong> they need to be executed in respective program.Please refer to separate h<strong>and</strong>outs for details in running data definition files using SAS or SPSS.Page 7 of 28


2.2. Creating a <strong>Stata</strong> set up file.When you have an ASCII data file but not a set up file, you will need to create one to define variables. An ASCII data filecontains many rows of numbers <strong>and</strong> <strong>Stata</strong> will not know which numbers belong to which variables. You also need to define thetype of variables, whether they are numeric (numbers) or string (texts or characters). ASCII data may be in free format, commaseparated, or fixed columns. Example files used in this exercise are 2008 Democratic <strong>and</strong> Republican PresidentialPrimaries/Iraq (United States), downloaded from the Roper Center, http://www.ropercenter.uconn.edu.Here is a portion of a fixed column ASCII data, called lat544.dat:.25 1 2391 1 1 1 4 2 1 2 3 3 2 & 338 & & & & & & & 131 5 6 7 7 3 1 4 4 4 4 4 4 & & & & 0 & & & 2 0 2 2 1 2 2 0 1 2 1 1 2 2 2 1 4 2 & 5 1 2 3 2 65 & 6 112 5 1 21.66 2 9041 1 1 1 4 1 2 1 2 3 2 & 8 13& & & & & & & 5 2 1 3 3 6 3 1 4 & & & & & 4 4 4 4 4 3 5 & 1 2 1 2 4 2 2 0 1 2 1 4 1 1 1 1 5 2 & 1 3 4 1 1 55 & 6 6 1 5 1 2.47 3 2122 4 4 4 2 3 2 4 2 1 1 & 1 2 112 5 3 6 6 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 2 2 2 1 1 1 1 4 2 1 2 1 2 1 1 1 1 1 & 1 7 & & & 65 & 7 8 1 2 1 1.41 4 4122 4 4 4 1 2 4 1 1 1 1 & 1 8 7 2 3 1 2 5 6 & & & & & & & & 1 & & & & & 1 1 1 1 0 & & & 3 3 3 3 1 1 1 4 0 1 2 0 2 1 2 1 2 1 & 1 2 3 1 1 & 5 5 5 2 2 1 1A portion of the corresponding codebook says ...<strong>Data</strong> LocationsVariable Rec Start End FormatWTVAR 1 1 7 F7.2CASE 1 8 14 F7.0AREACODE 1 15 21 F7.0TRACK 1 22 23 A2GWBUSHJO 1 24 25 A2GWBECON 1 26 27 A2GWBIRAQ 28 29 A2This means that the variable WTVAR is at the first record, startsfrom the column 1 <strong>and</strong> ends in column 7. The data format is F7.2,meaning that it’s a numeric variable with width 7, <strong>and</strong> includes twodecimal places. In the data above, .25, 1.66, .47, <strong>and</strong> .41 correspondto this variable.To define the variables in <strong>Stata</strong>, you can create a “dictionary” file thatcontains the variable information as in below. You can use any texteditors, but here we use <strong>Stata</strong>’s “do-file editor.” Open a do-flie editor:CMD: . doeditMNU: Window-> Do File Editor -> New Do File.There, typeinfix dictionary using H:\lat544.dat {WTVAR 1-7CASE 8-14AREACODE 15-21str TRACK 22-23str GWBUSHJO 24-25str GWBECON 26-27str GWBIRAQ 28-29}Carriage return is <strong>Stata</strong>’s default signal to endcomm<strong>and</strong>s. So it is important to type as it appearshere. The first line ends after the squiggly-brace({ ), each variable name <strong>and</strong> the column locationsis in one line, <strong>and</strong> the last squiggly-brace ( }) is onits own line.<strong>and</strong> save the file as a dictionary (.dct) file in the same directory as the data file. For example, I saved the file as H:\ lat544.dct, asI have the data file, lat544.dat, at the root of H drive. The str in front of variable names indicates that they are string variables.I have omitted record number, as there is only one record in this data. If your data file has more than one record, you need todefine which record you are referring to for each of the variables. Please see help infix to see the syntax for multiple records.Once you save the dictionary file,CMD: . infix using H:\lat544.dctMNU: File => Import => ASCII data in fixed format. Then Browse to find the dictionary file name <strong>and</strong> path.<strong>Stata</strong> will show the following in the output window.. infix dictionary using H:\lat544.dat {WTVAR 1-7CASE 8-14AREACODE 15-21str TRACK 22-23str GWBUSHJO 24-25str GWBECON 26-27str GWBIRAQ 28-29}(1373 observations read)Check with the codebook <strong>and</strong> see if the total number of records is 1373.Page 8 of 28


2.3. Reading in an Excel fileIf the data file is “clean,” all you need to do is to save the file as .csv file in Excel <strong>and</strong> import it into <strong>Stata</strong>. However, if the datafile is “not clean,” editing it may make it easy to import it into <strong>Stata</strong>. Here is an example of a “not clean” excel file.header lineseach value includes aspecial character <strong>and</strong>a commathe second line.variable names include specialcharacters, starts with a number,or have spaces between wordsblank line<strong>and</strong> column<strong>Stata</strong> reads the values in the first lineas the variable names. Header linesprevent the program to read thevariable names. Also, the programexpects data from the second line, soin this example, <strong>Stata</strong> will convert allthe variables as string.The variable names in <strong>Stata</strong> can nothave special symbols or start with anunderscore (_).The following is an example of a“clean” excel sheet. It has thefollowing characteristics:• The first line has <strong>Stata</strong>variable names: 32characters or less, no specialcharacters, <strong>and</strong> not startingwith an underscore ornumber. <strong>Data</strong> begin from• No blank rows or columns. (Blankcells are ok. <strong>Stata</strong> automatically adds aperiod (.) if numeric. Do not manuallyadd . in blank cells.)• Missing numeric data should be anempty cell or values defined as missing,such as 0, 9, or 99. A space (storedspace, different from empty), dot, orany other non-numeric character suchas n/a will cause the variable tobecome string.• Commas in numbers or texts areparticularly problematic because <strong>Stata</strong>may see them as a delimiter <strong>and</strong> willnot read the data properly. Youshould remove the commas fromnumeric values before saving the file.Page 9 of 28


Once you examine the file <strong>and</strong> make sure that the file is clean, here is a step-by-step instruction for saving a worksheet as acomma separated values file in Excel. As a practice, let’s read in a sample excel data.1. Open the Internet Explorer <strong>and</strong> downloadauto.xls fromhttps://webshare.princeton.edu/users/furuichi/auto.xls <strong>and</strong> select Save to Disk.2. Save the file in your H:\ directory.3. Start Excel <strong>and</strong> read the file by selecting File=> Open4. Under the File menu, select Save As, thenSave as type 'CSV' (comma separated values).5. Open <strong>Stata</strong>6. Change the directory in <strong>Stata</strong>Note: Renaming the file with a .csv extension in theWindows Explorer is not the same as saving the file asa .csv file.If the spreadsheet is small, you may copy the data <strong>and</strong> paste them into<strong>Stata</strong>’s data editor. Highlight all data in Excel, <strong>and</strong> select Edit =>Copy. Open <strong>Stata</strong>, then select <strong>Data</strong> Editor. Right click <strong>and</strong> selectPaste, or press Control <strong>and</strong> v keys at the same time to paste the Exceldata contents into <strong>Stata</strong>’s <strong>Data</strong> editor.<strong>Stata</strong> may mistakenly read numeric variables as strings. Check thatoriginal numeric values are numeric in converted <strong>Stata</strong> data by issuingthe comm<strong>and</strong> –describe- in <strong>Stata</strong> <strong>and</strong> examining the storage type. If thevariable has storage type that starts with "str," then <strong>Stata</strong> has made it a string variable.If you see that a numeric variable in the original data file hasstring storage type in <strong>Stata</strong>, go back to Excel, <strong>and</strong> change thevariable’s format into numeric, <strong>and</strong> re-save the file as .csv file.Here is how:1. Highlight the column with the numeric variable name.2. Click Format => Cells.3. In the Format Cells window, select Number tab.4. Under Category drop down list, select Number.5. Click OK6. Under the File menu, select Save As, then Save astype 'CSV' (comma separated values).in <strong>Stata</strong>. At the comm<strong>and</strong> prompt, type. destring stringvariablename, replaceYou can also change the variable type from string to numericFor this comm<strong>and</strong> to work, the stringvariablename can not have any non-numeric characters as its value. If it fails, check thevalues of the variable to find non-numeric characters.Page 10 of 28


2.4. Reading in <strong>Stata</strong> dataNow, let’s start using <strong>Stata</strong>. From an OIT computer, a link to <strong>Stata</strong> may be found in Start => All Programs => <strong>Stata</strong>10 =><strong>Stata</strong>SE10. A shortcut to <strong>Stata</strong> may be available from the Special Applications folder on the desktop. Double click the <strong>Stata</strong>icon.Typing comm<strong>and</strong>s in the Comm<strong>and</strong> window<strong>Stata</strong> starts in its default working folder, typically C:\ProgramFiles\<strong>Stata</strong>\<strong>Stata</strong>10. Let’s change the directory toH:\<strong>Stata</strong>H<strong>and</strong>sOn.. cd H:\Let’s create a <strong>Stata</strong>H<strong>and</strong>sOn directory.. mkdir <strong>Stata</strong>H<strong>and</strong>sOn. cd <strong>Stata</strong>H<strong>and</strong>sOnBefore reading in a data file, let’s open a log file. A log file storesyour output that appears in the Results Window.. log using auto1.logNow let’s read in 1978 Auto data. It is a data file that comes with<strong>Stata</strong> installation, <strong>and</strong> available in <strong>Stata</strong> format.. sysuse autoSuppose you want to add a label to the data, so that you canremember what the dataset is about. It is convenient if you makemany subsets of data files from the original file. As an exercise, let’slabel the data that it is for 1978 auto data for h<strong>and</strong>s on training.. label data “1978 auto data for h<strong>and</strong>s on training”We will work more on this data, but let’s save the data at this time.We’ll give it a name, testauto.. save testautoLet’s close a log file at this time <strong>and</strong> look at the file. Issue acomm<strong>and</strong>:. log closeUse a text editor or MS Word to open the log file.You can also view a log file in <strong>Stata</strong>. Remember to include theextension with the file name when typing the –view- comm<strong>and</strong>.. view stata1.logLet’s clear the data in the memory at this time <strong>and</strong> exit from <strong>Stata</strong>.. clear. exitusing MenusFile =>ChangeWorkingDirectory..(navigate toH:) thenselect MakeNew FolderFile => Log=> Begin....File=>Exampledatasets =>Exampledatasetsinstalled with<strong>Stata</strong> =>(auto.dta) use<strong>Data</strong>=>Labels=>Label datasetFile=> SaveasFile=> Log=>CloseFile=> Log=>ViewFile=> Exitusing MenusNotes <strong>and</strong> TipsIf you don't change the directory, <strong>Stata</strong> will assumethat the file name you type is in the defaultdirectory. The log <strong>and</strong> data files you save will alsobe in this directory unless you change it.If you do not open a log file at the beginning of thesession, the output will only be available in thetemporary memory. Once you exit from theprogram, the output will be lost.A log file with the extension ".log" is a plain text file.This means you can open <strong>and</strong> read it in almost anytext editor or word processor. If you issued the –log- comm<strong>and</strong> without the file extension,. log using stata1<strong>Stata</strong> would create "stata1.smcl." smcl is a log filetype specific for <strong>Stata</strong>.Notice that “log on (text)” appear on the rectangularspace between Results Window <strong>and</strong> Comm<strong>and</strong>Window once you begin a log.If you issue –save, replace- comm<strong>and</strong> withoutspecifying a file name, what is currently in memorywill overwrite original input file. To avoid losingoriginal data file by mistake, always remember tomake a master copy before starting to work ondata.Notice that “log on (text)” disappeared from therectangular space after closing the log.Results window can not be cleared while insession.Page 11 of 28


Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowRestart <strong>Stata</strong>, <strong>and</strong> check if we are in the directory we first specified.. pwdthen list the files:. lsIt should show that we are at H:\<strong>Stata</strong>H<strong>and</strong>sOn directory. Can youfind the testauto.dta <strong>and</strong> auto1.log? What is the size of the data file?It is a small data file, of only 5.4 kilobytes. As you may have seenat the very first screen, <strong>Stata</strong>’s memory may be initially set to 10megabytes. Because this data file is smaller than what <strong>Stata</strong> allowsin the memory at this time, you will not have problems reading inthe data.Let’s open the log file back on to continue to save the output on thesame log file. Issue a comm<strong>and</strong>:. log using stata1.log, appendTo read testauto data:. use testautoIf the data file is larger than <strong>Stata</strong>’s current memory, it will issue anerror message. Check the file size <strong>and</strong> set memory to give <strong>Stata</strong>more space. For example, if the data file is 36 megabytes, type. set memory 40mIt gives 40 megabytes worth of data memory in <strong>Stata</strong> to read thedata.File=>Changeworkingdirectory...File=> Log=>Begin, thenselectstata1.log,<strong>and</strong> Appendto existing fileFile=> Open(no equivalentmenu)Notes <strong>and</strong> Tipspwd st<strong>and</strong>s for Present Working Directory.You can see the directory <strong>Stata</strong> is pointing at bylooking at the bottom bar of <strong>Stata</strong>’s window. If youare not at H:\<strong>Stata</strong>H<strong>and</strong>sOn, change the directoyrby typing in the Comm<strong>and</strong> window:. cd H:\<strong>Stata</strong>H<strong>and</strong>sOnYou can, of course, start a new log file instead ofappending the new results to the existing log file.To start a new log file, give a new file name as in:. log using stata2.logYou could clear the data in memory <strong>and</strong> read in anew data file in one step, by issuing a comm<strong>and</strong>:. use testauto, clearOnly one data file can be read into the <strong>Stata</strong>’smemory at a time. You need to clear the memorybefore reading in another set of data. (You can,however, open many instances of <strong>Stata</strong> in onecomputer.)To see the maximum limits in <strong>Stata</strong>, type in theComm<strong>and</strong> window:. help limitsReview Questions:1. How can I start <strong>Stata</strong>?2. Which directory is this program pointing at?3. How can I change the directory to H:\?4. How large is the Auto data?5. How do I read the data into <strong>Stata</strong>?6. How do I label the data?7. How do I save the data?8. I don’t know what comm<strong>and</strong>s to use. How do I get a helpin <strong>Stata</strong>?9. How do I record the output?Hints:. pwd. cd C:\mydata. dir . ls. set memory 20m. use filename. label data “descriptions”. save filename, replace. help comm<strong>and</strong>name. search keywordPage 12 of 28


Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowThe –list- comm<strong>and</strong> is particularly helpful to use after sorting data,or combining with if. For example, you can obtain five minimumvalues of MPG by listing the first five records after sorting.. sort mpg. list mpg in 1/5Suppose you want to see the make of the cars whose price is lessthan $5000. Try:. list make price if price greater than>= greater than or equal to< less than= b ora>=b), but you cannot put spaces within them (e.g., it must be‘>=’, not ‘> =’).Combining tests: -<strong>and</strong>- <strong>and</strong> –or--if- can be combined with <strong>and</strong> (&) to evaluate for more than oneconditions. Let's say you want to find out the MAKE of the carswhose MPG is greater than 30 <strong>and</strong> PRICE is less than $5000.. list make if mpg>30 & price30 | price


Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowNotes <strong>and</strong> TipsAbout Missing Values<strong>Stata</strong> indicates a missing numerical value as a period (.), <strong>and</strong> amissing string value an empty string, “”. Missing numerical valuesare larger than numerical numbers.We know from the previous examination (.codebook rep78) thatfive out of 74 records of REP78 are missing. We can use the periodto indicate missing record in the comm<strong>and</strong> <strong>and</strong> see which MAKEof the cars are missing in the data.. list make if rep78 >= .Check that a period is the largest values in rep78, by sorting byrep78 <strong>and</strong> listing the last six values.. sort rep78. list make rep78 in -6/-1<strong>Data</strong>=>Describe data=>List dataclick by/if/in tabin list dialog box<strong>Data</strong>=> Sort=>Ascending sortReview Questions:5. How many variables <strong>and</strong> records are in the data?6. What does the note say?7. How can I add notes or comments to the data?8. What variables are in the data?9. How do I sort?10. Which variables have missing values?11. List the cars for which data is missing.12. List the cars whose repair record is less than 3 <strong>and</strong> theprice is less than $5,000Hints:. describe. codebook . labelbook. inspect. summarize. sort . gsort. list [if] [in]Page 18 of 28


5. Obtain descriptive statisticsGoal: find out number of missing records, minimum <strong>and</strong> maximum values, means, <strong>and</strong> medians, viewfrequency tables, <strong>and</strong> cross tabulations.The comm<strong>and</strong>s that are useful for getting basic descriptive statistics include tabulate, summarize, tabstat, <strong>and</strong> table.Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowThe –tabulate- comm<strong>and</strong> gives you a frequency distributionif only one variable is specified, <strong>and</strong> a cross-tabulation if twovariables are specified. If two variables are specified, the firstvariable will be shown in rows, <strong>and</strong> the second in columns.. tabulate rep78. tabulate rep78 foreign-summarize- gives the number of valid observations, mean,st<strong>and</strong>ard deviation, minimum, <strong>and</strong> maximum values.. summarize price mpgWhat if you wanted to see the average MPG for foreign <strong>and</strong>domestic cars? The –tabulate- comm<strong>and</strong> can be combinedwith –summarize- to produce a summary of one variable forthe variable specified in –tabulate-. For example, if you wantto see the average MPG by car type, type:. tabulate foreign, sum(mpg)If you want to see more statistics such as total, range, ormedian, you may use tabstat.. tabstat price mpg, stat(sum, range, median)There are more statistics you can see using tabstat. See . helptabstat for a list of statistis.The –table- comm<strong>and</strong> lets you create three-way (or four-wayif combined with –by-) cross-tabulations. We can try thatafter we create more categorical variables in the next section.using MenusStatistics=>Summaries,tables, <strong>and</strong>tests=>Tables=> Onewaytables, or Allpossible twowaytabulationsStatistics=>Summaries,tables, <strong>and</strong>tests=>Summary<strong>and</strong> descriptivestatistics=>SummarystatisticsStatistics=>Summaries,tables, <strong>and</strong>tests=>Tables=>One/two waytable ofsummarystatisticsNotes <strong>and</strong> Tips–tabulate- can not cross-tabulate more than twovariables. If you have more than two categoricalvariables to crosstab, use –table- (see below).Because –tabulate- gives you frequency counts, itmakes sense to use it for categorical variables thancontinuous variables.It would make sense to summarize continuousvariables rather than categorical variables.You can also see the average MPG by FOREIGNby using –by- <strong>and</strong> –summarize-.. bysort foreign, summarize(mpg)To use –by-, the data have to be sorted byFOREIGN. You could do .sort foreign, then .byforeign, sum(mpg). . bysort foreign does thesorting <strong>and</strong> by in one step.<strong>Stata</strong> allows shorth<strong>and</strong> in some comm<strong>and</strong>s. –sumisthe shorth<strong>and</strong> for –summary-. The shorth<strong>and</strong> isshown as an underscored letters in the help page.Review Questions:1. Which five cars yield the lowest gas mileage?Which five cars yield the highest gas mileage?2. What is the average price <strong>and</strong> average miles pergallon (MPG) of a car in the 1978 auto data?3. What is the average price of cars that are below <strong>and</strong>above the mean MPG?4. What is the median MPG?5. How are price <strong>and</strong> MPG different for domestic <strong>and</strong>foreign cars?6. How can I see the number of cars by the car type?7. How are the cars distributed by the repair records?8. Compare frequency-of-repair records for domestic<strong>and</strong> foreign cars.Hints:. sort. list. summarize. tabulate. table. by groupingvarname: summarize varnamesPage 19 of 28


6. Transform variables <strong>and</strong> recordsGoal: create <strong>and</strong> label new variables, modify existing variables, keep or delete variables <strong>and</strong> records from the file,recode values, create dummy variables from existing variables.The basic comm<strong>and</strong>s for creating new variables <strong>and</strong> modifying old ones are –generate- <strong>and</strong> –replace-.Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowThe comm<strong>and</strong>. generate newvar = somethingcreates a new variable named newvar <strong>and</strong> sets it equal tosomething. Something can be a number, a string, a mathematicalexpression, or a function of other variables. You can combine–if-, -&-, <strong>and</strong> -|- in generating new variables.. generate two = 1+1. generate mycars = 1 if (rep78==1 & priceCreate new variable,then click if/in tab,select Create..., typein criteria in thewindow, click OK<strong>Data</strong>=> Create orchange variables=>Change contents ofvariable<strong>Data</strong>=> Variableutilities=> Keep ordrop variables=>select Drop variables<strong>Data</strong>=> Create orchange variables=>Other variabletransformationcomm<strong>and</strong>s=>Recode categoricalvariableNotes <strong>and</strong> TipsYou normally want to use replace forsecond <strong>and</strong> later steps in multi-stepvariable creations. When you modifyexisting variables, make sure you will stillhave a way to recreate the originalvariable or have a back-up copy of thevariable. Once you write over existingvariable, there is no way to get theoriginal data back.. list hirep if rep78==.Notice that hirep disappears from thevariables window. Once you delete avariable, you can not undo the deletion.If you issue a comm<strong>and</strong> –preservebeforeremoving a variable, you mayrestore deleted variable by issuing acomm<strong>and</strong> –restore-. This is a temporarymeasure <strong>and</strong> only works as a set. Onceyou issue –restore- comm<strong>and</strong>, you needto issue another preserve comm<strong>and</strong> torestore.If you do not specify a new variable namewith the generate option, you willoverwrite the original variable. Let’s trythat with –preserve- <strong>and</strong> –restorecomm<strong>and</strong>s.. tab rep78. preserve. recode rep78 (1/2=1) (3/4=2)(5=3). tab rep78. restore. tab rep78gen is a short for generate.Page 20 of 28


Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowWe have already seen how to create a dummy variable (whoseoutcome is either 0 or 1) using –generate- <strong>and</strong> –replace-.Another easy way to create dummy variables is to use –tabulate- comm<strong>and</strong>. The –tabulate- comm<strong>and</strong>, when used witha generate option, produces dummy variables for each value.For example, suppose we want to create a dummy variable foreach of the outcomes of the categorical variable REP78.. tabulate rep78, gen(dumrep78)Suppose you want to group a continuous variable, PRICE, intofive equal ranges. First find out the minimum <strong>and</strong> maximumvalue that you want to use to group the PRICE by using –summarize-. Then,. generate ivprice = autocode(price,5,3291,15906)If you want to group PRICE into five groups of equalfrequencies, first sort PRICE, then issue the followingcomm<strong>and</strong>:. sort price. generate fqprice = group(5)Now, we have several more categorical variables to make a fourway table. Let’s create a table of repair records by HIREP byIVPRICE by FOREIGN. Here is how:. table rep78 hirep ivprice, by(foreign)You can label the variables so that you know what they are lateron. Let’s add a label to HIREP as an example.. label variable hirep “repair record is 3 or higher”. label define yesno 1 “yes” 0 “no”. lable values hirep yesnousing Menus<strong>Data</strong>=> Create orchange variables=>Other variablecreationcomm<strong>and</strong>s=> Createindicator variables<strong>Data</strong>=> Create orchange variables=>Create new variable,then enter autocodefunction in the box<strong>Data</strong>=> Sort=>Ascending sortStatistics=>Summaries, tables,<strong>and</strong> tests=>Tables=> Table ofsummarystatistics(table)<strong>Data</strong>=> Labels=>Label variable<strong>Data</strong>=> Labels=>Label values=>Define or modifyvalue labels<strong>Data</strong>=> Labels=>Label variable<strong>Data</strong>=> Labels=>Label values=>Assign value labelsto variableNotes <strong>and</strong> TipsScroll down the Variables Window to seewhat <strong>Stata</strong> created. Alternatively, viewthe list of variables by:. describeYou can also add notes to the variables.. note hirep: “temporary variable createdon October 1, 2006”When you describe data, (-describe-) youwill see an asterisk (*) by the variablelabel indicating that the variable hirep hasnotes.See the notes by typing. notesThe maximum number of variables youcan list in –table- is three.-label variable- adds a label to thevariable.-label define- defines values of a lable.The label name can be different from thevariable name, <strong>and</strong> can be used for othervariables.-label values- attach label to the variable.Review Questions:What is the comm<strong>and</strong> to1. create new variables?2. delete variables?3. regroup variables?4. group continuous variables?5. create dummy variables?Hints:. generate newvar =. drop varnames. recode oldvar (1/2=1) (3/4=2) (5=3),gen(newvar). generate varname = group(5). generate newvar =autocode(oldvar,5,min,max). generate newvar = 0. replace newvar = 1 if oldvar > 6165Page 21 of 28


7. GraphGoal: view the relationships of the variables by graphing <strong>and</strong> save graphs.<strong>Stata</strong> has several graphs for graphing distributions of individual variables, the relationship of the variables, as well as many morespecialized graphs. Shown here are comm<strong>and</strong>s for some basic graphs. You may explore graphs using the menus as well. In<strong>Stata</strong>, graphs appear in separate windows that pop up. The graphs do not appear on the Results window, <strong>and</strong> will not be storedin the log file. If you want to save the graphs, you will need to save each graph as a file.Typing comm<strong>and</strong>s in the Comm<strong>and</strong> windowHere's a simple histogram of PRICE.. histogram priceYou can see the histogram separately for different groups.For example, you can see a histogram of price for foreign<strong>and</strong> domestic cars separately <strong>and</strong> have Y values infrequency.. histogram price, by(foreign) freqAnother popular graph is box plot. Let’s see box plots ofprice by foreign.. graph box price, by(foreign)The basic comm<strong>and</strong> for drawing a bivariate graph is twoway.The comm<strong>and</strong> twoway is followed by a keyword indicatingthe type of graph. To obtain a scatter plot showing therelationship between MPG <strong>and</strong> WEIGHT, type. graph twoway scatter mpg weightWe can obtain the scatter plot by the car type, FOREIGN .. graph twoway scatter mpg weight, by(foreign)Twoway graphs can be overlaid: you can draw two twowaygraphs on the same set of axes. A common use of this is todraw a scatterplot with a regression line laid overtop of it toshow how the regression line fits the data.We will overlay scatter plot of with regression line fit forMPG <strong>and</strong> WEIGHT.. graph twoway (scatter mpg weight) (lfit mpg weight)Let’s save the graph. On the Comm<strong>and</strong> Window, type:. graph save OverlaidMpgWeightOnce it’s saved, close the graph window, <strong>and</strong> bring it upagain.. graph use OverlaidMpgWeightReview Questions:How can I …1. make a histogram of MPG?2. see a scatter plot of MPG against WEIGHT?3. fit a regression line over the previous scatter plot?4. bring the graph up again after I close the graphwindow?using MenusGraphics=> Histogram,insert variable name PRICEin the Variable: box <strong>and</strong>check the box next to Bins,change the number to 5Graphics=> Box plotGraphics=> TwowayGraph, click Create, selectScatter in the Basic plots:box, Y variable: mpg, Xvariable: weight, clickAccept, then in the “By” tab,select Draw subgraphs...,input foreign in Variables:boxGraphics=> TwowayGraph, click Create, selectFit plots under plotcategory, <strong>and</strong> Linearprediction under Fit plots:,Y variable: mpg, X variable:weightFile=> Save Graph...orIn the <strong>Stata</strong> Graph window,File=> SaveFile=> Open Graph...Notes <strong>and</strong> TipsFor an introduction to <strong>Stata</strong> graphs,type. help graph introDefault Y value of histogram is density.To see the histogram in frequency orpercentage, type freq or percent aftera comma:. histogram price, freqTo see more options, see.help histogramTyping scatter y x draws a graph of yagainst x.Here, scatter <strong>and</strong> lfit are plot typeswithin the twoway family. Alternatively,you can use || to separate the plottypes.. graph twoway || scatter mpg weight ||lfit mpg weightYou do need to separate the plot typesby the parentheses or the pipes.Hints:. histogram. graph twoway scatter. graph twoway (scatter y x) (lfit y x). graph save. graph usePage 22 of 28


Goal: run a multiple linear regression model.9. Obtain linear regression estimatesTyping comm<strong>and</strong>s in the Comm<strong>and</strong> windowIn estimating relationships among variables, you mayfirst want to examine how the variables are correlated.We suspect that MPG <strong>and</strong> WEIGHT are correlated.Let’s see the correlation:. correlate mpg weightIn addition, we suspect that the correlation may bedifferent between foreign <strong>and</strong> domestic cars. We cancombine the –correlate- comm<strong>and</strong> with a by statement.Before using a by statement, the data need to be sortedby the by-variable.. sort foreign. by foreign: correlate mpg weightIt seems that mpg <strong>and</strong> weight have a relatively highcorrelation. The correlation is different for foreign <strong>and</strong>domestic cars, so foreign must also impact MPG.From the scatterplots we saw earlier, we also discoveredthat the relationship between WEIGHT <strong>and</strong> MPG isnot exactly linear. We’ll include a square of WEIGHTto improve the model. Let’s run a regression estimatingMPG by WEIGHT, WEIGHT2 <strong>and</strong> FOREIGN.. regress mpg weight weight2 foreignAfter estimating a regression model, we can use thevalues estimated by the model, called post-estimationvalues. Using estimated MPG, we can see how theestimated line fit the original distribution by viewingoverlaid graph. To do so, we first need to create avariable for the predicted MPG. We’ll call thisMPGHAT.. predict mpghat. graph twoway (scatter mpg weight) (line mpghatweight), by (foreign)Review Questions:1. What is the correlation between MPG <strong>and</strong>WEIGHT?2. Is the correlation different between domestic<strong>and</strong> foreign cars?3. How do I obtain regression estimates?4. How can I compare observed <strong>and</strong> predictedvalues on a graph?using MenusStatistics=> Summaries, tables,<strong>and</strong> tests=> Summary <strong>and</strong>descriptive statistics=>Correlations <strong>and</strong> covariances<strong>Data</strong>=> Sort=> Ascending sortStatistics=> Summaries, tables,<strong>and</strong> tests=> Summary <strong>and</strong>descriptive statistics=>Correlations <strong>and</strong> covariances, in“by/if/in” tab click Repeatcomm<strong>and</strong> by groups, insertforeign in Variables that definegroups:Statistics=> Linear models <strong>and</strong>related=> Linear regressionStatistics=> Postestimation=>Predictions, residuals, etc.,Graphics=> Two-way graph(if there are already definedplots in “Plot definitions:”, eitherDisable or Edit them to createnew combinations)Notes <strong>and</strong> Tips. pwcorr mpg weight, star(.05)adds an asterisc (*) next to thecorrelation coefficients that arestatistically significat at 95% level.You can also sort <strong>and</strong> use “bystatement” in one step:. bysort foreign: correlate mpg weightThere are series of regressiondiagnostics you can do using graphs.See UCLA’s <strong>Stata</strong> tutorial site formore information.To compute a square of WEIGHT,WEIGHT2, you can multiply WEIGHTby itself, or raise it to the power of 2.. generate weight2 = weight*weight. generate weight2 = weight^2do the same thing.<strong>Stata</strong> has a series of “post estimationcomm<strong>and</strong>s.” After running aregression estimates, for example,you can test if the coefficients arestatistically significantly different from0, or from another independentvariable (wald test), or test forheteroscedasticity. For details, see. help regress postestimation,xb that appear as an option whenmenu is used is a default incomm<strong>and</strong> window input. It will notappear in the Results window whencomm<strong>and</strong> is input in the Comm<strong>and</strong>window.Hints:. correlate. by varname: correlate. regress. predict yhat. graph two way (scatter y x) (lineyhat x)Page 24 of 28


10. Do filesWhen you have rather intense computations or repeat/modify existing computations, it may be helpful for you tocreate a file that contains a set of <strong>Stata</strong> comm<strong>and</strong>s. Such files are called “do files” in <strong>Stata</strong>. Do files can be createdby manually entering comm<strong>and</strong>s in any text editors, or using <strong>Stata</strong>’s do-file editor. In <strong>Stata</strong>, do-file editors can beinvoked by:CMD: .doeditMNU: Window=> Do-file editor=> New do-fileYou may also create do-files by saving comm<strong>and</strong>s you submit interactively. When you start a <strong>Stata</strong> session, start“comm<strong>and</strong> log,” which is a log file with only the comm<strong>and</strong>s. It by default attaches .txt file extension if you do notspecify the extension. If that is the case, you can change it in Window’s file explorer. For this comm<strong>and</strong>, I have notfound a menu version.CMD: .cmdlog using filename.doIf you forget to start a comm<strong>and</strong> log, you may save the comm<strong>and</strong>s in the Review window. First, right click in theReview window then, select “Select All”. Right click in the Review window again, then select “Send to Do-fileEditor”. You can eliminate error comm<strong>and</strong>s by clicking the _rc on top of the Review window, which sorts thecomm<strong>and</strong>s by the errors, then select the error comm<strong>and</strong>s, right click, then “Delete”. You can resort the comm<strong>and</strong>sin the original order by clicking the top of the numbered column on the far left. For the same token, you can sortthe comm<strong>and</strong>s by clicking the top bar where it says “Comm<strong>and</strong>” <strong>and</strong> delete comm<strong>and</strong>s like –browse- <strong>and</strong> –help-.By the way, if you use menu for help <strong>and</strong> search, they do not appear on the Review or Results window.11. Shortcut menusOpen dofileeditorOpen dataeditorOpen databrowserOpen dataSave dataQuit 4Print resultsLog 1 Open/ closeviewersGraphwindow 2ScrollResultswindow 31. Begins log if no log file is open. If a log file is open, it lets you view, close, or suspend the log. You may appendto the previous log by selecting an existing log file. Dialog box menu changes accordingly.2. Moves graph window upfront. It only becomes active when a graph window is open.3. Scrolls the Results window one screen at a time, when you have –more- at the bottom of the Results window. Itis equivalent to hitting the space bar or clicking –more-4. Quit processing. Useful when a process is taking a log time <strong>and</strong> you want to stop the process, or when you have–more- but do not want to see more. It is equivalent to hitting q in Comm<strong>and</strong> window or Ctrl-c at the same time.Page 25 of 28


12. Exporting resultsYou can copy what appears in Results window by highlighting <strong>and</strong> right clicking. There are several options: Copy Text, CopyTable, Copy Table as HTML, <strong>and</strong> Copy as Picture. Here are pasted tables for each.Copy TextRepair |Record 1978 | Freq. Percent Cum.------------+-----------------------------------1 |2 |282.9011.592.9014.493 | 30 43.48 57.974 |5 |181126.0915.9484.06100.00------------+-----------------------------------Total | 69 100.00Copy TableRepairRecord 1978 Freq. Percent1 2 2.90 2.902383011.59 14.4943.48 57.974 18 26.09 84.065 11 15.94 100.00Total 69 100.00Copy as PictureCum.If you are pasting tables into Excel, copyingeither as table or HTML will work well.If you are pasting tables into Word, copyingas picture seems to produce the bestapperance. If you save them as picture,though, modifying the contents can only bedone using a graphic software.Copy Table as HTMLRepairRecord 1978Freq. Percent Cum.1 2 2.90 2.902 8 11.59 14.493 30 43.48 57.974 18 26.09 84.065 11 15.94 100.00RepairRecord 1978 Freq. Percent Cum.1 2 2.90 2.902 8 11.59 14.493 30 43.48 57.974 18 26.09 84.065 11 15.94 100.00Total 69 100.00Total 69 100.00Log files with extension .log can be opened in Word. Log files with extension .smcl will show the tags for <strong>Stata</strong>. See thecomm<strong>and</strong> in the next section to convert .smcl files into .log files.Graphs saved as a picture (see section 7. Graph) can be imported into a document. There are several options for the format.Use the drop down list in Save As box for the selection. Graphs can also be copied <strong>and</strong> pasted into another application like MSWord. Right click the graph you want to copy, then select Copy Graph. Paste the graph in Word using Edit=> Paste, rightclick <strong>and</strong> Paste, or hit Control <strong>and</strong> v at the same time. When the graphs are copied into Word 2003, they may not appearcorrectly when the file is converted into Word 2007.There are also user created comm<strong>and</strong>s to output results. You may check out comm<strong>and</strong>s such as outreg, outreg2, estout, tabout,est2tex, mktab, <strong>and</strong> xml_tab. To read about the comm<strong>and</strong>s, use search. For example, type in <strong>Stata</strong>’s comm<strong>and</strong> window,. search outreg, allNote about user created comm<strong>and</strong>s: <strong>Stata</strong>, being a programmer friendly program, makes it easy to install <strong>and</strong> use user madecomm<strong>and</strong>s. If you see a user made comm<strong>and</strong> that you want to use, you can install it by first finding the comm<strong>and</strong> by searchingfor it (you can also type -findit- comm<strong>and</strong>name in <strong>Stata</strong>’s Comm<strong>and</strong> window) <strong>and</strong> clicking the blue letters “click here to install.”The help pages on the comm<strong>and</strong>s become available after installing the program.Page 26 of 28


13. Other helpful comm<strong>and</strong>sIf working with a large file:You can describe data without loading the data by specifying the location <strong>and</strong> the name of data file.. describe using datafilenameYou can load only the variables you need by specifying the variable names.. use var1 var2 var3 using datafilenameSome comm<strong>and</strong>s produce a log that is more than a page long (-compress-, for example). To save yourself frompressing a key to scroll each page, you may use. set more offIf you are seeing –more- at the end of the screen after typing search, <strong>and</strong> want to quit seeing more screens, pressq or control <strong>and</strong> c keys at the same time. Clicking red X button does the same thing.You can save some memory by compressing the data.. compressShortcuts<strong>Stata</strong> can fill in a variable name with a tab key aftrer enough characters to recognize the name are entered. Forexample, while you have the auto data open, try:. describe h [hit tab key] <strong>Stata</strong> fills in the rest of the variable name as headroomYou can bring up previously used comm<strong>and</strong>s in the Comm<strong>and</strong> window by hitting Page Up key.You can refer to a set of variables with the same stem using an asterisc (*), as in:. describe weight* if you had created weight2, it will show both weight <strong>and</strong> weight2MiscellaneousIf you forget to start a log file at the beginning of a <strong>Stata</strong> session, but want to save what you have in the outputwindow, use. translate @Results outputfilename.txtThe file can be viewed using a text editor or a word processor.Note: -translate- only saves what is in the buffer (what you see in the Results window). Depending on the length ofthe output you had produced, earlier results may have been lost. It is a good habit to start a log file each time youstart a <strong>Stata</strong> session.If you created <strong>Stata</strong> log file that has a file extension .smcl, you can reformat it into a text file by giving the comm<strong>and</strong>:. translate filename.smcl filename.logIf you want to perform a mathematical operation on the spot, you can use the –display- comm<strong>and</strong>.. display 1+1 => will return 2Page 27 of 28


14. On-line tutorialsUCLAhttp://www.ats.ucla.edu/stat/stata/UNChttp://www.cpc.unc.edu/services/computer/presentations/statatutorial/<strong>Princeton</strong>http://data.princeton.edu/stata/http://www.princeton.edu/~eszter/stata.htmlhttp://www.princeton.edu/~otorres/<strong>Stata</strong>/http://opr.princeton.edu/computing/software/stata/intro/default.asp15. ReferencesHamilton, Lawrence C. 2006. Statistics With <strong>Stata</strong>. Updated for Version 9. Pacific Grove, CA: Duxbury Press.<strong>Stata</strong> Corporation. 2008. Using <strong>Stata</strong> Effectively: <strong>Data</strong> Management, Analysis, <strong>and</strong> Graphics Fundamentals.<strong>Data</strong> <strong>and</strong> <strong>Statistical</strong> <strong>Services</strong>, <strong>Princeton</strong> <strong>University</strong>. Fall 2007. <strong>Stata</strong> H<strong>and</strong>s-on Instruction Guide. Windows version9.0.Page 28 of 28

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!