Setting up

A Few Words on Workflow

A workflow is the framework of conducting the data analysis. There is no single standard practice, but it usually involves the following steps: planning and documentation, cleaning data, analyzing data, presenting the results, and archiving. In each step, standardization and automation is the key to a good workflow.

Why is the workflow important? An effective workflow would save us a lot of time, trouble, and sometimes even our lives. We will want a workflow that is efficient for reproducing our work, group collaboration, and debugging; an efficient workflow would also keep us away from annoying retractions.

More on workflow in data analysis with Stata:
J. Scott Long (2009), The Workflow of Data Analysis Using Stata, Stata Press

Stata’s Graphical User Interface

The five main windows in Stata are Review, Results, Variables, Properties and Command.

To set the windows to the way we prefer, go to Preferences.
To reveal a hidden window, go to the Window menu and select the window’s name.

The Review window shows the history of commands, both the successful ones and the errors indicated by “_rc”, the error codes.

Click Command or _rc to sort commands by alphabetical order.
Use Page Up and Page Down keys to select, copy, run, and export the commands.

The Results window displays the past commands and the textual results.

Click the search icon, the magnifier, to find texts in the Results window.

The Variables window lists the variables’ names and their labels.

Select a variable and its other properties will be reflected in the Properties window simultaneously.

The Properties window, by definition, summarizes the various properties of a variable.

Click the lock icon to unlock the ability to change the variable properties directly.
Use the arrows to the right of the lock icon to select variables in the Variables window.

More on the Stata user interface from the Stata manual

The current working directory is shown below the Command window. The little menu icon offers a way to change the working directory without being specified by the command cd. We will talk about the working directory shortly.

The Toolbar on top of the Results window gives us easy access to several features, most of which are self-explanatory.

In the middle of the bar, Log allows us to record all the outputs we produced in a log file. We wil explain the usage of a log in the next chapter.
Viewer does many things. One of the common uses is a help system and a place to find and download the user-written packages. The Stata manual details its other uses.
Do-file Editor is a text editor specific to Stata, where we write the commands.
Data Browser lets you view the data in a spreadsheet. You can also open the Browser by typing browse.
Data Editor lets us actually modify the data, which is highly not recommended. If you do have edited data munually in the editor, remember to copy the codes produced in the Results window and paste it to your do-file to save a copy of what you did. You can also open the Datat Editor by typing edit.
To the right, More asks Stata to continue to run the program if it pauses; Break asks it to stop the current session.

Getting Help

Type help command to look for the syntax and usage of a command.

Type search keyword to 1) search a topic in Stata’s local keyword databases and the net materials; 2) if we do not know the exact command name; or 3) if we want to search for a user-written package to download.

The Very First Steps to Reproducible Work

For simple tasks and for exploratory purposes one can use Stata interactively by selecting and clicking the commands from the menu bar, or typing commands line-by-line in the Command window. However, for any reproducible work we will want to keep track of the codes and methods we use. To that end below are the very first steps.

Set the current working directory

A working directory is the default directory where Stata reads and writes files. Usually for each project we will want to set up a different working directory. We will set the current working directory to where we work with the project before we get started.

cd changes the working directory.
. cd ~/Desktop/WD/

If the working directory is not specified, in Mac simply typing cd sets the home directory as the current working directory.

If the directory contains spaces, we need to include quotes.
. cd "~/Desktop/WD/my folder/"

cd.. changes the working directory one level up.

pwd displays the path of the current working directory.

dir or ls lists the files in the current working directory.

sysuse dir lists the names of the datasets installed with Stata and any other datasets in the working directory.

Open a Log File

A log file is a record of what we did with Stata in a session, including all the commands and outputs.

log using filename.log, text replace starts logging.

The replace option tells Stata to overwrite the log file if a log file with the same name already exists.

The text option saves the log file in plain text format which can be read by any text editor.

If the file is not specified to be a .log file, Stata will save it as Stata Markup and Control Language (SMCL) that can be read only by Stata’s Viewer.

log using filename.log, append appends the outputs from the current session to the ones from the last session.

Use a Do-file

Stata comes with its own text editor, the Do-file editor. A do-file is a text file where we keep, edit and save our commands.

To open the do-file editor, type doedit or click the Do-file Editor icon on the tool bar. To run the commands click Do on the top right.

We can also choose our own external editor. However, Microsoft Word, or any other word processor, is generally not recommended since it always tries to format the texts and could possibly distort the syntax and crash our codes.

More on a comprehensive review of the text editors specifically for Stata:
N. J. Cox, Some notes on text editors for Stata users

*
Usually a do-file should start with the following commands:

version 14
clear
capture log close
set more off

version specifies the version of Stata we use. Why bother doing that? Because Stata keeps upgrading itself, and programs written in older versions may not run in later versions. Specifying the current version will ensure that future versions will continue to interpret the codes correctly.

clear removes data and value labels from the memory before Stata can read another file. Stata works with one dataset at a time in memory.

capture log close closes the open log file if we have one and ignores this line if we do not, just to prevent the programs run in the current session from being logged to the last session's log file. capture ignores errors, if we have any, and allows the do-file or program to continue despite errors; thus capture log close will tell Stata to continue even if we do not have an open log file from the last session.

set more off tells Stata to run the commands continuously without worrying about the capacity of the Results window to display the results. Otherwise Stata will pause each time the screen is full, unless we keep hitting --more-- at the bottom of the Results window.

*
Do-files can run other do-files. If we are working with a project that involves several tasks, we will probably have several do-filed with each doing one task and have a master do-file running all the task-specific do-files. We can call other do-files from a master do-file:

do cleaning.do
do models.do
do graph.do
...

Reading and Saving Files

Reading and writing Stata binary files is much faster than using text files, among other advantages in data management. We may want to work with a file in Stata format as early as possible.

Read and save Stata files (.dta)

Stata will always go to the current working directory when working with files.

use filename, clear loads a Stata file (.dta) from the current working directory. clear clears the memory for Stata to load a new file. If the workspace is already clear, we can skip that as well.

save filename, replace saves a Stata file to the current working directory. replace overwrites the file saved earlier with the same file name. If such a file does not exist, we can skip the replace option.

To load a file located in a subdirectory under the working directory, we need to specify the relative path.
. use folder/mydata.dta, clear

To tell Stata to read a file elsewhere, we need to specify a full path.
. use ~/elsewhere/mydata.dta, clear

We can also read a file over the Internet.
. use http://www.stata-press.com/data/r14/auto, clear

sysuse loads example datasets installed with Stata
. sysuse auto, clear

Read and save delimited text files (.csv, .txt)

import/export delimited reads/saves text-delimited files. The data can be tab-separated (often .txt) or comma-separated (.csv). If the extension is not specified, Stata assumes that it is a .csv file.
. import delimited filename, clear
. import delimited var1 var2 var3 using filename.csv, clear loads the dataset and names the three variables var1, var2 and var3.
. export delimited output.txt, delimiter(tab) exports a file and saves it as a tab-delimited text file.

A Real Case

Sometimes we need to take a few more steps to read a text file. Let's look at a real case.

Let's say we have a text file downloaded from a database and we tried to read it into Stata by
. import delimited output.txt
. list in 1/10


     +-------------------------------------------------------------------------------------------+
     | v1                                                                                        |
     |-------------------------------------------------------------------------------------------|
  1. |                                                                                           |
  2. |                                                                                        .. |
  3. |                                                                                           |
  4. | ticker      evtdate       car                                                             |
  5. |  EGAS     12FEB2014     0.02356                                                           |
     |-------------------------------------------------------------------------------------------|
  6. |  EGAS     27JUL2012    -0.00276                                                           |
  7. |  HGR      03JUL2002     0.01478                                                           |
  8. |  HGR      03MAY2005    -0.05070                                                           |
  9. |  HGR      05AUG2002     0.02337                                                           |
 10. |  HGR      20JAN2004     0.04206                                                           |
     +-------------------------------------------------------------------------------------------+

We can see that all variables are squeezed into one column "v1", that there are empty rows, and that the variable names actually are the fourth row. To fix those problems, first let's try
. import delimited output.txt, rowrange(4:l) varnames(4) delimiters(space)
. list ticker-car in 1/5

rowrange(4:l) reads data from the fourth to the last row (l stands for "last"). varnames(4) means the fourth row is for the variable names. delimiters(space) tells Stata to use space as the delimiters.


     +--------------------------------------------------------------------------------------------------------+
     | ticker     v2   v3   v4   v5   v6     evtdate          v8   v9   v10       v11      v12      v13   car |
     |--------------------------------------------------------------------------------------------------------|
  1. |          EGAS    .    .    .    .   12FEB2014                                .   .02356        .       |
  2. |          EGAS    .    .    .    .   27JUL2012                          -.00276        .        .       |
  3. |           HGR    .    .    .    .               03JUL2002                    .        .   .01478       |
  4. |           HGR    .    .    .    .               03MAY2005                    .   -.0507        .       |
  5. |           HGR    .    .    .    .               05AUG2002                    .        .   .02337       |
     +--------------------------------------------------------------------------------------------------------+

Now the first few empty rows have been removed; and (some) variables are properly named. However, there was something not right. We had three variables, and now we've got a lot more (260! if we list them all). What happened? Stata used multiple spaces as delimiters, and due to that values from one variable went to different columns.

Here is what we want to do:
. import delimited output.txt, rowrange(4:l) varnames(4) delimiters(space, collapse)
Here delimiters(space, collapse) collapsed the multiple spaces into just one.

. list in 1/5


     +----------------------------------------+
     | ticker   evtdate         car        v4 |
     |----------------------------------------|
  1. |             EGAS   12FEB2014    .02356 |
  2. |             EGAS   27JUL2012   -.00276 |
  3. |              HGR   03JUL2002    .01478 |
  4. |              HGR   03MAY2005    -.0507 |
  5. |              HGR   05AUG2002    .02337 |
     +----------------------------------------+

Now it's about right. We got the correct variable names, just in the wrong places. We need to fix that.
. drop ticker
. rename evtdate ticker
. rename car evtdate
. rename v4 car
. list in 1/5


     +------------------------------+
     | ticker     evtdate       car |
     |------------------------------|
  1. |   EGAS   12FEB2014    .02356 |
  2. |   EGAS   27JUL2012   -.00276 |
  3. |    HGR   03JUL2002    .01478 |
  4. |    HGR   03MAY2005    -.0507 |
  5. |    HGR   05AUG2002    .02337 |
     +------------------------------+

Done.

Read and save Excel files

import/export excel reads/saves worksheets from Microsoft Excel (.xls and .xlsx) files.
. import excel filename.xlsx, clear loads the dataset but the variable names will be lost. What we see are Excel's column names.
. import excel filename.xls, firstrow loads the dataset with the original variable names.
. export excel output.xlsx, firstrow(varlabels) sheet("output1") saves the dataset as an Excel file "output.xlsx" to the sheet "output1".

Stata as a Calculator

Stata can be used as a calculator interactively. display displays the results of expressions right away.
. display 100^2

. display %6.0fc 100*100

10,000

. display as text "the square root of 10000 is " sqrt(10000)

the square root of 10000 is 100

. display _pi

3.1415927

. display normal(0.5)

0.69146246

Datasets to Play with

If you are shopping for datasets to play with, NYU Shanghai's Chinese Datasets Archive offers many datasets on a wide range of topics related to China.

Chinese Datasets Archive