Being Reproducible

A Few Words on Workflow

A workflow is the framework of conducting the data analysis. There is no single standard practice, but it usually involves the following steps: planning and documentation, cleaning data, analyzing data, presenting the results, and archiving. In each step, standardization and automation is the key to a good workflow.

Why is the workflow important? An effective workflow will save us a lot of time and trouble. We want a workflow that is efficient for reproducing our work, group collaboration, and debugging; an efficient workflow will also keep us away from annoying retractions.

More on workflow in data analysis with Stata:
J. Scott Long (2009), The Workflow of Data Analysis Using Stata, Stata Press

The Very First Steps to Reproducible Work

For simple tasks and for exploratory purposes one can use Stata interactively by selecting and clicking the commands from the menu bar, or typing commands line-by-line in the Command window. However, for any reproducible work we will want to keep track of the codes and methods we use. To that end below are the very first steps.

Set the current working directory

A working directory is the default directory where Stata reads and writes files. Usually for each project we will want to set up a different working directory. We will set the current working directory to where we work with the project before we get started.

cd changes the working directory.
. cd ~/Desktop/WD/

If the working directory is not specified, in Mac simply typing cd sets the home directory as the current working directory.

If the directory contains spaces, we need to include quotes.
. cd "~/Desktop/WD/my folder/"

cd.. changes the working directory one level up.

pwd displays the path of the current working directory.

dir or ls lists the files in the current working directory.

sysuse dir lists the names of the datasets installed with Stata and any other datasets in the working directory.

Open a Log File

A log file is a record of what we did with Stata in a session, including all the commands and outputs.

log using filename.log, text replace starts logging.

The replace option tells Stata to overwrite the log file if a log file with the same name already exists.

The text option saves the log file in plain text format which can be read by any text editor.

If the file is not specified to be a .log file, Stata will save it as Stata Markup and Control Language (SMCL) that can be read only by Stata’s Viewer.

log using filename.log, append appends the outputs from the current session to the ones from the last session.

Use a Do-file

Stata comes with its own text editor, the Do-file editor. A do-file is a text file where we keep, edit and save our commands.

To open the do-file editor, type doedit or click the Do-file Editor icon on the tool bar. To run the commands, click Do on the top right.

We can also choose our own external editor. However, Microsoft Word, or any other word processor, is generally not recommended since it always tries to format the texts and could possibly distort the syntax and crash our codes.

More on a comprehensive review of the text editors specifically for Stata:
N. J. Cox, Some notes on text editors for Stata users

Usually a do-file should start with the following commands:

version 14
capture log close
set more off

version specifies the version of Stata we use. Why bother doing that? Because Stata keeps upgrading itself, and programs written in older versions may not run in later versions. Specifying the current version will ensure that future versions will continue to interpret the codes correctly.

clear removes data and value labels from the memory before Stata can read another file. Stata works with one dataset at a time in memory.

capture log close closes the open log file if we have one open and ignores it if not; this prevents the programs running in the current session from being logged to the last session's log file. capture ignores errors, if we have any, and allows the do-file or program to continue despite errors. Therefore, capture log close will tell Stata to continue even if we do not have an open log file from the last session.

set more off tells Stata to run the commands continuously without worrying about the capacity of the Results window to display the results. Otherwise Stata will pause each time the screen is full, unless we keep hitting --more-- at the bottom of the Results window.

Do-files can run other do-files.

If we are working with a project that involves several tasks, we will probably have several do-files with each one handling one task. We will also have a master do-file running all the task-specific do-files. We can call other do-files from a master do-file:


Author: Yun Dai, 2018