Getting Started

Taking a First Look at Your Data

browse; edit

browse and edit let us view the data in a spreadsheet-like form.

browse opens the view-only Data Browser, where we are not allowed to modify data.

edit similarly opens the Data Editor, which, unlike the Browser, allows us to modify data. However, it is not recommended that we change our data interactively in the Editor, at least not without copying and pasting the relevant commands of the changes reflected in the Results window to the do-file.

describe; codebook; list; count

describe, codebook, list, and count describe our variables.

describe produces a summary of the dataset, including its storage type, display format, value label and variable label. If no variable is specified, describe will describe every variable in the dataset.
. sysuse auto
. describe m*


              storage   display    value
variable name   type    format     label      variable label
-------------------------------------------------------------------
make            str18   %-18s                 Make and Model
mpg             int     %8.0g                 Mileage (mpg)

This gives us all variables whose names start with m, which are make and mpg.

codebook produces more information than describe does. It gives descriptive statistics for numeric variables and a sample of observations for string variables, in addition to the variable names, labels, and data type.
. sysuse auto
. codebook m*


---------------------------------------------------------------------------------------------------
make                                                                                 Make and Model
---------------------------------------------------------------------------------------------------
      
                  type:  string (str18), but longest is str17
      
         unique values:  74                       missing "":  0/74
      
              examples:  "Cad. Deville"
                         "Dodge Magnum"
                         "Merc. XR-7"
                         "Pont. Catalina"
      
               warning:  variable has embedded blanks
      
---------------------------------------------------------------------------------------------------
mpg                                                                                   Mileage (mpg)
---------------------------------------------------------------------------------------------------
      
                  type:  numeric (int)
      
                 range:  [12,41]                      units:  1
         unique values:  21                       missing .:  0/74
      
                  mean:   21.2973
              std. dev:    5.7855
      
           percentiles:        10%       25%       50%       75%       90%
                                14        18        20        25        29

list lists values of the variables.
. sysuse auto
. list make-rep78 in 10/20 if price > 6000, table lists the values of the variables from make to rep78 from the 10th to 20th rows, whose prices are greater than 6000, in the form of a table.


    +--------------------------------------+
    | make             price   mpg   rep78 |
    |--------------------------------------|
11. | Cad. Deville    11,385    14       3 |
12. | Cad. Eldorado   14,500    14       2 |
13. | Cad. Seville    15,906    21       3 |
    +--------------------------------------+

count counts the number of observations that satisfy the specified conditions.
. sysuse auto
. count in 10/20 if price > 6000 counts the number of observations from the 10th to 20th rows and whose prices are greater than 6000. In this case Stata will return 3.

summarize; tabulate; tabstat; table

The four commands all produce summary statistics; the choice of the command, however, depends on the question we ask.

summarize

summarize returns a variety of univariate summary statistics. If no options have been specified for the variables, it displays the number of observations, mean, standard deviation, minimum and maximum values.
. sysuse auto
. sum weight, detail produces additional statistics with the detail option, including skewness, kurtosis, the four smallest and four largest values, and various percentiles.

                              Weight (lbs.)
-------------------------------------------------------------
      Percentiles      Smallest
 1%         1760           1760
 5%         1830           1800
10%         2020           1800       Obs                  74
25%         2240           1830       Sum of Wgt.          74

50%         3190                      Mean           3019.459
                        Largest       Std. Dev.      777.1936
75%         3600           4290
90%         4060           4330       Variance       604029.8
95%         4290           4720       Skewness       .1481164
99%         4840           4840       Kurtosis       2.118403

tabulate; tab1; tabulate, summarize(); tabulate, generate()

tabulate produces one- and two-way tables of frequencies.
. sysuse auto
. tab rep78 produces the distribution of the variable car repair record rep78.


     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00

In combination with the categorical variable, tabulate can produce a cross-tabulation of summary statistics by category.
. tab rep78 foreign produces the car repair record frequencies by foreign or domestic car type, where foreign is a categorical variable.


    Repair |
    Record |       Car type
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69

. tab rep78 foreign, column row reports relative frequencies within the columns and rows when the options column and row are specified.


+-------------------+
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
+-------------------+
     
    Repair |
    Record |       Car type
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
           |    100.00       0.00 |    100.00 
           |      4.17       0.00 |      2.90 
-----------+----------------------+----------
         2 |         8          0 |         8 
           |    100.00       0.00 |    100.00 
           |     16.67       0.00 |     11.59 
-----------+----------------------+----------
         3 |        27          3 |        30 
           |     90.00      10.00 |    100.00 
           |     56.25      14.29 |     43.48 
-----------+----------------------+----------
         4 |         9          9 |        18 
           |     50.00      50.00 |    100.00 
           |     18.75      42.86 |     26.09 
-----------+----------------------+----------
         5 |         2          9 |        11 
           |     18.18      81.82 |    100.00 
           |      4.17      42.86 |     15.94 
-----------+----------------------+----------
     Total |        48         21 |        69 
           |     69.57      30.43 |    100.00 
           |    100.00     100.00 |    100.00

Using by with the categorical or indicator variables produces separate tables of the statistics by each category.
. by foreign: tab rep78 returns the summary statistics in two tables for each foreign and domestic car type.


------------------------------------------------------------
-> foreign = Domestic
    
     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        4.17        4.17
          2 |          8       16.67       20.83
          3 |         27       56.25       77.08
          4 |          9       18.75       95.83
          5 |          2        4.17      100.00
------------+-----------------------------------
      Total |         48      100.00
      
-------------------------------------------------------------
-> foreign = Foreign
     
     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |          3       14.29       14.29
          4 |          9       42.86       57.14
          5 |          9       42.86      100.00
------------+-----------------------------------
      Total |         21      100.00

tab1 is a convenience tool for one-way table of frequencies.
. tab1 rep78 foreign headroom returns three tables of frequencies for each variable, equivalent to
. tab rep78
. tab foreign
. tab headroom


-> tabulation of rep78  
      
     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.90        2.90
          2 |          8       11.59       14.49
          3 |         30       43.48       57.97
          4 |         18       26.09       84.06
          5 |         11       15.94      100.00
------------+-----------------------------------
      Total |         69      100.00
      
-> tabulation of foreign  
      
  Car type |      Freq.     Percent        Cum.
-----------+-----------------------------------
  Domestic |         52       70.27       70.27
   Foreign |         22       29.73      100.00
-----------+-----------------------------------
     Total |         74      100.00
      
-> tabulation of headroom  
      
   Headroom |
      (in.) |      Freq.     Percent        Cum.
------------+-----------------------------------
        1.5 |          4        5.41        5.41
        2.0 |         13       17.57       22.97
        2.5 |         14       18.92       41.89
        3.0 |         13       17.57       59.46
        3.5 |         15       20.27       79.73
        4.0 |         10       13.51       93.24
        4.5 |          4        5.41       98.65
        5.0 |          1        1.35      100.00
------------+-----------------------------------
      Total |         74      100.00

tabulate category, summarize(var) produces one- and two-way tables of means and standard deviations by category on var.
. tab foreign, sum(weight) returns the summary table of the means and standard deviations of the car weight by foreign and domestic car types.


            |      Summary of Weight (lbs.)
   Car type |        Mean   Std. Dev.       Freq.
------------+------------------------------------
   Domestic |   3,317.115   695.36374          52
    Foreign |   2,315.909   433.00345          22
------------+------------------------------------
      Total |   3,019.459   777.19357          74

tabulate var, generate(newvar) creates indicator variables.
. tab foreign, gen(import) generates two new variables import1, indicating whether the car is domestic made, and import2, indicating whether the car is foreign made.


   Car type |      Freq.     Percent        Cum.
------------+-----------------------------------
   Domestic |         52       70.27       70.27
    Foreign |         22       29.73      100.00
------------+-----------------------------------
      Total |         74      100.00

tabstat

tabstat varlist, by(category) statistics (stat1 stat2…) reports summary statistics by groups, where multiple statistics can be specified, including mean, sum, count, standard deviation, percentiles etc.
. sysuse auto
. tabstat price weight, by(foreign) statistics(n mean sd p50)
reports the number of nonmissing observations, mean, standard deviation, and the 50th percentile (median) of variable price and weight by their car type of foreign or domestic made.


Summary statistics: N, mean, sd, p50
  by categories of: foreign (Car type)
      
 foreign |     price    weight
---------+--------------------
Domestic |        52        52
         |  6072.423  3317.115
         |  3097.104  695.3637
         |    4782.5      3360
---------+--------------------
 Foreign |        22        22
         |  6384.682  2315.909
         |  2621.915  433.0035
         |      5759      2180
---------+--------------------
   Total |        74        74
         |  6165.257  3019.459
         |  2949.496  777.1936
         |    5006.5      3190
------------------------------

table

table category1, by(category2) contents (stat1 var stat2 var stat3 var…) is a more flexible command that produces one-, two-, and up-to-seven-way tables of various summary statistics. Up to four variables can be selected as categories and up to five statistics can be specified.
. sysuse auto
. table rep78, by(foreign) contents(n weight mean weight sd weight p50 weight) reports the mean and the 50th percentile of car weight by repair record rep78 and foreign/domestic car types. foreign is the main category under which each value of price is specified.


------------------------------------------------------------------
Car type  |
and       |
Repair    |
Record    |
1978      |    N(weight)  mean(weight)    sd(weight)   med(weight)
----------+-------------------------------------------------------
Domestic  |
        1 |            2         3,100       523.259         3,100
        2 |            8       3,353.8      445.9961         3,465
        3 |           27       3,442.2      645.3701         3,350
        4 |            9       3,532.2      806.2068         3,700
        5 |            2         1,960      226.2742         1,960
----------+-------------------------------------------------------
Foreign   |
        1 |                                                       
        2 |                                                       
        3 |            3         2,010      158.7451         2,070
        4 |            9       2,207.8      335.4764         2,160
        5 |            9       2,403.3      405.0926         2,240
------------------------------------------------------------------

correlate; pwcorr

correlate and pwcorr both calculate correlations. But corr calculates correlations with nonmissing values only in the entire sample.

. sysuse auto
. corr rep78 mpg price
calculates pairwise correlations among the three variables. rep78 contains missing values; so when calculating the correlation of mpg and price, it only includes the observations where values of rep78 are nonmissing.


(obs=69)
   
             |    rep78      mpg    price
-------------+---------------------------
       rep78 |   1.0000
         mpg |   0.4023   1.0000
       price |   0.0066  -0.4559   1.0000

. pwcorr rep78 mpg price if !missing(rep78, mpg, price) does the same thing as above.


   
             |    rep78      mpg    price
-------------+---------------------------
       rep78 |   1.0000
         mpg |   0.4023   1.0000
       price |   0.0066  -0.4559   1.0000

However, pwcorr rep78 mpg price, without if !missing(rep78, mpg, price), would include the observations where rep78 has missing values when calculating the correlation of mpg and price.


             |    rep78      mpg    price
-------------+---------------------------
       rep78 |   1.0000 
         mpg |   0.4023   1.0000 
       price |   0.0066  -0.4686   1.0000

Command Syntax

Abbreviation

Stata commands can be abbreviated. For instance, instead of typing summarize each time, we used sum.

Type help command to find out how each command can be abbreviated by looking at where they are underlined (e.g. summarize shows that the shortest legitimate form of summarize is su).

Case-sensitivity

Stata commands are case-sensitive and must all be in lowercase.

Command structure

[by varlist]: command [varlist] [=exp] [if exp] [in range] [weight] [using filename][, options]

The above line shows the Stata grammar. Except for the command itself, everything else in the square brackets is optional.

[by varlist] repeats the command on each subset of data. Data must be sorted by varlist, a list of variable names, before we use by. If no variable list is specified, Stata assumes all variables will be used. Alternatively, use bysort to do both by and sort.
. sysuse auto
. sort foreign
. by foreign: tab rep78
or
. bysort foreign: tab rep78


-----------------------------------------------------------
-> foreign = Domestic
      
     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        4.17        4.17
          2 |          8       16.67       20.83
          3 |         27       56.25       77.08
          4 |          9       18.75       95.83
          5 |          2        4.17      100.00
------------+-----------------------------------
      Total |         48      100.00
      
-------------------------------------------------------------
-> foreign = Foreign
      
     Repair |
Record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          3 |          3       14.29       14.29
          4 |          9       42.86       57.14
          5 |          9       42.86      100.00
------------+-----------------------------------
      Total |         21      100.00

The by prefix is not the same as the by option that groups a variable in hierarchical data. In addition to the by prefix, Stata has a number of other command prefixes. Type help prefix for a thorough list.

[= exp] is an algebraic expression specifying the value to be assigned to a variable.

[if exp] restricts the commands to perform on a subset of observations that satisfy the conditions.

[in range] specifies an observation range. Negative integers mean “from the end of the data” (e.g. -1 refers to the last observation). f refers to the first observation, while l refers to the last observation.
. sort price
. list make-rep78 in f/10
lists the make and model, repair record, price and mileages of ten cars whose prices are the lowest.


     +--------------------------------------+
     | make             price   mpg   rep78 |
     |--------------------------------------|
  1. | Merc. Zephyr     3,291    20       3 |
  2. | Chev. Chevette   3,299    29       3 |
  3. | Chev. Monza      3,667    24       2 |
  4. | Toyota Corolla   3,748    31       5 |
  5. | Subaru           3,798    35       5 |
     |--------------------------------------|
  6. | AMC Spirit       3,799    22       . |
  7. | Merc. Bobcat     3,829    22       4 |
  8. | Renault Le Car   3,895    26       3 |
  9. | Chev. Nova       3,955    19       3 |
 10. | Dodge Colt       3,984    30       5 |
     +--------------------------------------+

[weight = exp] is a weighting expression to attach weight to each observation.
. sysuse census
. sum medage [weight = pop]
summarizes the median age weighted by the population in 50 states of the U.S.


(analytic weights assumed)
      
    Variable |     Obs      Weight        Mean   Std. Dev.       Min        Max
-------------+-----------------------------------------------------------------
      medage |      50   225907472    30.11047    1.66933       24.2       34.7

Type help weight for more instructions.

[using filename] reads or writes files. We have introduced how Stata reads and writes files in the previous chapter.

[,options] gives a list of options following the comma. Search the help files to learn more for each command that allows for different options.

Operators

Arithmetic

+ addition
- subtraction or negation
* multiplication
/ division
^ exponentiation

String

+ concatenation
e.g. “nyu”+”sh” produces “nyush”

* duplication
e.g. “nyu”*2 produces “nyunyu”

. input str20 A str20 B
. "nyu" "sh"
. "nyu" "ny"
. end
. gen C = A + B
. gen D = A*2
. list


     +---------------------------+
     |   A    B       C        D |
     |---------------------------|
  1. | nyu   sh   nyush   nyunyu |
  2. | nyu   ny   nyuny   nyunyu |
     +---------------------------+

Relational

Relational expressions return results that are either true or false.

== equal
< less than
> greater than
>= greater than or equal to
<= less than or equal to
!= (or ~=) not equal to

Relational operators can be used on both numbers and strings. For instance, both 2>1 and “newyork”>”abudhabi” are true.

Note that = exp is for assignment and == is for comparison.
. sysuse auto
. replace rep78 = 99 if rep78 == .
first evaluates if the value is missing and then assigns 99 to replace the missing values.

Logical

The logical operators return 1 for true and 0 for false.

& and
| or (called the “pipe”)
! (or ~) not

. list make price mpg rep78 if price > 10000 & mpg > 30 |rep78 <= 2 lists car models that meet several conditions.

It is the same as
. list make price mpg rep78 if (price > 10000 |rep78 <= 2) & (mpg > 30 |rep78 <= 2)


    +------------------------------------------+
    | make                 price   mpg   rep78 |
    |------------------------------------------|
12. | Cad. Eldorado       14,500    14       2 |
17. | Chev. Monte Carlo    5,104    22       2 |
18. | Chev. Monza          3,667    24       2 |
21. | Dodge Diplomat       4,010    18       2 |
22. | Dodge Magnum         5,886    16       2 |
    |------------------------------------------|
23. | Dodge St. Regis      6,342    17       2 |
40. | Olds Starfire        4,195    24       1 |
46. | Plym. Volare         4,060    18       2 |
48. | Pont. Firebird       4,934    18       1 |
52. | Pont. Sunbird        4,172    24       2 |
    +------------------------------------------+

Note how it differs from what we would get if we run
. list make price mpg rep78 if price > 10000 & (mpg > 30 | rep78 <= 2)


    +--------------------------------------+
    | make             price   mpg   rep78 |
    |--------------------------------------|
12. | Cad. Eldorado   14,500    14       2 |
    +--------------------------------------+

Comments

Comments are useful in Stata in several ways. While the commands tell Stata what to do, the comments remind ourselves or the readers what we have been doing.

Adding comments

Depending on one's preference, there are three ways to indicate comments in Stata.

*this is a comment
//this is a comment
/* this is a comment */

//this is a comment and /* this is a comment */ are used in do-files.

Anything commented by // and /* */ will not be executed.

Annotating codes

We use comments to annotate our codes, reminding ourselves of why we did one thing and helping our collaborators understand what we did.
. sum length if price > 10000 & !missing(price) //make sure only nonmissing cases are used.

Commenting out commands

/* */ can be used in the middle of the commands to comment out the commands.
. list make if price > 10000 /*|rep78 <= 2 */ if we are testing alternative codes and not sure yet if we want to run the piece of codes in /**/.

Making long commands readable

To make our long commands readable, use /// to tell Stata that the command continues to the next line and everything following the continuation lines is part of the commands.
. sum length if price > 10000 ///
& rep78 <= 2 ///
& make !="Audi Fox"

Starting a do-file properly

We may want to start our do-files with the information of title, author, date etc. using comments.
* project
* author, date

Dividing long scripts

Comments can divide our codes into different parts to make long scripts more readable.
**************
//model1
some commands

**************
//model2
more commands

Data Properties

Missing values

Missing values of numeric variables are denoted by a period (.), the system missing value. Additionally, Stata provides ways to denote extended missing values: . a, .b, …,and .z. We may run into the extended missing values in survey data, which could refer to nonresponse, non-applicable, question unasked etc.

The missing values take the largest positive values. The ordering is numbers< .< .a< .b< … < .z. Results of numeric expressions involving missing values will be missing values.

Missing values of strings are denoted by “” without space.

We should be careful when working with data containing missing values.
. sysuse auto
. list make if rep78 > 4
returns both car models whose repair records were larger than 4 and those that were missing. If we only want those with a record, we should use:
. list make if rep78 > 4 & !missing(rep78)

To check for missing values we can use one of the following two methods:
. if var> .
. if missing(var) that uses the missing() function

More on how Stata handles missing values in various scenarios:
UCLA: Statistical Consulting Group, Missing Values | Stata Learning Modules

Variable names and labels; value labels

Variable names

When it comes to naming variables, some people prefer camel casing (e.g. GearRatio); some prefer underscore (e.g. gear_ratio); and others prefer lowercase abbreviations (e.g. gr). We may also develop our own naming convention. But whichever we use, we should keep to one style throughout.

To change a variable's name, use rename oldvar newvar.

Variable labels

Each variable can have a label attached to it.

To define a new variable label, use label variable varname “label”.
. label variable price "car price" renames the variable name to “car price” from “price”.

Value labels

When browsing the auto data, you may have noticed that the foreign variable has values of characters in blue. This means that this is a numeric variable with labeled values. If you see the values in red, then it is a string variable. The black values are numbers.

We will discuss how to define variable value labels in the next chapter.

Data types and storage types

Variable values can be numbers or strings. In the auto dataset, for instance, make is a string variable and the rest are numeric variables.

Numbers are stored as five types which differ in range and accuracy: byte, int, long, float or double. float is the default storage type. Strings are stored as str#, where # indicates the maximum length of the string.

Storage types affect mainly how much memory will be needed. Usually we will not need to worry about the storage types since Stata will take care of that and find the most efficient ways to store data.

However,

To store an identification variable taking more than 7 digits, store them as long, double, or string. The default float takes only 7 digits and will have the ID values rounded.
Stata performs calculations in double precision. So if we store numbers less than 1 in float, they will be converted to double with 16 digits’ accuracy.
It would save space in memory to store indicator or categorical variables as byte, which stores numbers to the maximum of 100.

There are ways to convert numeric to string or string to numeric variables in Stata; we will return to this topic in strings, dates and times.

Type help data types to learn more.

Display format

format var %fmt changes the display format but not the contents of the variable. % starts the formatting, which can be a number, date, string, business calendar etc.

To use an example from the last chapter, display 100*100 returns "10000" while display %6.0fc 100*100 returns "10,000" with comma. 6 is the width of the result; 0 is the number of digits to follow the decimal point; f is for the fixed format; and c is to indicate the comma format.

We will deal with formats more frequently when we talk about dates and times. Just to give two quick examples here:
. format date %tw changes the display format of the date variable date to be a weekly date.
. format time %tcHH:MM:SS changes the display format of the time variable time to be a datetime consisting of hour(00-23), minute(00-59) and second(00-60).

Type help format to learn more on formatting the outputs.

More examples on data storage type, format and precision:
UCLA: Statistical Consulting Group, Why am I losing precision with large whole numbers (such as an ID variable)? | Stata FAQ

To find out more on how data precision can yield unexpected results and cause confusion, check J.S.Long (2009, p.77-p.81).