We are moving everything to the new site. Please visit our new Stata page. This site will no longer be updated.
browse and edit let us view the data in a spreadsheet-like form.
browse opens the view-only Data Browser, where we are not allowed to modify data.
edit similarly opens the Data Editor, which, unlike the Browser, allows us to modify data. However, it is not recommended that we change our data interactively in the Editor, at least not without copying and pasting the relevant commands of the changes reflected in the Results window to the do-file.
describe, codebook, list, and count describe our variables.
describe produces a summary of the dataset, including its storage type, display format, value label and variable label. If no variable is specified, describe will describe every variable in the dataset.
. sysuse auto
. describe m*
storage display value
variable name type format label variable label
-------------------------------------------------------------------
make str18 %-18s Make and Model
mpg int %8.0g Mileage (mpg)
This gives us all variables whose names start with m, which are make and mpg.
codebook produces more information than describe does. It gives descriptive statistics for numeric variables and a sample of observations for string variables, in addition to the variable names, labels, and data type.
. sysuse auto
. codebook m*
---------------------------------------------------------------------------------------------------
make Make and Model
---------------------------------------------------------------------------------------------------
type: string (str18), but longest is str17
unique values: 74 missing "": 0/74
examples: "Cad. Deville"
"Dodge Magnum"
"Merc. XR-7"
"Pont. Catalina"
warning: variable has embedded blanks
---------------------------------------------------------------------------------------------------
mpg Mileage (mpg)
---------------------------------------------------------------------------------------------------
type: numeric (int)
range: [12,41] units: 1
unique values: 21 missing .: 0/74
mean: 21.2973
std. dev: 5.7855
percentiles: 10% 25% 50% 75% 90%
14 18 20 25 29
list lists values of the variables.
. sysuse auto
. list make-rep78 in 10/20 if price > 6000, table lists the values of the variables from make to rep78 from the 10th to 20th rows, whose prices are greater than 6000, in the form of a table.
+--------------------------------------+
| make price mpg rep78 |
|--------------------------------------|
11. | Cad. Deville 11,385 14 3 |
12. | Cad. Eldorado 14,500 14 2 |
13. | Cad. Seville 15,906 21 3 |
+--------------------------------------+
count counts the number of observations that satisfy the specified conditions.
. sysuse auto
. count in 10/20 if price > 6000 counts the number of observations from the 10th to 20th rows and whose prices are greater than 6000. In this case Stata will return 3.
The four commands all produce summary statistics; the choice of the command, however, depends on the question we ask.
summarize returns a variety of univariate summary statistics. If no options have been specified for the variables, it displays the number of observations, mean, standard deviation, minimum and maximum values.
. sysuse auto
. sum weight, detail produces additional statistics with the detail option, including skewness, kurtosis, the four smallest and four largest values, and various percentiles.
Weight (lbs.)
-------------------------------------------------------------
Percentiles Smallest
1% 1760 1760
5% 1830 1800
10% 2020 1800 Obs 74
25% 2240 1830 Sum of Wgt. 74
50% 3190 Mean 3019.459
Largest Std. Dev. 777.1936
75% 3600 4290
90% 4060 4330 Variance 604029.8
95% 4290 4720 Skewness .1481164
99% 4840 4840 Kurtosis 2.118403
tabulate produces one- and two-way tables of frequencies.
. sysuse auto
. tab rep78 produces the distribution of the variable car repair record rep78.
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
In combination with the categorical variable, tabulate can produce a cross-tabulation of summary statistics by category.
. tab rep78 foreign produces the car repair record frequencies by foreign or domestic car type, where foreign is a categorical variable.
Repair |
Record | Car type
1978 | Domestic Foreign | Total
-----------+----------------------+----------
1 | 2 0 | 2
2 | 8 0 | 8
3 | 27 3 | 30
4 | 9 9 | 18
5 | 2 9 | 11
-----------+----------------------+----------
Total | 48 21 | 69
. tab rep78 foreign, column row reports relative frequencies within the columns and rows when the options column and row are specified.
+-------------------+
| Key |
|-------------------|
| frequency |
| row percentage |
| column percentage |
+-------------------+
Repair |
Record | Car type
1978 | Domestic Foreign | Total
-----------+----------------------+----------
1 | 2 0 | 2
| 100.00 0.00 | 100.00
| 4.17 0.00 | 2.90
-----------+----------------------+----------
2 | 8 0 | 8
| 100.00 0.00 | 100.00
| 16.67 0.00 | 11.59
-----------+----------------------+----------
3 | 27 3 | 30
| 90.00 10.00 | 100.00
| 56.25 14.29 | 43.48
-----------+----------------------+----------
4 | 9 9 | 18
| 50.00 50.00 | 100.00
| 18.75 42.86 | 26.09
-----------+----------------------+----------
5 | 2 9 | 11
| 18.18 81.82 | 100.00
| 4.17 42.86 | 15.94
-----------+----------------------+----------
Total | 48 21 | 69
| 69.57 30.43 | 100.00
| 100.00 100.00 | 100.00
Using by with the categorical or indicator variables produces separate tables of the statistics by each category.
. by foreign: tab rep78 returns the summary statistics in two tables for each foreign and domestic car type.
------------------------------------------------------------
-> foreign = Domestic
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 4.17 4.17
2 | 8 16.67 20.83
3 | 27 56.25 77.08
4 | 9 18.75 95.83
5 | 2 4.17 100.00
------------+-----------------------------------
Total | 48 100.00
-------------------------------------------------------------
-> foreign = Foreign
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
3 | 3 14.29 14.29
4 | 9 42.86 57.14
5 | 9 42.86 100.00
------------+-----------------------------------
Total | 21 100.00
tab1 is a convenience tool for one-way table of frequencies.
. tab1 rep78 foreign headroom returns three tables of frequencies for each variable, equivalent to
. tab rep78
. tab foreign
. tab headroom
-> tabulation of rep78
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 2.90 2.90
2 | 8 11.59 14.49
3 | 30 43.48 57.97
4 | 18 26.09 84.06
5 | 11 15.94 100.00
------------+-----------------------------------
Total | 69 100.00
-> tabulation of foreign
Car type | Freq. Percent Cum.
-----------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
-----------+-----------------------------------
Total | 74 100.00
-> tabulation of headroom
Headroom |
(in.) | Freq. Percent Cum.
------------+-----------------------------------
1.5 | 4 5.41 5.41
2.0 | 13 17.57 22.97
2.5 | 14 18.92 41.89
3.0 | 13 17.57 59.46
3.5 | 15 20.27 79.73
4.0 | 10 13.51 93.24
4.5 | 4 5.41 98.65
5.0 | 1 1.35 100.00
------------+-----------------------------------
Total | 74 100.00
tabulate category, summarize(var) produces one- and two-way tables of means and standard deviations by category on var.
. tab foreign, sum(weight) returns the summary table of the means and standard deviations of the car weight by foreign and domestic car types.
| Summary of Weight (lbs.)
Car type | Mean Std. Dev. Freq.
------------+------------------------------------
Domestic | 3,317.115 695.36374 52
Foreign | 2,315.909 433.00345 22
------------+------------------------------------
Total | 3,019.459 777.19357 74
tabulate var, generate(newvar) creates indicator variables.
. tab foreign, gen(import) generates two new variables import1, indicating whether the car is domestic made, and import2, indicating whether the car is foreign made.
Car type | Freq. Percent Cum.
------------+-----------------------------------
Domestic | 52 70.27 70.27
Foreign | 22 29.73 100.00
------------+-----------------------------------
Total | 74 100.00
tabstat varlist, by(category) statistics (stat1 stat2…) reports summary statistics by groups, where multiple statistics can be specified, including mean, sum, count, standard deviation, percentiles etc.
. sysuse auto
. tabstat price weight, by(foreign) statistics(n mean sd p50)
reports the number of nonmissing observations, mean, standard deviation, and the 50th percentile (median) of variable price and weight by their car type of foreign or domestic made.
Summary statistics: N, mean, sd, p50
by categories of: foreign (Car type)
foreign | price weight
---------+--------------------
Domestic | 52 52
| 6072.423 3317.115
| 3097.104 695.3637
| 4782.5 3360
---------+--------------------
Foreign | 22 22
| 6384.682 2315.909
| 2621.915 433.0035
| 5759 2180
---------+--------------------
Total | 74 74
| 6165.257 3019.459
| 2949.496 777.1936
| 5006.5 3190
------------------------------
table category1, by(category2) contents (stat1 var stat2 var stat3 var…) is a more flexible command that produces one-, two-, and up-to-seven-way tables of various summary statistics. Up to four variables can be selected as categories and up to five statistics can be specified.
. sysuse auto
. table rep78, by(foreign) contents(n weight mean weight sd weight p50 weight)
reports the mean and the 50th percentile of car weight by repair record rep78 and foreign/domestic car types. foreign is the main category under which each value of price is specified.
------------------------------------------------------------------
Car type |
and |
Repair |
Record |
1978 | N(weight) mean(weight) sd(weight) med(weight)
----------+-------------------------------------------------------
Domestic |
1 | 2 3,100 523.259 3,100
2 | 8 3,353.8 445.9961 3,465
3 | 27 3,442.2 645.3701 3,350
4 | 9 3,532.2 806.2068 3,700
5 | 2 1,960 226.2742 1,960
----------+-------------------------------------------------------
Foreign |
1 |
2 |
3 | 3 2,010 158.7451 2,070
4 | 9 2,207.8 335.4764 2,160
5 | 9 2,403.3 405.0926 2,240
------------------------------------------------------------------
correlate and pwcorr both calculate correlations. But corr calculates correlations with nonmissing values only in the entire sample.
. sysuse auto
. corr rep78 mpg price
calculates pairwise correlations among the three variables. rep78 contains missing values; so when calculating the correlation of mpg and price, it only includes the observations where values of rep78 are nonmissing.
(obs=69)
| rep78 mpg price
-------------+---------------------------
rep78 | 1.0000
mpg | 0.4023 1.0000
price | 0.0066 -0.4559 1.0000
. pwcorr rep78 mpg price if !missing(rep78, mpg, price) does the same thing as above.
| rep78 mpg price
-------------+---------------------------
rep78 | 1.0000
mpg | 0.4023 1.0000
price | 0.0066 -0.4559 1.0000
However, pwcorr rep78 mpg price, without if !missing(rep78, mpg, price), would include the observations where rep78 has missing values when calculating the correlation of mpg and price.
| rep78 mpg price
-------------+---------------------------
rep78 | 1.0000
mpg | 0.4023 1.0000
price | 0.0066 -0.4686 1.0000
Stata commands can be abbreviated. For instance, instead of typing summarize each time, we used sum.
Type help command to find out how each command can be abbreviated by looking at where they are underlined (e.g. summarize shows that the shortest legitimate form of summarize is su).
Stata commands are case-sensitive and must all be in lowercase.
[by varlist]: command [varlist] [=exp] [if exp] [in range] [weight] [using filename][, options]
The above line shows the Stata grammar. Except for the command itself, everything else in the square brackets is optional.
[by varlist] repeats the command on each subset of data. Data must be sorted by varlist, a list of variable names, before we use by. If no variable list is specified, Stata assumes all variables will be used. Alternatively, use bysort to do both by and sort.
. sysuse auto
. sort foreign
. by foreign: tab rep78
or
. bysort foreign: tab rep78
-----------------------------------------------------------
-> foreign = Domestic
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
1 | 2 4.17 4.17
2 | 8 16.67 20.83
3 | 27 56.25 77.08
4 | 9 18.75 95.83
5 | 2 4.17 100.00
------------+-----------------------------------
Total | 48 100.00
-------------------------------------------------------------
-> foreign = Foreign
Repair |
Record 1978 | Freq. Percent Cum.
------------+-----------------------------------
3 | 3 14.29 14.29
4 | 9 42.86 57.14
5 | 9 42.86 100.00
------------+-----------------------------------
Total | 21 100.00
The by prefix is not the same as the by option that groups a variable in hierarchical data. In addition to the by prefix, Stata has a number of other command prefixes. Type help prefix for a thorough list.
[= exp] is an algebraic expression specifying the value to be assigned to a variable.
[if exp] restricts the commands to perform on a subset of observations that satisfy the conditions.
[in range] specifies an observation range. Negative integers mean “from the end of the data” (e.g. -1 refers to the last observation). f refers to the first observation, while l refers to the last observation.
. sort price
. list make-rep78 in f/10
lists the make and model, repair record, price and mileages of ten cars whose prices are the lowest.
+--------------------------------------+
| make price mpg rep78 |
|--------------------------------------|
1. | Merc. Zephyr 3,291 20 3 |
2. | Chev. Chevette 3,299 29 3 |
3. | Chev. Monza 3,667 24 2 |
4. | Toyota Corolla 3,748 31 5 |
5. | Subaru 3,798 35 5 |
|--------------------------------------|
6. | AMC Spirit 3,799 22 . |
7. | Merc. Bobcat 3,829 22 4 |
8. | Renault Le Car 3,895 26 3 |
9. | Chev. Nova 3,955 19 3 |
10. | Dodge Colt 3,984 30 5 |
+--------------------------------------+
[weight = exp] is a weighting expression to attach weight to each observation.
. sysuse census
. sum medage [weight = pop]
summarizes the median age weighted by the population in 50 states of the U.S.
(analytic weights assumed)
Variable | Obs Weight Mean Std. Dev. Min Max
-------------+-----------------------------------------------------------------
medage | 50 225907472 30.11047 1.66933 24.2 34.7
Type help weight for more instructions.
[using filename] reads or writes files. We have introduced how Stata reads and writes files in the previous chapter.
[,options] gives a list of options following the comma. Search the help files to learn more for each command that allows for different options.
+ addition
- subtraction or negation
* multiplication
/ division
^ exponentiation
+ concatenation
e.g. “nyu”+”sh” produces “nyush”
* duplication
e.g. “nyu”*2 produces “nyunyu”
. input str20 A str20 B
. "nyu" "sh"
. "nyu" "ny"
. end
. gen C = A + B
. gen D = A*2
. list
+---------------------------+
| A B C D |
|---------------------------|
1. | nyu sh nyush nyunyu |
2. | nyu ny nyuny nyunyu |
+---------------------------+
Relational expressions return results that are either true or false.
== equal
< less than
> greater than
>= greater than or equal to
<= less than or equal to
!= (or ~=) not equal to
Relational operators can be used on both numbers and strings. For instance, both 2>1 and “newyork”>”abudhabi” are true.
Note that = exp is for assignment and == is for comparison.
. sysuse auto
. replace rep78 = 99 if rep78 == .
first evaluates if the value is missing and then assigns 99 to replace the missing values.
The logical operators return 1 for true and 0 for false.
& and
| or (called the “pipe”)
! (or ~) not
. list make price mpg rep78 if price > 10000 & mpg > 30 |rep78 <= 2 lists car models that meet several conditions.
It is the same as
. list make price mpg rep78 if (price > 10000 |rep78 <= 2) & (mpg > 30 |rep78 <= 2)
+------------------------------------------+
| make price mpg rep78 |
|------------------------------------------|
12. | Cad. Eldorado 14,500 14 2 |
17. | Chev. Monte Carlo 5,104 22 2 |
18. | Chev. Monza 3,667 24 2 |
21. | Dodge Diplomat 4,010 18 2 |
22. | Dodge Magnum 5,886 16 2 |
|------------------------------------------|
23. | Dodge St. Regis 6,342 17 2 |
40. | Olds Starfire 4,195 24 1 |
46. | Plym. Volare 4,060 18 2 |
48. | Pont. Firebird 4,934 18 1 |
52. | Pont. Sunbird 4,172 24 2 |
+------------------------------------------+
Note how it differs from what we would get if we run
+--------------------------------------+
| make price mpg rep78 |
|--------------------------------------|
12. | Cad. Eldorado 14,500 14 2 |
+--------------------------------------+
Comments are useful in Stata in several ways. While the commands tell Stata what to do, the comments remind ourselves or the readers what we have been doing.
Depending on one's preference, there are three ways to indicate comments in Stata.
*this is a comment
//this is a comment
/* this is a comment */
//this is a comment and /* this is a comment */ are used in do-files.
Anything commented by // and /* */ will not be executed.
We use comments to annotate our codes, reminding ourselves of why we did one thing and helping our collaborators understand what we did.
. sum length if price > 10000 & !missing(price) //make sure only nonmissing cases are used.
/* */ can be used in the middle of the commands to comment out the commands.
. list make if price > 10000 /*|rep78 <= 2 */ if we are testing alternative codes and not sure yet if we want to run the piece of codes in /**/.
To make our long commands readable, use /// to tell Stata that the command continues to the next line and everything following the continuation lines is part of the commands.
. sum length if price > 10000 ///
& rep78 <= 2 ///
& make !="Audi Fox"
We may want to start our do-files with the information of title, author, date etc. using comments.
* project
* author, date
Comments can divide our codes into different parts to make long scripts more readable.
**************
//model1
some commands
**************
//model2
more commands
Missing values of numeric variables are denoted by a period (.), the system missing value. Additionally, Stata provides ways to denote extended missing values: . a, .b, …,and .z. We may run into the extended missing values in survey data, which could refer to nonresponse, non-applicable, question unasked etc.
The missing values take the largest positive values. The ordering is numbers< .< .a< .b< … < .z. Results of numeric expressions involving missing values will be missing values.
Missing values of strings are denoted by “” without space.
We should be careful when working with data containing missing values.
. sysuse auto
. list make if rep78 > 4
returns both car models whose repair records were larger than 4 and those that were missing. If we only want those with a record, we should use:
. list make if rep78 > 4 & !missing(rep78)
To check for missing values we can use one of the following two methods:
. if var> .
. if missing(var) that uses the missing() function
More on how Stata handles missing values in various scenarios:
UCLA: Statistical Consulting Group, Missing Values | Stata Learning Modules
When it comes to naming variables, some people prefer camel casing (e.g. GearRatio); some prefer underscore (e.g. gear_ratio); and others prefer lowercase abbreviations (e.g. gr). We may also develop our own naming convention. But whichever we use, we should keep to one style throughout.
To change a variable's name, use rename oldvar newvar.
Each variable can have a label attached to it.
To define a new variable label, use label variable varname “label”.
. label variable price "car price" renames the variable name to “car price” from “price”.
When browsing the auto data, you may have noticed that the foreign variable has values of characters in blue. This means that this is a numeric variable with labeled values. If you see the values in red, then it is a string variable. The black values are numbers.
We will discuss how to define variable value labels in the next chapter.
Variable values can be numbers or strings. In the auto dataset, for instance, make is a string variable and the rest are numeric variables.
Numbers are stored as five types which differ in range and accuracy: byte, int, long, float or double. float is the default storage type. Strings are stored as str#, where # indicates the maximum length of the string.
Storage types affect mainly how much memory will be needed. Usually we will not need to worry about the storage types since Stata will take care of that and find the most efficient ways to store data.
However,
There are ways to convert numeric to string or string to numeric variables in Stata; we will return to this topic in strings, dates and times.
Type help data types to learn more.
format var %fmt changes the display format but not the contents of the variable. % starts the formatting, which can be a number, date, string, business calendar etc.
To use an example from the last chapter, display 100*100 returns "10000" while display %6.0fc 100*100 returns "10,000" with comma. 6 is the width of the result; 0 is the number of digits to follow the decimal point; f is for the fixed format; and c is to indicate the comma format.
We will deal with formats more frequently when we talk about dates and times. Just to give two quick examples here:
. format date %tw changes the display format of the date variable date to be a weekly date.
. format time %tcHH:MM:SS changes the display format of the time variable time to be a datetime consisting of hour(00-23), minute(00-59) and second(00-60).
Type help format to learn more on formatting the outputs.
More examples on data storage type, format and precision:
UCLA: Statistical Consulting Group, Why am I losing precision with large whole numbers (such as an ID variable)? | Stata FAQ
To find out more on how data precision can yield unexpected results and cause confusion, check J.S.Long (2009, p.77-p.81).