We are moving everything to the new site. Please visit our new Stata page. This site will no longer be updated.
The four commands are often used to create or change the contents of variables.
generate newvar = exp creates the new variable from existing variables through an expression.
. sysuse auto
. gen price1 = price/2.5
replace changes the contents of a variable. It can be used in combination with generate to recode the newly created variable from the existing numeric variable.
. gen price2 = price
. replace price2 = price*0.8 if foreign == 1
recode var (rule)..., generate(newvar) changes the contents of numeric variables, usually to create categorical variables.
. sysuse auto
. recode rep78 (min/2 = 1 good) (3 = 2 fair) (4 5 = 3 poor) (missing = .), gen(rep78_scale) label(repair_record_scale)
Some notes for recode:
ranges
Values specified in the () includes the two boundaries.
Numbers to the left of = are values to be recoded, while the number following = is the new value to be assigned.
/ means “through”.
values
min refers to the smallest value; max refers to the largest value.
missing indicates all missing values; nonmissing for all nonmissing values; and else for both missing and non-missing values.
If there are values unassigned, they will be taken to the new variable as they are.
value labels
Contents following the assigned value is the value label (e.g. good).
label() gives a name to the new value label.
To check if the transformations have worked as we would like them to be, it is always a good idea to cross-tabulate the newly defined variables and the variables created from.
. tab rep78 rep78_scale
Repair | RECODE of rep78 (Repair Record 1978)
Record |
1978 | good fair poor | Total
-----------+---------------------------------+----------
1 | 2 0 0 | 2
2 | 8 0 0 | 8
3 | 0 30 0 | 30
4 | 0 0 18 | 18
5 | 0 0 11 | 11
-----------+---------------------------------+----------
Total | 10 30 29 | 69
Another way to convert numeric variables to categorical or factor variables is to use autocode, the automated version of recode.
generate newvar = autocode(var, n, a, b) recodes the numeric variable var into a categorical variable newvar with n equal-length intervals; a and b are the two boundaries where a is inclusive.
. sysuse auto
. sort price
. gen price3 = autocode(price,3,price[1],price[_N])
recodes the numeric variable price into a variable of three categories ranging from its minimum to the maximum value. price[1] and price[_N] are explicit subscripting that tells Stata where to refer to.
These three commands provide additional information to a variable. We have met the first two in the previous chapter, but for convenience let’s review them altogether here.
label variable varname “label” attaches a label to a variable.
. label variable rep78 “repair_record_78”
rename oldvar newvar renames the variable name.
. rename rep78 repair_record
notes: varname adds notes to a variable.
. notes rep78: check the missing values
label define and label values attach value labels to a numeric variable in two steps:
label define labelname # label # label … first creates a label name for all the labels we are going to attach to the variable values .
label values var labelname then matches the labels to the values.
For instance,
. sysuse auto
. gen priceIndicator = (price >= 5000 & price < .)
. label define priceLabel 0 affordable 1 expensive
. label values priceIndicator priceLabel
first creates an indicator variable based on an arbitrary standard of affordability (price<5000), and then attaches the labels of “affordable” to the numeric value 0 and “expensive” to 1.
. codebook priceIndicator
------------------------------------------------------------------------------------------
priceIndicator (unlabeled)
------------------------------------------------------------------------------------------
type: numeric (float)
label: priceLabel
range: [0,1] units: 1
unique values: 2 missing .: 0/74
tabulation: Freq. Numeric Label
37 0 affordable
37 1 expensive
We can see how each value of priceIndicator is labeled. The label name is "priceLabel".
Once we got the value label name, we can list the labels for each value of the variable.
label list labelname returns a list.
. label list priceLabel
priceLabel:
0 affordable
1 expensive
In fact, in the auto data, variable foreign came with attached value lables.
. codebook foreign
--------------------------------------------------------------------------------------
foreign Car type
--------------------------------------------------------------------------------------
type: numeric (byte)
label: origin
range: [0,1] units: 1
unique values: 2 missing .: 0/74
tabulation: Freq. Numeric Label
52 0 Domestic
22 1 Foreign
. label list origin
origin:
0 Domestic
1 Foreign
label drop labelname removes the value labels from a variable.
. label drop priceLabel
encode var, gen(newvar) creates a numeric variable newvar by assigning a numeric value to each of var's string value. It provides a mapping between the numeric value and the value label.
. sysuse auto
. encode make, gen(make_num) creates the numeric variable make_num.
. label list make_num gives us the mapping between the numeric values and the labels of the newly defined numeric variable make_num.
make_num:
1 AMC Concord
2 AMC Pacer
3 AMC Spirit
4 Audi 5000
5 Audi Fox
6 BMW 320i
7 Buick Century
8 Buick Electra
9 Buick LeSabre
10 Buick Opel
11 Buick Regal
12 Buick Riviera
13 Buick Skylark
14 Cad. Deville
15 Cad. Eldorado
16 Cad. Seville
17 Chev. Chevette
18 Chev. Impala
19 Chev. Malibu
20 Chev. Monte Carlo
21 Chev. Monza
22 Chev. Nova
23 Datsun 200
24 Datsun 210
25 Datsun 510
26 Datsun 810
27 Dodge Colt
28 Dodge Diplomat
29 Dodge Magnum
30 Dodge St. Regis
31 Fiat Strada
32 Ford Fiesta
33 Ford Mustang
34 Honda Accord
35 Honda Civic
36 Linc. Continental
37 Linc. Mark V
38 Linc. Versailles
39 Mazda GLC
40 Merc. Bobcat
41 Merc. Cougar
42 Merc. Marquis
43 Merc. Monarch
44 Merc. XR-7
45 Merc. Zephyr
46 Olds 98
47 Olds Cutl Supr
48 Olds Cutlass
49 Olds Delta 88
50 Olds Omega
51 Olds Starfire
52 Olds Toronado
53 Peugeot 604
54 Plym. Arrow
55 Plym. Champ
56 Plym. Horizon
57 Plym. Sapporo
58 Plym. Volare
59 Pont. Catalina
60 Pont. Firebird
61 Pont. Grand Prix
62 Pont. Le Mans
63 Pont. Phoenix
64 Pont. Sunbird
65 Renault Le Car
66 Subaru
67 Toyota Celica
68 Toyota Corolla
69 Toyota Corona
70 VW Dasher
71 VW Diesel
72 VW Rabbit
73 VW Scirocco
74 Volvo 260
decode var, gen(newvar) works the other way around. It creates a string variable newvar using the value labels attached earlier to the numeric variable var.
. decode foreign, gen(foreign_str) creates a string variable that has two values “Domestic” and “Foreign”.
. codebook foreign_str
-----------------------------------------------------------------------------------
foreign_str Car type
-----------------------------------------------------------------------------------
type: string (str8)
unique values: 2 missing "": 0/74
tabulation: Freq. Value
52 "Domestic"
22 "Foreign"
Compare with
-----------------------------------------------------------------------------------
foreign Car type
-----------------------------------------------------------------------------------
type: numeric (byte)
label: origin
range: [0,1] units: 1
unique values: 2 missing .: 0/74
tabulation: Freq. Numeric Label
52 0 Domestic
22 1 Foreign
destring/tostring, replace or destring/tostring, generate() converts string variables to numeric variables and vice versa.
. tostring make_num, gen(make_str) converts the numeric variable make_num previously encoded from make back to a string variable but with values in numbers rather than alphabetical letters.
. codebook make_str
-----------------------------------------------------------------------------------
make_str Make and Model
-----------------------------------------------------------------------------------
type: string (str2)
unique values: 74 missing "": 0/74
examples: "21"
"35"
"49"
"62"
. destring make_str, replace converts make_str to a numeric variable.
. codebook make_str
-----------------------------------------------------------------------------------
make_str Make and Model
-----------------------------------------------------------------------------------
type: numeric (byte)
range: [1,74] units: 1
unique values: 74 missing .: 0/74
mean: 37.5
std. dev: 21.5058
percentiles: 10% 25% 50% 75% 90%
8 19 37.5 56 67
See also the real()/string() functions.
label define creates labels to each value of a numeric variable.
encode works on strings. It creates a new variable, and assigns a numeric value to each of the string value.
decode works on numeric variables that has attached value labels. It creates a new string variable that gets its variable values from the previous label values.
destring works on string variables that contain numbers stored as strings, and converts them to numbers.
tostring converts numeric variables to strings.
real()/string() functions convert strings/numeric variables between each other, but cannot be used just by themselves.
assert verifies if an expression is true. It returns no output if the statement is indeed true.
. egen car_space = rowmean(headroom length)
. gen car_space2 = (headroom+length)/2
both have created an arbitrary measure for car space using the mean of headroom and car length.
More on comparison of the two methods where we discuss rowmean().
. assert car_space == car_space2 examines if the two methods yield the same results.
keep and drop work in the same way: they keep or drop variables or observations.
keep/drop var subsets data by variables.
. sysuse auto
. keep price
keep/drop if exp subsets observations by conditions.
. drop if price < 5000
keep/drop in range subsets observations within certain rows.
. keep in 10/20
sort/gsort arranges the observations in ascending order.
. sysuse auto
. sort make
gsort -var can arrange the observations in descending order.
. gsort mpg sorts mpg in ascending order.
. gsort -mpg price sorts mpg in descending order and price in ascending order.
order arranges the order of variables.
order varlist(, first) or order varlist, last moves the variables to the first or last column.
. order trunk turn, last moves the two variables trunk and turn to the last two columns.
order varlist, before(var) or order varlist, after(var) moves the variables before or after a certain variable.
. order trunk turn, before(mpg) moves trunk and turn before mpg.
append, merge and joinby bind the datasets by attaching the “using dataset” stored on disk to the “master dataset” stored in memory.
append binds datasets vertically by adding observations from the using dataset to the master dataset.
The datasets should contain mostly the same variables; absent variables from one dataset will be given missing values in the combined file. append applies, for instance, when we want to combine datasets of different times or regions with mostly the same variables.
. use file1
. list
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 1995 1 child A |
2. | 2 1995 1 child B |
3. | 3 1995 1 mother B |
4. | 1 1995 2 father C |
5. | 2 1995 2 mother B |
|--------------------------------------|
6. | 3 1995 2 child A |
+--------------------------------------+
. use file2, clear
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 2005 1 child A |
2. | 2 2005 1 child A |
3. | 3 2005 1 mother B |
4. | 1 2005 2 father B |
5. | 2 2005 2 mother C |
|--------------------------------------|
6. | 3 2005 2 child B |
+--------------------------------------+
. append using file1
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 2005 1 child A |
2. | 2 2005 1 child A |
3. | 3 2005 1 mother B |
4. | 1 2005 2 father B |
5. | 2 2005 2 mother C |
|--------------------------------------|
6. | 3 2005 2 child B |
7. | 1 1995 1 child A |
8. | 2 1995 1 child B |
9. | 3 1995 1 mother B |
10. | 1 1995 2 father C |
|--------------------------------------|
11. | 2 1995 2 mother B |
12. | 3 1995 2 child A |
+--------------------------------------+
. append using file
adds the observations from file to the master file.
. append using file, keep (var1 var2 var3)
appends the file file and keeps only the three specified variables in the combined dataset.
. append using file1 file2, generate (filenum)
appends file1 and file2 to the master file and assigns a number to the observations to indicate the source file.
We can also create an id for each source file by ourselves that are meaningful to the datasets before appending the datasets.
. use file, clear
. generate str4 source = ”2016”
. save fileb
. use masterfile, clear
. generate str4 source = ”2015”
. append using fileb
. append using file1b file2b
appends file1 and file2 to the master file.
What if we have a long list of files to append to the master file? We will offer a method to automate the appending that greatly saves our time in the automation chapter.
merge combines datasets horizontally by adding variables to the existing dataset.
The datasets are combined by one or more key variables. The key variables must have the same names and data type (numeric or string), otherwise we will receive an error message.
If one dataset contains more observations than the other, missing values will be generated where not matched.
types of merges
There are three types of match merges that we frequently use.
merge 1:1 varlist using filename
In one-to-one merge, we use a unique key variable in both the master and the using datasets to merge files. In both files there should be only one unique value for the key variable.
merge 1:m varlist using filename
merge m:1 varlist using filename
In one-to-many and many-to-one merges, one observation from one dataset is matched to many observations in the other dataset. If the master dataset has many observations to match with the single observation in the using dataset, we use m:1; or we use 1:m if it is the using dataset that has many to match.
One-to-many or many-to-one merges are most frequently met when dealing with hierarchical data.
. use file5, clear
. list
+---------------------+
| family area SES |
|---------------------|
1. | 1 N 1 |
2. | 2 S 2 |
+---------------------+
. use file1, clear
+--------------------------------------+
| id year family role rating |
|--------------------------------------|
1. | 1 1995 1 child A |
2. | 2 1995 1 child B |
3. | 3 1995 1 mother B |
4. | 1 1995 2 father C |
5. | 2 1995 2 mother B |
|--------------------------------------|
6. | 3 1995 2 child A |
+--------------------------------------+
. merge m:1 family using file5
Result # of obs.
-----------------------------------------
not matched 0
matched 6 (_merge==3)
-----------------------------------------
. list
+-----------------------------------------------------------------+
| id year family role rating area SES _merge |
|-----------------------------------------------------------------|
1. | 1 1995 1 child A N 1 matched (3) |
2. | 2 1995 1 child B N 1 matched (3) |
3. | 3 1995 1 mother B N 1 matched (3) |
4. | 1 1995 2 father C S 2 matched (3) |
5. | 2 1995 2 mother B S 2 matched (3) |
|-----------------------------------------------------------------|
6. | 3 1995 2 child A S 2 matched (3) |
+-----------------------------------------------------------------+
The fourth type, many-to-many merges, can yield random results and be problematic. We will not discuss it here.
multiple key variables
We can merge files using more than one key variable when a single variable is not adequate to match the two datasets.
. use file3, clear
. list
+----------------------+
| id family role |
|----------------------|
1. | 1 1 child |
2. | 2 1 child |
3. | 3 1 mother |
4. | 1 2 father |
5. | 2 2 mother |
|----------------------|
6. | 3 2 child |
+----------------------+
. use file4, clear
+-----------------------------+
| id year family rating |
|-----------------------------|
1. | 1 1995 1 A |
2. | 2 1995 1 B |
3. | 3 1995 1 B |
4. | 1 1995 2 C |
5. | 2 1995 2 B |
|-----------------------------|
6. | 3 1995 2 A |
+-----------------------------+
. merge 1:1 id family using file3
Result # of obs.
-----------------------------------------
not matched 0
matched 6 (_merge==3)
-----------------------------------------
. list
+----------------------------------------------------+
| id year family rating role _merge |
|----------------------------------------------------|
1. | 1 1995 1 A child matched (3) |
2. | 1 1995 2 C father matched (3) |
3. | 2 1995 1 B child matched (3) |
4. | 2 1995 2 B mother matched (3) |
5. | 3 1995 1 B mother matched (3) |
|----------------------------------------------------|
6. | 3 1995 2 A child matched (3) |
+----------------------------------------------------+
_merge
After the merge the _merge variable will be automatically created. It tells us how the observations have been matched. The _merge variable, in the simple cases, has three values:
_merge == 1, observation appeared in the master file only
_merge == 2, observation appeared in the using file only
_merge == 3, observation appeared in both files
The _merge variable must be dropped or renamed before we perform the next merge.
tab _merge to find out how your merge went.
Type help merge for more merge options.
joinby forms all pairwise combinations within groups.
. use file6, clear
. list
+---------------------------+
| industry company rank |
|---------------------------|
1. | A a 1 |
2. | A b 2 |
3. | A c 3 |
4. | B l 4 |
5. | B m 5 |
|---------------------------|
6. | B n 6 |
+---------------------------+
. use file7, clear
+-----------------+
| industry year |
|-----------------|
1. | A 1995 |
2. | A 2005 |
3. | B 1995 |
4. | B 2005 |
+-----------------+
. use file6, clear
+----------------------------------+
| industry company rank year |
|----------------------------------|
1. | A a 1 1995 |
2. | A a 1 2005 |
3. | A b 2 1995 |
4. | A b 2 2005 |
5. | A c 3 1995 |
|----------------------------------|
6. | A c 3 2005 |
7. | B l 4 2005 |
8. | B l 4 1995 |
9. | B m 5 1995 |
10. | B m 5 2005 |
|----------------------------------|
11. | B n 6 1995 |
12. | B n 6 2005 |
+----------------------------------+
For the unmatched observations from the datasets, the default is to be ignored. But you can also set them to be kept in different ways by the unmatched(both/master/using) option.
reshape long varlist, i(i) j(j) or reshape wide varlist, i(i) j(j) converts data from long to wide and vice versa.
i is the id variable of a higher level;
j is the id variable of a lower level;
wide means reshaping to wide;
long means reshaping to long.
There are more observations in the long format and more variables in the wide format. In the long format, variable values of the lower levels are grouped under values of a higher level. In the wide format, each column represents variable values of lower levels.
. use long
. list
+------------------------------------------+
| id semester course gpa attend~e |
|------------------------------------------|
1. | 1 1 A 3.81 1 |
2. | 1 1 B 3.82 2 |
3. | 1 2 A 3.76 3 |
4. | 1 2 B 3.77 4 |
5. | 2 1 A 3.56 4 |
|------------------------------------------|
6. | 2 1 B 3.55 3 |
7. | 2 2 A 3.45 2 |
8. | 2 2 B 3.47 1 |
+------------------------------------------+
. reshape wide gpa attendance, i(id semester) j(course) string converts the dataset from long to wide.
(note: j = A B)
Data long -> wide
-----------------------------------------------------------------------------
Number of obs. 8 -> 4
Number of variables 5 -> 6
j variable (2 values) course -> (dropped)
xij variables:
gpa -> gpaA gpaB
attendance -> attendanceA attendanceB
-----------------------------------------------------------------------------
. list
+---------------------------------------------------+
| id semester gpaA attend~A gpaB attend~B |
|---------------------------------------------------|
1. | 1 1 3.81 1 3.82 2 |
2. | 1 2 3.76 3 3.77 4 |
3. | 2 1 3.56 4 3.55 3 |
4. | 2 2 3.45 2 3.47 1 |
+---------------------------------------------------+
. reshape long changes it back to the long format.
(note: j = A B)
Data wide -> long
-----------------------------------------------------------------------
Number of obs. 4 -> 8
Number of variables 6 -> 5
j variable (2 values) -> course
xij variables:
gpaA gpaB -> gpa
attendanceA attendanceB -> attendance
-----------------------------------------------------------------------
collapse (stat1) varlist1 (stat2) varlist2…, by(group varlist) aggregates the dataset to summary statistics. stat options include mean, median, percentiles, standard deviations, standard errors, first/last values, maximum/minimum etc.
. use long
. collapse (mean) gpa, by(semester id)
returns the mean of gpa of each id by semester.
. list
+-----------------------+
| id semester gpa |
|-----------------------|
1. | 1 1 3.815 |
2. | 2 1 3.555 |
3. | 1 2 3.765 |
4. | 2 2 3.46 |
+-----------------------+