We are moving everything to the new site. Please visit our new Stata page. This site will no longer be updated.
egen is the extended generate and requires a function to be specified to generate a new variable.
egen newvar = function(arguments) creates the new variable. Commonly used functions include but are not limited to mean(), sd(), min(), max(), rowmean(), diff(), total(), std(), group() etc.
Type help egen to view a complete list and descriptions of the functions that go with egen.
. sysuse auto
. egen total_weight=total(weight), by(foreign) creates the total car weight by car type.
Note that egen newvar = total() treats missing values as 0. Therefore, if we want to include only the nonmissing cases, we need to
. egen total_weight = total(weight) if !missing(weight), by(foreign)
. egen car_space = rowmean(headroom length) creates an arbitrary measure for car space using the mean of headroom and car length.
Note that if in some cases one of the two variables headroom and length is missing, egen newvar = rowmean() will ignore the missing observations and use the non-missing observations for calculation. If both are missing, egen newvar = rowmean() will then return a missing value. In this example neither variable contains missing values.
Compare this method to the generate method:
. gen car_space2 = (headroom+length)/2 where if any of the variables has missing values, generate will ignore the entire rows and return missing values.
egen group_id = group(old_group_var) creates a new group id with numeric values for the categorical variable.
. sysuse citytemp
. egen region_id = group(region)
. tostring(region_id), replace
generates a new group id with values from 1 to 4 for the categorical variable region and then converts the id variable to a string.
egen newvar = cut(var),at(#,#,…,#) provides one more method of recoding numeric to categorical variables. # specifies the cut-offs with its left-side being inclusive.
. egen price4 = cut(price),at(3291,5000,15906) recodes price into price4 with three intervals [3291,5000), [5000, 15906), and [5000, 15906).
egen newvar = cut(var),group(#) alternatively divides the newly defined variable into groups of equal frequencies.
. egen price5 = cut(price), group(5) generates price5 into 5 groups of the same size.
The four methods of transforming numeric to categorical variables that we have come across so far:
egen newvar = ends() takes out whatever precedes the first space in the string, or the entire string if the string variable does not contain a space. Space is the default separator.
. egen make3 = ends(make) takes out the car make from the combination of make and model by the space between the two.
. list make make3 in 5/15
+------------------------+
| make make3 |
|------------------------|
5. | Buick Electra Buick |
6. | Buick LeSabre Buick |
7. | Buick Opel Buick |
8. | Buick Regal Buick |
9. | Buick Riviera Buick |
|------------------------|
10. | Buick Skylark Buick |
11. | Cad. Deville Cad. |
12. | Cad. Eldorado Cad. |
13. | Cad. Seville Cad. |
14. | Chev. Chevette Chev. |
|------------------------|
15. | Chev. Impala Chev. |
+------------------------+
The punct() option allows one to change where to parse the substring; the default is to parse on the space.
. egen make4 = ends(make), punct(.) takes out either the portion precedes the “.”, or the entire string without “.”.
. list make make4 in 5/15
+--------------------------------+
| make make4 |
|--------------------------------|
5. | Buick Electra Buick Electra |
6. | Buick LeSabre Buick LeSabre |
7. | Buick Opel Buick Opel |
8. | Buick Regal Buick Regal |
9. | Buick Riviera Buick Riviera |
|--------------------------------|
10. | Buick Skylark Buick Skylark |
11. | Cad. Deville Cad |
12. | Cad. Eldorado Cad |
13. | Cad. Seville Cad |
14. | Chev. Chevette Chev |
|--------------------------------|
15. | Chev. Impala Chev |
+--------------------------------+
The punct() trim head|last|tail option further allows one to choose the portion of the string to take out: head, the first substring; last, the last substring; or tail, the remaining substring following the first parsing character.
. egen make5 = ends(make), trim last parses out the last portion from make.
. list make make5 in 5/15
+---------------------------+
| make make5 |
|---------------------------|
5. | Buick Electra Electra |
6. | Buick LeSabre LeSabre |
7. | Buick Opel Opel |
8. | Buick Regal Regal |
9. | Buick Riviera Riviera |
|---------------------------|
10. | Buick Skylark Skylark |
11. | Cad. Deville Deville |
12. | Cad. Eldorado Eldorado |
13. | Cad. Seville Seville |
14. | Chev. Chevette Chevette |
|---------------------------|
15. | Chev. Impala Impala |
+---------------------------+
See summary statistics by group.
See by prefix with min(), max(), sum(), mean() etc. in hierarchical data.
More examples on egen:
Roy Mill, Step #4: Thank God for the egen Command
An indicator variable denotes whether something is true, which is 1, or false, which is 0. Indicator variables are also called dummy variables.
For instance, foreign in Stata's auto dataset is an indicator variable: 1 if the car is foreign made and 0 if domestic made.
. sysuse auto
. codebook foreign
-------------------------------------------------------------------------------
foreign Car type
-------------------------------------------------------------------------------
type: numeric (byte)
label: origin
range: [0,1] units: 1
unique values: 2 missing .: 0/74
tabulation: Freq. Numeric Label
52 0 Domestic
22 1 Foreign
In Stata we can state something as true like below: use the dummy variable without explicitly specifying the condition but with the variable name alone. Stata will know that it means if foreign == 1 or if foreign ~= 1.
. list make if foreign
. list make if ~ foreign
One way to create an indicator variable is to use generate with an statement.
. sysuse auto
. gen rep78In = rep78 >= 5 if !missing(rep78)
The result would be 1 where the condition is true (repair record is more than or equal to 5) and 0 elsewhere.
In hierarchical data, in combination with the by prefix , generate and egen can be used to create indicator variables on lower levels.
. sysuse citytemp
. by region (division),sort: gen heat_Ind1 = heatdd > 8000
defines if each division, a subcategory under a region, has heating degree days larger than 8000.
. sysuse citytemp
. by region (division), sort: egen heat_Ind2 = max(heatdd > 8000)
defines if a region has divisions whose heating degree days are larger than 8000.
We can also use tabulate var, generate(newvar) to create a series of indicator variables.
. tab foreign, gen(import) generates two new variables import1, indicating whether the car is domestic, and import2, indicating whether the car is foreign made.
More on creating indicator variables:
William Gould, StataCorp, How do I create dummy variables?
Factor variables create indicator variables from categorical variables.
The example below contains several factor variables:
. sysuse auto
. reg price mpg c.weight##c.weight ib3.rep78 i.foreign
c.weight##c.weight gives us the squared weight, in addition to the main effect of weight. ib3.rep78 sets the base value at rep78=3 and creates indicators at each value of rep78. i.foreign creates indicators at each value of foreign.
i. indicates unique values/levels of a group
c. indicates a continuous variables
o. omits a variable or indicator
# specifies interactions
## specifies interactions including main effects
i.foreign
indicator variables for each level of foreign
i.foreign#i.rep78
indicator variables for all combinations of each value of foreign and rep78
i.foreign##i.rep78
the same as i.foreign i.rep78 i.foreign#i.rep78
i.foreign#i.rep78#i.make
indicator variables for all combinations of each value of foreign, rep78 and make (not saying i.make would make sense since it has 74 unique levels)
i.foreign##i.rep78##i.make
the same as i.foreign i.rep78 i.make i.foreign#i.rep78 i.rep78#i.make i.foreign#i.make i.foreign#i.rep78#i.make
i.rep78#c.mpg
variables created for the number of the levels of rep78. For each variable, it will be the value of mpg if at the level of rep78 and it will be 0 otherwise.
For instance, if the first observation has rep78=3 and mpg=22, then 3.rep78#c.mpg will be 22 and it will be 0 for 1b.rep78#c.mpg, 2.rep78#c.mpg, 4.rep78#c.mpg and 5.rep78#c.mpg.
3.rep78 at the level where rep78=3
i3.rep78 the same as above
i(2/4).rep78 selects the levels from rep78=2 through rep78=4
i(1 5).rep78 selects the levels where rep78=1 and rep78=5
o(1 5).rep78 omits the levels where rep78=1 and rep78=5
ib#.var changes the base level of the variable, where b is the marker indicating the base value.
For instance, ib3.rep78 sets the base value at rep78=3.
We can refer to observations by subscripting the variables. var[exp] does the explicit subscripting.
When the expressions are the internal variables _n and _N:
_n gives the number of current observations;
_N gives the total number of observations.
var[_n] refers to the current observation;
var[_N] refers to the last observation.
. sysuse auto
. sort price
. price[1] refers to the first observation of price and is now the lowest value of price after sorting.
. price[0] is missing since there is no such observation.
. price[_N] refers to the last observation of price and is now the largest value of price.
Subscripting can be useful in hierarchical data. We will see some cases below.
Suppose we have a dataset that has variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10
+------------------------------------------------------------------------------------------------------------+
| company sector id datetime rank score year rating |
|------------------------------------------------------------------------------------------------------------|
1. | 0000 Banks/Midcap 79092 2014.03.10 00:25:00 1 15.61 2013 3 |
2. | 0001 Integrated Oil 132083 2014.02.28 19:22:00 4 9.89 2013 2 |
3. | 0001 Oil & Gas Exploration & Production 71894 2014.02.26 00:17:00 4 5.57 2013 3 |
4. | 0001 Oil & Gas Exploration & Production 9928 2014.02.28 04:15:00 2 8.76 2013 3 |
5. | 0001 Oil & Gas Exploration & Production 122739 2014.02.26 03:05:00 3 8.11 2013 3 |
|------------------------------------------------------------------------------------------------------------|
6. | 0001 Oil & Gas Exploration & Production 122739 2014.11.10 04:13:00 3 8.11 2014 2 |
7. | 000R Internet 142116 2014.02.27 01:10:00 4 9.14 2013 3 |
8. | 000R Internet 142116 2014.02.18 00:18:00 4 9.14 2013 2 |
9. | 000R Internet 73111 2014.02.19 15:21:00 1 14.96 2013 2 |
10. | 000R Internet 73111 2014.10.31 06:50:00 1 14.96 2013 1 |
+------------------------------------------------------------------------------------------------------------+
. by company, sort: gen ingroup_id = _n
gives us a unique identifier to each observation within each company.
. list company id datetime ingroup_id in 1/6
+---------------------------------------------------+
| company id datetime ingrou~d |
|---------------------------------------------------|
1. | 0000 79092 2014.03.10 00:25:00 1 |
2. | 0001 132083 2014.02.28 19:22:00 1 |
3. | 0001 71894 2014.02.26 00:17:00 2 |
4. | 0001 9928 2014.02.28 04:15:00 3 |
5. | 0001 122739 2014.02.26 03:05:00 4 |
|---------------------------------------------------|
6. | 0001 122739 2014.11.10 04:13:00 5 |
+---------------------------------------------------+
Say we want to get the mean of the 3 most recent ratings by id and company:
. by id company (datetime), sort: gen rating_3rec_avg = (rating[1] + rating[2] + rating[3]) / 3
Alternatively, if we want to obtain the mean of the 3 most latest ratings:
. by id company (datetime), sort: gen rating_3last_avg = (rating[_N] + rating[_N-1] + rating[_N-2]) / 3
If a group contains all the same values, then the first should be equal to the last one.
. by id company, sort: gen flag = rating[1] == rating[_N]
Say we want to perform t tests on each pair (1 vs 2; 2 vs 3 etc.) of ratings for each sector with year. Before we do that, we want to make sure that each pair exists. Otherwise Stata will throw a warning message at us saying only one group of the pair found.
foreach i in 1 2 3{
//make sure the pairs for t tests exist
gen rk`i' = .
by sector year (rank), sort: ///
replace rk`i' = rank if rank[1] == `i' & rank[_N] == `i'+1
//create score_diff variables by sector*year
egen group`i' = group(sector year) if !missing(rk`i')
qui levelsof group`i', local(g)
gen score_diff`i' = .
foreach x of local g{
qui estpost ttest score if group`i' == `x', by(rk`i')
matrix b = e(b)
replace score_diff`i' = b[1,1] if group`i' == `x'
}
//drop the temp vars
drop rk`i' group`i'
}
Subscripting with _n and _N can be used to create lags and leads. [_n-1] refers to the previous observation; [_n+1] refers to the next observation. Use time-series operators L for lag and F for lead if you are dealing with time-series data.
Duplicates are observations with identical values. The duplicates commands provide a way to report on, give examples of, list, browse, tag, or drop duplicate observations.
duplicates report reports duplicates.
duplicates examples lists one example of the group of the duplicated observations.
duplicates list lists all duplicated observations.
duplicates tag, generate(var) creates a new variable var that gives the number of duplicates of each variable.
duplicates drop keeps only the first of the duplicates and drops the rest.
More examples on the duplicates commands and an alternative method of detecting duplicates:
UCLA: Statistical Consulting Group, How can I detect duplicate observations? | Stata FAQ
egen group_id = group(old_group_var) creates a new group id with numeric values for the categorical variable.
. sysuse citytemp
. egen region_id = group(region)
. tostring(region_id), replace
generates a new group id with values from 1 to 4 for the categorical variable region and then converts the id variable to a string.
We have already introduced earlier several commands that produce summary statistics by groups. To summarize them below:
To calculate summary statistics:
To aggregate data to summary statistics:
collapse (stat1) varlist1 (stat2) varlist2…, by(group varlist)
Use egen:
by group: egen, function()
fillin varlist creates additional rows of observations by filling in all combinations of the specified variables. A new variable _fillin will be created automatically to indicate where the observations come from: 1 if the observations come from the original dataset, or 0 if they are filled in.
Suppose we have a dataset on a course.
. list
+-----------------------------------+
| id course placem~t attend~e |
|-----------------------------------|
1. | 1 A 1 3 |
2. | 1 B 2 4 |
3. | 2 A 3 2 |
4. | 2 C 1 3 |
5. | 3 B 2 4 |
|-----------------------------------|
6. | 3 C 3 5 |
+-----------------------------------+
. fillin id course adds observations from all combinations of id and course; missing values have been created where the person does not have a course record.
. list
+---------------------------------------------+
| id course placem~t attend~e _fillin |
|---------------------------------------------|
1. | 1 A 1 3 0 |
2. | 1 B 2 4 0 |
3. | 1 C . . 1 |
4. | 2 A 3 2 0 |
5. | 2 B . . 1 |
|---------------------------------------------|
6. | 2 C 1 3 0 |
7. | 3 A . . 1 |
8. | 3 B 2 4 0 |
9. | 3 C 3 5 0 |
+---------------------------------------------+
expand [=]exp creates duplicates of each observation with n copies specified in the expression, where the original observation is kept and n-1 copies are created.
Using the same data before fillin above:
. expand attendance makes duplicates of the observations by the number of attendance so that each person will have a record of each attendance of each course. We can then, for instance, add course performance data to each attendance.
. list
. sort id course
+-----------------------------------+
| id course placem~t attend~e |
|-----------------------------------|
1. | 1 A 1 3 |
2. | 1 A 1 3 |
3. | 1 A 1 3 |
4. | 1 B 2 4 |
5. | 1 B 2 4 |
|-----------------------------------|
6. | 1 B 2 4 |
7. | 1 B 2 4 |
8. | 2 A 3 2 |
9. | 2 A 3 2 |
10. | 2 C 1 3 |
|-----------------------------------|
11. | 2 C 1 3 |
12. | 2 C 1 3 |
13. | 3 B 2 4 |
14. | 3 B 2 4 |
15. | 3 B 2 4 |
|-----------------------------------|
16. | 3 B 2 4 |
17. | 3 C 3 5 |
18. | 3 C 3 5 |
19. | 3 C 3 5 |
20. | 3 C 3 5 |
|-----------------------------------|
21. | 3 C 3 5 |
+-----------------------------------+
expand [=]exp, generate(newvar) creates a new variable to indicate if the observations come from the existing dataset or if they are the expanded ones.
We had a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10
+------------------------------------------------------------------------------------------------------------+
| company sector id datetime rank score year rating |
|------------------------------------------------------------------------------------------------------------|
1. | 0000 Banks/Midcap 79092 2014.03.10 00:25:00 1 15.61 2013 3 |
2. | 0001 Integrated Oil 132083 2014.02.28 19:22:00 4 9.89 2013 2 |
3. | 0001 Oil & Gas Exploration & Production 71894 2014.02.26 00:17:00 4 5.57 2013 3 |
4. | 0001 Oil & Gas Exploration & Production 9928 2014.02.28 04:15:00 2 8.76 2013 3 |
5. | 0001 Oil & Gas Exploration & Production 122739 2014.02.26 03:05:00 3 8.11 2013 3 |
|------------------------------------------------------------------------------------------------------------|
6. | 0001 Oil & Gas Exploration & Production 122739 2014.11.10 04:13:00 3 8.11 2014 2 |
7. | 000R Internet 142116 2014.02.27 01:10:00 4 9.14 2013 3 |
8. | 000R Internet 142116 2014.02.18 00:18:00 4 9.14 2013 2 |
9. | 000R Internet 73111 2014.02.19 15:21:00 1 14.96 2013 2 |
10. | 000R Internet 73111 2014.10.31 06:50:00 1 14.96 2013 1 |
+------------------------------------------------------------------------------------------------------------+
We wanted to build models comparing results on ranks 1 versus 2, 2 versus 3, 3 versus the first runner-up, and the last runner-up versus the first non-runner-up. All sounded fine, until it occurred to us that there could be only one runner-up within a group (sector * year). In this case, we want to expand the groups where we only have one runner up, and recode it to rank 4 and 5; the first non-runner-up would be rank 6. Here is what we did.
gen rank2 = .
replace rank2 = rank if rank <= 3
//expand the RUs if we only have one rank == 4
gen expd = .
by sector year rank (score), sort: replace expd = 1 if score[1] == score[_N] & rank == 4
expand expd + 1 if expd == 1
sort sector year rank id score
sort _all
by sector year rank, sort: replace rank2 = 4 if mod(_n,2) == 1 & expd == 1
by sector year rank, sort: replace rank3 = 5 if mod(_n,2) == 0 & expd == 1
//1st RU, last RU
by sector year rank, sort: egen max = max(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 4 if rank == 4 & score == max & expd != 1
by sector year rank, sort: egen min = min(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 5 if rank == 4 & score == min & expd != 1
An example of using fillin and expand to change time periods in panel data:
Social Science Computing Cooperative, UW-Madison, Stata Programming Techniques for Panel Data: Changing Time Periods
by prefix in combination with functions sum(), max(), min(), and mean() can serve many purposes and be quite helpful in handling hierarchical data.
Results Spreading with mean() in calculating summary statistics:
Social Science Computing Cooperative, UW-Madison, Stata for Researchers: Working with Groups
Another example on spreading results with sum() in creating group id:
Nicholas J. Cox and William Gould, How do I create individual identifiers numbered from 1 upwards?
Creating indicators with sum() to refer to locations of certain values of a variable:
Nicholas J. Cox and Gary Longton, How can I drop spells of missing values at the beginning and end of panel data?