Preparing Your Data II: More Tricks and Cases

egen

egen is the extended generate and requires a function to be specified to generate a new variable.

egen newvar = function(arguments) creates the new variable. Commonly used functions include but are not limited to mean(), sd(), min(), max(), rowmean(), diff(), total(), std(), group() etc.

Type help egen to view a complete list and descriptions of the functions that go with egen.

Generating new variables

total()

. sysuse auto
. egen total_weight=total(weight), by(foreign) creates the total car weight by car type.

Note that egen newvar = total() treats missing values as 0. Therefore, if we want to include only the nonmissing cases, we need to
. egen total_weight = total(weight) if !missing(weight), by(foreign)

rowmean()

. egen car_space = rowmean(headroom length) creates an arbitrary measure for car space using the mean of headroom and car length.

Note that if in some cases one of the two variables headroom and length is missing, egen newvar = rowmean() will ignore the missing observations and use the non-missing observations for calculation. If both are missing, egen newvar = rowmean() will then return a missing value. In this example neither variable contains missing values.

Compare this method to the generate method:
. gen car_space2 = (headroom+length)/2 where if any of the variables has missing values, generate will ignore the entire rows and return missing values.

group()

egen group_id = group(old_group_var) creates a new group id with numeric values for the categorical variable.

. sysuse citytemp
. egen region_id = group(region)
. tostring(region_id), replace
generates a new group id with values from 1 to 4 for the categorical variable region and then converts the id variable to a string.

Recoding variables

cut()

egen newvar = cut(var),at(#,#,…,#) provides one more method of recoding numeric to categorical variables. # specifies the cut-offs with its left-side being inclusive.
. egen price4 = cut(price),at(3291,5000,15906) recodes price into price4 with three intervals [3291,5000), [5000, 15906), and [5000, 15906).

egen newvar = cut(var),group(#) alternatively divides the newly defined variable into groups of equal frequencies.
. egen price5 = cut(price), group(5) generates price5 into 5 groups of the same size.

Summary

The four methods of transforming numeric to categorical variables that we have come across so far:

generate var, replace

recode var, generate()

generate var, autocode()

egen var = cut()

Extracting substrings

ends()

egen newvar = ends() takes out whatever precedes the first space in the string, or the entire string if the string variable does not contain a space. Space is the default separator.

. egen make3 = ends(make) takes out the car make from the combination of make and model by the space between the two.
. list make make3 in 5/15


     +------------------------+
     | make             make3 |
     |------------------------|
  5. | Buick Electra    Buick |
  6. | Buick LeSabre    Buick |
  7. | Buick Opel       Buick |
  8. | Buick Regal      Buick |
  9. | Buick Riviera    Buick |
     |------------------------|
 10. | Buick Skylark    Buick |
 11. | Cad. Deville      Cad. |
 12. | Cad. Eldorado     Cad. |
 13. | Cad. Seville      Cad. |
 14. | Chev. Chevette   Chev. |
     |------------------------|
 15. | Chev. Impala     Chev. |
     +------------------------+

The punct() option allows one to change where to parse the substring; the default is to parse on the space.
. egen make4 = ends(make), punct(.) takes out either the portion precedes the “.”, or the entire string without “.”.
. list make make4 in 5/15


     +--------------------------------+
     | make                     make4 |
     |--------------------------------|
  5. | Buick Electra    Buick Electra |
  6. | Buick LeSabre    Buick LeSabre |
  7. | Buick Opel          Buick Opel |
  8. | Buick Regal        Buick Regal |
  9. | Buick Riviera    Buick Riviera |
     |--------------------------------|
 10. | Buick Skylark    Buick Skylark |
 11. | Cad. Deville               Cad |
 12. | Cad. Eldorado              Cad |
 13. | Cad. Seville               Cad |
 14. | Chev. Chevette            Chev |
     |--------------------------------|
 15. | Chev. Impala              Chev |
     +--------------------------------+

The punct() trim head|last|tail option further allows one to choose the portion of the string to take out: head, the first substring; last, the last substring; or tail, the remaining substring following the first parsing character.
. egen make5 = ends(make), trim last parses out the last portion from make.
. list make make5 in 5/15


     +---------------------------+
     | make                make5 |
     |---------------------------|
  5. | Buick Electra     Electra |
  6. | Buick LeSabre     LeSabre |
  7. | Buick Opel           Opel |
  8. | Buick Regal         Regal |
  9. | Buick Riviera     Riviera |
     |---------------------------|
 10. | Buick Skylark     Skylark |
 11. | Cad. Deville      Deville |
 12. | Cad. Eldorado    Eldorado |
 13. | Cad. Seville      Seville |
 14. | Chev. Chevette   Chevette |
     |---------------------------|
 15. | Chev. Impala       Impala |
     +---------------------------+

Summary statistics by group

See summary statistics by group.

Other scenarios

See by prefix with min(), max(), sum(), mean() etc. in hierarchical data.

More examples on egen:
Roy Mill, Step #4: Thank God for the egen Command

Creating Indicator Variables (Dummy Variables)

An indicator variable denotes whether something is true, which is 1, or false, which is 0. Indicator variables are also called dummy variables.

For instance, foreign in Stata's auto dataset is an indicator variable: 1 if the car is foreign made and 0 if domestic made.
. sysuse auto
. codebook foreign


-------------------------------------------------------------------------------
foreign                                                                Car type
-------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  origin

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/74

            tabulation:  Freq.   Numeric  Label
                            52         0  Domestic
                            22         1  Foreign

In Stata we can state something as true like below: use the dummy variable without explicitly specifying the condition but with the variable name alone. Stata will know that it means if foreign == 1 or if foreign ~= 1.
. list make if foreign
. list make if ~ foreign

generate

One way to create an indicator variable is to use generate with an statement.
. sysuse auto
. gen rep78In = rep78 >= 5 if !missing(rep78)
The result would be 1 where the condition is true (repair record is more than or equal to 5) and 0 elsewhere.

In hierarchical data, in combination with the by prefix , generate and egen can be used to create indicator variables on lower levels.
. sysuse citytemp
. by region (division),sort: gen heat_Ind1 = heatdd > 8000
defines if each division, a subcategory under a region, has heating degree days larger than 8000.

. sysuse citytemp
. by region (division), sort: egen heat_Ind2 = max(heatdd > 8000)
defines if a region has divisions whose heating degree days are larger than 8000.

tabulate

We can also use tabulate var, generate(newvar) to create a series of indicator variables.
. tab foreign, gen(import) generates two new variables import1, indicating whether the car is domestic, and import2, indicating whether the car is foreign made.

More on creating indicator variables:
William Gould, StataCorp, How do I create dummy variables?

Factor variables

Factor variables create indicator variables from categorical variables.

The example below contains several factor variables:
. sysuse auto
. reg price mpg c.weight##c.weight ib3.rep78 i.foreign

c.weight##c.weight gives us the squared weight, in addition to the main effect of weight. ib3.rep78 sets the base value at rep78=3 and creates indicators at each value of rep78. i.foreign creates indicators at each value of foreign.

operators

i. indicates unique values/levels of a group
c. indicates a continuous variables
o. omits a variable or indicator
# specifies interactions
## specifies interactions including main effects

what factor variables actually mean

i.foreign
indicator variables for each level of foreign

i.foreign#i.rep78
indicator variables for all combinations of each value of foreign and rep78

i.foreign##i.rep78
the same as i.foreign i.rep78 i.foreign#i.rep78

i.foreign#i.rep78#i.make
indicator variables for all combinations of each value of foreign, rep78 and make (not saying i.make would make sense since it has 74 unique levels)

i.foreign##i.rep78##i.make
the same as i.foreign i.rep78 i.make i.foreign#i.rep78 i.rep78#i.make i.foreign#i.make i.foreign#i.rep78#i.make

i.rep78#c.mpg
variables created for the number of the levels of rep78. For each variable, it will be the value of mpg if at the level of rep78 and it will be 0 otherwise.

For instance, if the first observation has rep78=3 and mpg=22, then 3.rep78#c.mpg will be 22 and it will be 0 for 1b.rep78#c.mpg, 2.rep78#c.mpg, 4.rep78#c.mpg and 5.rep78#c.mpg.

selecting levels

3.rep78 at the level where rep78=3

i3.rep78 the same as above

i(2/4).rep78 selects the levels from rep78=2 through rep78=4

i(1 5).rep78 selects the levels where rep78=1 and rep78=5

o(1 5).rep78 omits the levels where rep78=1 and rep78=5

changing the base level

ib#.var changes the base level of the variable, where b is the marker indicating the base value.

For instance, ib3.rep78 sets the base value at rep78=3.

Subscripting

We can refer to observations by subscripting the variables. var[exp] does the explicit subscripting.

When the expressions are the internal variables _n and _N:
_n gives the number of current observations;
_N gives the total number of observations.
var[_n] refers to the current observation;
var[_N] refers to the last observation.

. sysuse auto
. sort price
. price[1] refers to the first observation of price and is now the lowest value of price after sorting.
. price[0] is missing since there is no such observation.
. price[_N] refers to the last observation of price and is now the largest value of price.

Subscripting can be useful in hierarchical data. We will see some cases below.

Suppose we have a dataset that has variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10


     +------------------------------------------------------------------------------------------------------------+
     | company                               sector       id              datetime   rank   score   year   rating |
     |------------------------------------------------------------------------------------------------------------|
  1. |    0000                         Banks/Midcap    79092   2014.03.10 00:25:00      1   15.61   2013        3 |
  2. |    0001                       Integrated Oil   132083   2014.02.28 19:22:00      4    9.89   2013        2 |
  3. |    0001   Oil & Gas Exploration & Production    71894   2014.02.26 00:17:00      4    5.57   2013        3 |
  4. |    0001   Oil & Gas Exploration & Production     9928   2014.02.28 04:15:00      2    8.76   2013        3 |
  5. |    0001   Oil & Gas Exploration & Production   122739   2014.02.26 03:05:00      3    8.11   2013        3 |
     |------------------------------------------------------------------------------------------------------------|
  6. |    0001   Oil & Gas Exploration & Production   122739   2014.11.10 04:13:00      3    8.11   2014        2 |
  7. |    000R                             Internet   142116   2014.02.27 01:10:00      4    9.14   2013        3 |
  8. |    000R                             Internet   142116   2014.02.18 00:18:00      4    9.14   2013        2 |
  9. |    000R                             Internet    73111   2014.02.19 15:21:00      1   14.96   2013        2 |
 10. |    000R                             Internet    73111   2014.10.31 06:50:00      1   14.96   2013        1 |
     +------------------------------------------------------------------------------------------------------------+

Creating group identifier

. by company, sort: gen ingroup_id = _n
gives us a unique identifier to each observation within each company.

. list company id datetime ingroup_id in 1/6


     +---------------------------------------------------+
     | company       id              datetime   ingrou~d |
     |---------------------------------------------------|
  1. |    0000    79092   2014.03.10 00:25:00          1 |
  2. |    0001   132083   2014.02.28 19:22:00          1 |
  3. |    0001    71894   2014.02.26 00:17:00          2 |
  4. |    0001     9928   2014.02.28 04:15:00          3 |
  5. |    0001   122739   2014.02.26 03:05:00          4 |
     |---------------------------------------------------|
  6. |    0001   122739   2014.11.10 04:13:00          5 |
     +---------------------------------------------------+

First / last several cases within a group

Say we want to get the mean of the 3 most recent ratings by id and company:
. by id company (datetime), sort: gen rating_3rec_avg = (rating[1] + rating[2] + rating[3]) / 3

Alternatively, if we want to obtain the mean of the 3 most latest ratings:
. by id company (datetime), sort: gen rating_3last_avg = (rating[_N] + rating[_N-1] + rating[_N-2]) / 3

Checking if values are consistent within a group

If a group contains all the same values, then the first should be equal to the last one.
. by id company, sort: gen flag = rating[1] == rating[_N]

Checking if the pairs exist for t tests within a group

Say we want to perform t tests on each pair (1 vs 2; 2 vs 3 etc.) of ratings for each sector with year. Before we do that, we want to make sure that each pair exists. Otherwise Stata will throw a warning message at us saying only one group of the pair found.


foreach i in 1 2 3{
	//make sure the pairs for t tests exist
	gen rk`i' = .
	by sector year (rank), sort: ///
	replace rk`i' = rank if rank[1] == `i' & rank[_N] == `i'+1

	//create score_diff variables by sector*year
	egen group`i' = group(sector year) if !missing(rk`i')
	qui levelsof group`i', local(g)
	gen score_diff`i' = .
	foreach x of local g{
		qui estpost ttest score if group`i' == `x', by(rk`i')
		matrix b = e(b)
		replace score_diff`i' = b[1,1] if group`i' == `x'
	}

	//drop the temp vars
	drop rk`i' group`i'
}

Carrying forward non-missing values

Nicholas J. Cox, How can I replace missing values with previous or following nonmissing values or within sequences?

Lags and leads

Subscripting with _n and _N can be used to create lags and leads. [_n-1] refers to the previous observation; [_n+1] refers to the next observation. Use time-series operators L for lag and F for lead if you are dealing with time-series data.

Duplicates

Duplicates are observations with identical values. The duplicates commands provide a way to report on, give examples of, list, browse, tag, or drop duplicate observations.

duplicates report reports duplicates.

duplicates examples lists one example of the group of the duplicated observations.

duplicates list lists all duplicated observations.

duplicates tag, generate(var) creates a new variable var that gives the number of duplicates of each variable.

duplicates drop keeps only the first of the duplicates and drops the rest.

More examples on the duplicates commands and an alternative method of detecting duplicates:
UCLA: Statistical Consulting Group, How can I detect duplicate observations? | Stata FAQ

Creating Group Identifiers

egen group_id = group(old_group_var) creates a new group id with numeric values for the categorical variable.

Summary Statistics by Group

We have already introduced earlier several commands that produce summary statistics by groups. To summarize them below:

To calculate summary statistics:

tabulate group, summarize(var)

tabstat varlist, by(group) statistics (stat1 stat2…)

table row_var col_var super_col_var, by(group) contents(stat1 var stat2 var stat3 var…)

by group: tab var

To aggregate data to summary statistics:
collapse (stat1) varlist1 (stat2) varlist2…, by(group varlist)

Use egen:
by group: egen, function()

Spreading Rows in Hierarchical Data

fillin; expand

fillin

fillin varlist creates additional rows of observations by filling in all combinations of the specified variables. A new variable _fillin will be created automatically to indicate where the observations come from: 1 if the observations come from the original dataset, or 0 if they are filled in.

Suppose we have a dataset on a course.
. list


     +-----------------------------------+
     | id   course   placem~t   attend~e |
     |-----------------------------------|
  1. |  1        A          1          3 |
  2. |  1        B          2          4 |
  3. |  2        A          3          2 |
  4. |  2        C          1          3 |
  5. |  3        B          2          4 |
     |-----------------------------------|
  6. |  3        C          3          5 |
     +-----------------------------------+

. fillin id course adds observations from all combinations of id and course; missing values have been created where the person does not have a course record.
. list


     +---------------------------------------------+
     | id   course   placem~t   attend~e   _fillin |
     |---------------------------------------------|
  1. |  1        A          1          3         0 |
  2. |  1        B          2          4         0 |
  3. |  1        C          .          .         1 |
  4. |  2        A          3          2         0 |
  5. |  2        B          .          .         1 |
     |---------------------------------------------|
  6. |  2        C          1          3         0 |
  7. |  3        A          .          .         1 |
  8. |  3        B          2          4         0 |
  9. |  3        C          3          5         0 |
     +---------------------------------------------+

expand

expand [=]exp creates duplicates of each observation with n copies specified in the expression, where the original observation is kept and n-1 copies are created.

Case 1

Using the same data before fillin above:
. expand attendance makes duplicates of the observations by the number of attendance so that each person will have a record of each attendance of each course. We can then, for instance, add course performance data to each attendance.
. list
. sort id course


     +-----------------------------------+
     | id   course   placem~t   attend~e |
     |-----------------------------------|
  1. |  1        A          1          3 |
  2. |  1        A          1          3 |
  3. |  1        A          1          3 |
  4. |  1        B          2          4 |
  5. |  1        B          2          4 |
     |-----------------------------------|
  6. |  1        B          2          4 |
  7. |  1        B          2          4 |
  8. |  2        A          3          2 |
  9. |  2        A          3          2 |
 10. |  2        C          1          3 |
     |-----------------------------------|
 11. |  2        C          1          3 |
 12. |  2        C          1          3 |
 13. |  3        B          2          4 |
 14. |  3        B          2          4 |
 15. |  3        B          2          4 |
     |-----------------------------------|
 16. |  3        B          2          4 |
 17. |  3        C          3          5 |
 18. |  3        C          3          5 |
 19. |  3        C          3          5 |
 20. |  3        C          3          5 |
     |-----------------------------------|
 21. |  3        C          3          5 |
     +-----------------------------------+

expand [=]exp, generate(newvar) creates a new variable to indicate if the observations come from the existing dataset or if they are the expanded ones.

Case 2

We had a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10


     +------------------------------------------------------------------------------------------------------------+
     | company                               sector       id              datetime   rank   score   year   rating |
     |------------------------------------------------------------------------------------------------------------|
  1. |    0000                         Banks/Midcap    79092   2014.03.10 00:25:00      1   15.61   2013        3 |
  2. |    0001                       Integrated Oil   132083   2014.02.28 19:22:00      4    9.89   2013        2 |
  3. |    0001   Oil & Gas Exploration & Production    71894   2014.02.26 00:17:00      4    5.57   2013        3 |
  4. |    0001   Oil & Gas Exploration & Production     9928   2014.02.28 04:15:00      2    8.76   2013        3 |
  5. |    0001   Oil & Gas Exploration & Production   122739   2014.02.26 03:05:00      3    8.11   2013        3 |
     |------------------------------------------------------------------------------------------------------------|
  6. |    0001   Oil & Gas Exploration & Production   122739   2014.11.10 04:13:00      3    8.11   2014        2 |
  7. |    000R                             Internet   142116   2014.02.27 01:10:00      4    9.14   2013        3 |
  8. |    000R                             Internet   142116   2014.02.18 00:18:00      4    9.14   2013        2 |
  9. |    000R                             Internet    73111   2014.02.19 15:21:00      1   14.96   2013        2 |
 10. |    000R                             Internet    73111   2014.10.31 06:50:00      1   14.96   2013        1 |
     +------------------------------------------------------------------------------------------------------------+

We wanted to build models comparing results on ranks 1 versus 2, 2 versus 3, 3 versus the first runner-up, and the last runner-up versus the first non-runner-up. All sounded fine, until it occurred to us that there could be only one runner-up within a group (sector * year). In this case, we want to expand the groups where we only have one runner up, and recode it to rank 4 and 5; the first non-runner-up would be rank 6. Here is what we did.


gen rank2 = .
replace rank2 = rank if rank <= 3

//expand the RUs if we only have one rank == 4 
gen expd = .
by sector year rank (score), sort: replace expd = 1 if score[1] == score[_N] & rank == 4 
expand expd + 1 if expd == 1
sort sector year rank id score
sort _all
by sector year rank, sort: replace rank2 = 4 if mod(_n,2) == 1 & expd == 1
by sector year rank, sort: replace rank3 = 5 if mod(_n,2) == 0 & expd == 1

//1st RU, last RU
by sector year rank, sort: egen max = max(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 4 if rank == 4 & score == max & expd != 1

by sector year rank, sort: egen min = min(score) if rank == 4 & expd != 1
by sector year rank: replace rank2 = 5 if rank == 4 & score == min & expd != 1

Case 3

An example of using fillin and expand to change time periods in panel data:
Social Science Computing Cooperative, UW-Madison, Stata Programming Techniques for Panel Data: Changing Time Periods

by prefix with sum(), max(), min(), mean() etc.

by prefix in combination with functions sum(), max(), min(), and mean() can serve many purposes and be quite helpful in handling hierarchical data.

Results Spreading with mean() in calculating summary statistics:
Social Science Computing Cooperative, UW-Madison, Stata for Researchers: Working with Groups

Another example on spreading results with sum() in creating group id:
Nicholas J. Cox and William Gould, How do I create individual identifiers numbered from 1 upwards?

Creating indicators with sum() to refer to locations of certain values of a variable:
Nicholas J. Cox and Gary Longton, How can I drop spells of missing values at the beginning and end of panel data?

Preparing Your Data II:More Tricks