Preparing Your Data I: The Essentials

Managing Your Data

generate; replace; recode; autocode

The four commands are often used to create or change the contents of variables.

generate; replace

generate newvar = exp creates the new variable from existing variables through an expression.
. sysuse auto
. gen price1 = price/2.5

replace changes the contents of a variable. It can be used in combination with generate to recode the newly created variable from the existing numeric variable.
. gen price2 = price
. replace price2 = price*0.8 if foreign == 1

recode

recode var (rule)..., generate(newvar) changes the contents of numeric variables, usually to create categorical variables.

. sysuse auto
. recode rep78 (min/2 = 1 good) (3 = 2 fair) (4 5 = 3 poor) (missing = .), gen(rep78_scale) label(repair_record_scale)

Some notes for recode:

ranges

Values specified in the () includes the two boundaries.

Numbers to the left of = are values to be recoded, while the number following = is the new value to be assigned.

/ means “through”.

values

min refers to the smallest value; max refers to the largest value.

missing indicates all missing values; nonmissing for all nonmissing values; and else for both missing and non-missing values.

If there are values unassigned, they will be taken to the new variable as they are.

value labels

Contents following the assigned value is the value label (e.g. good).

label() gives a name to the new value label.

To check if the transformations have worked as we would like them to be, it is always a good idea to cross-tabulate the newly defined variables and the variables created from.
. tab rep78 rep78_scale

		
    Repair |  RECODE of rep78 (Repair Record 1978)
    Record |              
      1978 |      good       fair       poor |     Total
-----------+---------------------------------+----------
         1 |         2          0          0 |         2 
         2 |         8          0          0 |         8 
         3 |         0         30          0 |        30 
         4 |         0          0         18 |        18 
         5 |         0          0         11 |        11 
-----------+---------------------------------+----------
     Total |        10         30         29 |        69

autocode

Another way to convert numeric variables to categorical or factor variables is to use autocode, the automated version of recode.

generate newvar = autocode(var, n, a, b) recodes the numeric variable var into a categorical variable newvar with n equal-length intervals; a and b are the two boundaries where a is inclusive.

. sysuse auto
. sort price
. gen price3 = autocode(price,3,price[1],price[_N])
recodes the numeric variable price into a variable of three categories ranging from its minimum to the maximum value. price[1] and price[_N] are explicit subscripting that tells Stata where to refer to.

label variable; rename; notes

These three commands provide additional information to a variable. We have met the first two in the previous chapter, but for convenience let’s review them altogether here.

label variable varname “label” attaches a label to a variable.
. label variable rep78 “repair_record_78”

rename oldvar newvar renames the variable name.
. rename rep78 repair_record

notes: varname adds notes to a variable.
. notes rep78: check the missing values

label define; label values; label list; label drop

label define and label values attach value labels to a numeric variable in two steps:

label define labelname # label # label … first creates a label name for all the labels we are going to attach to the variable values .
label values var labelname then matches the labels to the values.

For instance,
. sysuse auto
. gen priceIndicator = (price >= 5000 & price < .)
. label define priceLabel 0 affordable 1 expensive
. label values priceIndicator priceLabel
first creates an indicator variable based on an arbitrary standard of affordability (price<5000), and then attaches the labels of “affordable” to the numeric value 0 and “expensive” to 1.

. codebook priceIndicator


------------------------------------------------------------------------------------------
priceIndicator                                                                 (unlabeled)
------------------------------------------------------------------------------------------

                  type:  numeric (float)
                 label:  priceLabel

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/74

            tabulation:  Freq.   Numeric  Label
                            37         0  affordable
                            37         1  expensive

We can see how each value of priceIndicator is labeled. The label name is "priceLabel".

Once we got the value label name, we can list the labels for each value of the variable.

label list labelname returns a list.
. label list priceLabel


priceLabel:
           0 affordable
           1 expensive

In fact, in the auto data, variable foreign came with attached value lables.

. codebook foreign


--------------------------------------------------------------------------------------
foreign                                                                       Car type
--------------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  origin

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/74

            tabulation:  Freq.   Numeric  Label
                            52         0  Domestic
                            22         1  Foreign

We can see the label name is "origin", and the value labels are "Domestic" and "Foreign".

. label list origin


origin:
           0 Domestic
           1 Foreign

label drop labelname removes the value labels from a variable.
. label drop priceLabel

encode; decode

encode var, gen(newvar) creates a numeric variable newvar by assigning a numeric value to each of var's string value. It provides a mapping between the numeric value and the value label.

. sysuse auto
. encode make, gen(make_num) creates the numeric variable make_num.
. label list make_num gives us the mapping between the numeric values and the labels of the newly defined numeric variable make_num.


make_num:
           1 AMC Concord
           2 AMC Pacer
           3 AMC Spirit
           4 Audi 5000
           5 Audi Fox
           6 BMW 320i
           7 Buick Century
           8 Buick Electra
           9 Buick LeSabre
          10 Buick Opel
          11 Buick Regal
          12 Buick Riviera
          13 Buick Skylark
          14 Cad. Deville
          15 Cad. Eldorado
          16 Cad. Seville
          17 Chev. Chevette
          18 Chev. Impala
          19 Chev. Malibu
          20 Chev. Monte Carlo
          21 Chev. Monza
          22 Chev. Nova
          23 Datsun 200
          24 Datsun 210
          25 Datsun 510
          26 Datsun 810
          27 Dodge Colt
          28 Dodge Diplomat
          29 Dodge Magnum
          30 Dodge St. Regis
          31 Fiat Strada
          32 Ford Fiesta
          33 Ford Mustang
          34 Honda Accord
          35 Honda Civic
          36 Linc. Continental
          37 Linc. Mark V
          38 Linc. Versailles
          39 Mazda GLC
          40 Merc. Bobcat
          41 Merc. Cougar
          42 Merc. Marquis
          43 Merc. Monarch
          44 Merc. XR-7
          45 Merc. Zephyr
          46 Olds 98
          47 Olds Cutl Supr
          48 Olds Cutlass
          49 Olds Delta 88
          50 Olds Omega
          51 Olds Starfire
          52 Olds Toronado
          53 Peugeot 604
          54 Plym. Arrow
          55 Plym. Champ
          56 Plym. Horizon
          57 Plym. Sapporo
          58 Plym. Volare
          59 Pont. Catalina
          60 Pont. Firebird
          61 Pont. Grand Prix
          62 Pont. Le Mans
          63 Pont. Phoenix
          64 Pont. Sunbird
          65 Renault Le Car
          66 Subaru
          67 Toyota Celica
          68 Toyota Corolla
          69 Toyota Corona
          70 VW Dasher
          71 VW Diesel
          72 VW Rabbit
          73 VW Scirocco
          74 Volvo 260

decode var, gen(newvar) works the other way around. It creates a string variable newvar using the value labels attached earlier to the numeric variable var.
. decode foreign, gen(foreign_str) creates a string variable that has two values “Domestic” and “Foreign”.
. codebook foreign_str


-----------------------------------------------------------------------------------
foreign_str                                                                Car type
-----------------------------------------------------------------------------------

                  type:  string (str8)

         unique values:  2                        missing "":  0/74

            tabulation:  Freq.  Value
                            52  "Domestic"
                            22  "Foreign"

Compare with
. codebook foreign


-----------------------------------------------------------------------------------
foreign                                                                    Car type
-----------------------------------------------------------------------------------

                  type:  numeric (byte)
                 label:  origin

                 range:  [0,1]                        units:  1
         unique values:  2                        missing .:  0/74

            tabulation:  Freq.   Numeric  Label
                            52         0  Domestic
                            22         1  Foreign

destring; tostring

destring/tostring, replace or destring/tostring, generate() converts string variables to numeric variables and vice versa.

. tostring make_num, gen(make_str) converts the numeric variable make_num previously encoded from make back to a string variable but with values in numbers rather than alphabetical letters.
. codebook make_str


-----------------------------------------------------------------------------------
make_str                                                             Make and Model
-----------------------------------------------------------------------------------

                  type:  string (str2)

         unique values:  74                       missing "":  0/74

              examples:  "21"
                         "35"
                         "49"
                         "62"

. destring make_str, replace converts make_str to a numeric variable.
. codebook make_str


-----------------------------------------------------------------------------------
make_str                                                             Make and Model
-----------------------------------------------------------------------------------

                  type:  numeric (byte)

                 range:  [1,74]                       units:  1
         unique values:  74                       missing .:  0/74

                  mean:      37.5
              std. dev:   21.5058

           percentiles:        10%       25%       50%       75%       90%
                                 8        19      37.5        56        67

Summary

label define creates labels to each value of a numeric variable.

encode works on strings. It creates a new variable, and assigns a numeric value to each of the string value.
decode works on numeric variables that has attached value labels. It creates a new string variable that gets its variable values from the previous label values.

destring works on string variables that contain numbers stored as strings, and converts them to numbers.
tostring converts numeric variables to strings.

real()/string() functions convert strings/numeric variables between each other, but cannot be used just by themselves.

assert

assert verifies if an expression is true. It returns no output if the statement is indeed true.

. egen car_space = rowmean(headroom length)
. gen car_space2 = (headroom+length)/2
both have created an arbitrary measure for car space using the mean of headroom and car length. More on comparison of the two methods where we discuss rowmean().

. assert car_space == car_space2 examines if the two methods yield the same results.

keep; drop

keep and drop work in the same way: they keep or drop variables or observations.

keep/drop var subsets data by variables.
. sysuse auto
. keep price

keep/drop if exp subsets observations by conditions.
. drop if price < 5000

keep/drop in range subsets observations within certain rows.
. keep in 10/20

sort; gsort; order

sort/gsort arranges the observations in ascending order.
. sysuse auto
. sort make

gsort -var can arrange the observations in descending order.

. gsort mpg sorts mpg in ascending order.
. gsort -mpg price sorts mpg in descending order and price in ascending order.

order arranges the order of variables.

order varlist(, first) or order varlist, last moves the variables to the first or last column.
. order trunk turn, last moves the two variables trunk and turn to the last two columns.

order varlist, before(var) or order varlist, after(var) moves the variables before or after a certain variable.
. order trunk turn, before(mpg) moves trunk and turn before mpg.

append; merge; joinby

append, merge and joinby bind the datasets by attaching the “using dataset” stored on disk to the “master dataset” stored in memory.

append

append binds datasets vertically by adding observations from the using dataset to the master dataset.

The datasets should contain mostly the same variables; absent variables from one dataset will be given missing values in the combined file. append applies, for instance, when we want to combine datasets of different times or regions with mostly the same variables.

. use file1
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   1995        1    child        A |
  2. |  2   1995        1    child        B |
  3. |  3   1995        1   mother        B |
  4. |  1   1995        2   father        C |
  5. |  2   1995        2   mother        B |
     |--------------------------------------|
  6. |  3   1995        2    child        A |
     +--------------------------------------+

. use file2, clear
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   2005        1    child        A |
  2. |  2   2005        1    child        A |
  3. |  3   2005        1   mother        B |
  4. |  1   2005        2   father        B |
  5. |  2   2005        2   mother        C |
     |--------------------------------------|
  6. |  3   2005        2    child        B |
     +--------------------------------------+

. append using file1
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   2005        1    child        A |
  2. |  2   2005        1    child        A |
  3. |  3   2005        1   mother        B |
  4. |  1   2005        2   father        B |
  5. |  2   2005        2   mother        C |
     |--------------------------------------|
  6. |  3   2005        2    child        B |
  7. |  1   1995        1    child        A |
  8. |  2   1995        1    child        B |
  9. |  3   1995        1   mother        B |
 10. |  1   1995        2   father        C |
     |--------------------------------------|
 11. |  2   1995        2   mother        B |
 12. |  3   1995        2    child        A |
     +--------------------------------------+

. append using file
adds the observations from file to the master file.

. append using file, keep (var1 var2 var3)
appends the file file and keeps only the three specified variables in the combined dataset.

. append using file1 file2, generate (filenum)
appends file1 and file2 to the master file and assigns a number to the observations to indicate the source file.

We can also create an id for each source file by ourselves that are meaningful to the datasets before appending the datasets.
. use file, clear
. generate str4 source = ”2016”
. save fileb
. use masterfile, clear
. generate str4 source = ”2015”
. append using fileb

. append using file1b file2b
appends file1 and file2 to the master file.

What if we have a long list of files to append to the master file? We will offer a method to automate the appending that greatly saves our time in the automation chapter.

merge

merge combines datasets horizontally by adding variables to the existing dataset.

The datasets are combined by one or more key variables. The key variables must have the same names and data type (numeric or string), otherwise we will receive an error message.

If one dataset contains more observations than the other, missing values will be generated where not matched.

types of merges

There are three types of match merges that we frequently use.

merge 1:1 varlist using filename
In one-to-one merge, we use a unique key variable in both the master and the using datasets to merge files. In both files there should be only one unique value for the key variable.

merge 1:m varlist using filename
merge m:1 varlist using filename
In one-to-many and many-to-one merges, one observation from one dataset is matched to many observations in the other dataset. If the master dataset has many observations to match with the single observation in the using dataset, we use m:1; or we use 1:m if it is the using dataset that has many to match.

One-to-many or many-to-one merges are most frequently met when dealing with hierarchical data.

. use file5, clear
. list


     +---------------------+
     | family   area   SES |
     |---------------------|
  1. |      1      N     1 |
  2. |      2      S     2 |
     +---------------------+

. use file1, clear
. list


     +--------------------------------------+
     | id   year   family     role   rating |
     |--------------------------------------|
  1. |  1   1995        1    child        A |
  2. |  2   1995        1    child        B |
  3. |  3   1995        1   mother        B |
  4. |  1   1995        2   father        C |
  5. |  2   1995        2   mother        B |
     |--------------------------------------|
  6. |  3   1995        2    child        A |
     +--------------------------------------+

. merge m:1 family using file5


Result                           # of obs.
-----------------------------------------
not matched                             0
matched                                 6  (_merge==3)
-----------------------------------------

. list


   +-----------------------------------------------------------------+
   | id   year   family     role   rating   area   SES        _merge |
   |-----------------------------------------------------------------|
1. |  1   1995        1    child        A      N     1   matched (3) |
2. |  2   1995        1    child        B      N     1   matched (3) |
3. |  3   1995        1   mother        B      N     1   matched (3) |
4. |  1   1995        2   father        C      S     2   matched (3) |
5. |  2   1995        2   mother        B      S     2   matched (3) |
   |-----------------------------------------------------------------|
6. |  3   1995        2    child        A      S     2   matched (3) |
   +-----------------------------------------------------------------+

The fourth type, many-to-many merges, can yield random results and be problematic. We will not discuss it here.

multiple key variables

We can merge files using more than one key variable when a single variable is not adequate to match the two datasets.

. use file3, clear
. list


     +----------------------+
     | id   family     role |
     |----------------------|
  1. |  1        1    child |
  2. |  2        1    child |
  3. |  3        1   mother |
  4. |  1        2   father |
  5. |  2        2   mother |
     |----------------------|
  6. |  3        2    child |
     +----------------------+

. use file4, clear
. list


     +-----------------------------+
     | id   year   family   rating |
     |-----------------------------|
  1. |  1   1995        1        A |
  2. |  2   1995        1        B |
  3. |  3   1995        1        B |
  4. |  1   1995        2        C |
  5. |  2   1995        2        B |
     |-----------------------------|
  6. |  3   1995        2        A |
     +-----------------------------+

. merge 1:1 id family using file3


Result                           # of obs.
-----------------------------------------
not matched                             0
matched                                 6  (_merge==3)
-----------------------------------------

. list


   +----------------------------------------------------+
   | id   year   family   rating     role        _merge |
   |----------------------------------------------------|
1. |  1   1995        1        A    child   matched (3) |
2. |  1   1995        2        C   father   matched (3) |
3. |  2   1995        1        B    child   matched (3) |
4. |  2   1995        2        B   mother   matched (3) |
5. |  3   1995        1        B   mother   matched (3) |
   |----------------------------------------------------|
6. |  3   1995        2        A    child   matched (3) |
   +----------------------------------------------------+

_merge

After the merge the _merge variable will be automatically created. It tells us how the observations have been matched. The _merge variable, in the simple cases, has three values:
_merge == 1, observation appeared in the master file only
_merge == 2, observation appeared in the using file only
_merge == 3, observation appeared in both files

The _merge variable must be dropped or renamed before we perform the next merge.

tab _merge to find out how your merge went.

Type help merge for more merge options.

joinby

joinby forms all pairwise combinations within groups.

. use file6, clear
. list


   +---------------------------+
   | industry   company   rank |
   |---------------------------|
1. |        A         a      1 |
2. |        A         b      2 |
3. |        A         c      3 |
4. |        B         l      4 |
5. |        B         m      5 |
   |---------------------------|
6. |        B         n      6 |
   +---------------------------+

. use file7, clear
. list


   +-----------------+
   | industry   year |
   |-----------------|
1. |        A   1995 |
2. |        A   2005 |
3. |        B   1995 |
4. |        B   2005 |
   +-----------------+

. use file6, clear
. joinby industry using file7
. list


     +----------------------------------+
     | industry   company   rank   year |
     |----------------------------------|
  1. |        A         a      1   1995 |
  2. |        A         a      1   2005 |
  3. |        A         b      2   1995 |
  4. |        A         b      2   2005 |
  5. |        A         c      3   1995 |
     |----------------------------------|
  6. |        A         c      3   2005 |
  7. |        B         l      4   2005 |
  8. |        B         l      4   1995 |
  9. |        B         m      5   1995 |
 10. |        B         m      5   2005 |
     |----------------------------------|
 11. |        B         n      6   1995 |
 12. |        B         n      6   2005 |
     +----------------------------------+

For the unmatched observations from the datasets, the default is to be ignored. But you can also set them to be kept in different ways by the unmatched(both/master/using) option.

reshape; collapse

reshape

reshape long varlist, i(i) j(j) or reshape wide varlist, i(i) j(j) converts data from long to wide and vice versa.

i is the id variable of a higher level;
j is the id variable of a lower level;
wide means reshaping to wide;
long means reshaping to long.

There are more observations in the long format and more variables in the wide format. In the long format, variable values of the lower levels are grouped under values of a higher level. In the wide format, each column represents variable values of lower levels.

. use long
. list


     +------------------------------------------+
     | id   semester   course    gpa   attend~e |
     |------------------------------------------|
  1. |  1          1        A   3.81          1 |
  2. |  1          1        B   3.82          2 |
  3. |  1          2        A   3.76          3 |
  4. |  1          2        B   3.77          4 |
  5. |  2          1        A   3.56          4 |
     |------------------------------------------|
  6. |  2          1        B   3.55          3 |
  7. |  2          2        A   3.45          2 |
  8. |  2          2        B   3.47          1 |
     +------------------------------------------+

. reshape wide gpa attendance, i(id semester) j(course) string converts the dataset from long to wide.


(note: j = A B)

Data                               long   ->   wide
-----------------------------------------------------------------------------
Number of obs.                        8   ->       4
Number of variables                   5   ->       6
j variable (2 values)            course   ->   (dropped)
xij variables:
                                    gpa   ->   gpaA gpaB
                             attendance   ->   attendanceA attendanceB
-----------------------------------------------------------------------------

. list


     +---------------------------------------------------+
     | id   semester   gpaA   attend~A   gpaB   attend~B |
     |---------------------------------------------------|
  1. |  1          1   3.81          1   3.82          2 |
  2. |  1          2   3.76          3   3.77          4 |
  3. |  2          1   3.56          4   3.55          3 |
  4. |  2          2   3.45          2   3.47          1 |
     +---------------------------------------------------+

. reshape long changes it back to the long format.


(note: j = A B)

Data                               wide   ->   long
-----------------------------------------------------------------------
Number of obs.                        4   ->       8
Number of variables                   6   ->       5
j variable (2 values)                     ->   course
xij variables:
                              gpaA gpaB   ->   gpa
                attendanceA attendanceB   ->   attendance
-----------------------------------------------------------------------

collapse

collapse (stat1) varlist1 (stat2) varlist2…, by(group varlist) aggregates the dataset to summary statistics. stat options include mean, median, percentiles, standard deviations, standard errors, first/last values, maximum/minimum etc.

. use long
. collapse (mean) gpa, by(semester id)
returns the mean of gpa of each id by semester.
. list


     +-----------------------+
     | id   semester     gpa |
     |-----------------------|
  1. |  1          1   3.815 |
  2. |  2          1   3.555 |
  3. |  1          2   3.765 |
  4. |  2          2    3.46 |
     +-----------------------+

save; export

See Reading and Saving Files of Various Types.