Subscripting

We can refer to observations by subscripting the variables. var[exp] does the explicit subscripting.

When the expressions are the internal variables _n and _N:
_n gives the number of current observations;
_N gives the total number of observations.
var[_n] refers to the current observation;
var[_N] refers to the last observation.

. sysuse auto
. sort price
. price[1] refers to the first observation of price and is now the lowest value of price after sorting.
. price[0] is missing since there is no such observation.
. price[_N] refers to the last observation of price and is now the largest value of price.

Cases

Subscripting can be useful in hierarchical data. We will see some cases below.

Suppose we have a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10


     +------------------------------------------------------------------------------------------------------------+
     | company                               sector       id              datetime   rank   score   year   rating |
     |------------------------------------------------------------------------------------------------------------|
  1. |    0000                         Banks/Midcap    79092   2014.03.10 00:25:00      1   15.61   2013        3 |
  2. |    0001                       Integrated Oil   132083   2014.02.28 19:22:00      4    9.89   2013        2 |
  3. |    0001   Oil & Gas Exploration & Production    71894   2014.02.26 00:17:00      4    5.57   2013        3 |
  4. |    0001   Oil & Gas Exploration & Production     9928   2014.02.28 04:15:00      2    8.76   2013        3 |
  5. |    0001   Oil & Gas Exploration & Production   122739   2014.02.26 03:05:00      3    8.11   2013        3 |
     |------------------------------------------------------------------------------------------------------------|
  6. |    0001   Oil & Gas Exploration & Production   122739   2014.11.10 04:13:00      3    8.11   2014        2 |
  7. |    000R                             Internet   142116   2014.02.27 01:10:00      4    9.14   2013        3 |
  8. |    000R                             Internet   142116   2014.02.18 00:18:00      4    9.14   2013        2 |
  9. |    000R                             Internet    73111   2014.02.19 15:21:00      1   14.96   2013        2 |
 10. |    000R                             Internet    73111   2014.10.31 06:50:00      1   14.96   2013        1 |
     +------------------------------------------------------------------------------------------------------------+

Creating group identifier

. by company, sort: gen ingroup_id = _n
gives us a unique identifier to each observation within each company.

. list company id datetime ingroup_id in 1/6


     +---------------------------------------------------+
     | company       id              datetime   ingrou~d |
     |---------------------------------------------------|
  1. |    0000    79092   2014.03.10 00:25:00          1 |
  2. |    0001   132083   2014.02.28 19:22:00          1 |
  3. |    0001    71894   2014.02.26 00:17:00          2 |
  4. |    0001     9928   2014.02.28 04:15:00          3 |
  5. |    0001   122739   2014.02.26 03:05:00          4 |
     |---------------------------------------------------|
  6. |    0001   122739   2014.11.10 04:13:00          5 |
     +---------------------------------------------------+

First / last several cases within a group

Say we want to get the mean of the 3 most recent ratings by id and company:
. by id company (datetime), sort: gen rating_3rec_avg = (rating[1] + rating[2] + rating[3]) / 3

Alternatively, if we want to obtain the mean of the 3 most latest ratings:
. by id company (datetime), sort: gen rating_3last_avg = (rating[_N] + rating[_N-1] + rating[_N-2]) / 3

Checking if values are consistent within a group

If a group contains all the same values, then the first should be equal to the last one.
. by id company, sort: gen flag = rating[1] == rating[_N]

Checking if the pairs exist for t tests within a group

Say we want to perform t tests on each pair (1 vs 2; 2 vs 3 etc.) of ratings for each sector with year. Before we do that, we want to make sure that each pair exists. Otherwise Stata will throw a warning message at us saying only one group of the pair found.

If a pair exists, then the first obsrvation should be of value i and the last of value i + 1 (i.e. rank[1] == `i' & rank[_N] == `i'+1 ).


foreach i in 1 2 3{
	//make sure the pairs for t tests exist
	gen rk`i' = .
	by sector year (rank), sort: ///
	replace rk`i' = rank if rank[1] == `i' & rank[_N] == `i'+1

	//create score_diff variables by sector*year
	egen group`i' = group(sector year) if !missing(rk`i')
	qui levelsof group`i', local(g)
	gen score_diff`i' = .
	foreach x of local g{
		qui estpost ttest score if group`i' == `x', by(rk`i')
		matrix b = e(b)
		replace score_diff`i' = b[1,1] if group`i' == `x'
	}

	//drop the temp vars
	drop rk`i' group`i'
}

Carrying forward non-missing values

Nicholas J. Cox, How can I replace missing values with previous or following nonmissing values or within sequences?

Lags and leads

Subscripting with _n and _N can be used to create lags and leads. [_n-1] refers to the previous observation; [_n+1] refers to the next observation. Use time-series operators L for lag and F for lead if you are dealing with time-series data.