We can refer to observations by subscripting the variables. var[exp] does the explicit subscripting.
When the expressions are the internal variables _n and _N:
_n gives the number of current observations;
_N gives the total number of observations.
var[_n] refers to the current observation;
var[_N] refers to the last observation.
. sysuse auto
. sort price
. price[1] refers to the first observation of price and is now the lowest value of price after sorting.
. price[0] is missing since there is no such observation.
. price[_N] refers to the last observation of price and is now the largest value of price.
Subscripting can be useful in hierarchical data. We will see some cases below.
Suppose we have a dataset with variables of companies, analyst ids, some event dates and times, analyst scores and ranks, ratings on companies etc.
.list in 1/10
+------------------------------------------------------------------------------------------------------------+
| company sector id datetime rank score year rating |
|------------------------------------------------------------------------------------------------------------|
1. | 0000 Banks/Midcap 79092 2014.03.10 00:25:00 1 15.61 2013 3 |
2. | 0001 Integrated Oil 132083 2014.02.28 19:22:00 4 9.89 2013 2 |
3. | 0001 Oil & Gas Exploration & Production 71894 2014.02.26 00:17:00 4 5.57 2013 3 |
4. | 0001 Oil & Gas Exploration & Production 9928 2014.02.28 04:15:00 2 8.76 2013 3 |
5. | 0001 Oil & Gas Exploration & Production 122739 2014.02.26 03:05:00 3 8.11 2013 3 |
|------------------------------------------------------------------------------------------------------------|
6. | 0001 Oil & Gas Exploration & Production 122739 2014.11.10 04:13:00 3 8.11 2014 2 |
7. | 000R Internet 142116 2014.02.27 01:10:00 4 9.14 2013 3 |
8. | 000R Internet 142116 2014.02.18 00:18:00 4 9.14 2013 2 |
9. | 000R Internet 73111 2014.02.19 15:21:00 1 14.96 2013 2 |
10. | 000R Internet 73111 2014.10.31 06:50:00 1 14.96 2013 1 |
+------------------------------------------------------------------------------------------------------------+
. by company, sort: gen ingroup_id = _n
gives us a unique identifier to each observation within each company.
. list company id datetime ingroup_id in 1/6
+---------------------------------------------------+
| company id datetime ingrou~d |
|---------------------------------------------------|
1. | 0000 79092 2014.03.10 00:25:00 1 |
2. | 0001 132083 2014.02.28 19:22:00 1 |
3. | 0001 71894 2014.02.26 00:17:00 2 |
4. | 0001 9928 2014.02.28 04:15:00 3 |
5. | 0001 122739 2014.02.26 03:05:00 4 |
|---------------------------------------------------|
6. | 0001 122739 2014.11.10 04:13:00 5 |
+---------------------------------------------------+
Say we want to get the mean of the 3 most recent ratings by id and company:
. by id company (datetime), sort: gen rating_3rec_avg = (rating[1] + rating[2] + rating[3]) / 3
Alternatively, if we want to obtain the mean of the 3 most latest ratings:
. by id company (datetime), sort: gen rating_3last_avg = (rating[_N] + rating[_N-1] + rating[_N-2]) / 3
If a group contains all the same values, then the first should be equal to the last one.
. by id company, sort: gen flag = rating[1] == rating[_N]
Say we want to perform t tests on each pair (1 vs 2; 2 vs 3 etc.) of ratings for each sector with year. Before we do that, we want to make sure that each pair exists. Otherwise Stata will throw a warning message at us saying only one group of the pair found.
If a pair exists, then the first obsrvation should be of value i and the last of value i + 1 (i.e. rank[1] == `i' & rank[_N] == `i'+1 ).
foreach i in 1 2 3{
//make sure the pairs for t tests exist
gen rk`i' = .
by sector year (rank), sort: ///
replace rk`i' = rank if rank[1] == `i' & rank[_N] == `i'+1
//create score_diff variables by sector*year
egen group`i' = group(sector year) if !missing(rk`i')
qui levelsof group`i', local(g)
gen score_diff`i' = .
foreach x of local g{
qui estpost ttest score if group`i' == `x', by(rk`i')
matrix b = e(b)
replace score_diff`i' = b[1,1] if group`i' == `x'
}
//drop the temp vars
drop rk`i' group`i'
}
Subscripting with _n and _N can be used to create lags and leads. [_n-1] refers to the previous observation; [_n+1] refers to the next observation. Use time-series operators L for lag and F for lead if you are dealing with time-series data.