We are moving everything to the new site. Please visit our new Stata page. This site will no longer be updated.
Common tasks with string variables include matching a pattern or searching a location in a string, parsing or subsetting a string, concatenating strings, converting a string to a number etc.
These are the three functions that use regular expressions to perform matching. Regular expression is a method that allows for systematic searching, matching and replacing within strings using operators and letters.
Read how Stata’s regular expressions work:
Kevin S. Turner, What are regular expressions and how can I use them in Stata?
regexm(s,re) performs matching on the string s by regular expression re. If the matching is successful, it returns 1; otherwise it returns 0.
For instance, gen flag = regexm(id, "[^0-9 .]") marks observations that contain numeric characters. Then if we want to drop cases with non-numeric characters, we can easily do that.
regexr(s1,re,s2) replaces the string s1 that matches the regular expression re with s2.
. gen id2 = regexr(id, "[^0-9 .]", "marked") generates a new variable id2. Cases with all numeric characters have been replaced by "marked".
regexs(n) must appear after regexm(s,re) and returns the nth substring from what is previously matched by regexm(s,re).
. sysuse auto
. gen make2 = regexs(1) if regexm(make, "([a-zA-Z]+)([ ]|[.])*([a-zA-Z]+)") takes out the car make, the first part from its make and model combination in make strings.
. list make make2 in 5/15
+------------------------+
| make make2 |
|------------------------|
5. | Buick Electra Buick |
6. | Buick LeSabre Buick |
7. | Buick Opel Buick |
8. | Buick Regal Buick |
9. | Buick Riviera Buick |
|------------------------|
10. | Buick Skylark Buick |
11. | Cad. Deville Cad |
12. | Cad. Eldorado Cad |
13. | Cad. Seville Cad |
14. | Chev. Chevette Chev |
|------------------------|
15. | Chev. Impala Chev |
+------------------------+
In this example above:
[] matches at least one of the values in the bracket;
a-zA-Z matches any value of the letters, uppercase or lowercase; 0-9, similarly, will match any numbers in the specified numeric range;
* matches zero or more of the preceding expression;
| is the logical operator “or”.
More examples on the three functions:
UCLA: Statistical Consulting Group, How can I extract a portion of a string variable using regular expressions? | Stata FAQ
strpos(s1, s2) returns the position of s2 where first found in s1;
strrpos(s1, s2) returns the position of s2 where last found in s1;
when positions are not found they return 0.
. sysuse auto
. gen position = strpos(make,".") creates the new variable position that gives us the position of “.” in make.
. list make position in 5/15
+---------------------------+
| make position |
|---------------------------|
5. | Buick Electra 0 |
6. | Buick LeSabre 0 |
7. | Buick Opel 0 |
8. | Buick Regal 0 |
9. | Buick Riviera 0 |
|---------------------------|
10. | Buick Skylark 0 |
11. | Cad. Deville 4 |
12. | Cad. Eldorado 4 |
13. | Cad. Seville 4 |
14. | Chev. Chevette 5 |
|---------------------------|
15. | Chev. Impala 5 |
+---------------------------+
split splits string variables by separators into several components, and generates new string variables for each component taken out from the original string. The default separator is the space.
The parse() option allows one to specify the separator. The generate() option allows one to change the prefix of the new variables to be generated.
. sysuse auto
. split make, p("." " ") gen(substr) splits make by “.” and space into four new variables substr1-substr4.
. list make substr1-substr4 in 10/20
+------------------------------------------------------------+
| make substr1 substr2 substr3 substr4 |
|------------------------------------------------------------|
10. | Buick Skylark Buick Skylark |
11. | Cad. Deville Cad Deville |
12. | Cad. Eldorado Cad Eldorado |
13. | Cad. Seville Cad Seville |
14. | Chev. Chevette Chev Chevette |
|------------------------------------------------------------|
15. | Chev. Impala Chev Impala |
16. | Chev. Malibu Chev Malibu |
17. | Chev. Monte Carlo Chev Monte Carlo |
18. | Chev. Monza Chev Monza |
19. | Chev. Nova Chev Nova |
|------------------------------------------------------------|
20. | Dodge Colt Dodge Colt |
+------------------------------------------------------------+
substr(s, n1, n2) extracts the substring of s from n1 for the length of n2.
Let's say we have a dataset with a string variable time and we want to extract the date and time components to the seconds.
. list time in 1/5
+----------------------------------+
| time |
|----------------------------------|
1. | 2016-03-31T04:20:00.122823+00:00 |
2. | 2016-03-31T04:24:25.447985+00:00 |
3. | 2016-03-31T04:24:25.683670+00:00 |
4. | 2016-03-31T04:24:27.338571+00:00 |
5. | 2016-03-31T04:24:27.552821+00:00 |
+----------------------------------+
. gen clock = substr(time, 1, 19) generates a new variable clock with the first 19 characters of time. From there we can convert it to a date and time variable.
. list clock in 1/5
+---------------------+
| clock |
|---------------------|
1. | 2016-03-31T04:20:00 |
2. | 2016-03-31T04:24:25 |
3. | 2016-03-31T04:24:25 |
4. | 2016-03-31T04:24:27 |
5. | 2016-03-31T04:24:27 |
+---------------------+
subinstr(s1, s2, s3, n) replaces the first n occurrences in s1 of s2 with s3. If n is missing, all occurrences are replaced.
Here we have a dataset with a variable rank:
. tab rank
rank | Freq. Percent Cum.
------------------------------+-----------------------------------
1 | 506 10.44 10.44
1 (1T) | 262 5.40 15.84
1(1T) | 122 2.52 18.36
1(1T) | 72 1.49 19.84
2 | 509 10.50 30.34
2 (2T) | 262 5.40 35.75
2(2T) | 122 2.52 38.26
(2T) | 72 1.49 39.75
3 | 503 10.38 50.12
3 (3T) | 260 5.36 55.49
3(3T) | 121 2.50 57.98
(3T) | 72 1.49 59.47
4 | 442 9.12 68.58
4 (RU) | 228 4.70 73.29
4(RU) | 107 2.21 75.50
4(RU) | 63 1.30 76.79
5 | 324 6.68 83.48
5 (RU) | 162 3.34 86.82
5(RU) | 82 1.69 88.51
5(RU) | 46 0.95 89.46
......
We want to remove all the spaces, (1T), (2T), (3T) and (RU). Here is what we would do:
local n (1T) (2T) (3T) (RU) " "
foreach x in `n'{
replace rank = subinstr(rank,"`x'", "",.)
}
. tab rank
rank | Freq. Percent Cum.
-------------+-----------------------------------
1 | 962 19.84 19.84
2 | 965 19.91 39.75
3 | 956 19.72 59.47
4 | 840 17.33 76.79
5 | 614 12.67 89.46
6 | 333 6.87 96.33
7 | 127 2.62 98.95
8 | 44 0.91 99.86
9 | 7 0.14 100.00
-------------+-----------------------------------
Total | 4,848 100.00
The four functions trim the strings by removing the spaces.
strtrim(s) removes the leading or trailing spaces.
e.g. strtrim(“ nyush ”) = “nyush”
stritrim(s) removes the multiple internal spaces.
e.g. stritrim(“nyu     sh”) = “nyu sh”
strltrim(s) removes the leading spaces.
e.g. strltrim(“ nyush”) = “nyush”
strrtrim(s) removes the trailing spaces.
e.g. strrtrim(“nyush ”) = “nyush”
string(n) and real(s) are two string functions that convert numeric/string to string/numeric variables.
string(n) is a synonym for strofreal(n) and converts numeric or missing values to strings.
. sysuse auto
. gen rep78_str = string(rep78) converts the numeric and missing values of rep78 to strings.
real(s) converts strings to numeric or missing values. The string variable must contain number characters, otherwise missing values will be generated.
. gen rep78_str_num = real(rep78_str) converts the string variable rep78_str created above by string(n) back to numeric again.
. gen make_real = real(make) will create the new variable make_real with all missing values.
See also the destring/tostring commands. Note that real()/string() are functions and must be used in conjunction with a Stata command.
egen newvar, concat(varlist) concatenates a variable list and generates a new string variable.
. sysuse auto
. egen make_foreign = concat(foreign make)
Or simply use + to concatenate several strings when the variables are already strings.
. gen make_foreign2 = string(foreign) + make
A quick review of data storage types and display format before we start.
Dates and times usually come in the human readable string forms, such as “March 22, 2017 16:15 pm”, “2017.03.22 16:15” etc. But Stata internally stores dates and times as integers and reads them as numeric values.
In fact, Stata understands a date and time variable as the difference from the base.The base (the numeric value 0) of a datetime variable begins at 01jan1960 00:00:00.000 (the first millisecond of 01jan1960); therefore “25jan2016 08:30:25” for us human beings will be 1769329825000 (milliseconds) for Stata.
Specifying the display format allows us to choose the output of the date/time variable.
. format timevar %td sets timevar as a date variable. Here timevar is already a numeric date and time variable.
. format timevar %tcHH:MM:SS further sets timevar to be displayed as hour (00-23) : minute (00-59) :second (00-60).
We will see more example below.
There are multiple ways to display a year, month, week, day, hour, minute, second etc. For instance, there are three additional codes to display the hour: Hh (00-12), hH (0-23), or hh (0-12).
Check the Stata help pages for the full list and the default display formats.
Base | Display format | Storage type | String-to-numeric function | |
---|---|---|---|---|
datetime (assuming 86400 s/day) |
01jan1960 | %tc | must be double | clock() |
datetime (equivalent to UTC) |
01jan1960 | %tC | must be double | Clock() |
date | 01jan1960 | %td | may be float or long | date() |
weekly date | 1960 week 1 | %tw | may be float or int | weekly() |
monthly date | jan1960 | %tm | may be float or int | monthly() |
quarterly date | 1960 quarter 1 | %tq | may be float or int | quarterly() |
half-yearly date | 1960 half-year 1 | %th | may be float or int | halfyearly() |
yearly date | 0 A.D | %ty | may be float or int | yearly() |
*Scroll left and right if the table exceeds the screen on mobile devices.
*Adapted from tables in Working with dates and times and datetime – Date and time values and variables.
In the last column of the table above we have introduced the functions function(string, mask[,topyear]) that transform strings to the numeric date and time variables.
In those functions, mask specifies the order of the components appearing in the string.
Mask | |
---|---|
M | month |
D | day |
Y | 4-digit year |
19Y | 2-digit year to be interpreted as 19xx |
20Y | 2-digit year to be interpreted as 20xx |
h | hour of day |
m | minutes within hour |
s | seconds within minute |
# | ignore one element |
*Scroll left and right if the table exceeds the screen on mobile devices.
Source: datetime translation – String to numeric date translation functions
Here are some examples of the strings and their corresponding masks:
datetime | mask |
---|---|
“25jan2016 08:30:25” | “DMYhms” |
“2016-01-25 08:30:25” | “YMDhms” |
“16-01-25 08:30” | “20YMDhm” |
“08:30:25 UTC 01252016 ” | “hms#MDY” |
*Scroll left and right if the table exceeds the screen on mobile devices.
Suppose we have the following dataset and the string variable anntims:
. list in 1/5
+----------------------+
| anntims anndats |
|----------------------|
1. | 00:46:00 25feb2008 |
2. | 00:46:00 25feb2008 |
3. | 00:46:00 25feb2008 |
4. | 00:46:00 25feb2008 |
5. | 00:46:00 25feb2008 |
+----------------------+
We want to generate a new time variable from it:
. gen double anntims2 = clock(anntims,"hms")
. format anntims2 %tcHH:MM:SS
. list anntims2 in 1/5
+----------+
| anntims2 |
|----------|
1. | 00:46:00 |
2. | 00:46:00 |
3. | 00:46:00 |
4. | 00:46:00 |
5. | 00:46:00 |
+----------+
We also want to create a new variable with both date and time from the variables anntims (string) and anndats (numeric).
. gen double anndatims = dhms(anndats,hh(anntims2),mm(anntims2),ss(anntims2))
. format anndatims %tcCCYY.NN.DD_HH:MM:SS
. list anndatims in 1/5
+---------------------+
| anndatims |
|---------------------|
1. | 2008.02.25 00:46:00 |
2. | 2008.02.25 00:46:00 |
3. | 2008.02.25 00:46:00 |
4. | 2008.02.25 00:46:00 |
5. | 2008.02.25 00:46:00 |
+---------------------+
When working with the two-digit years, in addition to specifying the 19Y or 20Y masks, we can specify the topyear option in date and time function(string, mask, topyear). The full year to be returned will be the year that does not exceed the largest value of the specified topyear.
This is useful when we want to interpret the first two digits of the year components differently of a variable.
For instance, suppose we have a date variable date (string). We want to obtain the full years in the range of 1949-2048 and, based on the knowledge of the data, would like to interpret digits equal to or larger than 49 as from the years 19** and others as from 20**:
. list in 1/5
+--------+
| date |
|--------|
1. | 470505 |
2. | 480404 |
3. | 490101 |
4. | 500202 |
5. | 510303 |
+--------+
. gen float fulldate = date(date, "YMD", 2048)
. format fulldate %td
. list in 1/5
+--------------------+
| date fulldate |
|--------------------|
1. | 470505 05may2047 |
2. | 480404 04apr2048 |
3. | 490101 01jan1949 |
4. | 500202 02feb1950 |
5. | 510303 03mar1951 |
+--------------------+
The functions below extract the components from the date and time variables.
time-of-day component | function |
---|---|
hour of day | hh(tc) or hhC(tC) |
minutes of day | mm(tc) or mmC(tC) |
seconds of day | ss(tc) or ssC(tC) |
date component | function |
---|---|
calendar year | year(td) |
calendar month | month(td) |
calendar day | day(td) |
day of week (0=Sunday) | dow(td) |
day of year (1=first day) | doy(td) |
week within year (1=first week) | week(td) |
quarter within year (1=first quarter) | quarter(td) |
half within year (1=first half) | halfyear(td) |
*Scroll left and right if the table exceeds the screen on mobile devices.
*Adapted from tables in datetime – Date and time values and variables
Suppose we want to extract the year from the variable anndats we've seen earlier:
+----------------------+
| anntims anndats |
|----------------------|
1. | 00:46:00 25feb2008 |
2. | 00:46:00 25feb2008 |
3. | 00:46:00 25feb2008 |
4. | 00:46:00 25feb2008 |
5. | 00:46:00 25feb2008 |
+----------------------+
. gen annyrs = year(anndats)
. list annyrs in 1/5
+--------+
| annyrs |
|--------|
1. | 2008 |
2. | 2008 |
3. | 2008 |
4. | 2008 |
5. | 2008 |
+--------+
Suppose we want to extract the minutes from the time variable anntims2:
+----------+
| anntims2 |
|----------|
1. | 00:46:00 |
2. | 00:46:00 |
3. | 00:46:00 |
4. | 00:46:00 |
5. | 00:46:00 |
+----------+
. gen mins = mm(anntims2)
+------+
| mins |
|------|
1. | 46 |
2. | 46 |
3. | 46 |
4. | 46 |
5. | 46 |
+------+
Since Stata internally uses the difference from the base to read the dates and times, calculation of durations is simply addition or subtraction of the date and time units.
. list time in 1/5
+----------------------------------+
| time |
|----------------------------------|
1. | 2016-03-31T04:20:00.122823+00:00 |
2. | 2016-03-31T04:24:25.447985+00:00 |
3. | 2016-03-31T04:24:25.683670+00:00 |
4. | 2016-03-31T04:24:27.338571+00:00 |
5. | 2016-03-31T04:24:27.552821+00:00 |
+----------------------------------+
Here are the codes to get them:
. gen datetime = substr(time,1,19)
. gen microsec = substr(time,21,6)
. destring microsec, replace
. replace microsec = microsec/1000
. gen double timestamp = clock(datetime, "YMD#hms") + microsec
. format timestamp %tcCCYY.NN.DD_HH:MM:SS.sss
. gen duration = timestamp - timestamp[_n-1]
. gen minutes = minutes(duration)