We are moving everything to the new site. Please visit our new Stata page. This site will no longer be updated.

String Variables

Common tasks with string variables include matching a pattern or searching a location in a string, parsing or subsetting a string, concatenating strings, converting a string to a number etc.

Read the Stata manual for a full list of string functions

Back to top

Matching and searching

regexm(); regexr(); regexs()

These are the three functions that use regular expressions to perform matching. Regular expression is a method that allows for systematic searching, matching and replacing within strings using operators and letters.

Read how Stata’s regular expressions work:
Kevin S. Turner, What are regular expressions and how can I use them in Stata?

Back to top

regexm()

regexm(s,re) performs matching on the string s by regular expression re. If the matching is successful, it returns 1; otherwise it returns 0.

For instance, gen flag = regexm(id, "[^0-9 .]") marks observations that contain numeric characters. Then if we want to drop cases with non-numeric characters, we can easily do that.

Back to top

regexr()

regexr(s1,re,s2) replaces the string s1 that matches the regular expression re with s2.

. gen id2 = regexr(id, "[^0-9 .]", "marked") generates a new variable id2. Cases with all numeric characters have been replaced by "marked".

Back to top

regexs()

regexs(n) must appear after regexm(s,re) and returns the nth substring from what is previously matched by regexm(s,re).

. sysuse auto
. gen make2 = regexs(1) if regexm(make, "([a-zA-Z]+)([ ]|[.])*([a-zA-Z]+)")
takes out the car make, the first part from its make and model combination in make strings.

. list make make2 in 5/15


     +------------------------+
     | make             make2 |
     |------------------------|
  5. | Buick Electra    Buick |
  6. | Buick LeSabre    Buick |
  7. | Buick Opel       Buick |
  8. | Buick Regal      Buick |
  9. | Buick Riviera    Buick |
     |------------------------|
 10. | Buick Skylark    Buick |
 11. | Cad. Deville       Cad |
 12. | Cad. Eldorado      Cad |
 13. | Cad. Seville       Cad |
 14. | Chev. Chevette    Chev |
     |------------------------|
 15. | Chev. Impala      Chev |
     +------------------------+
			

In this example above:
[] matches at least one of the values in the bracket;
a-zA-Z matches any value of the letters, uppercase or lowercase; 0-9, similarly, will match any numbers in the specified numeric range;
* matches zero or more of the preceding expression;
| is the logical operator “or”.


More examples on the three functions:
UCLA: Statistical Consulting Group, How can I extract a portion of a string variable using regular expressions? | Stata FAQ

Back to top

strpos(); strrpos()

strpos(s1, s2) returns the position of s2 where first found in s1;
strrpos(s1, s2) returns the position of s2 where last found in s1
when positions are not found they return 0.

. sysuse auto
. gen position = strpos(make,".")
creates the new variable position that gives us the position of “.” in make.

. list make position in 5/15


     +---------------------------+
     | make             position |
     |---------------------------|
  5. | Buick Electra           0 |
  6. | Buick LeSabre           0 |
  7. | Buick Opel              0 |
  8. | Buick Regal             0 |
  9. | Buick Riviera           0 |
     |---------------------------|
 10. | Buick Skylark           0 |
 11. | Cad. Deville            4 |
 12. | Cad. Eldorado           4 |
 13. | Cad. Seville            4 |
 14. | Chev. Chevette          5 |
     |---------------------------|
 15. | Chev. Impala            5 |
     +---------------------------+
			

Back to top

Parsing and extraction

split

split splits string variables by separators into several components, and generates new string variables for each component taken out from the original string. The default separator is the space.

The parse() option allows one to specify the separator. The generate() option allows one to change the prefix of the new variables to be generated.

. sysuse auto
. split make, p("." " ") gen(substr)
splits make by “.” and space into four new variables substr1-substr4.

. list make substr1-substr4 in 10/20


     +------------------------------------------------------------+
     | make                substr1   substr2    substr3   substr4 |
     |------------------------------------------------------------|
 10. | Buick Skylark         Buick   Skylark                      |
 11. | Cad. Deville            Cad              Deville           |
 12. | Cad. Eldorado           Cad             Eldorado           |
 13. | Cad. Seville            Cad              Seville           |
 14. | Chev. Chevette         Chev             Chevette           |
     |------------------------------------------------------------|
 15. | Chev. Impala           Chev               Impala           |
 16. | Chev. Malibu           Chev               Malibu           |
 17. | Chev. Monte Carlo      Chev                Monte     Carlo |
 18. | Chev. Monza            Chev                Monza           |
 19. | Chev. Nova             Chev                 Nova           |
     |------------------------------------------------------------|
 20. | Dodge Colt            Dodge      Colt                      |
     +------------------------------------------------------------+
			

Back to top

substr()

substr(s, n1, n2) extracts the substring of s from n1 for the length of n2.

Let's say we have a dataset with a string variable time and we want to extract the date and time components to the seconds.
. list time in 1/5


     +----------------------------------+
     |                             time |
     |----------------------------------|
  1. | 2016-03-31T04:20:00.122823+00:00 |
  2. | 2016-03-31T04:24:25.447985+00:00 |
  3. | 2016-03-31T04:24:25.683670+00:00 |
  4. | 2016-03-31T04:24:27.338571+00:00 |
  5. | 2016-03-31T04:24:27.552821+00:00 |
     +----------------------------------+
			

. gen clock = substr(time, 1, 19) generates a new variable clock with the first 19 characters of time. From there we can convert it to a date and time variable.
. list clock in 1/5


     +---------------------+
     |               clock |
     |---------------------|
  1. | 2016-03-31T04:20:00 |
  2. | 2016-03-31T04:24:25 |
  3. | 2016-03-31T04:24:25 |
  4. | 2016-03-31T04:24:27 |
  5. | 2016-03-31T04:24:27 |
     +---------------------+
			

Back to top

subinstr()

subinstr(s1, s2, s3, n) replaces the first n occurrences in s1 of s2 with s3. If n is missing, all occurrences are replaced.

Here we have a dataset with a variable rank:
. tab rank


               rank |      Freq.     Percent        Cum.
------------------------------+-----------------------------------
                  1 |        506       10.44       10.44
             1 (1T) |        262        5.40       15.84
              1(1T) |        122        2.52       18.36
            1(1T)   |         72        1.49       19.84
                  2 |        509       10.50       30.34
             2 (2T) |        262        5.40       35.75
              2(2T) |        122        2.52       38.26
             (2T)   |         72        1.49       39.75
                  3 |        503       10.38       50.12
             3 (3T) |        260        5.36       55.49
              3(3T) |        121        2.50       57.98
             (3T)   |         72        1.49       59.47
                  4 |        442        9.12       68.58
             4 (RU) |        228        4.70       73.29
              4(RU) |        107        2.21       75.50
             4(RU)  |         63        1.30       76.79
                  5 |        324        6.68       83.48
             5 (RU) |        162        3.34       86.82
              5(RU) |         82        1.69       88.51
            5(RU)   |         46        0.95       89.46

      ......
			

We want to remove all the spaces, (1T), (2T), (3T) and (RU). Here is what we would do:


local n (1T) (2T) (3T) (RU) " "
foreach x in `n'{
	replace rank = subinstr(rank,"`x'", "",.) 
}			
		

. tab rank


        rank |      Freq.     Percent        Cum.
-------------+-----------------------------------
           1 |        962       19.84       19.84
           2 |        965       19.91       39.75
           3 |        956       19.72       59.47
           4 |        840       17.33       76.79
           5 |        614       12.67       89.46
           6 |        333        6.87       96.33
           7 |        127        2.62       98.95
           8 |         44        0.91       99.86
           9 |          7        0.14      100.00
-------------+-----------------------------------
       Total |      4,848      100.00			
		

Back to top

egen newvar = ends()

See egen.

Back to top

strtrim(); stritrim();strltrim();strrtrim()

The four functions trim the strings by removing the spaces.

strtrim(s) removes the leading or trailing spaces.
e.g. strtrim(“ nyush ”) = “nyush”

stritrim(s) removes the multiple internal spaces.
e.g. stritrim(“nyu     sh”) = “nyu sh”

strltrim(s) removes the leading spaces.
e.g. strltrim(“ nyush”) = “nyush”

strrtrim(s) removes the trailing spaces.
e.g. strrtrim(“nyush ”) = “nyush”

Back to top

Conversion

string(n) and real(s) are two string functions that convert numeric/string to string/numeric variables.

string()

string(n) is a synonym for strofreal(n) and converts numeric or missing values to strings.
. sysuse auto
. gen rep78_str = string(rep78)
converts the numeric and missing values of rep78 to strings.

Back to top

real()

real(s) converts strings to numeric or missing values. The string variable must contain number characters, otherwise missing values will be generated.
. gen rep78_str_num = real(rep78_str) converts the string variable rep78_str created above by string(n) back to numeric again.
. gen make_real = real(make) will create the new variable make_real with all missing values.


See also the destring/tostring commands. Note that real()/string() are functions and must be used in conjunction with a Stata command.

Back to top

Concatenation

egen newvar, concat(varlist) concatenates a variable list and generates a new string variable.
. sysuse auto
. egen make_foreign = concat(foreign make)

Or simply use + to concatenate several strings when the variables are already strings.
. gen make_foreign2 = string(foreign) + make

Back to top

Dates and Times

Storage types and display formats

A quick review of data storage types and display format before we start.

Storage type

Dates and times usually come in the human readable string forms, such as “March 22, 2017 16:15 pm”, “2017.03.22 16:15” etc. But Stata internally stores dates and times as integers and reads them as numeric values.

In fact, Stata understands a date and time variable as the difference from the base.The base (the numeric value 0) of a datetime variable begins at 01jan1960 00:00:00.000 (the first millisecond of 01jan1960); therefore “25jan2016 08:30:25” for us human beings will be 1769329825000 (milliseconds) for Stata.

Back to top

Display format

Specifying the display format allows us to choose the output of the date/time variable.
. format timevar %td sets timevar as a date variable. Here timevar is already a numeric date and time variable.
. format timevar %tcHH:MM:SS further sets timevar to be displayed as hour (00-23) : minute (00-59) :second (00-60).

We will see more example below.

There are multiple ways to display a year, month, week, day, hour, minute, second etc. For instance, there are three additional codes to display the hour: Hh (00-12), hH (0-23), or hh (0-12).

Check the Stata help pages for the full list and the default display formats.

Base Display format Storage type String-to-numeric function
datetime
(assuming 86400 s/day)
01jan1960 %tc must be double clock()
datetime
(equivalent to UTC)
01jan1960 %tC must be double Clock()
date 01jan1960 %td may be float or long date()
weekly date 1960 week 1 %tw may be float or int weekly()
monthly date jan1960 %tm may be float or int monthly()
quarterly date 1960 quarter 1 %tq may be float or int quarterly()
half-yearly date 1960 half-year 1 %th may be float or int halfyearly()
yearly date 0 A.D %ty may be float or int yearly()

*Scroll left and right if the table exceeds the screen on mobile devices.

*Adapted from tables in Working with dates and times and datetime – Date and time values and variables.

More on the syntax of display formats for dates and times from the Stata data management manual

Back to top

Conversion

In the last column of the table above we have introduced the functions function(string, mask[,topyear]) that transform strings to the numeric date and time variables.

In those functions, mask specifies the order of the components appearing in the string.

Mask
M month
D day
Y 4-digit year
19Y 2-digit year to be interpreted as 19xx
20Y 2-digit year to be interpreted as 20xx
h hour of day
m minutes within hour
s seconds within minute
# ignore one element

*Scroll left and right if the table exceeds the screen on mobile devices.

Source: datetime translation – String to numeric date translation functions

Here are some examples of the strings and their corresponding masks:

datetime mask
“25jan2016 08:30:25” “DMYhms”
“2016-01-25 08:30:25” “YMDhms”
“16-01-25 08:30” “20YMDhm”
“08:30:25 UTC 01252016 ” “hms#MDY”

*Scroll left and right if the table exceeds the screen on mobile devices.

Suppose we have the following dataset and the string variable anntims:
. list in 1/5


     +----------------------+
     |  anntims     anndats |
     |----------------------|
  1. | 00:46:00   25feb2008 |
  2. | 00:46:00   25feb2008 |
  3. | 00:46:00   25feb2008 |
  4. | 00:46:00   25feb2008 |
  5. | 00:46:00   25feb2008 |
     +----------------------+		
		

We want to generate a new time variable from it:
. gen double anntims2 = clock(anntims,"hms")
. format anntims2 %tcHH:MM:SS
. list anntims2 in 1/5


     +----------+
     | anntims2 |
     |----------|
  1. | 00:46:00 |
  2. | 00:46:00 |
  3. | 00:46:00 |
  4. | 00:46:00 |
  5. | 00:46:00 |
     +----------+			
		

We also want to create a new variable with both date and time from the variables anntims (string) and anndats (numeric).
. gen double anndatims = dhms(anndats,hh(anntims2),mm(anntims2),ss(anntims2))
. format anndatims %tcCCYY.NN.DD_HH:MM:SS
. list anndatims in 1/5


     +---------------------+
     |           anndatims |
     |---------------------|
  1. | 2008.02.25 00:46:00 |
  2. | 2008.02.25 00:46:00 |
  3. | 2008.02.25 00:46:00 |
  4. | 2008.02.25 00:46:00 |
  5. | 2008.02.25 00:46:00 |
     +---------------------+		
		

Back to top

Working with two-digit-years

When working with the two-digit years, in addition to specifying the 19Y or 20Y masks, we can specify the topyear option in date and time function(string, mask, topyear). The full year to be returned will be the year that does not exceed the largest value of the specified topyear.

This is useful when we want to interpret the first two digits of the year components differently of a variable.

For instance, suppose we have a date variable date (string). We want to obtain the full years in the range of 1949-2048 and, based on the knowledge of the data, would like to interpret digits equal to or larger than 49 as from the years 19** and others as from 20**:
. list in 1/5


     +--------+
     |   date |
     |--------|
  1. | 470505 |
  2. | 480404 |
  3. | 490101 |
  4. | 500202 |
  5. | 510303 |
     +--------+			
		

. gen float fulldate = date(date, "YMD", 2048)
. format fulldate %td
. list in 1/5


     +--------------------+
     |   date    fulldate |
     |--------------------|
  1. | 470505   05may2047 |
  2. | 480404   04apr2048 |
  3. | 490101   01jan1949 |
  4. | 500202   02feb1950 |
  5. | 510303   03mar1951 |
     +--------------------+			
		

More on string to numeric date translation functions from the Stata data management manual

Back to top

Extraction

The functions below extract the components from the date and time variables.

time-of-day component function
hour of day hh(tc) or hhC(tC)
minutes of day mm(tc) or mmC(tC)
seconds of day ss(tc) or ssC(tC)
date component function
calendar year year(td)
calendar month month(td)
calendar day day(td)
day of week (0=Sunday) dow(td)
day of year (1=first day) doy(td)
week within year (1=first week) week(td)
quarter within year (1=first quarter) quarter(td)
half within year (1=first half) halfyear(td)

*Scroll left and right if the table exceeds the screen on mobile devices.

*Adapted from tables in datetime – Date and time values and variables

Suppose we want to extract the year from the variable anndats we've seen earlier:


     +----------------------+
     |  anntims     anndats |
     |----------------------|
  1. | 00:46:00   25feb2008 |
  2. | 00:46:00   25feb2008 |
  3. | 00:46:00   25feb2008 |
  4. | 00:46:00   25feb2008 |
  5. | 00:46:00   25feb2008 |
     +----------------------+		
		

. gen annyrs = year(anndats)
. list annyrs in 1/5


     +--------+
     | annyrs |
     |--------|
  1. |   2008 |
  2. |   2008 |
  3. |   2008 |
  4. |   2008 |
  5. |   2008 |
     +--------+			
		

Suppose we want to extract the minutes from the time variable anntims2:


     +----------+
     | anntims2 |
     |----------|
  1. | 00:46:00 |
  2. | 00:46:00 |
  3. | 00:46:00 |
  4. | 00:46:00 |
  5. | 00:46:00 |
     +----------+			
		
. gen mins = mm(anntims2)
. list mins in 1/5

     +------+
     | mins |
     |------|
  1. |   46 |
  2. |   46 |
  3. |   46 |
  4. |   46 |
  5. |   46 |
     +------+			
		

Back to top

Calculation

Since Stata internally uses the difference from the base to read the dates and times, calculation of durations is simply addition or subtraction of the date and time units.

. list time in 1/5


     +----------------------------------+
     |                             time |
     |----------------------------------|
  1. | 2016-03-31T04:20:00.122823+00:00 |
  2. | 2016-03-31T04:24:25.447985+00:00 |
  3. | 2016-03-31T04:24:25.683670+00:00 |
  4. | 2016-03-31T04:24:27.338571+00:00 |
  5. | 2016-03-31T04:24:27.552821+00:00 |
     +----------------------------------+
			

Here are the codes to get them:
. gen datetime = substr(time,1,19)
. gen microsec = substr(time,21,6)
. destring microsec, replace
. replace microsec = microsec/1000
. gen double timestamp = clock(datetime, "YMD#hms") + microsec
. format timestamp %tcCCYY.NN.DD_HH:MM:SS.sss
. gen duration = timestamp - timestamp[_n-1]
. gen minutes = minutes(duration)

Read the full documentation on date and time values and variables from the Stata data management manual

Back to top