In this post, we briefly discuss string manipulations, and we focus on pattern searching, matching and replacement.

We first introduce three groups of functions:

  • grep() and grepl() for pattern searching and matching
  • sub() and gsub() for pattern matching and replacement
  • stringr functions to
    • detect strings
    • count the number of matches in strings
    • locate strings
    • extract strings
    • match strings
    • split strings

Then we show an example of string manipulation.


Regular expressions

There are a group of base R functions that search, match, and replace strings based on pattern. grep(), grepl(), regexpr(), gregexpr() and regexec() search for matches of a pattern within each element of a character vector. sub() and gsub() perform replacement of the first and all matches of a pattern.

A pattern is described by regular expressions. A regular expression is a sequence of characters that specifies a search pattern. They are constructed analogously to arithmetic expressions by using various operators to combine smaller expressions. Usually such patterns are used by string-searching algorithms for “find” or “find and replace” operations on strings. Use ?"regular expression" to learn regular expressions as used in R.

In a simple case, if we write a program to check the validity of password input by users, as defined below, we may need the help of regular expression in its most basic form.

  • At least 1 letter between a-z and 1 letter between A-Z.
  • At least 1 number between 0-9.
  • At least 1 character from $#@.
  • Minimum length 6 characters.
  • Maximum length 16 characters.

For instance, to evaluate if the input password contains 1 letter between [a-z], we can use grepl("[a-z]", password).


grep(), grepl()

We start with the pair of functions grep(pattern, x) and grepl(pattern, x).

grep() and grepl() search for matches of a pattern in a character vector.

Both functions need a pattern and an x argument. pattern is a character string containing a regular expression to be matched in the given character vector. x is the character vector from which matches are sought.


grep() returns an integer vector, which refers to the elements in the character vector that contain a match.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
grep("[a-zA-Z]", string)
## [1] 1 2 3
# numbers
grep("[0-9]", string)
## [1] 1 3
# characters from $#@
grep("[$#@]", string)
## integer(0)

grepl() returns a logical vector with TRUE or FALSE, and indicates which elements of the character vector contain a match. It returns TRUE when a pattern is found in the string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
grepl("[a-zA-Z]", string)
## [1] TRUE TRUE TRUE
# numbers
grepl("[0-9]", string)
## [1]  TRUE FALSE  TRUE
# characters from $#@
grepl("[$#@]", string)
## [1] FALSE FALSE FALSE

sub(), gsub()

The second pair is sub(pattern, replacement, x) and gsub(pattern, replacement, x).

These functions search a character vector for matches, and replace the substrings where a pattern is matched.

Argument x is a character vector where matches are sought. Elements of character vectors x that are not substituted will be returned unchanged.

sub() replaces only the first occurrence of a pattern. gsub() replaces all occurrences. Compare the results from sub() and gsub() below.

string <- c("A_a_1", "B_b_2", "C_c_3")
sub("_",".", string)
## [1] "A.a_1" "B.b_2" "C.c_3"
gsub("_",".", string)
## [1] "A.a.1" "B.b.2" "C.c.3"

In the next example below, we try to remove all the prefixes in the Open variables. We can use the regular expression "^.*\\." to remove all characters before the period, including the period itself. replacement = "" removes whatever matches the pattern specified with the regular expression.

string <- c("AAPL.Open", "AFL.Open", "MMM.Open")
sub("^.*\\.","", string)
## [1] "Open" "Open" "Open"

stringr package

Package stringr provides pattern matching functions for common tasks in string manipulation that detect, locate, extract, match, replace, and split strings. The package is part of tidyverse. It is based on the package stringi.

library(stringr)

Each stringr pattern matching function has the same first two arguments, a character vector of strings to process (string) and a single pattern to match (pattern).


detecting strings

str_detect(string, pattern) detects the presence or absence of a pattern in a string and returns a logical vector, similar to grepl().

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
str_detect(string, "[a-zA-Z]")
## [1] TRUE TRUE TRUE
# numbers
str_detect(string, "[0-9]")
## [1]  TRUE FALSE  TRUE
# characters from $#@
str_detect(string, "[$#@]")
## [1] FALSE FALSE FALSE

counting the number of matches in strings

str_count(string, pattern) counts the number of matches in a string.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")
# alphabets
str_count(string, "[a-zA-Z]")
## [1]  23  48 107
# numbers
str_count(string, "[0-9]")
## [1] 4 0 7
# characters from $#@
str_count(string, "[$#@]")
## [1] 0 0 0

locating strings

str_locate_all(string, pattern) locates the positions of all matches in a string. It returns a list of integer matrices. In each of these matrices, the first column gives start positions of matches, and the second column gives their end positions.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_locate_all(string, "NYU")
## [[1]]
##      start end
## [1,]     1   3
## 
## [[2]]
##      start end
## [1,]     1   3
## 
## [[3]]
##      start end

extracting strings

str_extract_all(string, pattern) extracts all matches and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_extract_all(string, "[0-9]")
## [[1]]
## [1] "2" "0" "1" "2"
## 
## [[2]]
## character(0)
## 
## [[3]]
## [1] "2" "9" "4" "5" "1" "4" "9"

str_extract(string, pattern) extracts text corresponding to the first match, and returns a character vector.

str_extract(string, "[0-9]")
## [1] "2" NA  "2"

matching strings

str_match_all(string, pattern) extracts matched groups from all matches and returns a list of character matrices.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_match_all(string, "[0-9]")
## [[1]]
##      [,1]
## [1,] "2" 
## [2,] "0" 
## [3,] "1" 
## [4,] "2" 
## 
## [[2]]
##      [,1]
## 
## [[3]]
##      [,1]
## [1,] "2" 
## [2,] "9" 
## [3,] "4" 
## [4,] "5" 
## [5,] "1" 
## [6,] "4" 
## [7,] "9"

spliting strings

str_split(string, pattern) splits a string into pieces and returns a list of character vectors.

string <- c("NYU Shanghai was founded in 2012.", 
            "NYU Shanghai is China's first Sino-US research university.", 
            "Of the class of 294 students, 51% came from The People's Republic of China, with the remaining 49% coming from other countries around the world.")

str_split(string," ")
## [[1]]
## [1] "NYU"      "Shanghai" "was"      "founded"  "in"       "2012."   
## 
## [[2]]
## [1] "NYU"         "Shanghai"    "is"          "China's"     "first"      
## [6] "Sino-US"     "research"    "university."
## 
## [[3]]
##  [1] "Of"        "the"       "class"     "of"        "294"       "students,"
##  [7] "51%"       "came"      "from"      "The"       "People's"  "Republic" 
## [13] "of"        "China,"    "with"      "the"       "remaining" "49%"      
## [19] "coming"    "from"      "other"     "countries" "around"    "the"      
## [25] "world."

Example

Here we show an application of the functions introduced above for string manipulation.

We have a variable Founded in a dataset sp500tickers, which is the year a company was founded. Some cases may contain more than one year; this could be that a company was acquired by another company, or that a company went through some other changes in its history. The variable recorded up to three events/years for each case.

head(sp500tickers$Founded, 10)
##  [1] "1902"        "1888"        "2013 (1888)" "1981"        "1989"       
##  [6] "2008"        "1982"        "1969"        "1932"        "1981"

Founded is a character variable, and the formats of the number strings in it are not consistent. For instance, some may contain parentheses, whitespaces, commas, slashes, or founders’ names, such as in “1994 (Northrop 1939, Grumman 1930)” or “1881/1894 (1980)”. Besides, the years are not necessarily organized in descending order, such as in “2005 (Molson 1786, Coors 1873)”.

Our goals are first to extract the numbers(years) from each case, getting rid of any other kinds of characters; then, separate the strings and organize the years in three columns, since we have up to three years in a case; finally, order the three years in descending order from the most recent to the most remote.

We take the steps below to achieve those goals.


Step 1: extract all the number strings.

founded <- sp500tickers$Founded
founded <- str_extract_all(founded, "[0-9]")

We use the stringr function str_extract_all() to do this. We specify the pattern to be "[0-9]" so that only number strings will be matched and returned.

The function returns a list of strings.

head(founded)
## [[1]]
## [1] "1" "9" "0" "2"
## 
## [[2]]
## [1] "1" "8" "8" "8"
## 
## [[3]]
## [1] "2" "0" "1" "3" "1" "8" "8" "8"
## 
## [[4]]
## [1] "1" "9" "8" "1"
## 
## [[5]]
## [1] "1" "9" "8" "9"
## 
## [[6]]
## [1] "2" "0" "0" "8"

Step 2: cut the strings into 3 groups, each consisting of 4 digits.

year1 <- t(sapply(founded, function(x) x[1:4]))
year2 <- t(sapply(founded, function(x) x[5:8]))
year3 <- t(sapply(founded, function(x) x[9:12]))

We use the sapply() function to handle the list outputs from str_extract_all().

head(year1)
##      [,1] [,2] [,3] [,4]
## [1,] "1"  "9"  "0"  "2" 
## [2,] "1"  "8"  "8"  "8" 
## [3,] "2"  "0"  "1"  "3" 
## [4,] "1"  "9"  "8"  "1" 
## [5,] "1"  "9"  "8"  "9" 
## [6,] "2"  "0"  "0"  "8"
head(year2)
##      [,1] [,2] [,3] [,4]
## [1,] NA   NA   NA   NA  
## [2,] NA   NA   NA   NA  
## [3,] "1"  "8"  "8"  "8" 
## [4,] NA   NA   NA   NA  
## [5,] NA   NA   NA   NA  
## [6,] NA   NA   NA   NA
head(year3)
##      [,1] [,2] [,3] [,4]
## [1,] NA   NA   NA   NA  
## [2,] NA   NA   NA   NA  
## [3,] NA   NA   NA   NA  
## [4,] NA   NA   NA   NA  
## [5,] NA   NA   NA   NA  
## [6,] NA   NA   NA   NA

The outputs are matrices.

class(year1)
## [1] "matrix" "array"

Step 3: paste the 4 single digits into a whole string to indicate a year.

y1 <- apply(year1, 1, paste0, collapse = "")
y2 <- apply(year2, 1, paste0, collapse = "")
y3 <- apply(year3, 1, paste0, collapse = "")
head(y1)
## [1] "1902" "1888" "2013" "1981" "1989" "2008"
head(y2)
## [1] "NANANANA" "NANANANA" "1888"     "NANANANA" "NANANANA" "NANANANA"
head(y3)
## [1] "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA" "NANANANA"

The outputs are vectors.

class(y1)
## [1] "character"

Step 4: convert the strings to numbers.

y1 <- as.numeric(y1)
y2 <- as.numeric(y2)
## Warning: NAs introduced by coercion
y3 <- as.numeric(y3)
## Warning: NAs introduced by coercion

Those "NANANANA" are coerced to NA.

head(y1)
## [1] 1902 1888 2013 1981 1989 2008
head(y2)
## [1]   NA   NA 1888   NA   NA   NA
head(y3)
## [1] NA NA NA NA NA NA

Step 5: organize the three year vectors into a data frame of three columns, and order the years from the most recent to the most remote for each case.

years <- data.frame(y1, y2, y3)
Founded1 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][1])
Founded2 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][2])
Founded3 <- apply(years, 1, function(x) x[order(x, decreasing = TRUE)][3])

We use the function order(x, decreasing = TRUE) to obtain the positions of the three years in descending order. Then x[order(x, decreasing = TRUE)] places the three years from the most recent to the most remote.

The example below shows how this works on one case.

years[327,]
##       y1   y2   y3
## 327 2004 1865 1909
order(years[327,], decreasing = TRUE)
## [1] 1 3 2
years[327,][order(years[327,], decreasing = TRUE)]
##       y1   y3   y2
## 327 2004 1909 1865

Next, function(x) x[order(x, decreasing = TRUE)][n] extracts the nth element from the output above. [1] means the most recent year, and [3] means the most remote one.

years[327,][order(years[327,], decreasing = TRUE)][1]
##       y1
## 327 2004
years[327,][order(years[327,], decreasing = TRUE)][2]
##       y3
## 327 1909
years[327,][order(years[327,], decreasing = TRUE)][3]
##       y2
## 327 1865

Finally, we use apply() to apply the function function(x) x[order(x, decreasing = TRUE)][n] to each row of the data frame years.