Regular expressions

Dr. Alexander Fisher

Duke University

What is a regular expression?

Definition and utility

A regular expression (aka regex or regexp) is a custom defined string matching pattern. A regular expression lets you:

  1. extract only the phone number from this string: “My phone number is (123) 456-7890, not to be confused with my birth month which is 0”

  2. search and replace multiple spellings of the word gray (grey, 6R3Y) in a document simultaneously

  3. search through all files in a directory for the one that contains a specific string

  4. find the specific line number from a file that contains a string

  5. find and replace through multiple files simultaneously

And much, much more!

Quick example

grep and grepl are base R functions that return the index of a match and the logical value of a match respectively.

text = c("Wealth, fame, power. Gold Roger, the King of the Pirates, attained",
         "everything this world has to offer.")
text
[1] "Wealth, fame, power. Gold Roger, the King of the Pirates, attained"
[2] "everything this world has to offer."                               
grep("Pirate", text)
[1] 1
grepl("Pirate", text)
[1]  TRUE FALSE

regular expressions are case-sensitive

grep("pirate", text)
integer(0)
grepl("pirate", text)
[1] FALSE FALSE

stringr

introducing stringr

stringr hosts a convenient set of tools to manipulate strings and extract regular expressions. All functions begin with the prefix str.

The best summary of stringr functions is on this cheatsheet

Notice below that the string comes first in these functions (in contrast with grep)

Example

txt = c("Luffy: 'I'm going to be king of the pirates! !'", 
        "The straw hat crew set sail.", 
        "Nami: 'I'm Going to be the world's greatest navigator!'")
  • like grepl
str_detect(txt, ":")
[1]  TRUE FALSE  TRUE
  • return the match
str_extract(txt, "\\s([A-Z]|[a-z])*ng") # first instance
[1] " going" NA       " Going"
str_extract_all(txt, "\\s([A-Z]|[a-z])*ng") %>% str()
List of 3
 $ : chr [1:2] " going" " king"
 $ : chr(0) 
 $ : chr " Going"
str_replace(txt, "Nami", "Zoro") %>%
  str_replace("navigator", "swordsman")
[1] "Luffy: 'I'm going to be king of the pirates! !'"        
[2] "The straw hat crew set sail."                           
[3] "Zoro: 'I'm Going to be the world's greatest swordsman!'"

The power of str_replace

no_bots = "My number is one Two tHree 456 fOuR 3 2 1"
str_to_lower(no_bots) %>%
  str_replace_all(c("one" = "1", "two" = "2", 
                    "three" = "3", "four" = "4")) %>%
  str_extract_all("\\d") %>% # simplify arg here could change pipeline
  unlist() %>%
  paste(collapse = "") # or use str_c(collapse = "")
[1] "1234564321"

Basic principles

  • To match a string exactly, just write those characters.

  • To match a single character from a set of possibilities, use square brackets, e.g. [0123456789] matches any digit.

  • To group characters together into an expression, use parentheses, ()

Repeaters: * , + and { }: the preceding character is to be used for more than once

  • * match zero or more occurrences of the preceding expression.

  • + match one or more occurrences of the preceding expression.

  • {} match the preceding expression for as many times as the value inside this bracket.

Some repeater examples:

regexp explanation
a* match 0 or more occurences of “a”
a+ match 1 more occurences of “a”
(abc)+ match 1 or more back-to-back occurence of the group “abc”
a{3} match a 3 times
a{3,} match a 3 or more times
a{3,5} match “a” 3, 4 or 5 times

{citation: https://www.geeksforgeeks.org/write-regular-expressions/}

Symbols

  • . symbol for wildcard. The dot symbol can take place of any other symbol.

  • ? symbol for optional character. The preceding character may or may not be present in the string to be matched. Example: docx? will match both docx and doc

  • $ symbol for position match end. Tells the computer that the match must occur at the end of the string or before \n at the end of the line or string.

  • \ symbol for escaping characters. If you want to match for the actual + or ., etc. add a backslash \ before that character.

  • | symbol for “or”. Match any one element separated by the vertical bar | character. Example: th(e|is|at) will match words “the”, “this” and “that”.

  • ^ symbol has two meanings.

    • By itself, ^ sets the position of the match to the beginning of the string or line. Example: ^\d{3} says to match the first three digits at the beginning of the string and will return 919 from 919-123-4567.

    • Together with brackets, [^set_of_characters] implies exclusion. Example: [^abc] will match any character except a, b, c.

Character classes

Character classes: match a character by its class, for example: letter, digit, space, and symbols.

  • \s : matches any whitespace characters such as space and tab

  • \S : matches any non-whitespace characters

  • \d : matches any digit character

  • \D : matches any non-digit characters

  • \w : matches any word character (basically alpha-numeric)

  • \W : matches any non-word character

  • \b : matches any word boundary (this would include spaces, dashes, commas, semi-colons, etc)

{citation: https://www.geeksforgeeks.org/write-regular-expressions/}

A hierarchical view of character classes

{citation: http://perso.ens-lyon.fr/lise.vaudor/strings-et-expressions-regulieres/}

Ranges

- can be used to interpolate between first and last and grab consecutive values. Example: [A-Z] matches any capital letters from “A” to “Z”. [1-4] matches any integer digit from 1 to 4.

To match an alphabetical character (upper or lower case “A-Z” or “a-z”) but not numbers, you can use the regular expression ([A-Z]|[a-z])

To match everything but capital “F” through “N”, you can use the regular expression [^F-N]

Escapism

When to escape?

. ^ $ * + ? { } [ ] \ | ( )

Are all special and perform as described on the previous slides by default. Therefore, these special characters must be escaped to match directly. You need to use two levels of escape to escape a special character. Example:

txt = "To be, [or] not to be that is the question."
str_detect(txt, "[")
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET, context=`[`)
str_detect(txt, "\[")
Error: '\[' is an unrecognized escape in character string starting ""\["
str_detect(txt, "\\[")
[1] TRUE

Escape classes

In order to access the presumed functionality of character classes, you need to use a double escape as well. Example:

txt = c("A1", "B2", "CC", "DD", "EE2")
  • Which strings end with a digit?
str_detect(txt, "\d$")
Error: '\d' is an unrecognized escape in character string starting ""\d"
str_detect(txt, "\\d$")
[1]  TRUE  TRUE FALSE FALSE  TRUE
  • Which strings contain 3 alphanumeric characters?
str_detect(txt, "\\w{3}")
[1] FALSE FALSE FALSE FALSE  TRUE

tl;dr

to match a symbol or character class, use double escapes

Exercise 1

Download the files secret-message.txt and emails.txt using the command below in the console:

WARNING

DO NOT VIEW THE FILE – YOUR CONTAINER MAY CRASH!

download.file("https://sta323-sp23.github.io/data/secret-message.txt", 
              destfile = "secret-message.txt")

download.file("https://sta323-sp23.github.io/data/emails.txt", 
              destfile = "emails.txt")

Hint: read in the file as a string with read_lines()

part 1

In secret-message.txt, find the secret message. It will be of the form sta323{secret-message} where secret-message is replaced by some other text.

part 2

In emails.txt extract the unique part of the email address (part before the “@”) and count the number of each hosting domain, i.e. count how many emails are Duke emails and how many are gmail.

Examples

In the following example we will search through the text, line by line, and extract matches.

Luffy's phone number is 123 456 7890
Zoro doesn't have a phone number
Nami's number is 012-345-6789
Usopp's number is (919)000 0000
Sanji's telephone number is (919) 123 4567
0000000000 is Robin's number.
Chopper doesn't have a phone number, but his lucky number is 1.

regex extraction

You can download the text file with

download.file("https://sta323-sp23.github.io/data/pirate-phone.txt",
destfile = "pirate-phone.txt")
str_extract(pirate_phone, "123 456 7890")

123 456 7890

exact match

str_extract(pirate_phone, "[0-9]{3}\\-[0-9]{3}\\-[0-9]{4}")

012-345-6789

matching xxx-xxx-xxxx using ranges, repeaters, escaped characters

Examples

Luffy's phone number is 123 456 7890
Zoro doesn't have a phone number
Nami's number is 012-345-6789
Usopp's number is (919)000 0000
Sanji's telephone number is (919) 123 4567
0000000000 is Robin's number.
Chopper doesn't have a phone number, but his lucky number is 1.

regex extraction

str_extract_all(pirate_phone, "\\(\\d{3}\\)\\d{3}\\s\\d{4}")

(919)000 0000

matching (xxx)xxx xxxx using character classes (\d for digit, \s for whitespace) and repeaters

str_extract_all(pirate_phone, "\\(\\d{3}\\).?\\d{3}\\s[0-9]{4}")
(919)000 0000
(919) 123 4567

matching (xxx)*xxx xxxx using character classes, wildcard (.) and optional chracter (?)

# hidden exercise, code here!
123 456 7890
012-345-6789
(919)000 0000
(919) 123 4567
0000000000

Multiple optional matching

Greedy vs ungreedy matching

What went wrong here?

text = "<div class='main'> <div> <a href='here.pdf'>Here!</a> </div> </div>"
str_extract(text, "<div>.*</div>")
[1] "<div> <a href='here.pdf'>Here!</a> </div> </div>"

If you add ? after a repeater, the matching will be non-greedy (find the shortest possible match, not the longest).

str_extract(text, "<div>.*?</div>")
[1] "<div> <a href='here.pdf'>Here!</a> </div>"

Appendix

base R string toolkit

The base language provides a number helper functions for additional manipulation of string objects:

  • paste(), paste0() - concatenate strings
  • substr(), substring() - extract or replace substrings
  • sprintf() - C-like string construction
  • nchar() - counts characters
  • strsplit() - split a string into substrings
  • tolower() - make string all lowercase
  • toupper() - make string all uppercase

…many more.

the “See Also” section of the the above functions’ documentation is a good place to discover additional functions.

On the commmand line (grep)

There are three fundamental tools on unix systems to process text patterns: grep, sed and awk.

The simplest is grep. grep looks by default for lines of files that match the regex.

Check out the documentation with

$ man grep

The basics are:

$ grep -option path/to/file(s)

Common options include:

grep option what it does
-c count lines with a match
-i case insensitive search
-l list only names of matching files
-n each output is preceded by its line number
-o print only the matching parts of lines
-v invert; list only lines that do not match pattern

Exercise 2

The power and ease of the terminal…

Using an appropriate regex and the terminal grep, find the secret message hidden in some of the files in the zip folder:

download.file("https://sta323-sp23.github.io/data/secret-messages.zip", 
              destfile = "secret-messages.zip")

unzip("secret-messages.zip", exdir = "secret-messages")

Again, look for sta323{message-here}.

base R regex

  • grep(), grepl() - regular expression pattern matching, “l” for return logical
  • sub(), gsub() - regular expression pattern replacement (replace first, replace all)
  • regmatches() - extract or replace matched strings

Example

txt = c("Luffy: 'I'm going to be king of the pirates! !'", 
        "The straw hat crew set sail.", 
        "Nami: 'I'm Going to be the world's greatest navigator!'")
grep(":", txt)
[1] 1 3
grepl(":", txt)
[1]  TRUE FALSE  TRUE
grep(":", txt, value = TRUE)
[1] "Luffy: 'I'm going to be king of the pirates! !'"        
[2] "Nami: 'I'm Going to be the world's greatest navigator!'"
sub("!", "!!!", txt) # first ! in a string, vectorized
[1] "Luffy: 'I'm going to be king of the pirates!!! !'"        
[2] "The straw hat crew set sail."                             
[3] "Nami: 'I'm Going to be the world's greatest navigator!!!'"
gsub("!", "!!!", txt) # all !, vectorized
[1] "Luffy: 'I'm going to be king of the pirates!!! !!!'"      
[2] "The straw hat crew set sail."                             
[3] "Nami: 'I'm Going to be the world's greatest navigator!!!'"
regmatches(txt, regexpr(".*:", txt))
[1] "Luffy:" "Nami:" 
regmatches(txt, regexpr("\\s[a-z]*ng", txt))
[1] " going"
regmatches(txt, regexpr("\\s[A-Z\\|a-z]*ng", txt))
[1] " going" " Going"

basic unix toolkit

A “folder” aka a “directory” is a container. A “file* is an element of a container, e.g. lab-1.qmd is a file contained in a lab-1-username directory.

command action
$ ls list files in current directory
$ pwd print working directory
$ cd change directory
example: $ cd .. to go to parent directory
$ mkdir dname make directory “dname”
$ mkdir s{1..5} overpowered file creation
$ rm /path/to/file remove a file
$ rm -rf dname recursively remove a directory and its contents
$ rm core* remove all objects in the current working directory that begin with “core”
$ wc -l show # lines in a file
$ y > x.txt pass printed output from command “y” on the left to file “x.txt” on the right
example: $ head -N file1.txt > file2.txt creates a new file called “file2” that is a replica of the first N lines of file 1
$ echo 'text here' >> filename add text to the end of a file
$ man x pull up documentation for command x, example: $ man ls