text = c("Wealth, fame, power. Gold Roger, the King of the Pirates, attained",
"everything this world has to offer.")
text
[1] "Wealth, fame, power. Gold Roger, the King of the Pirates, attained"
[2] "everything this world has to offer."
Dr. Alexander Fisher
Duke University
A regular expression (aka regex or regexp) is a custom defined string matching pattern. A regular expression lets you:
extract only the phone number from this string: “My phone number is (123) 456-7890, not to be confused with my birth month which is 0”
search and replace multiple spellings of the word gray (grey, 6R3Y) in a document simultaneously
search through all files in a directory for the one that contains a specific string
find the specific line number from a file that contains a string
find and replace through multiple files simultaneously
And much, much more!
grep
and grepl
are base R functions that return the index of a match and the logical value of a match respectively.
stringr
hosts a convenient set of tools to manipulate strings and extract regular expressions. All functions begin with the prefix str
.
The best summary of stringr
functions is on this cheatsheet
Notice below that the string comes first in these functions (in contrast with grep
)
str_replace
To match a string exactly, just write those characters.
To match a single character from a set of possibilities, use square brackets, e.g. [0123456789]
matches any digit.
To group characters together into an expression, use parentheses, ()
Repeaters: *
, +
and { }
: the preceding character is to be used for more than once
*
match zero or more occurrences of the preceding expression.
+
match one or more occurrences of the preceding expression.
{}
match the preceding expression for as many times as the value inside this bracket.
Some repeater examples:
regexp | explanation |
---|---|
a* |
match 0 or more occurences of “a” |
a+ |
match 1 more occurences of “a” |
(abc)+ |
match 1 or more back-to-back occurence of the group “abc” |
a{3} |
match a 3 times |
a{3,} |
match a 3 or more times |
a{3,5} |
match “a” 3, 4 or 5 times |
{citation: https://www.geeksforgeeks.org/write-regular-expressions/}
.
symbol for wildcard. The dot symbol can take place of any other symbol.
?
symbol for optional character. The preceding character may or may not be present in the string to be matched. Example: docx?
will match both docx
and doc
$
symbol for position match end. Tells the computer that the match must occur at the end of the string or before \n
at the end of the line or string.
\
symbol for escaping characters. If you want to match for the actual +
or .
, etc. add a backslash \
before that character.
|
symbol for “or”. Match any one element separated by the vertical bar |
character. Example: th(e|is|at)
will match words “the”, “this” and “that”.
^
symbol has two meanings.
By itself, ^
sets the position of the match to the beginning of the string or line. Example: ^\d{3}
says to match the first three digits at the beginning of the string and will return 919
from 919-123-4567
.
Together with brackets, [^set_of_characters]
implies exclusion. Example: [^abc]
will match any character except a, b, c.
Character classes: match a character by its class, for example: letter, digit, space, and symbols.
\s
: matches any whitespace characters such as space and tab
\S
: matches any non-whitespace characters
\d
: matches any digit character
\D
: matches any non-digit characters
\w
: matches any word character (basically alpha-numeric)
\W
: matches any non-word character
\b
: matches any word boundary (this would include spaces, dashes, commas, semi-colons, etc)
{citation: https://www.geeksforgeeks.org/write-regular-expressions/}
{citation: http://perso.ens-lyon.fr/lise.vaudor/strings-et-expressions-regulieres/}
-
can be used to interpolate between first and last and grab consecutive values. Example: [A-Z]
matches any capital letters from “A” to “Z”. [1-4]
matches any integer digit from 1 to 4.
To match an alphabetical character (upper or lower case “A-Z” or “a-z”) but not numbers, you can use the regular expression ([A-Z]|[a-z])
To match everything but capital “F” through “N”, you can use the regular expression [^F-N]
When to escape?
. ^ $ * + ? { } [ ] \ | ( )
Are all special and perform as described on the previous slides by default. Therefore, these special characters must be escaped to match directly. You need to use two levels of escape to escape a special character. Example:
Error in stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)): Missing closing bracket on a bracket expression. (U_REGEX_MISSING_CLOSE_BRACKET, context=`[`)
In order to access the presumed functionality of character classes, you need to use a double escape as well. Example:
Download the files secret-message.txt
and emails.txt
using the command below in the console:
Hint: read in the file as a string with read_lines()
In secret-message.txt
, find the secret message. It will be of the form sta323{secret-message}
where secret-message
is replaced by some other text.
In emails.txt
extract the unique part of the email address (part before the “@”) and count the number of each hosting domain, i.e. count how many emails are Duke
emails and how many are gmail
.
In the following example we will search through the text, line by line, and extract matches.
Luffy's phone number is 123 456 7890
Zoro doesn't have a phone number
Nami's number is 012-345-6789
Usopp's number is (919)000 0000
Sanji's telephone number is (919) 123 4567
0000000000 is Robin's number.
Chopper doesn't have a phone number, but his lucky number is 1.
You can download the text file with
Luffy's phone number is 123 456 7890
Zoro doesn't have a phone number
Nami's number is 012-345-6789
Usopp's number is (919)000 0000
Sanji's telephone number is (919) 123 4567
0000000000 is Robin's number.
Chopper doesn't have a phone number, but his lucky number is 1.
What went wrong here?
The base language provides a number helper functions for additional manipulation of string objects:
paste()
, paste0()
- concatenate stringssubstr()
, substring()
- extract or replace substringssprintf()
- C-like string constructionnchar()
- counts charactersstrsplit()
- split a string into substringstolower()
- make string all lowercasetoupper()
- make string all uppercase…many more.
the “See Also” section of the the above functions’ documentation is a good place to discover additional functions.
There are three fundamental tools on unix systems to process text patterns: grep, sed and awk.
The simplest is grep. grep looks by default for lines of files that match the regex.
Check out the documentation with
$ man grep
The basics are:
$ grep -option path/to/file(s)
Common options include:
grep option | what it does |
---|---|
-c | count lines with a match |
-i | case insensitive search |
-l | list only names of matching files |
-n | each output is preceded by its line number |
-o | print only the matching parts of lines |
-v | invert; list only lines that do not match pattern |
The power and ease of the terminal…
Using an appropriate regex and the terminal grep
, find the secret message hidden in some of the files in the zip folder:
Again, look for sta323{message-here}
.
grep()
, grepl()
- regular expression pattern matching, “l” for return logicalsub()
, gsub()
- regular expression pattern replacement (replace first, replace all)regmatches()
- extract or replace matched strings[1] "Luffy: 'I'm going to be king of the pirates!!! !'"
[2] "The straw hat crew set sail."
[3] "Nami: 'I'm Going to be the world's greatest navigator!!!'"
[1] "Luffy: 'I'm going to be king of the pirates!!! !!!'"
[2] "The straw hat crew set sail."
[3] "Nami: 'I'm Going to be the world's greatest navigator!!!'"
A “folder” aka a “directory” is a container. A “file* is an element of a container, e.g. lab-1.qmd
is a file contained in a lab-1-username
directory.
command | action |
---|---|
$ ls |
list files in current directory |
$ pwd |
print working directory |
$ cd |
change directory example: $ cd .. to go to parent directory |
$ mkdir dname |
make directory “dname” |
$ mkdir s{1..5} |
overpowered file creation |
$ rm /path/to/file |
remove a file |
$ rm -rf dname |
recursively remove a directory and its contents |
$ rm core* |
remove all objects in the current working directory that begin with “core” |
$ wc -l |
show # lines in a file |
$ y > x.txt |
pass printed output from command “y” on the left to file “x.txt” on the right example: $ head -N file1.txt > file2.txt creates a new file called “file2” that is a replica of the first N lines of file 1 |
$ echo 'text here' >> filename |
add text to the end of a file |
$ man x |
pull up documentation for command x, example: $ man ls |