5 Regular expressions

Regular expressions are patterns that are used to match combinations of characters in a string. Before we begin, just a cautionary note that if your regular expressions are becoming too complex, perhaps it is time to step back and think about whether it is necessary and if they can be represented multiple expressions that are easier to understand.

5.1 Prerequisites

The functions str_view() and str_view_all() from the stringr package (part of tidyverse) will be used to learn regular expressions interactively. str_view() shows the first match while str_view_all() shows all the matches.

5.2 Basic matches

The most basic form of matching is to match exact strings

x <- c("abc ABC\n123. !?\\(){}")
cat(x)
## abc ABC
## 123. !?\(){}
str_view(x, "abc")
str_view(x, "123")

5.3 Character classes

Character classes allow you to specify a list of characters for matching.

A bracket [...] can be used to specify a list of character. Therefore, it will match any characters that was specified within the brackets.

str_view_all(x, "[bcde]")

If a caret ^ is added to the start of the list of characters, it will match any characters that are NOT in the list.

str_view_all(x, "[^bcde]")

You can also specify a range expression using a hyphen - between two characters.

str_view_all(x, "[a-zA-Z]")

You can also specify character classes using pre-defined names of these classes with a bracket expression.

Regex What it matches
[:alnum:] letters and numbers
[:alpha:] letters
[:upper:] uppercase letters
[:lower:] lowercase letters
[:digit:] numbers
[:punct:] punctuations
[:space:] space characters

Try out some of the named classes using the function str_view_all()

str_view_all(x, "[:alnum:]")
str_view_all(x, "[:punct:]")

There are also special metacharacters you can use to match entire classes of characters.

Regex What it matches
. any character except newline "\n"
\d digit
\D non-digit
\s whitespace
\S non-whitespace
\t tab
\n newline
\w “word” i.e., letters (a-z and A-Z), digits (0-9) or (_)
\W non-“word”

Note that to include a \ in a regular expression, you need to escape it using \\. This is explained in the next subsection on escaping.

str_view_all(x, ".")
str_view_all(x, "\\d")
str_view_all(x, "\\D")

5.4 Escaping

In regular expressions, the backslash \ is used as an escape character and is used to “escape” any special characters that comes after the backslash. However, in R, the same backslashes \ are also used as an escape character in strings. For example, the string "abc ABC 123.\n!?\\(){}" is used to represent the characters abc ABC 123.\n!?\(){}. You will see that there is any additional backslash \ in the string representation. This is because \ is a special character for strings in R. Therefore, to represent a backslash \ in a string, another backslash needs to be added to escape the special representation of a \ in strings. This means that to create a string containing ““, you need to write "\\".

x <- c("abc ABC\n123. !?\\(){}")
cat(x)
## abc ABC
## 123. !?\(){}

Therefore, to create any regular expressions that contains a backslash, you would need to use a string that contains another backslash \ to escape the backslash \ that forms a part of the regular expression.

For example, how would you create a regular expression to match the character "." if it is defined to match any character except newline. You would need to escape it with \.. However, the backslash is a special character in a string. Therefore you need the string "\\." to represent the regular expression \.. The same logic is applied when representing metacharacters such as \d, \D, \w, \W, etc. You need to use “\” to represent \ in regular expressions.

str_view_all(x, ".")
str_view_all(x, "\\.")

To represent a backslash \ as a regular expression, two levels of escape would be required! To elaborate, to represent a \ as a regular expression, you would need to escape it by creating the regular expression \\. To represent each of these \ you need to use a string, which also requires you to add an additional \ to escape it. Therefore, to match a \ you need to write \\\\.

str_view(x, "\\\\")

5.5 Anchors

Regular expressions will match any part of a string unless you use anchors to specify positions such as the start or end of the string. Instead of characters, anchors are used to specify position.

Regex Position
^ start of string
$ end of string
\b word boundaries
\B non-word boundaries

You can use ^ to match the start of a string and $ to match the end of a string. For a complete match, you could anchor a regular expression with ^...$.

str_view_all(c("apple", "apple pie", "juicy apple"), "apple")
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple")
str_view_all(c("apple", "apple pie", "juicy apple"), "apple$")
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple$")

\b is used to match a position known as word boundary. You can use \b to match the start or end of a word. You can think of \B as the inverse of \b and basically it matches any position that \b does not.

str_view_all(c("apple", "pineapple", "juicy apple"), "\\bapple\\b")
str_view_all(c("apple", "pineapple", "juicy apple"), "apple\\b")
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple\\B")
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple")

5.6 Quantifies

You can use quantifiers to specify the number of times a pattern matches.

Regex Number of times
+ one or more
* zero or more
? zero or one
{m} exactly m
{m,} at least m (m or more)
{m,n} between m and n (both inclusive)
str_view(c("a", "abb", "abbb"), "ab+")
str_view(c("a", "abb", "abbb"), "ab*")
str_view(c("a", "abb", "abbb"), "ab?")
str_view(c("a", "abb", "abbb"), "ab{1}")
str_view(c("a", "abb", "abbb"), "ab{1,}")
str_view(c("a", "abb", "abbb"), "ab{1,2}")

By default, quantifies are applied to a single character. You can use (...) to apply quantifies to more than one character.

str_view(c("ab", "abab", "ababab"), "ab+")
str_view(c("ab", "abab", "ababab"), "(ab)+")