5 Regular expressions
Regular expressions are patterns that are used to match combinations of characters in a string. Before we begin, just a cautionary note that if your regular expressions are becoming too complex, perhaps it is time to step back and think about whether it is necessary and if they can be represented multiple expressions that are easier to understand.
5.1 Prerequisites
The functions str_view()
and str_view_all()
from the stringr package
(part of tidyverse) will be used to learn regular expressions interactively.
str_view()
shows the first match while str_view_all()
shows all the matches.
5.2 Basic matches
The most basic form of matching is to match exact strings
str_view(x, "abc")
str_view(x, "123")
5.3 Character classes
Character classes allow you to specify a list of characters for matching.
A bracket [...]
can be used to specify a list of character. Therefore, it will
match any characters that was specified within the brackets.
str_view_all(x, "[bcde]")
If a caret ^
is added to the start of the list of characters, it will match
any characters that are NOT in the list.
str_view_all(x, "[^bcde]")
You can also specify a range expression using a hyphen -
between two
characters.
str_view_all(x, "[a-zA-Z]")
You can also specify character classes using pre-defined names of these classes with a bracket expression.
Regex | What it matches |
---|---|
[:alnum:] |
letters and numbers |
[:alpha:] |
letters |
[:upper:] |
uppercase letters |
[:lower:] |
lowercase letters |
[:digit:] |
numbers |
[:punct:] |
punctuations |
[:space:] |
space characters |
Try out some of the named classes using the function str_view_all()
str_view_all(x, "[:alnum:]")
str_view_all(x, "[:punct:]")
There are also special metacharacters you can use to match entire classes of characters.
Regex | What it matches |
---|---|
. |
any character except newline "\n"
|
\d |
digit |
\D |
non-digit |
\s |
whitespace |
\S |
non-whitespace |
\t |
tab |
\n |
newline |
\w |
“word” i.e., letters (a-z and A-Z), digits (0-9) or (_) |
\W |
non-“word” |
Note that to include a \
in a regular expression, you need to escape it using
\\
. This is explained in the next subsection on escaping.
str_view_all(x, ".")
str_view_all(x, "\\d")
str_view_all(x, "\\D")
5.4 Escaping
In regular expressions, the backslash \
is used as an escape character and
is used to “escape” any special characters that comes after the backslash.
However, in R, the same backslashes \
are also used as an escape character
in strings. For example, the string "abc ABC 123.\n!?\\(){}"
is used to
represent the characters abc ABC 123.\n!?\(){}
. You will see that there is any
additional backslash \
in the string representation. This is because \
is a
special character for strings in R. Therefore, to represent a backslash \
in a
string, another backslash needs to be added to escape the special representation
of a \
in strings. This means that to create a string containing ““, you need
to write "\\"
.
Therefore, to create any regular expressions that contains a backslash, you
would need to use a string that contains another backslash \
to escape the
backslash \
that forms a part of the regular expression.
For example, how would you create a regular expression to match the character
"."
if it is defined to match any character except newline. You would need to
escape it with \.
. However, the backslash is a special character in a string.
Therefore you need the string "\\."
to represent the regular expression \.
.
The same logic is applied when representing metacharacters such as \d
, \D
,
\w
, \W
, etc. You need to use “\” to represent \
in regular expressions.
str_view_all(x, ".")
str_view_all(x, "\\.")
To represent a backslash \
as a regular expression, two levels of escape would
be required! To elaborate, to represent a \
as a regular expression, you would
need to escape it by creating the regular expression \\
. To represent each of
these \
you need to use a string, which also requires you to add an additional
\
to escape it. Therefore, to match a \
you need to write \\\\
.
str_view(x, "\\\\")
5.5 Anchors
Regular expressions will match any part of a string unless you use anchors to specify positions such as the start or end of the string. Instead of characters, anchors are used to specify position.
Regex | Position |
---|---|
^ |
start of string |
$ |
end of string |
\b |
word boundaries |
\B |
non-word boundaries |
You can use ^
to match the start of a string and $
to match the end of a
string. For a complete match, you could anchor a regular expression with
^...$
.
str_view_all(c("apple", "apple pie", "juicy apple"), "apple")
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple")
str_view_all(c("apple", "apple pie", "juicy apple"), "apple$")
str_view_all(c("apple", "apple pie", "juicy apple"), "^apple$")
\b
is used to match a position known as word boundary. You can use \b
to
match the start or end of a word. You can think of \B
as the inverse of \b
and basically it matches any position that \b
does not.
str_view_all(c("apple", "pineapple", "juicy apple"), "\\bapple\\b")
str_view_all(c("apple", "pineapple", "juicy apple"), "apple\\b")
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple\\B")
str_view_all(c("apple", "pineapple", "juicy apple"), "\\Bapple")
5.6 Quantifies
You can use quantifiers to specify the number of times a pattern matches.
Regex | Number of times |
---|---|
+ |
one or more |
* |
zero or more |
? |
zero or one |
{m} |
exactly m
|
{m,} |
at least m (m or more) |
{m,n} |
between m and n (both inclusive) |
By default, quantifies are applied to a single character. You can use (...)
to
apply quantifies to more than one character.