Regular Expressions

Regular expressions are a big part of the Perl language. They are essential to know and use when processing texts. And that's what Perl is good for.

In short, regular expressions are special patterns that when applied to strings either match them or not, capture various substrings, modify the initial substring by substituting parts of it and so on.

The most common usage for regular expression is to find out if a particular string can be found in another string.

my $string = 'Hello, world!';
if ($string =~ m/Hello/) {
    say 'We found Hello!';
}

So here are two new things: =~ and m// (m is for match). There is also an opposite !~ which is evaluated to True when the regular expression is not matched.

my $string = 'Hello, world!';
if ($string !~ m/Bye/) {
    say 'No Bye was found';
}

Character Classes

Character classes are regular expression constructs that match exactly one character out of many. We can create character classes with the [] operators. For example:

my $string = 'Bye';
if ($string =~ m/[aeiou]/) {
    say 'Found a vowel';
}

In this case, the conditional is true because one of the characters inside the character class [aeiou] was found in the string.

We can also indicate a range instead of writing all the characters we want to try to match:

my $string = 'hello';
if ($string =~ m/[a-g]/) {
    say 'Found a letter between a and g in the string';
}

In this example, we have indicated [a-g] instead of [abcdefg]. The same happens with numbers ([2-5]) and uppercase letters ([A-Z]).

Keep in mind that, when using ranges, you always need to indicate the range in alphabetical or numerical order. For example, [8-3] will not work (it will instead look for the characters 8, - and 3).

Exercise

Modify this regular expression to detect whether there are any 'x', 'y' or 'z' letters in the sentence:

my $string = "We're looking for any x, y or z";
if ($string =~ ) {
    say "Found an x, y or z";
}

Other character classes are written without the []. Instead, they consist of a backslash \ and a letter indicating a set of characters. For example, to find out if a string contains at least one number, we can use \d:

my $string = 'March has 31 days';
if ($string =~ m/\d/) {
    say 'Found a number!';
}

With \d we are saying we want to find digits. We can also use \w to match any word character (letters, digits and the underscore _):

my $string = 'This sentence is 6 words long';
if ($string =~ m/\w/) {
    say 'Found a word character';
}

And if we want to find out if a string contains whitespaces we can use \s:

my $string = 'White space';
if ($string =~ m/\s/) {
    say 'There is a whitespace in the string';
}

Whitespaces include tabulations, spaces, line feeds and carriage returns.

We can use the uppercase version of the character class to match the opposite. For example, with \S we match any character that is not a whitespace or the empty string:

my $string = '    ';
if ($string =~ m/\S/) {
    say 'There is at least one character that is not a whitespace';
}
else {
    say 'There is not a non-space character in the string';
}

Or with \D we match any character that is not a digit:

my $string = '42';
if ($string =~ m/\D/) {
    say 'There is at least one character that is a not a digit';
}
else {
    say 'There is not a non-digit character in the string';
}

Finally, if we want to match any character at all, we use the dot:

my $string = 'Hello, World!';
if ($string =~ m/./) {
    say 'Found a character';
}

The only time the dot doesn't match anything is when the string is empty or when it only contains a newline.

Metacharacters

Regular expressions can be really sophisticated. For example if we want to check if a string has a or o characters:

my $string = 'Hello';
if ($string =~ m/a|o/) {
    say 'a or o was found';
}

In this case, we use the metacharacter | to indicate that we want to match a or b. Another metacharacter is +, which helps us find more than one appearance of the same character:

my $string = 'Hello, World!';
if ($string =~ m/l+/) {
    say 'Found at least one l';
}

Similarly, * is used to indicate that there can be 0 or more appearances of the character:

my $string = 'Hello, World!';
if ($string =~ m/l*/) {
    say 'There may be a letter l or more in the string';
}

Since it will return true if there are no matches either, this will also work:

my $string = 'Hello, World!';
if ($string =~ m/j*/) {
    say 'There may be a letter j or more in the string';
}

Finally, with ? we indicate if something is found once or not at all:

my $string = 'Hello, World!';
if ($string =~ m/j?/) {
    say 'There may be a letter j or not in the string';
}

These metacharacters are better used in a bit more complex regular expressions. For example, if we want to know if a user has written their name, we can make sure there's at least one letter:

my $string = 'Larry';
if ($string =~ m/[a-z]+/) {
    say 'The string has at least one letter, so it can be a name'
}

Substitutions

Up until now, we have used regular expressions to match characters with the m at the beginning of the expression. But another widely used case is for substitutions with s:

my $string = 'Hello, World!';
$string =~ s/Hello/Good Bye/;
say $string;

In this case, we first indicate that we want to do a substitution with s, then we show which word or letter we want to change, and then we write the word or letter that is written instead.

Exercise

Perl's motto is "There is more than one way to do it". In this exercise, change this sentence so that it prints the motto:

my $string = "There is only one way to do it";
$string =~
say $string;

Modifiers

Modifiers are letters that are written at the end of the regular expression and that influence the result. For example, we can use i to do case-insensitive matching:

my $string = 'Hello, World!';
if ($string =~ m/h/i) {
    say 'There is an h or an H in the string';
}

Without the i at the end, there would be no match, since it would look only for a lowercase h. Another very common modifier is g:

my $string = 'Hello, World!';
$string =~ s/l/L/g;
say $string;

By indicating the g we are telling the regular expression to be greedy and to substitute all appearences of l for L. If we don't indicate the g, the result is HeLlo, World!, as it only substitutes the first match it finds.

Modifiers can be used in substitutions and in matches alike.

Anchors

Anchors are special characters in regular expressions that help us fix what we are looking for to the beginning (^) or the end ($) of the string. For example:

my $string = 'Hello, World!';
if ($string =~ m/^H/) {
    say 'There is an H at the beginning of the string';
}

This is true because we are looking for an H in the beginning. However:

my $string = 'Hello, World!';
if ($string =~ m/^o/) {
    say 'There is an o at the beginning of the string';
} else {
    say 'There is no o at the beginning of the string';
}

Notice how it doesn't match in this example because, even though there are two o in the string, none of them are at the beginning. The same happens when we anchor the regular expression to the end of the string:

my $string = 'Hello, World!';
if ($string =~ m/!$/) {
    say 'There is an ! at the end of the string';
}

In this case, it matches because the last character of the string is an exclamation mark.

Exercise

In this exercise, create the regular expression so that the printed statement is true, using character classes and anchors:

my $string = "Perl was born in 1987";
if ($string =~ ) {
    say "The string ends with a year";
}