1. Perl
  2. Regular expression

Let's master Perl a regular expression

A description of Perl's a regular expression . After reading this article, you can quickly master all of the regular expression you use in your daily life. Regular expressions allow you to represent a set of strings, and you can search for and replace strings that match the regular expression. Twice

You can use a regular expression to represent a set of strings. For example, let's express the three strings "a", "aa", and "aaa" with a regular expression. You can write the following using the regular expression "{}" that expresses the number of consecutive characters.

a{1,3}

The set of strings "a", "aa", and "aaa" is represented by one regular expression "a{1,3}". {} Is called a quantifier and can represent consecutive characters.

a
aa->a{1,3}
aaa

Let's look at another example of a regular expression. A set of strings "p1", "q1", and "r1" can be represented by one regular expression "[pqr] 1". [] Is called character class and can represent a set of multiple characters.

p1
q1->[pqr] 1
r1

In this way, you can write multiple strings with one regular expression.

Pattern matching

Pattern matching is an operation to check whether a set of characters represented by a regular expression is included in the target string. Use the pattern matching operator to do pattern matching.

# Pattern matching
$str =~ /regular expression/</pre>

The pattern matching operator is "=~ ". Regular expressions must be enclosed in slashes "/". Returns true if there is a match, false if there is no match. Pattern matching is often used in combination with if statement.

<pre>
if ($str =~ /regular expression/) {
  # What to do if there is a match
}

Let's check if the regular expression "a{1,3}" pattern matches the string "This string is aa.".

String: "This string is aa."

Regular expression: "a{1,3}"

The script is as follows.

my $str = 'This string is aa.';

if ($str =~ /a{1,3}/) {
  print "Match\n";
}

"Match!" Is output.

On the other hand, there is also an operator "!~ " That means that the pattern does not match.

$str !~ /Regular expression/</pre>

<h4>Get the matched string</h4>

You can use pattern matching to get the matched string. To get the matched string, enclose the part you want to get in "()". Let's get "aa" and "r1" in the string "This string is aar1".

<pre>
String: "This string is aar 1"

Regular expression: (a{1,3}) ([pqr] 1)

The script looks like this: Predefined variable The parts enclosed in parentheses are assigned to "$1" and "$2" in order.

my $str = 'This string is aar1';
if ($str =~ /(a{1,3}) ([pqr] 1)/) {
  print "Get $1 and $2";
}

The output is as follows.

Get aa and r1

Let's change the string to "This string is aaap1". The output is as follows.

Get aaa and p1

Replacement

You can use a regular expression to replace the matched string. The replacement syntax is as follows:

Target string =~ s/regular expression/replacement string/;

Now let's replace the string "This string is aa" with the string "This string is bb". This example uses the regular expression "a{1,3}" that matches aa.

Regular expression Replaced string
a{1,3} - ->bb

Target string
This string is aa - ->This string is bb

The script looks like this:

my $str = 'This string is aa';
$str =~ s/a{1,3}/bb/;

Only one is replaced. For example, if you have the string "This string is aa aaa", the above replacement will result in "This string is bb aaa" and only the first matching string will be replaced.

Use the regular expression option "g" to replace all match strings. Regular expression options are specified at the end.

# Replace all matched strings
Target string =~ s/regular expression/replaced string/g;

The script that replaces all the matched strings looks like this:

my $str = 'This string is aa aaa';
$str =~ s/a{1,3}/bb/g;

The output of the replaced string is as follows.

This string is bb bb

Now let's look at various a regular expression.

Regular expression character

Regular expressions that represent a set of characters include:

.. All characters except a line break

\d

Numbers

\D

Characters other than numbers

\w

Word letters ("a-z" "A-Z" "0-9" underscore "_")

\W

Characters other than word characters

\s

White space (space "", tab character "\t", line break character "\n,\r", etc.)

\S

Characters other than whitespace
^^ The beginning of the string
$ End of string

\b

Word character boundaries

Any single character "." Except for a line break

Use "." To represent any single character except a line break.

a
b b
c->.
)
@</pre>

You can also match the "." To any character, including line breaks. If you want the "." To match a line break, use the regular expression option "s".

<pre>
# Match . To any character, including line breaks
$str =~ /./s;

Number "\d"

Use "\d" to represent numbers. Matches 0-9.

0
1
2->\d
8
9

Other than numbers "\D"

Use "\D" to represent non-numeric characters. This is a companion to "\d".

Other than the number "\d"->\D

White space "\s"

Whitespace is spaces, tabs, page breaks, carriage returns, and line feeds. Use "\s" to represent whitespace.

space " "
Tab "\t"
Page break "\f"->\s
Carriage return "\n"
Line feed "\r"

"\S" other than white space

Use "\S" to represent non-whitespace characters. This is a companion to "\s".

Other than the space character "\s"->\S

Word character "\w"

A word character is a set of "alphabets, numbers, and underscores". Use "\w" to represent word characters.

A-Z
a-z
       ->\w
0-9
_

Non - word characters "\W"

Use "\W" to represent non-word characters.

Other than the word character "\w"->\W

The beginning of the string "^"

Some regular expression characters can represent the beginning of a character. Use "^" to represent the beginning of a character.

Beginning of string->^</pre>This regular expression character has no character length. Use in combination with other characters. For example, the regular expression "^abc" matches strings that start with "abc". Even if "abc" is included, it will not match unless it starts with "abc".

<pre>
abcppp
abcqqq->^abc
abcrrr

When combined with the m option,^changes to mean the beginning of a line. Use "\A" if you want to represent the beginning of a string when using the m option. "\A" always represents the beginning of the string regardless of the existence of the option.

End of string "$"

Use "$" to represent the end of a string. For example, the regular expression "abc $" matches strings ending in "abc".

pppabc
qqqabc->abc $
rrrabc

When combined with the m option, $changes to mean the end of a line. Use "\z" if you want to represent the end of a string when using the m option. "\Z" always represents the end of the string regardless of the existence of the option.

Word character boundaries

Use "\b" to represent word character boundaries. The boundary of a word character is the part that changes from a word character to another character. For example, the regular expression "abc\b" does not match "abcd" but does match "abc /", "abc @", and "abc".

abc/abc @->abc\b
abc

Character class

Use character class to represent a set of multiple characters.

[Character set]

For example, if you want to express the letters "a", "b", or "c", write [abc].

a
b->[abc]
c

You can use the hyphen "-" symbol to specify a range of letters or numbers.

a
b b
c->[a-e]
d
e
0
1
2->[0-4]
3
Four

If you want to express either an alphabet or a numerical value, write [a-zA-Z0-9].

Alphabets and numbers->[a-zA-Z0-9]

You can also use character classes to represent non-specific characters. Use "^" to represent something other than a specific character. Note that if "^" is used at the beginning of the character class [], it means "other than that", not "the beginning of the character".

For example, to represent a character other than "a", "b", and "c":

Characters other than "a", "b" and "c"->[^abc]

Quantifier

You can use the quantifier to specify how many consecutive characters will continue. The quantifiers are:

?? 0 or 1 character immediately before
* 0 or more characters immediately before
+ One or more characters immediately before
{m, n} Immediately before m or more and n or less
{m,} Last character is m or more
{0, n} N or less characters immediately before

? 0 or 1 character immediately before

? Is a quantity specifier that expresses that the previous character is 0 or 1. In other words, it may or may not have the previous character.

aaap
        ->aaap?
aaa

It can be used in combination with regular expression characters as well as quantifiers.

aaa
aaap
        ->aaa [pqr]?
aaaq
aaar

* 0 or more previous characters

* Is a quantity specifier that expresses that the immediately preceding character is 0 or more.

aaa
aaap->aaap *
aaapp

+ One or more characters immediately before

+ Is a quantity specifier that expresses that there is at least one character immediately before.

aaap
aaapp->aaap +
aaappp

{m, n} The previous character is m or more and n or less

{m, n} is a quantity specifier that expresses that the previous character is m or more and n or less.

aaap
aaapp->aaap{1,3}
aaappp

{m,}

{m,} is a quantity specifier that expresses that the previous character is m or more.

aaapp
aaappp->aaap{2,}
aaapppp

{0, n}

{0, n} is a quantifier that expresses that the previous character is n or less.

aaappp
aaapppp->aaap{0,5}
aaappppp

A set of strings

Use "|" to represent a set of strings. Usually used in combination with parentheses "()".

a123b
a456b->a(123 | 456 | 789) b
a789b

Escape regular expression characters

To make a regular expression character mean a normal character, you need to add a\immediately before it. For example, to represent the character "." Itself, you need to use "\.".

.txt->\ .txt

If you use quotemeta function, it will automatically escape the entire string.

my $regex = quotemeta('.txt');

You can also use the special regular expression character\Q for escaping. The characters from\Q to\E are escaped. If there is no\E, the target is up to the end of the regular expression.

/\Q.txt\E/

Regular expression technique

(? :) Uncaptured parentheses

The regular expression parentheses "()" are used for capture and when you want to mean "or" like "(A | B)". You can use "(? :)" instead of "()" to use () just for "or".

(?: A | B)

Change the enclosing character of the regular expression

If you see a lot of slashes in your regular expression, it's very annoying because you have to escape with \. In such cases, you can put m immediately before to change the enclosing character of the regular expression.

/ \/aaa \/bbb/
# Above and meaning
m #/aaa/bbb #
m {/ aaa/bbb}

You can also change the enclosing character for replacement.

# s aaa # bbb #
s | aaa | bbb |

Regular expression reference

Regular expressions can be referenced using qr operator. You can use it as a regular expression reference and assign it to a variable, including regular expression options.

my $regex = qr/(\d+)/sm;

This can also be used in regular pattern matching.

my $num = 34;
if ($num =~ /$regex/) {
   
}

Regular expression options

You can specify options for the regular expression as needed.

g Repeat pattern matching
s Match "." To a line break
m Match "^" and "$" to the beginning and end of a line
i Match case insensitively
e Use an expression for permutation
x Ignore spaces in a regular expression
o o Perform variable expansion only once

Regular expression options are specified at the end of the regular expression. It is also possible to combine multiple options.

/ Regular expression/sm

g Replace all matched strings

You can replace everything that matches by specifying the g option.

$message2 =~ s/yah/yes/g;

We've already covered m, s, and g, so we'll cover the remaining options.

m Match "^" and "$" to the beginning and end of a line

You can use the m option to match "^" and "$" to the beginning and end of a line.

$message =~ /^i/m

s Match "." To a line break s option

You can use the s option to match a "line break" to a ".".

$message =~ /^i/s

i Case sensitiveMatch without

You can use the i option to match case insensitively.

abc
Abc->/ abc/i
ABC

e Use an expression for permutation

The e option allows you to use an expression as the result of the replacement.

The following example is an example that doubles the matched number.

s/(\d+)/$1 * 2/e;

x Ignore whitespace in a regular expression

You can use the x option to ignore whitespace in a regular expression. You will also be able to write comments. If space is not important, you can use the x option to write the regular expression more clearly.

my $time = '03:02:56';
# my $regex = qr/  (\d{2}) hour
# ::(\d{2}) min
# ::(\d{2}) seconds
/ x;

if ($time =~ /$regex/) {
  my $hour = $1;
  my $minute = $2;
  my $second = $3;
}

If you want to use the blank itself, escape the blank like "\".

o Variable expansion only once

The regular expression option "o" allows variable expansion to be done only once. This will improve performance because it doesn't re-evaluate the regular expression, but it's also prone to bugs, so use it as needed.

my $regex = "a pen";
for (1 .. 10) {
  $message =~ /$regex/o;
}

With the o option, variable expansion is done only the first time. In other words, from the second time of the loop,/a pen/is used as it is without the variable expansion of/$regex/→/a pen /.

Shortest match

Perl a regular expression have one quirk. That is, the quantifier matches at the longest position. For example, which part would match given the following regular expression and string?

# Regular expressions
. +\S

# String
aaa bbb ccc

The regular expression .+\s means "one or more characters other than line break and blank". There are multiple candidates in the example above. There are two candidates, "aaa" and "aaa bbb". Perl's default behavior is to match the longest position and match "aaa bbb".

# Matching string
"Aaaa bbb"

It is possible to match this with "aaa". This technique is called the shortest match. Add? After the quantifier for the shortest match. If you change the regular expression to (. +?\S), it will match "aaa".

# Regular expression with the shortest match
. +?\S

The regular expression will now match "aaa".

"Aaaa"

Japanese and a regular expression

In order to use Japanese in a regular expression, both the regular expression and the target string must be converted to an decoded string. Please refer to How to use the Encode module for the conversion to an decoded string.

At the end

As long as you remember what you learned in this commentary, you will not have any problems in practice. Regular expressions allow you to extract or easily replace only the lines you need. Please enjoy the regular expression that are the real pleasure of Perl.

Related Informatrion