1. Perl
  2. here

Perl string processing basics

This is a description of Perl's string processing. Perl is a language that is good at text processing. It explains the output of strings, here documents, search, replace, format specification, etc. in an easy-to-understand way. It also explains how to handle Japanese.

Basics of strings

First of all, I will explain the basics of strings.

How to make a string

Let's create a string first. Use single quotes or double quotes to create the string.

# Create a string with single quotes
my $str1 = 'Hello';

# Create a string with double quotes
my $str2 = "Hello";

When enclosed in double quotes

When creating a string, enclose it in single quotes or double quotes, but if you enclose it in double quote operators, double quote string escape sequences And variable expansion can be used.

  • Double quote string escape sequences can be used
  • Variable expansion can be used

Escape sequence

Escape sequence is string enclosed in single quotes and Double quoted strings are special characters that can be used.

For example, you can use the line break "\n" or the tab "\t" . Line breaks and tabs are invisible characters, but they can be represented using escape sequences.

This is double quote string escape sequence.

# new line
my $str1 = "Hello\n";

# Tab character
my $str2 = "Cat\tDog";

Frequently used escape sequences are a line break and tabs, so be sure to remember these two.

The "\" symbol is used for escaping. If you want to express the symbol "\" itself, use "\\".

# Express "\"-let it be "\\"
my $str1 = "Hello \\World"

If you want to use single quotes or double quotes

If you want to use single quotes in a string enclosed in single quotes, use single quote string escape sequences.

# Use single quotes
my $str1 = 'Kimoto \'s cat';

If you want to use double quotes in a string enclosed in double quotes, use escape sequences.

# Use double quotes
my $str2 = "Can't open \" foo.txt\".";

Variable expansion

Variable expansion is a function that allows variable to be used in a string enclosed in double quotes. The string contained in the variable is expanded.

my $str1 = "Dog";

# Variable expansion-Variable are expanded as "I have Dog"
my $str2 = "I have $str1";

Variable expansion is a very useful feature of Perl, so let's use it more and more.

If variable expansion fails

When using variable expansion, I think that sometimes variable expansion fails, resulting in a compilation error. That is when the string that follows is also interpreted as part of the variable.

For example, a variable may be followed by an "underscore" or "Japanese".

my $str1 = "Dog";

# I want to expand $str1, but the variable name is interpreted as "$str1_Foo"
my $str2 = "I have $str1_Foo";

# I want to expand $str1, but the variable name is interpreted as "$str1 aiueo"
my $str3 = "I have $str1 AIUEO";

In such cases, you can use the notation "${variable name}" to specify the variable name.

${Variable name}

In the previous example, writing the following would work.

my $str1 = "Dog";

my $str2 = "I have ${str1} _Foo";

my $str3 = "I have ${str1} AIUEO";

Concatenation of strings

Next, let's concatenate the strings. Use string concatenation operator "." to concatenate strings.

my $str1 = 'I have a'.'pen';

The above example concatenates strings enclosed in quotes, but you can also use variable.

my $str3 = $str1. $Str2;

The string concatenation operator can also be used in combination with the assignment operator as follows:

$str1. = $str2;

This is the same as the following writing. It means that "$str2" is concatenated after "$str1" and assigned to "$str1". It's a common writing style, so keep it in mind.

$str1 = $str1. $Str2;

String output

Use print function to print the string to the screen.

my $message = 'Hello';
print $message;

String processing tricks

Here are some of the string processing tricks that are sometimes used.

String list operator

In Perl, there is a string list operator that makes it easy to create a list of strings.

You can use the string list operator to create a list of strings without quotes and commas.

# Normal method
my @strs1 = ('Cat', 'Dog', 'Mouse');

# How to use the string list operator
my @strs2 = qw(Cat Dog Mouse);

This operator is especially useful when the word you want to use follows many lines in a text file.

Cat
Dog
Mouse

You can use the string list operator in your Perl script by simply copying and pasting it.

my @strs = qw(
  Cat
  Dog
  Mouse
);

Please refer to the following for a detailed explanation of the string list operator.

Remove a line break

You may want to remove trailing a line break when reading a text file.

Perl has chomp function for removing a line break, but this function is environment dependent.

When running on a Windows OS, remove the "CR + LF" which is a line break in Windows, and when running on Unix/Linux/Mac, remove "LF".

To ensure that a line break are removed, it's a good idea to use the following regular expression.

$str =~ s/\x0D?\X0A?$//;

I'll talk more about a regular expression later.

Concatenate array elements with the specified characters

In string processing, CSV data (data separated by commas) and TSV data (data separated by tabs) are often created.

In such cases, prepare the data in an array and concatenate it with specific characters (commas and tabs).

You can easily do this with the join function .

my @animals = ('Cat', 'Dog', 'Mouse');

# Create CSV data "Cat, Dog, Mouse"
my $csv = join(',', @animals);

# Create TSV data "Cat Dog Mouse"
my $tsv = join("\t", @animals);

Please refer to the following article for the join function.

Split the string with the specified characters

In the above, I explained how to create CSV data and TSV data, but on the contrary, you may want to make CSV data or TSV data into an array.

In such cases, you can use the split function to create an array with specific characters as delimiters.

my $csv = 'Cat, Dog, Mouse';

# Create an array by specifying comma. ('Cat', 'Dog', 'Mouse')
my @animals = split(/,/, $csv);

A example for TSV data is also included.

my $tsv = "Cat\tDog\tMouse";
my @animals = split(/\t, $csv);

A regular expression can be specified directly as the first argument of the split function. For example, use the regular expression "one or more blanks"You can

# Make an array of data separated by one or more spaces
my $data = "Cat Dog Mouse";
my @animals = split(/ +/, $csv);

I'll talk more about a regular expression later.

See the following article for a detailed explanation of the split function.

Get the length of the string

Use the length function to get the length of the string.

# Get the length of the string
my $length = length "ABCDE";

See below for a detailed explanation of the length function.

String format specification

You can use the sprintf function if you want to specify the number of decimal places or specify a format that fills the left with 0s.

# Fill the left with 0 with 3 digits and it becomes "Code is 013"
my $str1 = sprintf("Code is%03d", 13);

# Display up to 3 decimal places "Number is 0.146"
my $str2 = sprintf("Number is%.3f", 0.145677);

See below for a detailed explanation of formatting.

Here document

Perl has a syntax called here-document that makes it easy to create multi-line strings.

# Here document
my $message = <<EOS;
If you want to write
multiple lines,
you can use here document
EOS

I will explain the syntax of here documents. Here documents begin with the symbol "<<". Specifies a string that specifies the end of the string for that word.

This can be any string, but I use the symbol "EOS" for "End of String".

Note that we write a semicolon ";" after this.

Then start the string from this next line.

If you want to end the string, write "EOS" on the last line. Please note that you cannot put a blank in front of EOS.

If you want to write a here document from the indented position, write as follows.

        # Here document
        my $message = <<EOS;
If you want to write
multiple lines,
you can use here document
EOS

Make it the same as a string enclosed in double quotes or single quotes

If you write your here document as follows, you can use escape sequences and special characters as if you were enclosing it in double quotes.

my $message = <<EOS;
...
EOS

or

my $message = << "EOS";
...
EOS

Note that the example below encloses the EOS in double quotes. If you do not enclose it in double quotes, it has the same meaning as enclosing it in double quotes by default.

To do the same as if you enclose it in single quotes, write:

my $message = <<'EOS';
...
EOS

Please refer to the following articles for detailed explanations of here documents.

Regular expression

Perl has a built-in regular expression feature in the language for searching and replacing strings.

Search using pattern matching

For example, if you want to know that the string contains the string "dog", you can write:

my $message = "It is my dog \n";

if ($message =~ /dog/) {
  ...
}

= ~ is called the pattern matching operator and is an operator for performing regular expression pattern matching.

Regular expressions must be enclosed in the symbol "//".

Replace

You can also use a regular expression to make substitutions. If you rewrite "dog" to "cat", write as follows.

my $message = "It is my dog \n";

$message =~ s/dog/cat/;

You can replace with the syntax "s/string to replace/string after replacement /" .

See the following articles for a detailed explanation of a regular expression.

Perl handles Japanese correctly

So far, all the string examples have been in English. However, I think that most of the actual string processing deals with Japanese.

If you are a beginner, you will feel how to do Japanese processing correctly in Perl.

For example, the length function or regular expression must be converted to an decoded string and then executed before it returns the correct result.

If you say "Regular expressions don't work well in Japanese", it's possible that Perl doesn't handle Japanese correctly.

But if you remember a little bit, don't worry, Perl's string processing isn't difficult at all.

In order to process Japanese correctly in Perl, remember the following three things.

  1. The source code is written in UTF-8 and the utf8 pragma is specified.
  2. Be sure to decode the string received from the outside
  3. Encode the string to be output to the outside

Write the source code in UTF - 8 and specify the utf8 pragma.

Save the source code in UTF-8 on Windows, Mac or Linux. Then specify the utf8 pragma.

# utf8 Specify pragma
use utf8;

my $message = 'Hello';

Remember this as a convention first.

Now the length function and a regular expression will work correctly for the strings written in the source code.

# utf8 Specify pragma
use utf8;

my $message = 'Hello';

# The length function works correctly. 5 is returned.
my $length = length $message;

# Regular expressions work correctly
if ($message =~ /day/) {
  ...;
}

The utf8 pragma converts the "UTF-8 byte string" written in the source code to the "decoded string".

By converting to an decoded string, you will be able to perform string processing correctly.

"Byte string" and "decoded string" will be explained later.

Be sure to decode the string received from the outside

Decoding means converting a "string written in a specific character code" into a "Perl decoded string".

For the sake of convenience, "Perl's decoded string" is called decoded string , and "string written in a specific character code" is a byte character. Let's call it a column .

For example, a string written with the character code "UTF-8" is called "UTF-8 byte string", and a string written with the character code "EUC-jp" is called "EUC-jp byte character". It is called "column".

If you know how to handle strings in Java, it is easy to understand by considering the correspondence between "string stream" and "byte stream".

Use the decode function of Encode module to decode.

Read a file with the character code "cp932" written on Windows

Let's write an example to read a file with the character code "cp932" written on Windows.

# decode.pl
use strict;
use warnings;
use utf8;
use Encode 'decode';

# Read file
while (my $line = <>) {
  $line = decode('cp932', $line);
  
  if ($line =~ /Tanaka/) {
    print "OK";
  }
}

while statement is used to read line by line from the file specified by the command line arguments.

The example of the file "name.txt" is as follows. Save it with cp932.

Tanaka
Suzuki
Yamada

If you execute the command as follows, OK will be displayed only once and you can see that the regular expression works correctly.

perl decode.pl name.txt

Read command line argument strings on Windows

Strings received from outside are not just files, but anything that is read from outside the program, such as command line arguments and environment variable, is external.

Therefore, even if Japanese is received as a command line argument, it will always be decoded.

my $name = $ARGV[0];
$name = decode('cp932', $name);

Encode the string to be output to the outside

While the string received from the outside is decoded, the character output to the outside is always encoded.

Encoding means converting an "decoded string" to a "byte string".

To encode, use the encode function of Encode module.

For example, if you want to output in UTF-8, write as follows.

use utf8;

use Encode 'encode';

my $message = 'aiueo';

# Encode before output

print encode('UTF-8', $message);

When specifying a file name

Another thing to keep in mind is that if you specify a file name with the open function etc., you need to encode it to the character code used by the OS.

To open a file with a Japanese name: If the error message has a Japanese file name, encode that part as well.

use utf8;
use Encode 'encode';

my $file = 'test.txt';

# Open a file with a Japanese name
open my $fh, '<', encode('cp932', $file)
  or die encode('cp932', "Can't open file \" $file\":$!");

I think this is a little annoying. If you use it often, I think that subroutine and modularization are one of the ways to make it easier.

For a detailed explanation of handling Japanese in Perl, see the following article, as it is explained in the explanation of the Encode module.

Operators related to strings

Here are the operations that are sometimes used for string processing.

Function for string processing

Here are some functions that I sometimes use for string processing.

substr function Extracts and replaces the character at the specified position
index function Search for strings
rindex function Search for a string from the end
reverse function Invert the order of strings
ucfirst function Convert the first character of a string to uppercase
lcfirst function Convert the first character of a string to lowercase
uc function Convert lowercase letters to uppercase
lc function Convert uppercase to lowercase
chr function Convert numbers to ASCII code-corresponding characters
ord function Convert ASCII code characters to numbers, which is an internal representation

Summary

If you remember this much, I think that you will be able to perform about 90%of string processing including Japanese.

Related Informatrion