- Perl ›
- here
Perl string processing basics
This is a description of Perl's string processing. Perl is a language that is good at text processing. It explains the output of strings, here documents, search, replace, format specification, etc. in an easy-to-understand way. It also explains how to handle Japanese.
Basics of strings
First of all, I will explain the basics of strings.
How to make a string
Let's create a string first. Use single quotes or double quotes to create the string.
# Create a string with single quotes my $str1 = 'Hello'; # Create a string with double quotes my $str2 = "Hello";
When enclosed in double quotes
When creating a string, enclose it in single quotes or double quotes, but if you enclose it in double quote operators, double quote string escape sequences And variable expansion can be used.
- Double quote string escape sequences can be used
- Variable expansion can be used
Escape sequence
Escape sequence is string enclosed in single quotes and Double quoted strings are special characters that can be used.
For example, you can use the line break "\n" or the tab "\t" . Line breaks and tabs are invisible characters, but they can be represented using escape sequences.
This is double quote string escape sequence.
# new line my $str1 = "Hello\n"; # Tab character my $str2 = "Cat\tDog";
Frequently used escape sequences are a line break and tabs, so be sure to remember these two.
The "\" symbol is used for escaping. If you want to express the symbol "\" itself, use "\\".
# Express "\"-let it be "\\" my $str1 = "Hello \\World"
If you want to use single quotes or double quotes
If you want to use single quotes in a string enclosed in single quotes, use single quote string escape sequences.
# Use single quotes my $str1 = 'Kimoto \'s cat';
If you want to use double quotes in a string enclosed in double quotes, use escape sequences.
# Use double quotes my $str2 = "Can't open \" foo.txt\".";
Variable expansion
Variable expansion is a function that allows variable to be used in a string enclosed in double quotes. The string contained in the variable is expanded.
my $str1 = "Dog"; # Variable expansion-Variable are expanded as "I have Dog" my $str2 = "I have $str1";
Variable expansion is a very useful feature of Perl, so let's use it more and more.
If variable expansion fails
When using variable expansion, I think that sometimes variable expansion fails, resulting in a compilation error. That is when the string that follows is also interpreted as part of the variable.
For example, a variable may be followed by an "underscore" or "Japanese".
my $str1 = "Dog"; # I want to expand $str1, but the variable name is interpreted as "$str1_Foo" my $str2 = "I have $str1_Foo"; # I want to expand $str1, but the variable name is interpreted as "$str1 aiueo" my $str3 = "I have $str1 AIUEO";
In such cases, you can use the notation "${variable name}" to specify the variable name.
${Variable name}
In the previous example, writing the following would work.
my $str1 = "Dog"; my $str2 = "I have ${str1} _Foo"; my $str3 = "I have ${str1} AIUEO";
Concatenation of strings
Next, let's concatenate the strings. Use string concatenation operator "." to concatenate strings.
my $str1 = 'I have a'.'pen';
The above example concatenates strings enclosed in quotes, but you can also use variable.
my $str3 = $str1. $Str2;
The string concatenation operator can also be used in combination with the assignment operator as follows:
$str1. = $str2;
This is the same as the following writing. It means that "$str2" is concatenated after "$str1" and assigned to "$str1". It's a common writing style, so keep it in mind.
$str1 = $str1. $Str2;
String output
Use print function to print the string to the screen.
my $message = 'Hello'; print $message;
String processing tricks
Here are some of the string processing tricks that are sometimes used.
String list operator
In Perl, there is a string list operator that makes it easy to create a list of strings.
You can use the string list operator to create a list of strings without quotes and commas.
# Normal method my @strs1 = ('Cat', 'Dog', 'Mouse'); # How to use the string list operator my @strs2 = qw(Cat Dog Mouse);
This operator is especially useful when the word you want to use follows many lines in a text file.
Cat Dog Mouse
You can use the string list operator in your Perl script by simply copying and pasting it.
my @strs = qw( Cat Dog Mouse );
Please refer to the following for a detailed explanation of the string list operator.
Remove a line break
You may want to remove trailing a line break when reading a text file.
Perl has chomp function for removing a line break, but this function is environment dependent.
When running on a Windows OS, remove the "CR + LF" which is a line break in Windows, and when running on Unix/Linux/Mac, remove "LF".
To ensure that a line break are removed, it's a good idea to use the following regular expression.
$str =~ s/\x0D?\X0A?$//;
I'll talk more about a regular expression later.
Concatenate array elements with the specified characters
In string processing, CSV data (data separated by commas) and TSV data (data separated by tabs) are often created.
In such cases, prepare the data in an array and concatenate it with specific characters (commas and tabs).
You can easily do this with the join function .
my @animals = ('Cat', 'Dog', 'Mouse'); # Create CSV data "Cat, Dog, Mouse" my $csv = join(',', @animals); # Create TSV data "Cat Dog Mouse" my $tsv = join("\t", @animals);
Please refer to the following article for the join function.
Split the string with the specified characters
In the above, I explained how to create CSV data and TSV data, but on the contrary, you may want to make CSV data or TSV data into an array.
In such cases, you can use the split function to create an array with specific characters as delimiters.
my $csv = 'Cat, Dog, Mouse'; # Create an array by specifying comma. ('Cat', 'Dog', 'Mouse') my @animals = split(/,/, $csv);
A example for TSV data is also included.
my $tsv = "Cat\tDog\tMouse"; my @animals = split(/\t, $csv);
A regular expression can be specified directly as the first argument of the split function. For example, use the regular expression "one or more blanks"You can
# Make an array of data separated by one or more spaces my $data = "Cat Dog Mouse"; my @animals = split(/ +/, $csv);
I'll talk more about a regular expression later.
See the following article for a detailed explanation of the split function.
Get the length of the string
Use the length function to get the length of the string.
# Get the length of the string my $length = length "ABCDE";
See below for a detailed explanation of the length function.
String format specification
You can use the sprintf function if you want to specify the number of decimal places or specify a format that fills the left with 0s.
# Fill the left with 0 with 3 digits and it becomes "Code is 013" my $str1 = sprintf("Code is%03d", 13); # Display up to 3 decimal places "Number is 0.146" my $str2 = sprintf("Number is%.3f", 0.145677);
See below for a detailed explanation of formatting.
Here document
Perl has a syntax called here-document that makes it easy to create multi-line strings.
# Here document my $message = <<EOS; If you want to write multiple lines, you can use here document EOS
I will explain the syntax of here documents. Here documents begin with the symbol "<<". Specifies a string that specifies the end of the string for that word.
This can be any string, but I use the symbol "EOS" for "End of String".
Note that we write a semicolon ";" after this.
Then start the string from this next line.
If you want to end the string, write "EOS" on the last line. Please note that you cannot put a blank in front of EOS.
If you want to write a here document from the indented position, write as follows.
# Here document my $message = <<EOS; If you want to write multiple lines, you can use here document EOS
Make it the same as a string enclosed in double quotes or single quotes
If you write your here document as follows, you can use escape sequences and special characters as if you were enclosing it in double quotes.
my $message = <<EOS; ... EOS
or
my $message = << "EOS"; ... EOS
Note that the example below encloses the EOS in double quotes. If you do not enclose it in double quotes, it has the same meaning as enclosing it in double quotes by default.
To do the same as if you enclose it in single quotes, write:
my $message = <<'EOS'; ... EOS
Please refer to the following articles for detailed explanations of here documents.
Regular expression
Perl has a built-in regular expression feature in the language for searching and replacing strings.
Search using pattern matching
For example, if you want to know that the string contains the string "dog", you can write:
my $message = "It is my dog \n"; if ($message =~ /dog/) { ... }
= ~ is called the pattern matching operator and is an operator for performing regular expression pattern matching.
Regular expressions must be enclosed in the symbol "//".
Replace
You can also use a regular expression to make substitutions. If you rewrite "dog" to "cat", write as follows.
my $message = "It is my dog \n"; $message =~ s/dog/cat/;
You can replace with the syntax "s/string to replace/string after replacement /" .
See the following articles for a detailed explanation of a regular expression.
Perl handles Japanese correctly
So far, all the string examples have been in English. However, I think that most of the actual string processing deals with Japanese.
If you are a beginner, you will feel how to do Japanese processing correctly in Perl.
For example, the length function or regular expression must be converted to an decoded string and then executed before it returns the correct result.
If you say "Regular expressions don't work well in Japanese", it's possible that Perl doesn't handle Japanese correctly.
But if you remember a little bit, don't worry, Perl's string processing isn't difficult at all.
In order to process Japanese correctly in Perl, remember the following three things.
- The source code is written in UTF-8 and the utf8 pragma is specified.
- Be sure to decode the string received from the outside
- Encode the string to be output to the outside
Write the source code in UTF - 8 and specify the utf8 pragma.
Save the source code in UTF-8 on Windows, Mac or Linux. Then specify the utf8 pragma.
# utf8 Specify pragma use utf8; my $message = 'Hello';
Remember this as a convention first.
Now the length function and a regular expression will work correctly for the strings written in the source code.
# utf8 Specify pragma use utf8; my $message = 'Hello'; # The length function works correctly. 5 is returned. my $length = length $message; # Regular expressions work correctly if ($message =~ /day/) { ...; }
The utf8 pragma converts the "UTF-8 byte string" written in the source code to the "decoded string".
By converting to an decoded string, you will be able to perform string processing correctly.
"Byte string" and "decoded string" will be explained later.
Be sure to decode the string received from the outside
Decoding means converting a "string written in a specific character code" into a "Perl decoded string".
For the sake of convenience, "Perl's decoded string" is called decoded string , and "string written in a specific character code" is a byte character. Let's call it a column .
For example, a string written with the character code "UTF-8" is called "UTF-8 byte string", and a string written with the character code "EUC-jp" is called "EUC-jp byte character". It is called "column".
If you know how to handle strings in Java, it is easy to understand by considering the correspondence between "string stream" and "byte stream".
Use the decode function of Encode module to decode.
Read a file with the character code "cp932" written on Windows
Let's write an example to read a file with the character code "cp932" written on Windows.
# decode.pl use strict; use warnings; use utf8; use Encode 'decode'; # Read file while (my $line = <>) { $line = decode('cp932', $line); if ($line =~ /Tanaka/) { print "OK"; } }
while statement is used to read line by line from the file specified by the command line arguments.
The example of the file "name.txt" is as follows. Save it with cp932.
Tanaka Suzuki Yamada
If you execute the command as follows, OK will be displayed only once and you can see that the regular expression works correctly.
perl decode.pl name.txt
Read command line argument strings on Windows
Strings received from outside are not just files, but anything that is read from outside the program, such as command line arguments and environment variable, is external.
Therefore, even if Japanese is received as a command line argument, it will always be decoded.
my $name = $ARGV[0]; $name = decode('cp932', $name);
Encode the string to be output to the outside
While the string received from the outside is decoded, the character output to the outside is always encoded.
Encoding means converting an "decoded string" to a "byte string".
To encode, use the encode function of Encode module.
For example, if you want to output in UTF-8, write as follows.
use utf8;
use Encode 'encode';
my $message = 'aiueo';
# Encode before output
print encode('UTF-8', $message);
When specifying a file name
Another thing to keep in mind is that if you specify a file name with the open function etc., you need to encode it to the character code used by the OS.
To open a file with a Japanese name: If the error message has a Japanese file name, encode that part as well.
use utf8; use Encode 'encode'; my $file = 'test.txt'; # Open a file with a Japanese name open my $fh, '<', encode('cp932', $file) or die encode('cp932', "Can't open file \" $file\":$!");
I think this is a little annoying. If you use it often, I think that subroutine and modularization are one of the ways to make it easier.
For a detailed explanation of handling Japanese in Perl, see the following article, as it is explained in the explanation of the Encode module.
Operators related to strings
Here are the operations that are sometimes used for string processing.
- Quart operator "q" - Alternative representation of quotes
- Double quote operator "qq" - Alternative representation of double quote
Function for string processing
Here are some functions that I sometimes use for string processing.
substr function | Extracts and replaces the character at the specified position |
index function | Search for strings |
rindex function | Search for a string from the end |
reverse function | Invert the order of strings |
ucfirst function | Convert the first character of a string to uppercase |
lcfirst function | Convert the first character of a string to lowercase |
uc function | Convert lowercase letters to uppercase |
lc function | Convert uppercase to lowercase |
chr function | Convert numbers to ASCII code-corresponding characters |
ord function | Convert ASCII code characters to numbers, which is an internal representation |
Summary
If you remember this much, I think that you will be able to perform about 90%of string processing including Japanese.