1. Perl
  2. Module
  3. here

Encode module - Properly handles multi - byte strings such as Japanese

Use the Encode module to properly handle multibyte characters such as Japanese in Perl. In many cases, you can handle Japanese properly if you remember the following three things.

  1. The string input from the outside is decoded by the decode function of the Encode module.
  2. The string to be output to the outside is encoded by the encode function of the Encode module.
  3. Save the source code in UTF-8 and enable the utf8 pragma

Terms in this commentary

In this explanation, the string input from the outside is called "byte string ". Let's call the string converted to Perl's internal representation "decoded string ". If the "byte string" is described in a specific character code, it will be called "UTF-8 byte string" or "Shift_JIS byte string".

When dealing with Japanese in programming, it is necessary to clearly distinguish whether the string is a byte string or an decoded string. The rule when dealing with multiple languages in Perl is that the program is responsible for converting between internal and byte strings. This method is a little tricky, but the mechanism is very simple as you only have to remember the above three rules.

Be sure to decode the string input from the outside

Character strings input from the outside are always decoded using the decode function of the Encode module. Decoding is the process of converting a "byte string" to an "decoded string". When dealing with multibyte strings, be sure to use the decode function to convert them to decoded strings.

Specify the character code of the byte string in the first argument of the decode function, and specify the byte string in the second argument. The return value will be an decoded string.

use Encode 'decode';

# External input (command line argument)
my $str = shift;

# Convert byte string (external input) to decoded string (when $str is UTF-8)
$str = decode('UTF-8', $str);

# Convert byte string (external input) to decoded string (when $str is Shift_JIS)
$str = decode('Shift_JIS', $str);

Character strings input from the outside are "command line arguments", "files", "standard input", "environment variable", etc. Think of all external input as byte strings.

$str = decode('UTF-8', $str);

Is expressed graphically as follows.

                Convert "UTF-8 byte string" to "decoded string"
UTF-8 byte string - ->decoded string

The meaning of the first argument of the decode function is easy to misunderstand. Specifies which character code the byte string is actually encoded in.

Be sure to encode the string to be output to the outside

The string to be output to the outside is encoded using the encode function of the Encode module. Encoding is the process of converting an "decoded string" to a "byte string".

The first argument of the encode function specifies which character code to convert. Specify the decoded string in the second argument.

use Encode 'encode';

# When converting an decoded string to a UTF-8 byte string
$str = encode('UTF-8', $str);

# When converting an decoded string to a Shift_JIS byte string
$str = encode('Shift_JIS', $str);

Whenever you use the decode function to output a string converted to an decoded string, you must use the encode function to convert it back to a byte string. When to convert using the encode function is just before the output. Try to keep the decoded string as late as possible in your program.

$str = encode('UTF-8', $str);

Is expressed graphically as follows.

                Convert "decoded string" to "UTF-8 byte string"
Internal string - ->UTF-8 byte string

Save the source code in UTF - 8 and enable the utf8 pragma

Another thing to worry about when dealing with multibyte characters is the string described in the source code. If you need to write multi-byte characters such as Japanese in the source code, save the source code in UTF-8. Then enable the utf8 pragma.

# Save source code as UTF-8
# utf8 Enable pragma
use utf8;

my $str = "Write Japanese etc.";

A string written in the source code is called a string literal. Keep in mind that the character code of a string literal will be the same as the character code of the source code. If you save the source code in UTF-8, the string literal will be a UTF-8 byte string, and if you save it in Shift_JIS, the string literal will be a Shift_JIS byte string.

How to save with a specific character code depends on the editor. For reference, the case of Windows Notepad is explained. If you understand this, you can understand it with other text editors.

Select "File"->"Save As"->"Character Code"->"UTF-8".

The utf8 pragma has the effect of converting a UTF-8 byte string written in the source code into an decoded string. So if you save the source code in UTF-8 and enable the utf8 pragma, string literals will be converted to decoded strings.

The effect of the utf8 pragma is graphically represented as follows. The effect is very similar to the decode function.

                                use utf8
UTF-8 byte string - ->decoded string

As long as you follow the following three things mentioned at the beginning, you will have less trouble with character codes. If you are worried about the character code, please go back to this principle first.

Meaning of converting to an decoded string

Let's see the effect of actually converting it to an decoded string. You can see that Perl handles strings correctly. length function returns the correct length, and regular expression It works correctly.

Save the source code in UTF-8. If it is not saved in UTF-8, you will get a warning such as "Malformed UTF-8 character".

The explanation is given in the case of UTF-8, but in the case of Windows, it is necessary to specify cp932 in the encode function and decode function. cp932 stands for "Windows-31J" which is the character code of Windows. Try passing the string "This is Japanese" as a command line argument.

# Try the effect of converting to an decoded string
use strict;
use warnings;

use utf8;
use Encode qw/encode decode/;

# Command line argument (UTF-8 byte string)
my $str1 = shift;

# UTF-8 byte string decoded to decoded string
$str1 = decode('UTF-8', $str1);

# String literal (because utf8 pragma is valid, it becomes an decoded string)
my $str2 = "Japanese";

# You can count characters correctly by converting to an decoded string
print length $str2, "\n"; # 3

# Regular expressions can be used correctly between decoded strings
if ($str1 =~ /$str2/) {
  print "Match!\n";
}

# Encode the decoded string into a byte string just before outputting
$str1 = encode('UTF-8', $str1);
$str2 = encode('UTF-8', $str2);

print "'$str1' is match'$str2'\n";

Character code conversion

In Perl, the character code conversion process is as follows by combining the decode function and encode function. You need to convert it to an decoded string once. The following example converts a UTF-8 byte string to a Shift_JIS byte string.

# Convert UTF-8 byte string to decoded string
$str = decode('UTF-8', $str);

# Convert decoded string to Shift_JIS byte string
$str = encode('Shift_JIS', $str);

The diagram is as follows.

UTF-8 byte string->decoded string->Shift_JIS byte string

This is a bit annoying, so we have a function called from_to.The first argument is a byte string, the second argument is the character code before conversion, and the third argument is the character code after conversion. Note that unlike the enocde and decode functions, the byte string itself specified in the first argument is converted.

# Character code conversion
use Encode 'from_to';

# $str itself is converted
from_to ($str, 'UTF-8', 'Shift_JIS');

Character string specified as the file name

When specifying a file name for a function such as open or unlink, it must be converted to an OS byte string.

use strict;
use warnings;
use utf8;
use Encode 'encode';

my $file = 'Ah.txt';
open my $fh, '<', encode('cp932', $file)
  or die "Can't open". encode('cp932', $file). ":$!";

The point is to encode just before passing it to the function that handles the file name. This is a little troublesome when dealing with many functions that handle file names, so it is convenient to create a function that converts the file name to the character code of the OS.

It is also convenient to specify the standard error output encoding using the binmode function to convert the error message to the character code of the automatic OS.

use warnings;
use utf8;

use Encode qw/encode decode/;
my $enc = 'cp932';
binmode STDERR, ": encoding ($enc)";
sub d ($) {decode($enc, shift)}
sub e ($) {encode($enc, shift)}

my $file = 'Ah.txt';
open my $fh, '<', e $file
  or die qq/Can't open "$file":$!/;

Now you can handle Perl character codes very simply.

Other notes about strings

Avoid concatenating internal and byte strings

Avoid concatenating internal and byte strings. In this case, the byte string is automatically converted to an decoded string, but that conversion causes garbled characters. Let's convert everything to an decoded string before concatenating.

It is not possible to accurately distinguish between an decoded string and a byte string

Perl has no way in your program to pinpoint whether a string is a byte string or an decoded string. You cannot use the utf8::is_utf8 method to identify whether it is an decoded string or a byte string. The utf8::is_utf8 method can only determine if the UTF8 flag is set, not an decoded string or a byte string.

So, while you can use it to speculate that this is probably the case, be aware that it can't be used to determine in-program whether it's an decoded string or a byte string. Whenever the UTF8 flag is set, it is an decoded string. If the UTF8 flag is not set, it is an decoded string or a byte string.

Cooperation with modules

Whether a function in a module accepts an decoded string or a byte string as an argument depends on the intention of the module creator. The same is true for the return value of a function. It is up to the author to decide whether to return a byte string or an decoded string. Be sure to read the documentation.

I think this is one of the things programmers find particularly difficult when programming in Perl. But don't just deny it, remember that you're benefiting from the other big benefits of "maintaining backwards compatibility, " "simple conversion rules, " and "correct and proper handling of strings." I think Perl handles strings pretty well compared to other languages.

As you can see from the implementation of other programming languages, there is a trade-off with the handling of character codes. It's a difficult field to deal with.

About decoded strings

You shouldn't be aware of what the decoded strings look like. You don't have to be aware of it when creating a program. If you still want to know, the following explanation is easy to understand.

Related Informatrion