Recipe 6.18. Matching Multiple-Byte Characters

6.18. Matching Multiple-Byte Characters

Problem

You need to perform regular-expression searches against multiple-byte characters.

A character encoding is a set mapping from characters and symbols to digital representations. ASCII is an encoding where each character is represented as exactly one byte, but complex writing systems, such as those for Chinese, Japanese, and Korean, have so many characters that their encodings need to use multiple bytes to represent characters.

Perl works on the principle that each byte represents a single character, which works well in ASCII but makes regular expression matches on strings containing multiple-byte characters tricky, to say the least. The regular expression engine does not understand the character boundaries in your string of bytes, and so can return "matches" from the middle of one character to the middle of another.

Solution

Exploit the encoding by tailoring the pattern to the sequences of bytes that constitute characters. The basic approach is to build a pattern that matches a single (multiple byte) character in the encoding, and then use that "any character" pattern in larger patterns.

Discussion

As an example, we'll examine one of the encodings for Japanese, called EUC-JP, and then show how we use this in solving a number of multiple-byte encoding issues. EUC-JP can represent thousands of characters, but it's basically a superset of ASCII. Bytes with values ranging from 0 to 127 (0x00 to 0x7F) are almost exactly their ASCII counterparts, so those bytes represent one-byte characters. Some characters are represented by a pair of bytes, the first with value 0x8E and the second with a value in the range 0xA0-0xDF. Some others are represented by three bytes, the first with the value 0x8F and the others in the range 0xA1-0xFE, while others still are represented by two bytes, each in the 0xA1-0xFE range.

We can convey this information - what bytes can make up characters in this encoding - as a regular expression. For ease of use later, here we'll define a string, $eucjp, that holds the regular expression to match a single EUC-JP character:

my $eucjp = q{                 # EUC-JP encoding subcomponents:
    [\x00-\x7F]                # ASCII/JIS-Roman (one-byte/character)
  | \x8E[\xA0-\xDF]            # half-width katakana (two bytes/char)
  | \x8F[\xA1-\xFE][\xA1-\xFE] # JIS X 0212-1990 (three bytes/char)
  | [\xA1-\xFE][\xA1-\xFE]     # JIS X 0208:1997 (two bytes/char)
};

(Because we've inserted comments and whitespace for pretty-printing, we'll have to use the /x modifier when we use this in a match or substitution.)

With this template in hand, the following sections show how to:

Perform a normal match without any "false" matches
Count, convert (to another encoding), and/or filter characters
Verify whether the target text is valid according to an encoding
Detect which encoding the target text uses

All the examples are shown using EUC-JP as the encoding of interest, but they will work with any of the many multiple-byte encodings commonly used for text processing, such as Unicode, Big-5, etc.

Avoiding false matches

A false match is where the regular expression engine finds a match that begins in the middle of a multiple-byte character sequence. We can get around the problem by carefully controlling the match, ensuring that the pattern matching engine stays synchronized with the character boundaries at all times.

This can be done by anchoring the match to the start of the string, then manually bypassing characters ourselves when the real match can't happen at the current location. With the EUC-JP example, the "bypassing characters" part is /(?: $eucjp )*?/. $eucjp is our template to match any valid character, and because it is applied via the non-greedy *?, it can match a character only when whatever follows (presumably the desired real match) can't match. Here's a real example:

/^ (?: $eucjp )*?  \xC5\xEC\xB5\xFE/ox # Trying to find Tokyo

In the EUC-JP encoding, the Japanese word for Tokyo is written with two characters, the first encoded by the two bytes \xC5\xEC, the second encoded by the two bytes \xB5\xFE. As far as Perl is concerned, we're looking merely for the four-byte sequence \xC5\xEC\xB5\xFE, but because we use (?: $eucjp )*? to move along the string only by characters of our target encoding, we know we'll stay in synch.

Don't forget to use the /ox modifiers. The /x modifier is especially crucial due to the whitespace used in the encoding template $eucjp. The /o modifier is for efficiency, since we know $eucjp won't change from use to use.

Use in a replacement is similar, but since the text leading to the real match is also part of the overall match, we must capture it with parentheses, being sure to include it in the replacment text. Assuming that $Tokyo and $Osaka have been set to the bytes sequences for their respective words in the EUC-JP encoding, we could use the following to replace Osaka for Tokyo:

/^ (  (?:eucjp)*? ) $Tokyo/$1$Osaka/ox

If used with /g, we want to anchor the match to the end of the previous match, rather than to the start of the string. That's as simple as changing ^ to \G:

/\G (  (?:eucjp)*? ) $Tokyo/$1$Osaka/gox

Splitting multiple-byte strings

Another common task is to split an input string into its individual charcters. With a one-byte-per-character encoding, you can simply split //, but with a multiple-byte encoding, we need something like:

@chars = /$eucjp/gox; # One character per list element

Now, @chars contains one character per element. The following snippet shows how you might use this to write a filter of some sort:

while (<>) {
  my @chars = /$eucjp/gox; # One character per list element
  for my $char (@chars) {
    if (length($char) == 1) {
      # Do something interesting with this one-byte character
    } else {
      # Do something interesting with this multiple-byte character
    }
  }
  my $line = join("",@chars); # Glue list back together
  print $line;
}

In the two "do something interesting" parts, any change to $char will be reflected in the output when @chars is glued back together.

Validating multiple-byte strings

The use of /$eucjp/gox in this kind of technique relies strongly on the input string indeed being properly formatted in our target encoding, EUC-JP. If it's not, the template /$eucjp/ won't be able to match, and bytes will be skipped.

One way to address this is to use /\G$eucjp/gox instead. This prohibits the pattern matching engine from skipping bytes in order to find a match (since the use of \G indicates that any match must immediately follow the previous match). This is still not a perfect approach, since it will simply stop matching on ill-formatted input data.

A better approach to confirm that a string is valid with respect to an encoding is to use something like:

$is_eucjp = m/^(?:$eucjp)*$/xo;

If a string has only valid characters from start to end, you know the string as a whole is valid.

There is one potential for a problem, and that's due to how the end-of-string metacharacter $ works: it can be true at the end of the string (as we want), and also just before a newline at the end of the string. That means you can still match sucessfully even if the newline is not a valid character in the encoding. To get around this problem, you could use the more-complicated (?!\n)$ instead of $.

You can use the basic validation technique to detect which encoding is being used. For example, Japanese is commonly encoded with either EUC-JP, or another encoding called Shift-JIS. If you've set up the templates, as with $eucjp, you can do something like:

$is_eucjp = m/^(?:$eucjp)*$/xo;
$is_sjis  = m/^(?:$sjis)*$/xo;

If both are true, the text is likely ASCII (since, essentially, ASCII is a sub-component of both encodings). (It's not quite fool-proof, though, since some strings with multi-byte characters might appear to be valid in both encodings. In such a case, automatic detection becomes impossible, although one might use character-frequency data to make an educated guess.)

Converting between encodings

Converting from one encoding to another can be as simple as an extension of the process-each-character routine above. Conversions for some closely related encodings can be done by a simple mathematical computation on the bytes, while others might require huge mapping tables. In either case, you insert the code at the "do something interesting" points in the routine.

Here's an example to convert from EUC-JP to Unicode, using a %euc2uni hash as a mapping table:

while (<>) {
  my @chars = /$eucjp/gox; # One character per list element
  for my $euc (@chars) {
    my $uni = $euc2uni{$char};
    if (defined $uni) {
        $euc = $uni;
    } else {
        ## deal with unknown EUC->Unicode mapping here.
    }
  }
  my $line = join("",@chars);
  print $line;
}

The topic of multiple-byte matching and processing is of particular importance when dealing with Unicode, which has a variety of possible representations. UCS-2 and UCS-4 are fixed-length encodings. UTF-8 defines a mixed one- through six-byte encoding. UTF-16, which represents the most common instance of Unicode encoding, is a variable-length 16-bit encoding.