Recipe 6.10. Speeding Up Interpolated Matches

6.10. Speeding Up Interpolated Matches

Problem

You want your function or program to take one or more regular expressions as arguments, but doing so seems to run slower than using literals.

Solution

To overcome this bottleneck, if you have only one pattern whose value won't change during the entire run of a program, store it in a string and use /$pattern/o.

while ($line = <>) {
    if ($line =~ /$pattern/o) {
        # do something
    }
}

If you have more than one pattern, however, that won't work. Use one of the three techniques outlined in the Discussion for a speed-up of an order of magnitude or so.

When Perl compiles a program, it converts patterns into an internal form. This conversion occurs at compile time for patterns without variables in them, but at run time for those that do contain variables. That means that interpolating variables into patterns, as in /$pattern/, can slow your program down. This is particularly noticeable when $pattern changes often.

The /o modifier is a promise from the script's author that the values of any variables interpolated into that pattern will not change - or that if they do, Perl should disregard any such changes. Given such a promise, Perl need only interpolate the variable and compile the pattern the first time it encounters the match. But if the interpolated variable were to change, Perl wouldn't notice. Make sure to use it only on unchanging variables, or else wrong answers will result.

Using /o on patterns without interpolated variables does not speed anything up. The /o modifier is also of no help when you have an unknown number of regular expressions and need to check one or more strings against all of these patterns. Nor is it of any use when the interpolated variable is a function argument, since each call of the function gives the variable a new value.

Example 6.4 is an example of the slow but straightforward technique for matching many patterns against many lines. The array @popstates contains the standard two-letter abbreviations for some of the places in the heartland of North America where we normally refer to soft drinks as pop (soda to us means either plain soda water or else handmade delicacies from the soda fountain at the corner drugstore, preferably with ice cream). The goal is to print out any line of input that contains any of those places, matching them at word boundaries only. It doesn't use /o because the variable that holds the pattern keeps changing.

Example 6.4: popgrep1

#!/usr/bin/perl
# popgrep1 - grep for abbreviations of places that say "pop"
# version 1: slow but obvious way
@popstates = qw(CO ON MI WI MN);
LINE: while (defined($line = <>)) {
    for $state (@popstates) {
        if ($line =~ /\b$state\b/) {
            print; next LINE;
       }
    }
}

Such a direct, obvious, brute-force approach is also horribly slow because it has to recompile all patterns with each line of input. Three different ways of addressing this are described in this section. One builds a string of Perl code and evals it; one caches the internal representations of regular expressions in closures; and one uses the Regexp module from CPAN to hold compiled regular expressions.

The traditional way to get Perl to speed up a multiple match is to build up a string containing the code and eval "$code" it. Example 6.5 contains a version that uses this technique.

Example 6.5: popgrep2

#!/usr/bin/perl
# popgrep2 - grep for abbreviations of places that say "pop"
# version 2: eval strings; fast but hard to quote
@popstates = qw(CO ON MI WI MN);
$code = 'while (defined($line = <>)) {';
for $state (@popstates) {
    $code .= "\tif (\$line =~ /\\b$state\\b/) { print \$line; next; }\n";
}
$code .= '}';
print "CODE IS\n----\n$code\n----\n" if 0;  # turn on to debug
eval $code;
die if $@;

The popgrep2 program builds strings like this:

while (defined($line = <>)) {
     if ($line =~ /\bCO\b/) { print $line; next; }
     if ($line =~ /\bON\b/) { print $line; next; }
     if ($line =~ /\bMI\b/) { print $line; next; }
     if ($line =~ /\bWI\b/) { print $line; next; }
     if ($line =~ /\bMN\b/) { print $line; next; }
}

As you see, those end up looking like constant strings to eval. We put the entire loop and pattern match in the eval text, too, which makes it run faster.

The worst thing about this eval "STRING" approach is that it's difficult to get the quoting and escaping right. The dequote function from Recipe 1.11 can make it easier to read, but escaping variables whose use is delayed will still be an issue. Also, none of the strings can contain a slash, since that's what we're using as a delimiter for the m// operator.

A solution to these problems is a subtle technique first developed by Jeffrey Friedl. The key here is building an anonymous subroutine that caches the compiled patterns in the closure it creates. To do this, we eval a string containing the definition of an anonymous subroutine to match any of the supplied patterns. Perl compiles the pattern once, when the subroutine is defined. The string is evaluated to give you comparatively quick matching ability. An explanation of the algorithm can be found at the end of the section "Regex Compilation, the /o Modifier, and Efficiency" in Chapter 7 of Mastering Regular Expressions.

Example 6.6 is a version of our pop grepper that uses that technique.

Example 6.6: popgrep3

#!/usr/bin/perl
# popgrep3 - grep for abbreviations of places that say "pop"
# version 3: use build_match_func algorithm
@popstates = qw(CO ON MI WI MN);
    $expr = join('||', map { "m/\\b\$popstates[$_]\\b/o" } 0..$#popstates);
$match_any = eval "sub { $expr }";
die if $@;
while (<>) {
    print if &$match_any;
}

The string that gets evaluated ends up looking like this, modulo formatting:

sub {
      m/\b$popstates[0]\b/o || m/\b$popstates[1]\b/o ||
      m/\b$popstates[2]\b/o || m/\b$popstates[3]\b/o ||
      m/\b$popstates[4]\b/o
  }

The reference to the @popstates array is locked up inside the closure. Each one is different, so the /o is safe here.

Example 6.7 is a generalized form of this technique showing how to create functions that return true if any of the patterns match or if all match.

Example 6.7: grepauth

#!/usr/bin/perl
# grepauth - print lines that mention both Tom and Nat

$multimatch = build_match_all(q/Tom/, q/Nat/);
while (<>) {
    print if &$multimatch;
}
exit;

sub build_match_any { build_match_func('||', @_) }
sub build_match_all { build_match_func('&&', @_) }
sub build_match_func {
    my $condition = shift;
    my @pattern = @_;  # must be lexical variable, not dynamic one
    my $expr = join $condition => map { "m/\$pattern[$_]/o" } (0..$#pattern);
    my $match_func = eval "sub { local \$_ = shift if \@_; $expr }";
    die if $@;  # propagate $@; this shouldn't happen!
    return $match_func;
}

Using eval "STRING" on interpolated strings as we did in popgrep2 is a hack that happens to work. Using lexical variables that get bound up in a closure as in popgrep3 and the build_match_* functions is deep enough magic that even Perl wizards stare at it a while before they believe in it. Of course, it still works whether they believe in it or not.

What you really need is some way to get Perl to compile each pattern once and let you directly refer to the compiled form later on. Such functionality is directly supported in the 5.005 release in the form of a qr// regular-expression quoting operator. For prior releases, that's exactly what the experimental Regexp module from CPAN was designed for. Objects created by this module represent compiled regular expression patterns. Using the match method on these objects matches the pattern against the string argument. Methods in the class exist for extracting backreferences, determining where pattern matched, and passing flags corresponding to modifiers like /i.

Example 6.8 is a version of our program that demonstrates a simple use of this module.

Example 6.8: popgrep4

#!/usr/bin/perl
# popgrep4 - grep for abbreviations of places that say "pop"
# version 4: use Regexp module
use Regexp;
@popstates = qw(CO ON MI WI MN);
@poppats   = map { Regexp->new( '\b' . $_ . '\b') } @popstates;
while (defined($line = <>)) {
    for $patobj (@poppats) {
        print $line if $patobj->match($line);
    }
}

You might wonder about the comparative speeds of these approaches. When run against the 22,000 line text file (the Jargon File, to be exact), version 1 ran in 7.92 seconds, version 2 in merely 0.53 seconds, version 3 in 0.79 seconds, and version 4 in 1.74 seconds. The last technique is a lot easier to understand than the others, although it does run slightly slower than they do. It's also more flexible.