ЭЛЕКТРОННАЯ БИБЛИОТЕКА КОАПП
Сборники Художественной, Технической, Справочной, Английской, Нормативной, Исторической, и др. литературы.



8.3 Efficiency

While most of the work of programming may be simply getting a program working properly, you may find yourself wanting more bang for the buck out of your Perl program. Perl's rich set of operators, datatypes, and control constructs are not necessarily intuitive when it comes to speed and space optimization. Many trade-offs were made during Perl's design, and such decisions are buried in the guts of the code. In general, the shorter and simpler your code is, the faster it runs, but there are exceptions. This section attempts to help you make it work just a wee bit better.

(If you want it to work a lot better, you can play with the new Perl-to-C translation modules, or rewrite your inner loop as a C extension.)

You'll note that sometimes optimizing for time may cost you in space or programmer efficiency (indicated by conflicting hints below). Them's the breaks. If programming were easy, they wouldn't need something as complicated as a human being to do it, now would they?

8.3.1 Time Efficiency

  • Use hashes instead of linear searches. For example, instead of searching through @keywords to see if $_ is a keyword, construct a hash with:

    my %keywords;
    for (@keywords) {
        $keywords{$_}++;
    }

    Then, you can quickly tell if $_ contains a keyword by testing $keyword{$_} for a non-zero value.

  • Avoid subscripting when a foreach or list operator will do. Subscripting sometimes forces conversion from floating point to integer, and there's often a better way to do it. Consider using foreach, shift, and splice operations. Consider saying use integer.

  • Avoid goto. It scans outward from your current location for the indicated label.

  • Avoid printf if print will work. Quite apart from the extra overhead of printf, some implementations have field length limitations that print gets around.

  • Avoid $&, $`, and $'. Any occurrence in your program causes all matches to save the searched string for possible future reference. (However, once you've blown it, it doesn't hurt to have more of them.)

  • Avoid using eval on a string. An eval of a string (not of a BLOCK) forces recompilation every time through. The Perl parser is pretty fast for a parser, but that's not saying much. Nowadays there's almost always a better way to do what you want anyway. In particular, any code that uses eval merely to construct variable names is obsolete, since you can now do the same directly using symbolic references:

    ${$pkg . '::' . $varname} = &{ "fix_" . $varname }($pkg);

  • Avoid string eval inside a loop. Put the loop into the eval instead, to avoid redundant recompilations of the code. See the study operator in Chapter 3 for an example of this.

  • Avoid run-time-compiled patterns. Use the /pattern/o (once only) pattern modifier to avoid pattern recompilation when the pattern doesn't change over the life of the process. For patterns that change occasionally, you can use the fact that a null pattern refers back to the previous pattern, like this:

    "foundstring" =~ /$currentpattern/;        # Dummy match (must succeed).
    while (<>) {
        print if //;
    }

    You can also use eval to recompile a subroutine that does the match (if you only recompile occasionally).

  • Short-circuit alternation is often faster than the corresponding regular expression. So:

    print if /one-hump/ || /two/;

    is likely to be faster than:

    print if /one-hump|two/;

    at least for certain values of one-hump and two. This is because the optimizer likes to hoist certain simple matching operations up into higher parts of the syntax tree and do very fast matching with a Boyer-Moore algorithm. A complicated pattern defeats this.

  • Reject common cases early with next if. As with simple regular expressions, the optimizer likes this. And it just makes sense to avoid unnecessary work. You can typically discard comment lines and blank lines even before you do a split or chop:

    while (<>) {
        next if /^#/;
        next if /^$/;
        chop;
        @piggies = split(/,/);
        ...
    }

  • Avoid regular expressions with many quantifiers, or with big {m,n} numbers on parenthesized expressions. Such patterns can result in exponentially slow backtracking behavior unless the quantified subpatterns match on their first "pass".

  • Try to maximize the length of any non-optional literal strings in regular expressions. This is counterintuitive, but longer patterns often match faster than shorter patterns. That's because the optimizer looks for constant strings and hands them off to a Boyer-Moore search, which benefits from longer strings. Compile your pattern with the -Dr debugging switch to see what Perl thinks the longest literal string is.

  • Avoid expensive subroutine calls in tight loops. There is overhead associated with calling subroutines, especially when you pass lengthy parameter lists, or return lengthy values. In increasing order of desperation, try passing values by reference, passing values as dynamically scoped globals, inlining the subroutine, or rewriting the whole loop in C.

  • Avoid getc for anything but single-character terminal I/O. In fact, don't use it for that either. Use sysread.

  • Use readdir rather than <*>. To get all the non-dot files within a directory, say something like:

    opendir(DIR,".");
    @files = sort grep(!/^\./, readdir(DIR));
    closedir(DIR);

  • Avoid frequent substr on long strings.

  • Use pack and unpack instead of multiple substr invocations.

  • Use substr as an lvalue rather than concatenating substrings. For example, to replace the fourth through sixth characters of $foo with the contents of the variable $bar, don't do:

    $foo = substr($foo,0,3) . $bar . substr($foo,6);

    Instead, simply identify the part of the string to be replaced, and assign into it, as in:

    substr($foo,3,3) = $bar;

    But be aware that if $foo is a huge string, and $bar isn't exactly 3 characters long, this can do a lot of copying too.

  • Use s/// rather than concatenating substrings. This is especially true if you can replace one constant with another of the same size. This results in an in-place substitution.

  • Use modifiers and equivalent and and or, instead of full-blown conditionals. Statement modifiers and logical operators avoid the overhead of entering and leaving a block. They can often be more readable too.

  • Use $foo = $a || $b || $c. This is much faster (and shorter to say) than:

    if ($a) {
        $foo = $a;
    }
    elsif ($b) {
        $foo = $b;
    }
    elsif ($c) {
        $foo = $c;
    }

    Similarly, set default values with:

    $pi ||= 3;

  • Group together any tests that want the same initial string. When testing a string for various prefixes in anything resembling a switch structure, put together all the /^a/ patterns, all the /^b/ patterns, and so on.

  • Don't test things you know won't match. Use last or elsif to avoid falling through to the next case in your switch statement.

  • Use special operators like study, logical string operations, pack 'u' and unpack '%' formats.

  • Beware of the tail wagging the dog. Misstatements resembling (<STDIN>)[0] and 0 .. 2000000 can cause Perl much unnecessary work. In accord with UNIX philosophy, Perl gives you enough rope to hang yourself.

  • Factor operations out of loops. The Perl optimizer does not attempt to remove invariant code from loops. It expects you to exercise some sense.

  • Slinging strings can be faster than slinging arrays.

  • Slinging arrays can be faster than slinging strings. It all depends on whether you're going to reuse the strings or arrays, and on which operations you're going to perform. Heavy modification of each element implies that arrays will be better, and occasional modification of some elements implies that strings will be better. But you just have to try it and see.

  • my variables are normally faster than local variables.

  • Sorting on a manufactured key array may be faster than using a fancy sort subroutine. A given array value may participate in several sort comparisons, so if the sort subroutine has to do much recalculation, it's better to factor out that calculation to a separate pass before the actual sort.

  • tr/abc//d is faster than s/[abc]//g.

  • print with a comma separator may be faster than concatenating strings. For example:

    print $fullname{$name} . " has a new home directory " .
        $home{$name} . "\n";

    has to glue together the two hashes and the two fixed strings before passing them to the low-level print routines, whereas:

    print $fullname{$name}, " has a new home directory ",
        $home{$name}, "\n";

    doesn't. On the other hand, depending on the values and the architecture, the concatenation may be faster. Try it.

  • Prefer join("", ...) to a series of concatenated strings. Multiple concatenations may cause strings to be copied back and forth multiple times. The join operator avoids this.

  • split on a fixed string is generally faster than split on a pattern. That is, use split(/ /,...) rather than split(/ +/,...) if you know there will only be one space. However, the patterns /\s+/, /^/ and / / are specially optimized, as is the split on whitespace.

  • Pre-extending an array or string can save some time. As strings and arrays grow, Perl extends them by allocating a new copy with some room for growth and copying in the old value. Pre-extending a string with the x operator or an array by setting $#array can prevent this occasional overhead, as well as minimize memory fragmentation.

  • Don't undef long strings and arrays if they'll be reused for the same purpose. This helps prevent reallocation when the string or array must be re-extended.

  • Prefer "\0" x 8192 over unpack("x8192",()).

  • system("mkdir...") may be faster on multiple directories if mkdir(2) isn't available.

  • Avoid using eof if return values will already indicate it.

  • Cache entries from passwd and group (and so on) that are apt to be reused. For example, to cache the return value from gethostbyaddr when you are converting numeric addresses (like 198.112.208.11) to names (like "www.ora.com"), you can use something like:

    sub numtoname {
        local($_) = @_;
        unless (defined $numtoname{$_}) {
            local(@a) = gethostbyaddr(pack('C4', split(/\./)),2);
            $numtoname{$_} = @a > 0 ? $a[0] : $_;
        }
        $numtoname{$_};
    }

  • Avoid unnecessary system calls. Operating system calls tend to be rather expensive. So for example, don't call the time operator when a cached value of $now would do. Use the special _ filehandle to avoid unnecessary stat(2) calls. On some systems, even a minimal system call may execute a thousand instructions.

  • Avoid unnecessary system calls. The system operator has to fork a subprocess and execute the program you specify. Or worse, execute a shell to execute the program you specify. This can easily execute a million instructions.

  • Worry about starting subprocesses, but only if they're frequent. Starting a single pwd, hostname, or find process isn't going to hurt you much - after all, a shell starts subprocesses all day long. We do occasionally encourage the toolbox approach, believe it or not.

  • Keep track of your working directory yourself rather than calling pwd repeatedly. (A package is provided in the standard library for this. See the Cwd module in Chapter 7.)

  • Avoid shell metacharacters in commands - pass lists to system and exec where appropriate.

  • Set the sticky bit on the Perl interpreter on machines without demand paging.

    chmod +t /usr/bin/perl

  • Using defaults doesn't make your program faster.

8.3.2 Space Efficiency

  • Use vec for compact integer array storage.

  • Prefer numeric values over string values - they require little additional space over that allocated for the scalar header structure.

  • Use substr to store constant-length strings in a longer string.

  • Use the Tie::SubstrHash module for very compact storage of a hash array, if the key and value lengths are fixed.

  • Use __END__ and the DATA filehandle to avoid storing program data as both a string and an array.

  • Prefer each to keys where order doesn't matter.

  • Delete or undef globals that are no longer in use.

  • Use some kind of DBM to store hashes.

  • Use temp files to store arrays.

  • Use pipes to offload processing to other tools.

  • Avoid list operations and file slurps.

  • Avoid using tr///, each of which must store a translation table of 256 short integers (not characters, since we have to remember which characters are to be deleted).

  • Don't unroll your loops or inline your subroutines.

8.3.3 Programmer Efficiency

  • Use defaults.

  • Use funky shortcut command-line switches like -a, -n, -p, -s, -i.

  • Use for to mean foreach.

  • Sling UNIX commands around with backticks.

  • Use <*> and such.

  • Use run-time-compiled patterns.

  • Use patterns with lots of *, +, and {}.

  • Sling whole arrays and slurp entire files.

  • Use getc.

  • Use $&, $`, and $'.

  • Don't check error values on open, since <HANDLE> and print HANDLE will simply no-op when given an invalid handle.

  • Don't close your files - they'll be closed on the next open.

  • Pass subroutine arguments as globals.

  • Don't name your subroutine parameters. You can access them directly as $_[EXPR].

  • Use whatever you think of first.

8.3.4 Maintainer Efficiency

  • Don't use defaults.

  • Use foreach to mean foreach.

  • Use meaningful loop labels with next and last.

  • Use meaningful variable names.

  • Use meaningful subroutine names.

  • Put the important thing first on the line using and, or, and statement modifiers.

  • Close your files as soon as you're done with them.

  • Use packages, modules, and classes to hide your implementation details.

  • Pass arguments as subroutine parameters.

  • Name your subroutine parameters using my.

  • Parenthesize for clarity.

  • Put in lots of (useful) comments.

  • Write the script as its own POD document.

8.3.5 Porter Efficiency

  • Wave a handsome tip under his nose.

  • Avoid functions that aren't implemented everywhere. You can use eval tests to see what's available.

  • Don't expect native float and double to pack and unpack on foreign machines.

  • Use network byte order when sending binary data over the network.

  • Don't send binary data over the network.

  • Check $] to see if the current version supports all the features you use.

  • Don't use $]: use require with a version number.

  • Put in the eval exec hack even if you don't use it.

  • Put the #!/usr/bin/perl line in even if you don't use it.

  • Test for variants of UNIX commands. Some finds can't handle -xdev, for example.

  • Avoid variant UNIX commands if you can do it internally. UNIX commands don't work too well on MS-DOS or VMS.

  • Use the Config module or the $^O variable to find out what kind of machine you're running on.

  • Put all your scripts and manpages into a single NFS filesystem that's mounted everywhere.

8.3.6 User Efficiency

  • Avoid forcing prompt order - pop users into their favorite editor with a form.

  • Better yet, use a GUI like the Perl Tk extension, where users can control the order of events.

  • Put up something for users to read while you continue doing work.

  • Use autoloading so that the program appears to run faster.

  • Give the option of helpful messages at every prompt.

  • Give a helpful usage message if users don't give correct input.

  • Display the default action at every prompt, and maybe a few alternatives.

  • Choose defaults for beginners. Allow experts to change the defaults.

  • Use single character input where it makes sense.

  • Pattern the interaction after other things the user is familiar with.

  • Make error messages clear about what needs fixing. Include all pertinent information such as filename and errno, like this:

    open(FILE, $file) or die "$0: Can't open $file for reading: $!\n";

  • Use fork and exit to detach when the rest of the script is batch processing.

  • Allow arguments to come either from the command line or via standard input.

  • Use text-oriented network protocols.

  • Don't put arbitrary limitations into your program.

  • Prefer variable-length fields over fixed-length fields.

  • Be vicariously lazy.

  • Be nice.