5. Packages, Modules, and Object Classes

Contents:
Packages
Modules
Objects
Using Tied Variables
Some Hints About Object Design

This chapter, more than any other in this book, is about Laziness, Impatience, and Hubris - because this chapter is about good software design.

We've all fallen into the trap of using cut-and-paste when we should have chosen to define a higher-level abstraction, if only just a loop or subroutine.[1] To be sure, some folks have gone to the opposite extreme of defining ever-growing mounds of higher-level abstractions when they should have used cut-and-paste.[2] Generally, though, most of us need to think about using more abstraction rather than less.

[1] This is a form of False Laziness.
[2] This is a form of False Hubris.

(Caught somewhere in the middle are the people who have a balanced view of how much abstraction is good, but who jump the gun on writing their own abstractions when they should be reusing existing code.)[3]

[3] You guessed it, this is False Impatience. But if you're determined to reinvent the wheel, at least try to invent a better one.

Whenever you're tempted to do any of these things, you need to sit back and think about what will do the most good for you and your neighbor over the long haul. If you're going to pour your creative energies into a lump of code, why not make the world a better place while you're at it? (Even if you're only aiming for the program to succeed, you need to make sure it fits its ecological niche.)

The first step toward ecologically sustainable programming is simply: don't litter in the park. When you write a chunk of code, think about giving the code its own namespace, so that your variables and functions don't clobber anyone else's, or vice versa. A namespace is a bit like your home, where you're allowed to be as messy as you like, as long as you keep your external interface to other citizens moderately civil. In Perl, a namespace is called a package. Packages provide the fundamental building block upon which the higher-level concepts of modules and classes are constructed.

Like the notion of "home", the notion of "package" is a bit nebulous. Packages are independent of files. You can have many packages in a single file, or a single package that spans several files, just as your home could be one part of a larger building, if you live in an apartment, or could comprise several buildings, if your name happens to be Queen Elizabeth. But the usual size of a home is one building, and the usual size of a package is one file. Perl has some special help for people who want to put one package in one file, as long as you're willing to name the file with the same name as the package and give your file an extension of ".pm", which is short for "perl module". The module is the unit of reusability in Perl. Indeed, the way you use a module is with the use command, which is a compiler directive that controls the importation of functions and variables from a module. Every example of use you've seen until now has been an example of module reuse.

Object classes are another concept built on the package concept. The concept of classes therefore cuts across the concepts of files and modules. But the typical class is nevertheless implemented with a module. (If you're starting to get the feeling that much of Perl culture is governed by mere convention, then you're starting to get the right feeling, civilly speaking. The trend over the last 20 years or so has been to design computer languages that enforce a state of paranoia. You're expected to program every module as if it were in a state of siege. Certainly there are some feudal cultures where this is appropriate, but not all cultures are like this. In Perl culture, by contrast, you're expected to stay out of someone's home because you weren't invited in, not because there are bars[4] on the windows.)

[4] But Perl provides some bars if you want them, too. See the Safe module in Chapter 7, The Standard Perl Library, for instance.

Anyway, back to classes. When you use a module that implements a class, you're benefiting from the direct reuse of the software that implements that module. But with object classes you can get the additional benefits of indirect software reuse when the class you're using turns around and reuses other classes that it gets some characteristics from. But this is not primarily a book about object-oriented methodology, and we're not here to convert you into a raving object-oriented zealot, even if you want to be converted. There are already plenty of books out there for that. Perl's philosophy of object-oriented design fits right in with Perl's philosophy of everything else: use object-oriented design where it makes sense, and avoid it where it doesn't. Your call.

As we mentioned in the previous chapter, object-oriented programming in Perl is accomplished through use of references that happen to refer to thingies that know which class they're associated with. In fact, now that you know about references, you know almost everything hard about objects. The rest of it just "lays under the fingers", as a violinist would say. You will need to practice a little, though.

In this chapter we will discuss creation and use of packages, modules, and classes. Then we will review some of the essentials of object-oriented programming, explain how references become objects, and illustrate how these objects are manipulated as members of one or more classes. We'll also tell you how to tie ordinary variables into object classes to turn them into magical variables.

5.1 Packages

Perl provides a mechanism to protect different sections of code from inadvertently tampering with each other's variables. In fact, apart from certain magical variables, there's really no such thing as a global variable in Perl. Code is always compiled in the current package. The initial current package is package main, but at any time you can switch the current package to another one using the package declaration. The current package determines which symbol table is used for name lookups (for names that aren't otherwise package-qualified). The notion of "current package" is both a compile-time and run-time concept. Most name lookups happen at compile-time, but run-time lookups happen when symbolic references are dereferenced, and also when new bits of code are parsed under eval. In particular, eval operations know which package they were invoked in, and propagate that package inward as the current package of the evaluated code. (You can always switch to a different package within the eval string, of course, since an eval string counts as a block, as does a file loaded in with do, require, or use.)

The scope of a package declaration is from the declaration itself through the end of the innermost enclosing block (or until another package declaration at the same level, which hides the earlier one). All subsequent identifiers (except those declared with my, or those qualified with a different package name) will be placed in the symbol table belonging to the package. Typically, you would put a package declaration as the first declaration in a file to be included by require or use. But again, that's by convention. You can put a package declaration anywhere you can put a statement. You could even put it at the end of a block, in which case it would have no effect whatsoever. You can switch into a package in more than one place; it merely influences which symbol table is used by the compiler for the rest of that block. (This is how a given package can span more than one file.)

You can refer to identifiers[5] in other packages by prefixing ("qualifying") the identifier with the package name and a double colon: $Package::Variable. If the package name is null, the main package is assumed. That is, $::sail is equivalent to $main::sail.[6] (The old package delimiter was a single quote, which produced things like $main'sail and $'sail. But a double colon is now the preferred delimiter, in part because it's more readable to humans, and in part because it's more readable to emacs macros. It also gives C++ programmers a warm feeling.)

[5] By identifiers, we mean the names used as symbol table keys to access scalar variables, array variables, hash variables, functions, file or directory handles, and formats. Syntactically speaking, labels are also identifiers, but they aren't put into a particular symbol table; rather, they are attached directly to the statements in your program. Labels may not be package qualified.
[6] To clear up another bit of potential confusion, in a variable name like $main::sail, we use the term "identifier" to talk about main and sail, but not main::sail. We call that a variable name instead, because an identifier may not contain a colon. The definition of an identifier is lexical, in that an identifier is a token that matches the pattern /^[A-Za-z_][A-Za-z_0-9]*$/.

Packages may be nested inside other packages: $OUTER::INNER::var. This implies nothing about the order of name lookups, however. There are no fallback symbol tables. All undeclared symbols are either local to the current package, or must be fully qualified from the outer package name down. For instance, there is nowhere within package OUTER that $INNER::var refers to $OUTER::INNER::var. It would treat package INNER as a totally separate global package. Similarly, every package declaration must declare a complete package name. No package name ever assumes any kind of implied "prefix", even if (seemingly) declared within the scope of some other package declaration.

Only identifiers (names starting with letters or underscore) are stored in the current package's symbol table. All other symbols are kept in package main, including all the magical punctuation-only variables like $! and $_. In addition, the identifiers STDIN, STDOUT, STDERR, ARGV, ARGVOUT, ENV, INC, and SIG are forced to be in package main even when used for purposes other than their built-in ones. Furthermore, if you have a package called m, s, y, or tr, then you can't use the qualified form of an identifier as a filehandle because it will be interpreted instead as a pattern match, a substitution, or a translation. Using uppercase package names avoids this problem.

Assignment of a string to %SIG assumes the signal handler specified is in the main package, if the name assigned is unqualified. Qualify the signal handler name if you want to have a signal handler in a package, or don't use a string at all: assign a typeglob or a function reference instead:

$SIG{QUIT} = "quit_catcher";     # implies "main::quit_catcher"
$SIG{QUIT} = *quit_catcher;      # forces current package's sub
$SIG{QUIT} = \&quit_catcher;     # forces current package's sub
$SIG{QUIT} = sub { print "Caught SIGQUIT\n" };  # anonymous sub

See my and local in Chapter 3, Functions, for other scoping issues. See the "Signals" section in Chapter 6, Social Engineering, for more on signal handlers.

5.1.1 Symbol Tables

The symbol table for a package happens to be stored in a hash whose name is the same as the package name with two colons appended. The main symbol table's name is thus %main::, or %:: for short, since package main is the default. Likewise, the symbol table for the nested package we mentioned earlier is named %OUTER::INNER::. As it happens, the main symbol table contains all other top-level symbol tables, including itself, so %OUTER::INNER:: is also %main::OUTER::INNER::.

When we say that a symbol table "contains" another symbol table, we mean that it contains a reference to the other symbol table. Since package main is a top-level package, it contains a reference to itself, with the result that %main:: is the same as %main::main::, and %main::main::main::, and so on, ad infinitum. It's important to check for this special case if you write code to traverse all symbol tables.

The keys in a symbol table hash are the identifiers of the symbols in the symbol table. The values in a symbol table hash are the corresponding typeglob values. So when you use the *name typeglob notation, you're really just accessing a value in the hash that holds the current package's symbol table. In fact, the following have the same effect, although the first is potentially more efficient because it does the symbol table lookup at compile time:

local *somesym = *main::variable;
local *somesym = $main::{"variable"};

Since a package is a hash, you can look up the keys of the package, and hence all the variables of the package. Try this:

foreach $symname (sort keys %main::) {
    local *sym = $main::{$symname};
    print "\$$symname is defined\n" if defined $sym;
    print "\@$symname is defined\n" if defined @sym;
    print "\%$symname is defined\n" if defined %sym;
}

Since all packages are accessible (directly or indirectly) through package main, you can visit every package variable in the program, using code written in Perl. The Perl debugger does precisely that when you ask it to dump all your variables.

Assignment to a typeglob performs an aliasing operation; that is,

*dick = *richard;

causes everything accessible via the identifier richard to also be accessible via the symbol dick. If you only want to alias a particular variable or subroutine, assign a reference instead:

*dick = \$richard;

This makes $richard and $dick the same variable, but leaves @richard and @dick as separate arrays. Tricky, eh?

This mechanism may be used to pass and return cheap references into or from subroutines if you don't want to copy the whole thing:

%some_hash = ();
*some_hash = fn( \%another_hash );
sub fn {
    local *hashsym = shift;
    # now use %hashsym normally, and you
    # will affect the caller's %another_hash
    my %nhash = (); # populate this hash at will
    return \%nhash;
}

On return, the reference will overwrite the hash slot in the symbol table specified by the *some_hash typeglob. This is a somewhat sneaky way of passing around references cheaply when you don't want to have to remember to dereference variables explicitly. It only works on package variables though, which is why we had to use local there instead of my.

Another use of symbol tables is for making "constant" scalars:

*PI = \3.14159265358979;

Now you cannot alter $PI, which is probably a good thing, all in all.

When you do that assignment, you're just replacing one reference within the typeglob. If you think about it sideways, the typeglob itself can be viewed as a kind of hash, with entries for the different variable types in it. In this case, the keys are fixed, since a typeglob can contain exactly one scalar, one array, one hash, and so on. But you can pull out the individual references, like this:

*pkg::sym{SCALAR}      # same as \$pkg::sym
*pkg::sym{ARRAY}       # same as \@pkg::sym
*pkg::sym{HASH}        # same as \%pkg::sym
*pkg::sym{CODE}        # same as \&pkg::sym
*pkg::sym{GLOB}        # same as \*pkg::sym
*pkg::sym{FILEHANDLE}  # internal filehandle, no direct equivalent
*pkg::sym{NAME}        # "sym" (not a reference)
*pkg::sym{PACKAGE}     # "pkg" (not a reference)

This is primarily used to get at the internal filehandle reference, since the other internal references are already accessible in other ways. But we thought we'd generalize it because it looks kind of pretty. Sort of. You probably don't need to remember all this unless you're planning to write a Perl debugger. So let's get back to the topic of writing good software.

5.1.2 Package Constructors and Destructors: BEGIN and END

Two special subroutine definitions that function as package constructors and destructors[7] are the BEGIN and END routines. The sub is optional for these routines.

[7] Strictly speaking, these aren't constructors and destructors, but initializers and finalizers. And strictly speaking, packages aren't objects. But strictly speaking, we don't speak strictly around here too often.

A BEGIN subroutine is executed as soon as possible, that is, the moment it is completely defined, even before the rest of the containing file is parsed. You may have multiple BEGIN blocks within a file - they will execute in order of definition. Because a BEGIN block executes immediately, it can pull in definitions of subroutines and such from other files in time to be visible during compilation of the rest of the file. This is important because subroutine declarations change how the rest of the file will be parsed. At the very least, declaring a subroutine allows it to be used as a list operator, without parentheses. And if the subroutine is declared with a prototype, then calls to that subroutine may be parsed like any of several built-in functions (depending on which prototype is used).

An END subroutine, by contrast, is executed as late as possible, that is, when the interpreter is being exited, even if it is exiting as a result of a die function, or from an internally generated exception such as you'd get when you try to call an undefined function. (But not if it's is being blown out of the water by a signal - you have to trap that yourself (if you can).)[8] You may have multiple END blocks within a file - they will execute in reverse order of definition; that is: last in, first out (LIFO). That is so that related BEGINs and ENDs will nest the way you'd expect, if you pair them up.

[8] See the sigtrap pragmatic module described in Chapter 7 for an easy way to do this. For general information on signal handling, see "Signals" in Chapter 6.

When you use the -n and -p switches to Perl, BEGIN and END work just as they do in awk(1), as a degenerate case. For example, the output order of colors if you run the following program is red, green, and blue:

die "green\n";
END   { print "blue\n" }
BEGIN { print "red\n" }

Just as eval provides a way to get compilation behavior during run-time, so too BEGIN provides a way to get run-time behavior during compilation. But note that the compiler must execute BEGIN blocks even if you're just checking syntax with the -c switch. By symmetry, END blocks are also executed when syntax checking. Your END blocks should not assume that any or all of your main code ran. (They shouldn't do this in any event, since the interpreter might exit early from an exception.) This is not a bad problem in general. At worst, it means you should test the "definedness" of a variable before doing anything rash with it. In particular, before saying something like:

system "rm -rf '$dir'"

you should always check that $dir contains something meaningful, whether or not you're doing it in an END block. Caveat destructor.

5.1.3 Autoloading

Normally you can't call a subroutine that isn't defined. However, if there is a subroutine named AUTOLOAD in the undefined subroutine's package (or in the case of an object method, in the package of any of the object's base classes), then the AUTOLOAD subroutine is called with the same arguments as would have been passed to the original subroutine. The fully qualified name of the original subroutine magically appears in the package-global $AUTOLOAD variable, in the same package as the AUTOLOAD routine.

Most AUTOLOAD routines will load a definition for the undefined subroutine in question using eval or require, then execute that subroutine using a special form of goto that erases the stack frame of the AUTOLOAD routine without a trace.

The standard AutoSplit module is a tool used by module writers to help split their modules into separate files (with filenames ending in .al), each holding one routine. The files are placed in the auto/ directory of the Perl library. These files can then be loaded on demand by the standard AutoLoader module. A similar approach is taken by the SelfLoader module, except that it autoloads functions from the file's own DATA area (which is less efficient in some ways and more efficient in others). Autoloading of Perl functions is analogous to dynamic loading of compiled C functions, except that autoloading (as practiced by AutoLoader and SelfLoader) is done at the granularity of the function call, whereas dynamic loading (as practiced by the DynaLoader module) is done at the granularity of the complete module, and will usually link in many C or C++ functions all at once. (See also the AutoLoader, SelfLoader, and DynaLoader modules in Chapter 7.)

But an AUTOLOAD routine can also just emulate the routine and never define it. For example, let's pretend that any function that isn't defined should just call system with its arguments. All you'd do is this:

sub AUTOLOAD {
    my $program = $AUTOLOAD;
    $program =~ s/.*:://;  # trim package name
    system($program, @_);
} 
date();
who('am', 'i');
ls('-l');

In fact, if you predeclare the functions you want to call that way, you don't even need the parentheses:

use subs qw(date who ls);
date;
who "am", "i";
ls "-l";

A more complete example of this is the standard Shell module described in Chapter 7, which can treat undefined subroutine calls as calls to programs.