[Chapter 6] 6.4 Cooperating with Other Languages

6.4 Cooperating with Other Languages

Just as there are many levels on which languages can compete, so too there are many levels on which languages can cooperate. Here we'll talk primarily about generation, translation and embedding (via linking).

6.4.1 Program Generation

Almost from the time people first figured out that they could write programs, they started writing programs that write other programs. These are called program generators. (If you're a history buff, you might know that RPG stood for Report Program Generator long before it stood for Role Playing Game.) Now, anyone who has written a program generator knows that it can make your eyes go crossed even when you're wide awake. The problem is simply that much of your program's data looks like real code, but isn't (at least not yet). The same text file contains both stuff that does something and similar looking stuff that doesn't. Perl has various features that make it easier to mix it together with other languages, textually speaking.

Of course, these features also make it easier to write Perl in Perl, but it's rather expected that Perl would cooperate with itself.

6.4.1.1 Generating other languages in Perl

Perl is, of course, a text-processing language, and most computer languages are textual. Beyond that, the lack of arbitrary limits together with the various quoting and interpolation mechanisms make it pretty easy to visually isolate the code of the other language you're spitting out. For example, here is a small chunk of s2p, the sed-to-perl translator:

print &q(<<"EOT");
:       #!$bin/perl
:       eval 'exec $bin/perl -S \$0 \${1+"\$@"}'
:               if \$running_under_some_shell;
:       
EOT

Here the enclosed text happens to be legal in two languages, both Perl and shell. We've used the trick of putting a colon and a tab on the front of every line, which visually isolates the enclosed code. One variable, $bin, is interpolated in the multi-line quote in two places, and then the string is passed through a function to strip the colon and tab.

Of course, you aren't required to use multi-line quotes. One often sees CGI scripts containing millions of print statements, one per line. It seems a bit like driving to church in an F-16, but hey, if it gets you there....

When you are embedding a large, multi-line quote containing some other language (such as HTML), it's sometimes helpful to pretend you're enclosing Perl into the other language instead:

print <<"END";
stuff
blah blah blah ${ \( EXPR ) } blah blah blah
blah blah blah @{[ LIST ]} blah blah blah
nonsense
END

You can use either of those two tricks to interpolate the value of any scalar EXPR or LIST into a longer string.

6.4.1.2 Generating Perl in other languages

Perl can easily be generated in other languages because it's both concise and malleable. You can pick your quotes not to interfere with the other language's quoting mechanisms. You don't have to worry about indentation, or where you put your line breaks, or whether to backslash your backslashes yet again. You aren't forced to define a package as a single string in advance, since you can slide into your package's namespace repeatedly, whenever you want to evaluate more code in that package.

6.4.2 Translation from Other Languages

One of the very first Perl applications was the sed-to-perl translator, s2p. In fact, Larry delayed the initial release of Perl in order to complete s2p and awk-to-perl (a2p), because he thought they'd improve the acceptance of Perl. Hmm, maybe they did.

6.4.2.1 s2p

The s2p program takes a sed script specified on the command line (or from standard input) and produces a comparable Perl script on the standard output.

Options include:

-Dnumber: Sets debugging flags.
-n: Specifies that this sed script was always invoked as sed -n. Otherwise a switch parser is prepended to the front of the script.
-p: Specifies that this sed script was never invoked as sed -n. Otherwise a switch parser is prepended to the front of the script.

The Perl script produced looks very sed-like, and there may very well be better ways to express what you want to do in Perl. For instance, s2p does not make any use of the split operator, but you might want to.

The Perl script you end up with may be either faster or slower than the original sed script. If you're only interested in speed you'll just have to try it both ways. Of course, if you want to do something sed doesn't do, you have no choice. It's often possible to speed up the Perl script by various methods, such as deleting all references to $\ and chop.

6.4.2.2 a2p

The a2p program takes an awk script specified on the command line (or from standard input) and produces a comparable Perl script on the standard output.

Options include:

-Dnumber

Sets debugging flags.

-Fcharacter

Tells a2p that this awk script is always invoked with a -F switch specifying character.

-nfieldlist

Specifies the names of the input fields if input does not have to be split into an array for some programmatic reason. If you were translating an awk script that processes the password file, you might say:

a2p -7 -nlogin.password.uid.gid.gcos.shell.home

Any delimiter may be used to separate the field names.

-number

Causes a2p to assume that input will always have that many fields.

a2p cannot do as good a job translating as a human would, but it usually does pretty well. There are some areas where you may want to examine the Perl script produced and tweak it some. Here are some of them, in no particular order.

There is an awk idiom of putting int(...) around a string expression to force numeric interpretation, even though the argument is always an integer anyway. This is generally unneeded in Perl, but a2p can't tell if the argument is always going to be an integer, so it leaves it in. You may wish to remove it.

Perl differentiates numeric comparison from string comparison. awk has one operator for both that decides at run-time which comparison to do. a2p does not try to do a complete job of awk emulation at this point. Instead it guesses which one you want. It's almost always right, but it can be spoofed. All such guesses are marked with the comment #???. You should go through and check them. You might want to run at least once with Perl's -w switch, which warns you if you use == where you should have used eq.

It would be possible to emulate awk's behavior in selecting string versus numeric operations at run-time by inspection of the operands, but it would be gross and inefficient. Besides, a2p almost always guesses right.

Perl does not attempt to emulate the behavior of awk in which nonexistent array elements spring into existence simply by being referenced. If somehow you are relying on this mechanism to create null entries for a subsequent for...in, they won't be there in Perl.

If a2p makes a split command that assigns to a list of variables that looks like ($Fld1, $Fld2, $Fld3...) you may want to rerun a2p using the -n option mentioned above. This will let you name the fields throughout the script. If it splits to an array instead, the script is probably referring to the number of fields somewhere.

The "exit" statement in awk doesn't necessarily exit; it goes to the END block if there is one. awk scripts that do contortions within the END block to bypass the block under such circumstances can be simplified by removing the conditional in the END block and just exiting directly from the Perl script.

Perl has two kinds of arrays, numerically indexed and associative. awk arrays are usually translated to associative arrays, but if you happen to know that the index is always going to be numeric, you could change the {...} to [...]. Remember that iteration over an associative array is done using the keys function, but iteration over a numeric array isn't. You might need to modify any loop that is iterating over the array in question.

awk starts by assuming OFMT has the value %.6g. Perl starts by assuming its equivalent, $#, to have the value %.20g. You'll want to set $# explicitly if you use the default value of OFMT. (Actually, you probably don't want to set $#, but rather put in printf formats everywhere it matters.)

Near the top of the line loop will be the split operator that is implicit in the awk script. There are times when you can move this operator down past some conditionals that test the entire record, so that the split is not done as often.

For aesthetic reasons you may wish to change the array base $[ from 1 back to Perl's default of 0, but remember to change all array subscripts and all substr and index operations to match.

Cute comments that say:

# Here's a workaround because awk is so dumb.

are, of course, passed through unmodified.

awk scripts are often embedded in a shell script that pipes stuff into and out of awk. Often the shell script wrapper can be incorporated into the Perl script, since Perl can start up pipes into and out of itself, and can do other things that awk can't do by itself.

Scripts that refer to the special variables RSTART and RLENGTH can often be simplified by referring to the variables $`, $&, and $', as long as they are within the scope of the pattern match that sets them.

The produced Perl script may have subroutines defined to deal with awk's semantics regarding "getline" and "print". Since a2p usually picks correctness over efficiency, it is almost always possible to rewrite such code to be more efficient by discarding the semantic sugar.

ARGV[0] translates to $0, but ARGV[n] translates to $ARGV[$n]. A loop that tries to iterate over ARGV[0] won't find it.

NOTE: Storage for the awk syntax tree is currently static, and can run out. You'll need to recompile a2p if that happens.

6.4.2.3 find2perl

The find2perl program is really easy to understand if you already understand the UNIX find(1) program. Just type find2perl instead of find, and give it the same arguments you would give to find. It will spit out an equivalent Perl script.

There are a couple of options you can use that your ordinary find(1) command probably doesn't support:

-tar tarfile: Outputs a tar file much like the -cpio switch of some versions of find.
-eval string: Evaluates the string as a Perl expression, and continues if true.

6.4.2.4 Source filters

The notion of a source filter started with the idea that a script or module should be able to decrypt itself on the fly, like this:

#!/usr/bin/perl
use MyDecryptFilter;
@*x$]`0uN&k^Zx02jZ^X{.?s!(f;9Q/^A^@~~8H]|,%@^P:q-=
...

But the idea grew from there, and now a source filter can be defined to do any transformation on the input text you like. One can now even do things like this:

#!/usr/bin/perl
use Filter::exec "a2p";
1,30{print $1}

Put that together with the notion of the -x switch mentioned at the beginning of this chapter, and you have a general mechanism for pulling any chunk of program out of an article and executing it, regardless of whether it's written in Perl or not. Now that's cooperation.

The Filter module is available from CPAN.

6.4.3 Translation to Other Languages

Historically, the Perl interpreter has been rather self-contained. When Perl was redesigned for Version 5, however, one of the requirements was that it be possible to write extension modules that could traverse the parsed syntax tree and emit code in other languages, either low-level or high-level. This has now come to pass.

More precisely, this is now coming to pass. Malcolm Beattie has been developing a "real compiler" for Perl. As of this writing, it's in Alpha 2 state, which means it mostly works, except for the really hard bits. The compiler consists of an ordinary Perl parser and interpreter (since you need to be able to execute BEGIN blocks to compile Perl), plus a set of modules under the name of B, which is short for both "Backend" and "Beattie". You don't actually invoke the B module directly though. Instead you invoke a particular backend via the O module, which pulls in the B module for you. Typically you invoke the O module right on the command line with the -M switch, so a compilation command might look like this:

perl -MO=C foo.pl >foo.c

There are three backends at the moment. The C backend rather woodenly spits out C calls into the ordinary Perl interpreter, but it can translate almost anything except the most egregious abuses of the dynamic capabilities of the interpreter. The Bytecode module is also fairly complete, and spits out an external Perl bytecode representation, which can then be read back in and executed by a suitably clued version of Perl. Finally, the CC backend attempts to translate into more idiomatic C with a lot of optimization. Obviously, that's a bit harder to do than the other thing. Nevertheless, it already works on a majority of the Perl regression tests. It's possible with some care to get C code that runs considerably faster than Perl 5's interpreter, which is no slouch to begin with. And Malcolm hasn't put in all the optimizations he wants to yet.

This is an ongoing topic of research, but you'll want to keep track of it. You are quite likely to be using this someday soon, if you aren't already. Look for it on CPAN of course, if it's not already a part of the standard Perl distribution by the time you read this.

6.4.4 Embedding Perl in C and C++

Another part of the design of Perl 5 was that it be possible to embed a Perl interpreter in a C or C++ program. And in fact, the ordinary perl executable pretends to have an embedded interpreter in it; the main() function essentially does this:

PerlInterpreter *my_perl;

int main(int argc, char **argv)
{
    int exitstatus;

    my_perl = perl_alloc();
    perl_construct( my_perl );

    exitstatus = perl_parse( my_perl, xs_init, argc, argv,
                                          (char **) NULL );
    if (exitstatus)
        exit( exitstatus );

    exitstatus = perl_run( my_perl );

    perl_destruct( my_perl );
    perl_free( my_perl );

    exit(exitstatus);
}

The important parts are the calls to perl_parse() and perl_run(), which respectively compile and run the program. If you were embedding Perl in your own program, you might replace the call to perl_run() with calls to perl_call_sv() function, which calls individual subroutines rather than the program as a whole. Or you can do both, if the main script contains initialization code as well as subroutine definitions.

There are many other useful entry points into the interpreter, such as perl_eval_sv(), which evaluates a string, but this chapter is already getting pretty long, and the fact of the matter is that there is extensive online documentation for the internals of Perl. To include it here would make this book even more unwieldy than it is, and most people who would be embedding Perl aren't scared of online documentation. See the perlembed(3) manpage for more on embedding Perl interpreters in your program.

A number of programs in the real world already have Perl embedded in them - the authors know of several proprietary products shipping with embedded Perl interpreters. There are also a couple of modules for the Apache HTTP servers that use an embedded Perl interpreter to avoid process startup costs on CGI-like scripting. And then there's the version of Berkeley's nvi editor with a Perl engine in it. Watch out, emacs, you've got company. :-)

6.4.5 Embedding C and C++ in Perl

If a respectable number of programs embed a Perl interpreter, then a veritable flood of extension modules embed C and C++ into Perl. Again, the Perl distribution itself does this with many of its standard extension modules, including DB_File, DynaLoader, Fcntl, FileHandle, GDBM_File, NDBM_File, ODBM_File, POSIX, Safe, SDBM_File, and Socket. And many of the modules on CPAN do this. So if you decide to do it yourself, you won't feel like you're researching a Ph.D. dissertation.

And again, we only have space to give you teasers for the online documentation, which is exhaustively extensive. We recommend you start with the perlxstut(3) manpage, which is a tutorial on the XS language, a preprocessor that spits out the glue routines you need to do the "impedance matching" between Perl and C or C++. You'll also be interested in perlxs(3), perlguts(3), and perlcall(3).

And once again, let us reiterate that your best resource is the Perl community itself. They invented a lot of this stuff, and are emotionally committed to making you like it, whether you like it or not. You'd better cooperate.