Recipe 7.7. Writing a Filter

7.7. Writing a Filter

Problem

You want to write a program that takes a list of filenames on the command line and reads from STDIN if no filenames were given. You'd like the user to be able to give the file "-" to indicate STDIN or "someprogram |" to indicate the output of another program. You might want your program to modify the files in place or to produce output based on its input.

Solution

Read lines with <>:

while (<>) {
    # do something with the line
}

iscussion

When you say:

while (<>) {
    # ...
 }

Perl translates this into:[4]

[4] Except that the code written here won't work because ARGV has internal magic.

unshift(@ARGV, '-') unless @ARGV;
while ($ARGV = shift @ARGV) {
    unless (open(ARGV, $ARGV)) {
        warn "Can't open $ARGV: $!\n";
        next;
    }
    while (defined($_ = <ARGV>)) {
        # ...
    }
}

You can access ARGV and $ARGV inside the loop to read more from the filehandle or to find the filename currently being processed. Let's look at how this works.

Behavior

If the user supplies no arguments, Perl sets @ARGV to a single string, "-". This is shorthand for STDIN when opened for reading and STDOUT when opened for writing. It's also what lets the user of your program specify "-" as a filename on the command line to read from STDIN.

Next, the file processing loop removes one argument at a time from @ARGV and copies the filename into the global variable $ARGV. If the file cannot be opened, Perl goes on to the next one. Otherwise, it processes a line at a time. When the file runs out, the loop goes back and opens the next one, repeating the process until @ARGV is exhausted.

The open statement didn't say open(ARGV, "< $ARGV"). There's no extra greater- than symbol supplied. This allows for interesting effects, like passing the string "gzip -dc file.gz |" as an argument, to make your program read the output of the command "gzip -dc file.gz". See Recipe 16.6 for more about this use of magic open.

You can change @ARGV before or inside the loop. Let's say you don't want the default behavior of reading from STDIN if there aren't any arguments - you want it to default to all the C or C++ source and header files. Insert this line before you start processing <ARGV>:

@ARGV = glob("*.[Cch]") unless @ARGV;

Process options before the loop, either with one of the Getopt libraries described in Chapter 15, User Interfaces, or manually:

# arg demo 1: Process optional -c flag 
if (@ARGV && $ARGV[0] eq '-c') { 
    $chop_first++;
    shift;
}

# arg demo 2: Process optional -NUMBER flag    
if (@ARGV && $ARGV[0] =~ /^-(\d+)$/) { 
    $columns = $1; 
    shift;
}

# arg demo 3: Process clustering -a, -i, -n, or -u flags     
while (@ARGV && $ARGV[0] =~ /^-(.+)/ && (shift, ($_ = $1), 1)) { 
    next if /^$/; 
    s/a// && (++$append,      redo);
    s/i// && (++$ignore_ints, redo); 
    s/n// && (++$nostdout,    redo); 
    s/u// && (++$unbuffer,    redo); 
    die "usage: $0 [-ainu] [filenames] ...\n";    
}

Other than its implicit looping over command-line arguments, <> is not special. The special variables controlling I/O still apply; see Chapter 8 for more on them. You can set $/ to set the line terminator, and $. contains the current line (record) number. If you undefine $/, you don't get the concatenated contents of all files at once; you get one complete file each time:

undef $/;		     
while (<>) { 	
    # $_ now has the complete contents of 	
    # the file whose name is in $ARGV     
}

If you localize $/, the old value is automatically restored when the enclosing block exits:

{     # create block for local 	
    local $/;         # record separator now undef 	
    while (<>) { 	    
        # do something; called functions still have 	    
        # undeffed version of $/ 	
    }     
}                     # $/ restored here

Because processing <ARGV> never explicitly closes filehandles, the record number in $. is not reset. If you don't like that, you can explicitly close the file yourself to reset $.:

while (<>) { 	
    print "$ARGV:$.:$_"; 	
    close ARGV if eof;     
}

The eof function defaults to checking the end of file status of the last file read. Since the last handle read was ARGV, eof reports whether we're at the end of the current file. If so, we close it and reset the $. variable. On the other hand, the special notation eof() with parentheses but no argument checks if we've reached the end of all files in the <ARGV> processing.

Command-line options

Perl has command-line options, -n, -p, and -i, to make writing filters and one-liners easier.

The -n option adds the while (<>) loop around your program text. It's normally used for filters like grep or programs that summarize the data they read. The program is shown in Example 7.1.

Example 7.1: findlogin1

#!/usr/bin/perl   
# findlogin1 - print all lines containing the string "login"   
while (<>) {# loop over files on command line 	
    print if /login/;     
}

The program in Example 7.1 could be written as shown in Example 7.2.

Example 7.2: findlogin2

#!/usr/bin/perl -n     
# findlogin2 - print all lines containing the string "login"     
print if /login/;

You can combine the -n and -e options to run Perl code from the command line:

% perl -ne 'print if /login/'

The -p option is like -n but it adds a print at the end of the loop. It's normally used for programs that translate their input. This program is shown in Example 7.3.

Example 7.3: lowercase1

#!/usr/bin/perl    
# lowercase - turn all lines into lowercase

use locale;
while (<>) {                 # loop over lines on command line
    s/([^\W0-9_])/\l$1/g;    # change all letters to lowercase
print;
}

The program in Example 7.3 could be written as shown in Example 7.4.

Example 7.4: lowercase2

#!/usr/bin/perl -p     
# lowercase - turn all lines into lowercase     
use locale;     
s/([^\W0-9_])/\l$1/g;# change all letters to lowercase

Or written from the command line as:

% perl -Mlocale -pe 's/([^\W0-9_])/\l$1/g'

While using -n or -p for implicit input looping, the special label LINE: is silently created for the whole input loop. That means that from an inner loop, you can go on the following input record by using next LINE (this is like awk 's next). Go on to the file by closing ARGV (this is like awk 's nextfile). This is shown in Example 7.5.

Example 7.5: countchunks

#!/usr/bin/perl -n    
# countchunks - count how many words are used.    
# skip comments, and bail on file if __END__   
# or __DATA__ seen.    
for (split /\W+/) { 
    next LINE if /^#/; 
    close ARGV if /__(DATA|END)__/; 
    $chunks++;     
}     
END { print "Found $chunks chunks\n" }

The tcsh keeps a .history file in a format such that every other line contains a commented out timestamp in Epoch seconds:

#+0894382237     
less /etc/motd     
#+0894382239     
vi ~/.exrc     
#+0894382242     
date     
#+0894382242     
who     
#+0894382288     
telnet home

A simple one-liner can render that legible:

% perl -pe 's/^#\+(\d+)\n/localtime($1) . " "/e' 
Tue May  5 09:30:37 1998     less /etc/motd 
Tue May  5 09:30:39 1998     vi ~/.exrc 
Tue May  5 09:30:42 1998     date
Tue May  5 09:30:42 1998     who 
Tue May  5 09:31:28 1998     telnet home

The -i option changes each file on the command line. It is described in Recipe 7.9, and is normally used in conjunction with -p.

You have to say use locale to handle current character set.