Recipe 1.18. Program: psgrep

1.18. Program: psgrep

Many programs, including ps, netstat, lsof, ls -l, find -ls, and tcpdump, can produce more output than can be conveniently summarized. Logfiles also often grow too long to be easily viewed. You could send these through a filter like grep to pick out only certain lines, but regular expressions and complex logic don't mix well; just look at the hoops we jump through in Recipe 6.17.

What we'd really like is to make full queries on the program output or logfile. For example, to ask ps something like, "Show me all the processes that exceed 10K in size but which aren't running as the superuser." Or, "Which commands are running on pseudo-ttys?"

The psgrep program does this - and infinitely more - because the specified selection criteria are not mere regular expressions; they're full Perl code. Each criterion is applied in turn to every line of output. Only lines matching all arguments are output. The following is a list of things to find and how to find them.

Lines containing "sh" at the end of a word:

% psgrep '/sh\b/'

Processes whose command names end in "sh":

% psgrep 'command =~ /sh$/'

Processes running with a user ID below 10:

% psgrep 'uid < 10'

% psgrep 'command =~ /^-/' 'tty ne "?"'

Processes running on pseudo-ttys:

% psgrep 'tty =~ /^[p-t]/'

Non-superuser processes running detached:

% psgrep 'uid && tty eq "?"'

Huge processes that aren't owned by the superuser:

% psgrep 'size > 10 * 2**10' 'uid != 0'

The last call to psgrep produced the following output when run on our system. As one might expect, only netscape and its spawn qualified.

 FLAGS   UID   PID  PPID PRI  NI   SIZE   RSS WCHAN     STA TTY TIME COMMAND
     0   101  9751     1   0   0  14932  9652 do_select S   p1  0:25 netscape
100000   101  9752  9751   0   0  10636   812 do_select S   p1  0:00 (dns helper)

Example 1.6 shows the psgrep program.

Example 1.6: psgrep

#!/usr/bin/perl -w
# psgrep - print selected lines of ps output by
#          compiling user queries into code

use strict;

# each field from the PS header
my @fieldnames = qw(FLAGS UID PID PPID PRI NICE SIZE
                    RSS WCHAN STAT TTY TIME COMMAND);

# determine the unpack format needed (hard-coded for Linux ps)
my $fmt = cut2fmt(8, 14, 20, 26, 30, 34, 41, 47, 59, 63, 67, 72);

my %fields;                         # where the data will store

die <<Thanatos unless @ARGV;
usage: $0 criterion ...
    Each criterion is a Perl expression involving:
     @fieldnames
    All criteria must be met for a line to be printed.
Thanatos

# Create function aliases for uid, size, UID, SIZE, etc.
# Empty parens on closure args needed for void prototyping.
for my $name (@fieldnames) {
    no strict 'refs';
    *$name = *{lc $name} = sub () { $fields{$name} };
}

my $code = "sub is_desirable { " . join(" and ", @ARGV) . " } ";
unless (eval $code.1) {
    die "Error in code: $@\n\t$code\n";
}

open(PS, "ps wwaxl |")              || die "cannot fork: $!";
print scalar <PS>;                  # emit header line
while (<PS>) {
    @fields{@fieldnames} = trim(unpack($fmt, $_));
    print if is_desirable();        # line matches their criteria
}
close(PS)                           || die "ps failed!";

# convert cut positions to unpack format
sub cut2fmt {
    my(@positions) = @_;
    my $template  = '';
    my $lastpos   = 1;
    for my $place (@positions) {
        $template .= "A" . ($place - $lastpos) . " ";
        $lastpos   = $place;
    }
    $template .= "A*";
    return $template;
}

sub trim {
    my @strings = @_;
    for (@strings) {
        s/^\s+//;
        s/\s+$//;
    }
    return wantarray ? @strings : $strings[0];
}

# the following was used to determine column cut points.
# sample input data follows
#123456789012345678901234567890123456789012345678901234567890123456789012345
#         1         2         3         4         5         6         7
# Positioning:
#       8     14    20    26  30  34     41    47          59  63  67   72
#       |     |     |     |   |   |      |     |           |   |   |    |
__END__
 FLAGS   UID   PID  PPID PRI  NI   SIZE   RSS WCHAN       STA TTY TIME COMMAND
   100     0     1     0   0   0    760   432 do_select   S   ?   0:02 init
   140     0   187     1   0   0    784   452 do_select   S   ?   0:02 syslogd
100100   101   428     1   0   0   1436   944 do_exit     S    1  0:00 /bin/login
100140    99 30217   402   0   0   1552  1008 posix_lock_ S   ?   0:00 httpd
     0   101   593   428   0   0   1780  1260 copy_thread S    1  0:00 -tcsh
100000   101 30639  9562  17   0    924   496             R   p1  0:00 ps axl
     0   101 25145  9563   0   0   2964  2360 idetape_rea S   p2  0:06 trn
100100     0 10116  9564   0   0   1412   928 setup_frame T   p3  0:00 ssh -C www
100100     0 26560 26554   0   0   1076   572 setup_frame T   p2  0:00 less
100000   101 19058  9562   0   0   1396   900 setup_frame T   p1  0:02 nvi /tmp/a

The psgrep program integrates many techniques presented throughout this book. Stripping strings of leading and trailing whitespace is found in Recipe 1.14. Converting cut marks into an unpack format to extract fixed fields is in Recipe 1.1. Matching strings with regular expressions is the entire topic of Chapter 6.

The multiline string in the here document passed to die is discussed in Recipes Recipe 1.10 and Recipe 1.11. The assignment to @fields{@fieldnames} sets many values at once in the hash named %fields. Hash slices are discussed in Recipes Recipe 4.7 and Recipe 5.10.

The sample program input contained beneath __END__ is described in Recipe 7.6. During development, we used canned input from the DATA filehandle for testing purposes. Once the program worked properly, we changed it to read from a piped-in ps command but left a remnant of the original filter input to aid in future porting and maintenance. Launching other programs over a pipe is covered in Chapter 16, Process Management and Communication, including Recipes Recipe 16.10 and Recipe 16.13.

The real power and expressiveness in psgrep derive from Perl's use of string arguments not as mere strings but directly as Perl code. This is similar to the technique in Recipe 9.9, except that in psgrep, the user's arguments are wrapped with a routine called is_desirable. That way, the cost of compiling strings into Perl code happens only once, before the program whose output we'll process is even begun. For example, asking for UIDs under 10 creates this string to eval:

eval "sub is_desirable { uid < 10 } " . 1;

The mysterious ".1" at the end is so that if the user code compiles, the whole eval returns true. That way we don't even have to check $@ for compilation errors as we do in Recipe 10.12.

Specifying arbitrary Perl code in a filter to select records is a breathtakingly powerful approach, but it's not entirely original. Perl owes much to the awk programming language, which is often used for such filtering. One problem with awk is that it can't easily treat input as fixed-size fields instead of fields separated by something. Another is that the fields are not mnemonically named: awk uses $1, $2, etc. Plus Perl can do much that awk cannot.

The user criteria don't even have to be simple expressions. For example, this call initializes a variable $id to user nobody 's number to use later in its expression:

% psgrep 'no strict "vars";
          BEGIN { $id = getpwnam("nobody") }
          uid == $id '

How can we use unquoted words without even a dollar sign, like uid, command, and size, to represent those respective fields in each input record? We directly manipulate the symbol table by assigning closures to indirect typeglobs, which creates functions with those names. The function names are created using both uppercase and lowercase names, allowing both "UID < 10" and "uid < 10". Closures are described in Recipe 11.4, and assigning them to typeglobs to create function aliases is shown in Recipe 10.14.

One twist here not seen in those recipes is empty parentheses on the closure. These allowed us to use the function in an expression anywhere we'd use a single term, like a string or a numeric constant. It creates a void prototype so the field-accessing function named uid accepts no arguments, just like the built-in function time. If these functions weren't prototyped void, expressions like "uid < 10" or "size / 2 > rss" would confuse the parser because it would see the unterminated start of a wildcard glob and of a pattern match, respectively. Prototypes are discussed in Recipe 10.11.

The version of psgrep demonstrated here expects the output from Red Hat Linux's ps. To port to other systems, look at which columns the headers begin at. This approach isn't relevant only to ps or only to Unix systems. It's a generic technique for filtering input records using Perl expressions, easily adapted to other record layouts. The input format could be in columns, space separated, comma separated, or the result of a pattern match with capturing parentheses.

The program could even be modified to handle a user-defined database with a small change to the selection functions. If you had an array of records as described in Recipe 11.9, you could let users specify arbitrary selection criteria, such as:

sub id()         { $_->{ID}   }
sub title()      { $_->{TITLE} }
sub executive()  { title =~ /(?:vice-)?president/i }

# user search criteria go in the grep clause
@slowburners = grep { id < 10 && !executive } @employees;

For reasons of security and performance, this kind of power is seldom found in database engines like those described in Chapter 14, Database Access. SQL doesn't support this, but given Perl and small bit of ingenuity, it's easy to roll it up on your own. The search engine at http://mox.perl.com/cgi-bin/MxScreen uses such a technique, but instead of output from ps, its records are Perl hashes loaded from a database.