Recipe 6.8. Extracting a Range of Lines

6.8. Extracting a Range of Lines

Problem

You want to extract all lines from one starting pattern through an ending pattern or from a starting line number up to an ending line number.

A common example of this is extracting the first 10 lines of a file (line numbers 1 to 10) or just the body of a mail message (everything past the blank line).

Solution

Use the operators .. or ... with patterns or line numbers. The operator ... doesn't return true if both its tests are true on the same line, but .. does.

while (<>) {
    if (/BEGIN PATTERN/ .. /END PATTERN/) {
        # line falls between BEGIN and END in the
        # text, inclusive.
    }
}

while (<>) {
    if ($FIRST_LINE_NUM .. $LAST_LINE_NUM) {
        # operate only between first and last line, inclusive.
    }
}

The ... operator doesn't test both conditions at once if the first one is true.

while (<>) {
    if (/BEGIN PATTERN/ ... /END PATTERN/) {
        # line is between BEGIN and END on different lines
    }
}

while (<>) {
    if ($FIRST_LINE_NUM ... $LAST_LINE_NUM) {
        # operate only between first and last line, but not same
    }
}

The range operators, .. and ..., are probably the least understood of Perl's myriad operators. They were designed to allow easy extraction of ranges of lines without forcing the programmer to retain explicit state information. When used in a scalar sense, such as in the test of if and while statements, these operators return a true or false value that's partially dependent on what they last returned. The expression left_operand .. right_operand returns false until left_operand is true, but once that test has been met, it stops evaluating left_operand and keeps returning true until right_operand becomes true, after which it restarts the cycle. To put it another way, the first operand turns on the construct as soon as it returns a true value, whereas the second one turns it off as soon as it returns true.

These conditions are absolutely arbitrary. In fact, you could write mytestfunc1() .. mytestfunc2(), although in practice this is seldom done. Instead, the range operators are usually used either with line numbers as operands (the first example), patterns as operands (the second example), or both.

# command-line to print lines 15 through 17 inclusive (see below)
perl -ne 'print if 15 .. 17' datafile

# print out all <XMP> .. </XMP> displays from HTML doc
while (<>) {
    print if m#<XMP>#i .. m#</XMP>#i;
}
    
# same, but as shell command
% perl -ne 'print if m#<XMP>#i .. m#</XMP>#i' document.html

If either operand is a numeric literal, the range operators implicitly compare against the $. variable ($NR or $INPUT_LINE_NUMBER if you use English). Be careful with implicit line number comparisons here. You must specify literal numbers in your code, not variables containing line numbers. That means you can simply say 3 .. 5 in a conditional, but not $n .. $m where $n and $m are 3 and 5 respectively. You have to be more explicit and test the $. variable directly.

perl -ne 'BEGIN { $top=3; $bottom=5 }  print if $top .. $bottom' /etc/passwd        # previous command FAILS
perl -ne 'BEGIN { $top=3; $bottom=5 } \
    print if $. == $top .. $. ==     $bottom' /etc/passwd    # works
perl -ne 'print if 3 .. 5' /etc/passwd   # also works

The difference between .. and ... is their behavior when both operands can be true on the same line. Consider these two cases:

print if /begin/ .. /end/;
print if /begin/ ... /end/;

Given the line "You may not end ere you begin", both the double- and triple-dot versions of the range operator above return true. But the code using .. will not print any further lines. That's because .. tests both conditions on the same line once the first test matches, and the second test tells it that it's reached the end of its region. On the other hand, ... will continue until the next line that matches /end/ because it never tries to test both operands on the same time.

You may mix and match conditions of different sorts, as in:

while (<>) {
    $in_header =   1  .. /^$/;
    $in_body   = /^$/ .. eof();
}

The first assignment sets $in_header to be true from the first input line until after the blank line separating the header, such as from a mail message, a news posting, or even an HTTP header. (Technically speaking, an HTTP header should have both linefeeds and carriage returns as network line terminators, but in practice, servers are liberal in what they accept.) The second assignment sets $in_body to be true starting as soon as the first blank line is encountered, up through end of file. Because range operators do not retest their initial condition, any further blank lines (such as those between paragraphs) won't be noticed.

Here's an example. It reads files containing mail messages and prints addresses it finds in headers. Each address is printed only once. The extent of the header is from a line beginning with a "From:" up through the first blank line. If we're not within that range, go on to the next line. This isn't an RFC-822 notion of an address, but it's easy to write.

%seen = ();
while (<>) {
    next unless /^From:?\s/i .. /^$/;
    while (/([^<>(),;\s]+\@[^<>(),;\s]+)/g) {
        print "$1\n" unless $seen{$1}++;
    }
}

If this all range business seems mighty strange, chalk it up to trying to support the s2p and a2p translators for converting sed and awk code into Perl. Both those tools have range operators that must work in Perl.

6.8. Extracting a Range of Lines

Problem

Solution

Discussion

See Also