|
The two main pattern matching operators are m// ,
the match operator, and s/// , the substitution
operator. There is also a split
operator, which takes an ordinary match operator as its first argument
but otherwise behaves like a function, and is therefore documented in
Chapter 3. Although we write m// and s/// here, you'll recall that you can
pick your own quote characters. On the other hand, for the m//
operator only, the m may be omitted if the delimiters you pick are in
fact slashes. (You'll often see patterns written this way, for
historical reasons.) Now that we've gone to all the trouble of enumerating these weird,
quote-like operators, you might
wonder what it is we've gone to all the trouble of quoting. The answer
is that the string inside the quotes specifies a regular expression.
We'll discuss regular expressions in the next section, because there's a lot
to discuss. The matching operations can have various modifiers, some of which affect
the interpretation of the regular expression inside: These are usually written as "the /x modifier", even though the
delimiter in question might not actually be a slash. In fact, any of
these modifiers may also be embedded within the regular expression
itself using the (?...) construct. See the section
"Regular Expression Extensions" later in this chapter. The /x modifier itself needs a little more explanation. It tells
the regular expression parser to ignore whitespace that is not
backslashed or within a character class. You can use this modifier to break up
your regular expression into (slightly) more readable parts.
The #
character is also treated as a metacharacter introducing a comment,
just as in ordinary Perl code. Taken together, these features go a
long way toward making Perl a readable language. The regular expressions used in the pattern matching and substitution
operators are syntactically similar to those used by the UNIX egrep program. When
you write a regular expression, you're actually writing a grammar for a
little language. The regular expression interpreter (which we'll call
the Engine) takes your grammar and compares it to the string you're
doing pattern matching on. If some portion of the string can be parsed
as a sentence of your little language, it says "yes". If not, it says
"no". What happens after the Engine has said "yes" depends on how you invoked
it. An ordinary pattern match is usually used as a conditional
expression, in which case you don't care where it matched, only
whether it matched. (But you can also find out where it matched if
you need to know that.) A substitution command will take the part that
matched and replace it with some other string of your choice. And the
split operator will return (as a
list) all the places your pattern didn't match. Regular expressions are powerful, packing a lot of meaning into a
short space. They can therefore be quite daunting if you try to
intuit the meaning of a large regular expression as a whole. But if you
break it up into its parts, and if you know how the Engine interprets
those parts, you can understand any regular expression. Before we dive into the rules for interpreting regular expressions,
let's take a look at some of the things you'll see in regular expressions.
First of all, you'll see literal strings. Most characters[]
in a regular expression simply match themselves. If you string several
characters in a row, they must match in order, just as you'd expect. So
if you write the pattern match: /Fred/ you can know that the pattern won't match unless the string contains
the substring "Fred " somewhere. Other characters don't match themselves, but are metacharacters.
(Before we explain what metacharacters do, we should reassure
you that you can always match such a character literally by putting a
backslash in front of it. For example, backslash is itself a
metacharacter, so to match a literal backslash, you'd backslash the
backslash: \\ .) The list of metacharacters is: \ | ( ) [ { ^ $ * + ? . We said that backslash turns a metacharacter into a literal character,
but it does the opposite to an alphanumeric character: it turns
the literal character into a sort of metacharacter or sequence. So
whenever you see a two-character sequence: \b \D \t \3 \s you'll know that the sequence matches something strange. A \b
matches a word boundary, for instance, while \t matches an ordinary
tab character. Notice that a word boundary is zero characters wide,
while a tab character is one character wide. Still, they're alike in
that they both assert that something is true about a particular spot
in the string. Most of the things in a regular expression fall into the
class of assertions, including the ordinary characters that simply
assert that they match themselves. (To be precise, they also assert
that the next thing will match one character later in the string, which
is why we talk about the tab character being "one character wide". Some
assertions eat up some of the string as they match, and others don't.
But we usually reserve the term "assertion" for the zero-width
assertions. We'll call these assertions with nonzero width atoms.)
You'll also see some things that aren't assertions. Alternation is indicated
with a vertical bar: /Fred|Wilma|Barney|Betty/ That means that any of those strings can trigger a match.
Grouping of various sorts is done with parentheses, including grouping
of alternating substrings within a longer regular expression: /(Fred|Wilma|Pebbles) Flintstone/ Another thing you'll see are what we call quantifiers. They say how many
of the previous thing should match in a row. Quantifiers look like: * + ? *? {2,5} Quantifiers only make sense when attached to atoms, that is, assertions
that have width. Quantifiers attach only to the previous atom, which in
human terms means they only quantify one character. So if you want to
match three copies of "moo " in a row, you need to group the
"moo " with
parentheses, like this: /(moo){3}/ That will match "moomoomoo ". If you'd said /moo{3}/ , it
would only have matched "moooo ". Since patterns are processed as double-quoted strings, the normal
double-quoted interpolations will work. (See "String Literals" earlier in
this chapter.) These are applied before the string is interpreted as
a regular expression. One caveat though: any $ immediately followed
by a vertical bar, closing parenthesis, or the end of the string will
be interpreted as an end-of-line assertion rather than a variable
interpolation. So if you say: $foo = "moo";
/$foo$/; it's equivalent to saying: /moo$/; You should also know that interpolating variables into a pattern slows
down the pattern matcher considerably, because it feels it needs to recompile the
pattern each time through, since the variable might have changed. Now that you've seen some regular expressions, we'll lay out
the rules that the Engine uses to match your pattern against the string.
The Perl Engine uses a nondeterministic finite-state automaton (NFA) to
find a match. That just means that it keeps track of what it has tried
and what it hasn't, and when something doesn't pan out, it backs up and
tries something else. This is called
backtracking. The Perl Engine is capable of
trying a million things at one spot, then giving up on all those,
backing up to within one choice of the beginning, and trying the million
things again at a different spot. If you're cagey, you can write
efficient patterns that don't do a lot of silly backtracking. The order of the rules below specifies which order the Engine tries
things. So when someone trots out a stock phrase like "left-most,
longest match", you'll know that overall Perl prefers left-most over
longest. But the Engine doesn't realize it's preferring anything at
that level. The global preferences result from a lot of localized
choices. The Engine thinks locally and acts globally. Rule 1. The Engine tries to match as far left in the string
as it can, such that the entire regular expression matches under Rule 2. In order to do this, its first choice is to start just before the first
character (it could have started anywhere), and to try to match the
entire regular expression at that point. The regular expression matches
if and only if Engine reaches the end of the regular expression before
it runs off the end of the string. If it matches, it quits
immediately - it doesn't keep looking for a "better" match, even though
the regular expression could match in many different ways. The match
only has to reach the end of the regular expression; it doesn't have to
reach the end of the string, unless there's an assertion in the regular
expression that says it must. If it exhausts all possibilities at the
first position, it realizes that its very first choice was wrong, and
proceeds to its second choice. It goes to the second position in the
string (between the first and second characters), and tries all the
possibilities again. If it succeeds, it stops. If it fails, it
continues on down the string. The pattern match as a whole doesn't fail
until it has tried to match the entire regular expression at every
position in the string, including after the last character in the
string. Note that the positions it's trying to match at are between the
characters of the string. This rule sometimes surprises people when
they write a pattern like /x*/ that can match zero or more x 's.
If you try the pattern on a string like "fox ", it will match the null
string before the "f " in preference to the "x " that's later in the
string. If you want it to match one or more x 's, you need to tell
it that by using /x+/ instead. See the quantifiers under Rule 5. A corollary to this rule is that any regular expression that can match
the null string is guaranteed to match at the leftmost position in the string. Rule 2. For this rule, the whole
regular expression is regarded as a set of alternatives (where the
degenerate case is just a set with one alternative). If there are two
or more alternatives, they are syntactically separated by the
| character (usually called a vertical bar). A set
of alternatives matches a string if any of the
alternatives match under Rule 3. It tries the alternatives
left-to-right (according to their position in the regular expression),
and stops on the first match that allows successful completion of the
entire regular expression. If none of the alternatives matches, it
backtracks to the Rule that invoked this Rule, which is usually Rule 1,
but could be Rule 4 or 6. That rule will then look for a new position
at which to apply Rule 2. If there's only
one alternative, then it either it matches or doesn't, and the rule
still applies. (There's no such thing as zero alternatives, because a
null string can always match something of zero width.) Rule 3. Any particular alternative matches if every item in the
alternative matches sequentially according to Rules 4 and 5 (such that the
entire regular expression can be satisfied). An item consists of either
an assertion, which is covered in Rule 4, or a quantified atom, which is
covered by Rule 5. Items that have choices on how to match are given
"pecking order" from left to right. If the items cannot be matched in
order, the Engine backtracks to the next alternative under Rule 2. Items that must be matched sequentially aren't separated in the regular
expression by anything
syntactic - they're merely juxtaposed in the order they must match.
When you ask to match /^foo/ , you're actually asking for four items
to be matched one after the other. The first is a zero-width assertion,
and the other three are ordinary letters that must match themselves, one
after the other. The left-to-right pecking order means that in a pattern like: /x*y*/ x gets to pick one way to match, and then y tries all its ways. If that
fails, then x gets to pick its second choice, and make y try all of its
ways again. And so on. The items to the right vary faster, to borrow
a phrase from multi-dimensional arrays.
Rule 4. An assertion must match according to this table. If the
assertion does not match at the current position, the Engine backtracks to
Rule 3 and retries higher-pecking-order items with different choices. The $ and \Z assertions can match not only at the end of the
string, but also one character earlier than that, if the last character
of the string happens to be a newline. The positive (?=...) and negative (?!...) lookahead assertions are
zero-width themselves, but assert that the regular expression
represented above by ... would (or would not) match at this point,
were we to attempt it. In fact, the Engine does attempt it. The Engine
goes back to Rule 2 to test the subexpression, and then wipes out any
record of how much string was eaten, returning only the success or
failure of the subexpression as the value of the assertion. We'll show
you some examples later.
Rule 5. A quantified atom matches only if the atom itself matches
some number of times allowed by the quantifier. (The atom is matched
according to Rule 6.) Different quantifiers require different numbers of
matches, and most of them allow a range of numbers of matches. Multiple
matches must all match in a row, that is, they must be adjacent within
the string. An unquantified atom is assumed to have a quantifier
requiring exactly one match. Quantifiers constrain and control matching
according to the table below. If no match can be found at the current
position for any allowed quantity of the atom in question, the Engine
backtracks to Rule 3 and retries higher-pecking-order items with
different choices. Quantifiers are: If a brace occurs in any other context, it is treated as a regular
character. n and m are
limited to integral values less than 65,536. If you use the
{ n} form,
then there is no choice, and the atom must match exactly that number
of times or not at all. Otherwise, the atom can match over a range of
quantities, and the Engine keeps track of all the choices so that it
can backtrack if necessary. But then the question arises as to which
of these choices to try first. One could start with the maximal
number of matches and work down, or the minimal number of matches and
work up. The quantifiers in the left column above try the biggest quantity first.
This is often called "greedy" matching. To find the greediest match,
the Engine doesn't actually count down from the maximum value, which
after all could be infinity. What actually happens in this case is
that the Engine first counts up to find out how many atoms it's
possible to match in a row in the current string, and then it
remembers all the shorter choices and starts out from the longest one. This could fail, of course, in which case it backtracks
to a shorter choice. If you say /.*foo/ , for example, it will try to match the maximal
number of "any" characters (represented by the dot) clear out to the end
of the line before it ever tries looking for "foo ", and then when the
"foo " doesn't match there (and it can't, because there's not enough room
for it at the end of the string), the Engine will back off one character
at a time until it finds a "foo ". If there is more than one "foo " in
the line, it'll stop on the last one, and throw away all the shorter
choices it could have made. By placing a question mark after any of the greedy quantifiers, they
can be made to choose the smallest quantity for the first try. So if
you say /.*?foo/ , the .*? first
tries to match 0 characters, then 1 character, then 2, and so on until
it can match the "foo ". Instead of backtracking backward, it
backtracks forward, so to speak, and ends up finding the first "foo "
on the line instead of the last. Rule 6. Each atom matches according to
its type, listed below. If the atom doesn't match (or doesn't allow a
match of the rest of the regular expression), the Engine backtracks to
Rule 5 and tries the next choice for the atom's quantity. Atoms match according to the following types: A regular expression in parentheses, (...) , matches whatever the
regular expression (represented by ... ) matches according to Rule 2.
Parentheses therefore serve as a grouping operator for quantification.
Parentheses also have the side effect of remembering the matched
substring for later use in a backreference (to be
discussed later). This side
effect can be suppressed by using (?:...) instead, which has only
the grouping semantics - it doesn't store anything in $1, $2, and so on. A ". " matches any character except \n . (It also matches
\n if you use the /s modifier.) The main use of
dot is as a vehicle for a minimal or maximal quantifier. A
.* matches a maximal number of don't-care characters, while a
.*? matches a minimal number of don't-care characters. But it's
also sometimes used within parentheses for its width:
/(..):(..):(..)/ matches three colon-separated fields, each of
which is two characters long. A list of characters in square brackets (called a character class) matches
any one of the characters in the list.
A caret at the front of the list causes it to match only characters that
are not in the list. Character ranges may be indicated using the
a-z notation. You may also use any of \d , \w ,
\s , \n , \r , \t , \f , or
\ nnn, as listed below. A \b means a backspace
in a character class. You may use a backslash to protect a hyphen that
would otherwise be interpreted as a range delimiter. To match a right
square bracket, either backslash it or place it first in the list. To
match a caret, don't put it first. Note that most other
metacharacters lose their meta-ness inside square brackets. In
particular, it's meaningless to specify alternation in a character
class, since the characters are interpreted individually. For example,
[fee|fie|foe] means the same thing as [feio|] . A backslashed letter matches a special character or character class: Note that \w matches a character of a word, not a whole word. Use
\w+ to match a word. A backslashed single-digit number matches whatever the corresponding
parentheses actually matched (except that \0 matches a null
character). This is called a backreference to a substring. A
backslashed multi-digit number such as \10 will be considered a
backreference if the pattern contains at least that many substrings
prior to it, and the number does not start with a 0 . Pairs of
parentheses are numbered by counting left parentheses from the left. A backslashed two- or three-digit octal number such as \033 matches the
character with the specified value, unless it would be interpreted as a
backreference. A backslashed x followed by one or two hexadecimal digits, such as
\x7f , matches the character having that hexadecimal value. A backslashed c followed by a single character, such as \cD ,
matches the corresponding control character. Any other backslashed character matches that character. Any character not mentioned above matches itself.
As mentioned above, \1 , \2 , \3 , and so on, are
equivalent to whatever the corresponding set of parentheses matched,
counting opening parentheses from left to right. (If the particular
pair of parentheses had a quantifier such as * after it, such
that it matched a series of substrings, then only the last match counts
as the backreference.) Note that such a backreference matches whatever
actually matched for the subpattern in the string being examined;
it's not just a shorthand for the rules of that subpattern. Therefore,
(0|0x)\d*\s\1\d* will match "0x1234 0x4321 ", but not "0x1234
01234 ", since subpattern 1 actually matched "0x ", even though the rule
0|0x could potentially match the leading 0 in the second number. Outside of the pattern (in particular, in the replacement of a
substitution operator) you can continue to refer to backreferences by
using $ instead of \ in front of
the number. The variables $1,
$2, $3 ... are automatically localized, and their
scope (and that of $` , $&, and $' below) extends to the end of the enclosing block or eval string, or to the next successful pattern
match, whichever comes first.
(The \1 notation sometimes works outside the current pattern, but
should not be relied upon.) $+ returns whatever the last bracket
match matched. $& returns the entire matched string. $` returns everything before the matched string.[]
$' returns everything after the matched string. For more explanation
of these magical variables (and for a way to write them in English), see
the section "Special Variables" at the end of this chapter. You may have as many parentheses as you wish. If you have more
than nine pairs, the variables $10, $11, ... refer to the
corresponding substring. Within the pattern, \10 , \11 , and so on, refer back
to substrings if there have been at least that many left parentheses before
the backreference. Otherwise (for backward compatibility) \10 is the
same as \010 , a backspace, and \11 the same as \011 , a tab. And so
on. (\1 through \9 are always backreferences.) Examples: s/^([^ ]+) +([^ ]+)/$2 $1/; # swap first two words
/(\w+)\s*=\s*\1/; # match "foo = foo"
/.{80,}/; # match line of at least 80 chars
/^(\d+\.?\d*|\.\d+)$/; # match valid number
if (/Time: (..):(..):(..)/) { # pull fields out of a line
$hours = $1;
$minutes = $2;
$seconds = $3;
} Hint: instead of writing patterns like /(...)(..)(.....)/ , use the
unpack function. It's more efficient. A word boundary (\b ) is defined as a spot between two
characters that has a \w on one side of it and a
\W on the other side of it (in either order), counting the
imaginary characters off the beginning and end of the string as matching
a \W . (Within character classes \b represents
backspace rather than a word boundary.) Normally, the ^ character is guaranteed to match only at the
beginning of the string, the $ character only at the end (or
before the newline at the end), and Perl does certain optimizations with
the assumption that the string contains only one line. Embedded
newlines will not be matched by ^ or $ . However, you may
wish to treat a string as a multi-line buffer, such that the
^ will also match after any newline within the string, and $
will also match before any newline. At the cost of a little more overhead,
you can do this by using the /m modifier on the pattern match
operator. (Older programs did this by setting $*, but this
practice is now deprecated.) \A and \Z are just
like ^ and $ except that they won't match multiple times
when the /m modifier is used, while ^ and $ will
match at every internal line boundary. To match the actual end of the
string, not ignoring newline, you can use \Z(?!\n) . There's
an example of a negative lookahead assertion. To facilitate multi-line substitutions, the . character never matches a
newline unless you use the /s modifier, which tells Perl to pretend
the string is a single line - even if it isn't. (The /s modifier also
overrides the setting of $*, in case you have some (badly behaved) older
code that sets it in another module.)
In particular, the following leaves a newline on the $_ string: $_ = <STDIN>;
s/.*(some_string).*/$1/; If the newline is unwanted, use any of these: s/.*(some_string).*/$1/s;
s/.*(some_string).*\n/$1/;
s/.*(some_string)[^\0]*/$1/;
s/.*(some_string)(.|\n)*/$1/;
chop; s/.*(some_string).*/$1/;
/(some_string)/ && ($_ = $1); Note that all backslashed metacharacters in Perl are
alphanumeric, such as \b , \w ,
and \n . Unlike some regular expression languages, there are no backslashed
symbols that aren't alphanumeric. So anything that looks like
\\ , \( , \) , \< , \> ,
\{ , or \} is always interpreted as a literal
character, not a metacharacter. This makes it simple to quote a string
that you want to use for a pattern but that you are afraid might contain
metacharacters.
Just quote all the non-alphanumeric characters: $pattern =~ s/(\W)/\\$1/g; You can also use the built-in quotemeta function to do this.
An even easier way to quote metacharacters right in the match operator
is to say: /$unquoted\Q$quoted\E$unquoted/ Remember that the first and last alternatives (before the first | and
after the last one) tend to gobble up the other elements of the regular
expression on either side, out
to the ends of the expression, unless there are enclosing parentheses. A
common mistake is to ask for: /^fee|fie|foe$/ when you really mean: /^(fee|fie|foe)$/ The first matches "fee " at the beginning of the string, or
"fie " anywhere, or "foe " at the end of the string. The second
matches any string consisting solely of "fee " or "fie " or
"foe ". Perl defines a consistent extension syntax for regular expressions.
You've seen some of them already.
The syntax is a pair of parentheses with a question mark as the first thing
within the parentheses.[]
The character after the question mark gives the function of the extension.
Several extensions are already supported: (?#text) A comment. The text is ignored. If the /x switch is used to enable
whitespace formatting, a simple # will suffice. (?:...) This groups things like "(...) " but doesn't make backreferences like "(...) " does. So: split(/\b(?:a|b|c)\b/) is like: split(/\b(a|b|c)\b/) but doesn't actually save anything in $1, which means
that the first split doesn't spit out extra delimiter fields
as the second one does. (?=...) A zero-width positive lookahead assertion. For example, /\w+(?=\t)/
matches a word followed by a tab, without including the tab in $&. (?!...) A zero-width negative lookahead assertion. For example /foo(?!bar)/
matches any occurrence of "foo " that isn't followed by "bar ". Note,
however, that lookahead and lookbehind are not the same thing. You cannot
use this for lookbehind: /(?!foo)bar/ will not find an occurrence of
"bar " that is preceded by something that is not "foo ". That's because
the (?!foo) is just saying that the next thing cannot be "foo " - and
it's not, it's a "bar ", so "foobar " will match. You would have to do
something like /(?!foo) ...bar/ for that. We say "like" because there's
the case of your "bar " not having three characters before it. You could
cover that this way: /(?:(?!foo) ...|^ .{0, 2}bar/ . Sometimes it's still
easier just to say: if (/bar/ and $` !~ /foo$/) (?imsx) One or more embedded pattern-match modifiers. This is particularly
useful for patterns that are specified in a table somewhere, some of
which want to be case-sensitive, and some of which don't. The case-insensitive ones merely need to include (?i) at the front of the
pattern. For example: # hardwired case insensitivity
$pattern = "buffalo";
if ( /$pattern/i )
# data-driven case insensitivity
$pattern = "(?i)buffalo";
if ( /$pattern/ )
We chose to use the question mark for this (and for the new minimal
matching construct) because (1) question mark is pretty rare in older
regular expressions, and (2) whenever you see one, you should stop
and question exactly what is going on. That's psychology. Now that we've got all that out of the way, here finally are the
quotelike operators (er, terms) that perform pattern matching and related
activities. m/ PATTERN /gimosx / PATTERN /gimosx This operator searches a string for a pattern match, and in a scalar context
returns true (1 ) or false ("" ). If no string is specified via
the =~ or !~ operator, the
$_ string is searched. (The string
specified with =~ need not be an lvalue - it
may be the result of an expression evaluation, but remember the
=~ binds rather tightly, so you may need
parentheses around your expression.)
Modifiers are: If / is the delimiter then the initial m is optional. With the m
you can use any pair of non-alphanumeric, non-whitespace characters as
delimiters. This is particularly useful for matching filenames
that contain "/ ", thus avoiding LTS (leaning toothpick syndrome). PATTERN may contain variables, which will be interpolated (and the
pattern recompiled) every time the pattern search is evaluated. (Note
that $) and $| will not be interpolated because they look
like end-of-line tests.) If you want such a pattern to be compiled only
once, add a /o after the trailing delimiter. This avoids
expensive run-time recompilations, and is useful when the value you are
interpolating won't change during execution. However,
mentioning /o constitutes a promise that you won't change the
variables in the pattern. If you do change them, Perl won't even
notice.
If the PATTERN evaluates to a null string, the last successfully
executed regular expression not hidden within an inner block (including
split, grep, and map) is used instead. If used in a context that requires a list value, a pattern match returns
a list consisting of the subexpressions matched by the parentheses in
the pattern - that is, ($1, $2, $3 ...). (The variables are
also set.) If the match fails, a null list is returned. If the match
succeeds, but there were no parentheses, a list value of (1) is
returned. Examples: # case insensitive matching
open(TTY, '/dev/tty');
<TTY> =~ /^y/i and foo(); # do foo() if they want it
# pulling a substring out of a line
if (/Version: *([0-9.]+)/) { $version = $1; }
# avoiding Leaning Toothpick Syndrome
next if m#^/usr/spool/uucp#;
# poor man's grep
$arg = shift;
while (<>) {
print if /$arg/o; # compile only once
}
# get first two words and remainder as a list
if (($F1, $F2, $Etc) = ($foo =~ /^\s*(\S+)\s+(\S+)\s*(.*)/)) This last example splits $foo into the first two words and the
remainder of the line, and assigns those three fields to $F1 ,
$F2 , and $Etc . The conditional is true if any variables
were assigned, that is, if the pattern matched. Usually, though, one would
just write the equivalent split: if (($F1, $F2, $Etc) = split(' ', $foo, 3)) The /g modifier specifies global pattern matching - that is, matching
as many times as possible within the string. How it behaves depends on
the context. In a list context, it returns a list of all the
substrings matched by all the parentheses in the regular expression.
If there are no parentheses, it returns a list of all the matched
strings, as if there were parentheses around the whole pattern. In a scalar context, m//g iterates through the string, returning true
each time it matches, and false when it eventually runs out of
matches. (In other words, it remembers where it left off last time and
restarts the search at that point. You can find the current
match position of a string using the pos function - see Chapter 3.)
If you modify the string in any way, the match position is reset to the
beginning. Examples: # list context--extract three numeric fields from uptime command
($one,$five,$fifteen) = (`uptime` =~ /(\d+\.\d+)/g);
# scalar context--count sentences in a document by recognizing
# sentences ending in [.!?], perhaps with quotes or parens on
# either side. Observe how dot in the character class is a literal
# dot, not merely any character.
$/ = ""; # paragraph mode
while ($paragraph = <>) {
while ($paragraph =~ /[a-z]['")]*[.!?]+['")]*\s/g) {
$sentences++;
}
}
print "$sentences\n";
# find duplicate words in paragraphs, possibly spanning line boundaries.
# Use /x for space and comments, /i to match the both `is'
# in "Is is this ok?", and use /g to find all dups.
$/ = ""; # paragrep mode again
while (<>) {
while ( m{
\b # start at a word boundary
(\w\S+) # find a wordish chunk
(
\s+ # separated by some whitespace
\1 # and that chunk again
) + # repeat ad lib
\b # until another word boundary
}xig
)
{
print "dup word `$1' at paragraph $.\n";
}
} ? PATTERN ? This is just like the
/ PATTERN /
search, except that it matches only once between calls to the
reset operator. This is a useful
optimization when you only want to see the first occurrence of
something in each file of a set of files, for instance. Only
?? patterns local to the current package are reset.
This usage is vaguely deprecated, and may be removed in some future
version of Perl. Most people just bomb out of the loop when they
get the match they want. s/ PATTERN / REPLACEMENT /egimosx This operator searches a string for PATTERN , and if found, replaces
that match with the REPLACEMENT text and returns the number of
substitutions made, which can be more than one with the /g modifier.
Otherwise it returns false (0).
If no string is specified via the =~ or !~ operator, the
$_ variable is searched and modified. (The string specified with
=~ must be a scalar variable, an array element, a hash element,
or an assignment to one of those, that is, an lvalue.) If the delimiter you choose happens to be a single quote, no variable
interpolation is done on either the PATTERN or the REPLACEMENT .
Otherwise, if the PATTERN contains a $ that looks like a variable rather
than an end-of-string test, the variable will be interpolated into the
PATTERN at run-time. If you want the PATTERN
compiled only once, when the
variable is first interpolated, use the /o option. If the
PATTERN evaluates to a null string, the
last successfully executed
regular expression is used instead. The REPLACEMENT pattern also
undergoes variable interpolation, but it does so each time the PATTERN
matches, unlike the PATTERN, which just gets interpolated once when
the operator is evaluated. (The PATTERN can match multiple times in one
evaluation if you use the /g option below.) Modifiers are: Any non-alphanumeric, non-whitespace delimiter may replace the slashes.
If single quotes are used, no interpretation is done on the replacement
string (the /e modifier overrides this, however). If the PATTERN is contained
within naturally paired delimiters (such as parentheses), the
REPLACEMENT has its own pair of delimiters, which may or may not be
the same ones used for PATTERN - for example, s(foo)(bar) or
s<foo>/bar/ . A /e will cause the replacement portion to be
interpreted as a full-fledged Perl expression instead of as a
double-quoted string. (It's kind of like an eval, but its
syntax is checked at compile-time.) Examples: # don't change wintergreen
s/\bgreen\b/mauve/g;
# avoid LTS with different quote characters
$path =~ s(/usr/bin)(/usr/local/bin);
# interpolated pattern and replacement
s/Login: $foo/Login: $bar/;
# modifying a string "en passant"
($foo = $bar) =~ s/this/that/;
# counting the changes
$count = ($paragraph =~ s/Mister\b/Mr./g);
# using an expression for the replacement
$_ = 'abc123xyz';
s/\d+/$&*2/e; # yields 'abc246xyz'
s/\d+/sprintf("%5d",$&)/e; # yields 'abc 246xyz'
s/\w/$& x 2/eg; # yields 'aabbcc 224466xxyyzz'
# how to default things with /e
s/%(.)/$percent{$1}/g; # change percent escapes; no /e
s/%(.)/$percent{$1} || $&/ge; # expr now, so /e
s/^=(\w+)/&pod($1)/ge; # use function call
# /e's can even nest; this will expand simple embedded variables in $_
s/(\$\w+)/$1/eeg;
# delete C comments
$program =~ s {
/\* # Match the opening delimiter.
.*? # Match a minimal number of characters.
\*/ # Match the closing delimiter.
} []gsx;
# trim white space
s/^\s*(.*?)\s*$/$1/;
# reverse 1st two fields
s/([^ ]*) *([^ ]*)/$2 $1/; Note the use of $ instead of \ in the last example.
Some people get a little too used to writing things like: $pattern =~ s/(\W)/\\\1/g; This is grandfathered for the right-hand side of a substitution to avoid
shocking the sed addicts, but it's a dirty habit to get into.[]
That's because in PerlThink, the right-hand side of a s/// is a
double-quoted string. In an ordinary double-quoted string, \1
would mean a control-A, but for s/// the customary UNIX meaning
of \1 is kludged in. (The lexer actually translates it to
$1 on the fly.) If you start to rely on that, however, you get
yourself into trouble if you then add an /e modifier: s/(\d+)/ \1 + 1 /eg; # a scalar reference plus one? Or if you try to do: s/(\d+)/\1000/; # "\100" . "0" == "@0"? You can't disambiguate that by saying \{1}000 , whereas you
can fix it with ${1}000 . Basically, the operation of
interpolation should not be confused with the operation of matching a
backreference. Certainly, interpolation and matching mean two different
things on the left side of the s/// . Occasionally, you can't just use a /g to get all the changes to
occur, either because the substitutions have to happen right-to-left, or
because you need the length of $` to change between matches. In this
case you can usually do what you want by calling the substitution
repeatedly. Here are two common cases: # put commas in the right places in an integer
1 while s/(\d)(\d\d\d)(?!\d)/$1,$2/;
# expand tabs to 8-column spacing
1 while s/\t+/' ' x (length($&)*8 - length($`)%8)/e; tr/ SEARCHLIST / REPLACEMENTLIST /cds y/ SEARCHLIST / REPLACEMENTLIST /cds Strictly speaking, this operator doesn't belong in a section on pattern
matching because it doesn't use regular expressions. Rather, it scans
a string character by character, and replaces
all occurrences of the characters found in the SEARCHLIST
with the corresponding character in the REPLACEMENTLIST . It returns
the number of characters replaced or deleted. If no string is
specified via the =~ or !~ operator, the $_ string is translated. (The
string specified with =~ must be a scalar variable, an array element,
or an assignment to one of those, that is, an lvalue.) For sed devotees,
y is provided as a synonym for tr///. If the SEARCHLIST is
contained within naturally paired delimiters (such as parentheses), the
REPLACEMENTLIST has its own pair of delimiters, which may or may
not be naturally paired ones - for example, tr[A-Z][a-z]
or tr(+-*/)/ABCD/ .
Modifiers: If the /c modifier is specified, the
SEARCHLIST character set is complemented; that
is, the effective search list consists of all the characters
not in SEARCHLIST . If the
/d modifier is specified, any
characters specified by SEARCHLIST but not given
a replacement in REPLACEMENTLIST are deleted.
(Note that this is slightly more flexible than the behavior of some
tr/// programs, which delete anything they
find in the SEARCHLIST , period.) If the
/s modifier is specified, sequences of
characters that were translated to the same character are squashed
down to a single instance of the character. If the /d modifier is used, the
REPLACEMENTLIST is always interpreted exactly as
specified. Otherwise, if the REPLACEMENTLIST is
shorter than the SEARCHLIST , the final character
is replicated until it is long enough. If the
REPLACEMENTLIST is null, the
SEARCHLIST is replicated. This latter is useful
for counting characters in a class or for squashing character
sequences in a class. Examples: $ARGV[1] =~ tr/A-Z/a-z/; # canonicalize to lower case
$cnt = tr/*/*/; # count the stars in $_
$cnt = $sky =~ tr/*/*/; # count the stars in $sky
$cnt = tr/0-9//; # count the digits in $_
tr/a-zA-Z//s; # bookkeeper -> bokeper
($HOST = $host) =~ tr/a-z/A-Z/;
tr/a-zA-Z/ /cs; # change non-alphas to single space
tr [\200-\377]
[\000-\177]; # delete 8th bit If multiple translations are given for a character, only the first one is used: tr/AAA/XYZ/ will translate any A to X. Note that because the translation table is built at compile time, neither
the SEARCHLIST nor the REPLACEMENTLIST
are subject to double quote
interpolation. That means that if you want to use variables, you must use
an eval: eval "tr/$oldlist/$newlist/";
die $@ if $@;
eval "tr/$oldlist/$newlist/, 1" or die $@; One more note: if you want to change your text to uppercase or
lowercase, it's better to use the \U or \L sequences
in a double-quoted string, since they will pay attention to locale
information, but tr/a-z/A-Z/ won't. |