Recipe 6.16. Detecting Duplicate Words

6.16. Detecting Duplicate Words

Problem

You want to check for doubled words in a document.

Solution

Use backreferences in your regular expression.

Parentheses in a pattern make the regular expression engine remember what matched that part of the pattern. Later in your pattern, you can refer to the actual string that matched with \1 (indicating the string matched by the first set of parentheses), \2 (for the second string matched by the second set of parentheses), and so on. Don't use $1; it would be treated as a variable and interpolated before the match began. If you match /([A-Z])\1/, that says to match a capital letter followed not just by any capital letter, but by whichever one was captured by the first set of parentheses in that pattern.

This sample code reads its input files by paragraph, with the definition of paragraph following Perl's notion of two or more contiguous newlines. Within each paragraph, it finds all duplicate words. It ignores case and can match across newlines.

Here we use /x to embed whitespace and comments to make the regular expression readable. /i lets us match both instances of "is" in the sentence "Is is this ok?". We use /g in a while loop to keep finding duplicate words until we run out of text. Within the pattern, use \b (word boundary) and \s (whitespace) to help pick out whole words and avoid matching "This".

$/ = '';                      # paragrep mode
while (<>) {
    while ( m{
                \b            # start at a word boundary (begin letters)
                (\S+)         # find chunk of non-whitespace
                \b            # until another word boundary (end letters)
                (
                    \s+       # separated by some whitespace
                    \1        # and that very same chunk again
                    \b        # until another word boundary
                ) +           # one or more sets of those
             }xig
         )
    {
        print "dup word '$1' at paragraph $.\n";
    }
}

That code finds the duplicated test in the following paragraph:

This is a test
test of the duplicate word finder.

The use of a word boundary anchor surrounding \S+ is usually a bad idea because word boundaries are defined as transitions between alphanumunders (that's a \w) and either the edge of the string or a non-alphanumunder. Surrounding it by \b changes \S+ from its normal meaning of one or more non-whitespace characters to a stretch of non-whitespace whose first and last character must be an alphanumunder.

Here's another interesting demonstration of using backreferences. Imagine you had two words in which the end of the first word was the same as the start of the next one, such as "nobody" and "bodysnatcher". You'd like to find that overlapping part and come up with "nobodysnatcher". This is just a variant on the duplicate word problem.

Conventional byte-by-byte processing the way a C programmer would write it would take a great deal of tricky code to solve this problem. But with a backtracking pattern matcher, it just takes one simple pattern match.

$a = 'nobody';
$b = 'bodysnatcher';
if ("$a $b" =~ /^(\w+)(\w+) \2(\w+)$/) {
    print "$2 overlaps in $1-$2-$3\n";
}
body overlaps in no-body-snatcher

You might think that $1 would first grab up all of "nobody" due to greediness. In fact, it does - for a while. But once it's done so, there aren't any more characters to put in $2. So the engine backs up and $1 begrudgingly gives up one character to $2. The space character matches successfully, but then it sees \2, which currently holds merely "y". The next character in the string is not a "y", but a "b". This makes the engine back up all the way, eventually forcing $1 to surrender enough to $2 that the pattern can match something, a space, then that same thing.

Actually, that won't quite work out if the overlap is itself the product of a doubling, as in "rococo" and "cocoon". The preceding algorithm would have decided that the overlapping string, $2, must be just "co" rather than "coco". But we don't want a "rocococoon"; we want a "rococoon". Adding a minimal matching quantifier to the $1 part gives the much better pattern:

/^(\w+?)(\w+) \2(\w+)$/,

which solves this problem.

Backtracking is more powerful than you imagine. Example 6.11 offers another take on the prime factorization problem from Chapter 2, Numbers.

Example 6.11: prime-pattern

#!/usr/bin/perl
# prime_pattern -- find prime factors of argument using pattern matching
for ($N = ('o' x shift); $N =~ /^(oo+?)\1+$/; $N =~ s/$1/o/g) {
    print length($1), " ";
}
print length ($N), "\n";

Although not practical, this approach marvelously demonstrates the power of backtracking and is therefore very instructional.

Here's another example. Using a brilliant insight first illustrated by Doug McIlroy (or so says Andrew Hume), you can find solutions to Diophantine equations of order one with regular expressions. Consider the equation 12x + 15y + 16z = 281. Can you think of possible values for x, y, and z ? Perl can!

# solve for 12x + 15y + 16z = 281, maximizing x
if (($X, $Y, $Z)  =
   (('o' x 281)  =~ /^(o*)\1{11}(o*)\2{14}(o*)\3{15}$/))
{
    ($x, $y, $z) = (length($X), length($Y), length($Z));
    print "One solution is: x=$x; y=$y; z=$z.\n";
} else {
    print "No solution.\n";
}
One solution is: x=17; y=3; z=2.

Because the first o* was greedy, x was allowed to grow as large as it could. Changing one or more of the * quantifiers to *?, +, or +? can produce different solutions.

('o' x 281)  =~ /^(o+)\1{11}(o+)\2{14}(o+)\3{15}$/
One solution is: x=17; y=3; z=2
('o' x 281)  =~ /^(o*?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=0; y=7; z=11.
('o' x 281)  =~ /^(o+?)\1{11}(o*)\2{14}(o*)\3{15}$/
One solution is: x=1; y=3; z=14.

An important lesson to be learned from these amazing feats of mathematical prowess by a lowly pattern matcher is that a pattern matching engine, particularly a backtracking one, very much wants to give you an answer and will work phenomenally hard to do so. But solving a regular expression with backreferences can take time exponentially proportional to the length of the input to complete. For all but trivial inputs, such algorithms make continental drift seem brisk.

6.16. Detecting Duplicate Words

Problem

Solution

Discussion

Example 6.11: prime-pattern

See Also