20.3. Extracting URLsProblemYou want to extract all URLs from an HTML file. SolutionUse the HTML::LinkExtor module from CPAN: use HTML::LinkExtor;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse_file($filename);
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element; # element type
# possibly test whether this is an element we're interested in
while (@element) {
# extract the next attribute and its value
my ($attr_name, $attr_value) = splice(@element, 0, 2);
# ... do something with them ...
}
}DiscussionYou can use HTML::LinkExtor in two different ways: either to call The <A HREF="http://www.perl.com/">Home page</A> <IMG SRC="images/big.gif" LOWSRC="images/big-lowres.gif"> would return a data structure like this: [
[ a, href => "http://www.perl.com/" ],
[ img, src =>"images/big.gif",
lowsrc => "images/big-lowres.gif" ]
]Here's an example of how you would use the if ($elt_type eq 'a' && $attr_name eq 'href') {
print "ANCHOR: $attr_value\n"
if $attr_value->scheme =~ /http|ftp/;
}
if ($elt_type eq 'img' && $attr_name eq 'src') {
print "IMAGE: $attr_value\n";
}Example 20.2 is a complete program that takes as its arguments a URL, like file:///tmp/testing.html or http://www.ora.com/, and produces on standard output an alphabetically sorted list of unique URLs. Example 20.2: xurl#!/usr/bin/perl -w
# xurl - extract unique, sorted list of links from URL
use HTML::LinkExtor;
use LWP::Simple;
$base_url = shift;
$parser = HTML::LinkExtor->new(undef, $base_url);
$parser->parse(get($base_url))->eof;
@links = $parser->links;
foreach $linkarray (@links) {
my @element = @$linkarray;
my $elt_type = shift @element;
while (@element) {
my ($attr_name , $attr_value) = splice(@element, 0, 2);
$seen{$attr_value}++;
}
}
for (sort keys %seen) { print $_, "\n" }This program does have a limitation: if the Here's an example of the run: % xurl http://www.perl.com/CPAN Often in mail or Usenet messages, you'll see URLs written as: <URL:http://www.perl.com> This is supposed to make it easy to pick URLs from messages: @URLs = ($message =~ /<URL:(.*?)>/g); See AlsoThe documentation for the CPAN modules LWP::Simple, HTML::LinkExtor, and HTML::Entities; Recipe 20.1 |