ЭЛЕКТРОННАЯ БИБЛИОТЕКА КОАПП
Сборники Художественной, Технической, Справочной, Английской, Нормативной, Исторической, и др. литературы.



20.5. Converting HTML to ASCII

Problem

You want to convert an HTML file into formatted plain ASCII.

Solution

If you have an external formatter like lynx, call an external program:

$ascii = `lynx -dump $filename`;

If you want to do it within your program and don't care about the things that the HTML::TreeBuilder formatter doesn't yet handle (tables and frames):

use HTML::FormatText;
use HTML::Parse;

$html = parse_htmlfile($filename);
$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);
$ascii = $formatter->format($html);

Discussion

These examples both assume you have the HTML text in a file. If your HTML is in a variable, you need to write it to a file for lynx to read. If you are using HTML::FormatText, use the HTML::TreeBuilder module:

use HTML::TreeBuilder;
use HTML::FormatText;

$html = HTML::TreeBuilder->new();
$html->parse($document);

$formatter = HTML::FormatText->new(leftmargin => 0, rightmargin => 50);

$ascii = $formatter->format($html);

If you use Netscape, its "Save as" option with the type set to "Text" does the best job with tables.

See Also

The documentation for the CPAN modules HTML::Parse, HTML::TreeBuilder, and HTML::FormatText; your system's lynx (1) manpage; Recipe 20.6


Previous: 20.4. Converting ASCII to HTMLPerl CookbookNext: 20.6. Extracting or Removing HTML Tags
20.4. Converting ASCII to HTMLBook Index20.6. Extracting or Removing HTML Tags