1825810 Members
2438 Online
109688 Solutions
New Discussion

Re: regexp and HTML

 
SOLVED
Go to solution
Anna Fong
Advisor

regexp and HTML

does anyone know of a module that will match a string while ignoring any/all HTML within the string?

For example,

$str1 = "some links to this subject";
$str2 = "some links to this subject";

the desired module would consider $str1 and $str2 to be equivalent.

if module not available, any code examples to accomplish this?

TIA,
Anna
5 REPLIES 5
H.Merijn Brand (procura
Honored Contributor
Solution

Re: regexp and HTML

# perl -MHTML::Parser

BTW, this module is included in my latest 5.8.3 ports :)

If you want to do it yoyrself, start with Parse::RecDescent

Enjoy, Have FUN! H.Merijn [ Who thinks writing such regexes is too much work ]
Enjoy, Have FUN! H.Merijn
Anna Fong
Advisor

Re: regexp and HTML

I've located an old HTML-Parser on the system (HTML-Parser-2.22). The documentation is very sparse. Any hints on using it to do above task?
H.Merijn Brand (procura
Honored Contributor

Re: regexp and HTML

2.22 is rather old (1998-12-18). The current version is 3.35, which has *many* documentation updates (see http://search.cpan.org/src/GAAS/HTML-Parser-3.35/Changes)

I've never used it myself. I'm alway using the brute way: 'lynx -dump', but that won't help in your case.
But you were explicitely asking for a module, and I understand the needs.

The problem in writing it yourself with regular expressions, is that

Please mail Janneman for questions

will be rendered ok by many browsers, but is hard to parse. Worse is that

Please mail Janneman for questions

would make it even harder (though I don't know if that's legal, but both lines parse correct in Opera and both work (I just tested)

The best answer is probably in perl itself:

# perldoc -q 'How do I remove HTML from a string'

Will give you a pretty complete answer

The example for HTML::Parser is also in the FAQ (perlfaq9), here's a (stripped to your needs) code snippet from there:

use HTML::Parser;
use HTML::FormatText;
$ascii = HTML::FormatText->new->format (parse_html ($str2));
$str1 eq $ascii and return "is equal";

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn
Anna Fong
Advisor

Re: regexp and HTML

Very cool! Reply back for 10 bonus points!
H.Merijn Brand (procura
Honored Contributor

Re: regexp and HTML

Hmmm, points :)

Thank you. These are the (technical) points I like to earn!

Enjoy, Have FUN! H.Merijn
Enjoy, Have FUN! H.Merijn