XHTML selective tag stripper/filter via LibXML

Sometimes user data can contain mark-up which makes it harmful to display or limits the context it can be safely shown. This is how we use anchor links with href attributes in articles’ titles on this site interchangeably with using the title to link to the piece.

For example article 1556 is titled–

<a href="/ddx/w/1691.html">Dog</a> days dress-up

And we want it to work that way on its display page. To link to its display page we need to wrap the title in another link. This causes a big problem–

<a href="/a/1556"><a href="/ddx/w/1691.html">Dog</a> days dress-up</a>

That won’t validate and will break or behave differently on various browsers. So, we’d like to be able to strip the anchor tag. Stripping HTML is a trivial task so we could just take it all out, but what if the title contains desireable markup, like <del></del> or <i></i>?

So, we want to strip selectively, or filter out tags. Not just anchors, perhaps we’d like to filter heading level or block level or whatnot too. So, we allow a list of tags to be stripped.

use warnings;
use strict;
use XML::LibXML;

my @strip = @ARGV || die "Give me tags to strip!\n";

my $parser = XML::LibXML->new();

my $raw = join'', <DATA>;

my $doc = $parser->parse_string($raw);

my $root = $doc->documentElement();

for my $strip ( @strip )
{
    for my $node ( $root->findnodes("//$strip") )
    {
        my $fragment = $doc->createDocumentFragment();
        $fragment->appendChild($_) for $node->childNodes;
        $node->replaceNode($fragment);
    }
}

print $doc->serialize();

exit 0;

__END__
<div>
<h1>Bang!</h1>
<p>Did <i>italic</i> and
<a href="/uri">link with <b>bold</b> inside it</a>.</p>

<a href="/top-level">naked link</a>
</div>

XML::LibXML also makes things like tag and attribute translation—say you wanted to inline/flow everything by making heading tags into bold and giving block tags a style display of inline—a snap.

digg stumbleupon del.icio.us reddit Fark Technorati Faves
Your information (required) Name*
Email*
Website

* Indicates required fields; email is used for validation and is not displayed on the site.

Your comment
Commenting on XHTML selective tag stripper/filter via LibXML
Title

Body is limited to ≈1,000 words. Paragraphs—but not line breaks—are automatically inserted. Valid XHTML is required. These are the allowed tags–

<a href=""></a> <br/> <acronym title=""></acronym> <abbr title=""></abbr> <code></code> <pre></pre> <tt></tt> <ins></ins> <del></del> <hr/> <cite></cite> <b></b> <i></i> <sup></sup> <sub></sub> <strong></strong> <em></em> <h1></h1> <h2></h2> <h3></h3> <q></q> <blockquote></blockquote>