Enpara — adding well-formed auto-paragraph markup to mixed text + html

One of the most frustrating things in writing HTML for… [sorry, just stubbing this out for now but the code is solid (as a demo, not what you’d want in production but pretty close to what this site uses and has since late 2006—needs error handling for bad input to be finished).]

The code below takes input—like comment form fields or blog posts—and wraps double spaced content in paragraph tags and replaces newlines with breaks—<br/>s—when they appear in naked auto-formatted paragraphs.

This code (well, not this code but this approach) is superior to everything out there because it will leave alone block level tags surrounded by double spacing including paragraphs. It is also smart enough to “enpara” anything naked—flow level tags like <b></b>—regardless of spacing because it is not splitting on space. It doesn’t care about whitespace at all. It’s parsing the raw content as HTML and then correcting naked nodes.

Note this is easy to extend to enpara inside arbitrary block level elements like <blockquote></blockquote>. Also note this is not a filter. As is, it will allow XSS attacks through. The same LibXML framework is excellent for stripping malicious tags though.

#!/usr/bin/perl
use strict;
use warnings;
# use Enpara; # can't and don't need to use it if it's in the same file

my $enpara = Enpara->new();
my $raw_input = join "", <DATA>;
print $enpara->enpara($raw_input);

exit 0;

package Enpara;
use strict;
use warnings;
use XML::LibXML;
use Carp;
# use HTML::Entities;
# our %Charmap = %HTML::Entities::entity2char;
# delete @Charmap{qw( amp lt gt quot apos )}; # these are valid

=head1 NAME

Enpara - Proof of concept for correctly adding paragraph tags
to mixed text and HTML which expects doublespaces to "enpara"
while raw HTML is also respected.

=head1 VERSION

0.01

=head1 SYNOPSIS

 use Enpara;

=head1 DESCRIPTION

=cut

sub new {
    my $class = shift;
    my $raw = shift;
    croak "Too many arguments to new" if @_;

    my $self = bless { _raw => $raw }, $class;

    $self->{_block_level} = 
    {
        map { $_ => 1 } 
        qw( blockquote fieldset address noscript iframe
            object param script table tbody thead tfoot form
            img div map pre dl di dt dd h h2 h3 h4 h5 h6 hr
            ol ul li td th tr p )
        };

    return $self;
}

sub _convert_to_xml {
    my ( $self, $raw ) = @_;

    $raw ||= $self->{_raw};
    my $prepared = "<html>\n" . $raw . "\n</html>";
    $self->{_parser} = XML::LibXML->new();
    $self->{_parser}->line_numbers(1);

    my $doc;
    eval {
        $doc = $self->{_parser}->parse_html_string($prepared);
    };
    carp $@ if $@;
    return $doc;
# handle errors here    $@ ? 
}

sub enpara {
    my ( $self, $raw ) = @_;
    # $raw may be empty here, already set at object creation
    $self->{_doc} = $self->_convert_to_xml($raw);
    my ( $body ) = $self->{_doc}->documentElement->findnodes("//body");
    $self->_enpara_this_nodes_content( $body );
    my $result = $body->serialize(1);
    $result =~ s,^\s*<body>|\s*</body>$,,g;
    return $result;
}

sub _enpara_this_nodes_content {
    my ( $self, $node ) = @_;

    my $lastChild = $node->lastChild;
    my @naked_block;
    for my $n ( $node->childNodes )
    {
        if ( $self->{_block_level}{$n->nodeName}
             or
             $n->nodeName eq "a" # special case block level, so IGNORE
             and
             grep { $_->nodeName eq "img" } $n->childNodes
             )
        {
            next unless @naked_block; # nothing to enblock
            my $p = $self->{_doc}->createElement("p");
            $p->setAttribute("enpara","enpara");
            $p->appendChild($_) for @naked_block;
            $node->insertBefore( $p, $n )
                if $p->textContent =~ /\S/;
            @naked_block = ();

        }
        elsif ( $n->nodeName eq "#text"
                and
                $n->nodeValue =~ /(?:[^\S\n]*\n){2,}/
                )
        {
            my $text = $n->nodeValue;

            my @text_part = map { $self->{_doc}->createTextNode($_) }
                split /([^\S\n]*\n){2,}/, $text;

            my @new_node;
            for ( my $x = 0; $x < @text_part; $x++ )
            {
                if ( $text_part[$x]->nodeValue =~ /\S/ )
                {
                    push @naked_block, $text_part[$x];
                }
                else # it's a blank newline node so _STOP_
                {
                    next unless @naked_block;
                    my $p = $self->{_doc}->createElement("p");
                    $p->setAttribute("enpara","enpara");
                    $p->appendChild($_) for @naked_block;
                    @naked_block = ();
                    push @new_node, $p;
                }
            }
            if ( @new_node )
            {
                $node->insertAfter($new_node[0], $n);
                for ( my $x = 1; $x < @new_node; $x++ )
                {
                    $node->insertAfter($new_node[$x], $new_node[$x-1]);
                }
            }
            $n->unbindNode;
        }
        else
        {
            push @naked_block, $n;
        }

        if ( $n->isSameNode( $lastChild )
             and @naked_block )
        {
            my $p = $self->{_doc}->createElement("p");
            $p->setAttribute("enpara","enpara");
            $p->appendChild($_) for ( @naked_block );
            $node->appendChild($p) if $p->textContent =~ /\S/;
        }
    }

    my $newline = $self->{_doc}->createTextNode("\n");
    my $br = $self->{_doc}->createElement("br");

    for my $p ( $node->findnodes('//p[@enpara="enpara"]') )
    {
        $p->removeAttribute("enpara");
        $node->insertBefore( $newline->cloneNode, $p );
        $node->insertAfter( $newline->cloneNode, $p );

        my $frag = $self->{_doc}->createDocumentFragment();

        my @kids = $p->childNodes();
        for ( my $i = 0; $i < @kids; $i++ )
        {
            my $kid = $kids[$i];
            next unless $kid->nodeName eq "#text";
            my $text = $kid->nodeValue;
            $text =~ s/\A\n// if $i == 0;
            $text =~ s/\n\z// if $i == $#kids;

            my @lines = map { $self->{_doc}->createTextNode($_) }
                split /(\n)/, $text;

            for ( my $i = 0; $i < @lines; $i++ )
            {
                $frag->appendChild($lines[$i]);
                unless ( $i == $#lines
                         or
                         $lines[$i]->nodeValue eq "\n" )
                {
                    $frag->appendChild($br->cloneNode);
                }
            }
            $kid->replaceNode($frag);
        }
    }
}

1;

__END__
<p>Did it manually in the first.</p>

<b>Didn't</b> <i>do it in the second.</i>

<p>Did it manually again in the third.</p>

<pre>
This is the fourth block and has

“triple spacing in it and an &amp;”
</pre>
Didn't do it here in the fifth.<p>Did it here in
the sixth mashed up against the fifth so we
could not possibly split on whitespace.</p>

<hr/>

Have a <b>bold</b> here that needs a paragraph.

also need

three in a row

and four for that matter

<p>real para back into the mix</p>

And two in a row <a href="http://jasper.local:3000/a/12" title="Read
more of "So I kinda have a crush">[read more]</a>

<b>asdf</b>

!

?

Resulting output

<p>Did it manually in the first.</p>
<p><b>Didn't</b> <i>do it in the second.</i></p>
<p>Did it manually again in the third.</p><pre>
This is the fourth block and has

“triple spacing in it and an &amp;”
</pre>
<p>Didn't do it here in the fifth.</p>
<p>Did it here in the sixth mashed up against the fifth
so we could not possibly split on whitespace.</p><hr/>
<p>Have a <b>bold</b> here that needs a paragraph.</p>

<p>also need</p>

<p>three in a row</p>

<p>and four for that matter</p>
<p>real para back into the mix</p>
<p>And two in a row <a href="http://jasper.local:3000/a/12" title="Read
more of "So I kinda have a crush">[read more]</a></p>

<p><b>asdf</b></p>

<p>!</p>

<p>?</p>
digg stumbleupon del.icio.us reddit Fark Technorati Faves
Your information (required) Name*
Email*
Website

* Indicates required fields; email is used for validation and is not displayed on the site.

Your comment
Commenting on Enpara — adding well-formed auto-paragraph markup to mixed text + html
Title

Body is limited to ≈1,000 words. Paragraphs—but not line breaks—are automatically inserted. Valid XHTML is required. These are the allowed tags–

<a href=""></a> <br/> <acronym title=""></acronym> <abbr title=""></abbr> <code></code> <pre></pre> <tt></tt> <ins></ins> <del></del> <hr/> <cite></cite> <b></b> <i></i> <sup></sup> <sub></sub> <strong></strong> <em></em> <h1></h1> <h2></h2> <h3></h3> <q></q> <blockquote></blockquote>