Smart quotes with Perl
Social links
View Ashley Pond V's profile on LinkedIn
Miscellaneous

Other pages

Introduction

You want your HTML to be typographically correct (notes here). But typing ‘this’ constantly to get ‘this’ instead of "this" is ludicrous. Perl to the rescue.

Sample text file, named “plain-quote.html”
<blockquote>Here is a parenthetical emdash--a 
simple one--to show how this might 
work. Like I've always said, "I agree 
when they say, 'Perl is the duct tape 
of the Internet'."</blockquote>

<div style="text-align:right;padding-right:22px">--Ashley</div>

Which looks so in your browser…

Here is a parenthetical emdash--a simple one--to show how this might work. Like I've always said, "I agree when they say, 'Perl is the duct tape of the Internet'."
--Ashley
Apply our smart quoter to it
G4:jinx[786]/www/perl>quote-fixer plain-quote.html
And we get back
<blockquote>Here is a parenthetical emdash&#151;a 
simple one&#151;to show how this might 
work. Like I&#146;ve always said, &#147;I agree 
when they say, &#145;Perl is the duct tape 
of the Internet&#146;.&#148;</blockquote>

<div style="text-align:right;padding-right:22px;">&#8211;Ashley</div>

Which will now render in a browser…

Here is a parenthetical emdash—a simple one—to show how this might work. Like I’ve always said, “I agree when they say, ‘Perl is the duct tape of the Internet’.”
–Ashley
Here is the code to do it
use warnings;
use strict;
use HTML::TokeParser;
#==============================================================
# needs to be an array to do them in the right order
my @Entity_Transform =
        (
# reverse dummy quotes to prepare for fixing
         sub { $_[0] =~ s/&quot;/"/g;
               $_[0] =~ s/(?<!`)``(?!`)/"/g;
               $_[0] =~ s/(?<!')''(?!')/"/g;
              },
# N-dash
         sub { $_[0] =~ s/(?<=\w)--(?=\s|\z)/&#8211;/g;
               $_[0] =~ s/(?<=\d)--(?=\d)/&#8211;/g;
               $_[0] =~ s/(\s|\A)--(?=\w)/$1&#8211;/g },
# M-dash
         sub { $_[0] =~ s/(?<!-)---?(?!-)/&#8212;/g;
               $_[0] =~ s/(?<=\w)---?(?=\w)/&#8212;/g },
# left_single 
         sub { $_[0] =~ s/(\s|\A)'/$1&#8216;/g;
               $_[0] =~ s/(?<!\w)'(?=\w)/&#8216;/g;
              },
# right_single
         sub { $_[0] =~ s/(?<!\s)'/&#8217;/g;
               $_[0] =~ s/'(?=\s|\z)/&#8217;/g;
              },
# double_quotes 
         sub {
              $_[0] =~ s/"(?=\w)/&#8220;/g;
              $_[0] =~ s/(?<!\s)"/&#8221;/g;
              $_[0] =~ s/(\A|\s)"/$1&#8220;/g;
           },
# breakable hyphen
         sub { $_[0] =~ s/(?<=\w-)(?=\w)/&#8205;/g },
# ellipsis
         sub { $_[0] =~ s/(?<!\.)\.\.\.(?!\.|\z)/&#8230;/g },
# copyright
         sub { $_[0] =~ s/\([Cc]\)/&#169;/g },
# trademark
         sub { $_[0] =~ s/\(tm\)/&#8482;/g },
         );
#==============================================================
# PROGRAM PROPER
#==============================================================
my $file = shift || die "Give me a file to fix up!\n";
-e $file or die "Put the pipe away. There is no file: $file\n";

my $parser = HTML::TokeParser->new($file) || die;

my @in_tag;
my $renewed = '';

while (my $token = $parser->get_token) 
{
    if ( $token->[0] eq 'S' )
    {
        $renewed .= $token->[4];
        push @in_tag, $token->[1]
            unless grep { $token->[1] eq $_ }
        qw( br hr link img meta input ); # non-closing tags
    }
    elsif ( $token->[0] eq 'E' or $token->[0] eq 'PI' )
    {
        $renewed .= $token->[2];
        pop @in_tag;
    }
    elsif ( $token->[0] eq 'C' )
    {
        $renewed .= $token->[1];
    }
    elsif ( $token->[0] eq 'T' )
    {
        my $txt = $token->[1]; 
        if ( grep /^(?:code|pre|script|style|textarea)$/, @in_tag ) {
            $renewed .= $txt;
        }
        else 
        {
            my $tmp = $txt;
            for my $code ( @Entity_Transform )
            {
                $code->($tmp);
            }
            $renewed .= $tmp;
        }
    }
}
print $renewed, "\n";

exit 0;
#==============================================================k

Discussion

This is a great utility but it a simplistic one. Still, an algorithm that works for 99% of given cases is a good one. This one will fail in certain places because it’s not looking for delimited quotes. Therefore it is not smart enough to know what to do with open contractions like ‘cause and ‘Burque and it picks the wrong quote. It also fails when we want the original symbols, as in, “I am 6'2" and change.”

You can see in the code that we need to keep track of what tags we’ve descended into. If we fix quotes within an HTML tag, we’ll break it. Fixing them in a style declaration will break the CSS. If we fix quotes in comments, we’ll break SSIs. If we fix quotes in scripts, we’ll break them. If we fix quotes in pre, code, or textarea tags, we’ll change the literal meaning of demonstrated code. So we skip the attributes of tags altogether and avoid messing with the content of pre|code|script|textarea|style tags.

A fully robust solution might make use of recursive regexes or Text::Balanced but it would be “breakable” by bad user input (incorrect punctuation) and the one above is much simpler and works quite well as is.

Search these pages via Google
Text, original code, fonts, and graphics ©1990-2008 Ashley Pond V.