|
Smart quotes with Perl
|
Social links
Class::Prototype
WWW::Spyder Javascript tricks serial() join function Smart quotes Text to Excel Developing Featherweight Web Services with JavaScript
Miscellaneous
|
|
| Smart quotes with Perl
|
Introduction You want your HTML to be typographically correct (notes here). But typing ‘this’ constantly to get ‘this’ instead of "this" is ludicrous. Perl to the rescue. Sample text file, named “plain-quote.html” <blockquote>Here is a parenthetical emdash--a simple one--to show how this might work. Like I've always said, "I agree when they say, 'Perl is the duct tape of the Internet'."</blockquote> <div style="text-align:right;padding-right:22px">--Ashley</div> Which looks so in your browser… Here is a parenthetical emdash--a simple one--to show how this might work. Like I've always said, "I agree when they say, 'Perl is the duct tape of the Internet'." --Ashley Apply our smart quoter to it G4:jinx[786]/www/perl>quote-fixer plain-quote.html And we get back <blockquote>Here is a parenthetical emdash—a simple one—to show how this might work. Like I’ve always said, “I agree when they say, ‘Perl is the duct tape of the Internet’.”</blockquote> <div style="text-align:right;padding-right:22px;">–Ashley</div> Which will now render in a browser… Here is a parenthetical emdasha simple oneto show how this might work. Like Ive always said, I agree when they say, Perl is the duct tape of the Internet. –Ashley Here is the code to do it use warnings; use strict; use HTML::TokeParser; #============================================================== # needs to be an array to do them in the right order my @Entity_Transform = ( # reverse dummy quotes to prepare for fixing sub { $_[0] =~ s/"/"/g; $_[0] =~ s/(?<!`)``(?!`)/"/g; $_[0] =~ s/(?<!')''(?!')/"/g; }, # N-dash sub { $_[0] =~ s/(?<=\w)--(?=\s|\z)/–/g; $_[0] =~ s/(?<=\d)--(?=\d)/–/g; $_[0] =~ s/(\s|\A)--(?=\w)/$1–/g }, # M-dash sub { $_[0] =~ s/(?<!-)---?(?!-)/—/g; $_[0] =~ s/(?<=\w)---?(?=\w)/—/g }, # left_single sub { $_[0] =~ s/(\s|\A)'/$1‘/g; $_[0] =~ s/(?<!\w)'(?=\w)/‘/g; }, # right_single sub { $_[0] =~ s/(?<!\s)'/’/g; $_[0] =~ s/'(?=\s|\z)/’/g; }, # double_quotes sub { $_[0] =~ s/"(?=\w)/“/g; $_[0] =~ s/(?<!\s)"/”/g; $_[0] =~ s/(\A|\s)"/$1“/g; }, # breakable hyphen sub { $_[0] =~ s/(?<=\w-)(?=\w)/‍/g }, # ellipsis sub { $_[0] =~ s/(?<!\.)\.\.\.(?!\.|\z)/…/g }, # copyright sub { $_[0] =~ s/\([Cc]\)/©/g }, # trademark sub { $_[0] =~ s/\(tm\)/™/g }, ); #============================================================== # PROGRAM PROPER #============================================================== my $file = shift || die "Give me a file to fix up!\n"; -e $file or die "Put the pipe away. There is no file: $file\n"; my $parser = HTML::TokeParser->new($file) || die; my @in_tag; my $renewed = ''; while (my $token = $parser->get_token) { if ( $token->[0] eq 'S' ) { $renewed .= $token->[4]; push @in_tag, $token->[1] unless grep { $token->[1] eq $_ } qw( br hr link img meta input ); # non-closing tags } elsif ( $token->[0] eq 'E' or $token->[0] eq 'PI' ) { $renewed .= $token->[2]; pop @in_tag; } elsif ( $token->[0] eq 'C' ) { $renewed .= $token->[1]; } elsif ( $token->[0] eq 'T' ) { my $txt = $token->[1]; if ( grep /^(?:code|pre|script|style|textarea)$/, @in_tag ) { $renewed .= $txt; } else { my $tmp = $txt; for my $code ( @Entity_Transform ) { $code->($tmp); } $renewed .= $tmp; } } } print $renewed, "\n"; exit 0; #==============================================================k Discussion This is a great utility but it a simplistic one. Still, an algorithm that works for 99% of given cases is a good one. This one will fail in certain places because it’s not looking for delimited quotes. Therefore it is not smart enough to know what to do with open contractions like ‘cause and ‘Burque and it picks the wrong quote. It also fails when we want the original symbols, as in, “I am 6'2" and change.” You can see in the code that we need to keep track of what tags we’ve descended into. If we fix quotes within an HTML tag, we’ll break it. Fixing them in a style declaration will break the CSS. If we fix quotes in comments, we’ll break SSIs. If we fix quotes in scripts, we’ll break them. If we fix quotes in pre, code, or textarea tags, we’ll change the literal meaning of demonstrated code. So we skip the attributes of tags altogether and avoid messing with the content of pre|code|script|textarea|style tags. A fully robust solution might make use of recursive regexes or Text::Balanced but it would be “breakable” by bad user input (incorrect punctuation) and the one above is much simpler and works quite well as is. |
|
|
Perl Books ·
CPAN ·
mod_perl ·
Perl Monks ·
Perl Mongers ·
Perl Journal ·
Use Perl ·
Perl Jobs ·
ActiveState ·
perldoc.perl.org ·
O’Reilly Perl ·
W3Schools tutorials ·
Ovid's CGI Course ·
Catalyst ·
Perl at Wikipedia
Text, original code, fonts, and graphics ©1990-2009 Ashley Pond V. |
||