Catalyst Model #6: Log file model–Apache access log

Published · Monday, 20 July 2009 (Updated · 5 May 2010)

Last time, doing stock quotes, we introduced a lot. Two views. Templates. A bunch of dependencies. A model which needed a helper to handle legacy data structures which we put into the controller to make things easier.

Time to relax back into it… well, okay, no. You made it this far. You can stand more. Let’s get nuts and hit some modern Perl.

What is modern Perl? It’s a little subjective but it’s basically Perl that takes itself seriously. strict and warnings are on. You reach for seasoned, sophisticated CPAN packages before coding up something on a lark. You use Perl 5.10+ with an eye toward Perl 6, and embraces new features and higher order concepts. You test.

We’ll be forgoing that last part. It’s beyond the scope of this series. It is the single best thing you can do to keep production code on track, on budget, and shake out bad design choices quickly though so find your way there. We recommend Perl Testing: A Developer's Notebook if you want an incisive treatment of the topic.

The modern things we’ll be adopting in this example are named regular expression captures, feature (given/when), and Moose. We’re also going to allude to DBIx::Class without using it.

How’s that? DBIx::Class has a rather nice query interface. We’re building a log model we’ll need to query. We’ll adopt some of DBIC’s API to do it. Riding another implementation like this buys you code that’s already half-mastered by users and helps avoid making dead-end choices because you don’t have the experience gathered by those who already solved most of the problems.

We want to be able to read our log file backwards because we want to be able to use order_by in queries and the interesting part of logfiles tends to be the most recent entries. So we will bring in File::ReadBackwards. For the forward reading we’ll use IO::File so that we have another object oriented file interface. This will allow the two to share most of their model code with little fuss.

Moose will already be in if you have Catalyst 5.8. Maybe not if you’re on 5.7.

Install the required modules

cpan Moose
cpan File::ReadBackwards
cpan IO::File

To some degree this entire model hangs on the definition of the log parser we’ll provide. Named capture regular expressions are a new feature in Perl. It is quite powerful and, if I may, proof that Perl is actually still kicking ass, Catalyst or not, no matter what the exaggerations presented in the news.

The parser definition to drive the log model

Let’s look at the config entry in myapp.yml first. We’re doing two things. Telling the model where the log is—log—and telling it how to partition and name the variables found in each line—rx. We’re using an Apache log file just because it’s well-known and common. You can switch this model to another log format by changing the regular expression. I have a put a truncated, IP-scrambled Apache log into my own MyApp’s etc dir. You can do your own—if it’s in the same format or you could adjust the rx shown—or you can try the one that will be inside the test app when I post all the code next week. It’s here: access-log.tar.gz (1.2MB).

Let’s color code the regular expression and a line from the log it’s matching against because even when this stuff is straightforward it can be a trick to read.

Model::Log:
  log: __path_to(etc/access.log)__
  rx: !!perl/regexp |-
    (?x:\A
      (?<ip>\S+)     \s
      (?<identd>\S+) \s
      (?<htuser>\S+) \s
      \[(?<day>\d+)/(?<month>\w+)/(?<year>\d+):(?<hour>\d\d):(?<minute>\d\d):(?<sec>\d\d) \s
      (?<offset>[^]]+)\]   \s
      "(?<method>[A-Z]+)   \s
      (?<request>\S+)      \s
      (?<protocol>\S+)"    \s
      (?<status>\d+)       \s
      (?<size>\d+)         \s
      "(?<referer>[^"]+)?" \s
      "(?<agent>(?:\\"|[^"])+)"
    \s*\z)

Regular expressions can be difficult to read but I think you’ll agree that even without comments, which are possible to inline in Perl regexes, that is fairly clear what it’s looking for. Let’s take a look at a sample line from a typical, but by no means guaranteed, Apache access log.

86.59.165.48 - - [24/Jun/2009:21:17:11 -0700] "GET /ddx/l/c.html HTTP/1.1" 200 16708 "http://www.google.com/search?client=safari&rls=en&q=scene%20words%20that%20start%20with%20a%20c&ie=UTF-8&oe=UTF-8" "Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US) AppleWebKit/530.17 (KHTML, like Gecko) Version/4.0 Safari/530.17"

Now, the rx applied to the log file line gives us a capture—inside the special %+ hash—that looks like this–

{
 protocol => 'HTTP/1.1',
 hour => '21',
 identd => '-',
 status => '200',
 month => 'Jun',
 ip => '86.59.165.48',
 size => '16708',
 day => '24',
 method => 'GET',
 referer => 'http://www.google.com/search?client...',
 offset => '-0700',
 request => '/ddx/l/c.html',
 minute => '17',
 sec => '11',
 htuser => '-',
 agent => 'Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US)...',
 year => '2009'
}

If you don’t think that’s hot, not even Marvin Gaye could help you.

Note: the IP addresses in the demo log and the examples here are randomized noise. If they happen to end up as real addresses, it’s coincidence and not an actual visitor’s address.

Create the new model

./script/myapp_create.pl model Log
emacs lib/MyApp/Model/Log.pm
package MyApp::Model::Log;
use parent "Catalyst::Model";
use File::ReadBackwards;
use IO::File;
use Carp;
use Moose;
no warnings "uninitialized";
use feature "switch";

has "log" => 
    is => "rw",
    isa => "Str";

has "rx" =>
    is => "ro",
    isa => "Regexp";

my %ops = map {; $_ => 1 } qw( !~ =~ == != ne );

sub search {
    my ( $self, $params, $opts ) = @_;
    my ( $io, $read_method );
    if ( $opts->{order_by} eq "backwards" )
    {
        $io = File::ReadBackwards->new( $self->log );
        $read_method = "readline";
    }
    else
    {
        $io = IO::File->new( $self->log, "r" );
        $read_method = "getline";
    }
    $io || croak "Couldn't open '", $self->log, "' for reading: $!";

    my @match;

  LINE:
    while ( my $line = $io->$read_method )
    {
        last LINE if $opts->{rows} and @match >= $opts->{rows};
        $line =~ $self->rx;
        my $keep;
        for my $key ( keys %{$params} )
        {
            croak "No such key '$key': $line`" unless exists $+{$key};
            if ( ref $params->{$key} )
            {
                my ( $op, $target ) = %{ $params->{$key} };
                croak "Op '$op' not allowed" unless $ops{$op};
                my $keep;
                given ($op) {
                    when ('=~') {
                        $keep =  $+{$key} =~ /$target/;
                    }
                    when ('!~') {
                        $keep =  $+{$key} !~ /$target/;
                    }
                    default {
                        $keep = eval "$+{$key} $op $target";
                    }
                }
                next LINE unless $keep;
            }
            else
            {
                my $target = $params->{$key};
                next LINE unless eval "$+{$key} eq $target";
            }
        }
        push @match, { %+ };
    }
    return wantarray ? @match : \@match;
}

__PACKAGE__->meta->make_immutable;

1;

The Moose pieces that should stand out right away are the log and rx. We get easy slots in the object—accessors and setters—and we get some cheap validation too. Trying to set rx to a string or variable reference will cause an error. It can only accept a regular expression. For the sake of simplicity we’re setting the log to accept a simple string (Str). In production code we’d use custom typesPath::Class::File—and maybe coercion.

The key to mimicking a subset of DBIC’s and SQL::Abstract’s API for constructing searches is in here–

my %ops = map {; $_ => 1 } qw( !~ =~ == != ne );

Those are the operations we’ll allow applied on query values. Let’s set up the controller to examine what we’ve got open to us.

emacs lib/MyApp/Controller/Visit.pm
package MyApp::Controller::Visit;
use strict;
use warnings;
use parent 'Catalyst::Controller';

__PACKAGE__->config( rows => 30 );

sub auto :Private { $_[1]->stash( template => "visit/index.tt" ) }

sub index :Path :Args(0) {
    my ( $self, $c ) = @_;
    $c->stash( rx => $c->model("Log")->rx );
}

sub last :Local {
    my ( $self, $c, $rows ) = @_;
    $rows ||= $self->{rows};
    my @match = $c->model("Log")
        ->search({},
                 {
                   order_by => "backwards",
                   rows => $rows,
                 });
    $c->stash( matches => \@match );
}

sub not_ok :Local {
    my ( $self, $c, $rows ) = @_;
    $rows ||= $self->{rows};
    my @match = $c->model("Log")
        ->search({ status => { "!~" => qr/^[23]/ } },
                 { rows => $rows });
    $c->stash( matches => \@match );
}

sub status :Local Args(1) {
    my ( $self, $c, $status ) = @_;

    my @match = $c->model("Log")
        ->search({ status => $status });
    $c->stash( matches => \@match );
}

sub ext :Local {
    my ( $self, $c, $ext, $rows ) = @_;
    $rows ||= $self->{rows};
    my @match = $c->model("Log")
        ->search({ request => { '=~' => qr/\.\Q$ext\E\z/ } },
                 { rows => $rows ,
                   order_by => "backwards"
               });
    $c->stash( matches => \@match );
}

sub robots :Local Args(0) {
    my ( $self, $c ) = @_;
    my $match = $c->model("Log")
        ->search({ status => 200,
                   agent => { "=~" => qr/(?:ro)?bot\b/ },
                 });
    $c->stash( matches => $match );
}

1;

The API of SQL::Abstract/DBIx::Class can be summed up as two hash refs: first is columns to values to search on and the second is query modifiers like order_by and rows which are the only ones we implement above.

There is a default row count to return in the model–

__PACKAGE__->config( rows => 30 );

Which can be overridden in config (as a default) and overridden per method call by supplying it as an argument to the second hash ref.

URI dispatch map for Controller::Visit

Here’s a map of what the controller will do. This would be better as Pod by the way but in the interests of shorter code samples—and getting these together in time—that’s deferred.

  • /visit
    • Splash/index. Displays the regular expression from the config file.
  • /visit/last/[rows]
    • Pulls all records backwards. Rows argument is optional to override default.
  • /visit/not_ok/[rows]
    • Searches with a regular expression against the status “column.” Returns any record with a status which doesn’t start with a 2 or a 3. Rows is optional.
  • /visit/status/[code]
    • Finds the requests with the matching HTTP status code. Code is required. /visit/status will get a 404.
  • /visit/ext/[extention]/[rows]
    • Queries against the request “column” to match for file extensions. The extension argument is required. Rows is optional.
    • E.g., /visit/ext/html/100
    • /visit/ext/png
  • /visit/robots
    • Searches against the user agent “column” for qr/(?:ro)?bot\b/. Takes no rows argument. Returns all found records.

Template visit/index.tt

You can see in auto above that all the methods will use a common template.

emacs root/alloy/visit/index.tt
<h1>#7, Controller::Visit + Model::Log</h1>
[% IF rx %]
  <h3>This is the log parsing regex set in the <a href="[% c.uri_for("/src/myapp.yml").path %]">config file</a></h3>
  <h4>Note, names of captures are lost due to handling of compiles regular expressions</h4>
  <pre>[% rx | html %]</pre>
[% END %]

[% IF matches.size() %]

<div id="viewtrack">
  <div class="count" style="border:0">
    <h3>IP</h3>
  </div>
  <div class="path" style="border:0">
    <h3>Request</h3>
  </div>
  <div class="visit" style="border:0">
    <h3>Status &middot; agent</h3>
  </div>
<br style="clear:both"/>
[% FOR match IN matches %]
  <div class="count">
    [% match.ip %]
  </div>
  <div class="path">
    [% match.method %] [% match.request %]
  </div>
  <div class="visit">
    <b>[% match.status %]</b> &middot; [% match.agent.truncate(40, '&hellip;') %]
  </div>
<br style="clear:both"/>
[% END %]
</div>

[% END %]

Screen shot /visit/last/10

Last 10 visits from the log

Start her up if she isn’t already running and see what you’ve got

./script/myapp_server.pl -d -r -p 3000

Tune in tomorrow for #7: Page view counter/tracker.



digg stumbleupon del.icio.us reddit Fark Technorati Faves

« Catalyst Models Intermission—MyApp source code browser · Catalyst Model #7: Page view counter/tracker »
« 10 Catalyst models in 10 days1 »