Here is the problem I am trying to solve:  all the statistics for my Web
site are stored by my ISP in a directory, one per day.  Each file is
compressed and called, for example,
www.20030915.gz
.

I want to write a Log analyzer that will make it easy for me to collect
various statistics and still be extensible so that I can add more monitoring
objects as time goes by.  Right now, here are some examples of the numbers
I’d like to see:

  • Number of hits on my site.
  • For my weblog, number of HTML and RSS hits.
  • The list of referrers for, say, the past three days.
  • The number of EJBGen downloads
    each day.
  • The keywords typically used on search engines to reach my site.

Of course, it should be as easy to obtain totals per month or even per year
if needed.

The idea is the following:  when the script is run, it should run
through all the compressed files and build an object representation of each file
and line.  Then it will invoke each listener with two pieces of
information, Date and LogLine.  Each listener is then free to compute its
statistics and store them for the next phase.

Once the data gathering is complete (back-end), it’s time to present the
information.  There are several possibilities to achieve that goal but for
now, I’ll just make sure that back-end and front-end are decoupled.  I
envision one class, View, to be passed all the gathered information and generate
the appropriate HTML.

So first of all, we have the class LogDir, which encapsulates the directory
where my log files are stored.  Using the convenient "backtick" operator,
it is fairly easy to invoke gzip on each file and store each file in a LogFile
object, which in turn contains a list of LogLines.

When it’s done, LogDir then calls all the listeners with the following method:


def processLogFiles
 
@files.each { |fileName|
    sf = LogFile.new(fileName)
    sf.logLines.each { |l|
      @lineListeners.each { |listener|
        listener.processLine(fileNameToDate(fileName), l)
      }
    }
  }
end # processLogFiles

The main loop is fairly simple:


ld = LogDir.new(LOG_DIR)
ld.addLineListener(ejbgenListener = EJBGenListener.new)
ld.addLineListener(weblogListener = WeblogListener.new)
ld.addLineListener(referrerListener = ReferrerListener.new)
ld.addLineListener(searchEngineListener = SearchEngineListener.new)
ld.processLogFiles

The last line is what causes LogDir to start and invoke all the listeners.

For example, here is the EJBGenListener.  All it needs to do is see if
the HTTP request includes "ejbgen-dist.zip" and increment a counter if it does. 
The overall result is a Hashmap of counts indexed by a Date object:


class EJBGenListener
 
def initialize
   
@ejbgenCounts = Hash.new(0)
 
end

 
def processLine(date, line)
   
if line.command =~ /ejbgen-dist.zip/
     
key = date.to_s
     
n = @ejbgenCounts[key]
     
n = n + 1
     
@ejbgenCounts[key] = n
   
end
 
end

 
def stats
   
@ejbgenCounts
 
end
end # EJBGenListener

The only thing worth noticing is that
the Hash constructor can take a parameter which represents the default value of
each bucket (0 in this case).

Ruby’s terseness is a real pleasure to work
with.  For example, I need to run some listeners on the three most recent
files of the directory (which obviously change every day).  Here is the
relevant Ruby code:


Dir.new(dir).entries.sort.reverse.delete_if { |x| ! (x =~ /gz$/) }[0..2].each {
|f|
  // do something with f
}

Compare this with the number of lines needed in Java…

So far, the code is mundane and very straightforward, not very different from
how you would program it in Java.  In the next entry, I will tackle the
front-end (HTML generation) because this is really the point I am trying to make
with this series of articles.