Hiram Software Blog


Using ANTLR to parse S3 logfiles

Hiram Software recently published a small, but immensely useful (for us), tool to parse S3 logfiles. We do a lot of projects in the JVM, often in Scala, and it surprised us how there was no readily-available tool to do this "at scale."

Below is the writeup of the JSalParser one of my team members presented to team as justification for "building" the tool.

‐ Hiram

Github: https://www.github.com/hiramsoft/jsalparser

JSalParser

The Java Server Access Logs Parser (i.e. JSalParser) parses Extended Log Format files generated by Apache HTTP Server, AWS S3, or AWS CloudFront (to name a few) into Java POJOs (Plain Old Java Objects).

No bags or maps or hashes or arrays of attributes here. Dates are converted to JODA DateTime objects. Numbers are, well, numbers. You could say this library deserializes the Extended Log Format, but I think that may be too generous for a simple parser.

In short, give JSalParser a log file as input and you will get back a Java object with as many members filled in as possible as output. The rest is up to you.

Synopsis

Parse an S3 log line-by-line

    String content = "1f000000000c6c88eb9dd89c000000000b35b0000000a5 www.example.com [27/Aug/2014:20:20:05 +0000] 192.168.0.1 - BFE596E2F4D94C8F WEBSITE.GET.OBJECT media/example.jpg \"GET /media/example.jpg HTTP/1.1\" 304 - - 27553 202 - \"http://www.example.com/page.html\" \"Mozilla/5.0 (iPhone; CPU iPhone OS 7_1_2 like Mac OS X) AppleWebKit/537.51.2 (KHTML, like Gecko) Version/7.0 Mobile/11D257 Safari/9537.53\" -";
    List<S3LogEntry> entries = JSalParser.parseS3Log(content);

    long TenMegabytes = 10000000L;

    for(int i=0;i<entries.size();i++) {
        S3LogEntry entry = entries.get(i);

        // Notice how the numbers are numbers, no additional parsing needed
        if(entry.getObjectSize() > TenMegabytes)
        {
            System.out.println(entry.getTime());

            // getTime() returns a JODA DateTime object,
            // so Java prints:
            // 2014-08-27T20:20:05.000+00:00
        }
    }

Parse a CloudFront log gzip file using a visitor for streaming efficiency

    // Assume gzipFileStream is some kind of java.io.InputStream
    // You got it either from a FileInputStream (local file on disk), S3, or anywhere else that returns InputStreams

    java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(gzipFileStream);

    // Process records inline by passing a visitor to effectively get "streaming" log processing
    // The only two things you need are an InputStream and a visitor
    // JSalParser is Thread-Safe
    JSalParser.parseCloudFrontLog(gzipInputStream, new ICloudFrontLogVisitor() {
        int count = 0;
        @Override
        public void accept(CloudFrontWebLogEntry entry) {
            System.out.print("Processing entry #" + (count++) + " from " + entry.getDateTime() + " ");
            // Date is returned as a JODA DateTime object.

            // Numbers are surfaced as Ints and Longs
            if(entry.getServerToClientStatus() == 200)
            {
                System.out.println("OK");
            }
            else
            {
                System.out.println("NOT_OK");
            }

            // You will get:
            /***********
             Processing entry #0 from 2014-08-28T04:48:38.000Z OK
             Processing entry #1 from 2014-08-28T04:48:38.000Z OK
             Processing entry #2 from 2014-08-28T04:49:23.000Z NOT_OK
             Processing entry #3 from 2014-08-28T04:48:37.000Z OK
             Processing entry #4 from 2014-08-28T04:48:38.000Z NOT_OK
             Processing entry #5 from 2014-08-28T04:48:38.000Z OK
             ***********/
        }
    });

Both S3 and CloudFront support accepting Strings and InputStream objects.

Both S3 and CloudFront support the visitor pattern for streaming efficiency.

Most Common Scenarios

These are the most common scenarios for working with S3 or CloudFront log files.

How do I read logs directly from AWS S3?

The Java AWS SDK ( http://aws.amazon.com/sdk-for-java/ ) is the best way to read and write objects on S3. The quick-and-dirty way is to construct an AmazonS3 instance and call getObject().

Some pseudo-code:

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.S3ObjectInputStream;
import com.amazonaws.services.s3.model.S3Object;

public void quickAndDirty(){

    AmazonS3 s3 = //... left as an exercise

    S3Object obj = s3.getObject("my-bucket-name", "/the-log-file-key-name");
    S3ObjectInputStream content = obj.getObjectContent();
    // S3ObjectInputStream is a subclass of InputStream

    List<S3LogEntry> entries = JSalParser.parseS3Log(content);
    // ... continue on like normal
}

Read directly from a gzip file:

Java has a built-in GZIPInputStream class ( java.util.zip.GZIPInputStream ) that is a subclass of InputStream.

If the files are local, pass in a FileInputStream instance:

import java.io;

File gzipFile = new File("path-to-your-local-file.gz");
FileInputStream gzipFileStream = new FileInputStream(gzipFile);
java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(gzipFileStream);

List<CloudFrontWebLogEntry> entries = JSalParser.parseCloudFrontLog(gzipInputStream);
// .. continue on like normal

Or, combine with the AWS SDK to read directly from S3

import com.amazonaws.services.s3.AmazonS3;
import com.amazonaws.services.s3.model.S3ObjectInputStream;
import com.amazonaws.services.s3.model.S3Object;

AmazonS3 s3 = //... left as an exercise

S3Object obj = s3.getObject("my-bucket-name", "/the-log-file-key-name.gz");
S3ObjectInputStream s3Stream = obj.getObjectContent();
java.util.zip.GZIPInputStream gzipInputStream = new java.util.zip.GZIPInputStream(s3Stream);

List<CloudFrontWebLogEntry> entries = JSalParser.parseCloudFrontLog(gzipInputStream);
// .. continue on like normal

How do I process log files as a stream?

JSalParser takes as an optional, second parameter either an ICloudFrontLogVisitor or an IS3LogVisitor instance. As the parser fully reads each entry, it calls the respective "accept" methods on the visitors. Your code is free to implement any business logic as needed.

Using the visitor means the parser only has to store a couple log lines in memory at a time, in theory making it so one could process extremely large files.

What considerations should I know about CloudFront log files?

CloudFront log files have two header lines, and these header lines describe the "schema" of the log file. S3 log files, in contrast, have no header file, and the schema is defined in documentation.

Therefore in order to process any CloudFront log file we must first process the header entry. If you send a CloudFront log file without headers you will get only an array of untyped values.

How do I use this with Apache HTTP Server log files?

Send the Apache HTTP Server log files to the JSalParser.parseS3Log* family of methods.

What happens if the parser encounters something it doesn't recognize?

The parser puts "extra" stuff (values it doesn't expect) into the "extras" list. Order is preserved, but all values are processed as Strings.

What happens if a value is missing?

The POJO has null or 0 as appropriate. I usually program in Scala and am much more used to Options, but alas this is the best choice for Java 7.

What is JODA DateTime?

If you aren't familiar with JODA for DateTime, then the short answer is: The Java Date class is "broken" and JODA has the best solution. More information and eloquence is at http://www.joda.org/joda-time/

What version of Java?

Java 7. No real reason, just that's the lowest JDK I have available to test.

The only "advanced" feature is generics.

Maven

In Progress

Motivation

I had reason to traverse the logs generated by AWS, and I wanted to write the business logic in my favorite JVM language (Scala). Behind the scenes Hiram Software is working on popular content sites, and it's important to us to know who is viewing our content in near realtime. I searched the internet, and I could not find any parsers for the Extended Log Format written in Java. The format is nearly 20 years old, and Google couldn't find anything. Why is this? I have three theories:

  • There exists such a parser but I am unable to find it (perhaps because SL4J and its kin dominate search results related to the Java and "logging" keywords ?).
  • The parsers that do exist are not open source (either written by an engineer for a corporation or as part of a log parsing product).
  • Most people use regular expressions

Unfortunately, I don't know how to write a regular expression to parse a server log file generically. Based on the number of StackOverflow questions (exhibit 1, exhibit 2, exhibit 3 ), I am not alone.

The core problem with the S3 log files is that the delimiter " " (space) can also be found in each of the values within a Quoted String. In the most general sense any field may be a - (empty value), an unquoted value http://www.google.com, or a quoted value "There may be delimeters in this string". In practice, it seems people who use regular expressions assume which fields will be quoted. Simplifying the problem with assumptions is not bad, but it trades off robustness for ease of coding. In theory the first change to the S3 server access logs will break a lot of code. And what's more, regular expressions only spit out captured Strings, which then have to be parsed into types.

The CloudFront logs improve upon S3 logs by using tabs "\t" as the delimiter. With that small change it becomes practical to split values by the single delimiter. The complication comes, instead, in the new header row. The header row "Contain[s] two header lines: one with the file-format version, and another that lists the W3C fields included in each record." Great for people. Bad for machines since the order of the fields may change from file to file. And furthermore, any time you get a file from CloudFront you have to decide if it is a "Web Distribution" file or an "RTMP Distribution" file format. There are no explicit tags in the log to indicate one or the other -- you have to parse the file to figure it out.

What about alternatives to writing my own? I'm a buy-over-build kind of guy.

I abandoned using regular expressions from the outset. I found The Buzz Media's Amazon CloudFront Log Parser as a credible alternative. It appeared to handle the CloudFront log files, but I could not use it because it did not support S3 logs and had a broken maven repository.

I looked into using a CSV parser, but the Apache CSV parser required a header row, and the S3 log does not have such a row. There may be other CSV parsers that would have worked, but by now I was tired and felt like I was not making progress.

I fell back to what I knew: ANTLR.

Solution

JSalParser exposes a class whose static methods accept content (either a String or InputStream) and return Lists of POJOs (Plain Old Java Objects, i.e. typed bags) representing each log entry. Alternatively, you may "stream" the log files by providing a visitor that "accepts" each fully-parsed log entry.

Under the covers there is an ANTLR v4 grammar that builds up the POJOs. If you are unfamiliar with ANTLR, it is an open source parser generator that often is compared to YACC or Lex. Inside src/main/antlr are .g4 files that ANTLR compiles into Java code. This Java code handles tokenization and builds the log entries. All of the hard work that The Buzz Media team had to write in maintaining state ANTLR does for us in a robust manner.

Why static methods? All of the state during parsing is self-contained to the objects provided by ANTLR. So long as we instantiate new objects for each new String or InputStream, the methods are thread safe. So they are static. It feels simpler to me.