Tabulog is a flexible, powerful framework for parsing log files, specifically designed for web logs (such as the access.log files created by Apache), with the final output being in a tabular format.
Parsing logs with Tabulog requires two things: a template, and a list of “parser classes.”
Inspired by Python’s Jinja2 templates, Tabulog templates use a human-readable format mixing literal text with code. Code is being used extremely loosely here, as you will see that the ‘code’ in our templates is not actually R code.
The easiest place to start is with an example. Let’s say you have a simple log file that looks like this:
10.0.0.8 - - [2019-01-01:10:58:12 -500] "https://mysite.com/index.html"
173.28.102.33 - - [2019-01-01:10:58:25 -500] "https://mysite.com/login"
...
We can see the log file here holds a certain format, specifically:
<ip address> - - [<datetime>] "<url>"
The Tabulog template to parse such a file looks like this
{{ ip ip_address }} - - [{{ Date date_time }}] "{{ url URL }}"
Each set of curly brackets represents an instance of a
class, and is declared in the C style of
class var_name
. So in the template above,
{{ ip ip_address }}
is really saying “In this spot, look
for an ip, and call it ip_address
.”
You may ask, how does the Tabulog know what an ip address is? Which is where we are introduced to parser classes.
In order to know what to look for in each field of our template, Tabulog must know what a given class should look like. For this we give it a parser class, which is really just a wrapper object for a regular expression.
In the current example with the ip address, we would tell Tabulog
that the ip class is represented by the Perl regular expression:
[0-9]{1,3}(\.[0-9]{1,3}){3}
. When Tabulog parsed the log
file, it would look for a match on that expression in that spot, and
raise a warning if it didn’t find one.
Once a field is parsed and read into R, you may want to further
transform or format the text. For example, you may want to cast
an integer field using the as.integer
function. This is
achieved using formatters.
When a parser object is created, an optional formatter can be passed.
This is simply a function that takes one argument (a character vector)
and returns a vector of the same length in the desired format. For
example, the builtin int
parser is created by the following
call:
## Parser: int
## -----------
## Matches:
## [0-9]+
## Formatter:
## .Primitive("as.integer")
Tabulog as a framework is designed to be language-agnostic, so the ideas of templates and parser classes here will be portable to any other versions of the package made for other languages. Formatters, however, are language specific and must be implemented in the language being used.
Let’s again say you have the example logs in the file
accesslog.txt
.
10.0.0.8 - - [2019-01-01:10:58:12 -0500] "https://mysite.com/index.html"
173.28.102.33 - - [2019-01-01:10:58:25 -0500] "https://mysite.com/login"
We first define the template as before.
We then need to define our classes. ip
and
url
are builtins with the package, but dates come in a
variety of formats so we must explicitly define ours here. Note you can
see all builtins using default_classes()
date_parser <- parser(
'[0-9]{4}\\-[0-9]{2}\\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\\-\\+][0-9]{4}',
function(x) lubridate::as_datetime(x, format = '%Y-%m-%d:%H:%M:%S %z'),
name = 'date'
)
date_parser
## Parser: date
## ------------
## Matches:
## [0-9]{4}\-[0-9]{2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\-\+][0-9]{4}
## Formatter:
## function (x)
## lubridate::as_datetime(x, format = "%Y-%m-%d:%H:%M:%S %z")
## $ip
## Parser: ip
## ----------
## Matches:
## [0-9]{1,3}(\.[0-9]{1,3}){3}
## Formatter:
## .Primitive("(")
##
## $url
## Parser: url
## -----------
## Matches:
## (-|(?:http(s)?:\/\/)?[\w.-]+(?:\.[\w\.-]+)+[\w\-\._~:/?#[\]@!\$&\'\(\)\*\+,;=.]+)
## Formatter:
## .Primitive("(")
Both ip
and url
require no formatting, so
they have the identity function, ((
in R), as their
formatter.
To get our final output in tabular format, we simply make the follow
call to parse_logs
.
# Naming the date_parser 'date' in the list tells Tabulog to use it to parse
# the field with class 'date' in the template.
parse_logs(readLines(log_file), template, classes = list(date = date_parser))
## ip_address date_time URL
## 1 10.0.0.8 2019-01-01 15:58:12 https://mysite.com/index.html
## 2 173.28.102.33 2019-01-01 15:58:25 https://mysite.com/login
Note that we only had to pass our custom class date
. The
builtin classes ip
and url
were included by
default.
A more elegant and portable way of completing this task would be to define the template and the custom class in the same file, which can be ported to other Tabulog libraries in other languages, leaving only the formatters to be defined in the R script.
First, we define the template
and the
classes
in a yaml file
template: '{{ ip ip_address }} - - [{{ date date_time }}] "{{ url URL }}"'
classes:
date: '[0-9]{4}\-[0-9]{2}\-[0-9]{2}:[0-9]{2}:[0-9]{2}:[0-9]{2}[ ][\-\+][0-9]{4}'
Next, we define the formatters for each of our classes. Here we only have one, but we still put it in a named list, with the name matching the name of the class in the template file.
Finally, we make one call to parse_logs_file
.
## ip_address date_time URL
## 1 10.0.0.8 2019-01-01 15:58:12 https://mysite.com/index.html
## 2 173.28.102.33 2019-01-01 15:58:25 https://mysite.com/login
The only characters that need to be escaped in templates are curly
braces (even single ones). Usually a backslash should be sufficient
'\{'
, but the html-style escapes '{'
and '}'
are also included as valid syntax for any
edge cases that may arise.