#StackBounty: #python #python-3.x #lexer Lexer for Apache logs in Python

Bounty: 100

I recently learned Python using the book by Mark Summerfeld and I am now doing a few katas of mine when I learn a new language. This helps to get the “how the language works” and I wrote a simple lexer for Apache logs.

An example of log records are

64.242.88.10 - - [07/Mar/2004:16:05:49 -0800] "GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1" 401 12846
64.242.88.10 - - [07/Mar/2004:16:06:51 -0800] "GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1" 200 4523
64.242.88.10 - - [07/Mar/2004:16:10:02 -0800] "GET /mailman/listinfo/hsdivision HTTP/1.1" 200 6291

and program I wrote reads all values as tokens and output them using | as a separator, as in

64.242.88.10|-|-|07/Mar/2004:16:05:49 -0800|GET /twiki/bin/edit/Main/Double_bounce_sender?topicparent=Main.ConfigurationVariables HTTP/1.1|401|12846
64.242.88.10|-|-|07/Mar/2004:16:06:51 -0800|GET /twiki/bin/rdiff/TWiki/NewUserTemplate?rev1=1.3&rev2=1.2 HTTP/1.1|200|4523
64.242.88.10|-|-|07/Mar/2004:16:10:02 -0800|GET /mailman/listinfo/hsdivision HTTP/1.1|200|6291

The lexer can be used for more general usage though.

Here is my code and I would love to read your comments and suggestions about how to improve it. I have also raised a few specific questions on certain topics.

import collections
import io
import sys

LEXER_SKIP_WHITESPACE=0
LEXER_READ_QUOTED_STRING=1
LEXER_READ_BRACKETED_STRING=2
LEXER_READ_WORD=3

class Location:
    def __init__(self, name=None, pos=0, line=1, col=0):
        self.name = name or "<input>"
        self.line = line
        self.col = col
        self.pos = pos
    def update(self, c):
        self.pos += 1
        if c == "n":
            self.line += 1
            self.col = 0
        else:
            self.col += 1
    def __repr__(self):
        return str.format("Location({}, {}, {}, {})", repr(self.name), repr(self.pos), repr(self.line), repr(self.col))
    def __str__(self):
        return str.format("{}: {}: line {}, column {}", self.name, self.pos, self.line, self.col)

def readchar(inputchannel, location):
    while True:
        maybechar = inputchannel.read(1)
        if maybechar == '':
            return None
        else:
            location.update(maybechar)
            yield maybechar

def readtoken(inputchannel, location):
    state = LEXER_SKIP_WHITESPACE
    token = ''
    for nextchar in readchar(inputchannel, location):
        if state is LEXER_SKIP_WHITESPACE:
            if nextchar == "n":
                yield "n"
                continue
            elif nextchar.isspace():
                continue
            elif nextchar == '"':
                state = LEXER_READ_QUOTED_STRING
                continue
            elif nextchar == '[':
                state = LEXER_READ_BRACKETED_STRING
                continue
            else:
                state = LEXER_READ_WORD
                token += nextchar
                continue
        elif state is LEXER_READ_QUOTED_STRING:
            if nextchar == '"':
                yield token
                token = ''
                state = LEXER_SKIP_WHITESPACE
                continue
            else:
                token += nextchar
                continue
        elif state is LEXER_READ_BRACKETED_STRING:
            if nextchar == ']':
                yield token
                token = ''
                state = LEXER_SKIP_WHITESPACE
                continue
            else:
                token += nextchar
                continue
        elif state is LEXER_READ_WORD:
            if nextchar == "n":
                yield token
                token = ''
                state = LEXER_SKIP_WHITESPACE
                yield "n"
                continue
            elif nextchar.isspace():
                yield token
                token = ''
                state = LEXER_SKIP_WHITESPACE
                continue
            else:
                token += nextchar
                continue
        else:
            raise Error("Impossible lexer state.")
    if state is LEXER_SKIP_WHITESPACE:
        return None
    elif state is LEXER_READ_QUOTED_STRING:
        raise Error("End of character stream in quoted string.")
    elif state is LEXER_READ_BRACKETED_STRING:
        raise Error("End of character stream in quoted string.")
    elif state is LEXER_READ_WORD:
        yield token
        return None
    else:
        raise Error("Impossible lexer state.")


class Lexer:
    def __init__(self, inputchannel, _location=None):
        self.location = _location or Location("<input>", 0, 1, 0)
        self.inputchannel = inputchannel
        self.buf = ''
        self.state = LEXER_SKIP_WHITESPACE

    def __iter__(self):
        return readtoken(self.inputchannel, self.location)

if __name__ == "__main__":
    sep = ''
    for token in Lexer(sys.stdin):
        if token == 'n':
            sys.stdout.write(token)
            sep = ''
        else:
            sys.stdout.write(sep)
            sys.stdout.write(token)
            sep = '|'

Question 1 – How to write lexers in Python?

I am well aware that there are tools similar to lex and yacc for Python that could have been used here, but that would defeat the purpose of learning how to write such a program in Python. I found it surprisingly hard.

My first misfortune is that Python does not do tail-elimination, so it is essentially forbidden to write a lexer as a set of mutually recursive functions. Mutually recursive functions are one of my favourite tools to do so, because it clearly singles out a specific state of the lexer (the recursive function it is in) and the transitions from this state to the other states, and makes it easy to test-case individually each of the transitions.

Since maybe not everybody is familiar with lexers based on mutually recursive functions here is the equivalent of the readtoken generator in OCaml. The beginning of read_token reads like “If you read a double quote, discard it and do read_quotedstring” and that function itself is defined later and does what on expects: it aggregates characters in a buffer until the next double quote and bless the result as a token.

let rec read_token f ((ax,buffer) as state) s =
  match Stream.peek s with
  | Some('"') -> Stream.junk s; read_quotedstring f state s
  | Some('[') -> Stream.junk s; read_timestamp f state s
  | Some(' ')
  | Some('t')-> Stream.junk s; read_token f state s
  | Some(c) -> read_simpleword f state s
  | None -> ax
and read_simpleword f ((ax,buffer) as state) s =
  match Stream.peek s with
  | Some('"')
  | Some('[')
  | Some(' ')
  | Some('t') ->
    let current_token = Buffer.contents buffer in
    Buffer.clear buffer;
    read_token f (f ax current_token, buffer) s
  | Some(c) ->
    Buffer.add_char buffer c;
    Stream.junk s;
    read_simpleword f state s
  | None ->
    ax
and read_quotedstring f ((ax,buffer) as state) s =
  match Stream.peek(s) with
  | Some('"') ->
    Stream.junk s;
    let current_token = Buffer.contents buffer in
    Buffer.clear buffer;
    read_token f (f ax current_token, buffer) s
  | Some(c) ->
    Stream.junk s;
    Buffer.add_char buffer c;
    read_quotedstring f state s
  | None ->
    failwith "End of stream within a quoted string"
and read_timestamp f ((ax,buffer) as state) s =
  match Stream.peek(s) with
  | Some(']') ->
    Stream.junk s;
    let current_token = Buffer.contents buffer in
    Buffer.clear buffer;
    read_token f (f ax (timestamp_to_iso current_token), buffer) s
  | Some(c) ->
    Stream.junk s;
    Buffer.add_char buffer c;
    read_timestamp f state s
  | None ->
    failwith "End of stream within a bracketed string"

My Python version defines a few states as symbolic constants as I would have done in C and then build a huge while loop keeping track of the transitions. I am not please with that implementation, because it does not give me any tool to manage the complexity of the lexer. Working with functions is very unpleasant because of the details of the scope of variables in Python. So what would be the idiomatic way to break down the lexer in small testable pieces, which would be important if I would decide to write a more complicate parser? Could the idea to represent lexer-states by objects be interesting?

Question 2. Is it right to treat stdin as a character stream?

Obviously I somehow did that but Python has no real character type and makes them look like length 1 strings. It feels to me that the approach of reading my input in chunks into “circular expandable buffers” and making copies of substrings of these chunks to generate my tokens would fit much more nicely in the language. Am I right?

Question 3. What is the all-purpose exception in Python?

This seems like a rather basic question but I am really failing to find a suitable answer in the documentation The Exception seems a poor choice, since it very general and make error identification and error handling rather complicated.

Question 4. Do generators return none?

Is it good style to return None in generators? Would pass also do?


Get this bounty!!!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.