Homework 1, Part 1: Parsers

What are parsers, and why use them?

Up until this point, most of the text files you've used for program input have been very simple. For example, the PPM format used for graphics output in this lab is easy to interpret with a few loops and if-statements. However, what if you want to read a more complicated language into your program, like C or HTML? Writing a scanner by hand for C will likely result in thousands of lines of very ugly code and many headaches.

Parsers are simple algorithms for interpreting complicated textual languages. They are commonly used in compilers, web browsers, and in our case, 3D model viewers. We'll be working with a complicated 3D model format for the three labs after this one, so it's best to learn how to write a parser for the two simple languages used in this lab.

This page provides only a high-level overview of what a parser does and how it works. You should check out Lex and YACC primer/HOWTO for details on how to actually write simple parsers for C or any compatible language.

How do parsers work?

At the highest level, a parser translates a stream of characters (in most cases from a file) into an abstract syntax tree (AST). The tree is a hierarchical data structure that represents the contents of the file. The leaves represent atoms of the language (such as numbers, identifiers, punctuation, etc.), while the higher nodes represent aggregations of these atoms such as expressions, statements, blocks, or definitions.

We'll go over each part of the parser step-by-step.

Lexers and regular expressions

The lower half of the parser is called the lexer. It translates groups of characters into tokens, also known as terminal symbols. It works using regular expressions. Essentially, you have a list of regular expressions paired with code. When an expression is matched, the corresponding code is executed, usually returning a value. For example, the regular expression 0|[1-9][0-9]* matches an integer, so we might pair it with code that translates the matched text to an integer (using sscanf or atoi) and signals the parser that an integer has been matched.

We need a regular expression and associated code to match each token in our language. For the polyline language of this lab, you'll need to match the word polyline, and positive and negative floating point numbers. For the transform4x4 language, you'll need to match translation, rotation, scaleFactor, and floating point numbers. You'll also need a do-nothing rule for whitespace. In more complicated languages, you'll need to worry about punctuation, different number formats, and comments.

Parsers and context-free grammars

The upper half of the parser uses a context-free grammar to build the AST from the token stream emitted by the lexer. We write the parser as a series of productions. For instance, a simple arithmetic language might look like this:

expr ::= INT
| expr PLUS expr
| expr MINUS expr
| expr TIMES expr
| expr DIVIDE expr
| LEFT_PAREN expr RIGHT_PAREN;

The tokens are fed into the parser one by one. For each symbol, the parser can decide to either shift (put the symbol on the stack), or reduce (pop some symbols from the top of the stack and reduce them to a non-terminal according to the grammar).

The class of parser we'll be using cannot handle all context-free grammars since we can't really work with ambiguous grammars. There are two common kinds of errors you will run into because of this when compiling your parser. A reduce-reduce conflict occurs when there is more than one production that the parser can use to reduce symbols on the stack. This is a pretty severe error, indicating that your language is incorrectly specified. By default, the parser will reduce with the top-most production in the file. A shift-reduce conflict occurs when the parser has a choice between shifting and reducing. This is almost always due to incorrectly specified precedence or associativity. This will be discussed in recitation. By default, the parser will shift.

Tools for writing lexers and parsers

Although lexers and parsers can be written as plain code, it is easier to write them in specialized languages, then use programs called lexer and parser generators to translate them into plain code. Two popular tools for doing this in C are flex (for lexers) and bison (for parsers). They are derived from older tools named lex and yacc, but the language used is almost the same. These tools are built-in for Linux and OS X, so you can use them on the cluster. Windows users will have to install them manually. A good tutorial is Lex and YACC primer/HOWTO. You should read this before recitation on Friday.

If you are programming in Python or Java, there are a variety of parsing tools for both languages, but nothing really standard. This page lists Python parsing tools. For Java, I would recommend JFlex and CUP. These tools are not on the cluster, so if you want to use them, you'll need to download them yourself.

Example parser

Bill Clark, a former TA, wrote an excellent example lexer/parser that reads instructions for animating a set of colored arrows in OpenGL. Please check it out. Feel free to use it as a template for your own parsers.

Here is a simple python parser reading in a list of colors with PLY. To run the example, execute python hw0.py < test.in . There are more examples on their website.