CS 11: advanced C track: lab 4


Material to be covered for this lab


Overview of this lab

Last week, we started to implement a small interpreted computer language by writing a lexer to recognize the tokens that constitute the building blocks of expressions in the language. In this lab we will complete the job by writing a parser, which will take the stream of tokens, figure out what kind of expressions they correspond to, and execute the code corresponding to these expressions.

Parsing

OK, so you've run your source code through flex, and have a nice set of tokens to play with. Unfortunately, there are some kinds of language analysis that are too complex for lexers. One example is matching open parentheses with their corresponding close parentheses: it turns out to be impossible for a lexer (which only handles regular expressions) to match arbitrarily nested sets of parentheses, which is something you would want in a computer language. For this we need something more powerful, and that thing is a parser. A parser takes a sequences of tokens (provided by the lexer) and either
  1. determines that a syntax error has occurred, or
  2. determines the next valid expression in the language.
If a valid expression is found, the parser may execute it immediately (for simple languages like this week's exercise) or convert it into yet another representation (called an abstract syntax tree) which is then subjected to even more processing (this is how serious computer languages work; for details, see Jason Hickey's CS 134b course on compilers).

Writing parsers by hand is hard work (much harder than writing lexers). So, by analogy with lexers, there are programs called parser generators that take an input file representing the grammar of the language to be parsed and generate C code that implements the parser. Also by analogy with lexers, parser generators are written in a special language that is in some ways similar to C, but that is specialized to the task of parsing. The standard unix parser generator program is yacc, which stands for "yet another compiler-compiler". Again, we will be using an improved version called bison. The name bison is a pun on yacc and on "GNU" (which in turn stands for "GNU's not unix"). GNU is the name of software from the Free Software Foundation.

What the bison grammar description looks like is described in great detail here. You should read as much of this as you need to. Briefly, the main section of the grammar describes productions (valid expressions in the language) in terms of simpler expressions, finally bottoming out with tokens. When an expression is matched, you can define an action to execute. This may involve evaluating the expression (in our case), or generating part of an abstract syntax tree (for a more complex language).


Program to write

You are to implement a simple calculator program/language, which will have the following features: Use the lexer you wrote in last week's lab to parse the tokens. Use bison to define the grammar. Your grammar should not have any shift-reduce or reduce-reduce conflicts (these indicate ambiguities in the grammar).

Your program should also be able to run scripts. This is done using the unix "sharp-bang" hack ("#!") as follows:

#! /usr/bin/env calculator
a = 2 + 2;
print(a);
When you put this into a file (called e.g. "print_4") and do this:
% chmod +x print_4
% print_4
it will print "4" to stdout.

To hand in


References