Modifying The NanoDB Parser

Last updated February 5, 2019 at 12:00PM.

NanoDB has made use of the ANTLR parsing framework for parsing SQL since its inception. The current version, ANTLR v4, has a lot of benefits over previous versions, making it easier to write parser rules, and also promoting a cleaner architecture that separates the various stages of parsing more cleanly.

There is some basic documentation of how to use ANTLR v4 online:

The ANTLR v4 project documentation on Github is a good place to start
The ANTLR Mega Tutorial may also be helpful for understanding ANTLR v4

Additionally, the book "The Definitive ANTLR 4 Reference" by Terence Parr (creator of ANTLR) is a very helpful resource. Donnie has a copy of this book in his office if you need to look anything up, or if you expect to write a lot of parsers and you like using ANTLR, you may want to acquire your own copy.

Parser Specification Files

The parser and its components are in the src/main/antlr4 directory of the project. In the imports subdirectory of this path are various specifications used by the main grammar:

Lexer.g4 specifies a number of lexer tokens used in the full parser
Keywords.g4.in specifies the list of SQL keywords
make_ci.py is a Python script to generate a case-insensitive version of the keyword list from Keywords.g4.in into Keywords.g4
The file Keywords.g4 probably shouldn't be checked into the repository, but it is...

The main parser grammar is in the file src/main/antlr4/edu/caltech/nanodb/sqlparse/NanoSQL.g4. It is reasonably well-documented, and you should be able to scan through this file and get a sense of its structure and general approach before making any changes to it.

During the build process, the ANTLR v4 parser generator consumes the above .g4 files and generates a substantial amount of code into the target/generated-sources/antlr4 directory of the Maven build. These sources are then built along with the rest of the NanoDB sources, so that the database server knows how to parse NanoDB's specific dialect of SQL.

You will note that all parts of the NanoDB parser itself are in the edu.caltech.nanodb.sqlparse package. Additionally, there are some other important helper classes that live under the src/main/java/edu/caltech/nanodb/sqlparse directory; these are not generated by the ANTLR v4 tools, but since they are essential to using the parser, they are grouped in the same Java package. (You can see that a Java package's contents may be drawn from multiple directories during the build process. In NanoDB, the sqlparse package is the only one for which this is the case.)

Translation of Parse-Tree into Abstract Syntax Tree

ANTLRv4 takes an input string and generates an ANTLR parse-tree from the input. All of the nodes in the parse-tree are ANTLR-specific nodes, and they correspond to the rules in the grammar, as well as the specific tokens that are consumed. This means that the parse-tree is not very useful for a SQL database, which may want to analyze and manipulate query-specific components of the in-memory representation.

Therefore, the ANTLR parse-tree is translated into an Abstract Syntax Tree (AST) for use within the database. The translation is performed by the NanoSQLTranslator class (in the edu.caltech.nanodb.sqlparse package). This translator handles the translation of both expressions (in general, subclasses of the Expression class) and commands (in general, subclasses of the Command class). This translator is rather grungy and tedious in its operation, but fortunately all of the tedium is contained within this one class. The ANTLRv4 grammar itself contains no AST-specific functionality, just a specification of the language. The database command-evaluation code is largely straightforward as well. All of it is handled by this translator class.

If you look in both the NanoSQL.g4 grammar and the NanoSQLTranslator code, you will notice for every grammar rule someRule there is a corresponding function Object visitSomeRule(NanoSQLParser.SomeRuleContext ctx). These functions are responsible for translating the ANTLR parse-tree into the corresponding NanoDB AST. There are numerous examples of how to access and use the ANTLR parse-tree information in this class, so you are encouraged to find rules that are similar to what you need to do, and then emulate how they do it.

Make sure to study the structure of NanoSQLTranslator before making changes to it. This class is large and grungy, and really cannot be broken into smaller parts because commands can contain expressions, and expressions can contain commands. The source code is laid out in a particular sequence to make the file a bit more comprehensible. If you make changes to it, please respect the existing structure of the code.

Note that when you create new grammar rules, but before you have run the Maven build process, your IDE may not recognize some or all of the types used in the NanoSQLTranslator class. This is because these types (e.g. the context types in NanoSQLParser, or the visitXxxx() methods in the NanoSQLBaseVisitor base-class) are generated by ANTLR v4 during the build process. This type information can be very helpful though! So, make sure you have successfully completed a build before starting to edit this file, so that you have useful type information available in your IDE.

Common Tasks

This section gives an overview of how to implement common tasks in the NanoDB codebase. It assumes you have read and understand the previous explanations. Note that the below descriptions are just an overview; for example, there are many details about how to create Expression and Command subclasses that are not discussed here.

If you want to add new SQL keywords:

Make sure you really need new SQL keywords! Anything you use as a keyword can no longer be used as an identifier, e.g. a table or column identifier.
Edit Keywords.g4.in to add your new keywords. Note that all keywords are specified in lowercase, and are kept in alphabetical order to make it easy to find things. Additionally, note that type names are in a special section at the bottom of this file, with names like TYPE_XXXX.

Do not modify Keywords.g4. This file is regenerated during the build.

Once you add new keywords, the build process will use the Python make_ci.py script to generate Keywords.g4 from Keywords.g4.in, and then your new keywords will be visible in the main grammar.

If you want to add a new `Expression` type:

Add any new keywords to the Keywords.g4.in file.
Add any new kinds of expression syntax as alternatives (to use the ANTLRv4 terminology) under the expression rule in NanoSQL.g4.

Note that operator precedence in ANTLR v4 is specified by the ordering of the alternatives! This is why in the expression rule, unary operators come before multiplication, and multiplication comes before addition, etc. Consider carefully where your new expression syntax should appear in the precedence hierarchy, and then add your new syntax where appropriate.

If you have any questions on this topic, please ask Donnie.
Create new subclasses of Expression in the edu.caltech.nanodb.expressions package, as appropriate for your new syntax. You will need to implement a constructor, accessors and mutators to support configuration of objects of your new type. In addition, you will need to implement various other operations declared on Expression, that all subclasses must support.

There are plenty of examples in the expressions package for you to reference!
Add new Object visitXxxx(NanoSQLParser.XxxxContext ctx) methods to the NanoSQLTranslator class, in the section that handles expressions. There are many kinds of expressions, so scan through this section of the file to determine the most appropriate place to add your new rules.

You can look at existing translation methods to see how to handle numerous different scenarios.

If you want to add a new `Command` type:

Add any new keywords to the Keywords.g4.in file.
Add any new grammar rules as alternatives under the commandNoSemicolon rule in NanoSQL.g4. You should review the grammar rules for existing commands, as they sometimes factor out common components into rules that are shared across commands. If you can leverage existing syntax in your commands then you will achieve two benefits:
1. Your syntax will already be familiar from other commands.
2. You will have to do less work!
Create new subclasses of Command in the edu.caltech.nanodb.commands package, as appropriate for your new syntax. You will need to implement a constructor, accessors and mutators to support configuration of objects of your new type. In addition, you will need to implement various other operations declared on Command, that all subclasses must support.

There are plenty of examples in the commands package for you to reference!
Add new Object visitXxxx(NanoSQLParser.XxxxContext ctx) methods to the NanoSQLTranslator class, in the section that handles commands. There are many kinds of commands, so scan through this section of the file to determine the most appropriate place to add your new rules.

You can look at existing translation methods to see how to handle numerous different scenarios.