Assignment is due Friday, February 1 at 5:00PM.

Last updated January 29, 2019 at 5:45PM.

In this assignment you have three major tasks to complete, along with describing your work in a design document:

These tasks are described in the following sections.

There are also several extra-credit tasks listed at the end of the assignment, for the ambitious.

Before You Start

This assignment builds on top of the previous assignment. If you still have bugs in your pin/unpin and update/delete code, you will want to fix these issues as you go forward. The main focus of the upcoming assignments will be SELECT queries on tables populated by INSERT statements, so you have some time to iron out the bugs.

If your attempts to improve insert performance went very poorly, you can always revert your codebase to use the original heap-file implementation. (Don't forget to retain your pin/unpin work, and your code to support updates and deletes.) None of the remaining assignments require fast tuple insertion; your only long-term penalty will be waiting slightly longer. We would still encourage you to try to find and fix your bugs though, since some of the challenge assignments will definitely be painful without improved insert performance.

NanoDB Query Evaluator and Plan-Node Lifecycle

NanoDB's query evaluator uses a demand-driven pipeline. All plan-nodes must operate in the same way for this to work properly. You can refer to the API documentation for specific methods on edu.caltech.nanodb.plans.PlanNode (the abstract base-class of all plan-nodes), although the below description will give you an overview of how it all fits together.

When a PlanNode subclass is constructed, various essential details are passed to the constructor to initialize the node. All plan-nodes are allowed to have up to two child plan-nodes, available as the leftChild and rightChild fields in the PlanNode class. These fields are marked protected so that all subclasses can access them. For such plan-nodes, the child plans are usually passed to the constructor. Note that there are also plan-nodes that do not require any children, such as the file-scan node. In these cases, the constructor specifies the necessary details for the plan node.

As with relational algebra operations, every plan-node specifies the schema of its output tuples. (Every node also has an associated cost of evaluation, which we will use in the next assignment. For now you can ignore the cost information; it will likely be null for the time being.) These values must be computed once an entire query plan is assembled; this is done by calling the prepare() method on the root plan-node. This operation, as is the case with most plan-node operations, will recurse down the plan-tree so that all nodes have a chance to prepare themselves. Note that the query evaluator expects that all plans have already been prepared before handing them to the evaluator.

Before a plan-node can produce any rows, its initialize() method must be invoked. The initialize() method sets up evaluation-time resources that the plan-node requires, so that the next invocation of getTuple() will return the first tuple from the plan-node. (Generally, this method invokes super.initialize() to run parent-class initialization, then it initializes its own internal members, and then finally it recursively invokes initialize() on its left and right child-nodes, if present.)

The getTuple() method is where the plan-node implements its logic. This method returns tuples generated by the plan-node, one by one in sequence. As long as more tuples are available, this method will return each tuple in sequence, but once there are no more tuples then getTuple() will start returning null. (It is not an error to call getTuple() again after it has returned null, but it shouldn’t be necessary, so it is strongly discouraged for your code to depend on this behavior.)

Many plan-nodes must call getTuple() on their children multiple times before they can return the next tuple to their caller. You can look at the SelectNode class' getTuple() method for a simple example of this. Some plan-nodes must even consume all data from their children before producing any results, such as in the case of sorting or hash-based grouping.

Some plan-nodes must also iterate over the tuples from their children multiple times in order to produce their results. The nested-loop join node is a perfect example of this; for each tuple produced by the left subplan, the plan-node must iterate over all tuples produced by the right subplan. A subplan can be restarted by calling initialize() on it again to reset it to the beginning of its tuple-sequence. For some plan-nodes such as the file-scan node, resetting to the beginning is a simple matter of going back to the start of the tuple file. However, other plan-nodes may not store their results; such nodes must recompute their results from scratch every time the node is restarted. (NanoDB doesn't currently implement any external-memory algorithms, but if a plan-node creates any intermediate data files when computing its results, subsequent calls to initialize() may also need to truncate or delete these intermediate files.)

When the evaluator is finished executing a plan, it calls cleanUp() on the root plan-node. Each plan-node performs its own cleanup, and then calls cleanUp() on its child node(s). Most plan-nodes require no cleanup, but this is critical when a node uses external memory for computing results so that temporary files can be deleted.

NanoDB Query Planner

NanoDB is currently unable to execute anything except the simplest SQL statements, because it has no way of generating more complex execution plans from a SQL statement. This is one of your major tasks for the assignment.

NanoDB allows different planners to be plugged into the database by providing a Planner interface (edu.caltech.nanodb.queryeval package). The two most relevant methods are:

SelectNode makeSimpleSelect(String tableName, Expression predicate,
                            List<SelectClause> enclosingSelects)

This method is used by UPDATE and DELETE statements to implement the optional WHERE clause. These DML statements require that the tuples produced by the plan can be used as L-values, which is how this method is currently implemented.

The "simplest" planner also uses makeSimpleSelect() to handle "SELECT * FROM t WHERE ..." queries.

PlanNode makePlan(SelectClause selClause,
                  List<SelectClause> enclosingSelects)

This method produces a query plan for a standard SELECT...FROM...WHERE statement. Every SELECT statement is parsed into a SelectClause that contains all components of the statement, including the list of values in the SELECT clause, the tree of join expressions in the FROM clause, and the predicate in the WHERE clause. (SelectClause, FromClause, etc., are in the queryast - "Query AST" - package.)

Note that every query plan produced by a planner implementation must be prepared by the planner. The query evaluator expects that the planner has completed this step.

This week you will create a planner class called SimplePlanner, starting with the SimplestPlanner implementation. Copy the SimplestPlanner.java file to SimplePlanner.java, then modify the class to be called SimplePlanner. (Do not modify SimplestPlanner.)

Default Planner Class

NanoDB can be configured to use a specific planner as its default planner using the nanodb.plannerClass property. The easiest way to make sure your new planner is used is to go to the ServerProperties class (in the edu.caltech.nanodb.server.properties package) and change the class name that is specified in the DEFAULT_PLANNER_CLASS constant. Just change "Simplest" to "Simple", and your new planner class will be used once you have recompiled your project.

Required Planner Features

Next, update SimplePlanner.makePlan() so that it supports all of the following features:

You are not required to support these features in this planner (they are part of a future assignment):

Because we are not supporting these features, you can ignore the enclosingSelects arguments to the Planner functions. We will use them in a future assignment.

The above is quite a list of features, but they can be implemented step by step, often in parallel by different teammates, and they shouldn't generate too much instability as you complete and incorporate each feature. If your team works in parallel, make sure to keep your local repository up to date with your team's repository, so that your codebases don't diverge too widely.

Here are some additional notes and guidelines:

General Planner Approach

As discussed in class, SQL SELECT statements can have up to eight different clauses, depending on the functionality supported by the database server:

These clauses (if present) are applied in a specific order:

  1. WITH (not supported by NanoDB)
  2. FROM
  3. WHERE
  4. GROUP BY
  5. HAVING
  6. SELECT
  7. ORDER BY
  8. LIMIT / OFFSET (not supported by NanoDB)

The planner you create can follow a straightforward approach, particularly since all plan-nodes derive from the same PlanNode base-class:

PlanNode plan = null;

if (FROM-clause is present)
    plan = generate_FROM_clause_plan();

if (WHERE-clause is present)
    plan = add_WHERE_clause_to(plan);

if (GROUP-BY-clause and/or HAVING-clause is present)
    plan = handle_grouping_and_aggregation(plan);

if (ORDER-BY-clause is present)
    plan = add_ORDER_BY_clause_to(plan);

// There's always a SELECT clause of some sort!
plan = add_SELECT_clause_to(plan);

You can see that it should be straightforward to build up the full plan based on what is present. This also has the benefit that the plan will not include plan-nodes for clauses that are absent.

The EXPLAIN Command

Like many databases, NanoDB provides an EXPLAIN command that can be used to see what the database decides to do when you run a query. You can use EXPLAIN to display the plans generated by your planner, to see if they actually makes sense. For example, if you have a table like this:

CREATE TABLE t ( a INTEGER, b VARCHAR(20) );

You can type "EXPLAIN SELECT b FROM t WHERE a < 5;", and (once your planner works) NanoDB will respond with something like this:

Explain Plan:
    Project[values:  [T.B]] cost is unknown
        FileScan[table:  T, pred:  T.A < 5] cost is unknown

(The "cost is unknown" because you haven't yet implemented plan costing or statistics collection. You will do this next time.)

Grouping and Aggregation

As discussed in class, grouping and aggregation is challenging to translate from SQL into relational algebra, because the SELECT clause mixes grouping/aggregation and projection, and the HAVING clause mixes selection and aggregation. The SELECT and HAVING expressions must be scanned for aggregate expressions so that the grouping plan-node can be initialized correctly, and also so that the aggregates may be removed and replaced with placeholder variables.

Additionally, it's wrong for aggregates to appear in the WHERE clause, ON clause, or in GROUP BY expressions. So, we need to scan these as well to ensure that they contain no aggregates.

This means we need a way to traverse expression trees, and possibly to modify them as we traverse them.

Scanning and/or Transforming Expressions

NanoDB's expressions package includes a mechanism for traversing and transforming expression trees. The Expression base-class has a traverse() method that performs a traversal of an entire expression hierarchy. The method takes an ExpressionProcessor implementation, with an enter(Expression) method and a leave(Expression) method; as each expression node is visited, the enter() method is called with that node as an argument, then the node's children are traversed (again with enter() and leave() being called), and then finally the leave() method is called again with that node as an argument.

You will also note that leave() returns an Expression; this allows the ExpressionProcessor implementation to mutate the expression as it is traversed. If an expression processor wants to replace the current node with some other expression, it simply returns it from leave(), and the original node will be replaced.

This functionality does require that you use these methods in a specific way. For example, if you were examining select-clause expressions for aggregates, you would need to do something like this:

for (SelectValue sv : selectValues) {
    // Skip select-values that aren't expressions
    if (!sv.isExpression())
        continue;

    Expression e = sv.getExpression().traverse(processor);
    sv.setExpression(e);
}

Note that the original expression is extracted, traversed, and then replaced by whatever the traverse() call returns.

If you want to see what an ExpressionProcessor implementation might look like, you should look at the SymbolFinder nested-class inside of the Expression class. This class is used by the Expression.getAllSymbols() method. Note that the SymbolFinder is read-only, so it does not have to be used precisely in the manner described above. The one your team writes will need to be used in the above manner since it may modify the expression passed to it.

Identifying and Replacing Aggregate Functions

Function calls are certainly a kind of expression, represented by the FunctionCall class in the expressions package. You can call the FunctionCall.getFunction() method to retrieve the specific function that is being invoked, and see if it is an instance of an aggregate function. Something like this:

if (e instanceof FunctionCall) {
    FunctionCall call = (FunctionCall) e;
    Function f = call.getFunction();
    if (f instanceof AggregateFunction) {
        ...  // Do stuff
    }
}

Of course, if you wish to replace an aggregate function call with a column-reference, like the approach described in class, you can look at the ColumnValue expression class. It takes a ColumnName object that holds the column name, and optionally the table name. (This ColumnName class also represents wildcard expressions; you can read the Javadocs for the details.)

Implementation Approach

You will need to create an ExpressionProcessor implementation that identifies aggregate functions in expressions, and maps each one to an auto-generated name. Therefore the class would need some kind of mapping from auto-generated names to the corresponding aggregates they represent. This should be used to process the SELECT expressions and the HAVING expressions, for obvious reasons. But, it can also be used on the WHERE, GROUP BY and ON clauses (if present), to ensure that they don't contain aggregates. (It would be an error for these clauses to contain aggregates.)

Finally, note that an aggregate function call also has an expression as its argument. It would be an error for an aggregate's argument to also contain an aggregate function call (e.g. MAX(AVG(x)) is an invalid expression).

You should throw an InvalidSQLException if you encounter any WHERE or ON clauses containing aggregates, GROUP BY expressions containing aggregates, or aggregate function calls that contain other aggregate function calls in their arguments. Make sure the error message is suitably descriptive for the situation you encounter.

Nested-Loop Join

The nested-loop join algorithm is very easy to state. For two tables r and s, the nested-loop join algorithm does the following:

for tr in r:
    for ts in s:
        if pred(tr, ts):
            add join(tr, ts) to result

(Table r is frequently called the outer relation, and s is called the inner relation, for obvious reasons. In NanoDB, as in many other database implementations, the left child of a join node is the outer relation and the right child is the inner relation.)

What makes this algorithm complicated to implement is using it in a demand-driven pipeline where plan-nodes must return the next tuple anytime a call to getTuple() is made. Because of this, the plan-node must store internal state that tracks where it is in its internal processing, so that it can pick up where it left off when getTuple() is called again.

A partial implementation of this join is provided in the NestedLoopJoin class (in the plannodes package). The implementation currently has the innermost if-statement and associated join-tuple operation, but the loop itself is not implemented. The class has three fields that are used for keeping track of the execution state:

(Note that there are probably cleaner ways to implement this. If you come up with something nicer then feel free to incorporate your refinements into your solution. We may even use them in a future class.)

The getTuplesToJoin() method updates leftTuple and/or rightTuple based on the next pair of tuples that should be considered. This is where you must implement the nested-loop logic that allows the join to compare every possible pair of tuples. You will need to iterate over the rows in the left and right child nodes, advancing the state of the nested loops every time getTuplesToJoin() is called. When you reach the end of the inner relation (i.e. when rightChild.getTuple() returns null), you can use initialize() to reset it back to the start.

Remember that you must also support left-outer joins with this plan node. You will need to modify the logic, and you may need to add fields to the class to properly track the execution state.

Once you have completed this node, you can use it in your planner to support the various join operations specified in the previous section.

Automated Testing

Once you have completed these tasks, it would be of value to know that they are actually correct. To this end, we can employ an automated testing framework to run tests against the database code to verify that everything works correctly. There are several different kinds of tests:

In the src/test/java/edu/caltech/test/nanodb/sql directory is a simple framework for issuing queries against the database and comparing the results against a set of expected results. This allows us to do very basic integration testing against the NanoDB query processor. There are also a number of classes that exercise various aspects of the SQL queries you will support this week.

You will note that these classes' constructors specify properties defined in the test_sql.props file; each property specifies the SQL initialization code that is necessary to run the corresponding test cases. (You can also issue your own initialization commands within the class if you wish; quite a few tests do this instead.)

Create a new class TestSimpleJoins in this directory (you can copy one of the other classes to start with), which exercises your inner- and outer-join support and checks the results. Your testing should be reasonably complete; for example, it would be useful to verify the following:

It shouldn't be too hard to construct a few small, simple input tables for these kinds of cases, and then construct the corresponding test code. (It will be tedious, but that's the price you pay for confidence that your code works.)

Don't be sloppy. Update the comments in the testing code to match what your code is doing. Also, make sure your tests pass. After you run the Maven test target, the results will show up in the test results for your project. If there are test failures, you can browse to the details and then fix them. All tests should pass at the end of Project 2.

Suggested Approach

This assignment has many parts that can be done in parallel, which should make it easier to divide it up amongst your teammates. Following is a general description of how you might tackle the project's tasks.

It is important to start the SimplePlanner implementation right away, since most functionality will be exposed through the SimplePlanner. Fortunately, it should also be very straightforward to implement support for SELECT (projection) and WHERE (selection) clauses, for queries involving zero or one table (simple FROM clause).

A subsequent step for the SimplePlanner will be to implement the code that converts FromClauses into their corresponding plan-node trees. This can be implemented before the nested-loop join implementation is completed, but SQL queries containing joins will not execute properly until the NestedLoopJoinNode implementation is completed.

The NestedLoopJoinNode implementation is another task that should be started early, since it is the basis of how joins are evaluated. It can be done independently in parallel with the planner work. Once this node is complete, you should have support for most basic SQL queries, as long as they don't include GROUP BY or HAVING clauses.

Once the NestedLoopJoinNode is done, and the planner is able to use the node in query planning, the required unit-tests can be written. This can proceed in parallel to other tasks.

The last piece of the puzzle is supporting the GROUP BY and HAVING clauses. Once the SimplePlanner supports the most basic queries (joins don't need to work to be able to use grouping and aggregation), work on these features can progress. You should probably start these features early in your efforts, since they tend to be the most complicated features for students to get working properly.

The bulk of the work is in creating an implementation of the ExpressionProcessor interface to identify and extract aggregate functions from an expression, and to report errors like nested aggregate function calls. This task can be started before any other functionality is complete, but you won't be able to use this tool until the SimplePlanner can handle basic queries (e.g. single-table queries).

Once this task is done, the SimplePlanner must be updated to support grouping and aggregation using your ExpressionProcessor implementation. This task tends to be straightforward to complete, although you will need to think about how to manage the necessary details.

Extra Credit

If you have gotten to this point and you now find yourself heartbroken that the assignment is over, you should add support for the LIMIT and OFFSET clauses by implementing a LimitOffsetNode and then incorporating this into your execution plan when the SQL specifies these clauses. For full credit, you must also write unit tests that fully exercise this functionality. (Up to +3pts without tests; up to +10pts with tests.)

Design Document and Submission

Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab2design.txt".

When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.

The first thing to do is tag your submission so that we can retrieve the correct version of your work:

git tag hw2

This will tag the current version of your repository with the name "hw2". Once you have done this, you can push all of your work, including this Git tag, to your Gitlab repository:

git push --tags

The --tags argument will cause all tags in your local repository to be replicated into your team's Gitlab repository. Once this is done, we will be able to retrieve the version of your work marked with this tag, so that we can grade it.

Note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw2-2", "hw2-3", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.

Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:

git show-ref --tags