Assignment is due Friday, February 1 at 5:00PM.
Last updated January 29, 2019 at 5:45PM.
In this assignment you have three major tasks to complete, along with describing your work in a design document:
These tasks are described in the following sections.
There are also several extra-credit tasks listed at the end of the assignment, for the ambitious.
This assignment builds on top of the previous assignment. If you still have bugs in your pin/unpin and update/delete code, you will want to fix these issues as you go forward. The main focus of the upcoming assignments will be SELECT
queries on tables populated by INSERT
statements, so you have some time to iron out the bugs.
If your attempts to improve insert performance went very poorly, you can always revert your codebase to use the original heap-file implementation. (Don't forget to retain your pin/unpin work, and your code to support updates and deletes.) None of the remaining assignments require fast tuple insertion; your only long-term penalty will be waiting slightly longer. We would still encourage you to try to find and fix your bugs though, since some of the challenge assignments will definitely be painful without improved insert performance.
NanoDB's query evaluator uses a demand-driven pipeline. All plan-nodes must operate in the same way for this to work properly. You can refer to the API documentation for specific methods on edu.caltech.nanodb.plans.PlanNode
(the abstract base-class of all plan-nodes), although the below description will give you an overview of how it all fits together.
When a PlanNode
subclass is constructed, various essential details are passed to the constructor to initialize the node. All plan-nodes are allowed to have up to two child plan-nodes, available as the leftChild
and rightChild
fields in the PlanNode
class. These fields are marked protected
so that all subclasses can access them. For such plan-nodes, the child plans are usually passed to the constructor. Note that there are also plan-nodes that do not require any children, such as the file-scan node. In these cases, the constructor specifies the necessary details for the plan node.
As with relational algebra operations, every plan-node specifies the schema of its output tuples. (Every node also has an associated cost of evaluation, which we will use in the next assignment. For now you can ignore the cost information; it will likely be null
for the time being.) These values must be computed once an entire query plan is assembled; this is done by calling the prepare()
method on the root plan-node. This operation, as is the case with most plan-node operations, will recurse down the plan-tree so that all nodes have a chance to prepare themselves. Note that the query evaluator expects that all plans have already been prepared before handing them to the evaluator.
Before a plan-node can produce any rows, its initialize()
method must be invoked. The initialize()
method sets up evaluation-time resources that the plan-node requires, so that the next invocation of getTuple()
will return the first tuple from the plan-node. (Generally, this method invokes super.initialize()
to run parent-class initialization, then it initializes its own internal members, and then finally it recursively invokes initialize()
on its left and right child-nodes, if present.)
The getTuple()
method is where the plan-node implements its logic. This method returns tuples generated by the plan-node, one by one in sequence. As long as more tuples are available, this method will return each tuple in sequence, but once there are no more tuples then getTuple()
will start returning null
. (It is not an error to call getTuple()
again after it has returned null
, but it shouldn’t be necessary, so it is strongly discouraged for your code to depend on this behavior.)
Many plan-nodes must call getTuple()
on their children multiple times before they can return the next tuple to their caller. You can look at the SelectNode
class' getTuple()
method for a simple example of this. Some plan-nodes must even consume all data from their children before producing any results, such as in the case of sorting or hash-based grouping.
Some plan-nodes must also iterate over the tuples from their children multiple times in order to produce their results. The nested-loop join node is a perfect example of this; for each tuple produced by the left subplan, the plan-node must iterate over all tuples produced by the right subplan. A subplan can be restarted by calling initialize()
on it again to reset it to the beginning of its tuple-sequence. For some plan-nodes such as the file-scan node, resetting to the beginning is a simple matter of going back to the start of the tuple file. However, other plan-nodes may not store their results; such nodes must recompute their results from scratch every time the node is restarted. (NanoDB doesn't currently implement any external-memory algorithms, but if a plan-node creates any intermediate data files when computing its results, subsequent calls to initialize()
may also need to truncate or delete these intermediate files.)
When the evaluator is finished executing a plan, it calls cleanUp()
on the root plan-node. Each plan-node performs its own cleanup, and then calls cleanUp()
on its child node(s). Most plan-nodes require no cleanup, but this is critical when a node uses external memory for computing results so that temporary files can be deleted.
NanoDB is currently unable to execute anything except the simplest SQL statements, because it has no way of generating more complex execution plans from a SQL statement. This is one of your major tasks for the assignment.
NanoDB allows different planners to be plugged into the database by providing a Planner
interface (edu.caltech.nanodb.queryeval
package). The two most relevant methods are:
SelectNode makeSimpleSelect(String tableName, Expression predicate,
List<SelectClause> enclosingSelects)
This method is used by
UPDATE
andDELETE
statements to implement the optionalWHERE
clause. These DML statements require that the tuples produced by the plan can be used as L-values, which is how this method is currently implemented.
The "simplest" planner also uses
makeSimpleSelect()
to handle "SELECT * FROM t WHERE ...
" queries.
PlanNode makePlan(SelectClause selClause,
List<SelectClause> enclosingSelects)
This method produces a query plan for a standard
SELECT...FROM...WHERE
statement. EverySELECT
statement is parsed into aSelectClause
that contains all components of the statement, including the list of values in theSELECT
clause, the tree of join expressions in theFROM
clause, and the predicate in theWHERE
clause. (SelectClause
,FromClause
, etc., are in thequeryast
- "Query AST" - package.)
Note that every query plan produced by a planner implementation must be prepared by the planner. The query evaluator expects that the planner has completed this step.
This week you will create a planner class called SimplePlanner
, starting with the SimplestPlanner
implementation. Copy the SimplestPlanner.java
file to SimplePlanner.java
, then modify the class to be called SimplePlanner
. (Do not modify SimplestPlanner
.)
NanoDB can be configured to use a specific planner as its default planner using the nanodb.plannerClass
property. The easiest way to make sure your new planner is used is to go to the ServerProperties
class (in the edu.caltech.nanodb.server.properties
package) and change the class name that is specified in the DEFAULT_PLANNER_CLASS
constant. Just change "Simplest" to "Simple", and your new planner class will be used once you have recompiled your project.
Next, update SimplePlanner.makePlan()
so that it supports all of the following features:
SELECT
statements that involve zero or more tables being joined together. (For example, "SELECT 3 + 2 AS five;
" is a valid SQL query. The project plan-node is able to support situations where it has no child plan, and no expression references a column name.) Join predicates may appear in the WHERE
clause, or in an ON
clause.SELECT ... FROM t1, t2, ...
" syntax, as well as joins with an ON
clause. You do not need to support NATURAL
joins or joins with the USING
clause in this planner (although this is available as an extra-credit task).FROM
clause.SELECT
expression may include multiple aggregate operations. You are encouraged to use the supplied HashedGroupAggregateNode
class. You will need to provide some limited validation of these expressions; details are given in a subsequent section.ORDER BY
clauses, using the supplied edu.caltech.nanodb.plans.SortNode
class.You are not required to support these features in this planner (they are part of a future assignment):
SELECT
or WHERE
clauses.Because we are not supporting these features, you can ignore the enclosingSelects
arguments to the Planner
functions. We will use them in a future assignment.
The above is quite a list of features, but they can be implemented step by step, often in parallel by different teammates, and they shouldn't generate too much instability as you complete and incorporate each feature. If your team works in parallel, make sure to keep your local repository up to date with your team's repository, so that your codebases don't diverge too widely.
Here are some additional notes and guidelines:
You should try to generate the minimal plan necessary, but don’t go overboard trying to achieve this. For example, only generate a Project plan-node if the SELECT
-clause specifies anything other than a *
(e.g. "SELECT * FROM ...
"). Only generate a Select plan-node if the query specifies a WHERE
predicate.
This is a simple query planner. Your plan doesn’t need to be efficient. It just needs to produce the correct results. We will worry about efficiency in a future assignment.
Note that some FromClause
objects will include a join predicate and others will not, depending on how the SQL statement is written. For example, "SELECT * FROM t1, t2 WHERE t1.a = t2.a
" will generate a different plan tree than "SELECT * FROM t1 JOIN t2 ON t1.a = t2.a
", although the rows retrieved will be identical.
In your join implementation, when you need to pad a tuple with NULL
values, see the TupleLiteral.ofSize(int numCols)
static method (in the expressions
package). You may even want to cache this "all-NULL
s" tuple in your plan node when it is needed, so that evaluating outer joins will be a bit more efficient.
NanoDB has "XxxxUtils
" classes all over the codebase, which often contain very helpful operations for you to leverage. For example, the plannodes
package contains a helper class PlanUtils
, with a static method PlanNode addPredicateToPlan(PlanNode)
. This method takes a plan and a predicate, and applies the predicate to the plan, returning an updated plan.
SelectNode
, the specified predicate will be added to the node's predicate.SelectNode
then a new SimpleFilterNode
will be added to the top of the plan, with the specified predicate.This function should make it very easy to implement certain parts of your planner.
Always be on the lookout for these useful utility classes!
As discussed in class, SQL SELECT
statements can have up to eight different clauses, depending on the functionality supported by the database server:
WITH
(not supported by NanoDB)SELECT
FROM
WHERE
GROUP BY
HAVING
ORDER BY
LIMIT
/ OFFSET
(not supported by NanoDB)These clauses (if present) are applied in a specific order:
WITH
(not supported by NanoDB)FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
LIMIT
/ OFFSET
(not supported by NanoDB)The planner you create can follow a straightforward approach, particularly since all plan-nodes derive from the same PlanNode
base-class:
PlanNode plan = null;
if (FROM-clause is present)
plan = generate_FROM_clause_plan();
if (WHERE-clause is present)
plan = add_WHERE_clause_to(plan);
if (GROUP-BY-clause and/or HAVING-clause is present)
plan = handle_grouping_and_aggregation(plan);
if (ORDER-BY-clause is present)
plan = add_ORDER_BY_clause_to(plan);
// There's always a SELECT clause of some sort!
plan = add_SELECT_clause_to(plan);
You can see that it should be straightforward to build up the full plan based on what is present. This also has the benefit that the plan will not include plan-nodes for clauses that are absent.
EXPLAIN
CommandLike many databases, NanoDB provides an EXPLAIN
command that can be used to see what the database decides to do when you run a query. You can use EXPLAIN
to display the plans generated by your planner, to see if they actually makes sense. For example, if you have a table like this:
CREATE TABLE t ( a INTEGER, b VARCHAR(20) );
You can type "EXPLAIN SELECT b FROM t WHERE a < 5;
", and (once your planner works) NanoDB will respond with something like this:
Explain Plan:
Project[values: [T.B]] cost is unknown
FileScan[table: T, pred: T.A < 5] cost is unknown
(The "cost is unknown" because you haven't yet implemented plan costing or statistics collection. You will do this next time.)
As discussed in class, grouping and aggregation is challenging to translate from SQL into relational algebra, because the SELECT
clause mixes grouping/aggregation and projection, and the HAVING
clause mixes selection and aggregation. The SELECT
and HAVING
expressions must be scanned for aggregate expressions so that the grouping plan-node can be initialized correctly, and also so that the aggregates may be removed and replaced with placeholder variables.
Additionally, it's wrong for aggregates to appear in the WHERE
clause, ON
clause, or in GROUP BY
expressions. So, we need to scan these as well to ensure that they contain no aggregates.
This means we need a way to traverse expression trees, and possibly to modify them as we traverse them.
NanoDB's expressions
package includes a mechanism for traversing and transforming expression trees. The Expression
base-class has a traverse()
method that performs a traversal of an entire expression hierarchy. The method takes an ExpressionProcessor
implementation, with an enter(Expression)
method and a leave(Expression)
method; as each expression node is visited, the enter()
method is called with that node as an argument, then the node's children are traversed (again with enter()
and leave()
being called), and then finally the leave()
method is called again with that node as an argument.
You will also note that leave()
returns an Expression
; this allows the ExpressionProcessor
implementation to mutate the expression as it is traversed. If an expression processor wants to replace the current node with some other expression, it simply returns it from leave()
, and the original node will be replaced.
This functionality does require that you use these methods in a specific way. For example, if you were examining select-clause expressions for aggregates, you would need to do something like this:
for (SelectValue sv : selectValues) {
// Skip select-values that aren't expressions
if (!sv.isExpression())
continue;
Expression e = sv.getExpression().traverse(processor);
sv.setExpression(e);
}
Note that the original expression is extracted, traversed, and then replaced by whatever the traverse()
call returns.
If you want to see what an
ExpressionProcessor
implementation might look like, you should look at theSymbolFinder
nested-class inside of theExpression
class. This class is used by theExpression.getAllSymbols()
method. Note that theSymbolFinder
is read-only, so it does not have to be used precisely in the manner described above. The one your team writes will need to be used in the above manner since it may modify the expression passed to it.
Function calls are certainly a kind of expression, represented by the FunctionCall
class in the expressions
package. You can call the FunctionCall.getFunction()
method to retrieve the specific function that is being invoked, and see if it is an instance of an aggregate function. Something like this:
if (e instanceof FunctionCall) {
FunctionCall call = (FunctionCall) e;
Function f = call.getFunction();
if (f instanceof AggregateFunction) {
... // Do stuff
}
}
Of course, if you wish to replace an aggregate function call with a column-reference, like the approach described in class, you can look at the ColumnValue
expression class. It takes a ColumnName
object that holds the column name, and optionally the table name. (This ColumnName
class also represents wildcard expressions; you can read the Javadocs for the details.)
You will need to create an ExpressionProcessor
implementation that identifies aggregate functions in expressions, and maps each one to an auto-generated name. Therefore the class would need some kind of mapping from auto-generated names to the corresponding aggregates they represent. This should be used to process the SELECT
expressions and the HAVING
expressions, for obvious reasons. But, it can also be used on the WHERE
, GROUP BY
and ON
clauses (if present), to ensure that they don't contain aggregates. (It would be an error for these clauses to contain aggregates.)
Finally, note that an aggregate function call also has an expression as its argument. It would be an error for an aggregate's argument to also contain an aggregate function call (e.g. MAX(AVG(x))
is an invalid expression).
You should throw an InvalidSQLException
if you encounter any WHERE
or ON
clauses containing aggregates, GROUP BY
expressions containing aggregates, or aggregate function calls that contain other aggregate function calls in their arguments. Make sure the error message is suitably descriptive for the situation you encounter.
The nested-loop join algorithm is very easy to state. For two tables r and s, the nested-loop join algorithm does the following:
for tr in r:
for ts in s:
if pred(tr, ts):
add join(tr, ts) to result
(Table r is frequently called the outer relation, and s is called the inner relation, for obvious reasons. In NanoDB, as in many other database implementations, the left child of a join node is the outer relation and the right child is the inner relation.)
What makes this algorithm complicated to implement is using it in a demand-driven pipeline where plan-nodes must return the next tuple anytime a call to getTuple()
is made. Because of this, the plan-node must store internal state that tracks where it is in its internal processing, so that it can pick up where it left off when getTuple()
is called again.
A partial implementation of this join is provided in the NestedLoopJoin
class (in the plannodes
package). The implementation currently has the innermost if-statement and associated join-tuple operation, but the loop itself is not implemented. The class has three fields that are used for keeping track of the execution state:
done
is a flag that indicates when the plan-node has completely consumed its inputs. The flag is initialized to false
in the initialize()
method. Once done
is true, the getTuple()
method will return null
.leftTuple
is the most recent tuple retrieved from the left child-plan.rightTuple
is the most recent tuple retrieved from the right child-plan.(Note that there are probably cleaner ways to implement this. If you come up with something nicer then feel free to incorporate your refinements into your solution. We may even use them in a future class.)
The getTuplesToJoin()
method updates leftTuple
and/or rightTuple
based on the next pair of tuples that should be considered. This is where you must implement the nested-loop logic that allows the join to compare every possible pair of tuples. You will need to iterate over the rows in the left and right child nodes, advancing the state of the nested loops every time getTuplesToJoin()
is called. When you reach the end of the inner relation (i.e. when rightChild.getTuple()
returns null
), you can use initialize()
to reset it back to the start.
Remember that you must also support left-outer joins with this plan node. You will need to modify the logic, and you may need to add fields to the class to properly track the execution state.
Once you have completed this node, you can use it in your planner to support the various join operations specified in the previous section.
Once you have completed these tasks, it would be of value to know that they are actually correct. To this end, we can employ an automated testing framework to run tests against the database code to verify that everything works correctly. There are several different kinds of tests:
In the src/test/java/edu/caltech/test/nanodb/sql
directory is a simple framework for issuing queries against the database and comparing the results against a set of expected results. This allows us to do very basic integration testing against the NanoDB query processor. There are also a number of classes that exercise various aspects of the SQL queries you will support this week.
You will note that these classes' constructors specify properties defined in the test_sql.props
file; each property specifies the SQL initialization code that is necessary to run the corresponding test cases. (You can also issue your own initialization commands within the class if you wish; quite a few tests do this instead.)
Create a new class TestSimpleJoins
in this directory (you can copy one of the other classes to start with), which exercises your inner- and outer-join support and checks the results. Your testing should be reasonably complete; for example, it would be useful to verify the following:
It shouldn't be too hard to construct a few small, simple input tables for these kinds of cases, and then construct the corresponding test code. (It will be tedious, but that's the price you pay for confidence that your code works.)
Don't be sloppy. Update the comments in the testing code to match what your code is doing. Also, make sure your tests pass. After you run the Maven test
target, the results will show up in the test results for your project. If there are test failures, you can browse to the details and then fix them. All tests should pass at the end of Project 2.
This assignment has many parts that can be done in parallel, which should make it easier to divide it up amongst your teammates. Following is a general description of how you might tackle the project's tasks.
It is important to start the SimplePlanner
implementation right away, since most functionality will be exposed through the SimplePlanner
. Fortunately, it should also be very straightforward to implement support for SELECT
(projection) and WHERE
(selection) clauses, for queries involving zero or one table (simple FROM
clause).
A subsequent step for the SimplePlanner
will be to implement the code that converts FromClause
s into their corresponding plan-node trees. This can be implemented before the nested-loop join implementation is completed, but SQL queries containing joins will not execute properly until the NestedLoopJoinNode
implementation is completed.
The NestedLoopJoinNode
implementation is another task that should be started early, since it is the basis of how joins are evaluated. It can be done independently in parallel with the planner work. Once this node is complete, you should have support for most basic SQL queries, as long as they don't include GROUP BY
or HAVING
clauses.
Once the NestedLoopJoinNode
is done, and the planner is able to use the node in query planning, the required unit-tests can be written. This can proceed in parallel to other tasks.
The last piece of the puzzle is supporting the GROUP BY
and HAVING
clauses. Once the SimplePlanner
supports the most basic queries (joins don't need to work to be able to use grouping and aggregation), work on these features can progress. You should probably start these features early in your efforts, since they tend to be the most complicated features for students to get working properly.
The bulk of the work is in creating an implementation of the ExpressionProcessor
interface to identify and extract aggregate functions from an expression, and to report errors like nested aggregate function calls. This task can be started before any other functionality is complete, but you won't be able to use this tool until the SimplePlanner
can handle basic queries (e.g. single-table queries).
Once this task is done, the SimplePlanner
must be updated to support grouping and aggregation using your ExpressionProcessor
implementation. This task tends to be straightforward to complete, although you will need to think about how to manage the necessary details.
If you have gotten to this point and you now find yourself heartbroken that the assignment is over, you should add support for the LIMIT
and OFFSET
clauses by implementing a LimitOffsetNode
and then incorporating this into your execution plan when the SQL specifies these clauses. For full credit, you must also write unit tests that fully exercise this functionality. (Up to +3pts without tests; up to +10pts with tests.)
Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab2design.txt
".
When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.
The first thing to do is tag your submission so that we can retrieve the correct version of your work:
git tag hw2
This will tag the current version of your repository with the name "hw2
". Once you have done this, you can push all of your work, including this Git tag, to your Gitlab repository:
git push --tags
The --tags
argument will cause all tags in your local repository to be replicated into your team's Gitlab repository. Once this is done, we will be able to retrieve the version of your work marked with this tag, so that we can grade it.
Note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw2-2
", "hw2-3
", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.
Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:
git show-ref --tags