In this assignment you have these tasks to complete:

The above tasks are described in the following sections.

There are also some somewhat challenging but mostly grungy extra-credit items for the intrepid, described at the end of the assignment.

Advanced Subqueries

So far, our planners have ignored subqueries in the SELECT, WHERE and HAVING clauses. In this assignment we will improve our planner to support these queries as well.

It is actually quite straightforward to add basic support for non-correlated subqueries. All we need to do is to have special kinds of expressions that support subqueries, and then it should be no problem to evaluate them.

In the expressions package you will notice the SubqueryOperator class, which is how an expression can include a subquery of some type. The SubqueryOperator provides support for retrieving the SelectClause corresponding to the SQL of the subquery, as well as storing and retrieving an execution plan generated from the subquery.

The SubqueryOperator has three subclasses:

These classes are used by the parser when a query specifies the corresponding SQL expression. Their implementation is provided for you to use.

Step 1: Planning Subqueries

If you try to run a SQL query with a subquery in the SELECT, WHERE or HAVING clause, you will initially get an error that the subquery has no execution plan. This is (surprise) because we haven't planned them yet.

Therefore, the first task is to create an ExpressionPlanner class that takes care of both of these steps. This class must traverse an entire expression, planning any subqueries that appear in the expression. Since this class must traverse expressions, it will implement the ExpressionProcessor interface.

Put your class in the queryeval package. Here are some additional specifications:

You must also write complete Javadocs for your entire expression-planner class, but keep in mind that you will need to make more changes to implement correlated evaluation, so you may want to hold off writing all the docs until you are finished.

For the purposes of this assignment, "complete Javadocs" means a class-level comment describing what the class is and how it works, a comment on every method describing the purpose of the method, and a comment on every field describing the purpose of the field. You do not need to write docs for parameters, return-values or exceptions, although we will of course be filled with joy if you do.

Once you have your initial ExpressionPlanner implementation, incorporate it into your query planner so that you can execute non-correlated subqueries that appear in the SELECT, WHERE and HAVING clauses.

Step 2: Costing Expressions with Subqueries

So far, our costing model has assumed that all expressions are basically the same cost to evaluate, and that the cost only involves CPU time. We know this is not true, particularly when expressions contain subqueries.

Create an ExpressionCostCalculator class that plan-nodes can use to compute the cost of expressions. This class will also implement the ExpressionProcessor interface so that it may traverse expressions, computing a cost for them. As before, put this class in the queryeval package.

Once this class is done, update the prepare() method of these plan-nodes to use the expression-cost calculator:

These are the main plan-nodes where scalar subqueries are likely to appear, so updating them should greatly improve the cost estimates of queries.

As before, write complete Javadocs for your entire expression-cost calculator class.

Step 3: Correlated Evaluation

The concept of correlated evaluation has come up frequently in both CS121 and CS122. It is a very powerful mechanism for evaluating subqueries that are specified in terms of the enclosing query, but the cost of that power is that evaluation is slow because the inner query must be evaluated once for each row being considered by the enclosing query. Even though this mechanism is slow, we still prefer to allow users to specify correlated queries because sometimes the correlated version of a query is much clearer than the equivalent decorrelated version. If the database can decorrelate the query automatically then everybody wins! In other cases (and there are always cases that a database can't decorrelate), the database must use the slow correlated evaluation mechanism.

There are certainly various ways to implement correlated evaluation within a query engine. NanoDB takes a very basic approach; it isn't particularly elegant (some might even say "primitive"), but it gets the job done. This approach is described below.

Environments

Every plan node has its own environment in which expressions are evaluated. Every time a plan-node must operate on a tuple, the node adds the tuple to its environment and then evaluates expressions with respect to that environment. You can see this in several plan nodes, but the ProjectNode (plannodes package) is a good example. In its projectTuple() method, the input tuple is added to the node's environment, and then the various project-expressions are evaluated against that environment.

Note that every plan node has its own environment, even within the same execution plan. There is not one environment shared across the entire execution plan; each node in the plan has its own environment.

The environments themselves are implemented with the Environment class (expressions package). One neat feature of this class is that environments may be given one or more parent-environments; if a column reference doesn't appear within the environment, it will pass the reference to each of its parent environments. This environment-chaining allows one plan-node to access the tuples that are generated by some other plan-node, if the nodes' environments are connected to each other.

Environments and Correlated Evaluation

In correlated evaluation, the subquery's plan must be able to see the tuples produced by the enclosing query's plan. This means that the subquery's nodes must reference some parent environment from the enclosing query. It is probably unnecessary for every subquery-node to reference a parent environment, but it is definitely easiest to do it this way, and it will certainly not cause any harm. NanoDB's plan nodes provide this operation via the descriptively named PlanNode.addParentEnvironmentToPlanTree(); this method takes an Environment object and adds it as a parent environment to every node in the tree.

This suggests that our planner can generate a plan for each subquery, and then set that subquery's parent-environment to be the environment of some node in the outer query's plan. But which node should be used? This depends on where the subquery appears. Let's look at some examples.

-- Compute the total account balance at each branch.
SELECT branch_name,
       (SELECT SUM(balance) FROM account a
        WHERE a.branch_name = b.branch_name) total_branch_balances
FROM branch b;

In this case, the subquery is part of a project operation. Thus, in this query's plan, the subquery must use the outer query's project-node environment as its parent environment.

-- For each branch, find the account(s) with the largest balance.
SELECT acct_id, branch_name, balance
FROM account a
WHERE balance = (SELECT MAX(balance) FROM account b
                 WHERE b.branch_name = a.branch_name);

In this case, the subquery is part of the selection criteria, which will be applied as part of some select plan-node (likely a file-scan, or perhaps a simple select node). Therefore, in this query's plan, the subquery must use the outer query's select-node environment as its parent environment.

This illustrates the basic approach.

Update your ExpressionPlanner implementation to configure the environments of correlated subqueries so that correlated evaluation is performed correctly. Because NanoDB is a bit primitive, you will want to implement something like this:

Try to be efficient in your implementation. For example, the expression-planner can keep track of whether it has encountered a subquery or not; if there are no subqueries in the expression, then there is no need to set the plan node's environment.

Planning and Correlated Evaluation

The last detail you will need to deal with is making sure that the planner knows the full context when planning subqueries. You may have noticed that the Planner.makePlan() method takes two arguments: a query to plan, and a list of the enclosing queries. Make sure that when you plan subqueries, you pass an appropriate list of enclosing queries into this argument. This is not particularly difficult to construct; you can use Java's collection classes to do this very easily. (And make a copy; don't mutate the argument by accident, which is easy to do in Java.)

Suggested Approach

This assignment has two main tasks that can be done in parallel. One effort can be focused on planning subqueries and supporting correlated evaluation, and the other effort can focus on improving the expression-costing model.

Unfortunately, the subquery-planning / correlated-evaluation work cannot really be decomposed into smaller units of work that can be done independently. The reason for this is that the correlated-evaluation work builds on the earlier work of planning these subqueries, and really only makes sense in that context. The coding itself isn't very involved, but conceptually it can be a bit challenging to work out how it all hangs together.

The expression-costing improvements are conceptually simpler than the subquery work, and are easier to parallelize as well. Computing / accumulating the cost of an expression is straightforward to implement; once this is done, the three plan-nodes may be updated in parallel by one or more teammates.

All teammates can help with testing, as the provided tests are rudimentary at best, and do not cover very many usage scenarios.

Submitting Your Assignment

Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab5design.txt".

When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.

The instructions are the same as for the previous assignment, although you will use a hw5 tag this week. Again, note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw5-2", "hw5-3", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.

Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:

git show-ref --tags

Extra Credit

Here are some ideas for extra-credit tasks for this assignment.

Extra Credit II: Decorrelating Queries

Query decorrelation is a massive topic, and covering the most powerful techniques would likely be suitable for a CS123 project. If you want to explore some of the basic concepts, you can implement one or more of the following basic query decorrelations by transforming the query's Abstract Syntax Tree (AST). You can tell if the query is being properly decorrelated by looking at the explain-plan output; besides seeing a change in the generated plan, you should see a dramatic reduction in the execution cost for most simple queries. Of course, your tests should continue to pass!

You will receive +10 points for each of the below decorrelations that your code supports. If you extend your nested-loop join implementation to support both semijoin and anti-semijoin then you will receive a further +5 points.

Semijoin and Anti-Semijoin Operators

Some of the more powerful decorrelations, particularly involving IN/NOT IN and EXISTS/NOT EXISTS operators, are made very easy with a semijoin and anti-semijoin (a.k.a. antijoin) plan node. If you wish to implment some of the below decorrelations, you will need to update your NestedLoopJoinNode implementation to implement semijoins and anti-semijoins. The changes to the logic are straightforward.

Scalar Subquery in SELECT Clause

Here is an example query:

SELECT a, ..., (SELECT ... FROM t2 WHERE c = t1.d) sq FROM t1 ... WHERE ...;

The subquery sq is a scalar subquery, correlated on the expression "c = t1.d". It can be transformed into the equivalent query:

SELECT a, ..., sq
FROM (t1 ...) LEFT JOIN (SELECT c, ... sq FROM t2) ON t2.c = t1.d
WHERE ...;

(Note: This transformation is not precisely correct, particularly if the subquery includes a COUNT aggregate operation.)

The correlation expression can be used as a join condition; it must be an equijoin or else the join operation will be too complicated for our simple approach to decorrelate.

Because the original correlated subquery is a scalar subquery, we know that the decorrelated version will generate only one column of data. The hardest part of the decorrelation is making sure that the subquery's result is named in such a way that the project can easily include it. Techniques similar to those in grouping/aggregation would be useful.

Note that the above transformation can be performed against the query AST before planning is started.

Correlated IN-Subquery in WHERE Clause

Here is an example query:

SELECT ... FROM t1 ... WHERE a IN (SELECT ... FROM t2 WHERE b = t1.c);

This is an excellent example where the semijoin operator is very useful for decorrelating the query. We can transform this query into the equivalent query:

SELECT ...
FROM (t1 ...) LEFT SEMIJOIN (SELECT ... FROM t2) ON t2.b = t1.c;

This query is easier than the previous one, because we only use the subquery to filter the outer query's rows; we don't need to include any results from the subquery.

Note that a "NOT IN" predicate can also be supported easily by using an anti-join instead of a semi-join. Your implementation should support both decorrelations.

Correlated EXISTS-Subquery in WHERE Clause

Here is an example query:

SELECT ... FROM t1 ... WHERE EXISTS (SELECT ... FROM t2 WHERE a = t1.b);

This query is very similar to the previous one, transforming to:

SELECT ...
FROM (t1 ...) LEFT SEMIJOIN (SELECT ... FROM t2) ON t2.a = t1.b;

Again, a "NOT EXISTS" predicate can also be supported easily by using an anti-join instead of a semi-join. Your implementation should support both decorrelations.