In this assignment you have these tasks to complete:
SELECT
, WHERE
and HAVING
clausesThe above tasks are described in the following sections.
There are also some somewhat challenging but mostly grungy extra-credit items for the intrepid, described at the end of the assignment.
So far, our planners have ignored subqueries in the SELECT
, WHERE
and HAVING
clauses. In this assignment we will improve our planner to support these queries as well.
It is actually quite straightforward to add basic support for non-correlated subqueries. All we need to do is to have special kinds of expressions that support subqueries, and then it should be no problem to evaluate them.
In the expressions
package you will notice the SubqueryOperator
class, which is how an expression can include a subquery of some type. The SubqueryOperator
provides support for retrieving the SelectClause
corresponding to the SQL of the subquery, as well as storing and retrieving an execution plan generated from the subquery.
The SubqueryOperator
has three subclasses:
ScalarSubquery
for scalar subqueriesInSubqueryOperator
for "attr IN (SELECT ...)
" predicates (note that "attr IN (value, value, ...)
" predicates are handled by InValuesOperator
)ExistsOperator
for "EXISTS (SELECT ...)
" predicatesThese classes are used by the parser when a query specifies the corresponding SQL expression. Their implementation is provided for you to use.
If you try to run a SQL query with a subquery in the SELECT
, WHERE
or HAVING
clause, you will initially get an error that the subquery has no execution plan. This is (surprise) because we haven't planned them yet.
Therefore, the first task is to create an ExpressionPlanner
class that takes care of both of these steps. This class must traverse an entire expression, planning any subqueries that appear in the expression. Since this class must traverse expressions, it will implement the ExpressionProcessor
interface.
Put your class in the queryeval
package. Here are some additional specifications:
You will need to support subqueries in the SELECT
, WHERE
and HAVING
clauses.
Report an error if you encounter any subqueries in the GROUP BY
or ORDER BY
clauses. (You may also want to report an "unsupported feature error" if you encounter subqueries in FROM
clause conditions, but this is optional, and worth +3 bonus points.)
Your expression-planner will obviously need a planner to plan subqueries. Be efficient: don't instantiate a new planner; rather, write your ExpressionPlanner
constructor to take a Planner
object to use inside the expression-planner. Then, when your planner uses the expression-planner, it can just pass itself to the expression-planner.
You must also write complete Javadocs for your entire expression-planner class, but keep in mind that you will need to make more changes to implement correlated evaluation, so you may want to hold off writing all the docs until you are finished.
For the purposes of this assignment, "complete Javadocs" means a class-level comment describing what the class is and how it works, a comment on every method describing the purpose of the method, and a comment on every field describing the purpose of the field. You do not need to write docs for parameters, return-values or exceptions, although we will of course be filled with joy if you do.
Once you have your initial ExpressionPlanner
implementation, incorporate it into your query planner so that you can execute non-correlated subqueries that appear in the SELECT
, WHERE
and HAVING
clauses.
So far, our costing model has assumed that all expressions are basically the same cost to evaluate, and that the cost only involves CPU time. We know this is not true, particularly when expressions contain subqueries.
Create an ExpressionCostCalculator
class that plan-nodes can use to compute the cost of expressions. This class will also implement the ExpressionProcessor
interface so that it may traverse expressions, computing a cost for them. As before, put this class in the queryeval
package.
You should come up with some way to scale the cost of an expression based on its complexity. A simple mechanism is adequate; for example, you might increment the CPU cost by 1 every time you enter a new expression. This way, complex expressions will have a higher CPU cost than simple ones.
When your expression-cost calculator comes across a SubqueryOperator
node, extract and accumulate the cost of computing the expression. You can safely assume that any subquery-operators have already been planned and prepared at the point that this code runs.
Once this class is done, update the prepare()
method of these plan-nodes to use the expression-cost calculator:
ProjectNode
- you can traverse the contents of the projectionSpec
collection, accumulating each expression's cost into a "per-row cost." Then, you can multiply that cost by the child plan's estimated number of rows.
IMPORTANT NOTE: Ignore the TODOs in the ProjectNode.prepare()
method. Those notes are for Donnie.
SimpleFilterNode
- you can estimate the cost of predicate
, then multiply that cost by the child plan's estimated number of rows. (Note that this cost is incurred once per row in the child plan, since the predicate must be evaluated for each row considered.)
FileScanNode
- this will basically be identical to SimpleFilterNode
.
These are the main plan-nodes where scalar subqueries are likely to appear, so updating them should greatly improve the cost estimates of queries.
As before, write complete Javadocs for your entire expression-cost calculator class.
The concept of correlated evaluation has come up frequently in both CS121 and CS122. It is a very powerful mechanism for evaluating subqueries that are specified in terms of the enclosing query, but the cost of that power is that evaluation is slow because the inner query must be evaluated once for each row being considered by the enclosing query. Even though this mechanism is slow, we still prefer to allow users to specify correlated queries because sometimes the correlated version of a query is much clearer than the equivalent decorrelated version. If the database can decorrelate the query automatically then everybody wins! In other cases (and there are always cases that a database can't decorrelate), the database must use the slow correlated evaluation mechanism.
There are certainly various ways to implement correlated evaluation within a query engine. NanoDB takes a very basic approach; it isn't particularly elegant (some might even say "primitive"), but it gets the job done. This approach is described below.
Every plan node has its own environment in which expressions are evaluated. Every time a plan-node must operate on a tuple, the node adds the tuple to its environment and then evaluates expressions with respect to that environment. You can see this in several plan nodes, but the ProjectNode
(plannodes
package) is a good example. In its projectTuple()
method, the input tuple is added to the node's environment, and then the various project-expressions are evaluated against that environment.
Note that every plan node has its own environment, even within the same execution plan. There is not one environment shared across the entire execution plan; each node in the plan has its own environment.
The environments themselves are implemented with the Environment
class (expressions
package). One neat feature of this class is that environments may be given one or more parent-environments; if a column reference doesn't appear within the environment, it will pass the reference to each of its parent environments. This environment-chaining allows one plan-node to access the tuples that are generated by some other plan-node, if the nodes' environments are connected to each other.
In correlated evaluation, the subquery's plan must be able to see the tuples produced by the enclosing query's plan. This means that the subquery's nodes must reference some parent environment from the enclosing query. It is probably unnecessary for every subquery-node to reference a parent environment, but it is definitely easiest to do it this way, and it will certainly not cause any harm. NanoDB's plan nodes provide this operation via the descriptively named PlanNode.addParentEnvironmentToPlanTree()
; this method takes an Environment
object and adds it as a parent environment to every node in the tree.
This suggests that our planner can generate a plan for each subquery, and then set that subquery's parent-environment to be the environment of some node in the outer query's plan. But which node should be used? This depends on where the subquery appears. Let's look at some examples.
-- Compute the total account balance at each branch.
SELECT branch_name,
(SELECT SUM(balance) FROM account a
WHERE a.branch_name = b.branch_name) total_branch_balances
FROM branch b;
In this case, the subquery is part of a project operation. Thus, in this query's plan, the subquery must use the outer query's project-node environment as its parent environment.
-- For each branch, find the account(s) with the largest balance.
SELECT acct_id, branch_name, balance
FROM account a
WHERE balance = (SELECT MAX(balance) FROM account b
WHERE b.branch_name = a.branch_name);
In this case, the subquery is part of the selection criteria, which will be applied as part of some select plan-node (likely a file-scan, or perhaps a simple select node). Therefore, in this query's plan, the subquery must use the outer query's select-node environment as its parent environment.
This illustrates the basic approach.
Update your ExpressionPlanner
implementation to configure the environments of correlated subqueries so that correlated evaluation is performed correctly. Because NanoDB is a bit primitive, you will want to implement something like this:
As your planner is planning a query, it can look for subqueries in various expressions (e.g. WHERE
-clause expressions, select-values, etc.). Each time it looks for subqueries, it will be in the context of some specific plan-node it is creating for the outer query's plan.
The planner will use an expression-planner to look for and to plan the subqueries. When it is instantiated, the expression-planner can initialize a new empty Environment
object, which will be the parent of any subquery plans.
Anytime the expression-planner encounters a subquery, it can generate a plan for the subquery, and then set that subquery's parent environment to be the Environment
object it created. (All subqueries in the expression would share the same parent environment.) This will use the PlanNode.addParentEnvironmentToPlanTree()
method.
Once all planning has been completed for the expression, the planner can take the expression-planner's environment object, and make that the environment for the parent plan-node, using the PlanNode.setEnvironment()
method.
Try to be efficient in your implementation. For example, the expression-planner can keep track of whether it has encountered a subquery or not; if there are no subqueries in the expression, then there is no need to set the plan node's environment.
The last detail you will need to deal with is making sure that the planner knows the full context when planning subqueries. You may have noticed that the Planner.makePlan()
method takes two arguments: a query to plan, and a list of the enclosing queries. Make sure that when you plan subqueries, you pass an appropriate list of enclosing queries into this argument. This is not particularly difficult to construct; you can use Java's collection classes to do this very easily. (And make a copy; don't mutate the argument by accident, which is easy to do in Java.)
This assignment has two main tasks that can be done in parallel. One effort can be focused on planning subqueries and supporting correlated evaluation, and the other effort can focus on improving the expression-costing model.
Unfortunately, the subquery-planning / correlated-evaluation work cannot really be decomposed into smaller units of work that can be done independently. The reason for this is that the correlated-evaluation work builds on the earlier work of planning these subqueries, and really only makes sense in that context. The coding itself isn't very involved, but conceptually it can be a bit challenging to work out how it all hangs together.
The expression-costing improvements are conceptually simpler than the subquery work, and are easier to parallelize as well. Computing / accumulating the cost of an expression is straightforward to implement; once this is done, the three plan-nodes may be updated in parallel by one or more teammates.
The ProjectNode
is probably the most complex to update, unless you are already familiar with the NanoDB SQL AST representation.
IMPORTANT NOTE: Ignore the TODOs in the ProjectNode.prepare()
method. Those notes are for Donnie.
The SimpleFilterNode
and FileScanNode
classes should be quite easy to update. The changes in these two nodes will be very similar.
All teammates can help with testing, as the provided tests are rudimentary at best, and do not cover very many usage scenarios.
Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab5design.txt
".
When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.
The instructions are the same as for the previous assignment, although you will use a hw5
tag this week. Again, note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw5-2
", "hw5-3
", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.
Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:
git show-ref --tags
Here are some ideas for extra-credit tasks for this assignment.
As stated earlier, if you implement this assignment against your cost-based join planner, and all relevant SQL tests pass, you will get +10 points on the grade for this assignment.
Although extremely uncommon, it is still possible that there might be subqueries in join conditions. Support these strange creatures as well. For full credit you must include unit-tests that demonstrate this functionality. (+5 points)
You may notice that the various SubqueryOperator
implementations evaluate their subquery every single time the expression is evaluated. This causes NanoDB to perform a significant amount of extra work, since if the subquery is not correlated with an enclosing query then there is no need to evaluate it multiple times. Update the implementation to be more intelligent about when the subquery needs to be reevaluated. For full credit, the InSubqueryOperator
implementation should not directly hold onto the rows generated by the operator; rather, it should use a "materialize" node to cache rows. (+5 points)
Query decorrelation is a massive topic, and covering the most powerful techniques would likely be suitable for a CS123 project. If you want to explore some of the basic concepts, you can implement one or more of the following basic query decorrelations by transforming the query's Abstract Syntax Tree (AST). You can tell if the query is being properly decorrelated by looking at the explain-plan output; besides seeing a change in the generated plan, you should see a dramatic reduction in the execution cost for most simple queries. Of course, your tests should continue to pass!
You will receive +10 points for each of the below decorrelations that your code supports. If you extend your nested-loop join implementation to support both semijoin and anti-semijoin then you will receive a further +5 points.
Some of the more powerful decorrelations, particularly involving IN
/NOT IN
and EXISTS
/NOT EXISTS
operators, are made very easy with a semijoin and anti-semijoin (a.k.a. antijoin) plan node. If you wish to implment some of the below decorrelations, you will need to update your NestedLoopJoinNode
implementation to implement semijoins and anti-semijoins. The changes to the logic are straightforward.
SELECT
ClauseHere is an example query:
SELECT a, ..., (SELECT ... FROM t2 WHERE c = t1.d) sq FROM t1 ... WHERE ...;
The subquery sq
is a scalar subquery, correlated on the expression "c = t1.d
". It can be transformed into the equivalent query:
SELECT a, ..., sq
FROM (t1 ...) LEFT JOIN (SELECT c, ... sq FROM t2) ON t2.c = t1.d
WHERE ...;
(Note: This transformation is not precisely correct, particularly if the subquery includes a COUNT
aggregate operation.)
The correlation expression can be used as a join condition; it must be an equijoin or else the join operation will be too complicated for our simple approach to decorrelate.
Because the original correlated subquery is a scalar subquery, we know that the decorrelated version will generate only one column of data. The hardest part of the decorrelation is making sure that the subquery's result is named in such a way that the project can easily include it. Techniques similar to those in grouping/aggregation would be useful.
Note that the above transformation can be performed against the query AST before planning is started.
IN
-Subquery in WHERE
ClauseHere is an example query:
SELECT ... FROM t1 ... WHERE a IN (SELECT ... FROM t2 WHERE b = t1.c);
This is an excellent example where the semijoin operator is very useful for decorrelating the query. We can transform this query into the equivalent query:
SELECT ...
FROM (t1 ...) LEFT SEMIJOIN (SELECT ... FROM t2) ON t2.b = t1.c;
This query is easier than the previous one, because we only use the subquery to filter the outer query's rows; we don't need to include any results from the subquery.
Note that a "NOT IN
" predicate can also be supported easily by using an anti-join instead of a semi-join. Your implementation should support both decorrelations.
EXISTS
-Subquery in WHERE
ClauseHere is an example query:
SELECT ... FROM t1 ... WHERE EXISTS (SELECT ... FROM t2 WHERE a = t1.b);
This query is very similar to the previous one, transforming to:
SELECT ...
FROM (t1 ...) LEFT SEMIJOIN (SELECT ... FROM t2) ON t2.a = t1.b;
Again, a "NOT EXISTS
" predicate can also be supported easily by using an anti-join instead of a semi-join. Your implementation should support both decorrelations.