Assignment is due Friday, February 15 at 5:00PM.
Last updated February 9, 2019 at 10:00PM.

In this assignment you will write a planner/optimizer that chooses an optimal join ordering using a dynamic programming algorithm. Your optimizer must generate valid plans for queries involving outer joins as well as inner joins, which increases the complexity of the planner somewhat.

Your planner should be able to handle all of the same queries that your Assignment 2 planner was required to handle, including queries with ORDER BY clauses, and grouping and aggregation. You are encouraged to reuse your Assignment 2 code to achieve this goal, but a specific approach is recommended to follow best practices.

Overview

Join ordering can have a profound impact on query execution performance, so most databases devote specific effort to finding an optimal join order. Most database systems use a Selinger-style join optimizer that uses dynamic programming to choose a join order, but that also keeps costlier plans that generate results in "interesting orders." For this assignment you won't worry about result ordering (NanoDB doesn't yet have many plan nodes that can take advantage of result-ordering anyway); you simply need to implement a join planner that uses dynamic programming to find the optimal order.

Although we will identify an optimal join order based on plan costs, we will also employ a heuristic to simplify planning: we will always perform selection as early as possible in plans. We can do this because NanoDB doesn't have any indexes yet, and all tables are heap files, so this will produce optimal plans. However, in general it is a bad approach since it may rule out other more optimal plans, particularly when there are indexes or other optimizations that may affect when selection should be applied.

As before, plan optimization is performed on units of SELECT-FROM-WHERE blocks; if a FROM clause includes a subquery then the subquery will be optimized separately, and then the subquery's generated plan is treated as a black box with respect to planning and optimization.

Optimization Procedure

The general approach for an optimizer based on dynamic programming is as follows:

Although this process is conceptually straightforward, it can be somewhat involved to handle all of the details. Specifically, while join-expressions may specify the conditions to join on, we may also have join-conditions in the WHERE clause that we can take advantage of lower in the tree as well. Consider these example queries, which we have discussed all term:

SELECT * FROM t1, t2 WHERE t1.a = t2.a AND t2.b > 5;

SELECT * FROM t1 JOIN t2 ON t1.a = t2.a WHERE t2.b > 5;

SELECT * FROM t1 JOIN t2 ON t1.a = t2.a AND t2.b > 5;

All three of these queries produce the exact same results, but the conditions appear in different places on the SQL abstract syntax tree:

A good optimizer will be able to take any of these queries and produce the exact same execution plan.

Approach

Generally, planners restrict themselves to manipulating predicates of very specific forms. This works well in practice, because virtually all query predicates follow these forms. (Anything that doesn't follow the patterns, usually can't be optimized very effectively.)

We will restrict our planner to only optimize conjunctive selection predicates; anything more complex will not be optimized by our planner.

You will note that the above queries all have conjunctive selection predicates. These queries specify two conjuncts: t1.a = t2.a, and t2.b > 5. Once these conjuncts are identified in the query, we can determine the optimal place to apply each conjunct in the final execution plan.

For a query "SELECT SelectVals FROM JoinExpr WHERE Pw", the SQL standard specifies that it should be translated into a relational algebra expression like this:

ΠSelectVals( σPw( JoinExpr ) )

The join expression JoinExpr will include join conditions if a NATURAL join or a USING/ON clause is specified in the query. Additionally, there may be conjuncts in the WHERE-predicate Pw that can be used to constrain joins, or even to constrain the inputs to a join. In other words, we may be able to take some or all of the conjuncts in Pw, and push them down into the plan we create for the JoinExpr to make it more efficient.

Therefore, we need to collect all of these conjuncts together in order to determine the optimal placement of each conjunct in the plan. That way we can constrain each node to generate as few records as possible, and if we have indexes or equijoin conditions, we can take advantage of them as well.

Equivalence Rules

When it comes to inner joins, we are quite free to rearrange the join order and place conditions wherever we want in the plan, because we have many equivalence rules for manipulating inner joins. Here is a small sampling:

As long as we are careful to only place predicates where they make sense (i.e. where the referenced columns are actually available), we won’t get into any trouble.

This suggests that our planner will need to traverse the contents of the join-expression JoinExpr, identifying all base-tables and SELECT subqueries to be manipulated - these are the leaves of the join expression, for which we need to determine an optimal join ordering. Additionally, we will also need to collect all conjuncts contained in join-conditions in JoinExpr, so that we can apply each conjunct as early as possible in the plan.

Outer joins complicate this process, because there are many equivalence rules that hold for inner joins that do not hold on outer joins. For example, these general cases apply:

Looking at second through fourth points, it is relatively easy to see that we can only push a conjunct down through an outer-join if the "outer" side is not opposite the child that the conjunct applies to. For example, in the first point above, we can push θ1 down through the left-outer join, since the right side of the join is not an "outer" side. (It should be evident that full-outer joins completely disallow pushing conjuncts down through them.)

Since outer joins generally resist reorganizing, our planner will basically leave them alone. This involves two basic strategies:

Thankfully, this doesn’t complicate things too much.

Implementation

Your implementation of the dynamic-programming join planner will go into the CostBasedJoinPlanner class in the queryeval package. A partial implementation of this class is provided to you, but you will need to make a number of updates to get it working. Note that several helper functions are provided to simplify your efforts; if you find yourself needing a particular operation, it may already exist. Some of this functionality is discussed in subsequent sections.

Refactoring

Recall that this week's CostBasedJoinPlanner planner should support all of the same functionality that your SimplePlanner supports. This means you can take the work you did on the SimplePlanner, and apply it, somehow, to your CostBasedJoinPlanner implementation. It might be tempting to simply copy this code from one class into the other, but this is a very bad approach.

A better approach is to come up with a code architecture that allows you to use common code in both your SimplePlanner and your CostBasedJoinPlanner classes without having to duplicate it. This is straightforward to do by creating an abstract base class that contains the common planner code, and then both of your planner implementations can extend this class. For example, you might do this:

public abstract class AbstractPlannerImpl implements Planner {
    ... // functionality common to all your planners
}

Then, you can update your SimplePlanner class like this:

public class SimplePlanner extends AbstractPlannerImpl {
    ... // code specific to the simple planner
}

And, your CostBasedJoinPlanner can use this common code as well:

public class CostBasedJoinPlanner extends AbstractPlannerImpl {
    ... // cost-based join planning goodness goes here
}

This process of rearranging code to improve the quality of your code while preserving or extending the existing functionality is called refactoring, and it is a common and very important task in most large-scale software projects. Many times you don't know what code structure will work best at the start, so over time you refine and revise it to avoid issues like code duplication.

Create an abstract base class for both of your planners to use, following the above pattern. (Feel free to use the names given here, or come up with your own names.) Then, refactor the code in your SimplePlanner so that the CostBasedJoinPlanner can also make use of as much of the shared code as is reasonable to do so.

What should go into the base class? The most likely candidate is any supporting functions you use to implement grouping and aggregation. What if you don't have any supporting functions, and your SimplePlanner.makePlan() method is just one giant chunk of code for doing everything - handling grouping and aggregation, and ordering, and all the rest? You can see that if functions are too large, it is difficult to reuse them in different contexts. Therefore, you should break any large functions into smaller ones that can be shared between both planners. Then, these functions can be migrated up into the abstract base-class. Again, this is all an example of refactoring at work.

You might also notice that the implementation of makeSimpleSelect() is identical in both SimplePlanner and CostBasedJoinPlanner. This is another candidate to migrate up into an abstract base-class.

This must be done carefully and thoughtfully, and with regular testing along the way, so that you don't end up breaking functionality that previously worked! That situation also has a name - it's called a regression, and regressions are best to be avoided.

What if you have a three-line snippet of code that appears in SimplePlanner.makePlan(), and you want to replicate it in your CostBasedJoinPlanner.makePlan() implementation? In cases like that, it's probably more confusing to create extra methods to handle all these tiny cases, than it would be to just duplicate the code. If it can be factored out very cleanly than you can move it into its own function; otherwise, you can just duplicate the code.

Over time you will develop a sense of when it's "acceptable" to duplicate code (although it should always make you sad to do so!), and how much warrants being migrated into its own function. These planner classes give you an excellent opportunity to practice this important programming skill.

Make sure to write Javadoc comments for your planner base-class, and for all methods you add to your planner classes.

Join Components

You will notice that the CostBasedJoinPlanner class contains a nested class called JoinComponent. This class holds all details for a particular join-plan, including the set of conjuncts used within the plan, and the set of leaf-plans that are joined together by the join-plan. This allows us to easily build up the overall plan from the various components.

As before, the main entry-points into the planner are:

You will note that the CostBasedJoinPlanner includes several other operations; these will be discussed in subsequent sections. There is a third entry-point you will need to be aware of:

The makeJoinPlan() method is private because it is not for external use, but there will definitely be situations where you need to recursively invoke it to handle certain situations. This method is implemented for you; it simply coordinates the phases of the planning outlined earlier.

Collecting Details from the FromClause

The initial detail-collection is performed by the collectDetails(FromClause, ...) method in the join-planner. You must implement this method, and also write a Javadoc header for it.

The detail-collection phase must collect both the set of conjuncts and the list of leaf FromClauses from the specified FromClause argument. Here are a few notes to help in your implementation:

Once all leaves are identified and all conjuncts are collected, we can begin generating an optimal plan to join the leaves.

Generating Optimal Leaf-Plans

The first step in creating an optimal query plan is to devise an optimal plan for each of the leaf expressions in the FROM component of the query. The method responsible for generating all leaf plans is the generateLeafJoinComponents() method; it uses the helper makeLeafPlan() to generate the actual plan for each leaf. You must implement the makeLeafPlan() method.

Coming up with a basic plan for each leaf is very straightforward; we already know how to do this:

Your code for this case should be careful to pass conjuncts down to the recursive invocation of makeJoinPlan(), but only when the resulting plan will still be equivalent to the original query. You can make use of the two methods on FromClause, hasOuterJoinOnLeft() and hasOuterJoinOnRight(), to determine when to pass conjuncts to the recursive invocation.

This plan should work, but we also want to apply our heuristic of applying selections as early as possible. To do this, we need to see if any conjuncts can be applied to each leaf plan-node we generate. Since this can become a tedious operation, and Java is not an ideal language for this kind of thing, several helper methods are provided for you to use:

Note that the caller must provide a collection that the results are stored into; a newly created HashSet<Expression> object would be best for this. Also, note that this helper method can optionally remove matching expressions from the input collection; when generating leaf plans, you don't want to remove matching expressions from the input collection.

Try to call prepare() as little as possible, since it traverses an entire plan recursively!

Once you have created a leaf-plan and you have the set of conjuncts that can be applied to that leaf plan, you will want to create a new predicate and apply it to that leaf-plan. (This is following the "apply selections as early as possible" heuristic.) Two methods are provided to make this very easy:

Generating an Optimal Join Plan: Approach

The final step of join optimization is to combine the N leaf plans into one optimal join plan. We will use dynamic programming for this step, allowing us to reuse earlier computations as we search for the answer. The algorithm will operate in a loop, starting with a collection of join-plans that combine n leaf-plans in an optimal way, and producing a collection of join-plans that combine n+1 leaf-plans in an optimal way. Initially we start with the set of leaf plans (each of which accesses one leaf in an optimal way); the first iteration will produce plans that join 2 leaves together; the next iteration will produce plans that join 3 leaves together; and so forth.

In the final iteration of this algorithm, we should have a set of plans that combine N-1 leaf-plans together, and we should produce a plan that combines all N leaf-plans together. Since we only keep the optimal plan for each distinct set of n leaves being joined, we should end up with only one join-plan that optimally combines all N leaves.

For a specific iteration of this algorithm, we will start with a set of plans JoinPlansn that join n leaf-plans together, and the set of leaf-plans LeafPlans. To generate plans that join n+1 leaves together, we will simply iterate over both of these sets:

# Forgive the formatting - markdown doesn't appear to
# support formatting within code blocks.

for plan_n in JoinPlans_n:    # Iterate over plans that join n leaves
    for leaf in LeafPlans:    # Iterate over the leaf plans
        if leaf already appears in plan_n:
            continue        # This leaf is already joined in by the current plan

        plan_n+1 = a new join-plan that joins together plan_n and leaf
        newCost = cost of new plan
        if JoinPlans_n+1 already contains a plan with all leaves in plan_n+1:
            if newCost is cheaper than cost of current "best plan" in JoinPlans_n+1:
                # plan_n+1 is the new "best plan" that combines this set of leaf-plans!
                replace current plan with new plan in JoinPlans_n+1
        else:
            # plan_n+1 is the first plan that combines this set of leaf-plans
            add plan_n+1 to JoinPlans_n+1

Reviewing the above algorithm, it is again clear that we should store additional details with each join-plan we generate:

Generating plann+1 from plann and leaf

This process is generally straightforward, but there are two important details about generating a new join-plan. First, we will constrain our join optimizer to only consider left-deep plans. In other words, leaf nodes should always appear on the right of a theta-join operation; a sub-plan involving other joins will always automatically end up on the left of a theta-join. (For this assignment we don't care if a nested subquery ends up as the right sub-plan of a join. Hopefully the estimated cost will make it unappealing to the planner.) This makes plan-generation simpler, because we don't need to generate two candidates from plann and leaf; we simply join them in that order, and that is our new plan.

The second important detail is that we must again follow our heuristic of applying selection as early as possible, so if there are conjuncts we can apply on the theta-join in the new join plan, we want to go ahead and apply them there. This requires a bit of effort:

SubplanConjuncts = LeftChildConjunctsRightChildConjuncts
UnusedConjuncts = AllConjuncts - SubplanConjuncts

This is why we want to store the conjuncts used by each plan we generate, so that we can easily determine what conjuncts remain to be applied when generating a new plan.

Of this set of UnusedConjuncts, we must determine which of those conjuncts should be applied to the theta-join. Once that is determined, we can easily compute the set of conjuncts used by the new plan, and we can store this with the new plan.

Storing the Optimal Plan

As is given in the pseudocode earlier, we must determine if a plan combining the specified leaf-nodes has already been stored. If so, we must compare the costs of the two plans, and if the new plans is cheaper than the current "best plan" then the new plan should replace the old plan. Or, if there is no plan for this combination of leaf nodes, we always store the new plan.

Generating an Optimal Join Plan: Implementation

In NanoDB, all Expression classes and all PlanNode classes implement the hashCode() and equals() methods, allowing them to be used with HashSets and HashMaps. For example, the set of leaves joined by a plan are collected within a HashSet<PlanNode> object, and it is very easy to perform set-membership tests with such a collection. Additionally, the hash-set itself can be used as a key into a mapping; specifically, we can map a set of leaf-plans to the optimal join-plan that combines those leaves.

Similarly, conjuncts can be manipulated with HashSet<Expression> objects. HashSets implement several methods for set operations: the addAll() method can be used to compute a set-union, the removeAll() method can be used to compute set-difference, and the retainAll() method can be used to compute the set-intersection. (Note that all of these methods change the object they are called on.) The HashSet constructor can also take another collection as an argument; the collection’s contents are copied into the HashSet.

The optimal join plan is generated by the generateOptimalJoin() method; an outline of the solution is provided in this method to steer you in the right direction. There isn't much else to say about this method that hasn't already been said, except that the implementation relies on two collections that look like this:

HashMap<HashSet<PlanNode>, JoinComponent> joinPlans

This mapping is used to store the optimal plan for joining together n leaf-plans. The key for this mapping is the set of leaf-nodes joined together by each plan; the value is an object that holds the plan itself, the conjuncts applied by the plan, and the set of leaf-plans joined together by the plan.

Because all plan-nodes implement the hashCode() and equals() methods, two HashSets of plan-nodes will hash and compare equal if they contain the same plans. Therefore, if you generate a new plan that joins together the same leaves, you can use the set of leaves to look up any existing plan for those leaves, and compare an existing plan's cost to the new plan's cost.

Some other notes:

Suggested Approach

This assignment is a bit more difficult to divide amongst teammates because it is focused on building one large feature, rather than building a number of smaller things. That said, there are still some natural ways of dividing up the work.

Some final tips:

Testing Your Work

Once you have completed your join planner, you should start with very simple tests to make sure everything is working properly. However, you must first tell NanoDB use the new planner. Don't forget that the nanodb.plannerClass property controls what planner is used for query planning. The easiest way to make sure your new planner is used is to go to the ServerProperties class (in the edu.caltech.nanodb.server.properties package) and change the class name that is specified in the DEFAULT_PLANNER_CLASS constant. If you change this value, don't forget that the change won't take effect until you have rebuilt your project.

Make sure you have ANALYZEd your tables before testing your planner. You should be able to handle very simple queries such as:

SELECT * FROM states;
SELECT state_id FROM states;

If these queries work, you might want to try something more challenging, such as this three-table join from the last assignment:

SELECT store_id, property_costs
FROM stores, cities, states
WHERE stores.city_id = cities.city_id AND
                cities.state_id = states.state_id AND
                state_name = 'Oregon' AND
                property_costs > 500000;

Regardless of what order you specify the tables or conditions, or how the conditions are specified (e.g. in ON clauses, or in the WHERE clause, etc.), you should get the exact same join plan. For example, I always get this plan:

Project[values:  [STORES.STORE_ID, STORES.PROPERTY_COSTS]] cost=[tuples=22.7, ..., cpuCost=11866.2, blockIOs=26]
    NestedLoop[pred:  STORES.CITY_ID == CITIES.CITY_ID] cost=[..., cpuCost=11843.5, blockIOs=26]
        NestedLoop[pred:  CITIES.STATE_ID == STATES.STATE_ID] cost=[..., cpuCost=298.0, blockIOs=2]
            FileScan[table:  STATES, pred:  STATES.STATE_NAME == 'Oregon'] cost=[..., cpuCost=44.0, blockIOs=1]
            FileScan[table:  CITIES] cost=[tuples=254.0, tupSize=23.8, cpuCost=254.0, blockIOs=1]
        FileScan[table:  STORES, pred:  STORES.PROPERTY_COSTS > 500000] cost=[..., cpuCost=2000.0, blockIOs=4]

Submitting Your Assignment

Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab4design.txt".

When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.

The instructions are the same as for the previous assignment, although you will use a hw4 tag this week. Again, note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw4-2", "hw4-3", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.

Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:

git show-ref --tags