Assignment is due Friday, February 8 at 5:00PM.
Last updated February 6, 2019 at 6:15PM.

In this assignment you will complete the following features:

These tasks are described in the following sections.

There are also a few opportunities for extra credit described at the end of this assignment, if you want to supplement your grade.

Plan Costing Overview

Plan costing is a remarkably imprecise operation. Given a particular query execution plan, we must somehow make an estimate of how costly it will be to evaluate that plan, given statistics that are hopefully up to date, and a general understanding of how our plan nodes work in various cases. In addition, there can be several measures for how to represent the "cost" of a plan, such as the total number of rows produced at each plan node, the total time to produce the results, the time to produce the first result, the CPU/memory/disk usage of the plan, and so forth.

NanoDB has a very basic way of representing both plan costs and table statistics, but you will see that even with these simple representations, plan costing can be quite involved. Furthermore, we can only make costing estimates in very limited situations; in many other situations, we must simply use default guesses because there’s no way to know.

Nonetheless, as imprecise as the whole effort is, it still gives our database the ability to choose a plan that is likely to be "better" than other plans. This allows us to generate multiple equivalent plans and then select the "best" one out of the bunch.

The "Stores" Database

This week you will have a new schema to use with NanoDB, a randomly generated schema of stores in various locations, along with employees at those stores. You will find this schema in the schema/stores directory of your repository. The make-stores.sql file specifies the schema, along with the various key and NOT NULL constraints on the database. The stores-10K.sql and stores-28K.sql files contain two different data-sets for this database. For all subsequent discussion, we will assume the use of the stores-28K.sql data-set.

You can use the \source command to load these files from the NanoDB command prompt:

CMD> \source schemas/stores/make-stores.sql
Table employees already doesn't exist; skipping drop-table.
Table stores already doesn't exist; skipping drop-table.
Table cities already doesn't exist; skipping drop-table.
Table states already doesn't exist; skipping drop-table.
Created table:  states
Created table:  cities
Created table:  stores
Created table:  employees
CMD> \source schemas/stores/stores-28K.sql

The second operation may take a while because of key-constraint enforcement; if it is too slow for you, you can turn off key-constraint enforcement before loading with this command:

CMD> set property 'nanodb.enforceKeyConstraints' = false;

Statistics Collection

Plan costing is nearly impossible to perform effectively unless we have some basic table statistics describing the data that queries will run against. Therefore, we must first ensure that we have useful statistics before we even attempt to perform any plan costing.

Every tuple-file in NanoDB stores statistics describing the file's contents. These statistics are stored in the header page (page 0) of each tuple file, immediately after the table's schema. These statistics are not updated continuously; that would be too costly. NanoDB currently requires the user to manually invoke the statistics-update operation. Additionally, the initial statistics created when a table is empty are, obviously, empty.

Users can update a table's statistics using the ANALYZE command. This command takes one or more table names as arguments, and performs stats collection on those tables:

CMD> ANALYZE cities, states, stores, employees;
Analyzing table CITIES
Analyzing table STATES
Analyzing table STORES
Analyzing table EMPLOYEES
Analysis complete.
CMD> 

This analysis process scans each table file, collecting useful statistics from the table and then storing the results into the header page. When a table is opened in preparation to execute a query, the table's statistics are automatically loaded from the header page.

The statistics we will collect are:

In addition, for each column, we will collect:

Note that we only want to store the minimum and maximum values for a column if we can easily compute inequality-based selectivity estimates (such as T.A > 5) for the column. Generally we do this by computing the ratio (highlow) / (maxValminVal); depending on whether we are performing >, ≥, <, or ≤, we will use maxVal or minVal for the high or low value, and the actual comparison-value for the other value.

Although we could certainly find the max and min value for a VARCHAR column, it's not practical to perform the above computation with strings, so we will simply not gather min/max statistics for string columns. (In addition, the max or min string value may be very large, and it could use up substantial space in the header page, with no real benefit.)

You should also note that we will always find the total number of distinct values for a column, regardless of whether we will find the min and max values, because we need the number of distinct values to estimate the selectivity of = and ≠ conditions.

Completing the Stats-Collection Functionality

In NanoDB, all tuple file formats may provide statistics-collection capabilities through the TupleFile.analyze() interface method. This is the method that NanoDB ultimately calls when the user issues the ANALYZE command. The (empty) implementation for this method is in the HeapTupleFile class; this is where you will implement statistics-collection for heap files.

In a database, you should always try to do as much work as possible with as little I/O as possible. Specifically, when collecting statistics, your implementation should collect all stats in one pass over the table file. This is easy to do; statistics collection can be implemented roughly like this:

for each data block in the tuple file:
    update stats based on the data block
    (e.g. use tuple_data_end – tuple_data_start to update a "total bytes in all tuples" value)
    
    for each tuple in the current data block:
        update stats based on the tuple (e.g. increment the tuple-count)
        for each column in the current tuple:
            update column-level stats

Note: While the TupleFile interface has getFirstTuple() and getNextTuple() methods, it will be faster if you access the file's DBPages and their contents more directly, without using these methods. Therefore, you should avoid using those functions in your implementation.

A handful of helper classes are provided to make statistics collection simpler (all of these classes are in the edu.caltech.nanodb.queryeval package):

In your implementation of HeapTupleFile.analyze(), you will want to create an array of ColumnStatsCollector objects, one for each column in the tuple file. As you traverse the tuples in the file, you can loop over each tuple's columns, passing each column's value into the corresponding ColumnStatsCollector object. When you have traversed all columns, you can generate a list of ColumnStats objects for the table statistics.

Don't forget that a tuple-slot can be empty if the corresponding tuple has been deleted! Make sure to use the HeaderPage and DataPage helper classes in implementing this class; they should make it very straightforward. Don't duplicate the same functionality already provided there, or you will lose points.

Finally, when you have completed generating the table statistics, you can save the new statistics as follows:

  1. Create a new TableStats object with the new statistics you have computed

  2. Store your TableStats object into the HeapTupleFile.stats field on the tuple-file object

  3. Call heapFileManager.saveMetadata(), passing in the tuple-file object (this). The heapFileManager field is set to the HeapTupleFileManager that loaded the heap file that you are analyzing in your function.

The above function will update the header page with the serialized version of the current schema and statistics. Once you have done this, the stats will be saved to the tuple file.

Testing Statistics-Collection

The ANALYZE command doesn't print out the details when it is run on a table, so you can’t really tell if it is working correctly by itself. However, NanoDB also has a "SHOW TABLE t STATS" command, which will output the statistics for the specified table. You can compare your results to the contents of the SQL files you loaded. Note that for some of the integer stats, -1 indicates NULL or "unknown", but for object-values, the stats will contain a null reference if the value is unknown.

Here are some outputs from the (hopefully correct) reference solution for you to check against:

CMD> show table states stats;
Statistics for table states:
    51 tuples, 1 data pages, avg tuple size is 15.7 bytes
    Column state_id:  51 unique values, 0 null values, min = 1, max = 51
    Column state_name:  51 unique values, 0 null values, min = null, max = null
CMD> show table cities stats;
Statistics for table cities:
    254 tuples, 1 data pages, avg tuple size is 23.8 bytes
    Column city_id:  254 unique values, 0 null values, min = 1, max = 254
    Column city_name:  244 unique values, 0 null values, min = null, max = null
    Column population:  254 unique values, 0 null values, min = 100135, max = 8143197
    Column state_id:  44 unique values, 0 null values, min = 1, max = 44
CMD> show table stores stats;
Statistics for table stores:
    2000 tuples, 4 data pages, avg tuple size is 13.0 bytes
    Column store_id:  2000 unique values, 0 null values, min = 1, max = 2000
    Column city_id:  254 unique values, 0 null values, min = 1, max = 254
    Column property_costs:  882 unique values, 0 null values, min = 0, max = 999000
CMD> show table employees stats;
Statistics for table employees:
    24932 tuples, 118 data pages, avg tuple size is 36.4 bytes
    Column emp_id:  24932 unique values, 0 null values, min = 1, max = 24932
    Column last_name:  4958 unique values, 0 null values, min = null, max = null
    Column first_name:  4933 unique values, 0 null values, min = null, max = null
    Column home_loc_id:  254 unique values, 0 null values, min = 1, max = 254
    Column work_loc_id:  254 unique values, 0 null values, min = 1, max = 254
    Column salary:  46 unique values, 0 null values, min = 35000, max = 80000
    Column manager_id:  11754 unique values, 4259 null values, min = 1, max = 24924

Plan Costing and Selectivity Estimates

Plan costs in NanoDB are represented with a few simple measures. There are the standard ones you would expect, such as the number of tuples produced by each plan-node, the worst-case number of disk IOs that will be performed, and so forth. These values are managed in the PlanCost class (edu.caltech.nanodb.queryeval package).

There is also a "CPU cost" measure, which is a simple estimate of the computational cost of evaluating a plan, using some imaginary units. For example, we might say that the CPU cost of processing one tuple in a plan-node has a CPU cost of 1.0. A select plan-node might only produce 10 tuples, but if it has to look at 1000 tuples then the CPU cost generated by that node will be 1000.

Also, unlike row-counts, these CPU costs should accumulate up the tree: if we have two equivalent plans, and our row-estimates are incredibly accurate so that we think the two plans will produce the same number of rows, the CPU cost will still tell us which plan is likely to do more work to generate the same results. For example, if a select plan-node must consider 1000 tuples, its cost will be 1000 plus the CPU-cost of its sub-plan.

Plan-Node Details

Every plan-node has a prepare() method that computes three critical pieces of information for the node:

All three of these protected fields are declared in the PlanNode class, and are therefore available on all plan-node implementations. Since these fields are protected access, all subclasses can access and manipulate these values. Subclasses are expected to provide an implementation of prepare() to compute these details.

You will need to complete this method for three of the plan-node types, computing the cost of each kind of node. (The schema is already computed for you.) The nodes whose implementation you must complete are:

The cost is represented as a PlanCost object (in the edu.caltech.nanodb.queryeval package). You should look at this class to see all values that your implementation must estimate. Use "best case" estimates; for now, you can assume that NanoDB always has all the memory it needs.

You can also look at examples of the prepare() method already implemented, in these classes: RenameNode, SortNode, ProjectNode. Be aware that these implementations are written to operate properly even when a child plan’s cost is not available. You do not need you mimic this; you can assume that the child plans' costs are always available, because once you are done they always should be available.

Costing Joins

You will notice in the following discussion that none of the join-costing takes the presence of keys into account. Thus, we will rely on the non-key-based approaches to computing selectivity discussed in class. If you wish to incorporate key-based join costing as well, this is an extra-credit task described at the end of the assignment.

Estimating Selectivity

Selectivity estimates are essential for guessing how many rows a plan-node will produce. Besides our table statistics, we must also be able to guess the selectivity of a predicate. A predicate's "selectivity" is simply the probability that a row from the input will satisfy the predicate; if we know the number of rows that come into a node, and we know the selectivity of a node's predicate, we simply multiply the two together to estimate the number of rows produced.

In NanoDB, the SelectivityEstimator class (queryeval package) is used for making all selectivity estimates. You must complete the implementation of this class to compute the selectivity of predicates that appear in query plans. Specifically, you must support these kinds of predicates:

If a predicate doesn't fall into one of these categories, use a default selectivity estimate of 25%. (This constant is defined for you in the SelectivityEstimator class.)

Some important caveats about the last four kinds of predicates:

You will see many detailed comments in the SelectivityEstimator class; you really only have to focus on the costing equations, and how to extract the statistics you will need. (Read the Javadocs for the various classes you will need to use.) Many other parts are provided for you, including the computeRatio() method that is able to take a column's min and max values, and compute the ratio (high - low) / (maxVal - minVal) mentioned earlier. This function can work with any numeric type.

Estimating Plan-Node Column Statistics

Anytime a predicate is applied within a plan-node, the column-statistics of the node's outputs may also be affected. Therefore, it is important to update these statistics whenever we are able to do so.

NanoDB has a class for performing this operation called StatisticsUpdater (in the edu.caltech.nanodb.queryeval package), responsible for this task. It is nearly impossible to predict the statistics for arbitrary predicates, so NanoDB handles a very limited, but still (hopefully) beneficial, set of predicates:

Note that both of these cases will eventually funnel into the static updateCompareStats() method on the StatisticsUpdater class; all of your changes will be in this section of the class.

Where it makes sense to do so, you should update the values in the ColumnStats objects. (It may not always be possible to make reasonable updates to the statistics.) You will need to explain the rationale behind your changes in your team's design document.

Suggested Approach

All parts of this assignment can be worked on in parallel, because the interfaces between the components are already defined for you. That said, some parts will not work until other parts are completed, and this is important to consider when planning your approach.

The final part of the assignment (described in the next section) should be explored by all teammates, so that your code is well exercised. Your team can have a "bug bash" to find and fix any issues that your code might have.

This is particularly valuable because the next assignment builds on top of your statistics / plan-costing implementation, and you will want to know this code works properly!

Testing Plan Costing

After you have completed your plan-costing code, you will want to ensure that it works properly. You don't have to write any test classes this week (but if you do, there will be extra credit for it!). A simple schema is provided for you in the schemas/stores directory of your repository. You can load make-stores.sql, and then load stores-28K.sql. Obviously, make sure to ANALYZE these tables before trying any queries, or else you won’t have statistics for the costing computations to use.

Note: You will need to put the answers for this section into your design document, so you might as well record them.

Basic Plan Costs

As mentioned last week, NanoDB has an EXPLAIN command you can use to see what it generates for different query plans. For example, you can try this:

EXPLAIN SELECT * FROM cities;

Once you have completed the costing implementation, you should see something like this (your costs might be different, depending on your assumptions):

Explain Plan:
    FileScan[table:  CITIES] cost=[tuples=254.0, tupSize=23.8, cpuCost=254.0, blockIOs=1]

Estimated 254.0 tuples with average size 23.787401
Estimated number of block IOs:  1

If you then try:

EXPLAIN SELECT * FROM cities WHERE population > 5000;

This should print out the same plan costs, since the smallest city-population in that table is 100135.

You can see the costs change if you try queries like this:

EXPLAIN SELECT * FROM cities WHERE population > 1000000;
EXPLAIN SELECT * FROM cities WHERE population > 5000000;

Take note of how many rows the database expects this last query will produce. (My estimate is 99.3 tuples.) However, how many tuples does the query actually produce?

Now, if you run the following query, notice that you will get the exact same results, but if you EXPLAIN it, the costing estimate is very different:

SELECT * FROM cities WHERE population > 8000000;

This is the fundamental limitation of the simple statistics that we track in NanoDB. Clearly, recording a histogram for different columns would produce much more accurate estimates.

Costing Joins

You can also try your costing estimates on more complex queries involving join operations. For example:

EXPLAIN SELECT store_id FROM stores, cities
WHERE stores.city_id = cities.city_id AND cities.population > 1000000;

(My analyzer predicts 1776.2 tuples for this query, and a CPU cost of 1,019,776.3 units. Don't feel like you have to match these numbers exactly; they are just my analyzer's estimates.)

Of course, the planner isn't quite intelligent enough to position the query predicates in optimal locations in the query plan, although we can manually rewrite the query to put predicates in better positions:

EXPLAIN SELECT store_id
FROM stores JOIN
     (SELECT city_id FROM cities
      WHERE population > 1000000) AS big_cities
     ON stores.city_id = big_cities.city_id;

(The rewritten query is still estimated to produce 1776.2 tuples, but the CPU cost is now 962,940.8 units. Woo, 56000 "CPU units" less!)

Finally, here is a slower query for you to tinker with:

SELECT store_id, property_costs
FROM stores, cities, states
WHERE stores.city_id = cities.city_id AND
      cities.state_id = states.state_id AND
      state_name = 'Oregon' AND property_costs > 500000;

The estimated tuple-count on this one is around 23 tuples, and the reference implementation estimates a CPU cost of 45,214,024.0. The actual tuple-count is 7.

See how fast you can get it by rewriting it! (I was able to take it from 35 seconds down to 0.5 seconds on my laptop, with a final CPU cost of 299,724.4.)

Updating Column Statistics

The EXPLAIN command doesn't show the column-stats for each plan-node because this would make the command output even denser than it already is. You can still do some basic testing to see if your code is working by creating nested subqueries in the FROM clause. Here is an example.

This plan will report a specific cost and estimated number of tuples. The city_id column is a primary key with uniform distribution, so the estimate should be pretty accurate.

EXPLAIN SELECT * FROM cities WHERE city_id > 20 AND city_id < 200;

(The estimates for this query may be slightly different from the below queries, for boring, grungy reasons.)

Next, you can force your simple planner to create multiple nested select nodes like this:

EXPLAIN SELECT * FROM (
    SELECT * FROM cities WHERE city_id > 20) t
WHERE city_id < 200;

Then, you can reverse the application of the predicates and see if the results are still the same:

EXPLAIN SELECT * FROM (
    SELECT * FROM cities WHERE city_id < 200) t
WHERE city_id > 20;

If the plan-nodes are properly updating their output statistics, you should see extremely close or even identical costing estimates for these kinds of queries.

Submitting Your Assignment

Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab3design.txt".

When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.

The instructions are the same as for the previous assignment, although you will use a hw3 tag this week. Again, note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw3-2", "hw3-3", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.

Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:

git show-ref --tags

Extra Credit

If you want to earn some extra credit, here are some options for you to pursue. Estimates of the amount of extra credit that will be received are given below; to obtain full credit, code must be both correct and well-written/commented. Partial completion will result in partial credit. Additional credit will be awarded for high-quality unit-testing code.

Candidate-Key Constraints and Joins

NanoDB schemas are able to report what candidate keys hold on the schema. However, none of the plan-nodes propagate this information on their output schemas. This would be very useful though, so that joins can produce much better estimates for the number of rows they will produce.

Update these plan-nodes to report the candidate keys on their outputs (+10pts):

Update the NestedLoopJoinNode plan-node to use candidate-key information to produce better costing estimates. Obviously, this will also require careful analysis of the predicate being used to join. (+10pts)

Update the FileScanNode and SimpleFilterNode take candidate-key information and the predicate being used, to determine when the node can stop after the first row is found and produced. (+10pts)

Disjunctive Selection Predicates

Many databases are able to update column statistics in other scenarios besides the ones given in this assignment. Try adding support for these additional kinds of predicates:

Quality Plan-Costing Unit-Tests!

You may have noticed that there is no testing for the table-analysis and plan-costing code in Assignment 3. Implement a complete and high-quality set of unit-tests to exercise and verify these components. The better and more useful tests you add, the more extra credit points you will get. (If they are very good, we will likely incorporate them into the course codebase.)

The current version of the assignment is very flexible in how students may compute estimates. You may need to require that plans are costed in very specific ways, or that selectivity is estimated in very specific ways, in order to provide useful tests. This is fine, although you will need to explain your choices as part of your submission.

Approximate bonus points:

Additional points may be awarded if your code is of a quality that it is able to be incorporated into the CS122 codebase for future years.

Support for Additional Types

The statistics-collection and selectivity-estimation code has not been updated to include support for the NUMERIC, DATE, TIME, and TIMESTAMP types. While NUMERIC should be relatively easy, the date/time types are more challenging to support the ratio calculations outlined earlier.

To pursue this extra-credit task, you will want to load the TPC-H data provided as part of Challenge 1, since this uses some of these data types.