Assignment is due Friday, February 8 at 5:00PM.
Last updated February 6, 2019 at 6:15PM.
In this assignment you will complete the following features:
These tasks are described in the following sections.
There are also a few opportunities for extra credit described at the end of this assignment, if you want to supplement your grade.
Plan costing is a remarkably imprecise operation. Given a particular query execution plan, we must somehow make an estimate of how costly it will be to evaluate that plan, given statistics that are hopefully up to date, and a general understanding of how our plan nodes work in various cases. In addition, there can be several measures for how to represent the "cost" of a plan, such as the total number of rows produced at each plan node, the total time to produce the results, the time to produce the first result, the CPU/memory/disk usage of the plan, and so forth.
NanoDB has a very basic way of representing both plan costs and table statistics, but you will see that even with these simple representations, plan costing can be quite involved. Furthermore, we can only make costing estimates in very limited situations; in many other situations, we must simply use default guesses because there’s no way to know.
Nonetheless, as imprecise as the whole effort is, it still gives our database the ability to choose a plan that is likely to be "better" than other plans. This allows us to generate multiple equivalent plans and then select the "best" one out of the bunch.
This week you will have a new schema to use with NanoDB, a randomly generated schema of stores in various locations, along with employees at those stores. You will find this schema in the schema/stores
directory of your repository. The make-stores.sql
file specifies the schema, along with the various key and NOT NULL
constraints on the database. The stores-10K.sql
and stores-28K.sql
files contain two different data-sets for this database. For all subsequent discussion, we will assume the use of the stores-28K.sql
data-set.
You can use the \source
command to load these files from the NanoDB command prompt:
CMD> \source schemas/stores/make-stores.sql
Table employees already doesn't exist; skipping drop-table.
Table stores already doesn't exist; skipping drop-table.
Table cities already doesn't exist; skipping drop-table.
Table states already doesn't exist; skipping drop-table.
Created table: states
Created table: cities
Created table: stores
Created table: employees
CMD> \source schemas/stores/stores-28K.sql
The second operation may take a while because of key-constraint enforcement; if it is too slow for you, you can turn off key-constraint enforcement before loading with this command:
CMD> set property 'nanodb.enforceKeyConstraints' = false;
Plan costing is nearly impossible to perform effectively unless we have some basic table statistics describing the data that queries will run against. Therefore, we must first ensure that we have useful statistics before we even attempt to perform any plan costing.
Every tuple-file in NanoDB stores statistics describing the file's contents. These statistics are stored in the header page (page 0) of each tuple file, immediately after the table's schema. These statistics are not updated continuously; that would be too costly. NanoDB currently requires the user to manually invoke the statistics-update operation. Additionally, the initial statistics created when a table is empty are, obviously, empty.
Users can update a table's statistics using the ANALYZE
command. This command takes one or more table names as arguments, and performs stats collection on those tables:
CMD> ANALYZE cities, states, stores, employees;
Analyzing table CITIES
Analyzing table STATES
Analyzing table STORES
Analyzing table EMPLOYEES
Analysis complete.
CMD>
This analysis process scans each table file, collecting useful statistics from the table and then storing the results into the header page. When a table is opened in preparation to execute a query, the table's statistics are automatically loaded from the header page.
The statistics we will collect are:
In addition, for each column, we will collect:
NULL
valuesNULL
valuesNULL
values, and if the column’s type is suitable for computing inequality-based selectivity estimatesNote that we only want to store the minimum and maximum values for a column if we can easily compute inequality-based selectivity estimates (such as T.A > 5
) for the column. Generally we do this by computing the ratio (high – low) / (maxVal – minVal); depending on whether we are performing >, ≥, <, or ≤, we will use maxVal or minVal for the high or low value, and the actual comparison-value for the other value.
Although we could certainly find the max and min value for a VARCHAR
column, it's not practical to perform the above computation with strings, so we will simply not gather min/max statistics for string columns. (In addition, the max or min string value may be very large, and it could use up substantial space in the header page, with no real benefit.)
You should also note that we will always find the total number of distinct values for a column, regardless of whether we will find the min and max values, because we need the number of distinct values to estimate the selectivity of = and ≠ conditions.
In NanoDB, all tuple file formats may provide statistics-collection capabilities through the TupleFile.analyze()
interface method. This is the method that NanoDB ultimately calls when the user issues the ANALYZE
command. The (empty) implementation for this method is in the HeapTupleFile
class; this is where you will implement statistics-collection for heap files.
In a database, you should always try to do as much work as possible with as little I/O as possible. Specifically, when collecting statistics, your implementation should collect all stats in one pass over the table file. This is easy to do; statistics collection can be implemented roughly like this:
for each data block in the tuple file:
update stats based on the data block
(e.g. use tuple_data_end – tuple_data_start to update a "total bytes in all tuples" value)
for each tuple in the current data block:
update stats based on the tuple (e.g. increment the tuple-count)
for each column in the current tuple:
update column-level stats
Note: While the TupleFile
interface has getFirstTuple()
and getNextTuple()
methods, it will be faster if you access the file's DBPage
s and their contents more directly, without using these methods. Therefore, you should avoid using those functions in your implementation.
A handful of helper classes are provided to make statistics collection simpler (all of these classes are in the edu.caltech.nanodb.queryeval
package):
The TableStats
class holds all the statistics for the tuple file. Among other things, it holds a list of ColumnStats
objects, each one holding the statistics for the corresponding column.
To help you compute the column statistics, there is a ColumnStatsCollector
class that can be fed the values from a given column, and it will record the minimum and maximum values, the set of distinct values, and the number of NULL
values. Then, it can be used to generate a ColumnStats
object that holds the necessary details.
In your implementation of HeapTupleFile.analyze()
, you will want to create an array of ColumnStatsCollector
objects, one for each column in the tuple file. As you traverse the tuples in the file, you can loop over each tuple's columns, passing each column's value into the corresponding ColumnStatsCollector
object. When you have traversed all columns, you can generate a list of ColumnStats
objects for the table statistics.
Don't forget that a tuple-slot can be empty if the corresponding tuple has been deleted! Make sure to use the HeaderPage
and DataPage
helper classes in implementing this class; they should make it very straightforward. Don't duplicate the same functionality already provided there, or you will lose points.
Finally, when you have completed generating the table statistics, you can save the new statistics as follows:
Create a new TableStats
object with the new statistics you have computed
Store your TableStats
object into the HeapTupleFile.stats
field on the tuple-file object
Call heapFileManager.saveMetadata()
, passing in the tuple-file object (this
). The heapFileManager
field is set to the HeapTupleFileManager
that loaded the heap file that you are analyzing in your function.
The above function will update the header page with the serialized version of the current schema and statistics. Once you have done this, the stats will be saved to the tuple file.
The ANALYZE
command doesn't print out the details when it is run on a table, so you can’t really tell if it is working correctly by itself. However, NanoDB also has a "SHOW TABLE t STATS
" command, which will output the statistics for the specified table. You can compare your results to the contents of the SQL files you loaded. Note that for some of the integer stats, -1 indicates NULL
or "unknown", but for object-values, the stats will contain a null
reference if the value is unknown.
Here are some outputs from the (hopefully correct) reference solution for you to check against:
CMD> show table states stats;
Statistics for table states:
51 tuples, 1 data pages, avg tuple size is 15.7 bytes
Column state_id: 51 unique values, 0 null values, min = 1, max = 51
Column state_name: 51 unique values, 0 null values, min = null, max = null
CMD> show table cities stats;
Statistics for table cities:
254 tuples, 1 data pages, avg tuple size is 23.8 bytes
Column city_id: 254 unique values, 0 null values, min = 1, max = 254
Column city_name: 244 unique values, 0 null values, min = null, max = null
Column population: 254 unique values, 0 null values, min = 100135, max = 8143197
Column state_id: 44 unique values, 0 null values, min = 1, max = 44
CMD> show table stores stats;
Statistics for table stores:
2000 tuples, 4 data pages, avg tuple size is 13.0 bytes
Column store_id: 2000 unique values, 0 null values, min = 1, max = 2000
Column city_id: 254 unique values, 0 null values, min = 1, max = 254
Column property_costs: 882 unique values, 0 null values, min = 0, max = 999000
CMD> show table employees stats;
Statistics for table employees:
24932 tuples, 118 data pages, avg tuple size is 36.4 bytes
Column emp_id: 24932 unique values, 0 null values, min = 1, max = 24932
Column last_name: 4958 unique values, 0 null values, min = null, max = null
Column first_name: 4933 unique values, 0 null values, min = null, max = null
Column home_loc_id: 254 unique values, 0 null values, min = 1, max = 254
Column work_loc_id: 254 unique values, 0 null values, min = 1, max = 254
Column salary: 46 unique values, 0 null values, min = 35000, max = 80000
Column manager_id: 11754 unique values, 4259 null values, min = 1, max = 24924
Plan costs in NanoDB are represented with a few simple measures. There are the standard ones you would expect, such as the number of tuples produced by each plan-node, the worst-case number of disk IOs that will be performed, and so forth. These values are managed in the PlanCost
class (edu.caltech.nanodb.queryeval
package).
There is also a "CPU cost" measure, which is a simple estimate of the computational cost of evaluating a plan, using some imaginary units. For example, we might say that the CPU cost of processing one tuple in a plan-node has a CPU cost of 1.0. A select plan-node might only produce 10 tuples, but if it has to look at 1000 tuples then the CPU cost generated by that node will be 1000.
Also, unlike row-counts, these CPU costs should accumulate up the tree: if we have two equivalent plans, and our row-estimates are incredibly accurate so that we think the two plans will produce the same number of rows, the CPU cost will still tell us which plan is likely to do more work to generate the same results. For example, if a select plan-node must consider 1000 tuples, its cost will be 1000 plus the CPU-cost of its sub-plan.
Every plan-node has a prepare()
method that computes three critical pieces of information for the node:
schema
cost
stats
All three of these protected fields are declared in the PlanNode
class, and are therefore available on all plan-node implementations. Since these fields are protected access, all subclasses can access and manipulate these values. Subclasses are expected to provide an implementation of prepare()
to compute these details.
You will need to complete this method for three of the plan-node types, computing the cost of each kind of node. (The schema is already computed for you.) The nodes whose implementation you must complete are:
SimpleFilterNode
- a select applied to a subplanFileScanNode
- a select applied to a table file stored on diskNestedLoopJoinNode
- a theta-join applied to two subplans; the join may be an inner or an outer joinThe cost is represented as a PlanCost
object (in the edu.caltech.nanodb.queryeval
package). You should look at this class to see all values that your implementation must estimate. Use "best case" estimates; for now, you can assume that NanoDB always has all the memory it needs.
You can also look at examples of the prepare()
method already implemented, in these classes: RenameNode
, SortNode
, ProjectNode
. Be aware that these implementations are written to operate properly even when a child plan’s cost is not available. You do not need you mimic this; you can assume that the child plans' costs are always available, because once you are done they always should be available.
You will notice in the following discussion that none of the join-costing takes the presence of keys into account. Thus, we will rely on the non-key-based approaches to computing selectivity discussed in class. If you wish to incorporate key-based join costing as well, this is an extra-credit task described at the end of the assignment.
Selectivity estimates are essential for guessing how many rows a plan-node will produce. Besides our table statistics, we must also be able to guess the selectivity of a predicate. A predicate's "selectivity" is simply the probability that a row from the input will satisfy the predicate; if we know the number of rows that come into a node, and we know the selectivity of a node's predicate, we simply multiply the two together to estimate the number of rows produced.
In NanoDB, the SelectivityEstimator
class (queryeval
package) is used for making all selectivity estimates. You must complete the implementation of this class to compute the selectivity of predicates that appear in query plans. Specifically, you must support these kinds of predicates:
If a predicate doesn't fall into one of these categories, use a default selectivity estimate of 25%. (This constant is defined for you in the SelectivityEstimator
class.)
Some important caveats about the last four kinds of predicates:
ANALYZE
on a table referenced by a query. In those cases, use the default selectivity estimate.normalize()
method on comparison expressions, such that when a column and a value are being compared, the column will always appear on the left. This means we don’t have to support VALUE op COLUMN, only COLUMN op VALUE.You will see many detailed comments in the SelectivityEstimator
class; you really only have to focus on the costing equations, and how to extract the statistics you will need. (Read the Javadocs for the various classes you will need to use.) Many other parts are provided for you, including the computeRatio()
method that is able to take a column's min and max values, and compute the ratio (high - low) / (maxVal - minVal) mentioned earlier. This function can work with any numeric type.
Anytime a predicate is applied within a plan-node, the column-statistics of the node's outputs may also be affected. Therefore, it is important to update these statistics whenever we are able to do so.
NanoDB has a class for performing this operation called StatisticsUpdater
(in the edu.caltech.nanodb.queryeval
package), responsible for this task. It is nearly impossible to predict the statistics for arbitrary predicates, so NanoDB handles a very limited, but still (hopefully) beneficial, set of predicates:
COLUMN op VALUE, where op is one of the following: =, ≠, >, ≥, <, ≤
P1 AND P2 AND ... - conjunctive selection predicate (the individual P1, P2, etc. are called "conjuncts"), although we will only use the conjuncts of the form COLUMN op VALUE.
Note that both of these cases will eventually funnel into the static updateCompareStats()
method on the StatisticsUpdater
class; all of your changes will be in this section of the class.
Where it makes sense to do so, you should update the values in the ColumnStats
objects. (It may not always be possible to make reasonable updates to the statistics.) You will need to explain the rationale behind your changes in your team's design document.
All parts of this assignment can be worked on in parallel, because the interfaces between the components are already defined for you. That said, some parts will not work until other parts are completed, and this is important to consider when planning your approach.
Collecting table statistics should probably be done by one person (unless you enjoy pair-programming; if so, go for it). Just about every other part of the assignment depends on this part working correctly, so it should be completed and verified early on in your efforts.
Predicate selectivity estimates and column-statistics updates both require a relatively small amount of coding to complete, but are more conceptual in nature. It is important to understand the underlying concepts, as well as the provided code, before completing this work.
One person can tackle both of these components, or one person can do the selectivity calculations and another can do the column-statistics updates.
This code can't be tested until stats-collection is working, and the plan-node costing is completed. But, that will not prevent it from being implemented in parallel.
Plan-node costing is mostly straightforward. There are multiple plan-nodes to update, and again, one person can do all of this, or different people can update different plan-nodes. The most complicated plan-node to update is the nested-loop join, but even this one is pretty straightforward.
This is the point where bugs in earlier parts of the assignment may become evident, so it is important that the teammates who worked on those parts of the code are ready to help fix any bugs that are identified.
Make sure you are familiar with what code is provided for you to use, regardless of which parts of the assignment you do. Most of the grungy work of ratio-calculations, casting values, etc. is already taken care of for you. If you find any part of the implementation to be particularly grungy, you may be unaware of some tools we have provided for you to use.
The final part of the assignment (described in the next section) should be explored by all teammates, so that your code is well exercised. Your team can have a "bug bash" to find and fix any issues that your code might have.
This is particularly valuable because the next assignment builds on top of your statistics / plan-costing implementation, and you will want to know this code works properly!
After you have completed your plan-costing code, you will want to ensure that it works properly. You don't have to write any test classes this week (but if you do, there will be extra credit for it!). A simple schema is provided for you in the schemas/stores
directory of your repository. You can load make-stores.sql
, and then load stores-28K.sql
. Obviously, make sure to ANALYZE
these tables before trying any queries, or else you won’t have statistics for the costing computations to use.
Note: You will need to put the answers for this section into your design document, so you might as well record them.
As mentioned last week, NanoDB has an EXPLAIN
command you can use to see what it generates for different query plans. For example, you can try this:
EXPLAIN SELECT * FROM cities;
Once you have completed the costing implementation, you should see something like this (your costs might be different, depending on your assumptions):
Explain Plan:
FileScan[table: CITIES] cost=[tuples=254.0, tupSize=23.8, cpuCost=254.0, blockIOs=1]
Estimated 254.0 tuples with average size 23.787401
Estimated number of block IOs: 1
If you then try:
EXPLAIN SELECT * FROM cities WHERE population > 5000;
This should print out the same plan costs, since the smallest city-population in that table is 100135.
You can see the costs change if you try queries like this:
EXPLAIN SELECT * FROM cities WHERE population > 1000000;
EXPLAIN SELECT * FROM cities WHERE population > 5000000;
Take note of how many rows the database expects this last query will produce. (My estimate is 99.3 tuples.) However, how many tuples does the query actually produce?
Now, if you run the following query, notice that you will get the exact same results, but if you EXPLAIN
it, the costing estimate is very different:
SELECT * FROM cities WHERE population > 8000000;
This is the fundamental limitation of the simple statistics that we track in NanoDB. Clearly, recording a histogram for different columns would produce much more accurate estimates.
You can also try your costing estimates on more complex queries involving join operations. For example:
EXPLAIN SELECT store_id FROM stores, cities
WHERE stores.city_id = cities.city_id AND cities.population > 1000000;
(My analyzer predicts 1776.2 tuples for this query, and a CPU cost of 1,019,776.3 units. Don't feel like you have to match these numbers exactly; they are just my analyzer's estimates.)
Of course, the planner isn't quite intelligent enough to position the query predicates in optimal locations in the query plan, although we can manually rewrite the query to put predicates in better positions:
EXPLAIN SELECT store_id
FROM stores JOIN
(SELECT city_id FROM cities
WHERE population > 1000000) AS big_cities
ON stores.city_id = big_cities.city_id;
(The rewritten query is still estimated to produce 1776.2 tuples, but the CPU cost is now 962,940.8 units. Woo, 56000 "CPU units" less!)
Finally, here is a slower query for you to tinker with:
SELECT store_id, property_costs
FROM stores, cities, states
WHERE stores.city_id = cities.city_id AND
cities.state_id = states.state_id AND
state_name = 'Oregon' AND property_costs > 500000;
The estimated tuple-count on this one is around 23 tuples, and the reference implementation estimates a CPU cost of 45,214,024.0. The actual tuple-count is 7.
See how fast you can get it by rewriting it! (I was able to take it from 35 seconds down to 0.5 seconds on my laptop, with a final CPU cost of 299,724.4.)
The EXPLAIN
command doesn't show the column-stats for each plan-node because this would make the command output even denser than it already is. You can still do some basic testing to see if your code is working by creating nested subqueries in the FROM
clause. Here is an example.
This plan will report a specific cost and estimated number of tuples. The city_id
column is a primary key with uniform distribution, so the estimate should be pretty accurate.
EXPLAIN SELECT * FROM cities WHERE city_id > 20 AND city_id < 200;
(The estimates for this query may be slightly different from the below queries, for boring, grungy reasons.)
Next, you can force your simple planner to create multiple nested select nodes like this:
EXPLAIN SELECT * FROM (
SELECT * FROM cities WHERE city_id > 20) t
WHERE city_id < 200;
Then, you can reverse the application of the predicates and see if the results are still the same:
EXPLAIN SELECT * FROM (
SELECT * FROM cities WHERE city_id < 200) t
WHERE city_id > 20;
If the plan-nodes are properly updating their output statistics, you should see extremely close or even identical costing estimates for these kinds of queries.
Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab3design.txt
".
When you are ready to submit your code, and your design document includes your team's answers to the design questions, you need to tag the version of your code that you want us to grade. Only one member of your team should perform these steps. If multiple people on your team do this, it will mess up your team's repository.
The instructions are the same as for the previous assignment, although you will use a hw3
tag this week. Again, note that if you decide to resubmit your work (e.g. because you found a bug that you needed to fix), you cannot reuse the same tag name again! Therefore, use an updated tag name "hw3-2
", "hw3-3
", etc. so that we can still find the submission. Also, don't forget to push the new tag to your submission repository.
Finally, one team member should submit this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused! You can list the Git repository hashes of your tags with this command:
git show-ref --tags
If you want to earn some extra credit, here are some options for you to pursue. Estimates of the amount of extra credit that will be received are given below; to obtain full credit, code must be both correct and well-written/commented. Partial completion will result in partial credit. Additional credit will be awarded for high-quality unit-testing code.
NanoDB schemas are able to report what candidate keys hold on the schema. However, none of the plan-nodes propagate this information on their output schemas. This would be very useful though, so that joins can produce much better estimates for the number of rows they will produce.
Update these plan-nodes to report the candidate keys on their outputs (+10pts):
SimpleFilterNode
FileScanNode
SortNode
RenameNode
GroupAggregateNode
should also report a candidate key on the grouping attributes.Update the NestedLoopJoinNode
plan-node to use candidate-key information to produce better costing estimates. Obviously, this will also require careful analysis of the predicate being used to join. (+10pts)
Update the FileScanNode
and SimpleFilterNode
take candidate-key information and the predicate being used, to determine when the node can stop after the first row is found and produced. (+10pts)
Many databases are able to update column statistics in other scenarios besides the ones given in this assignment. Try adding support for these additional kinds of predicates:
COLUMN IN (VALUE-LIST) (see the InValuesOperator
class for this kind of comparison). The COLUMN should not appear in other conjuncts. (+3pts)
COLUMN = VALUE1 OR COLUMN = VALUE2 OR ... (where all column-references are the same column name, all comparisons are "=", and the column does not appear in other parts of the predicate).
Additionally, support the case where one of the conjuncts of a conjunctive selection is a nested expression of this form. (+7pts)
You may have noticed that there is no testing for the table-analysis and plan-costing code in Assignment 3. Implement a complete and high-quality set of unit-tests to exercise and verify these components. The better and more useful tests you add, the more extra credit points you will get. (If they are very good, we will likely incorporate them into the course codebase.)
The current version of the assignment is very flexible in how students may compute estimates. You may need to require that plans are costed in very specific ways, or that selectivity is estimated in very specific ways, in order to provide useful tests. This is fine, although you will need to explain your choices as part of your submission.
Approximate bonus points:
Additional points may be awarded if your code is of a quality that it is able to be incorporated into the CS122 codebase for future years.
The statistics-collection and selectivity-estimation code has not been updated to include support for the NUMERIC
, DATE
, TIME
, and TIMESTAMP
types. While NUMERIC
should be relatively easy, the date/time types are more challenging to support the ratio calculations outlined earlier.
To pursue this extra-credit task, you will want to load the TPC-H data provided as part of Challenge 1, since this uses some of these data types.