Challenge 2: Run the TPC-H Queries

Last updated February 9, 2019 at 10:00PM.

In this challenge you will attempt to efficiently run as many of the 22 TPC-H benchmark queries as your database can support. While this challenge also requires you to load the data-sets provided with Challenge 1, you don't have to focus on loading the data as quickly as possible. Instead, you will focus on making sure your planner handles all queries properly, and that you are able to generate reasonably optimized execution plans for the queries you are able to execute.

Loading TPC-H Data

In order to run the TPC-H queries, you must load the TPC-H data. You can review the documentation in Challenge 1 for more details. Make sure you can successfully load the make-tpch.sql file, and at least the "SF=0.01" data-set.

You will also want to run ANALYZE on the TPC-H tables, so that you can generate reasonable-quality plans.

TPC-H Queries

In all likelihood, your database implementation will already run a few of the provided queries, based on your team's work in the first few assignments. However, the queries also rely on other advanced features of SQL:

Q2, Q4, Q17, Q20, Q21 and Q22 use correlated evaluation. NanoDB can do correlated evaluation, but your planner must recognize and handle it correctly. You may not be able to run these until you have completed the subquery lab (lab 5). (up to +15pts)
Q8, Q12 and Q14 use CASE ... WHEN ... ELSE syntax, which is presently unsupported in NanoDB. You would need to extend the NanoDB parser with support for this syntax, as well as writing a new Expression subclass to evaluate such expressions. (up to +40pts)
Q15 includes a view definition. Views are presently unsupported in NanoDB. You would need to implement this functionality at some level to support this query. (up to +40pts)
Q7, Q8 and Q9 use an EXTRACT() function that must be added to NanoDB. Adding functions to NanoDB is relatively straightforward, although you would be expected to create a generalized version of EXTRACT(), not just support the specific usage in these queries. (up to +20pts)
Q22 uses a SUBSTRING() function that must be added to NanoDB. (up to +10pts)

Keep in mind that the goal is not to run all of the queries. The goal is to run as many of the queries as you can. It is not likely that any team will successfully support all of these queries, unless they pursue this as part of a CS123 project. Additionally, high-performance plans for some of the queries would require advanced optimization techniques that are not covered in CS122.

Query Results and Timings

The NanoDB Reference Solution can run all queries but Q15 because it doesn't yet support views. In the future, this section will be updated with expected results and timings for the 22 queries.

Using the S-0.01 data-set, most of the queries finish executing in under 2 minutes with the cost-based join planner in the reference solution.

Performance Showdown

Participating teams will be ranked on how many of the 22 TPC-H queries they are able to successfully parse, plan and execute. This will be done against a smallish data-set, so that plans will be likely to finish in a reasonable timeframe. The tables will be ANALYZEd before queries are run so that your planner will have useful statistics to run against.

Of course, query results must be correct. Also, if a team's implementation takes too long to execute a given query then it will be excluded from the team's results.

For each query that teams are able to execute, the teams will be further ranked on how quickly they can execute each of the queries. Again, answers must be correct, and they must be produced in a reasonable timeframe. The size of the data-set may be increased as necessary.

The rankings will only include team names, so your team can remain anonymous if you wish to.

Participating in the Challenge

If your team decides to participate in this challenge, your team should download and fill in this document, and send it to "cs122tas [AAAATTTT] caltech.edu". Anytime your team is able to run more queries, and/or is able to run them more quickly and efficiently, you should fill out a new version of the document and send it to the mailing list.

If a given query cannot execute successfully, or takes a very long time to run, we will not be able to count it for credit. The whole point is to be able to run the queries reasonably quickly. (Plus, all of this will likely run on Donnie's laptop, and he will eventually get bored and kill your program if it takes too long.)