Last updated January 25, 2019 at 5:00PM.
In this challenge you will attempt to load a rather sizable data-set into your NanoDB database implementation. You should not attempt this until you have successfully completed your storage-performance improvements, since it will be too time-consuming otherwise. Additionally, you may encounter failures that are difficult to debug if you try to do too much in one step.
The data-set is part of the TPC-H performance benchmark. The "TPC" is the Transaction Processing Performance Council (which is already slightly confusing because it is not the TPPC or the TP2C as one might expect). The TPC devises performance benchmarks for various scenarios, for companies to use on their database systems. The TPC-H benchmark is focused on some simple data-warehouse operations with a relatively straightforward schema. The nice thing about the TPC-H benchmark is that it can generate as much or as little data as you want, so you can easily use it to exercise the performance of a database server.
It is expected that your solution will involve a combination of changes and improvements to the NanoDB sources, as well as altering various configuration options to tune for the size of the data-set. For example, using indexes to check key constraints is essential for good performance in most databases; similarly, managing your transaction log effectively becomes important when a very large amount of data is being loaded into the database.
Because of this, it is expected that you will want to pursue this challenge off and on throughout the term, particularly as new features are incorporated into NanoDB. It is not intended to be a one-off task. We also hope you will enjoy the challenge of trying to improve the performance of your system, and to load a much more substantial data-set than we will work with on the regular assignments.
Believe it or not, most database vendors have made it illegal to publish benchmarks using their database software, in their End User License Agreements (EULAs). It all stemmed from a research paper written by a professor named David DeWitt, who happened to greatly annoy Oracle's Larry Ellison by showing that their database server didn't perform very well at certain basic operations. You can see the performance figures and read about the aftermath here, but the main consequence is the inclusion of these so-called "DeWitt clauses" in most vendors' database server benchmarks. Secondarily, most database research papers that include performance benchmarks will "anonymize" them, e.g. by specifying "Vendor A," "Vendor B," and so forth.
By default, NanoDB enforces all constraints you might specify on tables. This includes "NOT NULL
" constraints, primary/unique key constraints, and foreign key constraints as well. However, until you actually have indexes to check key constraints quickly, NanoDB will do it the slow way - scan through every single row. Clearly you don't want to do this while you load a massive data-set.
Fortunately, you can turn off key-constraint enforcement using the "nanodb.enforceKeyConstraints
" property. You should turn this off before attempting to load TPC-H data. (You should probably not hard-code this change in ServerProperties
since other assignments may require it to be turned on. Plus, you should eventually be able to load the entire data-set with all constraints turned on.)
Make sure that nanodb.flushAfterCmd
is turned off, so that you don't have unnecessary cache-flushing between operations to the database. Otherwise you will incur a lot of extra write overheads.
All performance testing will be run with a buffer-pool size of 10MiB (10,485,760 bytes). Since the main focus of the challenge is loading data, you should not need a particularly large buffer-pool size.
The database schema for TPC-H consists of eight tables that represent customer orders from part suppliers. Here is a diagram of the schema from the TPC-H documentation:
The arrows in the diagram are a bit confusing - the one or more arrow-heads point to foreign keys in referencing tables, and the one tail will be at the primary key in the referenced table.
You will see that the top of each table has a number, possibly with "SF *" in front of the number. A number without "SF" is the total number of rows in the table. For tables with "SF" by the number, this is the "Scale Factor," and is used to scale up or down the TPC-H data. The chart indicates approximately how many rows will end up in each table based on the chosen scale factor. A "SF" of 1 generates about 1GB of data to insert.
You can see from the above diagram that most of the tables are small, but customer
, part
, partsupp
, orders
and lineitem
are very large, with lineitem
being monstrously huge.
As you might notice, all of the data files are bzip2-compressed tar files. Each one decompresses into its own directory, so you shouldn't have to worry about files overwriting each other. The files can be uncompressed at the command-line with a command like this:
tar -xjvf s-1.0.tbz
Participating teams will be ranked on whether they are able to load the entire data-set, how large the resulting table files are, and how long it takes to load the data. The rankings will only include team names, so that your team can remain anonymous if you wish to.
The ultimate goal is to import all of the TPC-H data with all NanoDB features turned on - transactions, indexes, key enforcement, etc. - and then to successfully analyze and generate statistics for all tables. However, we will do this incrementally, so that initially your code will be tested with key-constraint enforcement disabled, transactions disabled, and so forth. Then these features will be re-enabled as we progress through the term.
The performance test will be conducted like this, with the first four steps being timed. The fifth step is to verify that the data was loaded correctly.
ANALYZE
all TPC-H tables to generate statistics for them.select count(*) from lineitem;
" and "select c_nationkey, count(*) num from customer group by c_nationkey;
" will be run, and the answers verified against known results for each scale factor.Your submission will only be counted if it is able to successfully complete all of these steps.
The total size of the "datafiles
" directory will also be computed at the end of the performance test.
If your team decides to participate in this challenge, your team should download and fill in this document, and send it to "cs122tas [AAAATTTT] caltech.edu". Anytime your team is able to load a larger data-set, and/or is able to turn on more database functionality and still load an existing data-set, you should fill out a new version of the document and send it to the mailing list.
If your project takes a very long time to load the TPC-H data-set, we will not be able to count it for credit. The whole point is to be able to load the data quickly. (Plus, all of this will likely run on Donnie's laptop, and he will eventually get bored and kill your program if it takes too long.)