Assignment is due Friday, January 25 at 5:00PM.
Last updated January 16, 2019 at 10:30PM.

In this assignment you will have these tasks to complete:

These tasks are described in detail in the following sections.

Getting Started

Before you can begin working on this project, you must be able to check out and build your team's codebase. You are also strongly encouraged to set up IntelliJ IDEA so that you may easily navigate the code, build the project, and run the debugger. Lots of detailed information is available in the introductory documentation. If you have any questions or issues not addressed by the documentation, please don't hesitate to contact Donnie and/or the TAs.

Once you have completed all of your work, along with the design document for the project, your team will need to fill out and upload a short document containing logistical details to the course website. Part of this information includes how much time each teammate spent working on the assignment. This information is essential for us to get good feedback on the scope of assignments, so you will receive point deductions if it is not included. Future generations of CS122 will thank you!

NanoDB Tuple Updates and Deletion

As it stands, NanoDB does not yet support updating or deleting tuples. Your first tasks will be to complete the operations of updating and deleting tuples in a table. Tables and indexes are both implemented as "tuple files" in NanoDB, which allows us to use the same storage implementation to manage both tables and indexes.

NanoDB heap tuple files have a simple structure. The major aspects are described here, along with the classes that provide the associated functionality. All of these classes are in the edu.caltech.nanodb.storage.heapfile package; you should go to this package and read through the classes mentioned here.

When a tuple is to be deleted, the HeapTupleFile.deleteTuple() method is called with the specific tuple to be deleted. This method determines the slot-index of the tuple, and then uses the DataPage.deleteTuple() method to delete the tuple at this slot.

Enable tuple deletion by completing the implementation of the DataPage.deleteTuple() method. Your implementation should conform to these guidelines:

Enable tuple updating by completing the implementation of the setNullColumnValue() and setNonNullColumnValue() methods of the PageTuple class. These methods are somewhat tricky to get right. They are called by PageTuple.setColumnValue() to modify an existing tuple’s column values.

NanoDB Buffer Management

The Buffer Manager has a challenging job; among other things, it must decide which file buffers to keep in memory, and which should be evicted. Further complicating this task is the fact that queries may be accessing many different tuples, and the page containing that tuple data must be kept in memory until the query is finished using it.

A simple and common approach to solve this problem is to "pin" disk pages in memory while they are being used; the Buffer Manager cannot evict a page that is currently pinned. When a page is no longer in use, it is "unpinned," allowing the Buffer Manager to evict the page. This is a simple reference-counting mechanism, but it works very well to let the Buffer Manager know what it may safely evict.

In NanoDB, both DBPage objects and Tuple objects may be pinned and unpinned. The Buffer Manager works with pages, and queries work with tuples. When a DBPage is retrieved from the Buffer Manager, its pin-count is always incremented so that it won't immediately be evicted out from under the requester. Similarly, when a new tuple is retrieved, the tuple pins its backing DBPage to prevent the page from disappearing out from under it. When a tuple is unpinned, it will automatically unpin its backing DBPage. Of course, these objects may also be re-pinned by various components (e.g. a sort plan-node that needs to keep the tuple around for a while). However, they must eventually be unpinned, or else the Buffer Manager will not be able to reclaim the corresponding buffer space.

Make sure your implementation properly unpins tuples (and disk pages, if necessary) when they are no longer needed. (Recall that unpinning a tuple will also unpin its corresponding disk page, so you only have to unpin a disk page if you read it separately, e.g. for the header page which contains no tuples.)

We can simplify this task by recognizing that we only need to unpin a tuple when it is no longer being used by the database engine. For example, a "select" plan-node only needs to unpin tuples that don’t satisfy the selection predicate; these tuples will no longer be used in the execution plan. Another example is, all tuples produced by an execution plan should be unpinned once we have finished using them (e.g. to update or print them).

For now, concentrate on the following areas:

The NanoDB Buffer Manager maintains a strict upper bound on the size of the buffer pool. The size is exposed as the "nanodb.pagecache.size" property, and can be read or written during operation. While you are implementing the proper pinning/unpinning steps in your database, you may want to leave the pool size very large (its max size is 1GiB), but once you have pinning/unpinning working properly then you should be able to shrink the buffer-pool size down to very small sizes (the minimum is 64KiB, the largest size of a single data page) and still see correct operation.

Fortunately, even if you don’t get this exactly correct, the Buffer Manager will forcibly unpin all disk pages still held at the end of each SQL command. However, it will also nag you about this if too many pages are still pinned at the end of your queries. Releasing pages when they are no longer necessary means your database will operate more efficiently. Therefore, you will have point deductions on your submission if your implementation leaks too many buffer pages during operations.

NanoDB Storage Performance

Initially, NanoDB has a very simplistic approach to managing its heap files. In each heap file, NanoDB devotes the first page (page 0) to schema and other high-level details, and the remaining pages are devoted to tuple storage.

When NanoDB inserts a new tuple into a heap file, it follows a very simple process: Starting with page 1, it looks for enough space to store the tuple. If the page being considered has enough space then the tuple is added to that page. (Note that this is a first-fit strategy, not a best-fit strategy.) If not, NanoDB goes on to page 2, and then page 3, and so forth. If NanoDB reaches the end of the file then it creates a new page and stores the tuple there.

This approach is very slow (O(N2)) for adding a large number of tuples, but it does have the benefit of using space reasonably effectively. However, there is no reason why we shouldn’t be able to have both benefits; fast inserts as well as efficient space usage.

The implementation of this strategy is in the addTuple() method of the HeapTupleFile class, in the edu.caltech.nanodb.storage.heapfile package.

Note that two configuration properties also affect NanoDB storage performance.

Measuring Storage Performance

You can give NanoDB a try on this front by creating a simple table. After building NanoDB, start it with the nanodb script (or nanodb.bat on Windows), and create a simple table:

CREATE TABLE insert_perf (
    id INTEGER,
    str VARCHAR(200),
    num FLOAT
);
INSERT INTO insert_perf VALUES ( 1, 'hello', 3.14 );
INSERT INTO insert_perf VALUES ( 2, 'goodbye', 6.28 );
EXIT;

(A .sql file for this schema is in the schemas/insert-perf directory.)

If you run the above commands and then look in the datafiles directory, you should now see a file named insert_perf.tbl. This is the table that we just created and then added the tuples to. It will be 16KiB in size since the default page-size is 8KiB. (You can change the default page size via the nanodb.pageSize property.) If you want to get rid of this table, you can either delete the insert_perf.tbl file from this directory (but make sure NanoDB isn't running when you do this!), or you can use the command "DROP TABLE insert_perf;" in NanoDB.

How do we measure the actual performance? Wall-clock time (i.e. the actual time that the test takes to run) is not useful, because the test will be very dependent on computer hardware. Therefore, NanoDB includes a "SHOW 'STORAGE' STATS" command (Note that the string "storage" must be in single quotes! It is case-insensitive.), which prints out some statistics that give us an approximation of the storage performance:

Ideally, as you improve your insert performance, you should see both the total number of accesses and the distance traveled decrease dramatically.

You can try some larger files, also provided in the schemas/insert-perf directory:

You can try these operations simply by piping these files into the nanodb script. For example:

rm datafiles/*
./nanodb < schemas/insert-perf/make-insperf.sql
./nanodb < schemas/insert-perf/ins20k.sql

You can also generate much larger files if you want to really exercise your work. Note that we have a large insert/delete script that we will run against your database, so it would be advisable for you to exercise your code heavily!

Here are your tasks:

  1. Modify the storage format to improve the insert performance. You can modify the HeapTupleFileManager class in any way that you see fit. You can also modify the HeaderPage and DataPage classes, which do the actual work of reading and writing table pages.

    For example, you might want to add a linked-list of blocks with available space to the table-file, anchored from block 0. You might find it easiest to store such a list at the end of each block, but the design is entirely up to you. You could subtract some space from the value that the getTupleDataEnd() method in the DataPage class returns, to open up some extra space for bookkeeping.

    You could also create some kind of bitmap that records which blocks have available space. If you do this, your design must still support the maximum number of blocks in NanoDB files, which is 65,536 (64K). Recall that page sizes may vary from 512B to 65,536B; your implementation must work properly regardless of a table’s page size.

    Don't forget to update your bookkeeping structures during tuple updates and deletes as well! Even though your primary focus is insert performance, you should make sure that selects (i.e. tuple scans), inserts, updates and deletes all continue to function properly, and that all operations properly include your storage modifications.

    Do not use the table statistics as part of your implementation. This is actually not the intended use of table statistics. Table stats are updated infrequently by the database (because this is an expensive operation requiring many disk IOs), and are therefore almost always out of date. Plus, stats-collection is not currently implemented; you will implement this yourself in a future assignment.

    When you need to add a new page to a table file, you should be aware that the DBFile class provides a method getNumPages() that returns how many pages are currently in the file. Every DBPage has a reference to the DBFile object that it came from, so it should be very easy to tell where the new page should reside.

  2. Provide detailed documentation of your approach in the class-level comments for the HeapTupleFile.

Additional requirements:

Design Document and Submission

Once you are finished with the above tasks, your team needs to complete the design document for this assignment, available in your repository at "doc/lab1design.txt".

When you are ready to submit your code, and your design document includes your team's answers to the design questions, you can fill out this short document and submit it on the course website. Only one teammate should submit this document so that we don't get confused!