Last updated January 22, 2019 at 3:10PM.

This document describes how to get started with development in NanoDB. It will walk you through the software requirements, how to get the NanoDB sources, and some general information about the codebase.

Software Requirements

Working with the NanoDB codebase requires the following tools and libraries to be present:

Git

In order to check the code out of your team's NanoDB repository, you will need to have a recent version of Git installed on your local system. Git is frequently included on platforms like Linux and MacOS X, but you may need to install it onto Windows systems.

The command git needs to be on your path. For example, typing "git" should print out some general Git information.

Java 11

The NanoDB codebase is written using Java 11. Earlier versions of Java are unsupported.

The Java 11 development kit can be downloaded for a variety of platforms from the Oracle website. (The license should not be an issue for CS122, but if you are concerned about this then you can install OpenJDK 11, but the installation process is a bit more complicated.)

The commands javac, java, and javadoc all need to be on your path. For example, typing "javac -version" should report 11 or higher as the version.

Make sure to download the Java Development Kit (JDK), not just the Java Runtime Environment (JRE).

Apache Maven

Apache Maven is used to build NanoDB. Maven can be downloaded from the Apache Maven website. NanoDB has been tested with Maven 3.6.0, which is the current version. Other versions have not been tested, and may cause problems. Or, they may not.

If you are going to build from the command-line, then the command mvn must be on your path. For example, typing "mvn -v" should print out the version of Maven (and Java) that you are using.

If you are going to build from IntelliJ IDEA, then you don't necessarily need to install Maven separately, because IntelliJ includes a version of Maven packaged into the development environment.

Python

One step in the NanoDB build process runs a Python script to generate a file for the SQL parser. Therefore, you will need to have a Python interpreter on your path. For example, typing "python" should start up the Python interpreter.

IntelliJ IDEA

It is definitely possible to do all NanoDB programming entirely from the command-line, but you are strongly encouraged to use a Java IDE (Integrated Development Environment) to work on the code-base due to its large size. IntelliJ IDEA (pronounced "in-telli-jay idea") is the "officially supported" environment for the course, and is available from the JetBrains website. You can use the Community Edition for free (recommended), or you can request full access to all of the JetBrains tools as a student, which is a fantastic option.

Getting the NanoDB Sources

Gitlab

Every CS122 team has their own NanoDB repository on the server gitlab.caltech.edu. Use your IMSS (i.e. your @caltech.edu) username and password to log in to this server. When you log in, you should see that your repository is already set up for you and your teammates; all you have to do is to check it out and get to work. Each repository is given a fun team-name (this year's theme is Australian animals) which will be used to identify results in any challenges that your team participates in.

You should configure your Gitlab account with an SSH key to access your repository. This is the typical way that developers access shared Git servers. Don't use HTTPS token-based authentication in this class.

Once you have gotten SSH access configured, you can clone your team's repository to your local working environment with a command like this:

git clone git@gitlab.caltech.edu:cs122-19wi/nanodb-team-[yourteam].git

This will create a local directory named "nanodb-team-[yourteam]" in which you can do your software development. The best part is, all of your local changes will be isolated from everyone else, and even the most recent code in the codebase, until the point you decide to commit your work back to the repository.

Git is a distributed version control system, which means that programmers can work against their own local repository, and then push changes to various remote repositories. (This is different from a centralized repository model, where developers work against a single shared repository server.) The directory you created with the "git clone" command above is your own repository for you to develop with; it is a clone of your team's repository, but it is still a separate repository. When you make changes to your local repository, they will not be included in the team repository until you also run "git push" to push changes to the team repository. Similarly, when your teammates make changes to the team repository, you won't see the changes until you "git pull" them to your local repository. It is important to do this regularly, so that your entire team will stay synchronized. It is also important that you always make sure you only push working code to the team repository. If you break something, you will prevent your teammates from making any progress until it is fixed!

Configuring Git

Before you do any work, you need to tell Git who you are! This ensures that your commits will properly be tagged with your information. This is easy to do:

git config --global user.name "Your Name"
git config --global user.email "username@caltech.edu"

Please use your real name and your IMSS email address, so that we can easily understand your Git commit logs.

You will probably also find it helpful to turn on colorful output:

git config --global color.ui true

Git Repository Details

You should be aware that your local repository actually contains two components in one. First, you will see directories and files like src, pom.xml, nanodb, etc. These are actually not part of the Git repository itself; they are a working copy that you can edit however you see fit. If you decide you don't like the changes you have made in your working copy, you can always revert back to the local repository's version with no problems.

When you are completely satisfied with your changes, then you can commit these changes to your local repository. The repository itself is stored in a subdirectory named .git, which you can see if you type "ls -al". (Feel free to look in this directory, but don’t modify anything in there unless you know exactly what you are doing.)

To tell what files have been changed, you can type "git status". Git will show you a list of all files that have been modified, along with many other details. If you decide you don't like some of your changes, you can follow the directions in the status message to revert those files.

Getting Code Updates and Bugfixes

During the term, we will periodically publish more code for teams to use. This is partly because some code is not yet ready, and partly because we invariably find and fix bugs as we go through the term. To make your life easier, you should set up your local repository to make it easy to pull down code updates.

If you run "git remote -v", you will see that your team's Gitlab repository is nicknamed "origin". When you run "git pull", changes are moved from your team's repository to your local repository. When you run "git push", your changes are moved from your local repository to your team's repository. The remote named "origin" is the default remote repository used in these cases.

You can add other remotes, like this:

git remote add upstream git@gitlab.caltech.edu:cs122-19wi/nanodb-base.git

Now when you run "git remote -v", you will see the above repository as a second remote. And, grabbing updates to the NanoDB codebase will be very easy; you just need to type:

git pull upstream

We will generally tell teams when they should fetch upstream changes. Note that only one teammate should fetch upstream changes into their local codebase, and then push the changes to their team's repository. If this is done by multiple teammates at the same time then it can lead to a lot of headaches. Also, sometimes there may be merge conflicts that must be resolved, e.g. if we make changes to the same code that you have made changes to. In these cases, if you can't figure out the right thing to do, talk to Donnie or the TAs and we will help you sort it all out.

Setting up an IntelliJ IDEA Project

We strongly encourage you to use the IntelliJ IDEA environment for working with NanoDB. It has many sophisticated features that will make your life much easier as you write and debug NanoDB code. Here are some brief steps for creating a new project for NanoDB.

In IntelliJ IDEA:

  1. Select File -> New -> "Project from existing sources..." in the drop-down menus of the user interface. When the directory-chooser dialog pops up, select the directory containing the NanoDB sources (i.e. the one containing the pom.xml file).

    (In other words, you are selecting the directory that you created when you ran the "git clone" command.)

  2. On the first page of the "Import Project" dialog, select "Import project from external model" and choose Maven. (This will likely already be selected due to the presence of the pom.xml file.)

  3. You should be able to use the default settings on the remaining pages of the "Import Project" dialog, but the fourth page requires attention! Make sure that Java 11 is selected as the project SDK.

    If Java 11 doesn't appear, and you know you have already installed Java 11, you should be able to click the "Add JVM" button to add the Java 11 development kit to IDEA.

When IntelliJ is finished importing the project, you should see an area on the left that contains your project structure. On the right will be a tab labeled "Maven" which you can use to run the various build stages of the project.

IntelliJ stores its configuration in a directory named ".idea" in the root of your project. Don't check in this directory! The contents always vary from computer to computer, so this is an example of a file that should not be checked into the code repository. Git allows us to exclude files from the repository in a file called ".gitignore". You will notice that your project already includes a .gitignore file with the IntelliJ .idea directory excluded, along with the target directory created by the Maven build process, NanoDB data files and log files, and so forth. Basically, anything that is not source files should not be checked into the repository.

Building NanoDB

As stated earlier, NanoDB is built using Apache Maven. Maven uses a file named pom.xml to configure the build process. This file contains a lot of details, including the dependencies that NanoDB requires, custom build steps, and so forth. When Maven runs, it will download all necessary requirements for the project into a ".m2" subdirectory in your home directory.

Maven can build various targets in a project "life cycle." For example:

Using NanoDB

NanoDB can be started using the nanodb script from the terminal on Linux or MacOS X, or the nanodb.bat script on Windows. These scripts are written to use the JAR file generated by the mvn package step above, primarily so that the tests will be run against your codebase before you attempt to use the database.

Both of the above scripts include comments as to how NanoDB can be configured in various ways. Feel free to read the script files and make changes as necessary, within the guidelines of specific assignments.

When you start up NanoDB at the console, you will be greeted with a simple prompt:

$ ./nanodb
Welcome to NanoDB.  Exit with EXIT or QUIT command.

CMD> 

This is where you can issue SQL commands against the database. Commands are generally case-insensitive. Note that all commands must end with a semicolon ";" character.

As the program says, you can type "exit;" or "quit;" to exit the database. You can also exit with Ctrl-D (end-of-file) or Ctrl-C if you wish. Sometimes if NanoDB is being cranky, Ctrl-C may be your only option!

If you ever run into behavior that you think is a bug, please let Donnie and/or the TAs know so that we can fix it for the class, and/or give you a workaround until a fix is available. NanoDB has many bugs, and you will likely encounter at least a few of them during the term.

Readline and rlwrap

There is a very helpful utility available on Linux and MacOS X called rlwrap. This utility can be used to provide more user-friendly text-editing and command-scrollback support when it is present. You can type "which rlwrap" to see if it is available on your computer, and then install it if you don't have it.

You don't need to do anything special to get NanoDB to use rlwrap; the nanodb shell script will automatically check for rlwrap and use it if it is present.

NanoDB Configuration Properties

Some of the database configuration is exposed as properties. You can see all available properties by typing:

CMD> show properties;
+-----------------------------+----------------------------------------------+
| PROPERTY NAME               |                                        VALUE |
+-----------------------------+----------------------------------------------+
| nanodb.baseDirectory        |                                  ./datafiles |
| nanodb.createIndexesOnKeys  |                                        false |
| nanodb.enableIndexes        |                                        false |
| nanodb.enableKeyConstraints |                                         true |
| nanodb.enableTransactions   |                                        false |
| nanodb.flushAfterCmd        |                                         true |
| nanodb.pagecache.policy     |                                          LRU |
| nanodb.pagecache.size       |                                      1048576 |
| nanodb.pagesize             |                                         8192 |
| nanodb.plannerClass         | edu.caltech.nanodb.queryeval.SimplestPlanner |
+-----------------------------+----------------------------------------------+

Some of these properties can be modified while NanoDB is running. Note that property names must be enclosed in single-quotes, and property names are most definitely case-sensitive.

For example, to change the default page-size used by NanoDB, you can type:

CMD> set property 'nanodb.pagesize' = 4096;
Set property "nanodb.pagesize" to value 4096

Recall that page sizes must be a power of 2 between 512 and 65536; if you type an invalid value, NanoDB will tell you:

CMD> set property 'nanodb.pagesize' = 4000;
ERROR:  Specified page-size 4000 is invalid.

Other properties are read-only during normal execution:

CMD> set property 'nanodb.baseDirectory' = './datafiles2';
ERROR:  Property "nanodb.baseDirectory" is read-only during normal operation, and should only be set at start-up.

Properties like these can be set either by editing the NanoDB startup script (nanodb and/or nanodb.bat), or by specifying configuration at startup:

$ ./nanodb -Dnanodb.baseDirectory=./datafiles2
Welcome to NanoDB.  Exit with EXIT or QUIT command.

CMD> show properties;
+-----------------------------+----------------------------------------------+
| PROPERTY NAME               |                                        VALUE |
+-----------------------------+----------------------------------------------+
| nanodb.baseDirectory        |                                 ./datafiles2 |
| nanodb.createIndexesOnKeys  |                                        false |
| nanodb.enableIndexes        |                                        false |
| nanodb.enableKeyConstraints |                                         true |
| nanodb.enableTransactions   |                                        false |
| nanodb.flushAfterCmd        |                                         true |
| nanodb.pagecache.policy     |                                          LRU |
| nanodb.pagecache.size       |                                      1048576 |
| nanodb.pagesize             |                                         8192 |
| nanodb.plannerClass         | edu.caltech.nanodb.queryeval.SimplestPlanner |
+-----------------------------+----------------------------------------------+
CMD> 

Redirecting SQL Scripts into NanoDB

You can also redirect SQL files into NanoDB. When this is done, NanoDB will suppress its human-friendly prompts so that you don't see an annoying sequence of "CMD> CMD> CMD> CMD> CMD> CMD> CMD>" prompts on the console. Examples of how to do this are given later in the lab write-up, but a simple example might be:

$ ./nanodb < somedata.sql

Some Notes on the NanoDB Codebase

As you explore NanoDB in the upcoming weeks, there are a few important details that you should be aware of. This section will describe some of those details.

TODO Comments

NanoDB is definitely still a work in progress, and as such, there are numerous places where the code has comments like "// TODO: Implement some thing", or "// BUGBUG: Fix some thing". You only have to implement the features specified by each assignment. We usually also provide "TODO" comments to show you where to put your code, but it should be easy to tell what parts are your responsibility, and what comments are for Donnie to fix "when he gets around to it."

If you ever have any questions about what you are responsible to do, please do not hesitate to ask Donnie or a TA. We want to make sure you only do the work that you are expected to do!

Exceptions

If you are familiar with Java then you are probably also familiar with the difference between checked exceptions and unchecked exceptions. Checked exceptions are checked by the compiler; if a piece of code can throw a given exception, then the code must either handle that exception, or declare that the exception may be thrown. Unchecked exceptions are not checked by the compiler, and generally indicate abnormal issues not expected to occur in normal operation (e.g. due to programming bugs). In Java, any exception that is castable to java.lang.RuntimeException is a runtime exception. Any exception that is derived from java.lang.Exception but is not derived from java.lang.RuntimeException is a runtime exception.

Databases are an unusual kind of software. Many things can fail, for many different reasons. For example, the simple act of allocating a buffer can cause an IO error if the Buffer Manager must evict data pages to free up space, and the write of one of those data pages fails for some reason. This means that basically all of the code can fail.

Additionally, this is not a problem, because if a failure occurs, we simply try to roll back the current transaction. This means that all failures are handled at the very top level of command execution (this code is in the NanoDBServer.doCommand(Command) function, by the way). So, since most of the code can fail, and most failures are handled at the top level of the code, we simply use runtime exceptions for most parts of NanoDB. It's an easy solution, and doesn't require us to write "throws Exception" everywhere in the code.

Logging

NanoDB uses the Apache Log4j 2 framework extensively, to provide numerous details of database execution. You will see the result of this logging in the files named "nanodb*.log", with "nanodb.log" being the most recent logs. The config file that controls what components and logging levels are included is called "log4j2.properties". Again, you are encouraged to edit this file to tune the specific information that is included. If you have a confusing failure, the logs are a good option for understanding what was going on along the way.

Java API Documentation

NanoDB includes extensive Javadoc documentation in the codebase, which can be a very handy reference for understanding what different classes do. You can generate the documentation on your local system by running "mvn site" (or building the site step through IntelliJ IDEA). Additionally, the API documentation of the base system (not including changes you make) is available on the CS122 website.

Committing Work on NanoDB

As you work on your assignment, you may want to commit your changes as you get various parts of the project working. In fact, you are encouraged to do this! Nothing is more frustrating than finishing a complicated feature, then immediately mangling it as you start working on the next task. Commit your work anytime you finish anything you don’t feel like doing again. At any point in your work, you can run the command "git status" to see what files have been modified in your working directory.

The command to commit changes to your local repository is "git commit". However, it is important to understand Git's workflow for committing changes to the repository. Changes you make in your working directory will not immediately be included when you commit to your repository; rather, Git maintains a "staging area" of changes to be included in the next commit. In other words, you can make some changes that will be included in the commit, and other changes that will not be included in the commit. A file whose changes will be included in the next commit is described as being "staged" (i.e. its changes are in the staging area). A file whose changes will not be included in the next commit is "unstaged," or "modified but not staged."

To complicate this somewhat, files also fall into two categories: "tracked" files, which have been added to the repository and Git is managing them; and "untracked" files, which have not yet been added to the repository.

The upshot of all this is that if you want to add a new file to your repository, or you want to include changes of an existing file into your repository, you must run "git add filename" to include the file in the staging area. Then, these changes will be included in the next commit.

There is a shortcut for when you haven't added any new files: you can run "git commit -a", which will perform the staging step as well as the commit step. However, if you create a brand new file, you still need to run "git add filename" on that new file before it will be committed.

Git Commit Messages

Every time you make a commit to your repository, you will need to enter a "commit message" describing the changes you made. Make sure you write complete, concise, descriptive commit messages! This is one of the most important skills to have in a team setting, because other people will need to know what you have changed, and worthless commit messages will make their lives much more confusing. Therefore, we associate a significant number of points with your commit logs.

Read this article about how to write Git commit messages. Follow all of its advice! Make sure to follow the 50/72 format guideline:

Pushing to the Team Repository

Once you have made commits to your local repository, you can push them to your team repository by typing "git push". As stated before, make sure you never break the team repository unless your teammates are expecting it.

If you want to protect yourself from system crashes, you can also push your committed changes to your team's shared repository by running "git push" after completing significant tasks. This is strongly encouraged, since every year at least one or two students struggle with a crashed machine. If you regularly push to the team repository, it will be relatively easy to get back online if your local system goes down in flames.

If you worry about destabilizing the team repository, you can always do your development on a branch. If you want to know more about branches, ask Donnie and/or the TAs!

Submitting Your Assignments

Submitting assignments in this class is pretty straightforward. Every Git commit has a unique commit hash associated with it. You can list commits and their hashes with the "git log" command. Additionally, any commit can be retrieved simply by specifying the commit hash. Therefore, when you have completed your work on a given assignment, one teammate should submit a short document (which will be provided with each assignment) containing a few logistical details including the commit-hash to be graded. This will make it very easy for us to review your work.