CS 11: Python Track: Assignment 1: Libraries and Logging

Goals

In this assignment you will learn some generally useful libraries, and how to set up Jupyter.

Prerequisites

The Unix environment

You are expected to have an account on the CS cluster and to understand the basics of using Unix, the filesystem, logging in and out, etc. You should also be familiar with the man command to access on-line manual pages.

Text editing

You are free to use whatever text editor you like. Good ones include emacs, vi/vim, sublime, atom and WingIDE. We've also heard good things about pycharm.

Familiarity with Python

This course is *NOT* intended to be an introduction to Python. You can and should take CS1 or look through its material if you are shaky with Python syntax. By the end of this course, we will be looking at Python bytecode, so some understanding of the stack/assembly may prove helpful.

We will be using Python 3 for this course, not Python 2, but the differences are fairly small, and you can learn about them by doing a web search on "Python 2 and 3 differences" or something similar. (We expect that you are willing to do this.)

Getting set up with Anaconda and a VM

This course was written for python3 which is a cross-platform language. However, this does not mean that certain libraries are guarenteed to work with all operating systems. For example, there appear to be issues with exiting plots generated in matplotlib on Mac OSX. For this reason, we are providing you with a VM. You are free to use your own local environment, but be warned that there might be some minor issues. Then we will show you how to install Anaconda, which is a souped-up Python installation which contains all the libraries we will be using pre-installed.

The easiest way to get set up is to follow the instructions on this page. Those instructions are for the CS 11 C track, but they will work for this track too. Once this is done you will have a fully-functional Linux system ready to use. Log in to it and continue with the rest of these instructions.

Open up a web browser in your VM (there's a button on the bottom left). Go to the Anaconda download page. Click the link to download the 64-bit installer because the VM is 64-bit. If for some reason you are installing on your own machine and it is particularly old (mid-2000s or earlier) you might need 32-bit (this is highly unlikely).

Open a terminal. You can do this with ctrl + alt + t, or from the Menu in the bottom left. Change directory to Downloads. Run bash Anaconda3-4.3.1-Linux-x86_64.sh. Agree to the license agreement and follow the defaults. Anaconda will now be installed to /home/student/anaconda3. This will also install a large number of libraries. Many of them we will not use in this track, but we will use a number of the more important ones. Note that this installation should add Anaconda3 to your path. If it doesn't, you will have to add it yourself.

You can check that Anaconda has been installed by running ipython in a new terminal. This is a package that will have been installed by Anaconda. It provides a superior interactive python interpreter for serious users. When it starts, it will print out (among other things):

Python 3.6.0 |Anaconda custom (64-bit)

as one of the lines of output. Another good check would be running which conda which will tell you where conda is installed. It should be /home/student/anaconda3/bin/conda. Since this is part of the core of anaconda, you know you have installed it correctly.

You are also going to need the basemap package for a later assignment, so you may as well install it now. In the terminal, type conda install basemap. You are now ready for the rest of this class. Note that the username for this VM is student and its password is spring2017. This is mostly useful when updating software or installing new software from the VM's package manager (apt).

Covered this week

Jupyter Notebooks
The logging module
The collections module

Instructions

Jupyter

This part is not going to be turned in. If you've ever used Mathematica, you might wish that there was an interactive Python terminal in which you could go back and modify previous lines. This exists. Both Sage and Jupyter service this need. Jupyter is more popular at the moment. Jupyter comes installed with the anaconda package. Thus, go to your shell and type:

% jupyter-notebook

(We'll use % as the Unix shell prompt; don't type it.) This will open a web page on your default web browser. Note that the URL is localhost:8888. This means that you are using your browser to interact with this "site", but you are not using any resources from the internet, just your browser. To open a new notebook, click New and select Python [default]. This will bring you to a new page that will look a bit like a Mathematica notebook. Go ahead and type some Python code. Hit Shift-Enter to run the code in a single cell. Variables stored in one cell will be available in another. This course won't explicitly require Jupyter notebooks, but they are a good resource for helping you to write your code. You can close Jupyter from your browser, or by hitting Control-C in the shell.

Both Jupyter notebooks and iPython in the terminal have access to special commands prefixed by %. Some that are particularly useful are listed below.

%magic or %quickref gives a list of all inline commands.
%notebook -e foo.py in a terminal will export the current session to the notebook file foo.py..
%pastebin will let you do the same to pastebin but with a text file.
%time and %cd let us access regular command utilities.
%prun runs the profiler. It's autoimported instead of requiring a special import.
%psearch lets you run a regex search on objects in the session.
%store and %store -r lets variables be stored between sessions.

lab1a: Learn to log

As you might recall from CS1 (or any other programming you may have done), your code will invariably have bugs in it. Rereading the code might help, but it is typically faster to find out which line is doing something unexpected by printing out some relevant variables and comparing them to your expectations. In languages like Python this is often good and fine. However, with code that is (for instance) more vulnerable to race conditions the print statements can sometimes impact the resulting code execution. This stems from the fact that printing to the screen is slow compared to other processes. Consider:

  def foo():
      s = 0
      for i in range(10):
          print(i)
          s += i
      print('done')

compared to

  def bar():
      s = 0
      for i in range(10):
          s += i
      print('done')

A somewhat related issue is dealing with large code bases. It's fine if your hundred line script prints out an error message every once in a while, but every time something goes wrong for an operating system or a web server, it can't just be printed to the screen. However, the information as to what went wrong might still be necessary if some user wants to diagnose and fix their machine.

Thus, we introduce the logging library. Available in both python2 and 3, the built-in logging module is versatile and useful. If you need more information than what is contained on the set, look at the documentation here. Let's take a look at how the logging module works. Logging can be configured using logging.basicConfig(). It is a function that takes named arguments to determine what sorts of logging should be done. You can set log messages to go to a file using filename='mylog.txt', or set the format of the messages to log using format='%(message)s'. Then you can log messages using logging.debug(), .info(), .warning(), .error(). and .critical().

Let's try some examples:

  import logging
  logging.warning('This is a test warning.')
  logging.critical('It\'s important to understand logging levels.')
  logging.info('The five different logging levels were mentioned above.')

[5] What happened? Write your answer in a comment. (Numbers in brackets are time estimates in minutes, not mark counts.)

Logging handles messages differently depending on their logging level. A critical message is on a higher level than an info or a warning. We can change the default settings to force all of the levels to be treated the same.

  logging.basicConfig(level=logging.DEBUG)
  logging.debug('This message should now display.')
  logging.error('And so should this one')

Because the level was set to DEBUG, all levels DEBUG and up will be handled by the basic handler. We'll get to handlers in just a minute. Note that we said "DEBUG and up." As should be fairly clear, critical messages are more important than debug or info messages. The importance of the tiers is as follows: debug, info, warning, error, critical. A debug is like a print statement that you want for developing or bug testing your code, but that an end-user shouldn't see.

  logging.debug('The velocity is {}'.format(velociraptor1.velocity))

An info alerts the end user as to what the code is doing.

  logging.info('Purging all files on C:\.')

A warning warns that something might be going wrong, but that the program is trying to handle it.

  logging.warning('No initial item database provided. Generating default table.')

An error implies that something is wrong, but can perhaps be dealt with. It won't necessarily cause a crash, but likely some issue.

  logging.error('The sum of these positive numbers is negative.')

A critical implies a catastrophic failure. The program will crash, possibly even the system.

  logging.critical('Insufficient resources for this many requests.')

Adjacent logging levels can be fairly similar, and there is often some overlap. The reason to keep in mind which levels are which is that you can customize how each type of message is handled. A given kind of message can be ignored, written to a file or printed to the console.

Now we are going to talk about handlers, and then you're going to get to log what happens in some code.

There are three main parts to how logging actually works: the logger, the handler and the formatter.

The logger contains the information about logging itself. It knows what level of logging to pass to the handler(s) and has some assigned handler(s). Instead of using logging.info(), we could create a custom logger called mylogger (using the logging.Logger class/constructor) and use mylogger.info(). The handler tells us what to do with the message. It has an assigned formatter, and can also have a logging level. The real beauty here is in the number of handler classes available. The two most common ones are the SteamHandler and the FileHandler. However, more exotic ones exist, like the SMTPHandler which emails messages or the RotatingFileHandler which automatically removes log lines when the file hits a particular maxmimum size to avoid wasting disk space. The formatter is the simplest part. It takes the message from logger.info('Test'), and formats it. It can do things like add a timestamp (useful if tasks are taking longer than they should), inform you as to what module the message came from (useful if many modules have similar messages) and more.

[5] Why is it a good idea to let the logger set the level (rather than the handler(s)) if possible? Answer in a comment.

More information can be found in this tutorial. Please read it now and refer to it in what follows.

lab1b: Use the log

[10] Ben Bitfiddle has written the code in lab1b.py. Help him debug it by using logging. Do not actually fix the code, but only the debug messages. By that, we mean remove the print statements and convert them to logging debug statements. Use the basicConfig tool.

lab1c: Love the log

[40] Ben Bitfiddle has also written the code in lab1c.py. Help him by using logging. Use good judgment as to what levels of logging to use. This code is production code, so debug information should be suppressed, info information should be saved to lab1c.log, and warnings (and higher) should be sent to the console. Hint: This will require multiple handlers. Don't worry about the format you choose.

lab1d: The `collections` Library

Trainer Codestar is working on making his own Pokedex. He wants to count how many Pokemon of each species he has encountered. Unfortunately, he's really busy getting ready for the Elite Four, so you'll have to help. Write the following function so that it takes a string, and adds it to the dictionary. If it is already in the dictionary, increment the counter instead.

  GLOBAL_POKEMON_DICT = {}
  def poke_sort(pokemon_name):
      # TODO

Professor Oak looks over your code and explain to you what a defaultdict is. The collections library contains a number of useful data types, one of which is a defaultdict. A defaultdict is initialized as follows:

  a = defaultdict(arg)

where the argument arg is either a builtin type or a lambda function that returns a value. When a piece of code tries to extract a value from the defaultdict and the defaultdict does not explicitly have any data in that slot, the lambda is evaluated to get the default value, or the default value of that type is used. For instance:

  a = defaultdict(int)
  a['hi'] = 9
  print(a['hi'])    # 9
  print(a['fish'])  # 0: default value of type int

Since we used int as our default type, we get back 0 in the second case.

[10] Rewrite the poke_sort function with a defaultdict. Note the increased cleanliness of the code. If you need the documentation, it's here.

  GLOBAL_POKEMON_DEF_DICT = ????
  def poke_sort2(pokemon_name):
      # TODO

Named tuples

Sometimes you want some benefits of a class structure, but without the overhead of making an entire class. The namedtuple satisfies some of these needs. Suppose you want a rectangle object, but there aren't any methods you can think of. It seems like a bit of a pain to make a new class just for that. So maybe a tuple is in order. However, should the tuple be (x1,x2,y1,y2) or should it be (x1,y1,x2,y2)? Or maybe (x1,y1,x_side_length,y_side_length)? It doesn't really matter, but if you come back to this project in a few weeks, you might not remember the convention. And anyone who looks at this code in the future is going to spend a few minutes thinking about it.

  t[0] + t[1]

doesn't mean anything by itself. However,

  rect1.x1 + rect1.x2

does mean something. Thus, we have the namedtuple.

  a = namedtuple(class_name, lst)

where the class_name refers to the name of the class you would have made, and the lst is a list of the fields of the proposed class.

Thus, we would create the Rect class by doing

  Rect = namedtuple('Rect', [x1, x2, y1, y2])

Now we can do either of:

  t = Rect(0, 0, 1, 1)
  t = Rect(x1=0, x2=0, y1=1, y2=1)

And we can access pieces using t.x1 or t[0].

NOTE: There is nothing to hand in for this section (on namedtuple). It's for informational purposes only.

To hand in

The lab1a.py, lab1b.py, lab1c.py and lab1d.py programs. Part A should be only comments. Please follow the same style guide lines as CS1, which follow PEP8.