Buddhabrot Renderer, Part 2

After completing Part 1, you should have a program that can render images using the Buddhabrot rendering technique. However, you probably noticed that this computation is incredibly slow.

Modern computers often have many processors to use for computations. However, unless you take steps to leverage these processors, your program will only run on one processor, not taking full advantage of your computer's capabilities.

This week we will update the Buddhabrot program to be multithreaded, so that it can take advantage of multiple processors. At the end of this week, you should have a program that can run much faster on a multi-core system.

Approach

There are many ways to take a sequential computation and parallelize it. The best approach to use often depends on the nature of the problem being solved. We need to split the sequential computation into multiple parallel computations that can be executed on separate threads. If two threads must interact with shared data, we need to synchronize their access to the data to ensure that our program doesn't generate spurious results. Of course, synchronization will generally slow down your program, so it's important to break down the problem such that threads interact with each other as little as possible to complete their task.

The Buddhabrot computation seems straightforward to parallelize:

We could have multiple threads coordinate access to the image by guarding the image with a mutex, but having multiple threads contend over a single image-lock could become a bottleneck. Additionally, updating a single image from multiple threads can generate performance issues with the hardware caches. (If you are curious about this, you should look at cache coherence protocols that multi-core processors must implement. These protocols keep multiple independent caches of a single shared memory synchronized, so that programs don't see invalid or outdated results.)

To avoid this potential bottleneck - and to give us a more interesting and typical concurrency problem to solve - we will follow a producer-consumer pattern in our implementation. We will have N producer threads that are looking for points that escape from the Mandelbrot set, recording their escape trajectories. These N producers will store their results into a thread-safe queue. A "thread-safe queue" is simply a queue that can be used by multiple threads without generating spurious results, crashing, misplacing values, or having other such concurrency bugs. It is a very common way to allow different components of a concurrent/parallel system to interact with each other.

Finally, one consumer thread will be responsible for reading the results from the queue, and updating the image appropriately. Since only one thread will be accessing the image, we don't need to synchronize these accesses. In our program, the consumer will simply be the main thread (i.e. the one that is started by the OS when we run our program); it will initialize the producer threads, consume their input until they are done, and then output the resulting image data.

Pictorially, it will look like this:

Multithreaded Buddhabrot Solver

Thread-Safe Queues

The simplest implementation of a thread-safe queue uses a single mutex to guard the queue's contents. Both producers and consumers must acquire the mutex before interacting with the queue's internal state. With C++11 and later, the C++ Standard Library includes a number of thread synchronization primitives, including std::mutex and std::condition_variable, which make implementing simple thread-safe data structures straightforward.

The queue's contents can be managed with an existing C++ collection class like the std::deque ("deque" is short for "double-ended queue"), which is particularly well suited to this usage pattern: elements can be added to the back of the deque in constant time, and removed from the front in constant time. (In this way, the deque is better for building queues than the std::vector, since vector takes a linear amount of time to remove elements from the front - it must copy all elements over by one location, every time the first element is removed.)

Note that there are other ways to construct thread-safe queues. The above approach is one of the simplest, and also the slowest, since threads will frequently contend for the single mutex guarding the queue. More advanced thread-safe queues use a lock per element to achieve much lower lock contention. Finally, there are also lock-free and wait-free queues which can be extremely fast - but these approaches are also extremely difficult to implement correctly, and can be subject to extremely subtle bugs caused by unexpected hardware and compiler behaviors. Because of this, many projects stick to lock-based approaches because it is simply easier to ensure their correctness.

Differences in Speed

A very important issue with producer-consumer systems is how they behave when producers and consumers run at different speeds. This becomes particularly pronounced if there are many more producers than consumers, or many more consumers than producers.

What happens if the consumer is faster than the producer? In this case, the consumer will drain the queue frequently, resulting in the consumer having no work to do. This is fine, as long as the consumer can passively wait.

Passive waiting is very easy to implement using mutexes and condition variables.

Condition variables allow threads to coordinate access to shared resources, allowing threads to passively wait, and to wake up other threads when some condition becomes true.

OK, easy enough. But, what happens if the producer is faster than the consumer?

If a producer is producing data faster than a consumer can consume it, then the queue will simply grow, buffering the extra data. The problem is, if the producer is always faster than the consumer, then the queue will grow larger and larger, until all available memory is consumed by the queue. Then your program crashes!

Because of this, we need a way for slow consumers to force faster producers to slow down. This is called back-pressure, and again this is a critical feature for producer-consumer systems to provide.

Our queue can force fast producers to slow down by maintaining an upper bound on how many elements may be stored in the queue. If a producer tries to put data into a full queue, then it must passively wait until the queue has some space. Again, this can be implemented very easily using a condition variable.

Buddhabrot Updates

As stated earlier, you will update your Buddhabrot program to support multiple threads of execution. To do so, you will create a bounded thread-safe queue, and make a few other changes to your program. Read on for details...

Passing MandelbrotPointInfo Objects

Last week we kept things simple and had our compute_mandelbrot() function return MandelbrotPointInfo objects. This was fine for how we were using the function, but once we start passing these objects into C++ STL collections, we need to think more carefully about how much we may be copying data.

Update your compute_mandelbrot() function to allocate the MandelbrotPointInfo struct off of the heap. Don't just use new to allocate this object; you need to follow modern C++ best practices and wrap the allocation with a std::shared_ptr that will free the allocation when no one is using it any longer.

Concurrent Bounded Queue Implementation

In a file cbqueue.h (and optionally cbqueue.cpp, if you prefer separating class declarations and definitions), implement a ConcurrentBoundedQueue class that has the characteristics described earlier:

The queue should have these public member functions:

Note that the queue should not need a destructor, because we are using smart pointers to manage all heap-allocated memory. If your queue has a destructor, you are doing it wrong...

Finally, you might notice that it doesn't make sense for this queue to support copy-initialization or copy-assignment, so delete these operations in your class declaration.

Don't forget to properly document your class, including all data members and member functions.

Main Program Updates

Once you have a concurrent bounded queue implemented, you can start to restructure your program to spin up N producer threads to generate data, and then have the main thread use their results to update the image. This is mostly straightforward, but there are a few nuances to be aware of.

Producer Thread-Function

For the Buddhabrot renderer to start multiple threads, your program must provide a function for the std::thread instances to run. In your bbrot.cpp file, implement a function like this:

void generate_bbrot_trajectories(int num_points, int max_iters,
                                 ConcurrentBoundedQueue &queue)

This should be a straightforward refactoring of your work from last week. However, this function should only store points that are not in the Mandelbrot set into the concurrent queue. (Why store points that will just be discarded by the main thread?)

Finally, the function needs a way to say when it is finished producing; otherwise, the consumer won't know when to stop asking for more data. An easy way to do this is to put some kind of "finished" value into the queue. The simplest "finished" value would be a smart-pointer holding the nullptr value. (This is what the shared_ptr default constructor creates.) When the consumer sees this value, it will know that another producer has finished.

The main() Function

In our program, the main() function will run through this sequence of steps:

  1. Parse command-line arguments.
  2. Initialize the concurrent bounded queue with some upper bound (e.g. 100 items).
  3. Initialize the image that will store the final results.
  4. Start up N producer threads.
  5. Consume results from the concurrent bounded queue, until all N producers are finished. The results are used to update the image contents.
  6. When all N producers are finished, the main() function must call thread::join() on all producer threads. Failure to do this will cause the C++ runtime to terminate your program abnormally.
  7. Of course, don't forget to output the image to standard output!

The subsequent sections describe the details for these steps.

1. Command-Line Arguments

Previously, you would invoke the Buddhabrot program like this:

./bbrot 600 50000000 5000 > image.pgm

Add a fourth argument, the number of producer threads to start:

# Start 4 producer threads!
./bbrot 600 50000000 5000 4 > image.pgm

Update the code that checks the number of arguments, and that complains to the user if the arguments aren't correct. Make sure to parse the number of threads that the user specifies on the command-line.

2-3. Initializing the Queue and Image

Initializing the concurrent bounded queue and the image should be straightforward, so we won't say anything else about it here.

4. Initializing the N Producer Threads

The main function will need to create the number of threads requested by the user. You can use your generate_bbrot_trajectories() function as the thread function. Note that the queue must be passed by reference (you want all threads operating on the same queue object), so you may need to employ the std::ref() function. (The std::ref() function is very simple; it just wraps a reference with an object so that the reference can be passed by-value. Slightly weird, but not very exciting.)

Note that sometimes the number of points may not be evenly divisible by the number of producers. Make sure that your program generates exactly the number of points that the user requests.

5. Consuming Results from the Queue

Consuming results from the queue and using them to update the image contents should be very similar to the previous week's version of the program. There is really only one nuance - you will need to keep track of when producers are finished generating data. If you use a smart pointer to nullptr then this will be very easy; you can just keep a count of how many times you have seen a nullptr come through, and when all producers have finished, you know the program is done.

6. Call thread::join() on All Producers

Once each producer thread is done, the main thread must call join() on the thread. Failure to do so will cause the C++ runtime to complain loudly. The easiest way to do this is to wait until all producer threads have finished running, and then call join() on all of them.

This means you will need to keep track of the threads that you started in an earlier step. Hint: A vector<thread> might be a good option for doing this.

Outputting the Image

This should be identical to last time, so we won't say anything about this.

Testing

Once you have completed the above steps, you can start exercising your program to see if it's any faster. You may want to start out with only one producer thread at first, to make sure your program works properly:

./bbrot 200 1000000 1000 1 > img_1000.pgm

200x200 image, 1000-iteration limit, 1M points

If the output looks correct, and your program terminates when it's supposed to, then try it with two or more threads:

./bbrot 200 1000000 1000 2 > img_1000.pgm

Your program should continue to generate correct-looking images, but it should be faster (if you run on a computer with multiple CPUs).

Multithreaded Performance

As long as multiple CPUs are available, using multiple threads should make our program faster. There are limits, of course: if you have 4 CPUs and you run with 6 producer threads (and one consumer thread), at least 3 CPUs will have to run more than one thread, and this will likely be slower than just running with 3 or 4 producers. In this program, the CPU is the bottleneck, so you want to use the same number of threads as you have processors; using more threads than this will simply waste time.

To explore this, you can measure the performance of your program with the UNIX time command:

# Redirect the program output to /dev/null so we can ignore it
time ./bbrot 1000 10000000 1000 1 > /dev/null

real    1m47.544s  <-- This is the wall-clock time
user    1m55.108s
sys     0m14.927s

If you increase the number of threads to two, and there are two CPUs to use, you should see the program run approximately twice as fast:

time ./bbrot 1000 10000000 1000 2 > /dev/null

real    0m56.543s  <-- 1.9x faster than 1 thread
user    1m59.812s
sys     0m16.470s

And with three threads:

time ./bbrot 1000 10000000 1000 3 > /dev/null

real    0m40.740s  <-- 2.6x faster than 1 thread
user    2m2.805s
sys     0m15.734s

These tests were run on a computer with 32 CPUs. Why is using three producer threads not three times faster than using one producer thread? The answer lies in how we have structured the computation: the single consumer must still process all of the data from the producers, and this is a sequential task. Additionally, our simple queue with a single mutex reduces the amount of time the producers can operate in parallel.

Most parallel programs will still include some work that must be performed sequentially. Because of this, there is a limit to the speedup that may be achieved through parallelizing the task. This is known as Amdahl's Law, and is a very important principle. In general, when parallelizing computations, you want to have as little sequential work as possible, so that each additional CPU/thread will produce a big benefit.

There are many things we could do to make our Buddhabrot renderer use threads and CPUs more efficiently - but that is beyond the scope of this CS11 track!

All Done!

Once you have finished testing and debugging your program, submit these files to codePost.io:

We will use a fresh copy of image.h, so you don't need to submit this file.