CS 179: GPU Computing
Assignment 4

Due: Friday, May 1, 2015 - 11:59 PM


>>> CLARIFICATION/CHANGE LOG <<<

	- All of the code for this assignment is in one file, compiled entirely by nvcc. As of last check, the Nvidia compiler doesn't support the full C++ standard and libraries, so if you want to add code (error checking, etc), stick to C. 
		- If you would really like to use C++, you can modify the makefile from Assignment 3, and use a similar code structure to that given in the previous assignments.

	- Update: We had some problems with image support on haru, so only PNG files are known to work in preprocess.py. I've updated the instructions below.


Submission: 
------------------

By e-mail to kyuh@caltech.edu. Package your files in a standard archive format (e.g. zip, tar.gz, tar.bz2). Please also include your name in the title of your archive.


Resource usage:
------------------

The coding question (Question 3, X-ray CT reconstruction) is a reasonably classic problem. Please do not look up the solution code directly. (Feel free to look up general documentation, as well as general theoretical resources on CT reconstruction.)


Other notes:
------------------

The material involved in this set is likely more difficult than in some previous sets. If you feel that any concepts are unclear, please tell us. 


Question 1: Parallel Breadth-First Search (BFS) (30 pts)
--------------------------------------------------------
--------------------------------------------------------

On Monday, we discussed a variation of the breadth-first-search (BFS) problem, and showed how it can easily parallelized on the GPU by replacing the queue with multiple arrays of state (the "frontier", "visited", and "cost" arrays, along with the compact adjacency list representation of our graph).


1.1 	(10 pts)
---------------------

We've spent much of the course discussing the different types of memory available (shared, constant, texture, etc), as well as good memory practices (coalesced reads/locality, avoiding bank conflicts, etc).

Suppose we wanted to avoid the read-penalties of global memory, and wanted to use per-block shared memory to store the data we need, before we run our computations. Would doing this increase our performance? 

If so, how would you do it, and why would it increase performance?  
If not, why not?

(Be reasonably thorough; a few sentences is sufficient.)


1.2 	(10 pts)
---------------------

As mentioned in class, we call our GPU-side kernel within a loop. Each kernel call solves BFS for the next "layer" of vertices within our graph (depicted in Lecture 10, slides 16-17). 


From Lecture 10 (slides 22-24), this loop's pseudocode is:

while F is not all false:
	call GPU kernel( F, X, C, Va, Ea )

(Here, F was the "frontier" array, which showed every vertex about to be processed in the next iteration. This pseudocode essentially says we continually iterate over layers of the graph, until there are no more vertices to check.)


What is a good, parallelizable way to check whether "F is not all false" at each iteration? 

(Hint: If we store "false" as 0, and "true" as 1 within the frontier array, what happens to the array's sum?)


1.3 	(10 pts)
---------------------

Can you think of another way to check whether "F is not all false"? What are the advantages and disadvantages of this approach? Under what circumstances (e.g. graph density vs sparsity) would this approach perform better (and worse) than your suggestion in (1.2)? 

(Hint: In the GPU pseudocode, what part of the code has to be executed to set an element of F to true? Can you add some code there?)


Question 2: Algorithm compare/contrast: PET Reconstruction (10 pts)
--------------------------------------------------------
--------------------------------------------------------

(This problem is based on the claim made in "Medical Image Processing on the GPU: Past, Present and Future", by Eklund, et al.)


On Friday, we discussed how to reconstruct a "slice" of a 3-D object, using X-ray computed tomography (X-ray CT). 

Recall that, to get this data, one sends X-ray radiation along an angle, and measures the radiation at the other side. After doing this for multiple angles, we then reconstruct the image using the method described in class (filtered backprojection).


Another common medical imaging technique is positron emission tomography (PET). As with an X-ray CT, a detector measures radiation around the slice. However, here the radiation comes from within the patient's body (usually by injection of a radioactive sugar into the circulatory system, followed by positron-electron annihalation). Effectively, we now measure *emission* of radiation, instead of purely *transmission*.

Since radioactive decay is a stochastic process, we gain our radiation measurements bit-by-bit, instead of all at once. Instead of accumulating all measurements into a sinogram, a common method is to store a list of detected emissions over time ("list-mode format"), something like:


	(0.025 ms, location_measurement_0)
	(0.026 ms, location_measurement_1)
	(0.026 ms, location_measurement_2)
	(0.030 ms, location_measurement_3)
	...


Suppose we attempt to reconstruct our image using this data, utilizing the same physical principles as X-ray CT*, where each radiation measurement corresponds to measurement of the density function along all of the pixels within that line. 

Notice that now there is no easy way to determine where the relevant measurement data lies for each pixel that we're trying to reconstruct. So instead of parallelizing over pixels (as we did in class for X-ray CT), we now parallelize over the entries in our list - each measurement will add itself to all pixels along its line.


How do you expect our GPU-accelerated PET reconstruction performance to compare to that of X-ray CT?


(Hint 1: Are there any atomically unsafe operations to be aware of?)

(Hint 2: In X-ray CT reconstruction, we were able to use texture memory to cache sinogram reads. Does that change here?)


(* In practice, people have found better algorithms, namely an expectation-maximization approach to forward and back-projection - details in "List Mode PET Reconstruction" (Kovacs). )


Question 3: X-ray CT Reconstruction (coding+theory) (60 pts, +10 extra credit)
--------------------------------------------------------
--------------------------------------------------------

On Friday, we discussed the GPU-accelerated filtered backprojection algorithm for X-ray CT reconstruction. Here, we implement this algorithm.


3.1 	(25 pts)
---------------------

We saw in class that ordinary backprojection would result in a blurry image, due to the averaging effect of measurements over multiple angles. Our solution was to use a high-pass filter over the sinogram data.

For this assignment, we'll use a basic ramp filter, where the lowest frequency's amplitude is scaled by 0, the highest is scaled by 1 (so preserved), and amplitudes in between are scaled linearly with the frequency.

The DFT of a sinogram measurement on a given angle is structured such that the highest frequency data is in the middle, and the amplitudes are symmetric around this point. *


*** To do: Implement the high-pass filter. (See the sections marked "TODO 1".)
(You'll also need to do the "TODO" sections for any part of Question 3 to work.)


Note: If you don't use the R2C and C2R cuFFT modes (described below), you'll probably need a second kernel to convert the complex number array to an array of floats.


(* One can gain some efficiency with this symmetry by e.g. considering only the first half of the DFT, zeroing the second half, then doubling the result after inverting, or by using cuFFT in R2C and C2R mode. But this is optional.)


3.2 	(35 pts)
---------------------

The "core" of our reconstruction is the backprojection algorithm, described in Lecture 12, slides 19-40. 


*** To do: Implement the backprojection algorithm. (See the sections marked "TODO 2".)
(You'll also need to do the "TODO" sections for any part of Question 3 to work.)


While we discussed texture caching in class, you don't need to implement it (you can do it for extra credit, see below).


Cautionary notes:
	- Pay attention to the edge cases and y-coordinate flip (see slide 40).
	- Also note that (0,0) geometrically is in the middle of the image, so center your coordinates appropriately.


3.3 	(Extra credit: 10 pts)
---------------------

Texture caching of the sinogram will get us a slight performance increase, as discussed near the end of Lecture 4. 

*** To do: Rewrite the setup and reconstruction code to texture cache the sinogram.


Assignment Notes
--------------------------------------------------------
--------------------------------------------------------


The CUDA code itself will work on any of the class machines. 

However, the pre-processing and post-processing Python scripts (below) only work on haru (not mx or minuteman). You can alternatively run them on your local machine after installing the following Python dependencies:

	numpy
	scipy
	matplotlib
	scikit-image

(You'll probably also need pip, which is obtainable with apt-get on Debian-based systems. Then, use pip to install the above four packages.)


To compile: 
	nvcc xray_ct_recon.cu -o xray_ct_recon -lcufft
To run:
	./xray_ct_recon
		( Sinogram filename ) 
		( Width or height of original image, whichever is larger ) 
		( Number of angles in sinogram )
		( threads per block )
		( number of blocks )
		( output filename )


The program takes in a space-delimited array of values corresponding to the sinogram, and outputs another space-delimited array. 

To convert the program output to an image, run:

	python postprocess.py <program output filename>


To produce the data for CT reconstruction, we use a "simulated CT scanner" that does the forward Radon transform on an image - this works on images in the PNG format. To do this, run:

	python preprocess.py <input image filename> <number of angles>


Running this should produce multiple output files. The "4_sinogram" file is your input to your program. The output should resemble the "5_recon_output" image.