CS 179: GPU Computing
Assignment 1

Due: Wednesday, April 6, 2016 - 3:00 PM

Submission:
Include your written answers in a README file. Submit these answers, as well as your code, by e-mail to cs179.ta@gmail.com. Package your files in a standard archive format (e.g. zip, tar.gz, tar.bz2).


Question 1: Common Errors (20 points)
--------------------------------------------------------
--------------------------------------------------------

This class will make heavy use of low-level C constructs and concepts, especially pointers and memory management. 

As a "warm-up", here are a few quick samples of code and their intended specifications. Each such piece of code is incorrect. Identify what is wrong with the code, and how it should be fixed.

(Many of these problems allude to common errors encountered while writing both GPU and CPU code.)


1.1
---------------------
Creates an integer pointer, sets the value to which it points to 3, adds 2 to this value, and prints said value.

void test1() {
    int *a = 3;
    *a = *a + 2;
    printf("%d\n", *a);
}


1.2
---------------------
Creates two integer pointers and sets the values to which they point to 2 and 3, respectively.

void test2() {
    int *a, b;
    a = (int *) malloc(sizeof (int));
    b = (int *) malloc(sizeof (int));

    if (!(a && b)) {
        printf("Out of memory\n");
        exit(-1);
    }
    *a = 2;
    *b = 3;
}


1.3 
---------------------
Allocates an array of 1000 integers, and for i = 0, ..., 999, sets the i-th element to i.

void test3() {
    int i, *a = (int *) malloc(1000);

    if (!a) {
        printf("Out of memory\n");
        exit(-1);
    }
    for (i = 0; i < 1000; i++)
        *(i + a) = i;
}


1.4 
---------------------
Creates a two-dimensional array of size 3x100, and sets element (1,1) (counting from 0) to 5.

void test4() {
    int **a = (int **) malloc(3 * sizeof (int *));
    a[1][1] = 5;
}


1.5
---------------------
Sets the value pointed to by a to an input, checks if the value pointed to by a is 0, and prints a message if it is.

void test5() {
    int *a = (int *) malloc(sizeof (int));
    scanf("%d", a);
    if (!a)
        printf("Value is 0\n");
}


Question 2: Parallelization (30 points)
--------------------------------------------------------
--------------------------------------------------------


2.1
---------------------
Given an input signal x[n], suppose we have two output signals y_1[n] and y_2[n], given by the difference equations: 
		y_1[n] = x[n - 1] + x[n] + x[n + 1]
		y_2[n] = y_2[n - 2] + y_2[n - 1] + x[n]

Which calculation do you expect will have an easier and faster implementation on the GPU, and why?


2.2
---------------------
In class, we discussed how the exponential moving average (EMA), in comparison to the simple moving average (SMA), is much less suited for parallelization on the GPU. 

Recall that the EMA is given by:
	y[n] = c * x[n] + (1 - c) * y[n - 1]

Suppose that c is close to 1, and we only require an approximation to y[n]. How can we get this approximation in a way that is parallelizable? (Explain in words, optionally along with pseudocode or equations.)

Hint: If c is close to 1, then 1 - c is close to 0. If you expand the recurrence relation a bit, what happens to the contribution (to y[n]) of the terms y[n - k] as k increases?


Question 3: Small-Kernel Convolution (50 points)
--------------------------------------------------------
--------------------------------------------------------


Introduction:
------------------
On Friday, we saw that the effect of a linear time-invariant system on an input signal x[n] (to produce an output y[n]) can be summarized by the system's impulse response h[n]. This is the output of the system in response to a unit impulse as input.

We can then find y[n] by computing the convolution, which we denote (*):

	y[n] = (x (*) h)[n]

(See Friday's lecture slides for an expanded definition.)

The goal is to GPU-accelerate this computation. Similar to how we handled the addition problem, we allocate and copy memory as appropriate, and we can use the strategies in Lecture 2 to divide indicies among our many threads.


To do:
------------------
Complete the GPU-accelerated convolution by filling in the parts marked TODO in blur.cc and blur.cu.


Assignment notes:
------------------
The code given to you will run the ordinary CPU version of the convolution, and compare the GPU/CPU speedup and the correctness of the GPU output. The default is currently set to convolve the starting signal with a Gaussian kernel.

There are two modes of operation:

	Normal mode: Generate the input signal x[n] randomly, with a size specified in the arguments.

	Audio mode: Read the input signal x[n] from an input audio file, and write the signal y[n] as an output audio file.

To toggle between these two modes, set AUDIO_ON accordingly, and use the appropriate makefile.

Because convolving with the Gaussian kernel acts as an imperfect low-pass filter, the output file (in audio mode) will have its higher frequencies attenuated. Try it out!

>> (revised 4/5/2015)
On haru (haru.caltech.edu), you should get a speedup of ~6-8x, using a reasonable choice of block size and #blocks (e.g. 512, 200). 


Hints:
------------------
- The CPU code exists already; use it as a guide! Recall that we often accelerate CPU code by replacing it with "similar-looking" GPU code!


Technical details:
------------------
- For UNIX development: to select which makefile to use for the make program, use the -f command-line argument.
- For UNIX development: to use audio mode, you'll need to install libsndfile.
	- On Ubuntu, you can do this with "sudo apt-get install libsndfile1-dev".
	- On OS X, we highly recommend installing the Homebrew package manager from http://brew.sh/, then using the command "brew install libsndfile". Otherwise, you'll have to build it from source code.
	
- For Windows development: no need to install anything, just run the appropriate .cmd file (pseudo makefile) after redefining the AUDIO_ON variable.
- Windows development tip: the Windows analog of the UNIX "cd" command is "dir".