CS 179: GPU Computing
Assignment 1

Due: Wednesday, April 8, 2015 - 3:00 PM


Submission: 

Include your written answers in a readme file. Submit these answers, as well as your code, by e-mail to kyuh@caltech.edu. Package your files in a standard archive format (e.g. zip, tar.gz, tar.bz2).


Question 1: Common errors (20pts)
--------------------------------------------------------
--------------------------------------------------------


This class will make heavy use of low-level C constructs and concepts, especially pointers and memory management. 

As a "warm-up", here are a few quick samples of code and their intended specifications. Each such piece of code is incorrect. Identify what is wrong with the code, and how it should be fixed.

(Many of these problems allude to common errors encountered while writing both GPU and CPU code.)


1.1
---------------------

Creates an integer pointer, sets the value to which it points to 3, adds 2 to this value, and prints said value.


void test1(){
    int *a = 3;
    *a = *a + 2;
    printf("%d",*a);
}


1.2
---------------------

Creates two integer pointers and sets the values to which they point to 2 and 3, respectively.


void test2(){
    int* a,b;
    a = (int*) malloc(sizeof(int));
    b = (int*) malloc(sizeof(int));

    if (!(a && b)){
        printf("Out of memory");
        exit(-1);
    }
    *a = 2;
    *b = 3;
}


1.3 
---------------------

Allocates an array of 1000 integers, and for i = 0,...,999, sets the i-th element to i.


void test3(){
    int i, *a = (int*) malloc(1000);

    if (!a){
        printf("Out of memory");
        exit(-1);
    }
    for (i = 0; i < 1000; i++)
        *(i+a)=i;
}


1.4 
---------------------

Creates a two-dimensional array of size 3x100, and sets element (1,1) (counting from 0) to 5.


void test4(){
    int **a = (int**) malloc(3*sizeof(int*));
    a[1][1] = 5;
}


1.5
---------------------

Sets the value pointed to by a to an input, checks if the value pointed to by a is 0, and prints a message if it is.


void test5(){
    int *a = (int*) malloc(sizeof(int));
    scanf("%d",a);
    if (!a)
        printf("Value is 0\n");
}


Question 2: Parallelization (30pts)
--------------------------------------------------------
--------------------------------------------------------


2.1
---------------------

Given an input signal x[n], suppose we have two output signals y_1[n] and y_2[n], given by the difference equations: 
		y_1[n] = x[n-1] + x[n] + x[n+1]
		y_2[n] = y_2[n-2] + y_2[n-1] + x[n]

Which calculation do you expect will have an easier and faster implementation on the GPU, and why?


2.2
---------------------

In class, we discussed how the exponential moving average (EMA), in comparison to the simple moving average (SMA), is much less suited for parallelization on the GPU. 

Recall that the EMA is given by:
	y[n] = c x[n] + (1-c) y[n-1]

Suppose that c is close to 1, and we only require an approximation to y[n]. How can we get this approximation in a way that is parallelizable? (Explain in words, optionally along with pseudocode or equations.)

Hint: If c is close to 1, then 1-c is close to 0. If you expand the recurrence relation a bit, what happens to the contribution (to y[n]) of the terms y[n-k] as k increases?

	
Question 3: Small-kernel convolution (50pts)
--------------------------------------------------------
--------------------------------------------------------


Introduction:
------------------

On Friday, we saw that the effect of a linear time-invariant system on an input signal x[n] (to produce an output y[n]) can be summarized by the system's impulse response h[n]. This is the output of the system in response to a unit impulse as input.

We can then find y[n] by computing the convolution, which we denote (*):

	y[n] = (x (*) h)[n]

(See Friday's lecture slides for an expanded definition.)


The goal is to GPU-accelerate this computation. Similar to how we handled the addition problem, we allocate and copy memory as appropriate, and we can use the strategies in Lecture 2 to divide indicies among our many threads.


To do:
------------------
Complete the GPU-accelerated convolution by filling in the parts marked TODO in Blur.cc and Blur_cuda.cu .


Notes:
------------------
The code given to you will run the ordinary CPU version of the convolution, and compare the GPU/CPU speedup and the correctness of the GPU output. The default is currently set to convolve the starting signal with a Gaussian kernel.

There are two modes of operation:

	Normal mode: Generate the input signal x[n] randomly, with a size specified in the arguments.

	Audio mode: Read the input signal x[n] from an input audio file, and write the signal y[n] as an output audio file.

To toggle between these two modes, set AUDIO_ON accordingly, and use the appropriate makefile.

Normal mode works on the servers haru, mx, and minuteman. 
Audio mode works only on haru.

Because convolving with the Gaussian kernel acts as an imperfect low-pass filter, the output file (in audio mode) will have its higher frequencies attenuated. Try it out!

>> (revised 4/5/2015)
On haru (haru.caltech.edu), you should get a speedup of ~6-8x, using a reasonable choice of block size and #blocks (e.g. 512, 200). 


Hints:
------------------

- The CPU code exists already; use it as a guide! Recall that we often accelerate CPU code by replacing it with "similar-looking" GPU code!