CS 179: GPU Computing Lab 1: Introduction to CUDA Updated by: Loko Kung, 2018 Due: Wednesday, April 10, 2019 - 3:00 PM Submission: Include your written answers in the supplied README file. Submit these answers and your code by creating a zip file in your home directory on Titan, in the format: lab[N]_2019_submission.zip Example: lab1_2019_submission.zip For help on using zip on Linux, see: https://www.geeksforgeeks.org/zip-command-in-linux-with-examples/ ================================================================================ Question 1: Common Errors (20 points) ================================================================================ This class will make heavy use of low-level C constructs and concepts, especially pointers and memory management. As a "warm-up", here are a few quick samples of code and their intended specifications. Each such piece of code is incorrect. Identify what is wrong with the code, and how it should be fixed. (Many of these problems allude to common errors encountered while writing both GPU and CPU code.) -------------------------------------------------------------------------------- 1.1 -------------------------------------------------------------------------------- Creates an integer pointer, sets the value to which it points to 3, adds 2 to this value, and prints said value. void test1() { int *a = 3; *a = *a + 2; printf("%d\n", *a); } -------------------------------------------------------------------------------- 1.2 -------------------------------------------------------------------------------- Creates two integer pointers and sets the values to which they point to 2 and 3, respectively. void test2() { int *a, b; a = (int *) malloc(sizeof (int)); b = (int *) malloc(sizeof (int)); if (!(a && b)) { printf("Out of memory\n"); exit(-1); } *a = 2; *b = 3; } -------------------------------------------------------------------------------- 1.3 -------------------------------------------------------------------------------- Allocates an array of 1000 integers, and for i = 0, ..., 999, sets the i-th element to i. void test3() { int i, *a = (int *) malloc(1000); if (!a) { printf("Out of memory\n"); exit(-1); } for (i = 0; i < 1000; i++) *(i + a) = i; } -------------------------------------------------------------------------------- 1.4 -------------------------------------------------------------------------------- Creates a two-dimensional array of size 3x100, and sets element (1,1) (counting from 0) to 5. void test4() { int **a = (int **) malloc(3 * sizeof (int *)); a[1][1] = 5; } -------------------------------------------------------------------------------- 1.5 -------------------------------------------------------------------------------- Sets the value pointed to by a to an input, checks if the value pointed to by a is 0, and prints a message if it is. void test5() { int *a = (int *) malloc(sizeof (int)); scanf("%d", a); if (!a) printf("Value is 0\n"); } ================================================================================ Question 2: Parallelization (30 points) ================================================================================ -------------------------------------------------------------------------------- 2.1 -------------------------------------------------------------------------------- Given an input signal x[n], suppose we have two output signals y_1[n] and y_2[n], given by the difference equations: y_1[n] = x[n - 1] + x[n] + x[n + 1] y_2[n] = y_2[n - 2] + y_2[n - 1] + x[n] Which calculation do you expect will have an easier and faster implementation on the GPU, and why? -------------------------------------------------------------------------------- 2.2 -------------------------------------------------------------------------------- In class, we discussed how the exponential moving average (EMA), in comparison to the simple moving average (SMA), is much less suited for parallelization on the GPU. Recall that the EMA is given by: y[n] = c * x[n] + (1 - c) * y[n - 1] Suppose that c is close to 1, and we only require an approximation to y[n]. How can we get this approximation in a way that is parallelizable? (Explain in words, optionally along with pseudocode or equations.) Hint: If c is close to 1, then 1 - c is close to 0. If you expand the recurrence relation a bit, what happens to the contribution (to y[n]) of the terms y[n - k] as k increases? ================================================================================ Question 3: Small-Kernel Convolution (50 points) ================================================================================ Introduction: ------------------ On Friday, we saw that the effect of a linear time-invariant system on an input signal x[n] (to produce an output y[n]) can be summarized by the system's impulse response h[n]. This is the output of the system in response to a unit impulse as input. We can then find y[n] by computing the convolution, which we denote (*): y[n] = (x (*) h)[n] (See Friday's lecture slides for an expanded definition.) The goal is to GPU-accelerate this computation. Similar to how we handled the addition problem, we allocate and copy memory as appropriate, and we can use the strategies in Lecture 2 to divide indicies among our many threads. To do: ------------------ Complete the GPU-accelerated convolution by filling in the parts marked TODO in blur.cu. Assignment notes: ------------------ The code given to you will run the ordinary CPU version of the convolution, and compare the GPU/CPU speedup and the correctness of the GPU output. The default is currently set to convolve the starting signal with a Gaussian kernel. There are two output binaries: noaudio-blur: Generate the input signal x[n] randomly, with a size specified in the arguments. audio-blur: Read the input signal x[n] from an input audio file, and write the signal y[n] as an output audio file. The default 'make all' will make both binaries. Depending on where you are building the binaries, you may need to change the CUDA_PATH variable in the Makefile. Because convolving with the Gaussian kernel acts as an imperfect low-pass filter, the output file (in audio mode) will have its higher frequencies attenuated. Try it out! As of: 4/5/2015 On haru (haru.caltech.edu), you should get a speedup of ~6-8x, using a reasonable choice of block size and #blocks (e.g. 512, 200). Hints: ------------------ - The CPU code exists already (in blur.cpp somewhere in the mess); use it as a guide! Recall that we often accelerate CPU code by replacing it with "similar-looking" GPU code! Technical details: ------------------ - For UNIX development: To use audio mode, you'll need to install libsndfile. - On Ubuntu, you can use "sudo apt-get install libsndfile1-dev".