CS 179
Assignment 2

Due: Wednesday, April 17, 2019 - 3:00 PM.

Put all answers in a file called README.txt. After answering all of the
questions, list how long part 1 and part 2 took. Feel free to leave any other
feedback.

========================================
NOTE: New submission method!

Instead of emailing us the solution, put a zip file in your home directory 
on Titan, in the format:

lab2_2019_submission.zip


Your submission should be a single archive file (.zip) 
with your README file and all code.

========================================


PART 1

Question 1.1: Latency Hiding (5 points)
---------------------------------------

Approximately how many arithmetic instructions does it take to hide the latency
of a single arithmetic instruction on a GK110?

Assume all of the arithmetic instructions are independent (ie have no
instruction dependencies).

You do not need to consider the number of execution cores on the chip.

Hint: What is the latency of an arithmetic instruction? How many instructions
can a GK110 begin issuing in 1 clock cycle (assuming no dependencies)?


Question 1.2: Thread Divergence (6 points)
------------------------------------------

Let the block shape be (32, 32, 1).

(a)
int idx = threadIdx.y + blockSize.y * threadIdx.x;
if (idx % 32 < 16)
    foo();
else
    bar();

Does this code diverge? Why or why not?

(b)
const float pi = 3.14;
float result = 1.0;
for (int i = 0; i < threadIdx.x; i++)
    result *= pi;

Does this code diverge? Why or why not? (This is a bit of a trick question,
either "yes" or "no can be a correct answer with appropriate explanation.)


Question 1.3: Coalesced Memory Access (9 points)
------------------------------------------------

Let the block shape be (32, 32, 1). Let data be a (float *) pointing to global
memory and let data be 128 byte aligned (so data % 128 == 0).

Consider each of the following access patterns.

(a)
data[threadIdx.x + blockSize.x * threadIdx.y] = 1.0;

Is this write coalesced? How many 128 byte cache lines does this write to?

(b)
data[threadIdx.y + blockSize.y * threadIdx.x] = 1.0;

Is this write coalesced? How many 128 byte cache lines does this write to?

(c)
data[1 + threadIdx.x + blockSize.x * threadIdx.y] = 1.0;

Is this write coalesced? How many 128 byte cache lines does this write to?


Question 1.4: Bank Conflicts and Instruction Dependencies (15 points)
---------------------------------------------------------------------

Let's consider multiplying a 32 x 128 matrix with a 128 x 32 element matrix.
This outputs a 32 x 32 matrix. We'll use 32 ** 2 = 1024 threads and each thread
will compute 1 output element. Although its not optimal, for the sake of
simplicity let's use a single block, so grid shape = (1, 1, 1),
block shape = (32, 32, 1).

For the sake of this problem, let's assume both the left and right matrices have
already been stored in shared memory are in column major format. This means the
element in the ith row and jth column is accessible at lhs[i + 32 * j] for the
left hand side and rhs[i + 128 * j] for the right hand side.

This kernel will write to a variable called output stored in shared memory.

Consider the following kernel code:

int i = threadIdx.x;
int j = threadIdx.y;
for (int k = 0; k < 128; k += 2) {
    output[i + 32 * j] += lhs[i + 32 * k] * rhs[k + 128 * j];
    output[i + 32 * j] += lhs[i + 32 * (k + 1)] * rhs[(k + 1) + 128 * j];
}

(a)
Are there bank conflicts in this code? If so, how many ways is the bank conflict
(2-way, 4-way, etc)?

(b)
Expand the inner part of the loop (below)

output[i + 32 * j] += lhs[i + 32 * k] * rhs[k + 128 * j];
output[i + 32 * j] += lhs[i + 32 * (k + 1)] * rhs[(k + 1) + 128 * j];

into "psuedo-assembly" as was done in the coordinate addition example in lecture
4.

There's no need to expand the indexing math, only to expand the loads, stores,
and math. Notably, the operation a += b * c can be computed by a single
instruction called a fused multiply add (FMA), so this can be a single
instruction in your "psuedo-assembly".

Hint: Each line should expand to 5 instructions.

(c)
Identify pairs of dependent instructions in your answer to part b.

(d)
Rewrite the code given at the beginning of this problem to minimize instruction
dependencies. You can add or delete instructions (deleting an instruction is a
valid way to get rid of a dependency!) but each iteration of the loop must still
process 2 values of k.

(e)
Can you think of any other anything else you can do that might make this code
run faster?


PART 2 - Matrix transpose optimization (65 points)
--------------------------------------------------

Optimize the CUDA matrix transpose implementations in transpose_cuda.cu. Read
ALL of the TODO comments. Matrix transpose is a common exercise in GPU
optimization, so do not search for existing GPU matrix transpose code on the
internet.

Your transpose code only need to be able to transpose square matrices where the
side length is a multiple of 64.

The initial implementation has each block of 1024 threads handle a 64x64 block
of the matrix, but you can change anything about the kernel if it helps obtain
better performance.

The main method of transpose.cc already checks for correctness for all transpose
results, so there should be an assertion failure if your kernel produces incorrect
output.

The purpose of the shmemTransposeKernel is to demonstrate proper usage of global
and shared memory. The optimalTransposeKernel should be built on top of
shmemTransposeKernel and should incorporate any "tricks" such as ILP, loop
unrolling, vectorized IO, etc that have been discussed in class.

You can compile and run the code by running

make transpose
./transpose

and the build process was tested on minuteman. If this does not work on haru for
you, be sure to add the lines

export PATH=/usr/local/cuda-6.5/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-6.5/lib64:$LD_LIBRARY_PATH

to your ~/.profile file (and then exit and ssh back in to restart your shell).

On OS X, you may have to run or add to your .bash_profile the command

export DYLD_LIBRARY_PATH=$DYLD_LIBRARY_PATH:/usr/local/cuda/lib/

in order to get dynamic library linkage to work correctly.

The transpose program takes 2 optional arguments: input size and method. Input
size must be one of -1, 512, 1024, 2048, 4096, and method must be one all,
cpu, gpu_memcpy, naive, shmem, optimal. Input size is the first argument and
defaults to -1. Method is the second argument and defaults to all. You can pass
input size without passing method, but you cannot pass method without passing an
input size.

Examples:
./transpose
./transpose 512
./transpose 4096 naive
./transpose -1 optimal

Copy paste the output of ./transpose.cc into README.txt once you are done.
Describe the strategies used for performance in either block comments over the
kernel (as done for naiveTransposeKernel) or in README.txt.


BONUS (+5 points, maximum set score is 100 even with bonus)
--------------------------------------------------------------------------------

Mathematical scripting environments such as Matlab or Python + Numpy often
encourage expressing algorithms in terms of vector operations because they offer
a convenient and performant interface. For instance, one can add 2 n-component
vectors (a and b) in Numpy with c = a + b.

This is often implemented with something like the following code:

void vec_add(float *left, float *right, float *out, int size) {
    for (int i = 0; i < size; i++)
        out[i] = left[i] + right[i];
}

Consider the code

a = x + y + z

where x, y, z are n-component vectors.

One way this could be computed would be

vec_add(x, y, a, n);
vec_add(a, z, a, n);

In what ways is this code (2 calls to vec_add) worse than the following?

for (int i = 0; i < n; i++)
    a[i] = x[i] + y[i] + z[i];

List at least 2 ways (you don't need more than a sentence or two for each way).