CS 179: GPU Computing
Assignment 6
Modified by Jordan Bonilla 5/11/17

Due: Wednesday, May 17, 2017 - 11:59 PM


Submission: 
------------------

By e-mail to cs179.ta@gmail.com. Make sure you get a confirmation email.
Package your files in a standard archive format (e.g. zip, tar.gz, tar.bz2). 
Please also include your name in the title of your archive.


Question 1: 1-D Wave PDE Solver	(20 pts)
---------------------------------------------
---------------------------------------------

Many scientific problems exhibit a wave-like nature, the behavior of which can be determined by solving various forms of the wave equation. In this problem, we'll numerically solve the 1-dimensional wave equation, a second-order PDE.

In class, we showed that we can get a discrete solution by iteratively calculating values of the function at time t+1 in terms of data at time t, and t-1. (We also discussed how to keep things efficient memory-transfer wise.)


1.1 	Implement 1-D Wave PDE (20 pts)
-------------------------------
Implement the single-GPU wave equation solver by filling out the TODOs in the CUDA1DFDWave_* files.


Note
--------------------------

You can visualize the numerical data by either using the attached script (python makePlots.py), or by using Excel, Matplotlib, or similar software. The output should look like a 2D wave.You can also write
verification to check that the GPU output matches the CPU output.


Question 2: Gillespie Simulation 	(60 pts)
---------------------------------------------
---------------------------------------------
The boilerplate code in the Problem2 directory is given as suggestion. You will have to add to the given function signatures.


2.1 	Gillespie timestep implementation (20 pts)
-------------------------------

The Gillespie simulation simulates a stochastic system with memoryless transition events.  The process considers that at any point in time, all possible transitions are separate Poisson processes.  This allows us to easily calculate the probability distribution of the next event, and the probabilities of that event being each of the possible transitions.

To perform the simulation, we assign a propensity to each possible transition, which corresponds to the rate parameter of the probability distribution of that item.  The probability density function of the minimum of these random variables is just an exponential distribution with a rate parameter equal to the sum of the propensities.  The probability of any transition being the next one is proportional to its propensity.

So, at each iteration of the algorithm, we sample a timestep from an exponential distribution with rate parameter equal to the sum of the propensities, and choose a transition randomly, where the probability of any transition being chosen is proportional to its propensity.  An exponential distribution can be created by taking -ln(x)/lambda, where x is uniformly distributed from 0 to 1, and lambda is the desired rate constant of the exponential distribution.

In this problem we will simulate a reaction where a chemical is produced by a system that toggles on and off, and decays with a rate proportional to its concentration
     Kon    b     g
OFF <--> ON --> X --> 0
     Koff

We can write the propensities of each possible transition

While production is inactive
OFF -> ON   Kon
[X]--       [X] * g

While production is active
ON -> OFF   Koff
[X]++       b
[X]--       [X] * g

In this analysis we will initialize the concentration to 0, start the system off, and use the values
b=10
g=1
Kon = 0.1
Koff = 0.9


In order to examine the macroscale behavior of this system, we will use the Monte Carlo method, and simulate the random dynamics of the system enough times to obtain reasonable numeric results.

** To do: Implement a cuda kernel to perform a single iteration of the Gillespie algorithm on the given system using an array of random numbers given as an argument.  Generate random numbers using cuRand before calling the kernel, and pass the random numbers as an argument.


2.2 	Data resampling and stopping condition (20 pts)
-------------------------------

A downside to the Gillespie algorithm is that samples are not given in evenly spaced timesteps, but rather at varying intervals based on the distribution of times until a transition.  The data must be resampled at a uniform interval in order to perform more calculations.  A simple way to approximate this is, for each time point we want to sample, the concentration at the first known point afterwards is stored.

For this problem, a reasonable set of timesteps would be to have 1000 points evenly spaced from 0 to 100.

The iterations must be continued until each of the seperate simulations have reached a point greater than the final time point that must be sampled.  A reduction can be used to find the progress of the slowest simulation, and stop the program once it has progressed passed the region of interest
** To Do: Implement a cuda kernel to, after each iteration of the Gillespie algorithm, update the values in an array of uniformly spaced samples.  Implement a reduction to detect the progress of the slowest simulation, and stop running iterations once it has passed the final timepoint.


2.3 	Calculation of system behavior (20 pts)
--------------------------------

We can now use these values at the sample points to calculate the expected behavior of the system.  We can use reductions to combine the results from each separate simulation into more useful values.

Two important measures of behavior are mean and variance.  The mean is the equal to the expected value of a random variable, and is estimated in our Monte Carlo simulation by summing the values for each different simulation and dividing by the number of simulations.  The variance is the expected squared difference between the variable and its mean.  This can be calculated by summing the squared differences, and dividing by the total number of simulations.

** To Do: Implement a cuda kernel(s) to calculate the expected concentration and variance of the concentration at each timepoint.  


Question 3: Cardiac Simulation 			(20 pts)
---------------------------------------------
---------------------------------------------

Many scientific problems, especially in computational biology, are researched via MCMC methods.

The actin thin filament's behavior can be modeled with Markov Chains. In the CardiacTwitch.cu file, we model the cardiac thin filament (muscle tissue) as a series of N = 26 biological regulatory units where each regulatroy unit can take one of 6 states which represent a different stage in the contraction process of the filament. Each of the M=6 states is represented with by an integer {0,5}. The RU chain is thus modeled as
as array of 26 integers.

In class, we showed that we can solve such Markoc Chain problems by solving for t+1 in terms of
the state at t using the computed probabilities for all possible state transitions. These
probabilities can be encoded in a stochastic matrix that can be computed beforehand to hold all
possible transitions from all possible states. For this problem, we will be using a given stoachastic
matrix structure that is performance optimized.


In the provided code you will see the variable h_TM (host transition matrix) which is a linearized
[6 x 6 x 6 x 2 x 6] Matrix of coefficients that the model uses to advance in state.
It is not necessary to understand the science behind how this stochastic matrix was created. 

Whenever a possible transition states for an RU is required, a stochastic amtrix lookup is performed.

The first coordinate of the stocastic matrix represents the left neighbor of a state in a RU {0,5}.

The second coordinate represents the right neighbor of a state in a RU, {0,5}.

The third coordinate represents current state of the RU we are getting transition 
probabilities for {0,5}.

The fourth coordinate represents the current RU type (mutation or wildtype). This is a binary assignment {0,1}. For this problem, this coordinate will always be 0.

The fifth coordinate is used to specify what information to pull from the matrix {0,6}.
0 will request probability P1, 1 will request state change M1, ... 4 will request P3, 5 will request M3.

This is consistent with the following transiton model for any any RU elment being updated:
   

|-------|---------|---------|--------------------------------|

0.0     P1        P2        P3                               1.0


Each P represents the probability of making a transition to a differnt RU state.
For each probability P, there is an associated change in state, M. The value of M simply an integer number that when added to the current RU state (an integer), gives the new RU state as an integer.

In the CUDA kernel, a uniform pseudorandom number, r, is generated and is comapred to the P values above. I.e if r < P1, the P1 transition is taken. If P1 < r < P2, the P2 transition is taken. If P2 < r < P3, the P3 transition is taken. Should the random number exceed the value of P3, no transition is taken.

This works because a RU has, at most, 3 possible transitions and at least 2 possible transntions. In the case where only 2 transitions are possible, P3, M3 = 0. 

Note that you do not need to understand how the stocastic matrix logic because all the transiiton logic is
already coded for you.


3.1 	Multi-GPU implementation (20 pts)
--------------------------
Note that the code is given as working in the single-GPU case. Your job is to get it working for the multiple GPU case. You will thus require a system with multiple GPUs like Haru to complete this problem.

Expand the code given in cardiac_twitch.cu to use multiple GPUs in the general case for a computer
with >1 GPUs. Split the 4096 repititons among all available GPUs and then aggregate results on the CPU after
all GPUs have synchronized. Comment your run-time for the single and multiple GPU cases and explain the differences in runtime. 

The cardiac_twitch.cu file outputs a csv file (results.csv) for you to visualize and validate your results.
Plot these results by using the attached script (python makePlot.py). 

3.2	Bonus (Extra 10 pts)
--------------------------
Desribe two missing optimizations that would speed up runtime dramatically for the single and multiple GPU 
implementations.