CS 179 Problem Set 5 Due Wednesday 5/4 @ 3PM Submit to cs179.ta@gmail.com in an archive format (zip, tgz, tar) Brief Summary -------------------------------------------------------------------------------- For this set, we are going to train a logistic regression model to classify Yelp reviews as either a review of a restaurant or a review of a non-restaurant business. Out of all 1.5 million reviews, about 1 million reviews are of restaurants. Details about given code and data -------------------------------------------------------------------------------- In this assignment, you will be working with Latent Semantic Analysis (LSA) data on a dataset of Yelp reviews. The concept of how to produce the LSA data from the Yelp reviews is outside of the scope of the course. We included a simplified explanation of LSA at the end of this file (Appendix A) for those of you who are interested. For this assignment, however, all you need to understand is that the LSA data is a mathematical representation of the reviews. The reviews are "summarized" into numerical vectors when we turn them into LSA data sets so that we can analyze a set of data that is otherwise not numerically analyzable. Generally speaking, LSA data sets only contain floating point values summarizing the contents of literal data. In this set however, the LSA data contain an extra binary column indicating whether or not the review was of a restaurant. The LSA file (1.1 GB) with the labels is available at Haru: /srv/cs179_set5_data/shuffled_lsa_labelled.txt CMS cluster: /cs/courses/cs179/public_html/2015_labs/set5_data/shuffled_lsa_labelled.txt Also in these directories, you can find the raw data that was used to generate the LSA file. Each row of the LSA file contains 51 floating point numbers (50 LSA components + 1 label at the end). The file has 1,569,264 rows. The value 1 indicates that the review is of a restaurant and -1 indicates that the review is not of a restaurant. Compiling the given code base will generate an executable called classify. Run the classify program on an LSA file by passing the file name as an argument to the classify executable. What you need to do on the set -------------------------------------------------------------------------------- You are going to train a logistic regression model using gradient descent on mini-batches of reviews. The loss function to minimize is (1 / N) * sum_{n=1}^N log(1 + exp(-y_n w^T x_n)) where N is the total number of data points, y_n is the label of point n, x_n is the features of point n, and w is the weights vector. For our case, y_n is -1 or 1, x_n is a 50 component vector, and the weights are therefore also a 50 component vector When operating on mini-batches (which is what we're going to do), N = mini-batch size. One way to minimize this loss function is gradient descent. The gradient of the loss with respect to w is grad = (-1 / N) * sum_{n=1}^N (y_n * x_n) / (1 + exp(y_n w^T x_n) The update rule for gradient descent is w := w - step_size * grad where step_size is a scalar. If step_size is too large, the weights will oscillate, and if step_size is too small the procedure will run too slowly, so you'll want to experiment with different values of step_size (try different orders of magnitude, somewhere in 0.01 to 10 is a good guess). Given a set of weights and an input, the prediction is sign(w^T x). Note the loss and the gradient both contains y w^T x expressions. We want y and w^T x to have the same sign, and if they do have the same sign then y w^T x > 0. Beyond using logistic regression and gradient descent, it is generally up to you how you want to format your code. You can change function signatures if necessary. The weights should be randomly initialized with small Gaussian noise (see the gaussianFill function). Gaussian fill is actually deterministic for this set due to the seeding of the generator. This will make it easier to check results for correctness with other students. You must fill in the trainLogRegKernel function in classify_cuda.cu. One very approach to this kernel is to have each thread classify one point and compute the contribution to the gradient of that point. Be sure to consider that all points in a batch are meant to be computed using the same weights, so the weight updating shouldn't occur until all predictions & gradients are calculated. The "classify" function in classify.cc should be main driver of the program. After processing a batch, the batch number and either the value of the loss function over the batch or the error rate of the batch (misclassifications / batch_size) should be printed. These are both training set metric and do not determine how the model would perform on data outside of the training set, but this problem set is focussed on training so these numbers are appropriate to examine. Also, you need to use cudaMemcpyAsync and streams in order to further utilize the GPU. Your stream should follow a pattern of: H->D => kernel => D->H This pattern is what would usually happen in a normal kernel but with streams you will have multiple instances of this pattern occurring at once. See lecture slides for more info. Use at least 2 sets of buffers so that you can perform multiple operations at once. This means you should have at least 2 device input and output buffers, 2 host buffers to copy from (perhaps pinned for extra performance), and 2 streams. At least 2 sets of everything is needed to run kernels and transfers in parallel. If you can, try benchmarking this code on a Kepler or Maxwell GPU because of a technology called "Hyper-Q" that facilitates overlapping data transfer and computing. Haru has older, Fermi GPUs so unfortunately does not have Hyper_Q. Performance numbers you should expect -------------------------------------------------------------------------------- The solution code takes about 30s to run on a normal Kepler GPU. Notably, it also takes about 30s just to read and decode the full CSV file. This is 30MB/s of throughput. With the runtime so dominated by IO and parsing, writing a highly efficient CUDA kernel cannot have much of an impact on runtime. This task of training a logistic regression on streaming data is a good example of a task without enough computational density to be an excellent fit for GPUs. With mini-batch size of 2048, the reference code gets about 250 misclassifications per mini-batch. This is about 12% error. The loss function is ~0.35 over each mini-batch (but has relatively high variance). Analysis you should perform -------------------------------------------------------------------------------- Write all analysis in README.txt. How much does IO and parsing dominate your runtime? Compute this by commenting out all of the CUDA code and just running the loop over all of the data. You might have to use the data somehow (like writing it to a buffer with readLSAReview) to prevent the compiler from optimizing out your loop. Once you have program runtime with logistic regression and program runtime just from IO and parsing, you can compute the fraction of the time spent doing IO and parsing. What is the latency and throughput of your kernel for different batch sizes? For batch sizes 1, 32, 1024, 2048, 16384, and 65536 record the amount of time it takes your kernel to run (the latency) and compute the throughput (batch_size / latency). You should only measure the latency of the kernel itself, not the copying of data onto and off of the device. Appendix A. Yelp, Clustering and Latent Semantic Analysis (LSA) -------------------------------------------------------------------------------- Yelp is a website where users can leave reviews of restaurants, doctors, car shops, and all kinds of other businesses. Yelp has released a great dataset that we're going to explore on this set: http://www.yelp.com/dataset_challenge . This dataset contains ~1.5 million reviews of ~61,000 businesses. Clustering is the process of dividing data into groups with similar characteristics. As the idea of clustering is best presented visually, see http://en.wikipedia.org/wiki/Cluster_analysis for many great diagrams. The most common clustering algorithm is called k-means clustering and involves randomly initializing k "cluster centers" and then iteratively mapping each point to the closest cluster center, updating the "cluster center" location to be the average of cluster's points. This requires many locations to converge and does not find a global optima. Additionally, the user must specify k, the number of clusters. Streaming algorithms are algorithms that see each data point only once and use only a fixed amount of memory. Streaming algorithms are very useful when processing large datasets or when processing high throughput streams of data (such as high frequency sensor data or the Twitter firehose). Streaming algorithms are also very important for low latency applications (because otherwise you would need to do some batch processing for each incoming data point). k-means clustering is not a streaming algorithm because it requires many iterations to converge. There are quite a few different schemes to approximate the k-means algorithm over streaming data sets. One popular option is to perform "sloppy k-means" with only a single pass over the dataset and with a greater value of k than is actually wanted. Sloppy k-means consists of mapping each point to a cluster and then updating the cluster location for the point. This can be performed on a single point or a batch of points at a time. Let q be the number of clusters for the sloppy online algorithm and k be the actual desired number of clusters. For big data applications where the total number of data points is known (and = n), q ~= k log(n). One can compute the final desired clustering by clustering the q "sloppy clusters", and the cluster of each point can be determined by maintaining a mapping from data point -> sloppy cluster -> real cluster. Analyzing textual content (such as Yelp reviews) is difficult because there is not an obvious way to compute the similarity between two snippets of text (called documents). Many methods rely on mapping documents to vectors. One common and easy way to map text to vectors is called "bag of words" (BOW). For a given document, the BOW is a vector where each component contains the number of occurences for a fixed word. For instance, if the first component refers to the word "cat" and a document contains "cat" 10 times, then the first component of the BOW vector is 10. The number of words in a collections of documents is often huge (~300,000 for the Yelp reviews) and a single document generally contains only a very small fraction of the full set of words. This means BOW vectors are often sparse (have very few non-zero components). BOW leads to the "term-document" matrix which is a matrix where element (i, j) contains the number of times the ith document contains the jth word. There are more useful vector representations of documents than BOW. One of these representations is called latent semantic analysis (LSA). LSA works by computing the low-rank approximation of the term-document matrix through SVD. We can compute a fixed rank approximation, so each LSA vector has a fixed size of our choosing. This vector corresponds to the coefficients that go with the corresponding left and right singular vectors.