Overview

This page contains possible data sets, ideas for questions and code that you can use for your project. Since this is a research oriented class, it is highly encouraged to pick a project related to your own research, which is not limited by this page.

Ideally, we prefer group sizes of 2-3 people. Exceptions possible with instructor's permission. Please feel free to contact either Andreas, Daniel or Deb about project ideas.

Data sets

NLP & Text data

Twenty Newsgroups (1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles). Useful for a variety of text classification and/or clustering projects.

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

WebKB, this dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Enron e-mail, consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

10K Corpus, 10-K reports from thousands of publicly traded U.S. companies, published in 1996-2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.

http://www.ark.cs.cmu.edu/10K/

Image & Video data

ImageNet (WordNet semantic network annotated with images)

http://www.image-net.org/

Corel Image Data

A simple bee dance data set from Georgia Tech (find number of discrete modes, and learn dynamics of each mode).
TRECVID (Competition for multimedia information retrieval. Fairly large archive of video data sets, along with featurizations).

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Character recognition (Optical character recognition, and the simpler digit recognition task)

http://ai.stanford.edu/~btaskar/ocr/

Caltech 101 Vision data set for image classification

http://pascal.inrialpes.fr/data/human/

NRIA Pedestrian Dataset

http://www.vision.caltech.edu/Image_Datasets/Caltech101/

Visual Object Challenge

http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/

Ground truthed Pedestrian ETH data

http://www.vision.ee.ethz.ch/datasets/index.en.html

Neuroscience & Physiology data

ICDM Brain Connectivity competition

http://pbc.lrdc.pitt.edu/?q=2009b-home

fMRI data (want to predict cognitive state given brain activation data)

http://multivac.ml.cmu.edu/10708(work?)

Collaborative Research in Neuroscience data sets

http://crcns.org/

Collaborative prediction data

Netflix / MovieLens data

Predict movie ratings based on training data.

Sensor network data

Data from a 54-node sensor network deployment: temperature, humidity, and light data, along with the voltage level of the batteries at each node.(Berkeley) E.g., for fitting spatio-temporal GP models, perform outlier detection, active learning, Gaussian Process optimization, ...

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

Data from a wireless sensor network for traffic survillience, including acoustic signal data, magnetic signal data, etc. (Berkeley)

http://path.berkeley.edu/~singyiu/vehicledetection/research/research.htm

Light sensor network include link quality information between pairs of sensors

http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/lightsensor.zip

LAM repository at rawseeds.org

www.rawseeds.org

Network data

Large social network data sets (for link prediction, etc.)

http://snap.stanford.edu/data/

Other sources of data

UC Irvine has a repository that could be useful. Many of these data sets have been used extensively in machine learning research.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab)

http://www.cs.toronto.edu/~roweis/data.html

KDD Cup data sets

http://www.kdnuggets.com/datasets/kddcup.html

Project ideas

Online learning, bandit optimization, dimension reduction

Compare online classification / regression with offline algorithms on data sets
Experiment with parallel online learning:

Martin Zinkevich, Alexander Smola, John Langford. Slow learners are fast. NIPS 2009

Online multitask learning, e.g., for link prediction / collaborative filtering (e.g., on Netflix data)

Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., and Smola, A., Feature Hashing for Large Scale Multitask Learning, International Conference on Machine Learning, 2009

Implement and evaluate some nonstandard bandit algorithm (X-armed bandit [A.2], contextual bandits [A.4], [A.1] ...)

S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. "Online Optimization in X-Armed Bandits" NIPS 2008.
J. Langford and T. Zhang. "The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits" NIPS 2007.
S. Pandey, D. Chakrabarti, D. Agrawal. "Multi-armed Bandit Problems with Dependent Arms" ICML 2006

Compare different bandit algorithms (e.g., low regret algorithms like UCB with optimal solution using MDPs / Gittins indices)

J. C. Gittins, D. M. Jones, A Dynamic Allocation Index for the Discounted Multiarmed Bandit Problem, Biometrika, Vol 66, No. 3. (1979), pp. 561-565.

Experiment with Gaussian process optimization, compare different selection heuristics

D. Lizotte, T. Wang, M. Bowling, D. Schuurmans. "Automatic Gait Optimization with Gaussian Process Regression" IJCAI 2007
Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias Seeger. "Gaussian Process Bandits without Regret: An Experimental Design Approach" arXiv [pdf]

Random features -- an "explicit" kernel trick. Compare against standard, kernelized SVMs in runtime / accuracy

Random Features for Large-Scale Kernel Machines, Ali Rahimi, Ben Recht,
in Neural Information Processing Systems (NIPS) 2007

Compare different dimension reduction algorithms (PCA, LLE, ISOMAP, Maximum Variance Unfolding, ...)

Nonlinear dimensionality reduction by locally linear embedding.
Sam Roweis & Lawrence Saul. Science v.290 no.5500, Dec.22, 2000. pp.2323--2326.
http://waldron.stanford.edu/~isomap/
Unsupervised learning of image manifolds by semidefinite programming K. Q. Weinberger and L. K. Saul (2004). In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR-04), Washington D.C.

Compare online and offline algorithms for dimension reduction

Random projections
Online PCA

Implement online clustering algorithm, and apply it to large (e.g., image) data set

Online k-Means

Implement and evaluate algorithms for online linear/convex optimization (e.g., online shortest paths)

A. Kalai, S. Vempala. "Efficient Algorithms for Online Decision Problems" Journal of Computer System Sciences 2005
B. McMahan, A. Blum. "Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary", COLT 2004.

Active learning / experimental design

Experiment with heuristics for active learning (e.g., SVMs / logistic regression) on some interesting data set (e.g., image classification) How much does active learning help?

S. Tong, D. Koller. "Support Vector Machine Active Learning with Applications to Text Classification." JMLR 2001.

Implement and evaluate some nonstandard active learning algorithm (e.g., [B.3], [B.4], [B.8])

S. Dasgupta, D.J. Hsu. "Hierarchical sampling for active learning", ICML 2008.
R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, X. Zhu. "Human Active Learning", NIPS 2008.
S. Dasgupta, D. Hsu, C. Monteleoni. "A General Agnostic Active Learning Algorithm", NIPS 2007.

Use submodular function optimization for Bayesian experimental design on some data set (e.g.,

A. Krause, A. Singh, C. Guestrin. "Near-optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies". Journal of Machine Learning Research (JMLR) 2008
M. Streeter, D. Golovin "An Online Algorithm for Maximizing Submodular Functions" NIPS 2008

Implement batch mode active learning

S. Hoi, R. Jin, J. Zhu, M. Lyu "Batch mode active learning and its application to medical image classification", ICML 2006.

Nonparametric learning

Compare different active set methods for fast inference in Gaussian Process Regression

N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 609-616, Cambridge, MA, 2003. The MIT Press.
Snelson, E. and Ghahramani, Z. (2006) Sparse Gaussian Processes using Pseudo-Inputs.
In Advances in Neural Information Processing Systems 2005

Use GP regression / SVMs for some interesting, nonstandard kernel (e.g., graph kernels, diffusion kernel)

Nino Shervashidze, Karsten M. Borgwardt: Fast subtree kernels on graphs, NIPS 2009
R. Kondor and J. Lafferty (2002). Diffusion Kernels on Graphs and Other Discrete Input Spaces. ICML 2002

Experiment with kernelized algorithms (Kernelized PCA, LDA, ...)
Compare Gaussian Process classification with Support Vector Machines on some interesting data set
Use Gaussian Processes to predict slip in JPL Mars rover data (contact instructors): Orbital remote sensing imagery of mars to predict areas of high danger to rovers; some of the data is truthed--that is, how much slippage actually occurred during actual rover trajectories.Compare prediction to actual parametric models estimated from slip data.
Heteroscedastic GP regression (varying noise)

K. Kersting, C. Plagemann, P. Pfaff, W. Burgard. Most-Likely Heteroscedastic Gaussian Process Regression. In Z. Ghahramani, editor(s), Proceedings of the 24th Annual International Conference on Machine Learning (ICML-07)

Gaussian process mixture modeling

C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.