Overview

This page contains possible data sets, ideas for questions and code that you can use for your project. Since this is a research oriented class, it is highly encouraged to pick a project related to your own research, which is not limited by this page.

Ideally, we prefer group sizes of 2-3 people. Exceptions possible with instructor's permission. Please feel free to contact either Andreas, Hongchao or Pete about project ideas.

Data sets

Caltech data sets

Urban Challenge Datasets (contact Pete Trautman):

High fidelity GPS vehicle trajectories.
Ladar scans.
Possible (but probably a little difficult to recover): stereo, video.

Exercise physiology data (contact Pete Trautman)

Generated by John Doyle's group, athletes are asked to ride a stationary bike under various conditions. Heart rate, wattage output, breathing rate, and gas exchange data are recorded.

Fly data (contact Pete Trautman)

High resolution data of fly activities. Using background subtraction, fly positions are recorded over a fixed time interval. Various positive and negative attractions are placed in the fly arena, to encourage certain types of behavior.

JPL data sets (contact Pete Trautman)

orbital remote sensing imagery of mars to predict areas of high danger to rovers; some of the data is truthed--that is, how much slippage actually occurred during actual rover trajectories.'
Use rover slip data to estimate parameters of soil mechanics models
Video truthed people tracks
UAV fly over data, with annotated lakes and buildings.
Data for visual SLAM
Person segmentation- there is a data set of people walking, which have been annotated and have bounding boxes around them.

LDPC data sets (contact Hongchao Zhou)

Parity-check matrix for an LDPC code.
Receiverd signal.

Image & Video data

ImageNet (WordNet semantic network annotated with images)

http://www.image-net.org/

Corel Image Data

A simple bee dance data set from Georgia Tech (find number of discrete modes, and learn dynamics of each mode).
TRECVID (Competition for multimedia information retrieval. Fairly large archive of video data sets, along with featurizations).

http://www-nlpir.nist.gov/projects/trecvid/trecvid.data.html

Character recognition (Optical character recognition, and the simpler digit recognition task)

http://ai.stanford.edu/~btaskar/ocr/

Caltech 101 Vision data set for image classification

http://pascal.inrialpes.fr/data/human/

NRIA Pedestrian Dataset

http://www.vision.caltech.edu/Image_Datasets/Caltech101/

Visual Object Challenge

http://pascallin.ecs.soton.ac.uk/challenges/VOC/voc2008/

Ground truthed Pedestrian ETH data

http://www.vision.ee.ethz.ch/datasets/index.en.html

Neuroscience & Physiology data

ICDM Brain Connectivity competition

http://pbc.lrdc.pitt.edu/?q=2009b-home

fMRI data (want to predict cognitive state given brain activation data)

http://multivac.ml.cmu.edu/10708(work?)

Collaborative prediction data

Netflix / MovieLens data

Predict movie ratings based on training data.

Sensor network data

Data from a 54-node sensor network deployment: temperature, humidity, and light data, along with the voltage level of the batteries at each node.(Berkeley)

http://www-2.cs.cmu.edu/~guestrin/Research/Data/

Data from a wireless sensor network for traffic survillience, including acoustic signal data, magnetic signal data, etc. (Berkeley)

http://path.berkeley.edu/~singyiu/vehicledetection/research/research.htm

Light sensor network include link quality information between pairs of sensors

http://www.cs.cmu.edu/~guestrin/Class/10708-F08/projects/lightsensor.zip

LAM repository at rawseeds.org

www.rawseeds.org

NLP & Text data

Twenty Newsgroups (1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles). Useful for a variety of text classification and/or clustering projects.

http://www-2.cs.cmu.edu/afs/cs/project/theo-11/www/naive-bayes.html

WebKB, this dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.

http://www-2.cs.cmu.edu/~webkb/

Enron e-mail, consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.

10K Corpus, 10-K reports from thousands of publicly traded U.S. companies, published in 1996¨C2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.

http://www.ark.cs.cmu.edu/10K/

Network data

Large social network data sets (for link prediction, etc.)

http://snap.stanford.edu/data/

Other sources of data

UC Irvine has a repository that could be useful. Many of these data sets have been used extensively in graphical models research.

http://www.ics.uci.edu/~mlearn/MLRepository.html

Sam Roweis also has a link to several datasets (most ready for use in Matlab)

http://www.cs.toronto.edu/~roweis/data.html

KDD Cup data sets

http://www.kdnuggets.com/datasets/kddcup.html

Project ideas

Caltech data related ideas

Do structured-prediction to predict slip in rover data (e.g., using conditional random fields). Compare prediction to actual parametric models estimated from slip data.
Activity recognition of fly data (e.g., using hierarchical conditional random fields). E.g.:

Location-Based Activity Recognition. L. Liao, D. Fox, and H. Kautz. NIPS-05.
Learning and Inferring Transportation Routines. L. Liao, D. Fox, and H. Kautz. AAAI-04.

Clustering/segmentation of Ladar scans, video, GPS trajectories using graphical methods.
Compare graphical model methods with classical model ID methods to analyze the physiology data.
Apply approximate inference methods to fly data, SLAM data, or visual SLAM data to do tracking, data association, multitarget tracking, etc.
Compare different approximate inference techniques (loopy BP, variational inference, ...) for coding theory (LDPC codes)

Learning and Modeling

Compare constraint-based (e.g., using independence tests) and score based algorithms for structure learning.
Implement algorithms for structure learning of undirected graphical models. E.g., based on L1 regularization (e.g., Ravikumar et al, NIPS '07, NIPS '08)
Experiment with Bayesian model averaging (e.g., using sampling)
Compare conditional random fields with generative models (directed or undirected) on some learning task
Compare Max-margin Markov Nets [Taskar et al NIPS '03] with Conditional Random Fields [ICML '01]

Inference

Compare different techniques for exact inference (in terms of complexity, ...)

Junction tree inference
Bucket elimination
Recursive conditioning
Algebraic circuits

Compare different techniques for approximate inference

Variational inference (structured mean field, etc.)
Generalized belief propagation
Sampling (MCMC / Gibbs /...)
Preconditioning based inference (Ravikumar et al NIPS '05)

MAP inference

Compare exact techniques (e.g., using graph cuts; junction trees with low-treewidth models) with approximate techniques (Max-product, LP relaxations...)

Compare different algorithms for Bayesian filtering in dynamical models.

Assumed Density filtering
Particle filtering. Rao blackwellization for data association, for "two streams" hypothesis, for slam benchmarks
Ensemble Kalman filters

Compare algebraic circuits and Bayes nets in their ability to represent different data sets (e.g., how much compression do we get by representing a Bayes net as an arithmetic circuit)
Compare Gaussian graphical models with inferred sparse precision matrix with Gaussian processes for spatial data.

Applications

Fault detection in sensor networks (e.g., automatic data cleaning based on detecting outliers exploiting correlation)
Experiment with different models for image segmentation / foreground-background classification
Compare probabilistic context free grammars with Hidden Markov models for parsing
Model-identification for physiological data (e.g., using Gaussian Processes)

Structured models

Learn a simplified class of relational models (e.g., no existence uncertainty)
Link prediction / collaborative filtering. (1) E.g., compare matrix factorization techniques with loopy BP in factor graph. (2) Given data about part of a graph, predict presence of edges
Experiment with topic models (e.g., LDA) on some interesting data set relevant to your research
Apply non-parametric Bayesian clustering (e.g., using Hierarchical Dirichlet Processes) on some data set

in particular on the fly data, on the GPS Alice data, on the pedestrian data
For data association
for the Motion segmentation problem/layered generative modeling

Other

Value of information and experimental design

Compare different experimental design criteria / different heuristics for optimizing VOI
Implement a simple automatic troubleshooting system (e.g., why does my car not start?)

Experiment with algorithms for solving the Maximum A posteriori Assignment (MAP) problem [Park & Darwiche UAI '01]
Experiment with polygonal random fields for SLAM. http://ai.stanford.edu/~paskin/prf/
Use Kirchhoff's Matrix Tree theorem for exact inference (e.g., as in http://people.csail.mit.edu/mcollins/papers/matrix-tree.pdf)
Experiment with affinity propagation as exemplar based clustering algorithm (http://www.psi.toronto.edu/affinitypropagation/)