Overview
This page contains possible data sets, ideas for questions and code that you can use for your project. Since this is a research oriented class, it is highly encouraged to pick a project related to your own research, which is not limited by this page.Ideally, we prefer group sizes of 2-3 people. Exceptions possible with instructor's permission. Please feel free to contact either Andreas, Hongchao or Pete about project ideas.
Data sets
Caltech data sets
- Urban Challenge Datasets (contact Pete Trautman):
- High fidelity GPS vehicle trajectories.
- Ladar scans.
- Possible (but probably a little difficult to recover): stereo,
video.
- Exercise physiology data (contact Pete Trautman)
- Generated by John Doyle's group, athletes are asked to ride a
stationary bike under various conditions. Heart rate, wattage output,
breathing rate, and gas exchange data are recorded.
- Fly data (contact Pete Trautman)
- High resolution data of fly activities. Using background
subtraction, fly positions are recorded over a fixed time interval.
Various positive and negative attractions are placed in the fly arena,
to encourage certain types of behavior.
- JPL data sets (contact Pete Trautman)
- orbital remote sensing imagery of mars to predict areas of high
danger to rovers; some of the data is truthed--that is, how much
slippage actually occurred during actual rover trajectories.'
- Use rover slip data to estimate parameters of soil mechanics
models
- Video truthed people tracks
- UAV fly over data, with annotated lakes and buildings.
- Data for visual SLAM
- Person segmentation- there is a data set of people walking,
which have been annotated and have bounding boxes around them.
- LDPC data sets (contact Hongchao Zhou)
- Parity-check matrix for an LDPC code.
- Receiverd signal.
Image & Video data
- ImageNet (WordNet semantic network annotated with images)
- Corel Image Data
- http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html
- http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/
- A simple bee dance data set from Georgia Tech (find number of discrete modes, and learn dynamics of each mode).
- TRECVID (Competition for multimedia information retrieval. Fairly large archive of video data sets, along with featurizations).
- Character recognition (Optical character recognition, and the simpler digit recognition task)
- Caltech 101 Vision data set for image classification
- NRIA Pedestrian Dataset
- Visual Object Challenge
- Ground truthed Pedestrian ETH data
Neuroscience & Physiology data
- ICDM Brain Connectivity competition
- fMRI data (want to predict cognitive state given brain activation data)
Collaborative prediction data
- Netflix / MovieLens data
- Predict movie ratings based on training data.
Sensor network data
- Data from a 54-node sensor network deployment: temperature, humidity, and light data, along with the voltage level of the batteries at each node.(Berkeley)
- Data from a wireless sensor network for traffic survillience, including acoustic signal data, magnetic signal data, etc. (Berkeley)
- Light sensor network include link quality information between pairs of sensors
- LAM repository at rawseeds.org
NLP & Text data
- Twenty Newsgroups (1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles). Useful for a variety of text classification and/or clustering projects.
- WebKB, this dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
- Enron e-mail, consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.
- 10K Corpus, 10-K reports from thousands of publicly traded U.S. companies, published in 1996¨C2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.
Network data
- Large social network data sets (for link prediction, etc.)
Other sources of data
- UC Irvine has a repository that could be useful. Many of these data sets have been used extensively in graphical models research.
- Sam Roweis also has a link to several datasets (most ready for use in Matlab)
- KDD Cup data sets
Project ideas
Caltech data related ideas
- Do structured-prediction to predict slip in rover data (e.g., using conditional random fields).
Compare prediction to actual parametric models estimated from slip data.
- Activity recognition of fly data (e.g., using hierarchical conditional random fields). E.g.:
- Location-Based Activity Recognition. L. Liao, D. Fox, and H. Kautz. NIPS-05.
- Learning and Inferring Transportation Routines. L. Liao, D. Fox, and H. Kautz. AAAI-04.
- Clustering/segmentation of Ladar scans, video, GPS trajectories
using graphical methods.
- Compare graphical model methods with classical model ID methods
to analyze the physiology data.
- Apply approximate inference methods to fly data, SLAM data, or
visual SLAM data to do tracking,
data association, multitarget tracking, etc.
- Compare different approximate inference techniques (loopy BP,
variational inference, ...) for coding theory (LDPC codes)
Learning and Modeling
- Compare constraint-based (e.g., using independence tests) and score based algorithms for structure
learning.
- Implement algorithms for structure learning of undirected
graphical models. E.g., based on L1 regularization (e.g., Ravikumar et
al, NIPS '07, NIPS '08)
- Experiment with Bayesian model averaging (e.g., using sampling)
- Compare conditional random fields with generative models
(directed or undirected) on some learning task
- Compare Max-margin Markov Nets [Taskar et al NIPS '03] with
Conditional Random Fields [ICML '01]
Inference
- Compare different techniques for exact inference (in terms of
complexity, ...)
- Junction tree inference
- Bucket elimination
- Recursive conditioning
- Algebraic circuits
- Compare different techniques for approximate inference
- Variational inference (structured mean field, etc.)
- Generalized belief propagation
- Sampling (MCMC / Gibbs /...)
- Preconditioning based inference (Ravikumar et al NIPS '05)
- MAP inference
- Compare exact techniques (e.g., using graph cuts; junction
trees with low-treewidth models) with approximate techniques
(Max-product, LP relaxations...)
- Compare different algorithms for Bayesian filtering in dynamical
models.
- Assumed Density filtering
- Particle filtering. Rao blackwellization for data association,
for "two streams" hypothesis, for slam benchmarks
- Ensemble Kalman filters
- Compare algebraic circuits and Bayes nets in their ability to
represent different data sets (e.g., how much compression do we get by
representing a Bayes net as an arithmetic circuit)
- Compare Gaussian graphical models with inferred sparse precision
matrix with Gaussian processes for spatial data.
Applications
- Fault detection in sensor networks (e.g., automatic data cleaning
based on detecting outliers exploiting correlation)
- Experiment with different models for image segmentation /
foreground-background classification
- Compare probabilistic context free grammars with Hidden Markov
models for parsing
- Model-identification for physiological data (e.g., using Gaussian
Processes)
Structured models
- Learn a simplified class of relational models (e.g., no existence
uncertainty)
- Link prediction / collaborative filtering. (1) E.g., compare
matrix factorization techniques with loopy BP in factor graph. (2)
Given data about part of a graph, predict presence of edges
- Experiment with topic models (e.g., LDA) on some interesting data
set relevant to your research
- Apply non-parametric Bayesian clustering (e.g., using
Hierarchical Dirichlet Processes) on some data set
- in particular on the fly data, on the GPS Alice data, on the
pedestrian data
- For data association
- for the Motion segmentation problem/layered generative modeling
Other
- Value of information and experimental design
- Compare different experimental design criteria / different
heuristics for optimizing VOI
- Implement a simple automatic troubleshooting system (e.g., why
does my car not start?)
- Experiment with algorithms for solving the Maximum A posteriori
Assignment (MAP) problem [Park & Darwiche UAI '01]
- Experiment with polygonal random fields for SLAM.
http://ai.stanford.edu/~paskin/prf/
- Use Kirchhoff's Matrix Tree theorem for exact inference (e.g., as
in http://people.csail.mit.edu/mcollins/papers/matrix-tree.pdf)
- Experiment with affinity propagation as exemplar based clustering
algorithm (http://www.psi.toronto.edu/affinitypropagation/)
Code
- Gaussian Process Code repository
- Kevin Murphy's Bayes net toolbox in Matlab
- Infer.net -- Microsoft Research UK's graphical model library
- PNL -- Intel's Probabilistic Network Library
- Inference with Gaussian processes
- Andreas Krause's submodular function optimization toolbox
- OpenCV (open source library for computer vision)
- some code for image classification (using ""Bag-of-words" representation):
- Mark Steyvers and Tom Griffiths Matlab Topic Modelling Toolbox.
- David Blei: Latent Dirichlet allocation (LDA) for topic modeling in C .
- Y.W. Teh. Nonparametric Bayesian Mixture Models - release 2.1.
- David Blei: Latent Dirichlet allocation (LDA) for topic modeling in C .
- Hal Daume III . Fast search for Dirichlet process mixture models
- Kenichi Kurihara,. Variational Dirichlet Process Gaussian Mixture Model
- Adnan Darwiche's software for compiling Bayes nets into algebraic circuits
- JavaBayes (contains example Bayes nets)
- WinBugs (Bayesian inference using Gibbs sampling)