# Overview

This page contains possible data sets, ideas for questions and code that you can use for your project. Since this is a research oriented class, it is highly encouraged to pick a project related to your own research, which is not limited by this page.Ideally, we prefer group sizes of 2-3 people. Exceptions possible with instructor's permission. Please feel free to contact either Andreas, Daniel or Deb about project ideas.

# Data sets

## NLP & Text data

- Twenty Newsgroups (1000 text articles posted to each of 20 online newgroups, for a total of 20,000 articles). Useful for a variety of text classification and/or clustering projects.
- WebKB, this dataset contains webpages from 4 universities, labeled with whether they are professor, student, project, or other pages.
- Enron e-mail, consists of ~500K e-mails collected from Enron employees. It has been used for research into information extraction, social network analysis, and topic modeling.
- 10K Corpus, 10-K reports from thousands of publicly traded U.S. companies, published in 1996-2006 and stock return volatility measurements in the twelve-month period before and the twelve-month period after each report.

## Image & Video data

- ImageNet (WordNet semantic network annotated with images)
- Corel Image Data
- http://kdd.ics.uci.edu/databases/CorelFeatures/CorelFeatures.html

- http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds/

- A simple bee dance data set from Georgia Tech (find number of discrete modes, and learn dynamics of each mode).
- TRECVID (Competition for multimedia information retrieval. Fairly large archive of video data sets, along with featurizations).
- Character recognition (Optical character recognition, and the simpler digit recognition task)
- Caltech 101 Vision data set for image classification
- NRIA Pedestrian Dataset
- Visual Object Challenge
- Ground truthed Pedestrian ETH data

## Neuroscience & Physiology data

- ICDM Brain Connectivity competition
- fMRI data (want to predict cognitive state given brain activation data)
- Collaborative Research in Neuroscience data sets

## Collaborative prediction data

- Netflix / MovieLens data
- Predict movie ratings based on training data.

## Sensor network data

- Data from a 54-node sensor network deployment: temperature,
humidity, and light data, along with the voltage level of the batteries
at each node.(Berkeley) E.g., for fitting spatio-temporal GP models,
perform outlier detection, active learning, Gaussian Process
optimization, ...

- Data from a wireless sensor network for traffic survillience, including acoustic signal data, magnetic signal data, etc. (Berkeley)
- Light sensor network include link quality information between pairs of sensors
- LAM repository at rawseeds.org

## Network data

- Large social network data sets (for link prediction, etc.)

## Other sources of data

- UC Irvine has a repository that could be useful. Many of these data sets have been used extensively in machine learning research.
- Sam Roweis also has a link to several datasets (most ready for use in Matlab)
- KDD Cup data sets

# Project ideas

## Online learning, bandit optimization, dimension reduction

- Compare online classification / regression with offline algorithms on data sets
- Experiment with parallel online learning:
- Martin Zinkevich, Alexander Smola, John Langford. Slow learners are fast. NIPS 2009

- Online multitask learning, e.g., for link prediction / collaborative filtering (e.g., on Netflix data)
- Weinberger, K., Dasgupta, A., Attenberg, J., Langford, J., and Smola, A., Feature Hashing for Large Scale Multitask Learning, International Conference on Machine Learning, 2009
- Implement and evaluate some nonstandard bandit algorithm (X-armed bandit [A.2], contextual bandits [A.4], [A.1] ...)
- S. Bubeck, R. Munos, G. Stoltz, C. Szepesvari. "Online Optimization in X-Armed Bandits" NIPS 2008.
- J. Langford and T. Zhang. "The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits" NIPS 2007.
- S. Pandey, D. Chakrabarti, D. Agrawal. "Multi-armed Bandit Problems with Dependent Arms" ICML 2006
- Compare different bandit algorithms (e.g., low regret algorithms like UCB with optimal solution using MDPs / Gittins indices)
- J. C. Gittins, D. M. Jones, A Dynamic Allocation Index for the
Discounted Multiarmed Bandit Problem, Biometrika, Vol 66, No. 3.
(1979), pp. 561-565.

- Experiment with Gaussian process optimization, compare different selection heuristics
- D. Lizotte, T. Wang, M. Bowling, D. Schuurmans. "Automatic Gait Optimization with Gaussian Process Regression" IJCAI 2007
- Niranjan Srinivas, Andreas Krause, Sham M. Kakade, Matthias
Seeger. "Gaussian Process Bandits without Regret: An Experimental
Design Approach" arXiv [pdf]

- Random features -- an "explicit" kernel trick. Compare against standard, kernelized SVMs in runtime / accuracy

- Random Features for Large-Scale Kernel Machines, Ali Rahimi, Ben Recht,

in Neural Information Processing Systems (NIPS) 2007

- Compare different dimension reduction algorithms (PCA, LLE, ISOMAP, Maximum Variance Unfolding, ...)
- Nonlinear dimensionality reduction by locally linear embedding.

Sam Roweis & Lawrence Saul. Science v.290 no.5500, Dec.22, 2000. pp.2323--2326. - http://waldron.stanford.edu/~isomap/
- Unsupervised learning of image manifolds by semidefinite
programming K. Q. Weinberger and L. K. Saul (2004). In Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition
(CVPR-04), Washington D.C.

- Compare online and offline algorithms for dimension reduction
- Random projections
- Online PCA

- Implement online clustering algorithm, and apply it to large (e.g., image) data set
- Online k-Means

- Implement and evaluate algorithms for online linear/convex optimization (e.g., online shortest paths)
- A. Kalai, S. Vempala. "Efficient Algorithms for Online Decision Problems" Journal of Computer System Sciences 2005
- B. McMahan, A. Blum. "Online Geometric Optimization in the Bandit Setting Against an Adaptive Adversary", COLT 2004.

## Active learning / experimental design

- Experiment with heuristics for active learning (e.g., SVMs / logistic regression) on some interesting data set (e.g., image classification) How much does active learning help?
- S. Tong, D. Koller. "Support Vector Machine Active Learning with Applications to Text Classification." JMLR 2001.
- Implement and evaluate some nonstandard active learning algorithm (e.g., [B.3], [B.4], [B.8])
- S. Dasgupta, D.J. Hsu. "Hierarchical sampling for active learning", ICML 2008.
- R. Castro, C. Kalish, R. Nowak, R. Qian, T. Rogers, X. Zhu. "Human Active Learning", NIPS 2008.
- S. Dasgupta, D. Hsu, C. Monteleoni. "A General Agnostic Active Learning Algorithm", NIPS 2007.
- Use submodular function optimization for Bayesian
experimental design on some data set (e.g.,

- A. Krause, A. Singh, C. Guestrin. "Near-optimal Sensor Placements in Gaussian Processes: Theory, Efficient Algorithms and Empirical Studies". Journal of Machine Learning Research (JMLR) 2008
- M. Streeter, D. Golovin "An Online Algorithm for Maximizing Submodular Functions" NIPS 2008
- Implement batch mode active learning
- S. Hoi, R. Jin, J. Zhu, M. Lyu "Batch mode active learning and its application to medical image classification", ICML 2006.

## Nonparametric learning

- Compare different active set methods for fast inference in Gaussian Process Regression
- N. Lawrence, M. Seeger, and R. Herbrich. Fast sparse Gaussian process methods: The informative vector machine. In S. Becker, S. Thrun, and K. Obermayer, editors, Advances in Neural Information Processing Systems 15, pages 609-616, Cambridge, MA, 2003. The MIT Press.
- Snelson, E. and Ghahramani, Z. (2006) Sparse Gaussian Processes using Pseudo-Inputs.

In Advances in Neural Information Processing Systems 2005 - Use GP regression / SVMs for some interesting, nonstandard kernel (e.g., graph kernels, diffusion kernel)
- Nino Shervashidze, Karsten M. Borgwardt: Fast subtree kernels on graphs, NIPS 2009
- R. Kondor and J. Lafferty (2002). Diffusion Kernels on Graphs and Other Discrete Input Spaces. ICML 2002
- Experiment with kernelized algorithms (Kernelized PCA, LDA, ...)

- Compare Gaussian Process classification with Support Vector Machines on some interesting data set

- Use Gaussian Processes to predict slip in JPL Mars rover data (contact instructors): Orbital remote sensing imagery of mars to predict areas of high danger to rovers; some of the data is truthed--that is, how much slippage actually occurred during actual rover trajectories.Compare prediction to actual parametric models estimated from slip data.
- Heteroscedastic GP regression (varying noise)

- K. Kersting, C. Plagemann, P. Pfaff, W. Burgard. Most-Likely Heteroscedastic Gaussian Process Regression. In Z. Ghahramani, editor(s), Proceedings of the 24th Annual International Conference on Machine Learning (ICML-07)
- Gaussian process mixture modeling

- C. E. Rasmussen and Z. Ghahramani. Infinite mixtures of Gaussian process experts. In T. G. Diettrich, S. Becker, and Z. Ghahramani, editors, Advances in Neural Information Processing Systems 14. The MIT Press, 2002.

# Code

- Gaussian Process Code repository
- Inference with Gaussian processes
- Lehel Csato's online GP toolbox:
- Andreas Krause's submodular function optimization toolbox
- Kevin Murphy's Bayes net toolbox in Matlab
- Infer.net -- Microsoft Research UK's graphical model library
- PNL -- Intel's Probabilistic Network Library
- OpenCV (open source library for computer vision)
- some code for image classification (using ""Bag-of-words" representation):
- Y.W. Teh. Nonparametric Bayesian Mixture Models - release 2.1.
- Kenichi Kurihara,. Variational Dirichlet Process Gaussian Mixture Model