PhD Course on Processing Big Data Feb 2018

Instructor: Claudia Soares, PhD Program in ECE @ Tecnico

Announcements

Slides covering up to dimensionality reduction and homework were posted. Homework due date is now May 8.
New slides posted.
Due to health issues today's class (Mar 9) will be postponed. We will meet as scheduled on Tue, and we will then arrange an alternative date for the missing class.
I have posted the slides for lectures 1 and 2
Fri March 9 our classroom will be occupied, so we will meet extraordinarily at 14:00, room 5.09, North Tower
Classes will be held Tue and Fri, at 11:00, room 4.12, North Tower

Why Processing Big Data?

Nowadays, data are generated in many ways, every time, everywhere: our online activity, medical records, purchase and travel records, financial data. The data flows are now larger than the world’s storage capacity; they are heterogenous, noisy, incomplete — and very useful.

This course provides frameworks and tools to find the stories behind this deluge of data.

Data are big.

And so learning depends critically on algorithms that run linearly with the size of the data.

Data are messy.

They come heterogeneous, noisy, and with missing entries. And so learning needs to be robust. We will go through tools like

supervised and
unsupervised learning,
online learning,
low rank models, and
graph signal processing, all with an eye on scalability.

But more importantly, we will put the algorithms to work during the course, building up to the final Big Data project.

Prerequisites

The course is open to all PhD students familiar with Linear Algebra, matrix notation, and some high-level programming (like Python, R, julia or MATLAB). Familiarity with probability theory will also make your journey smoother through the semester.

Lecture slides

Introduction to learning
Exploratory Data Analysis
Generalization theory
Principal Component Analysis: A Linear Algebra approach
Probabilistic PCA
Compressed Sensing
Matrix Sketching
Kernel PCA
ISOMAP
Hierarchical Clustering
Assignement Clustering
Spectral Clustering (I and II)
Processing large scale heterogeneous data: Generalized low rank models
Graphical models
Graphical models for heterogeneous data streams

(Tentative) Syllabus

Introduction
1. Exploratory Data Analysis
2. Generalization of a learned hypothesis
3. Limitations of predictive modeling

Big data: machine learning for massive datasets
1. Dimensionality reduction
2. Compressed sensing and sparse recovery;
3. Clustering;
4. Regression;
5. Online learning.
Big messy data: Generalized low rank models
1. Missing data problem
  1. PCA;
  2. Regularized PCA and solution methods;
  3. Generalized regularization and solution methods;
  4. Matrix completion for big data;
  5. Choosing low rank models;
  6. Fitting low rank models;
  7. Applications.
2. Heterogeneous nature of data
  1. Generalized loss functions;
  2. Loss functions for abstract data types;
  3. Multidimensional loss functions;
  4. Applications.
Big data flow processing: graph signal processing
1. Introduction to graphs and their matrices;
2. Signal variation on a graph and frequency;
3. Graph filtering;
4. IIR, FIR filtering on a graph;
5. Applications.

Grading

Homework (30%), project (45%), final take-home exam (20%), participation (5%)

References

No single text covers the entirety of the course. The following books will be partially used and also complemented with recent papers.

K. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press
T. Hastie, R. Tibshirani, J. Friedman, “The Elements of Statistical Learning”, 2nd edition, 2009
M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized low rank models,” Foundations and Trends in Machine Learning, 2016.
Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. “Learning from data,” Vol. 4. Singapore: AMLBook, 2012.