PhD Course on Processing Big Data Feb 2018

Instructor: Claudia Soares, PhD Program in ECE @ Tecnico


Announcements

  • Slides covering up to dimensionality reduction and homework were posted. Homework due date is now May 8.

  • New slides posted.

  • Due to health issues today's class (Mar 9) will be postponed. We will meet as scheduled on Tue, and we will then arrange an alternative date for the missing class.

  • I have posted the slides for lectures 1 and 2

  • Fri March 9 our classroom will be occupied, so we will meet extraordinarily at 14:00, room 5.09, North Tower

  • Classes will be held Tue and Fri, at 11:00, room 4.12, North Tower

Why Processing Big Data?

Nowadays, data are generated in many ways, every time, everywhere: our online activity, medical records, purchase and travel records, financial data. The data flows are now larger than the world’s storage capacity; they are heterogenous, noisy, incomplete  —  and very useful.

This course provides frameworks and tools to find the stories behind this deluge of data.

Data are big.

And so learning depends critically on algorithms that run linearly with the size of the data.

Data are messy.

They come heterogeneous, noisy, and with missing entries. And so learning needs to be robust. We will go through tools like

  • supervised and

  • unsupervised learning,

  • online learning,

  • low rank models, and

  • graph signal processing, all with an eye on scalability.

But more importantly, we will put the algorithms to work during the course, building up to the final Big Data project.



Prerequisites

The course is open to all PhD students familiar with Linear Algebra, matrix notation, and some high-level programming (like Python, R, julia or MATLAB). Familiarity with probability theory will also make your journey smoother through the semester.

Lecture slides

  1. Introduction to learning

  2. Exploratory Data Analysis

  3. Generalization theory

  4. Principal Component Analysis: A Linear Algebra approach

  5. Probabilistic PCA

  6. Compressed Sensing

  7. Matrix Sketching

  8. Kernel PCA

  9. ISOMAP

  10. Hierarchical Clustering

  11. Assignement Clustering

  12. Spectral Clustering (I and II)

  13. Processing large scale heterogeneous data: Generalized low rank models

  14. Graphical models

  15. Graphical models for heterogeneous data streams

(Tentative) Syllabus

  1. Introduction

    1. Exploratory Data Analysis

    2. Generalization of a learned hypothesis

    3. Limitations of predictive modeling

  1. Big data: machine learning for massive datasets

    1. Dimensionality reduction

    2. Compressed sensing and sparse recovery;

    3. Clustering;

    4. Regression;

    5. Online learning.

  2. Big messy data: Generalized low rank models

    1. Missing data problem

      1. PCA;

      2. Regularized PCA and solution methods;

      3. Generalized regularization and solution methods;

      4. Matrix completion for big data;

      5. Choosing low rank models;

      6. Fitting low rank models;

      7. Applications.

    2. Heterogeneous nature of data

      1. Generalized loss functions;

      2. Loss functions for abstract data types;

      3. Multidimensional loss functions;

      4. Applications.

  3. Big data flow processing: graph signal processing

    1. Introduction to graphs and their matrices;

    2. Signal variation on a graph and frequency;

    3. Graph filtering;

    4. IIR, FIR filtering on a graph;

    5. Applications.

Grading

Homework (30%), project (45%), final take-home exam (20%), participation (5%)

References

No single text covers the entirety of the course. The following books will be partially used and also complemented with recent papers.

  • K. Murphy, “Machine Learning: A Probabilistic Perspective”, MIT Press

  • T. Hastie, R. Tibshirani, J. Friedman, “The Elements of Statistical Learning”, 2nd edition, 2009

  • M. Udell, C. Horn, R. Zadeh, and S. Boyd, “Generalized low rank models,” Foundations and Trends in Machine Learning, 2016.

  • Abu-Mostafa, Yaser S., Malik Magdon-Ismail, and Hsuan-Tien Lin. “Learning from data,” Vol. 4. Singapore: AMLBook, 2012.