TTIC 31120: Computational and Statistical Learning Theory

This is a webpage for the Fall 2012 course "Computational and Statistical Learning Theory", taught at TTIC, and also open to all University of Chicago students.

MWF 9:10-10:30AM in TTIC 530.
Instructor: Nati Srebro.
TA: Behnam Neyshabur.

Course Description

We will discuss classic results and recent advances in statistical learning theory (mostly under the agnostic PAC model), touch on computational learning theory, and also explore the relationship with stochastic optimization and online regret analysis. Our emphasis will be on developing a rigorous quantitative understanding of machine learning and acquiring techniques for analyzing and proving performance guarantees for learning methods.

Pre-Requisites

Mathematical maturity, as obtain, e.g., in a rigorous analysis course.
Discrete Math (specifically combinatorics and asymptotic notation)
Probability Theory
Introduction to Machine Learning
Algorithms; Basic Complexity Theory (NP-Hardness)

Familiarity with Convex Optimization, Computational Complexity and background in Statistics can be helpful, but is not required.

Specific Topics:

We will try to touch:

The Statistical Model (Learning Based on an IID Sample):
- The PAC (Probably Approximately Correct) and Agnostic PAC models.
- Stochastic Optimization
- Cardinality Bounds
- Description Length Bounds
- PAC-Bayes
- Compression Bounds
- The Growth Function and VC Dimension
- VC Subgraph Dimension and Fat Shattering Dimension
- Tight Characterization of Learning in terms of the VC and Fat Shattering Dimensions
- Covering Numbers
- Radamacher Averages, including Local Radamacher Analysis
Uniform Learning and No-Free Lunch Theorems
Online Learning, Online Optimization and Online Regret
- The Perceptron Rule and Online Gradient Descent
- Experts and the Winnow Rule
- Bergman Divergence and Online Mirror Descent
- Online to Batch Conversion
Computational Lower Bounds:
- Computational Hardness of Proper Learning
- Cryptographic Hardness of Learning
Additional Topics
- Stability Based Analysis
- Boosting: Weak Learning and the Margin Interpretation of Boosting.

Requirements and Grading:

Required Reading: Some material will be covered through assigned reading, rather then in-class.
Problem Sets: about 5-6 problem sets, some of which introducing additional material beyond the material covered in class. Students are required to turn in the problem sets, but they will not be graded.
Scribing: each student is required to scribe (prepare lecture notes) for two lectures. These notes will be graded.
Challenge Problems: These are harder problems that might be a challenge to solve and will be included in the problem sets. Successfully completing at least 1-2 of these is required for receiving an "A" in the class. Students will also be asked to present and discuss (with the instructor) the solutions to 1-2 of these.
Research Problems: I don't know how to solve these (though I might have an idea). Solving one of these will likely earn an "A+" in the class and might lead to a publication.
Final Exam

In order to receive a "B" a student is expected to complete the exercises, do a serious job at scribing a lecture and display reasonable understanding on the exam.
In order to receive an "A" a student is expected to complete the exercises, do an excellent job scribing a lecture, successfully answer at least 1-2 Challenge Problems, and display a good understanding on the exam.

Recommended Texts

We will not be following a specific text, but some of the material is covered in:

O. Bousquet, S. Boucheron, and G. Lugosi, (2004), Introduction to statistical learning theory
Kearns Vaziran, An Introduction to Computational Learning Theory, MIT Press, 1994.

Lectures and Required Reading:

Lecture 1: Wednesday October 3rd

Course Overview
Basic Setup: Generalization Error, Learning Rule, Learning Algorithm
Warmup: "Minimum Consistent Rectangle" learning rule.

Required Reading: Section 1 of: S. Boucheron, O. Bousquet, and G. Lugosi, (2004), Concentration inequalities, Advanced Lectures in Machine Learning, Springer, pp. 208--240, 2004 Alternative reading : Lecture notes
Problem Set 1 (due October 10th)

Lecture 2: Friday October 5th

Analysis of the "Minimum Consistent Rectangle" learning rule in the realizable case. (Kearns and Vazirani Section 1.1)
No Free Lunch Theorem.
Empirical Risk Minimization
Finite cardinality generalization bound and learning gurantee.

Bousquet et al, Introduction to statistical learning theory, Sections 1-3.

Lecture 3: Wednesday October 10th

Realizable, Agnostic and Optimistic Rates (details in HW)
VC Learning Guarantee:
- Growth Function for Binary Classification
- Symmetrization (proof deferred)
- The VC-dimension and Shatter Lemma (proof in HW)

Bousquet et al, Introduction to statistical learning theory, Section 4 Problem Set 2 (due October 17th)

Lecture 4: Friday October 12th

The VC-dimension as a tight characterization of PAC learnability
Compression Schemes and Compression-Based Learning Guarantee

Lecture 5: Monday October 15th

Description Length Bounds, the Minimum Description Length and Structural Risk Minimization learning rules.
PAC-Bayes Bounds (proof in HW)

Lecture 6: Wednesday October 17th

Universal Learning
Introduction of computational model, definition of proper and improper efficient PAC learnability.
Reducing the consistency problem to Proper PAC Learnability. Kearns and Vazirani Chapter 1

Problem Set 3 (due October 24th)

Lecture 7: Friday October 19th

3-TERM-DNF is not efficiently properly PAC learnable, but is efficiently (improperly) PAC learnable.
On the efficient PAC learnability of poly-time functions. Kearns and Vazirani Chapter 1

Lecture 8: Monday October 22nd

Learning, Cryptography, and One-Way Functions
Hardness of Discrete Cube Root as an assumption, and the Discrete Cube Root as a learning problem.
Cryptographic hardness of PAC Learning poly-time functions and log-depth networks.
For further reading on hardness of learning depth-two networks: Cryptographic Hardness for Learning Intersections of Halfspaces, Klivans and Sherstov, FOCS 2006.
Efficient Agnostic Learning:
- Reducing AGREEMENT to agnostic proper learning, and the hardness of agnostic proper learning half-spaces. (proofs in HW)
- Cryptographic hardness of agnostic improper learning of half-spaces. Further reading Learning Kernel Based Halfspaces with the 0-1 Loss, Shai Shalev-Shwartz, Karthik Sridharan and Ohad Shamir, SIAM J Computin 2011.
- Non-convexity of 0/1-loss and introduction of convex surrogate loss functions.

Kearns and Vazirani Chapter 6

Lecture 9: Wednesday October 24th

Introduction to statistical learning (optimization) of real valued objectives.
Empirical Covering Numbers as generalization of the Growth Function
Rademacher Complexity and Symmetrization
McDiarmid's Inequality
Massart's Finite Class Lemma (proof in HW)

Problem Set 4 (due October 31st)

Lecture 10: Friday October 26th

Pollard's and Dudley's bounds on the Rademacher complexity in terms of covering numbers (proof of Dydley's refined bound in HW).
Learning guarantee in terms of the VC-subgraph dimension:
- Pseudo-shattering and the VC-subgraph dimension.
- Bounding covering numbers in terms of the VC-subgraph dimension (proof deferred).
- The class of linear predictors.
Scale Sensitive Learning
- The fat shattering dimension.
- Bounding covering numbers in terms of the fat shattering dimension---the fat shatter lemma (proof in HW).
- Lipschitz Composition Lemma.
- Conclusion: Learning guarantee in terms of fat-shattering dimension.

Lecture 11: Monday October 29th

The class of l_2-bounded linear predictors
- Bounding the fat-shattering dimension
- Bounding the Rademacher complexity directly
Fat shattering as necessary and sufficient for learnability.
Margin-based learning guarantees: the margin loss, ramp loss and hinge-loss.

Lecture 12: Wednesday October 31st

Aggregate/coting predictors
- Rademacher complexity of the convex hull.
- l_1-bounded linear predictors.
Introduction to Online Learning
- Online learning rule, regret, mistake bound.
- Mistake bound for finite cardinality classes: HALVING

Problem Set 5&6 (due November 11th)

Lecture 13, Friday November 2nd

Halfspaces are not online learnable
The Online Perceptron and its mistake bound.
Online Gradient Descent
Relating regret and linear regret

Lecture 14: Monday November 5th

Analysis of Online Gradient Descent
Interpreting Perceptron as Online Gradient Descent
- Conservative vs aggressive updates.
- Non-separable perceptron.
Hypothesis classes and function classes; Linear and Lipschitz classes.
Online vs. Statistical Learning; Online to Batch Conversion.

Lecture 15: Wednesday November 7th

Online to Batch Conversions and the Stochastic Approximation approach.
Stochastic Gradient Descent as computationally optimal for PAC Learning.
Understanding Online Gradient Descent in terms of regularization.
Beyond the l_2 norm: Strongly convex distance generating functions, Bregman divergence, strong convexity and dual norms.

Lecture 16: Monday November 12th

Background Reading: Sections 5.4.1-5.4.2 of Nemirovski's "Lectures on Modern Convex Optimization"

Online Mirror Descent
l_p norm bounded classes; using a squared l_p norm for an l_1 constrained problem.
The entropy regularized, exponentiated gradient, and the multiplicative weights algorithm.

Lecture 17: Wednesday November 14th

Plan:

Mirror Decent, Dual Averaging, Passive-Aggressive and Follow-The-Regularized-Leader
Using a strongly convex regularizer to bound the Rademacher complexity.
Stability based learning guarantees
Bounding stability in terms of strong convexity

Problem Set 7 (due November 21th)

Lecture 18: Monday November 19th

Plan:

Boosting

Lecture 19: Wednesday November 21st

Plan:

Boosting

Problem Set 8 (due December 1st)

Lecture 20: Monday November 26th

Plan:

Course summary

Assignments:

Problem Set 1 (due October 10th)
Problem Set 2 (due October 17th)
Problem Set 3 (due October 24th)
Problem Set 4 (due October 31st)
Problem Set 5&6 (due November 11th)
Problem Set 7 (due November 21th)
Problem Set 8 (due December 1st)

Last modified: Mon Nov 12 22:39:06 Central Standard Time 2012