TTIC 31120: Computational and Statistical Learning Theory
This is a webpage for the Fall 2012 course "Computational and Statistical Learning Theory", taught at TTIC, and also open to all University of Chicago students.
MWF 9:1010:30AM in TTIC 530.
Instructor: Nati Srebro.
TA: Behnam Neyshabur.
Course Description
We will discuss classic results and recent advances in statistical
learning theory (mostly under the agnostic PAC model), touch on
computational learning theory, and also explore the relationship with
stochastic optimization and online regret analysis. Our emphasis will
be on developing a rigorous quantitative understanding of machine
learning and acquiring techniques for analyzing and proving
performance guarantees for learning methods.
PreRequisites
 Mathematical maturity, as obtain, e.g., in a rigorous analysis course.
 Discrete Math (specifically combinatorics and asymptotic notation)
 Probability Theory
 Introduction to Machine Learning
 Algorithms; Basic Complexity Theory (NPHardness)
Familiarity with Convex Optimization, Computational Complexity and
background in Statistics can be helpful, but is not required.
Specific Topics:
We will try to touch:
 The Statistical Model (Learning Based on an IID Sample):
 The PAC (Probably Approximately Correct) and Agnostic PAC models.
 Stochastic Optimization
 Cardinality Bounds
 Description Length Bounds
 PACBayes
 Compression Bounds
 The Growth Function and VC Dimension
 VC Subgraph Dimension and Fat Shattering Dimension
 Tight Characterization of Learning in terms of the VC and Fat Shattering Dimensions
 Covering Numbers
 Radamacher Averages, including Local Radamacher Analysis
 Uniform Learning and NoFree Lunch Theorems
 Online Learning, Online Optimization and Online Regret
 The Perceptron Rule and Online Gradient Descent
 Experts and the Winnow Rule
 Bergman Divergence and Online Mirror Descent
 Online to Batch Conversion
 Computational Lower Bounds:
 Computational Hardness of Proper Learning
 Cryptographic Hardness of Learning
 Additional Topics
 Stability Based Analysis
 Boosting: Weak Learning and the Margin Interpretation of Boosting.
Requirements and Grading:
 Required Reading: Some material will be covered through assigned
reading, rather then inclass.
 Problem Sets: about 56 problem sets, some of which introducing additional material beyond the material covered in class. Students are required to turn in the problem sets, but they will not be graded.
 Scribing: each student is required to scribe (prepare lecture notes)
for two lectures. These notes will be graded.
 Challenge Problems: These are harder problems that might be a challenge to
solve and will be included in the problem sets. Successfully completing at least 12 of
these is required for receiving an "A" in the
class. Students will also be asked to present
and discuss (with the instructor) the solutions
to 12 of these.
 Research Problems: I don't know how to solve these (though I might
have an idea). Solving one of these will
likely earn an "A+" in the class and might
lead to a publication.
 Final Exam
In order to receive a "B" a student is expected to complete
the exercises, do a serious job at scribing a lecture and
display reasonable understanding on the exam.
In order to receive an "A" a student is expected to complete the
exercises, do an excellent job scribing a lecture,
successfully answer at least 12 Challenge Problems, and display a good understanding on the exam.
Recommended Texts
We will not be following a specific text, but some of the material is covered in:
Lectures and Required Reading:
 Lecture 1: Wednesday October 3rd

 Course Overview
 Basic Setup: Generalization Error, Learning Rule, Learning Algorithm
 Warmup: "Minimum Consistent Rectangle" learning rule.
Required Reading: Section 1 of: S. Boucheron, O. Bousquet, and G. Lugosi, (2004), Concentration inequalities, Advanced Lectures in Machine Learning, Springer, pp. 208240, 2004 Alternative reading : Lecture notes
Problem Set 1 (due October 10th)
 Lecture 2: Friday October 5th

 Analysis of the "Minimum Consistent Rectangle" learning rule in
the realizable case. (Kearns and Vazirani Section 1.1)
 No Free Lunch Theorem.
 Empirical Risk Minimization
 Finite cardinality generalization bound and learning gurantee.
Bousquet et al, Introduction to statistical learning theory, Sections 13.
 Lecture 3: Wednesday October 10th

 Realizable, Agnostic and Optimistic Rates (details in HW)
 VC Learning Guarantee:
 Growth Function for Binary Classification
 Symmetrization (proof deferred)
 The VCdimension and Shatter Lemma (proof in HW)
Bousquet et al, Introduction to statistical learning theory, Section 4
Problem Set 2 (due October 17th)
 Lecture 4: Friday October 12th

 The VCdimension as a tight characterization of PAC learnability
 Compression Schemes and CompressionBased Learning Guarantee
 Lecture 5: Monday October 15th

 Description Length Bounds, the Minimum Description Length and Structural Risk Minimization learning rules.
 PACBayes Bounds (proof in HW)
 Lecture 6: Wednesday October 17th

 Universal Learning
 Introduction of computational model, definition of proper and improper efficient PAC learnability.
 Reducing the consistency problem to Proper PAC Learnability.
Kearns and Vazirani Chapter 1
Problem Set 3 (due October 24th)
 Lecture 7: Friday October 19th

 3TERMDNF is not efficiently properly PAC learnable, but is efficiently (improperly) PAC learnable.
 On the efficient PAC learnability of polytime functions.
Kearns and Vazirani Chapter 1
 Lecture 8: Monday October 22nd

 Learning, Cryptography, and OneWay Functions
 Hardness of Discrete Cube Root as an assumption, and the Discrete Cube Root as a learning problem.
 Cryptographic hardness of PAC Learning polytime functions and logdepth networks.
 For further reading on hardness of learning depthtwo networks: Cryptographic Hardness for Learning Intersections of Halfspaces, Klivans and Sherstov, FOCS 2006.
 Efficient Agnostic Learning:
 Reducing AGREEMENT to agnostic proper learning, and the hardness of agnostic proper learning halfspaces. (proofs in HW)
 Cryptographic hardness of agnostic improper learning of halfspaces. Further reading Learning Kernel Based Halfspaces with the 01 Loss, Shai ShalevShwartz, Karthik Sridharan and Ohad Shamir, SIAM J Computin 2011.
 Nonconvexity of 0/1loss and introduction of convex surrogate loss functions.
Kearns and Vazirani Chapter 6
 Lecture 9: Wednesday October 24th

 Introduction to statistical learning (optimization) of real valued objectives.
 Empirical Covering Numbers as generalization of the Growth Function
 Rademacher Complexity and Symmetrization
 McDiarmid's Inequality
 Massart's Finite Class Lemma (proof in HW)
Problem Set 4 (due October 31st)
 Lecture 10: Friday October 26th

 Pollard's and Dudley's bounds on the Rademacher complexity in terms of covering numbers (proof of Dydley's refined bound in HW).
 Learning guarantee in terms of the VCsubgraph dimension:
 Pseudoshattering and the VCsubgraph dimension.
 Bounding covering numbers in terms of the VCsubgraph dimension (proof deferred).
 The class of linear predictors.
 Scale Sensitive Learning
 The fat shattering dimension.
 Bounding covering numbers in terms of the fat shattering dimensionthe fat shatter lemma (proof in HW).
 Lipschitz Composition Lemma.
 Conclusion: Learning guarantee in terms of fatshattering dimension.
 Lecture 11: Monday October 29th

 The class of l_2bounded linear predictors
 Bounding the fatshattering dimension
 Bounding the Rademacher complexity directly
 Fat shattering as necessary and sufficient for learnability.
 Marginbased learning guarantees: the margin loss, ramp loss and hingeloss.
 Lecture 12: Wednesday October 31st

 Aggregate/coting predictors
 Rademacher complexity of the convex hull.
 l_1bounded linear predictors.
 Introduction to Online Learning
 Online learning rule, regret, mistake bound.
 Mistake bound for finite cardinality classes: HALVING
Problem Set 5&6 (due November 11th)
 Lecture 13, Friday November 2nd

 Halfspaces are not online learnable
 The Online Perceptron and its mistake bound.
 Online Gradient Descent
 Relating regret and linear regret
 Lecture 14: Monday November 5th

 Analysis of Online Gradient Descent
 Interpreting Perceptron as Online Gradient Descent
 Conservative vs aggressive updates.
 Nonseparable perceptron.
 Hypothesis classes and function classes; Linear and Lipschitz classes.
 Online vs. Statistical Learning; Online to Batch Conversion.
 Lecture 15: Wednesday November 7th

 Online to Batch Conversions and the Stochastic Approximation approach.
 Stochastic Gradient Descent as computationally optimal for PAC Learning.
 Understanding Online Gradient Descent in terms of regularization.
 Beyond the l_2 norm: Strongly convex distance generating functions, Bregman divergence, strong convexity and dual norms.
 Lecture 16: Monday November 12th
 Background Reading: Sections 5.4.15.4.2 of Nemirovski's "Lectures on Modern Convex Optimization"
 Online Mirror Descent
 l_p norm bounded classes; using a squared l_p norm for an l_1 constrained problem.
 The entropy regularized, exponentiated gradient, and the multiplicative weights algorithm.
 Lecture 17: Wednesday November 14th
 Plan:
 Mirror Decent, Dual Averaging, PassiveAggressive and FollowTheRegularizedLeader
 Using a strongly convex regularizer to bound the Rademacher complexity.
 Stability based learning guarantees
 Bounding stability in terms of strong convexity
Problem Set 7 (due November 21th)
 Lecture 18: Monday November 19th
 Plan:
 Lecture 19: Wednesday November 21st
 Plan:
Problem Set 8 (due December 1st)
 Lecture 20: Monday November 26th
 Plan:
Assignments:
Problem Set 1 (due October 10th)
Problem Set 2 (due October 17th)
Problem Set 3 (due October 24th)
Problem Set 4 (due October 31st)
Problem Set 5&6 (due November 11th)
Problem Set 7 (due November 21th)
Problem Set 8 (due December 1st)
Last modified: Mon Nov 12 22:39:06 Central Standard Time 2012