Time & Place. MW 3:30-4:45, Siebel 1214.
Instructor. Matus Telgarsky (Office hours: Siebel 3212, W 4:45-6:00).
TA. Yucheng Chen (Office hours: Siebel 2107, M 5-6, when homework due M 5-7).
Evaluation. Homework is 80% of your grade, project is 20% of your grade. All handed-in work must be \(\LaTeX\)-compiled, on time, and submitted through gradescope (self-enrollment code 9GEERY). Homework details and project details appear below.
Academic integrity. Cheating in this class wastes everyone’s time, just take something else. Please see the full information below.
Discussion and announcements. piazza, here’s the signup link.
Feedback. Please reach out with any comments/concerns!
Notes are typed when possible, otherwise handwritten notes are scanned.
Schedule for future lectures is imprecise.
Date. | Topics. | Notes. | Coursework. |
8/28 | Administrivia; perceptron. | pdf. | hw0 out: tex, pdf. |
8/30 | Perceptron; decomposition of learning problems. | pdf. | |
Representation. | |||
9/6 | Failure of linear; box apx (linear over boxes, decision trees). | pdf. | hw0 due! |
9/11 | End of box apx: boosted decision trees, branching programs, 3-layer ReLU nets. Start of poly-fit: Stone-Weierstrass! | pdf. | |
9/13 | Polynomial fit via Stone-Weierstrass: sums of exponentials, RBF kernels, 2-layer networks. | pdf. | |
9/18 | RKHS interlude. | pdf. | |
9/20 | Succinct deep networks; multiplication with networks; networks and polynomials, smooth functions; Wasserstein distance, probability modeling, and GANs. | ||
Optimization. | |||
Convexity bootcamp; gradient descent in the smooth case; subgradient descent in the bounded+Lipschitz case; mirror descent and geometry; GLORIOUS MAUREY SPARSIFICATION; consistency of convex risk minimization; something non-convex; clustering. | |||
Generalization. | |||
Concentration bootcamp; Finite classes and primitive covering; Symmetrization and Rademacher complexity; Lipschitz losses, margin losses, finite classes; Full covering, Dudley, and Sudakov; Linear predictors via covers and Rademacher; VC dimension and VC for neural networks; Rademacher and covering bounds for neural networks; | |||
Miscellaneous. | |||
Depends on how much time remains! I hope: heavy tails; online learning; reinforcement learning; spectral methods. | |||
Groups. You may work individually, or in pairs.
Content. The project must contain a theoretical component; whether you include something else as well is up to you, but will not fundamentally affect the grade. Also, please focus on quality; if you can make something cleaner and shorter without removing information, that is preferred.
Milestones. We’ll schedule meetings some time in October so that I can sanity check all projects.
Other learning theory-ish classes. All of these courses are different, and all have good material, and there are many I neglected to include!
Lieven Vandenberghe @ UCLA. This is not a learning theory course, it’s part 3 of a long optimization course, covering material not in the standard Boyd-Vandenberghe book. The lectures links are to slides; the proofs there are incredibly clean, indeed this is my favorite resouce for many of these methods.
Textbooks and surveys. Again, there are many others, but here are a key few.
Boucheron, Bousquet, and Lugosi have two excellent surveys: one is a gentler start, whereas the other even gets to some advanced topics. I read the easier one immediately upon entering grad school (despite having taken a good learning theory course in undergrad) and really liked it.
Shai Shalev-Shwartz and Shai Ben-David have an excellent textbook from 2014: “Understanding Machine Learning: From Theory to Algorithms”.
Shai Shalev-Shwartz also has an excellent monograph covering online learning, which will sadly receive little or no coverage in this learning theory course.
Devroye, Györfi, and Lugosi have a wonderful book from 1996, “A Probabilistic Theory of Pattern Recognition”, which in addition to covering standard topics in statistical learning theory, has many nice results which rarely enter courses, for instance Stone’s original argument for the consistency of \(k\)-nn in finite-dimensional Euclidean space.
Martin Anthony and Peter Bartlett have a wonderful 1999 book, “Neural Network Learning: Theoretical Foundations”, which is again in the setting of statistical learning theory, and is the only listed reference with extensive VC dimension bounds for neural networks. (Pro-tip: the cover image is not a biological image, it is a proof…)