Introduction to Machine Learning

Slides:



Advertisements
Similar presentations
Computational Learning Theory
Advertisements

1/15 Agnostically learning halfspaces FOCS /15 Set X, F class of functions f: X! {0,1}. Efficient Agnostic Learner w.h.p. h: X! {0,1} poly(1/ )
Pretty-Good Tomography Scott Aaronson MIT. Theres a problem… To do tomography on an entangled state of n qubits, we need exp(n) measurements Does this.
VC Dimension – definition and impossibility result
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
CHAPTER 2: Supervised Learning. Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Learning a Class from Examples.
An Efficient Membership-Query Algorithm for Learning DNF with Respect to the Uniform Distribution Jeffrey C. Jackson Presented By: Eitan Yaakobi Tamar.
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAA A.
1 CSE 417: Algorithms and Computational Complexity Winter 2001 Lecture 21 Instructor: Paul Beame.
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Probably Approximately Correct Model (PAC)
Chapter 11: Limitations of Algorithmic Power
1 Computational Learning Theory and Kernel Methods Tianyi Jiang March 8, 2004.
Chapter 11 Limitations of Algorithm Power Copyright © 2007 Pearson Addison-Wesley. All rights reserved.
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
Efficient Model Selection for Support Vector Machines
Theory of Computing Lecture 15 MAS 714 Hartmut Klauck.
Theory of Computing Lecture 17 MAS 714 Hartmut Klauck.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
1 Lower Bounds Lower bound: an estimate on a minimum amount of work needed to solve a given problem Examples: b number of comparisons needed to find the.
Introduction to Machine Learning Supervised Learning 姓名 : 李政軒.
1 Machine Learning: Lecture 8 Computational Learning Theory (Based on Chapter 7 of Mitchell T.., Machine Learning, 1997)
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Umans Complexity Theory Lectures Lecture 1a: Problems and Languages.
Concept learning, Regression Adapted from slides from Alpaydin’s book and slides by Professor Doina Precup, Mcgill University.
Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University Today: Computational Learning Theory Probably Approximately.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
MACHINE LEARNING 3. Supervised Learning. Learning a Class from Examples Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
The NP class. NP-completeness
Graphs 4/13/2018 5:25 AM Presentation for use with the textbook, Algorithm Design and Applications, by M. T. Goodrich and R. Tamassia, Wiley, 2015 NP-Completeness.
P & NP.
CS 9633 Machine Learning Support Vector Machines
Analysis of Algorithms
NP-Completeness NP-Completeness Graphs 5/7/ :49 PM x x x x x x x
Chapter 7. Classification and Prediction
Information Complexity Lower Bounds
Large Margin classifiers
Great Theoretical Ideas in Computer Science
Vitaly Feldman and Jan Vondrâk IBM Research - Almaden
Computational Learning Theory
Umans Complexity Theory Lectures
Vapnik–Chervonenkis Dimension
NP-Completeness Yin Tat Lee
An Introduction to Support Vector Machines
Background: Lattices and the Learning-with-Errors problem
NP-Completeness NP-Completeness Graphs 11/16/2018 2:32 AM x x x x x x
Analysis and design of algorithm
Hidden Markov Models Part 2: Algorithms
Pseudo-derandomizing learning and approximation
Objective of This Course
Depth Estimation via Sampling
CSCI B609: “Foundations of Data Science”
The
Online Learning Kernels
NP-Completeness NP-Completeness Graphs 12/3/2018 2:46 AM x x x x x x x
Computational Learning Theory
Perceptron as one Type of Linear Discriminants
Computational Learning Theory
Chapter 11 Limitations of Algorithm Power
The probably approximately correct (PAC) learning model
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
CSC 380: Design and Analysis of Algorithms
NP-Completeness Yin Tat Lee
Machine Learning: UNIT-3 CHAPTER-2
Lecture 14 Learning Inductive inference
INTRODUCTION TO Machine Learning 3rd Edition
CS249: Neural Language Model
Presentation transcript:

Introduction to Machine Learning 236756 Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses

Efficient PAC Learning Until now we were mostly worried about sample complexity How many examples do we need in order to Probably Approximately Correctly learn from a specific concept class? This is a CS course, not a stats course. We can’t ignore the computational price! How much computational effort is needed for PAC learning?

Definition of Efficient PAC Learning Of course we want “polynomial”. But in what? 1st Attempt: In the number of training examples 𝑚 2nd Attempt: In 1 𝛿 , 1 𝜀 The number of necessary examples 𝑚 is a function of the complexity of ℋ, and parameters 𝛿,𝜀 (which we really care about). If 𝑚 is large, we can throw away the excess. Learning: procedure one-step-ERM 𝑆= 𝑥 1 , 𝑦 1 ,…, 𝑥 𝑚 , 𝑦 𝑚 return ℎ 𝑆 Predicting: ℎ 𝑆 𝑥 := argmin ℎ′∈ℋ 𝐿 𝑆 (ℎ′) (𝑥) Must output an efficient prediction rule

…Polynomial In What? We will define complexity of a learning algorithm 𝒜 with respect to 𝛿,𝜀 and another parameter 𝑛 which is related to the ``size’’ of ℋ,𝒳 Parameter 𝑛 can be embedding dimension. Example: If we decide to use 𝑛 features to describe objects, how will that increase runtime? The standard way to do this is by defining a sequence of pairs ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ , and studying asymptotic complexity of learning 𝒳 𝑛 , ℋ 𝑛 as 𝑛 grows Important to remember: 𝒜 does not get distribution as part of input

Formal Definition: Efficient PAC Learning [Valiant 1984] A sequence ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ is efficient PAC learnable if ∃ algorithm 𝒜(𝑆,𝜀,𝛿) and a polynomial 𝑝(𝑛, 1 𝜀 , 1 𝛿 ) s.t. For all 𝑛, 𝒟 𝒳 𝑛 ,𝒴 s.t. ∃ℎ: 𝐿 𝒟 ℎ =0 and ∀𝜀,𝛿: 𝒜 receives as input 𝑆∼ 𝒟 𝑚 with 𝑚≤𝑝 𝑛, 1 𝜀 , 1 𝛿 , runs in time at most 𝑝 𝑛, 1 𝜀 , 1 𝛿 , and outputs a predictor ℎ: 𝒳 𝑛 ↦𝒴 that can be evaluated in time 𝑝 𝑛, 1 𝜀 , 1 𝛿 and w.p. ≥1−𝛿: 𝐿 𝒟 ℎ ≤𝜀 Over sample and/or algorithm randomization

Efficient PAC Learning Using CONSISTENT Reminder: CONSISTENT ℋ 𝑆=( 𝑥 1 , 𝑦 1 ,…,( 𝑥 𝑚 , 𝑦 𝑚 ) If ∃ℎ∈ℋ s.t. ℎ 𝑥 𝑖 = 𝑦 𝑖 for all 𝑖, output such ℎ Otherwise, output “doesn’t exist”. If VCdim(ℋ)≤𝐷 can learn (in realizable case) using CONSISTENT on 𝑂 𝐷+log 1 𝛿 /𝜀 samples Conclusion: If for sequence ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ we have (1) VCdim ℋ 𝑛 ≤poly(𝑛) and (2) CONSISTENT polytime computable (in sample size) then sequence is efficient PAC learnable.

Example: If 𝑛 encodes | ℋ 𝑛 | Assume ∀𝑛 |ℋ 𝑛 |=𝑛 ∀ 𝑥,ℎ ∈( 𝒳 𝑛 , ℋ 𝑛 ), ℎ(𝑥) computable in polynomial time 𝑞(𝑛) CONSISTENT computable in time 𝑚𝑛⋅𝑞(𝑛) Problem efficient PAC learnable in 𝑂 𝑞 𝑛 ×𝑛× 𝜀 −1 log 𝑛 𝛿 time More generally If VCDim grows polynomially with 𝑛 And CONSISTENT computable in polynomial time  Problem efficient PAC learnable

Exponential Size (or Infinite) Classes Axis-aligned rectangles in 𝑛 dimensions? Halfspaces in 𝑛 dimensions? Boolean functions on 𝑛 variables

Axis-Aligned Rectangles in 𝑛 Dimensions 𝑛=2 VCDim ℋ 𝑛 =𝑂(𝑛) CONSISTENT solvable in time 𝑂 𝑛𝑚

Halfspaces in 𝑛 Dimensions 𝑛=2 Reminder: ℎ 𝑤 𝑥 = sign 𝑤,𝑥 VCDim ℋ 𝑛 =𝑂(𝑛) CONSISTENT solvable in time 𝑂 poly 𝑛,𝑚

Boolean Conjunctions 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1 Note: ℋ = 3 𝑛 +1 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1 ℋ 𝑛 = ℎ 𝑖 1 .. 𝑖 𝑘 , 𝑗 1 .. 𝑗 𝑟 𝑖 1 ,.., 𝑖 𝑘 , 𝑗 1 ,.., 𝑗 𝑟 ∈ 𝑛 } ℎ 𝑖 1 .. 𝑖 𝑘 , 𝑗 1 .. 𝑗 𝑟 𝑥 1 ,…, 𝑥 𝑛 = 𝑥 𝑖 1 ∧⋯∧ 𝑥 𝑖 𝑘 ∧ ¬𝑥 𝑗 1 ∧⋯∧¬ 𝑥 𝑗 𝑟 Can we solve CONSISTENT efficiently? Yes! Start with ℎ=ℎ 1,..𝑛,1,..𝑛 Scan samples in any order Ignore samples 𝑥,0 Given sample 𝑥,1 : Fix ℎ by removing violating literals What is the running time? Literals

3-term DNFs Note: |ℋ|= 3 3𝑛 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1 ℋ 𝑛 = ℎ 𝐴 1 , 𝐴 2 , 𝐴 3 𝐴 1 , 𝐴 2 , 𝐴 3 conjunctions} ℎ 𝐴 1 , 𝐴 2 , 𝐴 3 𝑥 = 𝐴 1 𝑥 ∨ 𝐴 2 𝑥 ∨ 𝐴 3 𝑥 Can we still solve CONSISTENT efficiently? Probably not! It’s NP-Hard

Exponential Size (or Infinite) Classes Axis-aligned rectangles in 𝑛 dimensions Halfspaces in 𝑛 dimensions Conjunctions on 𝑛 variables 3-term DNF’s Python programs of size at most 𝑛 Python programs that run in poly(n) time Decision trees of size at most 𝑛 Circuits of size at most 𝑛 Even circuits of depth at most log 𝑛 CONSISTENT: Poly-time What does this imply? CONSISTENT: NP-Hard What does THIS imply???

Implication Of “CONSISTENT is Hard” CONSISTENT computes ℎ∈ ℋ 𝑛 consistent with sample Efficient PAC learnability allows outputting any function as prediction (as long as it is efficiently computable) “CONSISTENT is hard”  “learning is hard” “CONSISTENT is hard”  “proper learning is hard” Def: A proper learning algorithm is a learning algorithm that must output ℎ∈ ℋ 𝑛 We already saw improper learning… When? The halving algorithm! Is there a problem that is not efficient PAC properly learnable, but still efficient PAC (improperly) learnable?

Improper Learning of 3-term DNFs Distribution rule: 𝑎∧𝑏 ∨ 𝑐∧𝑑 = 𝑎∨𝑐 ∧ 𝑎∨𝑑 ∧ 𝑏∨𝑐 ∧(𝑏∨𝑑) 𝐴 1 ∨ 𝐴 2 ∨ 𝐴 3 = 𝑢∈ 𝐴 1 ,𝑣∈ 𝐴 2 ,𝑤∈ 𝐴 3 (𝑢∨𝑣∨𝑤) Can view as conjunction over 2𝑛 3 variables At most 3 ((2𝑛) 3 ) +1 possible conjunctions Can solve using CONSISTENT for conjunctions in dimension 2𝑛 3 . Can efficient PAC learn 3-term DNFs Pay polynomially in sample complexity, gain exponentially in computational complexity At most 2𝑛 3 possibilities

CONSISTENT Computational-Complexity Improper Learning 3-term DNF Conjunctions over 𝟎,𝟏 𝟐𝒏 𝟑 Sample complexity 𝑂 𝑛+𝑙𝑜𝑔 1 𝛿 𝜀 𝑂 𝑛 3 +𝑙𝑜𝑔 1 𝛿 𝜀 CONSISTENT Computational-Complexity NP-Hard 𝑂 𝑛 3 × 𝑛 3 +𝑙𝑜𝑔 1 𝛿 𝜀 Conjunctions over 0,1 2𝑛 3

(Smaller concept class, hard CONSISTENT) Improper Learning Conjunctions over 0,1 2𝑛 3 (Larger concept class, easy CONSISTENT) 3-term DNF over 0,1 𝑛 (Smaller concept class, hard CONSISTENT)

So How Do We Prove Hardness of Learning? Hardness of learning is reminiscent of cryptography The business of cryptography is preventing from an adversary to uncover a secret (key) even using partial observations Much of cryptography is based on existence of trapdoor one-way functions 𝑓: 0,1 𝑛 ↦{0,1 } 𝑛 s.t. (1) Easy to compute 𝑓 (2) Hard to compute 𝑓 −1 , even only with high probability [(3) Easy to compute 𝑓 −1 given trapdoor 𝑠 𝑓 ] Given family ℱ 𝑛 of trapdoor one-way functions 𝑓, each with a distinct trapdoor 𝑠 𝑓 ∈ 0,1 𝑝𝑜𝑙𝑦 𝑛 Define ℋ 𝑛 = ℎ=𝑓 −1 :𝑓∈ ℱ 𝑛 Efficient PAC learning ℋ 𝑛 would violate (2) because thanks to (1) we could simulate a sample 𝑥 1 =𝑓( 𝑦 1 ), 𝑦 1 ,…,( 𝑥 𝑚 =𝑓 𝑦 𝑚 , 𝑦 𝑚 )

Breaking a Crypto System Given Efficient PAC Learnability break-crypto-system 𝑥∈ 0,1 𝑛 , 𝑓∈ ℱ 𝑛 draw 𝑦 1 ,…, 𝑦 𝑚 ∈ 0,1 𝑛 randomly // 𝑚=𝑝𝑜𝑙𝑦(𝑛) compute 𝑥 1 =𝑓 𝑦 1 ,…, 𝑥 𝑚 =𝑓( 𝑦 𝑚 ) // easy send 𝑥 1 , 𝑦 1 ,…,( 𝑥 𝑚 , 𝑦 𝑚 ) to efficient PAC learner, obtain ℎ // efficiently computable return ℎ(𝑥) By efficient PAC learnability, succeeds for most 𝑥’s with high probability

The Cubic Root Problem Fix integer 𝑁 of 𝑛≈ log 𝑁 bits s.t. 𝑁=𝑝𝑞 for two primes 𝑝,𝑞 Let 𝑓 𝑥 = 𝑥 3 𝑚𝑜𝑑 𝑁 Under mild number-theoretic assumption, 𝑓 −1 well defined Given 𝑝,𝑞, easy to compute 𝑓 −1 Using a python program of polynomial size, running in 𝑝𝑜𝑙𝑦 𝑛 time Using a circuit of O(𝑙𝑜𝑔 𝑛 ) depth Without 𝑝,𝑞, believed to be hard to compute 𝑓 −1  No efficient PAC learning of Short python programs Efficient python programs Logarithmic depth circuits

The Agnostic Case If you thought realizable was hard... agnostic is even harder ERM for halfspaces is NP-Hard. Improperly learning halfspaces is crypto-Hard. Everything “interesting” we know of is hard in the agnostic case (except maybe for intervals…) So what to do?

What is the Source of Hardness? Is it the fact that the concept classes are large? No – We saw that over squared losses (over the reals) we can efficiently optimize linear classifiers Working with discrete valued losses is what makes a hard problem hard ℓ 𝑠𝑞𝑟 (−1,ℎ 𝑥 ) ℓ 0−1 (−1,ℎ 𝑥 ) 1 1 -1 ℎ 𝑥 ∈ℝ ℎ 𝑥 ∈ℝ Why is this easier? It’s continuous It’s convex All of the above

Convexity Definition (convex set): A set 𝐶 in a vector space is convex if ∀𝑢,𝑣∈𝐶 and for all 𝛼∈ 0,1 : 𝛼𝑢+ 1−𝛼 𝑣∈𝐶

Convexity Definition (convex function): A function 𝑓:𝐶↦ ℝ for convex domain 𝐶 is convex if ∀𝑢,𝑣∈𝐶,𝛼∈ 0,1 : 𝑓 𝛼𝑢+ 1−𝛼 𝑣 ≤𝛼𝑓 𝑢 + 1−𝛼 𝑓(𝑣) 𝑓(𝑣) 𝛼𝑓(𝑢)+ 1−𝛼 𝑓(𝑣) 𝑓(𝑢) 𝑓(𝛼𝑢+ 1−𝛼 𝑣) 𝑢 𝑣 𝐶 𝛼𝑢+ 1−𝛼 𝑣

Minimizing Convex Functions Any local minimum is also a global minimum See proof in book! We can greedily search for increasingly better solutions in small neighborhoods. When we’re stuck, we’re done!

Surrogate Loss Functions Instead of minimizing ℓ 0−1 minimize something that is Convex (for any fixed 𝑦) An upper bound of ℓ 0−1 For example: ℓ 𝑠𝑞𝑟 𝑦,ℎ 𝑥 = 𝑦−ℎ 𝑥 2 ℓ 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑦,ℎ 𝑥 =log⁡(1+ exp −𝑦ℎ 𝑥 ) ℓ ℎ𝑖𝑛𝑔𝑒 (𝑦,ℎ(𝑥))=max{0,1−𝑦⋅ℎ(𝑥)} Easier to minimize If surrogate small, so is ℓ 0−1 The bigger the violation, the more you pay ℓ 0−1 ℓ ℎ𝑖𝑛𝑔𝑒 1 1 Hinge loss for linear predictors minimizable using LP (in poly time) ℎ 𝑥 ∈ℝ ℎ 𝑥 ∈ℝ 𝑦=1 𝑦=−1 No payment if sign correct by margin

Summary: Learning Using Convex Surrogates If your problem is hard to learn w.r.t. ℓ:𝒴×𝒴↦ ℝ Predict using ℎ:𝒳↦ 𝒴 Define surrogate ℓ 𝑠𝑢𝑟 :𝒴× 𝒴 ↦ ℝ 𝒴 convex set ℓ 𝑠𝑢𝑟 (𝑦, 𝑦 ) convex function in 𝑦 for all 𝑦 ℓ≤ ℓ 𝑠𝑢𝑟 Can now naturally define 𝐿 𝒟 ℎ = E 𝑥∼𝒟 ℓ 𝑠𝑢𝑟 (𝑦,ℎ 𝑥 ) ≤ 𝐿 𝒟 (ℎ) 𝐿 𝑆 ℎ = 1 𝑚 𝑖=1 𝑚 ℓ 𝑠𝑢𝑟 ( 𝑦 𝑖 ,ℎ 𝑥 𝑖 ) ≤ 𝐿 𝑆 (ℎ) 𝑠𝑢𝑟 𝑠𝑢𝑟

Sample Bounds for Probably Approximately Optimizing over Surrogates 𝑠𝑢𝑟 Can we approximately, probably minimize 𝐿 𝒟 by minimizing 𝐿 𝑆 over some class of functions {ℎ:𝒳↦ 𝒴 } ? If 𝒴 = −𝑀,𝑀 (bounded functions), then Hoeffding bound can be used (as in binary case) to say that, for any ℎ: Pr 𝑆∼ 𝒟 𝑚 𝐿 𝒟 (ℎ)− 𝐿 𝑆 (ℎ) >𝑡 ≤ 2𝑒 − 𝑡 2 2 𝑀 2 𝑚 Union bound won’t work here (class of functions typically uncountable) VC-Subgraph is a method for extending VC theory to real valued functions 𝑠𝑢𝑟 𝑠𝑢𝑟 𝑠𝑢𝑟