Introduction to Machine Learning

Introduction to Machine Learning 236756
Prof. Nir Ailon Lecture 4: Computational Complexity of Learning & Surrogate Losses

Efficient PAC Learning
Until now we were mostly worried about sample complexity How many examples do we need in order to Probably Approximately Correctly learn from a specific concept class? This is a CS course, not a stats course. We can’t ignore the computational price! How much computational effort is needed for PAC learning?

Definition of Efficient PAC Learning
Of course we want “polynomial”. But in what? 1st Attempt: In the number of training examples 𝑚 2nd Attempt: In 1 𝛿 , 1 𝜀 The number of necessary examples 𝑚 is a function of the complexity of ℋ, and parameters 𝛿,𝜀 (which we really care about). If 𝑚 is large, we can throw away the excess. Learning: procedure one-step-ERM 𝑆= 𝑥 1 , 𝑦 1 ,…, 𝑥 𝑚 , 𝑦 𝑚 return ℎ 𝑆 Predicting: ℎ 𝑆 𝑥 := argmin ℎ′∈ℋ 𝐿 𝑆 (ℎ′) (𝑥) Must output an efficient prediction rule

…Polynomial In What? We will define complexity of a learning algorithm 𝒜 with respect to 𝛿,𝜀 and another parameter 𝑛 which is related to the ``size’’ of ℋ,𝒳 Parameter 𝑛 can be embedding dimension. Example: If we decide to use 𝑛 features to describe objects, how will that increase runtime? The standard way to do this is by defining a sequence of pairs ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ , and studying asymptotic complexity of learning 𝒳 𝑛 , ℋ 𝑛 as 𝑛 grows Important to remember: 𝒜 does not get distribution as part of input

Formal Definition: Efficient PAC Learning [Valiant 1984]
A sequence ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ is efficient PAC learnable if ∃ algorithm 𝒜(𝑆,𝜀,𝛿) and a polynomial 𝑝(𝑛, 1 𝜀 , 1 𝛿 ) s.t. For all 𝑛, 𝒟 𝒳 𝑛 ,𝒴 s.t. ∃ℎ: 𝐿 𝒟 ℎ =0 and ∀𝜀,𝛿: 𝒜 receives as input 𝑆∼ 𝒟 𝑚 with 𝑚≤𝑝 𝑛, 1 𝜀 , 1 𝛿 , runs in time at most 𝑝 𝑛, 1 𝜀 , 1 𝛿 , and outputs a predictor ℎ: 𝒳 𝑛 ↦𝒴 that can be evaluated in time 𝑝 𝑛, 1 𝜀 , 1 𝛿 and w.p. ≥1−𝛿: 𝐿 𝒟 ℎ ≤𝜀 Over sample and/or algorithm randomization

Efficient PAC Learning Using CONSISTENT
Reminder: CONSISTENT ℋ 𝑆=( 𝑥 1 , 𝑦 1 ,…,( 𝑥 𝑚 , 𝑦 𝑚 ) If ∃ℎ∈ℋ s.t. ℎ 𝑥 𝑖 = 𝑦 𝑖 for all 𝑖, output such ℎ Otherwise, output “doesn’t exist”. If VCdim(ℋ)≤𝐷 can learn (in realizable case) using CONSISTENT on 𝑂 𝐷+log 1 𝛿 /𝜀 samples Conclusion: If for sequence ( 𝒳 𝑛 , ℋ 𝑛 ) 𝑛=1 ∞ we have (1) VCdim ℋ 𝑛 ≤poly(𝑛) and (2) CONSISTENT polytime computable (in sample size) then sequence is efficient PAC learnable.

Example: If 𝑛 encodes | ℋ 𝑛 |
Assume ∀𝑛 |ℋ 𝑛 |=𝑛 ∀ 𝑥,ℎ ∈( 𝒳 𝑛 , ℋ 𝑛 ), ℎ(𝑥) computable in polynomial time 𝑞(𝑛) CONSISTENT computable in time 𝑚𝑛⋅𝑞(𝑛) Problem efficient PAC learnable in 𝑂 𝑞 𝑛 ×𝑛× 𝜀 −1 log 𝑛 𝛿 time More generally If VCDim grows polynomially with 𝑛 And CONSISTENT computable in polynomial time  Problem efficient PAC learnable

Exponential Size (or Infinite) Classes
Axis-aligned rectangles in 𝑛 dimensions? Halfspaces in 𝑛 dimensions? Boolean functions on 𝑛 variables

Axis-Aligned Rectangles in 𝑛 Dimensions
𝑛=2 VCDim ℋ 𝑛 =𝑂(𝑛) CONSISTENT solvable in time 𝑂 𝑛𝑚

Halfspaces in 𝑛 Dimensions
𝑛=2 Reminder: ℎ 𝑤 𝑥 = sign 𝑤,𝑥 VCDim ℋ 𝑛 =𝑂(𝑛) CONSISTENT solvable in time 𝑂 poly 𝑛,𝑚

Boolean Conjunctions 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1
Note: ℋ = 3 𝑛 +1 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1 ℋ 𝑛 = ℎ 𝑖 1 .. 𝑖 𝑘 , 𝑗 1 .. 𝑗 𝑟 𝑖 1 ,.., 𝑖 𝑘 , 𝑗 1 ,.., 𝑗 𝑟 ∈ 𝑛 } ℎ 𝑖 1 .. 𝑖 𝑘 , 𝑗 1 .. 𝑗 𝑟 𝑥 1 ,…, 𝑥 𝑛 = 𝑥 𝑖 1 ∧⋯∧ 𝑥 𝑖 𝑘 ∧ ¬𝑥 𝑗 1 ∧⋯∧¬ 𝑥 𝑗 𝑟 Can we solve CONSISTENT efficiently? Yes! Start with ℎ=ℎ 1,..𝑛,1,..𝑛 Scan samples in any order Ignore samples 𝑥,0 Given sample 𝑥,1 : Fix ℎ by removing violating literals What is the running time? Literals

3-term DNFs Note: |ℋ|= 3 3𝑛 𝒳 𝑛 = 0,1 𝑛 𝒴= 0,1 ℋ 𝑛 = ℎ 𝐴 1 , 𝐴 2 , 𝐴 𝐴 1 , 𝐴 2 , 𝐴 3 conjunctions} ℎ 𝐴 1 , 𝐴 2 , 𝐴 3 𝑥 = 𝐴 1 𝑥 ∨ 𝐴 2 𝑥 ∨ 𝐴 3 𝑥 Can we still solve CONSISTENT efficiently? Probably not! It’s NP-Hard

Exponential Size (or Infinite) Classes
Axis-aligned rectangles in 𝑛 dimensions Halfspaces in 𝑛 dimensions Conjunctions on 𝑛 variables 3-term DNF’s Python programs of size at most 𝑛 Python programs that run in poly(n) time Decision trees of size at most 𝑛 Circuits of size at most 𝑛 Even circuits of depth at most log 𝑛 CONSISTENT: Poly-time What does this imply? CONSISTENT: NP-Hard What does THIS imply???

Implication Of “CONSISTENT is Hard”
CONSISTENT computes ℎ∈ ℋ 𝑛 consistent with sample Efficient PAC learnability allows outputting any function as prediction (as long as it is efficiently computable) “CONSISTENT is hard”  “learning is hard” “CONSISTENT is hard”  “proper learning is hard” Def: A proper learning algorithm is a learning algorithm that must output ℎ∈ ℋ 𝑛 We already saw improper learning… When? The halving algorithm! Is there a problem that is not efficient PAC properly learnable, but still efficient PAC (improperly) learnable?

Improper Learning of 3-term DNFs
Distribution rule: 𝑎∧𝑏 ∨ 𝑐∧𝑑 = 𝑎∨𝑐 ∧ 𝑎∨𝑑 ∧ 𝑏∨𝑐 ∧(𝑏∨𝑑) 𝐴 1 ∨ 𝐴 2 ∨ 𝐴 3 = 𝑢∈ 𝐴 1 ,𝑣∈ 𝐴 2 ,𝑤∈ 𝐴 3 (𝑢∨𝑣∨𝑤) Can view as conjunction over 2𝑛 3 variables At most 3 ((2𝑛) 3 ) +1 possible conjunctions Can solve using CONSISTENT for conjunctions in dimension 2𝑛 3 . Can efficient PAC learn 3-term DNFs Pay polynomially in sample complexity, gain exponentially in computational complexity At most 2𝑛 3 possibilities

CONSISTENT Computational-Complexity
Improper Learning 3-term DNF Conjunctions over 𝟎,𝟏 𝟐𝒏 𝟑 Sample complexity 𝑂 𝑛+𝑙𝑜𝑔 1 𝛿 𝜀 𝑂 𝑛 3 +𝑙𝑜𝑔 1 𝛿 𝜀 CONSISTENT Computational-Complexity NP-Hard 𝑂 𝑛 3 × 𝑛 3 +𝑙𝑜𝑔 1 𝛿 𝜀 Conjunctions over 0,1 2𝑛 3

(Smaller concept class, hard CONSISTENT)
Improper Learning Conjunctions over 0,1 2𝑛 3 (Larger concept class, easy CONSISTENT) 3-term DNF over 0,1 𝑛 (Smaller concept class, hard CONSISTENT)

So How Do We Prove Hardness of Learning?
Hardness of learning is reminiscent of cryptography The business of cryptography is preventing from an adversary to uncover a secret (key) even using partial observations Much of cryptography is based on existence of trapdoor one-way functions 𝑓: 0,1 𝑛 ↦{0,1 } 𝑛 s.t. (1) Easy to compute 𝑓 (2) Hard to compute 𝑓 −1 , even only with high probability [(3) Easy to compute 𝑓 −1 given trapdoor 𝑠 𝑓 ] Given family ℱ 𝑛 of trapdoor one-way functions 𝑓, each with a distinct trapdoor 𝑠 𝑓 ∈ 0,1 𝑝𝑜𝑙𝑦 𝑛 Define ℋ 𝑛 = ℎ=𝑓 −1 :𝑓∈ ℱ 𝑛 Efficient PAC learning ℋ 𝑛 would violate (2) because thanks to (1) we could simulate a sample 𝑥 1 =𝑓( 𝑦 1 ), 𝑦 1 ,…,( 𝑥 𝑚 =𝑓 𝑦 𝑚 , 𝑦 𝑚 )

Breaking a Crypto System Given Efficient PAC Learnability
break-crypto-system 𝑥∈ 0,1 𝑛 , 𝑓∈ ℱ 𝑛 draw 𝑦 1 ,…, 𝑦 𝑚 ∈ 0,1 𝑛 randomly // 𝑚=𝑝𝑜𝑙𝑦(𝑛) compute 𝑥 1 =𝑓 𝑦 1 ,…, 𝑥 𝑚 =𝑓( 𝑦 𝑚 ) // easy send 𝑥 1 , 𝑦 1 ,…,( 𝑥 𝑚 , 𝑦 𝑚 ) to efficient PAC learner, obtain ℎ // efficiently computable return ℎ(𝑥) By efficient PAC learnability, succeeds for most 𝑥’s with high probability

The Cubic Root Problem Fix integer 𝑁 of 𝑛≈ log 𝑁 bits s.t. 𝑁=𝑝𝑞 for two primes 𝑝,𝑞 Let 𝑓 𝑥 = 𝑥 3 𝑚𝑜𝑑 𝑁 Under mild number-theoretic assumption, 𝑓 −1 well defined Given 𝑝,𝑞, easy to compute 𝑓 −1 Using a python program of polynomial size, running in 𝑝𝑜𝑙𝑦 𝑛 time Using a circuit of O(𝑙𝑜𝑔 𝑛 ) depth Without 𝑝,𝑞, believed to be hard to compute 𝑓 −1  No efficient PAC learning of Short python programs Efficient python programs Logarithmic depth circuits

The Agnostic Case If you thought realizable was hard agnostic is even harder ERM for halfspaces is NP-Hard. Improperly learning halfspaces is crypto-Hard. Everything “interesting” we know of is hard in the agnostic case (except maybe for intervals…) So what to do?

What is the Source of Hardness?
Is it the fact that the concept classes are large? No – We saw that over squared losses (over the reals) we can efficiently optimize linear classifiers Working with discrete valued losses is what makes a hard problem hard ℓ 𝑠𝑞𝑟 (−1,ℎ 𝑥 ) ℓ 0−1 (−1,ℎ 𝑥 ) 1 1 -1 ℎ 𝑥 ∈ℝ ℎ 𝑥 ∈ℝ Why is this easier? It’s continuous It’s convex All of the above

Convexity Definition (convex set): A set 𝐶 in a vector space is convex if ∀𝑢,𝑣∈𝐶 and for all 𝛼∈ 0,1 : 𝛼𝑢+ 1−𝛼 𝑣∈𝐶

Convexity Definition (convex function): A function 𝑓:𝐶↦ ℝ for convex domain 𝐶 is convex if ∀𝑢,𝑣∈𝐶,𝛼∈ 0,1 : 𝑓 𝛼𝑢+ 1−𝛼 𝑣 ≤𝛼𝑓 𝑢 + 1−𝛼 𝑓(𝑣) 𝑓(𝑣) 𝛼𝑓(𝑢)+ 1−𝛼 𝑓(𝑣) 𝑓(𝑢) 𝑓(𝛼𝑢+ 1−𝛼 𝑣) 𝑢 𝑣 𝐶 𝛼𝑢+ 1−𝛼 𝑣

Minimizing Convex Functions
Any local minimum is also a global minimum See proof in book! We can greedily search for increasingly better solutions in small neighborhoods. When we’re stuck, we’re done!

Surrogate Loss Functions
Instead of minimizing ℓ 0−1 minimize something that is Convex (for any fixed 𝑦) An upper bound of ℓ 0−1 For example: ℓ 𝑠𝑞𝑟 𝑦,ℎ 𝑥 = 𝑦−ℎ 𝑥 2 ℓ 𝑙𝑜𝑔𝑖𝑠𝑡𝑖𝑐 𝑦,ℎ 𝑥 =log⁡(1+ exp −𝑦ℎ 𝑥 ) ℓ ℎ𝑖𝑛𝑔𝑒 (𝑦,ℎ(𝑥))=max{0,1−𝑦⋅ℎ(𝑥)} Easier to minimize If surrogate small, so is ℓ 0−1 The bigger the violation, the more you pay ℓ 0−1 ℓ ℎ𝑖𝑛𝑔𝑒 1 1 Hinge loss for linear predictors minimizable using LP (in poly time) ℎ 𝑥 ∈ℝ ℎ 𝑥 ∈ℝ 𝑦=1 𝑦=−1 No payment if sign correct by margin

Summary: Learning Using Convex Surrogates
If your problem is hard to learn w.r.t. ℓ:𝒴×𝒴↦ ℝ Predict using ℎ:𝒳↦ 𝒴 Define surrogate ℓ 𝑠𝑢𝑟 :𝒴× 𝒴 ↦ ℝ 𝒴 convex set ℓ 𝑠𝑢𝑟 (𝑦, 𝑦 ) convex function in 𝑦 for all 𝑦 ℓ≤ ℓ 𝑠𝑢𝑟 Can now naturally define 𝐿 𝒟 ℎ = E 𝑥∼𝒟 ℓ 𝑠𝑢𝑟 (𝑦,ℎ 𝑥 ) ≤ 𝐿 𝒟 (ℎ) 𝐿 𝑆 ℎ = 1 𝑚 𝑖=1 𝑚 ℓ 𝑠𝑢𝑟 ( 𝑦 𝑖 ,ℎ 𝑥 𝑖 ) ≤ 𝐿 𝑆 (ℎ) 𝑠𝑢𝑟 𝑠𝑢𝑟

Sample Bounds for Probably Approximately Optimizing over Surrogates
𝑠𝑢𝑟 Can we approximately, probably minimize 𝐿 𝒟 by minimizing 𝐿 𝑆 over some class of functions {ℎ:𝒳↦ 𝒴 } ? If 𝒴 = −𝑀,𝑀 (bounded functions), then Hoeffding bound can be used (as in binary case) to say that, for any ℎ: Pr 𝑆∼ 𝒟 𝑚 𝐿 𝒟 (ℎ)− 𝐿 𝑆 (ℎ) >𝑡 ≤ 2𝑒 − 𝑡 2 2 𝑀 2 𝑚 Union bound won’t work here (class of functions typically uncountable) VC-Subgraph is a method for extending VC theory to real valued functions 𝑠𝑢𝑟 𝑠𝑢𝑟 𝑠𝑢𝑟

Introduction to Machine Learning

Similar presentations

Presentation on theme: "Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Machine Learning

Similar presentations

Presentation on theme: "Introduction to Machine Learning"— Presentation transcript:

Similar presentations

About project

Feedback