Introduction to Statistical Learning Theory

Introduction to Statistical Learning Theory
J. Saketha Nath (IIT Bombay)

What is STL? “The goal of statistical learning theory is to study, in a statistical framework, the properties of learning algorithms” – [Bousquet et.al., 04]

Supervised Learning Setting
Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: ?? Do well on new data Assumptions: There exists 𝐹 𝑋𝑌 that generates 𝐷 (Stochastic framework) iid samples

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: ?? Pick a candidate that does well on new data ?? Assumptions: There exists 𝐹 𝑋𝑌 that generates 𝐷 (Stochastic framework) iid samples

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: ?? Pick a candidate that does well on new data ?? Assumptions: There exists 𝑭 𝑿𝒀 that generates 𝑫 as well as “new data” (Stochastic framework) iid samples and bounded, Lipschitz loss

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set 𝒢 of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: g ∗ = argmin 𝑔∈𝒢 𝑬 𝒍 𝒀,𝒈(𝑿) Assumptions: There exists 𝑭 𝑿𝒀 that generates 𝐷 as well as “new data” iid samples and bounded, Lipschitz loss

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set 𝒢 of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: g ∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) Assumptions: There exists 𝐹 𝑋𝑌 that generates 𝐷 as well as “new data” iid samples and bounded, Lipschitz loss Minimize expected loss (a.k.a. risk 𝑅 𝑙 [𝑔] minimization)

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set 𝒢 of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: g ∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) Assumptions: There exists 𝐹 𝑋𝑌 that generates 𝐷 as well as “new data” iid samples and bounded, Lipschitz loss Well-defined, but un-realizable.

Given: Training data: 𝐷= 𝑥 1 , 𝑦 1 , 𝑥 2 , 𝑦 2 , …, 𝑥 𝑚 , 𝑦 𝑚 Model: set 𝒢 of candidate predictors of the form 𝑔:𝒳↦𝒴 Loss function: 𝑙:𝒴×𝒴↦ ℝ + Goal: g ∗ = argmin 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) Assumptions: There exists 𝐹 𝑋𝑌 that generates 𝐷 as well as “new data” iid samples and bounded, Lipschitz loss How well can we approximate?

Skyline ? Case of 𝒢 =1 (estimate error rate)
Law of large numbers: 1 𝑚 𝑖=1 𝑚 𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 𝑚=1 ∞ p 𝐸 𝑙 𝑌,𝑔(𝑋) With high probability, average loss (a.k.a. empirical risk) on (a large) training set is a good approximation for risk

Skyline ? Case of 𝒢 =1 (estimate error rate)
Law of large numbers: 1 𝑚 𝑖=1 𝑚 𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 𝑚=1 ∞ p 𝐸 𝑙 𝑌,𝑔(𝑋) For given (but any) 𝐹 𝑋𝑌 ,𝛿>0,𝜖>0, we have that: There exists 𝑚 0 𝛿,𝜖 ∈ℕ, such that 𝑃 1 𝑚 𝑖=1 𝑚 𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 −𝐸 𝑙 𝑌,𝑔(𝑋) >𝜖 ≤𝛿 for all 𝑚≥ 𝑚 0 𝛿,𝜖 .

𝑷 𝑹 𝒍 𝒈 𝒎 − 𝑹 𝒍 𝒈 ∗ >𝝐 ≤𝜹 for all 𝒎≥ 𝒎 𝟎 𝜹,𝝐 .
Some Definitions A problem 𝒢,𝑙 is learnable iff there exists an algorithm that selects 𝑔 𝑚 ∈𝒢 such that for any F 𝑋𝑌 ,𝛿>0,𝜖>0, we have that there exists 𝑚 0 𝛿,𝜖 ∈ℕ, such that 𝑷 𝑹 𝒍 𝒈 𝒎 − 𝑹 𝒍 𝒈 ∗ >𝝐 ≤𝜹 for all 𝒎≥ 𝒎 𝟎 𝜹,𝝐 . 𝑔 ∗ is the (true) risk minimizer Such an algorithm is called universally consistent 𝑚 0 𝛿,𝜖 may depend on 𝐹 𝑋𝑌 (Smallest) 𝑚 0 is called sample complexity of the problem Analogously sample complexity of algorithm

Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) min 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦 𝑖 ,𝑔 𝑥 𝑖 (consistent estimator approximation) Bounds based on concentration of mean Indirect bounds (choice optimization alg.) [Vapnik, 92]

Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) min 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦 𝑖 ,𝑔 𝑥 𝑖 (consistent estimator approximation) Bounds based on concentration of mean Indirect bounds (choice optimization alg.) Minimize error on training set 𝑹 𝒎 [𝒈] [Vapnik, 92]

Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) min 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦 𝑖 ,𝑔 𝑥 𝑖 (consistent estimator approximation) Bounds based on concentration of mean Indirect bounds (choice optimization alg.) [Vapnik, 92]

Some Algorithms SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) Update 𝑔 (𝑘) using 𝑙( 𝑦 𝑘 , 𝑥 𝑘 ) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) Online learning literature Direct bounds on risk SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) min 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦 𝑖 ,𝑔 𝑥 𝑖 (consistent estimator approximation) Bounds based on concentration of mean Indirect bounds (choice optimization alg.) [Vapnik, 92] [Robbins & Monro, 51]

Some Algorithms SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) Update 𝑔 (𝑘) using 𝑙( 𝑦 𝑘 , 𝑥 𝑘 ) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) Online learning literature Direct bounds on risk [Robbins & Monro, 51]

Some Algorithms SAMPLE AVERAGE APPROXIMATION (a.k.a Empirical Risk Minimization) min 𝑔∈𝒢 𝐸 𝑙 𝑌,𝑔(𝑋) ≈ min 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝑙 𝑦 𝑖 ,𝑔 𝑥 𝑖 (consistent estimator approximation) Bounds based on concentration of mean Indirect bounds (choice optimization alg.) SAMPLE APPROXIMATION (a.k.a Stochastic Gradient Descent) Update 𝑔 (𝑘) using 𝑙( 𝑦 𝑘 , 𝑥 𝑘 ) and 𝑔 ≡ 1 𝑚 𝑘=1 𝑚 𝑔 𝑘 (weak estimator approximation) Online learning literature Direct bounds on risk Focus of this talk Summary of results [Vapnik, 92] [Robbins & Monro, 51]

ERM consistency: Sufficient conditions
0≤𝑅 𝑔 𝑚 −𝑅 𝑔 ∗ =𝑅 𝑔 𝑚 − 𝑅 𝑚 𝑔 𝑚 + 𝑅 𝑚 𝑔 𝑚 − 𝑅 𝑚 𝑔 ∗ + 𝑅 𝑚 𝑔 ∗ −𝑅[ 𝑔 ∗ ] ≤ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅 𝑚 𝑔 𝑅 𝑚 𝑔 ∗ −𝑅[ 𝑔 ∗ ] 𝑝 ∵LLN Hence one-sided uniform convergence is a sufficient condition for ERM consistency i.e., max 𝑔∈𝒢 𝑅 𝑔 − 𝑅 𝑚 𝑔 𝑚=1 ∞ 𝑝 as 𝑚→∞ Vapnik proved this is necessary for “non-trivial” consistency (of ERM)

Story so far … Two algorithms: Sample Average Approx., Sample Approx.
One-sided uniform convergence of mean is sufficient for SAA consistency. Defined Rademacher Complexity. Pending: Concentration around mean for the max. term. 𝓡 𝐦 𝓖 𝒎=𝟏 ∞ →𝟎 ⇒ a Learnable problem.

Candidate for Problem Complexity
≤𝐸 max 𝑔∈𝒢 𝐸 𝑅 𝑚 ′ 𝑔 − 𝑅 𝑚 𝑔

≤𝐸 max 𝑔∈𝒢 𝐸 𝑅 𝑚 ′ 𝑔 − 𝑅 𝑚 𝑔 Ensure (asymptotically) goes to zero. Show concentration around mean for max. div.

≤𝐸 max 𝑔∈𝒢 𝐸 𝑅 𝑚 ′ 𝑔 − 𝑅 𝑚 𝑔 MAXIMUM DISCREPANCY

Towards Rademacher Complexity

𝐸 𝜎 𝐸 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑙 𝑌 𝑖 ′ ,𝑔 𝑋 𝑖 ′ −𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖

𝐸 𝜎 𝐸 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑙 𝑌 𝑖 ′ ,𝑔 𝑋 𝑖 ′ −𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 iid Rademacher random variables 𝑃 𝜎 𝑖 =1 =0.5, 𝑃 𝜎 𝑖 =−1 =0.5.

Rademacher Complexity
≤2 𝐸 𝐸 𝜎 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 Empirical term Distribution−dependent term

𝑓( 𝑍 𝑖 ) =2 𝐸 𝐸 𝜎 max 𝑔∈𝒢 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑙 𝑌 𝑖 ,𝑔 𝑋 𝑖 Empirical term Distribution−dependent term 𝑓∈ℱ

=2 𝐸 𝐸 𝜎 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑓( 𝑍 𝑖 ) ℛ 𝑚 ℱ ℛ 𝑚 ℱ ℛ 𝑚 ℱ is Rademacher Complexity; ℛ 𝑚 ℱ is empirical Rademacher Complexity

One-sided uniform convergence of mean is sufficient for SAA consistency. Defined Rademacher Complexity. Pending: Concentration around mean for the max. term. 𝓡 𝐦 𝓖 𝒎=𝟏 ∞ →𝟎 ⇒ a Learnable problem.

Closer look at ℛ 𝑚 ℱ =𝐸 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑓( 𝑍 𝑖 )
High if ℱ correlates with random noise Classification problems: ℱ can assign arbitrary labels Higher ℛ 𝑚 ℱ , lower confidence on prediction ℱ 1 ⊆ ℱ 2 ⇒ ℛ 𝑚 ℱ 1 ≤ ℛ 𝑚 ℱ 2 Lower ℛ 𝑚 ℱ , higher chance we miss Bayes optimal

Closer look at ℛ 𝑚 ℱ =𝐸 max 𝑓∈ℱ 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑓( 𝑍 𝑖 )
High if ℱ correlates with random noise Classification problems: ℱ can assign arbitrary labels Higher ℛ 𝑚 ℱ , lower confidence on prediction ℱ 1 ⊆ ℱ 2 ⇒ ℛ 𝑚 ℱ 1 ≤ ℛ 𝑚 ℱ 2 Lower ℛ 𝑚 ℱ , higher chance we miss Bayes optimal Choose model with right trade-off using Domain knowledge.

Relation with classical measures
Growth Function: Π 𝑚 ℱ ≡ max 𝑥 1 ,…, 𝑥 𝑚 ⊂𝒳 𝑓 𝑥 1 ,…,𝑓( 𝑥 𝑚 ) | 𝑓∈ℱ Classification case: Π 𝑚 ℱ is max. no. of distinct classifiers induced Massart’s Lemma: 𝓡 𝒎 𝓕 ≤ 𝟐 𝚷 𝐦 𝓕 𝒎 VC-Dimension: 𝑉𝐶𝑑𝑖𝑚 ℱ ≡ max 𝑚: Π 𝑚 ℱ = 2 𝑚 𝑚 Sauer’s Lemma: 𝓡 𝒎 𝓕 ≤ 𝟐𝒅 𝐥𝐨𝐠 𝒆𝒎 𝒅 𝒎

Mean concentration: Observation
Define ℎ 𝑋 1 , 𝑌 1 ,…, 𝑋 𝑚 , 𝑌 𝑚 ≡ max 𝑔∈𝒢 𝑅 𝑔 − 𝑅 𝑚 𝑔 ℎ is function: of iid random variables Satisfies bounded difference property Δℎ when one ( 𝑋 𝑖 , 𝑌 𝑖 ) changes ≤ Δ𝑙 𝑚 (∵ bounded loss) Concentration around mean – McDiarmid’s inequality

McDiarmid’s Inequality
Let 𝑋 1 ,…, 𝑋 𝑚 ∈ 𝒳 𝑚 be iid rvs and ℎ: 𝒳 𝑚 ↦ℝ satisfying: ℎ 𝑥 1 ,…, 𝑥 𝑖 ,…, 𝑥 𝑚 −ℎ 𝑥 1 ,…, 𝑥 𝑖 ′ ,…, 𝑥 𝑚 ≤𝑐 𝑖 Then the following hold for any 𝜖>0: 𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤ 𝑒 −2 𝜖 2 𝑖=1 𝑚 𝑐 𝑖 2 , 𝑃 ℎ−𝐸 ℎ ≤−𝜖 ≤ 𝑒 −2 𝜖 2 𝑖=1 𝑚 𝑐 𝑖 2

McDiarmid’s Inequality
Let 𝑋 1 ,…, 𝑋 𝑚 ∈ 𝒳 𝑚 be iid rvs and ℎ: 𝒳 𝑚 ↦ℝ satisfying: ℎ 𝑥 1 ,…, 𝑥 𝑖 ,…, 𝑥 𝑚 −ℎ 𝑥 1 ,…, 𝑥 𝑖 ′ ,…, 𝑥 𝑚 ≤𝑐 𝑖 Then the following hold for any 𝜖>0: 𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤ 𝑒 −2 𝜖 2 𝑖=1 𝑚 𝑐 𝑖 2 , 𝑃 ℎ−𝐸 ℎ ≤−𝜖 ≤ 𝑒 −2 𝜖 2 𝑖=1 𝑚 𝑐 𝑖 2 𝟎⇒learnable 𝑒 −2𝑚 𝜖 2 Δ 𝑙 2 →0

𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖
Learning Bounds Let 𝛿≡ 𝑒 −2𝑚 𝜖 2 Δ 𝑙 2 , i.e., 𝝐=𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤𝛿 is same as: with probability atleast 1−𝛿, we have: 𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖 With probability atleast 1−𝛿, we have:

Learning Bounds Let 𝛿≡ 𝑒 −2𝑚 𝜖 2 Δ 𝑙 2 , i.e., 𝝐=𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎
𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤𝛿 is same as: with probability atleast 1−𝛿, we have: 𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖 With probability atleast 1−𝛿, we have: Computable except this term!

𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖
Learning Bounds Let 𝛿≡ 𝑒 −2𝑚 𝜖 2 Δ 𝑙 2 , i.e., 𝝐=𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤𝛿 is same as: with probability atleast 1−𝛿, we have: 𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖 With probability atleast 1−𝛿, we have: Use McDiarmid on ℛ 𝑚 (ℱ)

Learning Bounds Let 𝛿≡ 𝑒 −2𝑚 𝜖 2 Δ 𝑙 2 , i.e., 𝝐=𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎
𝑃 ℎ−𝐸 ℎ ≥𝜖 ≤𝛿 is same as: with probability atleast 1−𝛿, we have: 𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 𝓡 𝒎 𝓕 +𝚫𝒍 𝐥𝐨𝐠 𝟏 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖 With probability atleast 1−𝛿, we have: 𝑹 𝒈 ≤ 𝑹 𝒎 𝒈 +𝟐 ℛ 𝒎 𝓕 +𝟑𝚫𝒍 𝐥𝐨𝐠 𝟐 𝜹 𝟐𝒎 ∀ 𝒈∈𝓖

One-sided uniform convergence of mean is sufficient for SAA consistency. Defined Rademacher Complexity. Concentration around mean for the max. term. 𝓡 𝐦 𝓖 𝒎=𝟏 ∞ →𝟎 ⇒ a Learnable problem. Examples of usable Learnable problems Shows sufficiency condition not loose

Linear model with Lipschitz loss
Consider 𝒢≡ 𝑔 | ∃ 𝑤∋𝑔 𝑥 = 𝑤,𝜙 𝑥 , 𝑤 ≤𝑊 , 𝜙:𝒳↦ℋ (linear model) Contraction Lemma: ℛ 𝑚 ℱ ≤ ℛ 𝑚 𝒢 ℛ 𝑚 𝒢 = 𝐸 𝜎 max 𝑤 ≤𝑊 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝑤,𝜙 𝑥 𝑖 = 𝐸 𝜎 max 𝑤 ≤𝑊 𝑤, 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝜙 𝑥 𝑖 = 𝑊 𝑚 𝐸 𝜎 1 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝜙 𝑥 𝑖 ≤ 𝑊 𝑚 𝐸 𝜎 𝑚 𝑖=1 𝑚 𝜎 𝑖 𝜙 𝑥 𝑖 (∵ Jensen’s Inequality) = 𝑊 𝑚 𝑖=1 𝑚 𝜙( 𝑥 𝑖 ) 2 ≤ 𝑊𝑅 𝑚 → (if 𝜙(𝑥) ≤𝑅)

Learnable Problems Shai Shalev-Shwartz et.al., 2009

THANK YOU

Introduction to Statistical Learning Theory

Similar presentations

Presentation on theme: "Introduction to Statistical Learning Theory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Statistical Learning Theory

Similar presentations

Presentation on theme: "Introduction to Statistical Learning Theory"— Presentation transcript:

Similar presentations

About project

Feedback