Vapnik–Chervonenkis Dimension

Slides:



Advertisements
Similar presentations
VC Dimension – definition and impossibility result
Advertisements

More Vectors.
Circuit and Communication Complexity. Karchmer – Wigderson Games Given The communication game G f : Alice getss.t. f(x)=1 Bob getss.t. f(y)=0 Goal: Find.
On Complexity, Sampling, and -Nets and -Samples. Range Spaces A range space is a pair, where is a ground set, it’s elements called points and is a family.
6.1 Vector Spaces-Basic Properties. Euclidean n-space Just like we have ordered pairs (n=2), and ordered triples (n=3), we also have ordered n-tuples.
Lecture notes for Stat 231: Pattern Recognition and Machine Learning 1. Stat 231. A.L. Yuille. Fall 2004 PAC Learning and Generalizability. Margin Errors.
Week 21 Basic Set Theory A set is a collection of elements. Use capital letters, A, B, C to denotes sets and small letters a 1, a 2, … to denote the elements.
By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)
Machine Learning Week 2 Lecture 2.
Derandomized DP  Thus far, the DP-test was over sets of size k  For instance, the Z-Test required three random sets: a set of size k, a set of size k-k’
Northwestern University Winter 2007 Machine Learning EECS Machine Learning Lecture 13: Computational Learning Theory.
Measuring Model Complexity (Textbook, Sections ) CS 410/510 Thurs. April 27, 2007 Given two hypotheses (models) that correctly classify the training.
Probability Distributions
Vapnik-Chervonenkis Dimension
Vapnik-Chervonenkis Dimension Definition and Lower bound Adapted from Yishai Mansour.
Vapnik-Chervonenkis Dimension Part II: Lower and Upper bounds.
On Complexity, Sampling, and є-Nets and є-Samples. Present by: Shay Houri.
Lecture II.  Using the example from Birenens Chapter 1: Assume we are interested in the game Texas lotto (similar to Florida lotto).  In this game,
PAC learning Invented by L.Valiant in 1984 L.G.ValiantA theory of the learnable, Communications of the ACM, 1984, vol 27, 11, pp
All of Statistics Chapter 5: Convergence of Random Variables Nick Schafer.
Section 7.2. Section Summary Assigning Probabilities Probabilities of Complements and Unions of Events Conditional Probability Independence Bernoulli.
Week 21 Conditional Probability Idea – have performed a chance experiment but don’t know the outcome (ω), but have some partial information (event A) about.
Computational Learning Theory IntroductionIntroduction The PAC Learning FrameworkThe PAC Learning Framework Finite Hypothesis SpacesFinite Hypothesis Spaces.
Sixth lecture Concepts of Probabilities. Random Experiment Can be repeated (theoretically) an infinite number of times Has a well-defined set of possible.
Goal of Learning Algorithms  The early learning algorithms were designed to find such an accurate fit to the data.  A classifier is said to be consistent.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Carla P. Gomes CS4700 Computational Learning Theory Slides by Carla P. Gomes and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5)
Set Operations Section 2.2.
Machine Learning Chapter 7. Computational Learning Theory Tom M. Mitchell.
Summary of the Last Lecture This is our second lecture. In our first lecture, we discussed The vector spaces briefly and proved some basic inequalities.
Basic Probability. Introduction Our formal study of probability will base on Set theory Axiomatic approach (base for all our further studies of probability)
Approximation Algorithms based on linear programming.
Theory of Computational Complexity Probability and Computing Ryosuke Sasanuma Iwama and Ito lab M1.
Theory of Computational Complexity M1 Takao Inoshita Iwama & Ito Lab Graduate School of Informatics, Kyoto University.
PROBABILITY AND COMPUTING RANDOMIZED ALGORITHMS AND PROBABILISTIC ANALYSIS CHAPTER 1 IWAMA and ITO Lab. M1 Sakaidani Hikaru 1.
1 CS 391L: Machine Learning: Computational Learning Theory Raymond J. Mooney University of Texas at Austin.
Probabilistic analysis
On Complexity, Sampling, and ε-Nets and ε-Samples
CHAPTER 2 Set Theory A B C.
Jiaping Wang Department of Mathematical Science 04/22/2013, Monday
Week 3 2. Review of inner product spaces
Probabilistic Algorithms
Probability.
Revenue = (# of Calculators ) * ( price )
Computational Learning Theory
Is this quarter fair?. Is this quarter fair? Is this quarter fair? How could you determine this? You assume that flipping the coin a large number of.
What is Probability? Quantification of uncertainty.
Markov Chains Mixing Times Lecture 5
To any sequence we can assign a sequence with terms defined as
Spectral Clustering.
Chapter 2 Sets and Functions.
Discrete Probability Chapter 7 With Question/Answer Animations
Distinct Distances in the Plane
Additive Combinatorics and its Applications in Theoretical CS
Depth Estimation via Sampling
CSCI B609: “Foundations of Data Science”
The Curve Merger (Dvir & Widgerson, 2008)
The Nonstochastic Multiarmed Bandit Problem
The probably approximately correct (PAC) learning model
Sampling Distributions
COUNTING AND PROBABILITY
Computational Learning Theory Eric Xing Lecture 5, August 13, 2010
CSCI B609: “Foundations of Data Science”
Quantum Foundations Lecture 2
Counting Elements of Disjoint Sets: The Addition Rule
Lecture 2 Basic Concepts on Probability (Section 0.2)
Chapter 1 Probability Spaces
Sets, Combinatorics, Probability, and Number Theory
Presentation transcript:

Vapnik–Chervonenkis Dimension VC-Dimensions Vapnik–Chervonenkis Dimension

Example Consider a database consisting of the salary and age for a random sample of the adult population in the United States. We are interested in using the database to answer the question: What fraction of the adult population in the US has: - age between 35 and 45 - salary between 50,000$ and 70,000$ ? 70,000 מתחילים עם דוגמא: אנחנו רוצים למצוא את אחוז האנשים בארה"ב בגילאים 35-45 עם משכורת 50,000-70,000$. מדובר במלבן שמקביל לצירים. יש לנו מסד נתונים ואנחנו יכולים לעבור עליו ולבדוק מה האחוז שעונה על השאלה הזו. נרצה לדעת כמה גדול צריך להיות מסד הנתונים שלנו, כך שבסבירות 50,000 35 45

How large does our database need to be? Theorem: Growth function sample bound (from last week) For any class 𝐻 and distribution 𝐷, if a training sample 𝑆 is drawn from 𝐷 of size: 𝑛≥ 1 𝜖 [ ln 𝐻 + ln 1 𝛿 ] then with probability ≥1−𝛿, every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖 has 𝑒𝑟 𝑟 𝑆 ℎ >0, or every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0 has 𝑒𝑟 𝑟 𝐷 ℎ <𝜖

There are N adults in the US, So there are at most 𝑁 4 rectangles, so 𝐻 ≤ 𝑁 4 We receive: 𝑛≥ 1 𝜖 [ ln 𝑁 4 + ln 1 𝛿 ] Which means 𝑛→∞ when 𝑁→∞ By using VC dimension, we will be able to achieve: n=𝑂( 1 𝜖 𝑉𝐶𝑑𝑖𝑚 𝐻 log 𝑑 𝜖 + log 1 𝛿 )

Definitions Given a set S of examples and a concept class 𝐻, we say 𝑆 is shattered by 𝐻 if for every 𝐴⊆𝑆, there exists some ℎ∈𝐻 that labels all examples in 𝐴 as positive and all examples in 𝑆\A as negative The VC-dimension of 𝐻 is the size of the largest set 𝑆 shattered by 𝐻 The VC dimension is the maximal number 𝑑 such that there exists a set of 𝑑 points that is shattered by 𝐻 Example: Intervals of the real axis דוגמא על הלוח: הסט שלנו הוא 4 נקודות שניתנות לחלוקה ע"י מלבנים. לא עבור כל 4 נקודות הדבר אפשרי, עבור 5 בהכרח לא. ולכן המימד הוא 4.

Growth Function / Shatter Function Given a set 𝑆 of examples and a concept class 𝐻, 𝐻 𝑆 ={ℎ∩𝑆:ℎ∈𝐻} For integer 𝑛 and class 𝐻, 𝐻 𝑛 = max 𝑆 =𝑛 |𝐻 𝑆 |

Examples Intervals of the real axis: 𝑉𝑐𝑑𝑖𝑚=2, 𝐻 𝑛 =𝑂 𝑛 2 Rectangle with axis-parallel edges: 𝑉𝑐𝑑𝑖𝑚=4, 𝐻 𝑛 =𝑂 𝑛 4 Union of 2 intervals of the real axis (Divide an orders set of numbers into two different intervals) Convex polygons: 𝑉𝑐𝑑𝑖𝑚→∞, 𝐻 𝑛 = 2 𝑛

Half spaces in d dimensions 𝑉𝐶𝑑𝑖𝑚=𝑑+1 Proof: 𝑉𝐶𝑑𝑖𝑚≥𝑑+1 S: d unit coordinate vectors + origin A subset of S (Assume includes the origin) Vector w has 1-s in the coordinates corresponding to vectors not in A For each 𝑥∈𝐴, 𝑤 𝑇 𝑥≤0 and for each 𝑥∉𝐴, 𝑤 𝑇 𝑥>0

Half spaces in d dimensions 𝑉𝐶𝑑𝑖𝑚=𝑑+1 Proof: 𝑉𝐶𝑑𝑖𝑚<𝑑+2 Theorem(Radon): Any set 𝑆⊆ 𝑅 𝑑 with 𝑆 ≥𝑑+2 can be partitioned into two disjoint subsets 𝐴 and 𝐵 such that 𝑐𝑜𝑛𝑣𝑒𝑥(𝐴)∩𝑐𝑜𝑛𝑣𝑒𝑥(𝐵)≠∅

Growth function sample bound For any class 𝐻 and distribution 𝐷, if a training sample 𝑆 is drawn from 𝐷 of size: 𝑛≥ 1 𝜖 [ ln 𝐻 + ln 1 𝛿 ] then with probability ≥1−𝛿, every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖 has 𝑒𝑟 𝑟 𝑆 ℎ >0, or every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0 has 𝑒𝑟 𝑟 𝐷 ℎ <𝜖 If we use VD-dimension: 𝑛≥ 2 𝜖 [ log 2 (2𝐻 2𝑛 ) ⁡+ 𝑙𝑜 𝑔 2 1 𝛿 ] And later we will see: n=𝑂( 1 𝜖 𝑑 log 𝑑 𝜖 + log 1 𝛿 ) Where 𝑑=𝑉𝐶𝑑𝑖𝑚(𝐻)

Sauer’s Lemma If Vcdim(H)=d, then 𝐻 𝑛 ≤ 𝑖=0 𝑑 𝑛 𝑖 ≤ 𝑒𝑛 𝑑 𝑑

proof Instead of 𝐻 𝑛 ≤ 𝑖=0 𝑑 𝑛 𝑖 , it is sufficient to prove a stronger claim: Given any set S 𝑠.𝑡 𝑆 =𝑛: H S ≤|{B⊆𝑆:𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵}|

proof Assume 𝑉𝐶𝑑𝑖𝑚 𝐻 𝑆 =𝑑 The proof is by Induction on 𝑛: If 𝑛=1: The empty set is always shattered by 𝐻, so |𝐻 𝑠 |=2 in the first example, and |𝐻 𝑠 |=1 in the second example. Assume inequation holds for sets of size 𝑛−1 and prove for sets of size 𝑛

proof Let 𝑆={ 𝑠 1 ,…, 𝑠 𝑛 } and define: 𝑌 0 ={ 𝑦 2 ,…, 𝑦 𝑛 : 0, 𝑦 2 …, 𝑦 𝑛 ∈𝐻[𝑆] 𝒐𝒓 1, 𝑦 2 ,…, 𝑦 𝑛 ∈𝐻[𝑆]} 𝑌 1 ={ 𝑦 2 ,…, 𝑦 𝑛 : 0, 𝑦 2 …, 𝑦 𝑛 ∈𝐻[𝑆] 𝒂𝒏𝒅 1, 𝑦 2 ,…, 𝑦 𝑛 ∈𝐻[𝑆]} 𝑌 0 + 𝑌 1 = ?

proof Let 𝑆={ 𝑠 1 ,…, 𝑠 𝑛 } and define: 𝑌 0 ={ 𝑦 2 ,…, 𝑦 𝑛 : 0, 𝑦 2 …, 𝑦 𝑛 ∈𝐻[𝑆] 𝒐𝒓 1, 𝑦 2 ,…, 𝑦 𝑛 ∈𝐻[𝑆]} 𝑌 1 ={ 𝑦 2 ,…, 𝑦 𝑛 : 0, 𝑦 2 …, 𝑦 𝑛 ∈𝐻[𝑆] 𝒂𝒏𝒅 1, 𝑦 2 ,…, 𝑦 𝑛 ∈𝐻[𝑆]} 𝑌 0 + 𝑌 1 = 𝐻[𝑆]

proof Define: 𝑆 ′ ={ 𝑠 2 ,…, 𝑠 𝑛 } Notice: 𝑌 0 =𝐻[ 𝑆 ′ ] Using induction assumption: 𝑌 0 = 𝐻 𝑆 ′ ≤ 𝐵⊆ 𝑆 ′ :𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 = | 𝐵⊆𝑆: 𝑠 1 ∉𝐵 𝑎𝑛𝑑 𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 | Define: 𝐻′⊆𝐻: 𝐻 ′ ={ℎ∈𝐻:∃ ℎ ′ ∈𝐻 𝑠.𝑡 1− ℎ ′ 𝑠 1 ),…,ℎ′( 𝑠 𝑛 =(ℎ 𝑠 1 ,…,ℎ 𝑠 𝑛 } 𝐻′ contains pairs of hypotheses that agree on 𝑆 ′ ( 𝑠 2 ,… 𝑠 𝑛 ) and differ on 𝑠 1 It can be seen that 𝐻 ′ 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵⊆ 𝑆 ′ ↔ 𝐻 ′ 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝑡ℎ𝑒 𝑠𝑒𝑡 𝐵∪{ 𝑠 1 }

proof Notice: 𝑌 1 = 𝐻 ′ [ 𝑆 ′ ] Using induction assumption: 𝑌 1 = 𝐻′ 𝑆 ′ ≤ 𝐵⊆ 𝑆 ′ :𝐻′ 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 = = 𝐵⊆ 𝑆 ′ : 𝐻 ′ 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵∪{ 𝑠 1 } =| 𝐵⊆𝑆: 𝑠 1 ∈𝐵 𝑎𝑛𝑑 𝐻′ 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 | ≤| 𝐵⊆𝑆: 𝑠 1 ∈𝐵 𝑎𝑛𝑑 𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 | Finally: |𝐻 𝑆 |= 𝑌 0 + 𝑌 1 ≤ ≤ 𝐵⊆𝑆: 𝑠 1 ∉𝐵 𝑎𝑛𝑑 𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 + 𝐵⊆𝑆: 𝑠 1 ∈𝐵 𝑎𝑛𝑑 𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵 = = 𝐵⊆𝑆:𝐻 𝑠ℎ𝑎𝑡𝑡𝑒𝑟𝑠 𝐵

Lemma Let 𝐻 be a concept class over some domain 𝜒 and let 𝑆 an𝑑 𝑆′ be sets of 𝑛 elements drawn from some distribution 𝐷 on 𝜒, where 𝑛≥ 8 𝜖 . 𝐴 – The event that there exists ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0, but 𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖 𝐵 – The event that there exists ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0, but 𝑒𝑟 𝑟 𝑆 ′ ℎ ≥ 𝜖 2 Then 𝑃 𝐵 ≥ 1 2 𝑃(𝐴)

Proof Clearly, 𝑃 𝐵 ≥𝑃 𝐴∩𝐵 =𝑃 𝐴 ∙𝑃(𝐵|𝐴) Let’s find 𝑃(𝐵|𝐴): Let ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0 and 𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖 Draw set 𝑆′: 𝐸 𝑒𝑟 𝑟 𝑆 ′ ℎ =𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖

Growth function sample bound For any class 𝐻 and distribution 𝐷, if a training sample 𝑆 is drawn from 𝐷 of size: 𝑛≥ 2 𝜖 [ log 2 2𝐻 2𝑛 + log 2 1 𝛿 ] then with probability ≥1−𝛿, every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝐷 ℎ ≥𝜖 has 𝑒𝑟 𝑟 𝑆 ℎ >0, or every ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆 ℎ =0 has 𝑒𝑟 𝑟 𝐷 ℎ <𝜖

Proof Consider the set S of size n from distribution D: A denotes the event that there exists ℎ∈𝐻 with 𝑒𝑟 𝑟 𝐷 ℎ >𝜖 but 𝑒𝑟 𝑟 𝑆 ℎ =0 We will prove 𝑃 𝐴 ≤𝛿 B denotes the event that there exists ℎ∈𝐻 with 𝑒𝑟 𝑟 𝑆′ ℎ ≥ 𝜖 2 but 𝑒𝑟 𝑟 𝑆 ℎ =0 By the previous lemma, it’s enough to prove that 𝑃 𝐵 ≤ 𝛿 2

Proof We will draw a set 𝑆′′ of 2𝑛 points and partition in into two sets: 𝑆, 𝑆′ Randomly put the point in 𝑆′′ into pairs: 𝑎 1 , 𝑏 1 , …,( 𝑎 𝑛 , 𝑏 𝑛 ) For each index i, flip a fair coin. If heads, put 𝑎 𝑖 in 𝑆 and 𝑏 𝑖 in 𝑆′. 𝑃 𝐵 over the new 𝑆, 𝑆′ is identical to 𝑃(𝐵) over 𝑆 in case it was drawn directly

Proof Fix some classifier ℎ∈𝐻[ 𝑆 ′′ ] and consider the probability over these n fair coin flips that: ℎ makes zero mistakes on 𝑆 ℎ make more than 𝜖𝑛 2 mistakes on 𝑆′ For any index i, ℎ makes a mistake on both 𝑎 𝑖 , 𝑏 𝑖 → 𝑃𝑟𝑜𝑏=0 There are fewer mistakes than 𝜖𝑛 2 indices such that ℎ makes a mistake on either 𝑎 𝑖 or 𝑏 𝑖 →𝑃𝑟𝑜𝑏=0 There are 𝑟≥ 𝜖𝑛 2 indices i such that ℎ makes a mistake on exactly on of 𝑎 𝑖 or 𝑏 𝑖 → The chance that all those mistakes land in 𝑆′ is 1 2 𝑟 ≤ 1 2 𝜖𝑛 2

Growth function uniform convergence For any class 𝐻 and distribution 𝐷, if a training sample 𝑆 is drawn from 𝐷 of size: 𝑛≥ 8 𝜖 2 [ln (2𝐻 2𝑛 + ln 1 𝛿 ] then with probability ≥1−𝛿, every ℎ∈𝐻 satisfies 𝑒𝑟 𝑟 𝑠 ℎ −𝑒𝑟 𝑟 𝐷 ℎ ≤𝜖

When we combine Sauer’s lemma with the theorems.. 𝑛≥ 2 𝜖 [ log 2 2𝐻 2𝑛 + log 2 1 𝛿 ] = 2 𝜖 [ log 2 2 2𝜖𝑛 𝑑 𝑑 + log 2 1 𝛿 ] 𝑛≥ 2 𝜖 1+𝑑∙ log 2 𝑛 +𝑑∙ log 2 2𝜖 𝑑 + log 2 1 𝛿 And the inequation is solved with: n=𝑂( 1 𝜖 𝑑 log 𝑑 𝜖 + log 1 𝛿 ) Where 𝑑=𝑉𝐶𝑑𝑖𝑚(𝐻)

VC-dimensions of Combination of Concepts 𝑐𝑜𝑚 𝑏 𝑓 ℎ 1 ,… ℎ 𝑘 ={𝑥∈𝑋:𝑓 ℎ 1 𝑥 ,…, ℎ 𝑘 𝑥 =1} When ℎ 𝑖 (𝑥) denotes the indicator for whether or not 𝑥∈ ℎ 𝑖 Given concept class 𝐻, a Boolean function 𝑓 and an integer 𝑘, 𝐶𝑂𝑀 𝐵 𝑓,𝑘 𝐻 ={𝑐𝑜𝑚 𝑏 𝑓 ℎ 1 ,…, ℎ 𝑘 : ℎ 𝑖 ∈𝐻}

Corollary If the 𝑉𝐶 dim 𝐻 =𝑑, then for any combination function 𝑓, VCdim 𝐶𝑂𝑀 𝐵 𝑓,𝑘 𝐻 =𝑂 𝑘𝑑𝑙𝑜𝑔 𝑘𝑑