Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)

The issues in learning feasibility In the absence of noise in training data, we can be confident that the answer to issue #2 is yes Chapter 2 of text (theory of generalization) is about issue #1

Learning diagram with noisy input Input noise sets a lower bound on E in (g) that is independent of both hypothesis set and learning algorithm

When noise is present, issues are coupled Since we don’t know the lower bound on E in (g) due to noise, reducing E in (g) eventually becomes fitting the noise. When we fit noise in the data E out (g) diverges from E in (g). Chapter 4 of text discusses this “over-fitting” problem

Hoeffding inequality says that for any selected confidence level 1- , we can make E out close to E test by increasing N By application of “union bound” we showed that the same is true for E train if the hypothesis set is finite

Infinite hypothesis sets operate on finite datasets Let h be a member of an infinite hypothesis set used to train a dichotomizer. For ever example in the training set (size N), h will predict +1 or -1 The collection of these predictions is a dichotomy (i.e division of the training set into 2 classes) Define |H( x 1, x 2, … x N )| as the number of dichotomies that members of the hypothesis set H can produce for a training set of size N Even though |H| is infinite |H( x 1, x 2, … x N )| < 2 N, the total number of distinct dichotomies that can be produced by N prediction of +1 Hence there must be redundancy in predictions by members of H

Dichotomizier example illustrates how poor the union bound can be as an approximation to the probability of disjoint events Since the effect on a training set of different hypotheses can be nearly the same, even identical, an alternative to the union bound is needed.

Growth function of hypothesis set H applied to training sets of size N to learn dichotomizers is the maximum number of dichotomies that can be generated by H on any set of N points in the attribute space. m H (N) = max(|H( x 1, x 2, … x N )|) < 2 N Replace the union bound by a bound on the “growth function”

9 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Growth function m H (N) is closely related the VC dimension of H defined by: “If there exists h  H consistent for all 2 N ways that N points can be labeled with {+1,-1} then VC(H ) = N.” We say “H can shatter N points” In terms of m H (N), d vc (H) can be defined as the largest value of N for which m H (N) = 2 N

If no dataset of size k can be shattered by H, then k is a “break point” of H If k is a break point of H, then m H (k) < 2 k Since d vc (H) can be defined as the largest value of N for which m H (N) = 2 N, k= d vc (H)+1 is a break point Break points

Review: H= 2D perceptron m H (3) = 8: no break point. Even though 3 points in line cannot be shattered, m H (3) defined as max dichotomies for any set of points Every set of 4 points has 2 labeling that cannot be shattered by the 2D perceptron. k=4 is a break point. d vc (H)=3 For dD perceptron d vc (H)=d+1 (see lecture 7 amlbook.com)

H = positive rays: h(x) = sign(x-a) Hypothesis set is infinite (unlimited choices for a) N points can divide a line into at most N+1 intervals. (If some points overlap, number will be fewer) Each choice of the interval containing a is a dichotomy; thereforem H (N) = N+1

H = positive rays: h(x) = sign(x-a) m H (N) = N+1 m H (2) = 3 < 2 k d vc (H) = 1

H = positive internals: h(x) = +1 for all x in interval, otherwise h(x) = -1 Dichotomy depends on which of N+1 intervals endpoints fall in. If same, all samples are -1. m H (N) = 1 + combination N+1 objects, choose 2 m H (N) = 0.5(N 2 + N) +1

H = positive internals: h(x) = +1 for all x in interval, otherwise h(x) = -1 m H (N) = 0.5(N 2 + N) +1 m H (3) = 7 < 2 k d vc (H) = 2

Bottom line on bounding the growth function: Theorem 2.4 text p49 If H has break-point k on data set size N then k= d vc (H)+1 is a break point which is a polynomial of degree d VC By induction can show that

Main result on learning feasibility Existence of any break point ensures learning feasibility

From Hoeffding inequality to VC inequality

VC generalization bound If m H (2N) is a polynomial of degree d VC then for large N  ->sqrt(cln(N)/N) where c is proportional to d VC

Review: For a specific hypothesis h, E out (h) is analogous to the population mean. E in (h) is analogous to a sample mean Why is this analogy not sufficient to show the feasibility of learning?

In this expression, what is the bound and what is the confidence level. Review: By analogy with estimation of population means, when E in =  E test

is a relationship between test-set size, N, confidence level, 1- , and a bound on |E test -  E out |  How do we use it to find the bound for a given confidence level and test-set size? How do we use it to find the test-set size need for given bound for a given level? Review: When E in =  E test

Review: How does a finite VC dimension of hypothesis set H ensure the feasibility of learning with H? 1) Replace the union bound (valid for finite H-sets) with growth function (valid of infinite H-sets)

Review: How does a finite VC dimension of hypothesis set H ensure the feasibility of learning with H? 1) Replace the union bound (valid for finite H-sets) with growth function (valid of infinite H-sets). 2) If H-set applied to a training set of size N has any break point, then growth function is a polynomial in N

Review: How does a finite VC dimension of hypothesis set H ensure the feasibility of learning with H? 1) Replace the union bound (valid for finite H-sets) with growth function (valid of infinite H-sets). 2) If H-set applied to a training set of size N has any break point, then growth function is a polynomial in N 3) k= d vc (H)+1 is a break point.

4) Since polynomials are dominated by exponentials,  can be make arbitrarily small, regardless of the size of , by increasing the size of N

Feasible, yes. Practical, maybe not Suppose we want  < 0.1 with 90% confidence (i.e.  = 0.1) We require using We get Part 3 of assignment 3: Use a non-linear root-finding code to solve this implicit relationship for N with d VC =3 and 6. Compare with results in text p57.

Assignment 3 due 9-30-14 1) Make a table of the bounds that can be placed on the relative error in estimates of population mean  =1 based on a sample with N=100 at confidence levels 90%, 95%, and 99% 2) |E test -  E out |<  ( ,N) = sqrt(ln(2/  )/2N) Sponsor requires 98% confidence that  ( ,N)=0.1. How large does N have to be to achieve this? 3) Use a non-linear root-finding code to solve this implicit relationship for N when  =0.1 and  =0.1 with d VC =3 and 6. Compare with results in text p57. Hint:

While Hoeffding inequality for a test set (M=1) is useful in practice, VC inequality is too conservative for practical use; nevertheless, important for showing learning feasibility.

Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)

Similar presentations

Presentation on theme: "Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)

Similar presentations

Presentation on theme: "Feasibility of learning: the issues solution for infinite hypothesis sets VC generalization bound (mostly lecture 5 on AMLbook.com)"— Presentation transcript:

Similar presentations

About project

Feedback