By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik
In the Name of God Statistical Learning Theory Bounds on the Rate of Convergence of Learning Processes(chapter 3) Supervisor : Dr. Shiry By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Introduction In this chapter:
we consider upper bounds on the rate of uniform convergence. (lower bounds are not as important for controlling the learning processes as the upper bounds)

Introduction Bounds on the rate of convergence:
Distribution-dependent bounds (based on the annealed entropy function) Distribution-independent bounds (based on the growth function) Bounds are nonconstructive the VC dimension of set of functions (a scalar value that can be evaluated for any set of functions ) Constructive distribution-independent bounds

THE BASIC INEQUALITIES
: a set of indicator functions : the corresponding VC entropy : the annealed entropy : the growth function theorem: basic inequalities in the theory of bounds

The bounds are nontrivial if (In chapter 2 we called this condition the second milestone of learning theory.)

Theorem 3.1 estimates the rate of uniform convergence with respect to the norm of the deviation between probability and frequency. maximal difference occurs for the events with maximal variance. (Bernoulli case ) variance : therefore the maximum of the variance is achieved for the events with probahility: the largest deviations are associated with functions that possess large risk.

Theorem 3.2 considered relative uniform convergence. (we will obtain a bound on the risk where the confidence interval is determined by the rate of uniform convergence) the uniform relative convergence: upper bound on the risk obtained using Theorem 3.2 is much better than the upper bound on the risk obtained on the basis of Theorem 3.1.

The bounds obtained in Theorems 3.1 and 3-2 are distribution-dependent To construct distribution independent bounds it is sufficient to note that for any distribution function F(z) the growth function is not less than the annealed entropy. for any distribution function F(z):

These inequalities are nontrivial if (necessary and sufficient conditions for distribution-free uniform convergence) if condition 3.5 is violated, then there exist probability measures F(z) on Z for which uniform convergence does not take place.

Generalization for the set of real functions
Let be a set of real functions, with Let us construct a set of indicators functions by: Where are indicator functions, the set of indicators coincides with this set of functions.

Generalization In generalization we distinguish three cases:
Totally bounded functions Totally bounded nonnegative functions Nonnegative (not necessarily bounded) functions The following bounds are nontrivial if

Generalization Totally bounded functions
Totally bounded nonnegative functions

Generalization Nonnegative functions
Let be a set of functions such that for some p > 2 the pth normalized moments of the random variables exist:the Therefore: Where:

Distribution – independent bounds
The above bounds were distribution-dependent To obtain distribution- independent bounds one replaces the annealed entropy with the growth function. The following inequalities are nontrivial if

Distribution – independent bounds
For the set of totally bounded For the set of nonnegative totally bounded functions For the set of nonnegative real functions whose pth normalized moment exists for some p > 2,

Bounds on the generalization ability of learning machines
What actual risk R(αl) is provided by the function Q(z,αl) that achieves minimal empirical risk Remp(αl)? How close is this risk to the minimal possible infα (R(α)), α ϵΛ, for the given set of functions? using the following notation, the bounds are nontrivial when

Describe distribution-independent bounds (another form)
For the set of totally bounded functions with probability at least for all functions of : with probability at least for the function that minimizes the empirical risk :

For the set of totally bounded nonnegative functions with probahility at least for all functions : with probability of at least for the function that minimizes the empirical risk :

For the set of unbounded nonnegative functions We are given a pair (p,Ƭ) such that : With probability at least for all functions where (u)+ = max(u,0). With probability at least for the function that minimizes the empirical risk

The structure of the growth function
To make the above bounds constructive one has to find a way to evaluate the annealed entropy and/or the growth function for the given set of functions. We will find constructive bounds by using the concept of VC dimension of the set of functions. There is remarkable connection between the concept of VC dimension and the growth function.

The structure of the growth function
theorem Any growth function either satisfies the equality Or is bounded by the inequality Definition We will say that the VC dimension of the set of indicator functions is infinite if the growth function for this set of functions is linear. the corresponding growth function is bounded by a logarithmic function with coefficient h.

VC dimension the finiteness of the VC dimension of the set of indicator functions is a sufficient condition for consistency of the ERM method independent of the probability measure and implies a fast rate of convergence. It is a necessary and sufficient condition for distribution-independent consistency of ERM learning machines. The VC dimension of a set of indicator functions It is the maximum number h of vectors z1,...,zh that can be separated into two classes in all possible ways using functions of the set. If for any n there exists a set of n vectors that can be shattered by the set of functions, then the VC dimension is equal to infinity.

VC dimension The VC dimension of a set of real functions
Let be a set of real functions bounded by constants A and B. Considering the set of indicators where θ(z) is the step function The VC dimension of a set of real functions is defined to be the VC dimension of the set of corresponding indicators with parameters and

VC dimension-Example The VC dimension of the set of linear indicator functions in n-dimensioiial coordinate space is equal to h=n+1 since by using functions of this set one can shatter at most n + 1 vectors. The VC dimension of the set of linear function in n-dimensional coordinate space is equal to h = n + 1, because the VC dimension of the corresponding linear indicator functions is equal to « + 1.

VC dimension-Example The VC dimension of the set of functions
is infinite. The points on the line can be shattered by functions from this set to separate these data into two classes determined by the sequence it is sufficient to choose the value of the parameter α to be by choosing an appropriate co- coefficient α one can for any number of appropriately chosen points approximate values of any function bounded by (-1,+1) using

VC dimension The VC dimension of a set of functions does not coincide with the number of parameters. It can be either larger or smaller than the number of parameters. In the following: we will present the bounds on the risk functional that in Chapter 4 we use for constructing the methods for controlling the generalization ability of learning machines.

Constructive distribution – independent bounds
Considering sets of functions that possess a finite VC dimension h Therefore, in all inequalities of the above Section the following constructive expression can be used (in the case of the finite VC dimension) We also will consider the case where the set of loss functions contains a finite number of elements

For the set of totally bounded functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

The set of totally bounded nonnegative functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

The set of unbounded nonnegative functions with probability at least for all functions with probability at least for the function that minimizes the empirical risk:

Refrences Vapnik, Vladimir,”The Nature of Statistical Learning Theory”, 2000

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Similar presentations

Presentation on theme: "By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik

Similar presentations

Presentation on theme: "By : L. Pour Mohammad Bagher Author : Vladimir N. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback