The Nature of Statistical Learning Theory by V. Vapnik

Name: The Nature of Statistical Learning Theory by V. Vapnik
Uploaded: 2017-08-26T08:01:30+00:00
Duration: PTM20S54
Channel: Eugenia Randall
Description: The Nature of Statistical Learning Theory by V. Vapnik

The Nature of Statistical Learning Theory by V. Vapnik
Statistical Learning Theory & Classifications Based on Support Vector Machines The Nature of Statistical Learning Theory by V. Vapnik 2014: Anders Melen 2015: Rachel Temple Turn off all the lights Introduce myself and major Introduce the topic

Empirical Data Modeling What is Statistical Learning Theory
Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session We are going to start with Empirical data modeling which we’ve talked about many times in this course and you should feel very familiar with. If you’re unsure you should pick up the idea pretty quickly in a few minutes here. Quick simplified explanation of what statistical learning theory is and what its all about. Example has what supervised learning is Risk Minimization etc! There is a lot to talk about here Everytime I reach a new section I’m going to jump back to the table of contents so you know we’re moving on to a new topic I generally talk really fast during presentations so if I do, please feel free to yell at me so you can keep up I also tried to strike a good balance between covering a balance of high level concepts and some important low level details

Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session Lets start off with something we should all be familiar with. Empirical Data Modeling

Empirical Data Modeling
Observations of a system are collected Induction on observations are used to build up a model of the system. Model is then used to deduce responses of an unobserved system. Sampling is typically non-uniform High dimensional problems will form a sparse distribution in the input space Basically refers to any kind computer modeling that uses empirical observations rather than mathematical relationships Empirical data being data that has been gathered by observation.

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model The underlying function may lie outside the hypothesis space A poor choice of the model space will result in a large approximation error (model mismatch)

Modeling Error Approximation error is the consequence of the hypothesis space not fitting the target space Globally Optimal Model Best Reachable Model Selected Model Goal Choose a model from the hypothesis space which is closest (w/ respect to some error measure) to the function target space

This forms the Generalization Error
Estimation Error is the error between the best model in our hypothesis space and the model within our hypothesis space that we selected. Approximation Error Globally Optimal Model Generalization Error Best Reachable Model Estimation Error Selected Model Globally optimal model & selected model → generalization error This forms the Generalization Error

The Globally optimal model & the selected model form the generalization error which measures how well our data model adapts to new and unobserved data

Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Machines (SVM) Exam Questions Q & A Session

Statistical Learning Theory
Definition: “Consider the learning problem as a problem of finding a desired dependence using a limited number of observations.” (Vapnik 17) Learning itself falls into several categories most importantly the first two Unsupervised learning supervised learning online learning reinforcement learning

Model of Supervised Learning
Training The supervisor takes each generated x value and returns an output value y. Each (x,y) pair is part of the training set: F(x,y) = F(x) F(y|x) = (x1, y1) , (x2, y2), … , (xl,yl) Many of the algorithms we already discussed in this course are based on supervised learning theory. The second example here shows the expanded form of the conditional probability function that most of us are all familiar with. It reads “the probability of y given x” What this diagram here is essentially saying is that we pass in a training set to both S and LM where each training row has a solution Y. Thus we supervise the building of a model that can accurately predict future values outside of our training set.

Risk Minimization To find the best function, we need to measure loss
L is the discrepancy function which is based on the y’s generated by the supervision and the ŷ’s generated by the estimate functions F is a predictor such that expected loss is minimized L(y, F(x,𝛂)) This is a very common setup in machine learning. We essentially want to be able to look at our training data drawn from an unknown distribution From there we want to determine a predictor f() such that the expected loss is minimized thus giving us better predictive accuracy.

Risk Minimization Pattern Recognition With pattern recognition, the supervisor’s output y can only take on 2 values, y = {0,1} and the loss takes the following values. So the risk function determines the probability of different answers being given by the supervisor and the estimation function. The main difference between a pattern recognition approach to risk minimization is that the output y will be a boolean value.

Some Simplifications From Here On
Training Set {(X1,Y1), … , (Xl,Yl)} → {Z1, … , Zl} Loss Function L(y, F(x,𝛂)) → Q(z,𝛂) From here on out I want to declare a few simplifications namely the training set and the loss function Keep these in mind as they are used throughout the rest of the presentation

Empirical Risk Minimization (ERM)
We want to measure the risk over the training set rather than the set of all

The empirical risk must converge to the actual risk over the set of loss functions One really important fact is that the empirical risk must converge to the actually risk over the set of loss functions as denoted by this limit here.

In both directions! This indeed has to be the case in both directions. Again this minimum limit is denoting the required convergence criteria

Vapnik-Chervonenkis Dimensions
Lets just call them VC Dimensions Developed by Alexey Jakovlevich Chervonenkis & Vladimir Vapnik The VC dimension is scalar value that measures the capacity of a set of functions

Vapnik-Chervonenkis Dimensions
The VC dimension is a set of functions responsible for the generalization ability of learning machines The VC dimension of a set of indicator functions Q(z,𝛂)𝛂 ∈ 𝞚 is the maximum number h of vectors z1, …, zh that can be separated into two classes in all 2h possible ways using functions of the set.

Upper Bound For Risk It can be shown that
where is the confidence interval and h is the VC dimension

Upper Bound For Risk ERM only minimizes and ,
the confidence interval, is fixed based on the VC dimension of the set of functions determined by apriori ERM must tune the confidence interval based on the problem to avoid overfitting and underfitting

Structural Risk Management (SRM)
SRM attempts to minimize the right hand size of the inequality over both terms simultaneously

The term is dependent on a specific function’s error while the term depends on the dimension of the space that the functions lives in. The VC dimension is the controlling variable

We define the hypothesis space S to be the set of functions: Q(z,𝛂)𝛂 ∈ 𝞚 We say that Sk= {Q(z,𝛂)},𝛂 ∈ 𝞚k is the hypothesis space of a VC dimension, k, such that: For a set of observations {z1, …, zn}, SRM will choose the loss function Q(z,𝛂) minimizing the empirical risk in subset Sk for which the guaranteed risk is minimal. There is a few take away messages for SRM as follows SRM defines a trade-off between the quality of the approximation of the given data and the complexity of the approximating function As VC dimension increases the minima of the empirical risks decrease but the confidence interval increase SRM is more general than ERM alone because it uses the subset Sk for which minimizing Remp(𝛂) **empirical risk of alpha** yields the best bound on R(𝛂) **risk of alpha**

Support Vector Machines (SVM)
Map input vectors x into a high-dimensional feature space using a kernel function: (zi, z) = K(x, xi) In this feature space the optimal separating hyperplane is constructed

Feature space… Optimal hyperplane… What are you talking about... Feature space… Optimal hyperplane… what the heck am I talking about? I really can’t stand this diagram so I found a little animation that better illustrates the concept of a SVM

As you watch notice there exists no linear solution to this classification problem so what we we going to do? We are going to use a polynomial kernel so that we can find a plane that correctly classifies the data points. To simplify this, we are projecting the plane into a hyperplane such that we can find a plane that accurately classifies the data

Lets try a basic one dimensional example! Lets start off with a really basic one dimensional example! Can we find a plane that accurately classifies the data points?

Aw snap, that was easy! Well that was really easy!

Ok, what about a harder one dimensional example? Alright, lets try a little bit harder one dimensional example Can anyone figure out what we need to do?

Project the lower dimensional data into a higher dimensional space just like in the animation! Just like in the animation all we need to do is use a little calculus to project our lower dimensional data onto a higher dimensional space

There is several ways to implement a SVM Polynomial Learning Machine (Like the animation) Radial Basis Function Machines Two-Layer Neural Networks There is several different implementations of SVM’s that all use different kernel functions Polynomial Learning Machines Radial Basis Function Machines Two Layer Neural Networks We already saw a quick polynomial learning example in the animation Lets take a quick look at Two layer neural network implementation

Neural Networks are computer science models inspired by nature!
Simple Neural Network Neural Networks are computer science models inspired by nature! The brain is a massive natural neural network consisting of neurons and synapses Neural networks can be modeled using a graphical model If anyone has taken Josh Bongards Evolutionary Robotics course you should be well versed in neural networks. For those of you who haven’t I’ll try and explain what they are quickly so you’ll have some context for the following section

Neurons → Nodes Synapses → Edges Simple Neural Network
As a computer scientist you should be familiar with graphs as a data structure To model a natural neural network like a brain we need to define a few things. Neurons can be thought of as nodes Synapses can be thought of as edges Synapses have weight associated with them. When values are passed into an input node they propagate through the neural network being affected by the synapses until they reach the output nodes Molecular Form Neural Network Model

Two-Layer Neural Network
Kernel is a sigmoid function Implementing the rules

Two-Layer Neural Network
Using this technique the following are found automatically: Architecture of a two-layer machine Determining N number of units in first layer (# of support vectors) The vectors of the weights wi = xi in the first layer The vector of weights for the second layer (values of 𝛂)

How well can the machine generalize?
Conclusion The quality of a learning machine is characterized by three main components How rich and universal is the set of functions that the LM can approximate? How well can the machine generalize? How fast does the learning process for this machine converge

Exam Question #1 What is the main difference between Polynomial, radial basis learning machines and neural network learning machines? Also provide that difference for the neural network learning machine The kernel function The difference is the kernel function

Exam Question #2 What is empirical data modeling? Give a summary of the main concept and its components Empirical data modeling is the induction of observations to build up a model. Then the model is used to deduce responses of an unobserved system.

What must the Remp(𝛂) do over the set of loss functions?
Exam Question #3 What must the Remp(𝛂) do over the set of loss functions? It must converge to the R(𝛂) What must the empirical risk of alpha do over the set of loss functions? it must converge to the general risk function over the set of loss function

Table of Contents Empirical Data Modeling What is Statistical Learning Theory Model of Supervised Learning Risk Minimization Vapnik-Chervonenkis Dimensions Structural Risk Management (SRM) Support Vector Classification Optimal Separating Hyperplane & Quadratic Programming Support Vector Machines (SVM) Exam Questions Q & A Session

End Any questions?

The Nature of Statistical Learning Theory by V. Vapnik

Similar presentations

Presentation on theme: "The Nature of Statistical Learning Theory by V. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

The Nature of Statistical Learning Theory by V. Vapnik

Similar presentations

Presentation on theme: "The Nature of Statistical Learning Theory by V. Vapnik"— Presentation transcript:

Similar presentations

About project

Feedback