Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory.

Similar presentations


Presentation on theme: "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory."— Presentation transcript:

1 1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory

2 2 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

3 3 Objectives Problems with philosophical approaches - lack quantitative description/ characterization of ideas; - no real predictive power (as in Natural Sciences) - no agreement on basic definitions/ concepts (as in Natural Sciences) Goal: to introduce Predictive Learning as a scientific discipline

4 4 Characteristics of Scientific Theory Problem setting Solution approach Math proofs (technical analysis) Constructive methods Applications Note: Problem Setting and Solution Approach are independent (of each other)

5 5 History and Overview SLT aka VC-theory (Vapnik-Chervonenkis) Theory for estimating dependencies from finite samples (predictive learning setting) Based on the risk minimization approach All main results originally developed in 1970’s for classification (pattern recognition) – why? but remained largely unknown Recent renewed interest due to practical success of Support Vector Machines (SVM)

6 6 History and Overview(cont’d) MAIN CONCEPTUAL CONTRIBUTIONS Distinction between problem setting, inductive principle and learning algorithms Direct approach to estimation with finite data (KID principle) Math analysis of ERM (standard inductive setting) Two factors responsible for generalization: - empirical risk (fitting error) - complexity(capacity) of approximating functions

7 7 Importance of VC-theory Math results addressing the main question: - under what general conditions the ERM approach leads to (good) generalization? New approach to induction: Predictive vs generative modeling (in classical statistics) Connection to philosophy of science - VC-theory developed for binary classification (pattern recognition) ~ the simplest generalization problem - natural sciences: from observations to scientific law  VC-theoretical results can be interpreted using general philosophical principles of induction, and vice versa.

8 8 Inductive Learning Setting The learning machine observes samples (x,y), and returns an estimated response Two modes of inference: identification vs imitation Risk

9 9 The Problem of Inductive Learning Given: finite training samples Z={(x i, y i ),i=1,2,…n} choose from a given set of functions f(x, w) the one that approximates best the true output. (in the sense of risk minimization) Concepts and Terminology approximating functions f(x, w) (non-negative) loss function L(f(x, w),y) expected risk functional R(Z,w) Goal: find the function f(x, w o ) minimizing R(Z,w) when the joint distribution P(x,y) is unknown.

10 10 Empirical Risk Minimization ERM principle in model-based learning –Model parameterization: f(x, w) –Loss function: L(f(x, w),y) –Estimate risk from data: –Choose w* that minimizes R emp Statistical Learning Theory developed from the theoretical analysis of ERM principle under finite sample settings

11 11 Probabilistic Modeling vs ERM

12 12 Probabilistic Modeling vs ERM: Example Known class distribution  optimal decision boundary

13 13 Probabilistic Approach Estimate parameters of Gaussian class distributions, and plug them into quadratic decision boundary

14 14 ERM Approach Quadratic and linear decision boundary estimated via minimization of squared loss

15 15 Estimation of multivariate functions Is it possible to estimate a function from finite data? Simplified problem: estimation of unknown continuous function from noise-free samples Many results from function approximation theory: –To estimate accurately a d-dimensional function one needs O(n^^d) data points –For example, if 3 points are needed to estimate 2-nd order polynomial for d=1, then 3^^10 points are needed to estimate 2-nd order polynomial in 10-dimensional space. –Similar results in signal processing Never enough data points to estimate multivariate functions in most practical applications (image recognition, genomics etc.) For multivariate function estimation, the number of free parameters increases exponentially with problem dimensionality (the Curse of Dimensionality)

16 16 Properties of high-dimensional data Sparse data looks like a porcupine: the volume of a unit sphere inscribed in a d- dimensional cube gets smaller even as the volume of d-cube gets exponentially larger! A point is closer to an edge than to another point Pairwise distances between points are the same  Intuition behind kernel (local) methods no longer holds. How generalization is possible, in spite of the curse of dimensionality?

17 17 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

18 18 Keep-It-Direct Principle The goal of learning is generalization rather than estimation of true function (system identification) Keep-It-Direct Principle (Vapnik, 1995) Do not solve an estimation problem of interest by solving a more general (harder) problem as an intermediate step Good predictive model reflects some properties of unknown distribution P(x,y) Since model estimation with finite data is ill-posed, one should never try to solve a more general problem than required by given application  Importance of formalizing application requirements via appropriate learning formulation

19 19 Learning vs System Identification Consider regression problem where unknown target function Goal 1: Prediction Goal 2: Function Approximation (system identification) or Admissible models: algebraic polynomials Purpose of comparison: contrast goals (1) and (2) NOTE: most applications assume Goal 2, i.e. Noisy Data ~ true signal + noise

20 20 Empirical Comparison Target function: sine-squared Input distribution: non-uniform Gaussian pdf Additive gaussian noise with st. deviation = 0.1

21 21 Empirical Comparison (cont’d) Model selection: use separate data sets - training : for parameter estimation - validation: for selecting polynomial degree - test: for estimating prediction risk (MSE) Validation set generated differently to contrast (1)&(2) Predictive Learning (1) ~ Gaussian Funct. Approximation (2) ~ uniform fixed sampling Training + test data ~ Gaussian Training set size: 30Validation set size : 30

22 22 Regression estimates (2 typical realizations of data): Dotted line ~ estimate obtained using predictive learning Dashed line ~ estimate via function approximation setting  Estimated models are too smooth (under fct approx.)

23 23 Conclusion The goal of prediction (1) is different (less demanding) than the goal of estimating the true target function (2) everywhere in the input space. The curse of dimensionality applies to system identification setting (2), but may not hold under predictive setting (1). Both settings coincide if the input distribution is uniform (i.e., in signal and image denoising applications)

24 24 Philosophical Interpretation of KID Interpretation of predictive models Realism ~ objective truth (hidden in Nature) Instrumentalism ~ creation of human mind (imposed on the data) – favored by KID Objective Evaluation still possible (via prediction risk reflecting application needs)  Natural Science Methodological implications Importance of good learning formulations (asking the ‘right question’) Accounts for 80% of success in applications

25 25 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

26 26 VC-theory has 4 parts: 1.Analysis of consistency/convergence of ERM 2.Generalization bounds 3.Inductive principles (for finite samples) 4.Constructive methods (learning algorithms) for implementing (3) NOTE: (1)  (2)  (3)  (4)

27 27 Consistency/Convergence of ERM Empirical Risk known but Expected Risk unknown Asymptotic consistency requirement: under what (general) conditions models providing min Empirical Risk will also provide min Prediction Risk, when the number of samples grows large? Why asymptotic analysis is needed? - helps to develop useful concepts - necessary and sufficient conditions ensure that VC-theory is general and can not be improved

28 28 Consistency of ERM Convergence of empirical risk to expected risk does not imply consistency of ERM Models estimated via ERM (w*) are always biased estimates of the functions minimizing true risk:

29 29 Conditions for Consistency of ERM Main insight: consistency is not possible without restricting the set of possible models Example: 1-nearest neighbor classification method. - is it consistent ? Consider binary decision functions (~ classification) How to measure their flexibility ~ ability to ‘explain’/fit available data (for binary classification)? This complexity index for indicator functions: - is independent of unknown data distribution; - measures the capacity of a set of possible models, rather than characteristics of the ‘true model’

30 30 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

31 31 SHATTERING Linear indicator functions: can split 3 data points in 2D in all 2^^3 = 8 possible binary partitions If a set of n samples can be separated by a set of functions in all 2^^n possible ways, this sample is said to be shattered (by the set of functions) Shattering ~ a set of models can explain a given sample of size n (for all possible labelings)

32 32 VC DIMENSION Definition: A set of functions has VC-dimension h is there exist h samples that can be shattered by this set of functions, but there are no h+1 samples that can be shattered  VC-dimension h=3 ( h=d+1 for linear functions ) VC-dim. is a positive integer (combinatorial index) What is VC-dim. of 1-nearest neighbor classifier ?

33 33 VC-dimension and Consistency of ERM VC-dimension is infinite if a sample of size n can be split in all 2^^n possible ways (in this case, no valid generalization is possible) Finite VC-dimension gives necessary and sufficient conditions for: (1) consistency of ERM-based learning (2) fast rate of convergence (these conditions are distribution-independent) Interpretation of the VC-dimension via falsifiability: functions with small VC-dim can be easily falsified

34 34 VC-dimension and Falsifiability A set of functions has VC-dimension h if (a)It can explain (shatter) a set of h samples ~ there exists h samples that cannot falsify it and (b) It can not shatter h+1 samples ~ any h+1 samples falsify this set Finiteness of VC-dim is necessary and sufficient condition for generalization (for any learning method based on ERM)

35 35 Recall Occam’s Razor: Main problem in predictive learning - Complexity control (model selection) - How to measure complexity? Interpretation of Occam’s razor (in Statistics): Entities ~ model parameters Complexity ~ degrees-of-freedom Necessity ~ explaining (fitting) available data  Model complexity = number of parameters (DoF) Consistent with classical statistical view: learning = function approx. / density estimation

36 36 Philosophical Principle of VC-falsifiability Occam’s Razor: Select the model that explains available data and has the smallest number of free parameters (entities) VC theory: Select the model that explains available data and has low VC-dimension (i.e. can be easily falsified)  New principle of VC-falsifiability

37 37 Calculating the VC-dimension How to estimate the VC-dimension (for a given set of functions)? Apply definition (via shattering) to derive analytic estimates – works for ‘simple’ sets of functions Generally, such analytic estimates are not possible for complex nonlinear parameterizations (i.e., for practical machine learning and statistical methods)

38 38 Example 1: VC-dimension of spherical indicator functions. Consider spherical decision surfaces in a d-dimensional x - space, parameterized by center c and radius r parameters: In a 2-dim space (d=2) there exists 3 points that can be shattered, but 4 points cannot be shattered  h=3

39 39 Example 2: VC-dimension of a linear combination of fixed basis functions (i.e. polynomials, Fourier expansion etc.) Assuming that basis functions are linearly independent, the VC-dim equals the number of basis functions (free parameters). Example 3: single parameter but infinite VC-dimension

40 40 Example 4: Wide linear decision boundaries Consider linear functions such that the distance btwn D(x) and the closest data sample is larger than given value Then VC-dimension depends on the width parameter, rather than d (as in linear models):

41 41 Example 5: Linear combination of fixed basis functions is equivalent to linear functions in m-dimensional space  VC-dimension = m + 1 (this assumes linear independence of basis functions) In general, analytic estimation of VC-dimension is hard VC-dimension can be - equal to DoF - larger than DoF - smaller than DoF

42 42 VC-dimension vs number of parameters VC-dimension can be equal to DoF (number of parameters) Example: linear estimators VC-dimension can be smaller than DoF Example: penalized estimators VC-dimension can be larger than DoF Example: feature selection sin (wx)

43 43 VC-dimension for Regression Problems VC-dimension was defined for indicator functions Can be extended to real-valued functions, i.e. third-order polynomial for univariate regression: linear parameterization  VC-dim = 4 Qualitatively, the VC-dimension ~ the ability to fit (or explain) finite training data for regression.

44 44 Example: what is VC-dim of kNN Regression? Ten training samples from Using k-nn regression with k=1 and k=4:

45 45 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

46 46 Recall consistency of ERM Two Types of VC-bounds: (1)How close is the empirical risk to the true risk (2) How close is the empirical risk to the minimal possible risk ?

47 47 Generalization Bounds Bounds for learning machines (implementing ERM) evaluate the difference btwn (unknown) risk and known empirical risk, as a function of sample size n and general properties of admissible models (their VC-dimension) Classification: the following bound holds with probability of for all approximating functions where is called the confidence interval Regression: the following bound holds with probability of for all approximating functions where

48 48 Practical VC Bound for regression Practical regression bound can be obtained by setting the confidence level and theoretical constants: can be used for model selection (examples given later) Compare to analytic bounds (SC, FPE) in Lecture Set 2 Analysis (of the denominator) shows that h < 0.8 n for any estimator In practice: h < 0.5 n for any estimator

49 49 VC Regression Bound for model selection VC-bound can be used for analytic model selection (if the VC-dimension is known) Example: polynomial regression for estimating Sine_Squared target function from 25 noisy samples Optimal model found: 6-th degree polynomial (no resampling needed)

50 50 Modeling pure noise with x in [0,1] via poly regression sample size n=30, noise Comparison of different model selection methods: - prediction risk (MSE) - selected DoF (~ h)

51 51 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

52 52 Structural Risk Minimization Analysis of generalization bounds suggests that when n/h is large, the term is small  This leads to parametric modeling approach (ERM) When n/h is not large (say, less than 20), both terms in the right-hand side of VC- bound need to be minimized  make the VC-dimension a controlling variable SRM = formal mechanism for controlling model complexity Set of admissible models has a nested structure such that  structure formally defines complexity ordering

53 53 Structural Risk Minimization (SRM) An upper bound on the true risk and the empirical risk, as a function of VC-dimension h (for fixed sample size n)

54 54 SRM vs ERM modeling

55 55 SRM Approach Use VC-dimension as a controlling parameter for minimizing VC bound: Two general strategies for implementing SRM: 1. Keep fixed and minimize (most statistical and neural network methods) 2. Keep fixed and minimize (Support Vector Machines)

56 56 Common SRM structures (1) Dictionary structure A set of algebraic polynomials is a structure since More generally whereis a set of basis functions (dictionary). The number of terms (basis functions) m specifies an element of a structure. For fixed basis fcts, VC-dim ~ number of parameters

57 57 Common SRM structures (2) Feature selection (aka subset selection) Consider sparse polynomials of degree m: for m=1: for m=2: etc. Each monomial is a feature. The goal is to select a set of m features providing min. empirical risk (MSE) This is a structure since More generally, where m basis fcts are selected from a (large) set of M fcts Note: nonlinear optimization, VC-dimension is unknown

58 58 Common SRM structures (3) Penalization Consider algebraic polynomial of fixed degree where For each (positive) value c this set of functions specifies an element of a structure Minimization of empirical risk (MSE) on each element of a structure is a constrained minimization problem This optimization problem can be equivalently stated as minimization of the penalized empirical risk functional: where the choice of Note: VC-dimension is unknown

59 59 Example: SRM structures for regression Regression data set x-values~ uniformly sampled in [0,1] y-values ~ target fct additive Gaussian noise with st. dev 0.05 Experimental set-up training set ~ 40 samples validation set ~ 40 samples (for model selection) SRM structures defined on algebraic polynomials - dictionary (polynomial degrees 1 to 10) - penalization (fixed degree-10 polynomial) - sparse polynomials (degree 1 to 5)

60 60 Estimated models using different SRM structures: - dictionary - penalization lambda=1.013e-005 - sparse polynomial Visual results: target fct~ red line, feature selection~ black solid, dictionary ~ green, penalization ~ yellow line 00.20.40.60.81 -1.5 -0.5 0 0.5 1 x y

61 61 SRM Summary SRM structure ~ complexity ordering on a set of admissible models (approximating functions) Many different structures on the same set of approximating functions (possible models) How to choose the ‘best’ structure? - depends on application data - VC theory cannot provide answer SRM = mechanism for complexity control - selecting optimal complexity for a given data set - new measure of complexity: VC-dimension - model selection using analytic VC-bounds

62 62 Real-Life Application: signal denoising Univariate signal ~ function of time True signal (unknown) vs. noisy signal (given)

63 63 Signal denoising: problem statement Regression formulation ~ real-valued function estimation (with squared loss) Signal representation: linear combination of orthogonal basis functions (harmonic, wavelets) Differences (from standard formulation) - fixed sampling rate - training data X-values = test data X-values  Computationally efficient orthogonal estimators: Discrete Fourier/Wavelet Transform (DFT / DWT)

64 64 Examples of wavelets see http://en.wikipedia.org/wiki/Wavelethttp://en.wikipedia.org/wiki/Wavelet Haar waveletSymmlet

65 65 MeyerMexican Hat

66 66 Wavelets (cont’d) Example of translated and dilated wavelet basis functions:

67 67 Issues for signal denoising Denoising via (wavelet) thresholding - wavelet thresholding = sparse feature selection - nonlinear estimator suitable for ERM Main factors for signal denoising Representation (choice of basis functions) Ordering (of basis functions) ~ SRM structure Thresholding (model selection) Large-sample setting: representation Finite-sample setting: thresholding + ordering

68 68 Framework for signal denoising Ordering of (wavelet) basis functions = = structure on orthogonal basis functions Traditional ordering Better ordering VC- thresholding Opt number of wavelets ~ min of VC-bound (analytic) Usually take VC-dim. h=m (number of wavelets or DoF)

69 69 Empirical Results: signal denoising Two target functions: Blocks and Heavisine Data set: 128 noisy samples, SNR = 2.5

70 70 Empirical Results: Blocks signal estimated by VC-based denoising

71 71 Empirical Results: Heavisine estimated by VC-based denoising

72 72 Application Study: ECG Denoising

73 73 A closer look of a noisy segment

74 74 Denoised ECG signal VC denoising applied to 4,096 noisy samples. The final model (below) has only 76 wavelets!

75 75 Discussion Application of VC-theory to signal denoising - orthogonal basis functions - nonlinear estimator: sparse feature selection Finite sample setting: importance of - Ordering (of basis functions) ~ SRM structure - Model selection (~ thresholding) Large-sample setting: - type of basis functions (representation)

76 76 OUTLINE of Set 4 Objectives and Overview Inductive Learning Problem Setting Keep-It-Direct Principle Analysis of ERM VC-dimension Generalization Bounds Structural Risk Minimization (SRM) Summary and Discussion

77 77 Summary and Discussion: VC-theory Methodology - learning problem setting (KID principle) - concepts (risk minimization, VC-dimension, structure) Interpretation/ evaluation of existing methods Model selection using VC-bounds New types of inference (TBD later) What theory can not do: - provide formalization (for a given application) - select ‘good’ structure - always a bridge between theory and applications

78 78 References/ more reading Original references on VC-theory V. Vapnik, The Nature of Statistical Learning Theory, Springer, 1995 V. Vapnik, Statistical Learning Theory, Wiley, 1998 Model selection using VC-bounds V. Cherkassky, X. Shao, F. Mulier and V. Vapnik, Model Complexity Control for regression using VC generalization bounds, IEEE Trans. on Neural Networks,10,5,1075-1089, 1999 V. Cherkassky and Y. Ma, Comparison of model selection for regression, Neural Computation, MIT Press 15 (7), 1691-1714, 2003 Signal denoising using VC-theory V. Cherkassky and X. Shao, Signal estimation and denoising using VC- theory, Neural Networks, Pergamon, 14, 37-52, 2001 V. Cherkassky and S. Kilts, Myopotential denoising of ECG signals using wavelet thresholding methods, Neural Networks, Pergamon, 14, 1129-1137, 2001


Download ppt "1 Introduction to Predictive Learning Electrical and Computer Engineering LECTURE SET 4 Statistical Learning Theory."

Similar presentations


Ads by Google