Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.

Introduction to Radial Basis Function Networks

Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations

Introduction to Radial Basis Function Networks Overview

Typical Applications of NN Pattern Classification Function Approximation Time-Series Forecasting

Function Approximation f Unknown Approximator ˆ f

Neural Network Supervised Learning Unknown Function + + 

Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. – Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366. Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators. – Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257. – Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis- Function Networks," Neural Computation, 5(2), 305-316.

Statistics vs. Neural Networks StatisticsNeural Networks modelnetwork estimationlearning regressionsupervised learning interpolationgeneralization observationstraining set parameters(synaptic) weights independent variablesinputs dependent variablesoutputs ridge regressionweight decay

Introduction to Radial Basis Function Networks The Model of Function Approximator

Linear Models Fixed Basis Functions Weights

Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x =

Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Can you say some bases?

Example Linear Models Polynomial Fourier Series Are they orthogonal bases?

y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Single-Layer Perceptrons as Universal Aproximators Hidden Units With sufficient number of sigmoidal units, it can be a universal approximator.

y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Radial Basis Function Networks as Universal Aproximators Hidden Units With sufficient number of radial-basis-function units, it can also be a universal approximator.

Non-Linear Models Adjusted by the Learning process Weights

Introduction to Radial Basis Function Networks The Radial Basis Function Networks

Radial Basis Functions Center Distance Measure Shape Three parameters for a radial function: xixi r = ||x  x i ||   i (x)=  (||x  x i ||)

Typical Radial Functions Gaussian Hardy Multiquadratic Inverse Multiquadratic

Gaussian Basis Function (  =0.5,1.0,1.5)

Inverse Multiquadratic c=1 c=2 c=3 c=4 c=5

+ + Most General RBF + Basis {  i : i =1,2,…} is `near’ orthogonal.

Properties of RBF’s On-Center, Off Surround Analogies with localized receptive fields found in several biological structures, e.g., – visual cortex; – ganglion cells

The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Projection Interpolation As a function approximator

The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Subclasses Classes As a pattern classifier.

Introduction to Radial Basis Function Networks RBFN’s for Function Approximation

The idea x y Unknown Function to Approximate Training Data

The idea x y Unknown Function to Approximate Training Data Basis Functions (Kernels)

The idea x y Basis Functions (Kernels) Function Learned

The idea x y Basis Functions (Kernels) Function Learned Nontraining Sample

The idea x y Function Learned Nontraining Sample

Radial Basis Function Networks as Universal Aproximators x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Training set Goalfor all k

Learn the Optimal Weight Vector w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = Training set Goalfor all k

Regularization Training set Goalfor all k If regularization is unneeded, set

Learn the Optimal Weight Vector Minimize

Learn the Optimal Weight Vector Define

Learn the Optimal Weight Vector

Variance Matrix Design Matrix

Summary Training set

Introduction to Radial Basis Function Networks The Projection Matrix

The Empirical-Error Vector Unknown Function Unknown Function

The Empirical-Error Vector Unknown Function Unknown Function Error Vector

Sum-Squared-Error Error Vector If  =0, the RBFN’s learning algorithm is to minimize SSE (MSE).

The Projection Matrix Error Vector

Introduction to Radial Basis Function Networks Learning the Kernels

RBFN’s as Universal Approximators x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Training set Kernels

What to Learn? Weights w ij ’s Centers  j ’s of  j ’s Widths  j ’s of  j ’s Number of  j ’s  Model Selection x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml

One-Stage Learning

The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on- line setting.

x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Two-Stage Training Step 1 Step 2 Determines Centers  j ’s of  j ’s. Widths  j ’s of  j ’s. Number of  j ’s. Determines Centers  j ’s of  j ’s. Widths  j ’s of  j ’s. Number of  j ’s. Determines w ij ’s. E.g., using batch-learning.

Train the Kernels

+ + + + + Unsupervised Training

Subset Selection – Random Subset Selection – Forward Selection – Backward Elimination Clustering Algorithms – KMEANS – LVQ Mixture Models – GMM Methods

Subset Selection

Random Subset Selection Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune – Centers – Widths – #points

Clustering Algorithms Partition the data points into K clusters.

Clustering Algorithms Is such a partition satisfactory?

Clustering Algorithms How about this?

+ + + + Clustering Algorithms 11 22 33 44

Introduction to Radial Basis Function Networks Bias-Variance Dilemma

Goal Revisit Minimize Prediction Error Ultimate Goal  Generalization Goal of Our Learning Procedure Minimize Empirical Error

Badness of Fit Underfitting – A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. – Produces excessive bias in the outputs. Overfitting – A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. – Produces excessive variance in the outputs.

Underfitting/Overfitting Avoidance Model selection Jittering Early stopping Weight decay – Regularization – Ridge Regression Bayesian learning Combining networks

Best Way to Avoid Overfitting Use lots of training data, e.g., – 30 times as many training cases as there are weights in the network. – for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.

Badness of Fit UnderfitOverfit

Bias-Variance Dilemma UnderfitOverfit Large bias Small bias Small varianceLarge variance However, it's not really a dilemma.

UnderfitOverfit Large bias Small bias Small varianceLarge variance Bias-Variance Dilemma More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data.

Bias-Variance Dilemma It's not really a dilemma. fit more overfit more underfit Bias Variance

Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Solution obtained with training set 2. Solution obtained with training set 1. Solution obtained with training set 3. bias The mean of the bias=? The variance of the bias=?

Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise The mean of the bias=? The variance of the bias=? bias Variance

Model Selection Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Variance Reduce the effective number of parameters. Reduce the number of hidden nodes.

Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Goal:

Bias-Variance Dilemma Goal: 0 constant

Bias-Variance Dilemma 0

Goal: noise bias 2 variance Cannot be minimized Minimize both bias 2 and variance

Model Complexity vs. Bias-Variance Goal: noise bias 2 variance Model Complexity (Capacity)

Bias-Variance Dilemma Goal: noise bias 2 variance Model Complexity (Capacity)

Example (Polynomial Fits)

Degree 1Degree 5 Degree 10Degree 15

Introduction to Radial Basis Function Networks The Effective Number of Parameters

Variance Estimation Mean Variance In general, not available.

Variance Estimation Mean Variance Loss 1 degree of freedom

Simple Linear Regression

Minimize

Mean Squared Error (MSE) Minimize Loss 2 degrees of freedom

Variance Estimation Loss m degrees of freedom m: #parameters of the model

The Number of Parameters w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = m #degrees of freedom:

The Effective Number of Parameters (  ) w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = The projection Matrix

The Effective Number of Parameters (  ) The projection Matrix Facts: Pf)

Regularization Penalize models with large weights SSE The effective number of parameters:

Regularization Penalize models with large weights SSE Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. The effective number of parameters:

Regularization Penalize models with large weights SSE With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. The effective number of parameters:

Variance Estimation Loss  degrees of freedom The effective number of parameters:

Variance Estimation The effective number of parameters:

Introduction to Radial Basis Function Networks Model Selection

Goal – Choose the fittest model Criteria – Least prediction error Main Tools (Estimate Model Fitness) – Cross validation – Projection matrix Methods – Weight decay (Ridge regression) – Pruning and Growing RBFN’s

Empirical Error vs. Model Fitness Minimize Prediction Error Ultimate Goal  Generalization Goal of Our Learning Procedure Minimize Empirical Error Minimize Prediction Error (MSE)

Estimating Prediction Error When you have plenty of data use independent test sets – E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use – Cross-Validation – Bootstrap

Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., – K-fold Cross-Validation – Leave-One-Out Cross-Validation – Generalize Cross-Validation

K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D 1, D 2, …, D k. Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i.

K-Fold CV Available Data

K-Fold CV Available Data D1D1 D2D2 D3D3... DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk Test Set Training Set Estimate  2

Leave-One-Out CV Split the p available input-output patterns into a training set of size p  1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition. A special case of k -fold CV.

Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. Given a model, the function with least empirical error for D i. As an index of model’s fitness. We want to find a model also minimize this. As an index of model’s fitness. We want to find a model also minimize this.

Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. How to estimate? Are there any efficient ways?

Error Variance Predicted by LOO Error-square for the left-out element.

Generalized Cross-Validation

More Criteria Based on CV GCV (Generalized CV) UEV (Unbiased estimate of variance) FPE (Final Prediction Error) Akaike’s Information Criterion BIC (Bayesian Information Criterio)

More Criteria Based on CV

Regularization Penalize models with large weights SSE Standard Ridge Regression,

Solution Review Used to compute model selection criteria

Example Width of RBF r = 0.5

Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?

Optimizing the Regularization Parameter Re-Estimation Formula

Local Ridge Regression Re-Estimation Formula

Example — Width of RBF

Example — Width of RBF There are two local-minima. Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.

Example — Width of RBF There are two local-minima.

Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available. RMSE: Root Mean Squared Error In real case, it is not available.

Local Ridge Regression Standard Ridge Regression Local Ridge Regression

Standard Ridge Regression Local Ridge Regression j  implies that  j (  ) can be removed.

The Solutions Used to compute model selection criteria Linear Regression Standard Ridge Regression Local Ridge Regression

Optimizing the Regularization Parameters Incremental Operation P : The current projection Matrix. P j : The projection Matrix obtained by removing  j (  ).

Optimizing the Regularization Parameters Solve Subject to

Optimizing the Regularization Parameters Solve Subject to Remove  j (  )

The Algorithm Initialize i ’s. – e.g., performing standard ridge regression. Repeat the following until GCV converges: – Randomly select j and compute – Perform local ridge regression – If GCV reduce & remove  j (  )

References Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI). Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.http://www.anc.ed.ac.uk/~mjo/intro/intro.html

Introduction to Radial Basis Function Networks Incremental Operations

Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.

Similar presentations

Presentation on theme: "Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.

Similar presentations

Presentation on theme: "Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function."— Presentation transcript:

Similar presentations

About project

Feedback