Download presentation
Presentation is loading. Please wait.
Published byCornelia Walsh Modified over 9 years ago
1
Introduction to Radial Basis Function Networks
2
Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations
3
Introduction to Radial Basis Function Networks Overview
4
Typical Applications of NN Pattern Classification Function Approximation Time-Series Forecasting
5
Function Approximation f Unknown Approximator ˆ f
6
Neural Network Supervised Learning Unknown Function + +
7
Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. – Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366. Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators. – Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257. – Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis- Function Networks," Neural Computation, 5(2), 305-316.
8
Statistics vs. Neural Networks StatisticsNeural Networks modelnetwork estimationlearning regressionsupervised learning interpolationgeneralization observationstraining set parameters(synaptic) weights independent variablesinputs dependent variablesoutputs ridge regressionweight decay
9
Introduction to Radial Basis Function Networks The Model of Function Approximator
10
Linear Models Fixed Basis Functions Weights
11
Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x =
12
Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Can you say some bases?
13
Example Linear Models Polynomial Fourier Series Are they orthogonal bases?
14
y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Single-Layer Perceptrons as Universal Aproximators Hidden Units With sufficient number of sigmoidal units, it can be a universal approximator.
15
y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Radial Basis Function Networks as Universal Aproximators Hidden Units With sufficient number of radial-basis-function units, it can also be a universal approximator.
16
Non-Linear Models Adjusted by the Learning process Weights
17
Introduction to Radial Basis Function Networks The Radial Basis Function Networks
18
Radial Basis Functions Center Distance Measure Shape Three parameters for a radial function: xixi r = ||x x i || i (x)= (||x x i ||)
19
Typical Radial Functions Gaussian Hardy Multiquadratic Inverse Multiquadratic
20
Gaussian Basis Function ( =0.5,1.0,1.5)
21
Inverse Multiquadratic c=1 c=2 c=3 c=4 c=5
22
+ + Most General RBF + Basis { i : i =1,2,…} is `near’ orthogonal.
23
Properties of RBF’s On-Center, Off Surround Analogies with localized receptive fields found in several biological structures, e.g., – visual cortex; – ganglion cells
24
The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Projection Interpolation As a function approximator
25
The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Subclasses Classes As a pattern classifier.
26
Introduction to Radial Basis Function Networks RBFN’s for Function Approximation
27
The idea x y Unknown Function to Approximate Training Data
28
The idea x y Unknown Function to Approximate Training Data Basis Functions (Kernels)
29
The idea x y Basis Functions (Kernels) Function Learned
30
The idea x y Basis Functions (Kernels) Function Learned Nontraining Sample
31
The idea x y Function Learned Nontraining Sample
32
Radial Basis Function Networks as Universal Aproximators x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Training set Goalfor all k
33
Learn the Optimal Weight Vector w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = Training set Goalfor all k
34
Regularization Training set Goalfor all k If regularization is unneeded, set
35
Learn the Optimal Weight Vector Minimize
36
Learn the Optimal Weight Vector Define
37
Learn the Optimal Weight Vector Define
38
Learn the Optimal Weight Vector
39
Variance Matrix Design Matrix
40
Summary Training set
41
Introduction to Radial Basis Function Networks The Projection Matrix
42
The Empirical-Error Vector Unknown Function Unknown Function
43
The Empirical-Error Vector Unknown Function Unknown Function Error Vector
44
Sum-Squared-Error Error Vector If =0, the RBFN’s learning algorithm is to minimize SSE (MSE).
45
The Projection Matrix Error Vector
46
Introduction to Radial Basis Function Networks Learning the Kernels
47
RBFN’s as Universal Approximators x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Training set Kernels
48
What to Learn? Weights w ij ’s Centers j ’s of j ’s Widths j ’s of j ’s Number of j ’s Model Selection x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml
49
One-Stage Learning
50
The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on- line setting.
51
x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Two-Stage Training Step 1 Step 2 Determines Centers j ’s of j ’s. Widths j ’s of j ’s. Number of j ’s. Determines Centers j ’s of j ’s. Widths j ’s of j ’s. Number of j ’s. Determines w ij ’s. E.g., using batch-learning.
52
Train the Kernels
53
+ + + + + Unsupervised Training
54
Subset Selection – Random Subset Selection – Forward Selection – Backward Elimination Clustering Algorithms – KMEANS – LVQ Mixture Models – GMM Methods
55
Subset Selection
56
Random Subset Selection Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune – Centers – Widths – #points
57
Clustering Algorithms Partition the data points into K clusters.
58
Clustering Algorithms Is such a partition satisfactory?
59
Clustering Algorithms How about this?
60
+ + + + Clustering Algorithms 11 22 33 44
61
Introduction to Radial Basis Function Networks Bias-Variance Dilemma
62
Goal Revisit Minimize Prediction Error Ultimate Goal Generalization Goal of Our Learning Procedure Minimize Empirical Error
63
Badness of Fit Underfitting – A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. – Produces excessive bias in the outputs. Overfitting – A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. – Produces excessive variance in the outputs.
64
Underfitting/Overfitting Avoidance Model selection Jittering Early stopping Weight decay – Regularization – Ridge Regression Bayesian learning Combining networks
65
Best Way to Avoid Overfitting Use lots of training data, e.g., – 30 times as many training cases as there are weights in the network. – for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.
66
Badness of Fit UnderfitOverfit
67
Badness of Fit UnderfitOverfit
68
Bias-Variance Dilemma UnderfitOverfit Large bias Small bias Small varianceLarge variance However, it's not really a dilemma.
69
UnderfitOverfit Large bias Small bias Small varianceLarge variance Bias-Variance Dilemma More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data.
70
Bias-Variance Dilemma It's not really a dilemma. fit more overfit more underfit Bias Variance
71
Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Solution obtained with training set 2. Solution obtained with training set 1. Solution obtained with training set 3. bias The mean of the bias=? The variance of the bias=?
72
Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise The mean of the bias=? The variance of the bias=? bias Variance
73
Model Selection Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Variance Reduce the effective number of parameters. Reduce the number of hidden nodes.
74
Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Goal:
75
Bias-Variance Dilemma Goal: 0 constant
76
Bias-Variance Dilemma 0
77
Goal: noise bias 2 variance Cannot be minimized Minimize both bias 2 and variance
78
Model Complexity vs. Bias-Variance Goal: noise bias 2 variance Model Complexity (Capacity)
79
Bias-Variance Dilemma Goal: noise bias 2 variance Model Complexity (Capacity)
80
Example (Polynomial Fits)
82
Degree 1Degree 5 Degree 10Degree 15
83
Introduction to Radial Basis Function Networks The Effective Number of Parameters
84
Variance Estimation Mean Variance In general, not available.
85
Variance Estimation Mean Variance Loss 1 degree of freedom
86
Simple Linear Regression
87
Minimize
88
Mean Squared Error (MSE) Minimize Loss 2 degrees of freedom
89
Variance Estimation Loss m degrees of freedom m: #parameters of the model
90
The Number of Parameters w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = m #degrees of freedom:
91
The Effective Number of Parameters ( ) w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = The projection Matrix
92
The Effective Number of Parameters ( ) The projection Matrix Facts: Pf)
93
Regularization Penalize models with large weights SSE The effective number of parameters:
94
Regularization Penalize models with large weights SSE Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters =m. Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters =m. The effective number of parameters:
95
Regularization Penalize models with large weights SSE With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters <m. With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters <m. The effective number of parameters:
96
Variance Estimation Loss degrees of freedom The effective number of parameters:
97
Variance Estimation The effective number of parameters:
98
Introduction to Radial Basis Function Networks Model Selection
99
Goal – Choose the fittest model Criteria – Least prediction error Main Tools (Estimate Model Fitness) – Cross validation – Projection matrix Methods – Weight decay (Ridge regression) – Pruning and Growing RBFN’s
100
Empirical Error vs. Model Fitness Minimize Prediction Error Ultimate Goal Generalization Goal of Our Learning Procedure Minimize Empirical Error Minimize Prediction Error (MSE)
101
Estimating Prediction Error When you have plenty of data use independent test sets – E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use – Cross-Validation – Bootstrap
102
Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., – K-fold Cross-Validation – Leave-One-Out Cross-Validation – Generalize Cross-Validation
103
K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D 1, D 2, …, D k. Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i.
104
K-Fold CV Available Data
105
K-Fold CV Available Data D1D1 D2D2 D3D3... DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk Test Set Training Set Estimate 2
106
Leave-One-Out CV Split the p available input-output patterns into a training set of size p 1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition. A special case of k -fold CV.
107
Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.
108
Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. Given a model, the function with least empirical error for D i. As an index of model’s fitness. We want to find a model also minimize this. As an index of model’s fitness. We want to find a model also minimize this.
109
Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. How to estimate? Are there any efficient ways?
110
Error Variance Predicted by LOO Error-square for the left-out element.
111
Generalized Cross-Validation
112
More Criteria Based on CV GCV (Generalized CV) UEV (Unbiased estimate of variance) FPE (Final Prediction Error) Akaike’s Information Criterion BIC (Bayesian Information Criterio)
113
More Criteria Based on CV
115
Regularization Penalize models with large weights SSE Standard Ridge Regression,
116
Regularization Penalize models with large weights SSE Standard Ridge Regression,
117
Solution Review Used to compute model selection criteria
118
Example Width of RBF r = 0.5
119
Example Width of RBF r = 0.5
120
Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?
121
Optimizing the Regularization Parameter Re-Estimation Formula
122
Local Ridge Regression Re-Estimation Formula
123
Example — Width of RBF
124
Example — Width of RBF
125
Example — Width of RBF There are two local-minima. Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.
126
Example — Width of RBF There are two local-minima.
127
Example — Width of RBF There are two local-minima.
128
Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available. RMSE: Root Mean Squared Error In real case, it is not available.
129
Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available. RMSE: Root Mean Squared Error In real case, it is not available.
130
Local Ridge Regression Standard Ridge Regression Local Ridge Regression
131
Standard Ridge Regression Local Ridge Regression j implies that j ( ) can be removed.
132
The Solutions Used to compute model selection criteria Linear Regression Standard Ridge Regression Local Ridge Regression
133
Optimizing the Regularization Parameters Incremental Operation P : The current projection Matrix. P j : The projection Matrix obtained by removing j ( ).
134
Optimizing the Regularization Parameters Solve Subject to
135
Optimizing the Regularization Parameters Solve Subject to
136
Optimizing the Regularization Parameters Solve Subject to Remove j ( )
137
The Algorithm Initialize i ’s. – e.g., performing standard ridge regression. Repeat the following until GCV converges: – Randomly select j and compute – Perform local ridge regression – If GCV reduce & remove j ( )
138
References Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI). Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.http://www.anc.ed.ac.uk/~mjo/intro/intro.html
139
Introduction to Radial Basis Function Networks Incremental Operations
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.