Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function.

Similar presentations


Presentation on theme: "Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function."— Presentation transcript:

1 Introduction to Radial Basis Function Networks

2 Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection Incremental Operations

3 Introduction to Radial Basis Function Networks Overview

4 Typical Applications of NN Pattern Classification Function Approximation Time-Series Forecasting

5 Function Approximation f Unknown Approximator ˆ f

6 Neural Network Supervised Learning Unknown Function + + 

7 Neural Networks as Universal Approximators Feedforward neural networks with a single hidden layer of sigmoidal units are capable of approximating uniformly any continuous multivariate function, to any desired degree of accuracy. – Hornik, K., Stinchcombe, M., and White, H. (1989). "Multilayer Feedforward Networks are Universal Approximators," Neural Networks, 2(5), 359-366. Like feedforward neural networks with a single hidden layer of sigmoidal units, it can be shown that RBF networks are universal approximators. – Park, J. and Sandberg, I. W. (1991). "Universal Approximation Using Radial-Basis-Function Networks," Neural Computation, 3(2), 246-257. – Park, J. and Sandberg, I. W. (1993). "Approximation and Radial-Basis- Function Networks," Neural Computation, 5(2), 305-316.

8 Statistics vs. Neural Networks StatisticsNeural Networks modelnetwork estimationlearning regressionsupervised learning interpolationgeneralization observationstraining set parameters(synaptic) weights independent variablesinputs dependent variablesoutputs ridge regressionweight decay

9 Introduction to Radial Basis Function Networks The Model of Function Approximator

10 Linear Models Fixed Basis Functions Weights

11 Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x =

12 Linear Models Feature Vectors Inputs Hidden Units Output Units Decomposition Feature Extraction Transformation Linearly weighted output y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Can you say some bases?

13 Example Linear Models Polynomial Fourier Series Are they orthogonal bases?

14 y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Single-Layer Perceptrons as Universal Aproximators Hidden Units With sufficient number of sigmoidal units, it can be a universal approximator.

15 y 11 11 22 22 mm mm x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Radial Basis Function Networks as Universal Aproximators Hidden Units With sufficient number of radial-basis-function units, it can also be a universal approximator.

16 Non-Linear Models Adjusted by the Learning process Weights

17 Introduction to Radial Basis Function Networks The Radial Basis Function Networks

18 Radial Basis Functions Center Distance Measure Shape Three parameters for a radial function: xixi r = ||x  x i ||   i (x)=  (||x  x i ||)

19 Typical Radial Functions Gaussian Hardy Multiquadratic Inverse Multiquadratic

20 Gaussian Basis Function (  =0.5,1.0,1.5)

21 Inverse Multiquadratic c=1 c=2 c=3 c=4 c=5

22 + + Most General RBF + Basis {  i : i =1,2,…} is `near’ orthogonal.

23 Properties of RBF’s On-Center, Off Surround Analogies with localized receptive fields found in several biological structures, e.g., – visual cortex; – ganglion cells

24 The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Projection Interpolation As a function approximator

25 The Topology of RBF Feature Vectors x1x1 x2x2 xnxn y1y1 ymym Inputs Hidden Units Output Units Subclasses Classes As a pattern classifier.

26 Introduction to Radial Basis Function Networks RBFN’s for Function Approximation

27 The idea x y Unknown Function to Approximate Training Data

28 The idea x y Unknown Function to Approximate Training Data Basis Functions (Kernels)

29 The idea x y Basis Functions (Kernels) Function Learned

30 The idea x y Basis Functions (Kernels) Function Learned Nontraining Sample

31 The idea x y Function Learned Nontraining Sample

32 Radial Basis Function Networks as Universal Aproximators x1x1 x2x2 xnxn w1w1 w2w2 wmwm x = Training set Goalfor all k

33 Learn the Optimal Weight Vector w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = Training set Goalfor all k

34 Regularization Training set Goalfor all k If regularization is unneeded, set

35 Learn the Optimal Weight Vector Minimize

36 Learn the Optimal Weight Vector Define

37 Learn the Optimal Weight Vector Define

38 Learn the Optimal Weight Vector

39 Variance Matrix Design Matrix

40 Summary Training set

41 Introduction to Radial Basis Function Networks The Projection Matrix

42 The Empirical-Error Vector Unknown Function Unknown Function

43 The Empirical-Error Vector Unknown Function Unknown Function Error Vector

44 Sum-Squared-Error Error Vector If  =0, the RBFN’s learning algorithm is to minimize SSE (MSE).

45 The Projection Matrix Error Vector

46 Introduction to Radial Basis Function Networks Learning the Kernels

47 RBFN’s as Universal Approximators x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Training set Kernels

48 What to Learn? Weights w ij ’s Centers  j ’s of  j ’s Widths  j ’s of  j ’s Number of  j ’s  Model Selection x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml

49 One-Stage Learning

50 The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on- line setting.

51 x1x1 x2x2 xnxn y1y1 ylyl 11 22 mm w 11 w 12 w1mw1m wl1wl1 wl2wl2 w lml Two-Stage Training Step 1 Step 2 Determines Centers  j ’s of  j ’s. Widths  j ’s of  j ’s. Number of  j ’s. Determines Centers  j ’s of  j ’s. Widths  j ’s of  j ’s. Number of  j ’s. Determines w ij ’s. E.g., using batch-learning.

52 Train the Kernels

53 + + + + + Unsupervised Training

54 Subset Selection – Random Subset Selection – Forward Selection – Backward Elimination Clustering Algorithms – KMEANS – LVQ Mixture Models – GMM Methods

55 Subset Selection

56 Random Subset Selection Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune – Centers – Widths – #points

57 Clustering Algorithms Partition the data points into K clusters.

58 Clustering Algorithms Is such a partition satisfactory?

59 Clustering Algorithms How about this?

60 + + + + Clustering Algorithms 11 22 33 44

61 Introduction to Radial Basis Function Networks Bias-Variance Dilemma

62 Goal Revisit Minimize Prediction Error Ultimate Goal  Generalization Goal of Our Learning Procedure Minimize Empirical Error

63 Badness of Fit Underfitting – A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. – Produces excessive bias in the outputs. Overfitting – A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. – Produces excessive variance in the outputs.

64 Underfitting/Overfitting Avoidance Model selection Jittering Early stopping Weight decay – Regularization – Ridge Regression Bayesian learning Combining networks

65 Best Way to Avoid Overfitting Use lots of training data, e.g., – 30 times as many training cases as there are weights in the network. – for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.

66 Badness of Fit UnderfitOverfit

67 Badness of Fit UnderfitOverfit

68 Bias-Variance Dilemma UnderfitOverfit Large bias Small bias Small varianceLarge variance However, it's not really a dilemma.

69 UnderfitOverfit Large bias Small bias Small varianceLarge variance Bias-Variance Dilemma More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data.

70 Bias-Variance Dilemma It's not really a dilemma. fit more overfit more underfit Bias Variance

71 Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Solution obtained with training set 2. Solution obtained with training set 1. Solution obtained with training set 3. bias The mean of the bias=? The variance of the bias=?

72 Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise The mean of the bias=? The variance of the bias=? bias Variance

73 Model Selection Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Variance Reduce the effective number of parameters. Reduce the number of hidden nodes.

74 Bias-Variance Dilemma Sets of functions E.g., depend on # hidden nodes used. The true model noise bias Goal:

75 Bias-Variance Dilemma Goal: 0 constant

76 Bias-Variance Dilemma 0

77 Goal: noise bias 2 variance Cannot be minimized Minimize both bias 2 and variance

78 Model Complexity vs. Bias-Variance Goal: noise bias 2 variance Model Complexity (Capacity)

79 Bias-Variance Dilemma Goal: noise bias 2 variance Model Complexity (Capacity)

80 Example (Polynomial Fits)

81

82 Degree 1Degree 5 Degree 10Degree 15

83 Introduction to Radial Basis Function Networks The Effective Number of Parameters

84 Variance Estimation Mean Variance In general, not available.

85 Variance Estimation Mean Variance Loss 1 degree of freedom

86 Simple Linear Regression

87 Minimize

88 Mean Squared Error (MSE) Minimize Loss 2 degrees of freedom

89 Variance Estimation Loss m degrees of freedom m: #parameters of the model

90 The Number of Parameters w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = m #degrees of freedom:

91 The Effective Number of Parameters (  ) w1w1 w2w2 wmwm x1x1 x2x2 xnxn x = The projection Matrix

92 The Effective Number of Parameters (  ) The projection Matrix Facts: Pf)

93 Regularization Penalize models with large weights SSE The effective number of parameters:

94 Regularization Penalize models with large weights SSE Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. Without penalty ( i =0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. The effective number of parameters:

95 Regularization Penalize models with large weights SSE With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. With penalty ( i >0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. The effective number of parameters:

96 Variance Estimation Loss  degrees of freedom The effective number of parameters:

97 Variance Estimation The effective number of parameters:

98 Introduction to Radial Basis Function Networks Model Selection

99 Goal – Choose the fittest model Criteria – Least prediction error Main Tools (Estimate Model Fitness) – Cross validation – Projection matrix Methods – Weight decay (Ridge regression) – Pruning and Growing RBFN’s

100 Empirical Error vs. Model Fitness Minimize Prediction Error Ultimate Goal  Generalization Goal of Our Learning Procedure Minimize Empirical Error Minimize Prediction Error (MSE)

101 Estimating Prediction Error When you have plenty of data use independent test sets – E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use – Cross-Validation – Bootstrap

102 Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., – K-fold Cross-Validation – Leave-One-Out Cross-Validation – Generalize Cross-Validation

103 K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D 1, D 2, …, D k. Train and test the learning algorithm k times, each time it is trained on D\D i and tested on D i.

104 K-Fold CV Available Data

105 K-Fold CV Available Data D1D1 D2D2 D3D3... DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk D1D1 D2D2 D3D3 DkDk Test Set Training Set Estimate  2

106 Leave-One-Out CV Split the p available input-output patterns into a training set of size p  1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition. A special case of k -fold CV.

107 Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

108 Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. Given a model, the function with least empirical error for D i. As an index of model’s fitness. We want to find a model also minimize this. As an index of model’s fitness. We want to find a model also minimize this.

109 Error Variance Predicted by LOO A special case of k -fold CV. Available input-output patterns. Training sets of LOO. Function learned using D i as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element. How to estimate? Are there any efficient ways?

110 Error Variance Predicted by LOO Error-square for the left-out element.

111 Generalized Cross-Validation

112 More Criteria Based on CV GCV (Generalized CV) UEV (Unbiased estimate of variance) FPE (Final Prediction Error) Akaike’s Information Criterion BIC (Bayesian Information Criterio)

113 More Criteria Based on CV

114

115 Regularization Penalize models with large weights SSE Standard Ridge Regression,

116 Regularization Penalize models with large weights SSE Standard Ridge Regression,

117 Solution Review Used to compute model selection criteria

118 Example Width of RBF r = 0.5

119 Example Width of RBF r = 0.5

120 Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?

121 Optimizing the Regularization Parameter Re-Estimation Formula

122 Local Ridge Regression Re-Estimation Formula

123 Example — Width of RBF

124 Example — Width of RBF

125 Example — Width of RBF There are two local-minima. Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.

126 Example — Width of RBF There are two local-minima.

127 Example — Width of RBF There are two local-minima.

128 Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available. RMSE: Root Mean Squared Error In real case, it is not available.

129 Example — Width of RBF RMSE: Root Mean Squared Error In real case, it is not available. RMSE: Root Mean Squared Error In real case, it is not available.

130 Local Ridge Regression Standard Ridge Regression Local Ridge Regression

131 Standard Ridge Regression Local Ridge Regression j  implies that  j (  ) can be removed.

132 The Solutions Used to compute model selection criteria Linear Regression Standard Ridge Regression Local Ridge Regression

133 Optimizing the Regularization Parameters Incremental Operation P : The current projection Matrix. P j : The projection Matrix obtained by removing  j (  ).

134 Optimizing the Regularization Parameters Solve Subject to

135 Optimizing the Regularization Parameters Solve Subject to

136 Optimizing the Regularization Parameters Solve Subject to Remove  j (  )

137 The Algorithm Initialize i ’s. – e.g., performing standard ridge regression. Repeat the following until GCV converges: – Randomly select j and compute – Perform local ridge regression – If GCV reduce & remove  j (  )

138 References Kohavi, R. (1995), "A study of cross-validation and bootstrap for accuracy estimation and model selection," International Joint Conference on Artificial Intelligence (IJCAI). Mark J. L. Orr (April 1996), Introduction to Radial Basis Function Networks, http://www.anc.ed.ac.uk/~mjo/intro/intro.html.http://www.anc.ed.ac.uk/~mjo/intro/intro.html

139 Introduction to Radial Basis Function Networks Incremental Operations


Download ppt "Introduction to Radial Basis Function Networks. Content Overview The Models of Function Approximator The Radial Basis Function Networks RBFN’s for Function."

Similar presentations


Ads by Google