Presentation is loading. Please wait.

Presentation is loading. Please wait.

Introduction to Radial Basis Function Networks

Similar presentations


Presentation on theme: "Introduction to Radial Basis Function Networks"— Presentation transcript:

1 Introduction to Radial Basis Function Networks

2 Contents Overview The Models of Function Approximator
The Radial Basis Function Networks RBFN’s for Function Approximation The Projection Matrix Learning the Kernels Bias-Variance Dilemma The Effective Number of Parameters Model Selection

3 RBF Linear models have been studied in statistics for about 200 years and the theory is applicable to RBF networks which are just one particular type of linear model. However, the fashion for neural networks which started in the mid-80 has given rise to new names for concepts already familiar to statisticians

4 Typical Applications of NN
Pattern Classification Function Approximation Time-Series Forecasting

5 Function Approximation
Unknown f Fonksiyon yaklaşımı yöntemleri genel olarak ikiye ayrılır. 1. Bir aralık üzerinde bilinen bir fonksiyona başka bir fonksiyon ile yaklaşım 2. Belli noktalarda değerleri bilinen bir fonksiyona başka bir fonksiyon ile yaklaşım. En iyi yaklaşım polinom yaklaşım fonksiyonları ve kare yaklaşım ek olarak belirtilen durumlar arasındaki ilişkiyi genel olarak doğrusal bir ilişki değildir. Approximator ˆ f

6 Introduction to Radial Basis Function Networks
The Model of Function Approximator

7 Linear Models Weights Fixed Basis Functions

8 Linearly weighted output
Linear Models y 1 2 m x1 x2 xn w1 w2 wm x = Linearly weighted output Output Units Decomposition Feature Extraction Transformation Hidden Units Inputs Feature Vectors

9 Linearly weighted output
Linear Models Can you say some bases? y Linearly weighted output Output Units w1 w2 wm Decomposition Feature Extraction Transformation Hidden Units 1 2 m Inputs Feature Vectors x = x1 x2 xn

10 Example Linear Models Are they orthogonal bases? Polynomial
Fourier Series

11 Single-Layer Perceptrons as Universal Aproximators
1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of sigmoidal units, it can be a universal approximator. Hidden Units

12 Radial Basis Function Networks as Universal Aproximators
y 1 2 m x1 x2 xn w1 w2 wm x = With sufficient number of radial-basis-function units, it can also be a universal approximator. Hidden Units

13 Non-Linear Models Weights Adjusted by the Learning process

14 Introduction to Radial Basis Function Networks
The Radial Basis Function Networks

15 Radial Basis Functions
Three parameters for a radial function: i(x)= (||x  xi||) xi Center Distance Measure Shape r = ||x  xi||

16 Typical Radial Functions
Gaussian Hardy-Multiquadratic (1971) Inverse Multiquadratic r verinin seçilen merkeze(center) uzaklıklığı Standard sapma E = (standardsapma)^2 – değişim C – shape parameter that affects the shape of surface. : Bir şekil parametresidir (örneğin, bir oranı parametre olarak veya sadece bu, aşağıdakilerden birini ya da her ikisinin bir fonksiyonu), bir yer parametresi de bir ölçü parametresi, ne bir olasılık dağılımının bir parametredir. Böyle bir parametre sadece (bir konum parametresi yaptığı gibi) o kayması ya da (bir ölçek parametresi yaptığı gibi) bu küçülen / germe yerine dağıtım şeklini etkileyen gerekir.

17 Gaussian Basis Function (=0.5,1.0,1.5)

18 Inverse Multiquadratic

19 Basis {i: i =1,2,…} is `near’ orthogonal.
Most General RBF + + +

20 Properties of RBF’s On-Center, Off Surround
Analogies with localized receptive fields found in several biological structures, e.g., visual cortex; ganglion cells

21 The Topology of RBF As a function approximator x1 x2 xn y1 ym Output
Units Interpolation Hidden Units Projection Interpolation : ekleme yapma, ara değerleri bulma Inputs Feature Vectors

22 The Topology of RBF As a pattern classifier. x1 x2 xn y1 ym Output
Units Classes Hidden Units Subclasses Inputs Feature Vectors

23 Introduction to Radial Basis Function Networks
RBFN’s for Function Approximation

24 Radial Basis Function Networks
Radial basis function (RBF) networks are feed-forward networks trained using a supervised training algorithm. The activation function is selected from a class of functions called basis functions. They usually train much faster than BP. They are less susceptible to problems with non-stationary inputs Bunlar, sabit olmayan girişli sorunlarına karşı daha az hassas olan networks

25 Radial Basis Function Networks
Popularized by Broomhead and Lowe (1988), and Moody and Darken (1989), RBF networks have proven to be a useful neural network architecture. The major difference between RBF and BP is the behavior of the single hidden layer. Rather than using the sigmoidal or S-shaped activation function as in BP, the hidden units in RBF networks use a Gaussian or some other basis kernel function. S-shaped- Sigmoid function Gaussian function Kernel (statistics) Polynomial kernel Radial basis function Laplacian

26 The idea y Unknown Function to Approximate Training Data x

27 Basis Functions (Kernels)
The idea y Unknown Function to Approximate Training Data x Basis Functions (Kernels)

28 Basis Functions (Kernels)
The idea y Function Learned x Basis Functions (Kernels)

29 Basis Functions (Kernels)
The idea Nontraining Sample y Function Learned x Basis Functions (Kernels)

30 The idea Nontraining Sample y Function Learned x

31 Radial Basis Function Networks as Universal Aproximators
Training set x1 x2 xn w1 w2 wm x = Goal for all k

32 Learn the Optimal Weight Vector
Training set x1 x2 xn x = Goal for all k w1 w2 wm

33 Regularization Training set If regularization is unneeded, set Goal
for all k

34 Learn the Optimal Weight Vector
Minimize

35 Learn the Optimal Weight Vector
Define

36 Learn the Optimal Weight Vector
Define

37 Learn the Optimal Weight Vector

38 Learn the Optimal Weight Vector
Design Matrix Variance Matrix

39 Introduction to Radial Basis Function Networks
The Projection Matrix

40 The Empirical-Error Vector
Unknown Function

41 The Empirical-Error Vector
Unknown Function Error Vector

42 Sum-Squared-Error Error Vector
If =0, the RBFN’s learning algorithm is to minimize SSE (MSE). Sum-Squared-Error Error Vector

43 The Projection Matrix Error Vector

44 Introduction to Radial Basis Function Networks
Learning the Kernels Çekirdeği Öğrenme genel çekirdek yöntemleri ve makine öğrenme için temel bir konudur. Uygun çekirdek seçme sorusu DVM özellikle, çekirdek yöntemleri başından beri gündeme gelmiştir. Bu alanda önemli gelişmeler hem makine öğrenme teknikleri uygularken kullanıcıların gereksinimlerini azaltmak ve daha iyi performans elde etmenize yardımcı olacaktır

45 RBFN’s as Universal Approximators
xn y1 yl 1 2 m w11 w12 w1m wl1 wl2 wlml Training set Kernels

46 What to Learn? x1 x2 xn y1 yl Weights wij’s Centers j’s of j’s
1 2 m w11 w12 w1m wl1 wl2 wlml Weights wij’s Centers j’s of j’s Widths j’s of j’s Number of j’s  Model Selection

47 One-Stage Learning

48 The simultaneous updates of all three sets of parameters may be suitable for non-stationary environments or on-line setting. One-Stage Learning

49 Two-Stage Training x1 x2 xn y1 yl Step 2 Step 1 w11 w12 w1m wl1 wl2
1 2 m w11 w12 w1m wl1 wl2 wlml Step 2 Determines wij’s. E.g., using batch-learning. Step 1 Determines Centers j’s of j’s. Widths j’s of j’s. Number of j’s.

50 Train the Kernels

51 Unsupervised Training
+ + + + +

52 Methods Subset Selection Clustering Algorithms Mixture Models
Random Subset Selection Forward Selection Backward Elimination Clustering Algorithms KMEANS LVQ Mixture Models GMM

53 Subset Selection

54 Random Subset Selection
Randomly choosing a subset of points from training set Sensitive to the initially chosen points. Using some adaptive techniques to tune Centers Widths #points

55 Clustering Algorithms
Partition the data points into K clusters.

56 Clustering Algorithms
Is such a partition satisfactory?

57 Clustering Algorithms
How about this?

58 Clustering Algorithms
1 + + 2 + 4 + 3

59 Introduction to Radial Basis Function Networks
Bias-Variance Dilemma Bias-Variance Dilemma-Önyargı-Varyans İkilemi

60 Questions -How should the user choose the kernel?
Problem similar to that of selecting features for other learning algorithms. Poor choice ---learning made very difficult. Good choice ---even poor learners could succeed. The requirement from the user is thus critical. can this requirement be lessened? is a more automatic selection of features possible?

61 Goal Revisit Minimize Prediction Error Minimize Empirical Error
Ultimate Goal  Generalization Minimize Prediction Error Goal of Our Learning Procedure Deneysel hatayı Minimize etmek Minimize Empirical Error

62 Badness of Fit Underfitting Overfitting
A model (e.g., network) that is not sufficiently complex can fail to detect fully the signal in a complicated data set, leading to underfitting. Produces excessive bias in the outputs. Overfitting A model (e.g., network) that is too complex may fit the noise, not just the signal, leading to overfitting. Produces excessive variance in the outputs.

63 Underfitting/Overfitting Avoidance
Model selection Jittering Early stopping Weight decay Regularization Ridge Regression Bayesian learning Combining networks Model seçimi Değişimi erken durdurma Ağırlık çürüme düzenlileştirme Ridge Regresyon Bayes öğrenme ağları birleştiren

64 Best Way to Avoid Overfitting
Use lots of training data, e.g., 30 times as many training cases as there are weights in the network. for noise-free data, 5 times as many training cases as weights may be sufficient. Don’t arbitrarily reduce the number of weights for fear of underfitting.

65 Badness of Fit Underfit Overfit

66 Badness of Fit Underfit Overfit

67 Bias-Variance Dilemma
However, it's not really a dilemma. Bias-Variance Dilemma Underfit Overfit Large bias Small bias Small variance Large variance

68 Bias-Variance Dilemma
More on overfitting Easily lead to predictions that are far beyond the range of the training data. Produce wild predictions in multilayer perceptrons even with noise-free data. Underfit Overfit Large bias Small bias Small variance Large variance

69 Bias-Variance Dilemma
It's not really a dilemma. Bias-Variance Dilemma Bias Variance fit more overfit underfit

70 Bias-Variance Dilemma
The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise The true model bias bias bias Solution obtained with training set 3. Solution obtained with training set 2. Solution obtained with training set 1. Sets of functions E.g., depend on # hidden nodes used.

71 Bias-Variance Dilemma
The mean of the bias=? The variance of the bias=? Bias-Variance Dilemma noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.

72 Model Selection Variance Reduce the effective number of parameters.
Reduce the number of hidden nodes. Model Selection noise Variance The true model bias Sets of functions E.g., depend on # hidden nodes used.

73 Bias-Variance Dilemma
Goal: Bias-Variance Dilemma noise The true model bias Sets of functions E.g., depend on # hidden nodes used.

74 Bias-Variance Dilemma
Goal: Bias-Variance Dilemma Goal: constant

75 Bias-Variance Dilemma

76 Bias-Variance Dilemma
Goal: Bias-Variance Dilemma bias2 variance Minimize both bias2 and variance noise Cannot be minimized

77 Model Complexity vs. Bias-Variance
Goal: Model Complexity vs. Bias-Variance bias2 variance noise Model Complexity (Capacity)

78 Bias-Variance Dilemma
Goal: Bias-Variance Dilemma bias2 variance noise Model Complexity (Capacity)

79 Example (Polynomial Fits)

80 Example (Polynomial Fits)

81 Example (Polynomial Fits)
Degree 1 Degree 5 Degree 10 Degree 15

82 Introduction to Radial Basis Function Networks
The Effective Number of Parameters

83 Variance Estimation Mean In general, not available. Variance

84 Variance Estimation Mean Variance Loss 1 degree of freedom

85 Simple Linear Regression

86 Simple Linear Regression
Minimize Simple Linear Regression

87 Mean Squared Error (MSE)
Minimize Mean Squared Error (MSE) Loss 2 degrees of freedom

88 Variance Estimation m: #parameters of the model Loss m degrees
of freedom m: #parameters of the model

89 The Number of Parameters
x1 x2 xn x = #degrees of freedom: w1 w2 wm

90 The Effective Number of Parameters ()
x1 x2 xn x = The projection Matrix w1 w2 wm

91 The Effective Number of Parameters ()
Facts: The Effective Number of Parameters () Pf) The projection Matrix

92 Regularization SSE The effective number of parameters:
Penalize models with large weights SSE

93 The effective number of parameters  =m.
Regularization Without penalty (i=0), there are m degrees of freedom to minimize SSE (Cost). The effective number of parameters  =m. Penalize models with large weights SSE

94 The effective number of parameters:
Regularization With penalty (i>0), the liberty to minimize SSE will be reduced. The effective number of parameters  <m. Penalize models with large weights SSE

95 Variance Estimation Loss  degrees of freedom
The effective number of parameters: Variance Estimation Loss  degrees of freedom

96 The effective number of parameters:
Variance Estimation

97 Introduction to Radial Basis Function Networks
Model Selection

98 Model Selection Goal Criteria Main Tools (Estimate Model Fitness)
Choose the fittest model Criteria Least prediction error Main Tools (Estimate Model Fitness) Cross validation Projection matrix Methods Weight decay (Ridge regression) Pruning and Growing RBFN’s

99 Empirical Error vs. Model Fitness
Ultimate Goal  Generalization Minimize Prediction Error Goal of Our Learning Procedure Minimize Empirical Error (MSE) Minimize Empirical Error : Ampirik(Deneysel) Hata Minimize Minimize Prediction Error

100 Estimating Prediction Error
When you have plenty of data use independent test sets E.g., use the same training set to train different models, and choose the best model by comparing on the test set. When data is scarce, use Cross-Validation Bootstrap Veriler bağımsız test setleri kullanmak çok zaman E. g., aynı eğitim kümesi farklı model tren, ve test kümesi üzerinde karşılaştırılarak en iyi model seçmek için kullanın. Bootstrap Örneklemesi Örneklemeyi tekrarlamak ve çok kez yapmak, zaman, emek, maliyet, mümkün omama gibi sebeplereden dolayı elverişli değildir. Elimizde, sadece bir veri (data, gözlemler) bulunmaktadır. Verinin alındığı kitlenin ortalamasını tahmin etmek istemişsek, elimizde tahmin olarak örnek ortalaması bulunmaktadır. Kitle dağılımı hakkında hiçbir varsayım yapılmamışsa, küçük hacimli örneklemede, kitle ortalaması için güven aralığı söyleyemeyiz, aralık tahmini yapamayız. Bu gibi sorunların altından kalmak için elimizdeki veri üzerinde “yeniden n hacimlik örneklemeler” yapılıp, ilgili istatistiğin değeri çok kez gözlenip, dağılımı hakkında fikir elde edilebilir.

101 Cross Validation Simplest and most widely used method for estimating prediction error. Partition the original set into several different ways and to compute an average score over the different partitions, e.g., K-fold Cross-Validation Leave-One-Out Cross-Validation Generalize Cross-Validation Veri madenciliği çalışmalarında, uygulanan yöntemin başarınının sınanması için, veri kümesini eğitim ve test kümeleri olarak ayırılmaktadır. Bu ayırma işlemi çeşitli şekillerde yapılabilir. Örneğin veri kümesinin %66’lık bir kısımını eğitim %33’lük bir kısmını ise test için ayırmak ve eğitim kümesi ile sistem eğitildikten sonra test kümesi ile başarısının sınanması kullanılabilecek yöntemlerden birsidir. Bu eğitim ve test kümelerinin rastgele olarak atanması da farklı bir yöntemdir. Fakat bu yazının konusu eğitim ve test kümelerinin nasıl katlanarak değiştirildiğidir. K-katlamalı çarpraz doğrulama yönteminde öncelikle bir k değeri seçilir. Ben yazıda anlatması kolay olsun diye 3 olarak seçeceğim. Ancak belirtmem gerekir ki literatürde en çok tercih edilen k değeri 10’dur.

102 K-Fold CV Split the set, say, D of available input-output patterns into k mutually exclusive subsets, say D1, D2, …, Dk. Train and test the learning algorithm k times, each time it is trained on D\Di and tested on Di.

103 K-Fold CV Available Data

104 K-Fold CV . . . . . . . . . . . . . . . Estimate 2 Test Set
Training Set D1 Available Data D2 D3 . . . Dk D1 D2 D3 . . . Dk Estimate 2 D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk D1 D2 D3 . . . Dk

105 A special case of k-fold CV.
Leave-One-Out CV Split the p available input-output patterns into a training set of size p1 and a test set of size 1. Average the squared error on the left-out pattern over the p possible ways of partition.

106 Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

107 Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Given a model, the function with least empirical error for Di. Available input-output patterns. Training sets of LOO. Function learned using Di as training set. As an index of model’s fitness. We want to find a model also minimize this. The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

108 Error Variance Predicted by LOO
A special case of k-fold CV. Error Variance Predicted by LOO Available input-output patterns. Training sets of LOO. Function learned using Di as training set. How to estimate? Are there any efficient ways? The estimate for the variance of prediction error using LOO: Error-square for the left-out element.

109 Error Variance Predicted by LOO
Error-square for the left-out element.

110 Generalized Cross-Validation

111 More Criteria Based on CV
GCV (Generalized CV) Akaike’s Information Criterion UEV (Unbiased estimate of variance) FPE (Final Prediction Error) BIC (Bayesian Information Criterio)

112 More Criteria Based on CV

113 More Criteria Based on CV

114 Regularization SSE Standard Ridge Regression, Penalize models with
large weights SSE

115 Regularization SSE Standard Ridge Regression, Penalize models with
large weights SSE

116 Solution Review Used to compute model selection criteria

117 Example Width of RBF r = 0.5

118 Example Width of RBF r = 0.5

119 Example Width of RBF r = 0.5 How the determine the optimal regularization parameter effectively?

120 Optimizing the Regularization Parameter
Re-Estimation Formula

121 Local Ridge Regression
Re-Estimation Formula

122 Example — Width of RBF

123 Example — Width of RBF

124 Example — Width of RBF There are two local-minima.
Using the about re-estimation formula, it will be stuck at the nearest local minimum. That is, the solution depends on the initial setting.

125 Example — Width of RBF There are two local-minima.

126 Example — Width of RBF There are two local-minima.

127 Example — Width of RBF RMSE: Root Mean Squared Error
In real case, it is not available.

128 Example — Width of RBF RMSE: Root Mean Squared Error
In real case, it is not available.

129 Local Ridge Regression
Standard Ridge Regression Local Ridge Regression

130 Local Ridge Regression
Standard Ridge Regression j  implies that j() can be removed. Local Ridge Regression

131 The Solutions Linear Regression Standard Ridge Regression
Local Ridge Regression Used to compute model selection criteria

132 Optimizing the Regularization Parameters
Incremental Operation P: The current projection Matrix. Pj: The projection Matrix obtained by removing j().

133 Optimizing the Regularization Parameters
Solve Subject to

134 Optimizing the Regularization Parameters
Solve Subject to

135 Optimizing the Regularization Parameters
Remove j() Solve Subject to

136 The Algorithm Initialize i’s.
e.g., performing standard ridge regression. Repeat the following until GCV converges: Randomly select j and compute Perform local ridge regression If GCV reduce & remove j()


Download ppt "Introduction to Radial Basis Function Networks"

Similar presentations


Ads by Google