6. Radial-basis function (RBF) networks

Slides:



Advertisements
Similar presentations
Pattern Recognition and Machine Learning
Advertisements

Multi-Layer Perceptron (MLP)
Computational Intelligence Winter Term 2009/10 Prof. Dr. Günter Rudolph Lehrstuhl für Algorithm Engineering (LS 11) Fakultät für Informatik TU Dortmund.
7. Support Vector Machines (SVMs)
ECG Signal processing (2)
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Support Vector Machines
Pattern Recognition and Machine Learning: Kernel Methods.
B.Macukow 1 Lecture 12 Neural Networks. B.Macukow 2 Neural Networks for Matrix Algebra Problems.
Ch. 4: Radial Basis Functions Stephen Marsland, Machine Learning: An Algorithmic Perspective. CRC 2009 based on slides from many Internet sources Longin.
Visual Recognition Tutorial
Pattern Recognition and Machine Learning
Linear Transformations
6/10/ Visual Recognition1 Radial Basis Function Networks Computer Science, KAIST.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Giansalvo EXIN Cirrincione unit #6 Problem: given a mapping: x  d  t  and a TS of N points, find a function h(x) such that: h(x n ) = t n n = 1,
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Radial Basis Function Networks 표현아 Computer Science, KAIST.
November 2, 2010Neural Networks Lecture 14: Radial Basis Functions 1 Cascade Correlation Weights to each new hidden node are trained to maximize the covariance.
6 1 Linear Transformations. 6 2 Hopfield Network Questions.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
CS Instance Based Learning1 Instance Based Learning.
Aula 4 Radial Basis Function Networks
Radial Basis Function (RBF) Networks
Radial-Basis Function Networks
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks II PROF. DR. YUSUF OYSAL.
Radial Basis Function Networks
8/10/ RBF NetworksM.W. Mak Radial Basis Function Networks 1. Introduction 2. Finding RBF Parameters 3. Decision Surface of RBF Networks 4. Comparison.
Radial Basis Function Networks
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Summarized by Soo-Jin Kim
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Principles of Pattern Recognition
Biointelligence Laboratory, Seoul National University
Artificial Neural Networks Shreekanth Mandayam Robi Polikar …… …... … net k   
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
Radial Basis Function Networks:
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 3: LINEAR MODELS FOR REGRESSION.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
SUPA Advanced Data Analysis Course, Jan 6th – 7th 2009 Advanced Data Analysis for the Physical Sciences Dr Martin Hendry Dept of Physics and Astronomy.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 24 Nov 2, 2005 Nanjing University of Science & Technology.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Non-Bayes classifiers. Linear discriminants, neural networks.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Computational Intelligence: Methods and Applications Lecture 29 Approximation theory, RBF and SFN networks Włodzisław Duch Dept. of Informatics, UMK Google:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computacion Inteligente Least-Square Methods for System Identification.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Today’s Lecture Neural networks Training
Deep Feedforward Networks
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
Numerical Analysis Lecture 16.
Neuro-Computing Lecture 4 Radial Basis Function Network
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Linear Discrimination
Presentation transcript:

6. Radial-basis function (RBF) networks RBF = radial-basis function: a function which depends only on the radial distance from a point XOR problem quadratically separable

So RBFs are functions taking the form Where f is a nonlinear activation function, x is the input and xi is the i’th position, prototype, basis or centre vector. The idea is that points near the centres will have similar outputs I.e. if x ~ xi then f (x) ~ f (xi) since they should have similar properties. Therefore instead of looking at the data points themselves characterise the data by their distances from the prototype vectors (similar to kernel density estimation)

For example, the simplest form of f is the identity function f (x) = x x1=(0,1) x d1 d2 (0,0) 1 1.1 (1,1) 1 .5 (0,1) 0 1.1 (1,0) .5 x2=(1,0.5) Now use the distances as the inputs to a network and form a weighted sum of these

fN (y)=f |y-xN| Can be viewed as a Two-layer network y1 y2 Input Output d wj yM fN (y)=f |y-xN| Hidden layer output = S wi fi(y) adjustable parameters are weights wj number of hidden units = number of prototype vectors Form of the basis functions decided in advance

use a weighted sum of the outputs from the basis functions for e. g use a weighted sum of the outputs from the basis functions for e.g. classification, density estimation etc Theory can be motivated by many things (regularisation, Bayesian classification, kernel density estimation, noisy interpolation etc), but all suggest that basis functions are set so as to represent the data. Thus centres can be thought of as prototypes of input data. * * * * 1 * * O1 MLP vs RBF distributed local

Ok = P(Ck|(x) a p(x|Ck) P(Ck) E.g. Bayesian interpretation: if we choose to model the probability and we choose appropriate weights then we can interpret the outputs as the posterior probabilities: Ok = P(Ck|(x) a p(x|Ck) P(Ck) O1 O2 O3 P(C1) P(C3) F1(x) = p(x|C1) F3(x) = p(x|C3) x y

Starting point: exact interpolation Each input pattern x must be mapped onto a target value d d x

That is, given a set of N vectors xi and a corresponding set of N real numbers, di (the targets), find a function F that satisfies the interpolation condition: F ( xi ) = di for i =1,...,N or more exactly find: satisfying:

Example: XOR problem x d (0,0) 0 (1,1) 0 (0,1) 1 (1,0) 1 Exact interpolation: RBF placed at position of each pattern vector using 1) linear RBF

i.e. 4 hidden units in network Network structure

w1 w2 w3 w4 0 1 1 1 0 1 1 0 1 1 1 0 1 = w1 w2 w3 w4 1 Results =

And general solution is: Ie F(x1,x2) = sqrt(x12+x22) sqrt((x1-1)2+x22) sqrt(x12+(x2-1)2) + sqrt((x1-1)2+(x2-1)2)

f ( x i - xj ): scalar function of distance between vector x i and xj For n vectors get: f ( x1 - x1 ) f ( x1 - xN ) f ( xN - xN ) f ( xN - x1 ) = w1 wN d1 dN Interpolation Matrix weight f ( x i - xj ): scalar function of distance between vector x i and xj Equivalently F W = D

above equation If F is invertible we have a unique solution of the Micchelli’s Theorem Let xi , i = 1, ..., N be a set of distinct points in Rd, Then the N-by-N interpolation matrix , whose ji-th element is f ( x i - xj ) , is nonsingular. So provided F is nonsingular then interpolation matrix will have an inverse and weights to achieve exact interpolation

Easy to see that there is always a solution. For instance, if we take f(x-y)=1 if x = y, and 0 otherwise (e.g. a Gaussian with very small s), setting wi=di solves the interpolation problem However, this is a bit trivial as the only general conclusion about the input space is that the training data points are different.

To summarize: For a given data set containing N points (xi,di), i=1,…,N Choose a RBF function f Calculate f(xj - xi ) Obtain the matrix F Solve the linear equation F W = D Get the unique solution Done! Like MLP’s, RBFNs can be shown to be able to approximate any function to arbitrary accuracy (using an arbitrarily large numbers of basis functions). Unlike MLP’s, however, they have the property of ‘best approximation’ i.e. that there exists an RBFN with minimum approximation error.

Other types of RBFs include Multiquadrics for some c>0 (b) Inverse multiquadrics Gaussian for some s >0

Linear activation function has some undesirable properties e. g Linear activation function has some undesirable properties e.g. f (xi) = 0. (NB f is still a non-linear function as it is only piecewise linear in x). Inverse multiquadrics and Gaussian RBFs are both examples of ‘localized’ functions Multiquadrics RBFs are ‘nonlocalized’ functions

‘Localized’: as distance from the centre increases the output of the RBF decreases

‘Nonlocalized’: as distance from the centre increases the output of the RBF increases

Example: XOR problem x d (0,0) 0 (1,1) 0 (0,1) 1 (1,0) 1 Exact interpolation: RBF placed at position of each pattern vector using 2) Gaussian RBF with s=1

i.e. 4 hidden units in network Network structure

w1 w2 w3 w4 exp(0) exp(-.5) exp(-.5) exp(-1) exp(-.5) exp(0) exp(-1) exp(-.5) exp(-.5) exp(-1) exp(0) exp(-.5) exp(-1) exp(-.5) exp(-.5) exp(0) 1 = w1 w2 w3 w4 -3.0359 3.4233 Results =

- sqrt((x1-1)2+x22) - sqrt(x12+(x2-1)2) + sqrt((x1-1)2+(x2-1)2) 1) f(x1,x2) = sqrt(x12+x22) - sqrt((x1-1)2+x22) - sqrt(x12+(x2-1)2) + sqrt((x1-1)2+(x2-1)2) 2) f(x1,x2) = -3.0359 exp(-(x12+x22)/2) +3.4233 exp(-(x1-1)2+x22)/2) +3.4233 exp(-(x12+(x2-1)2)/2) -3.0359 exp(-(x1-1)2+(x2-1)2)/2)

Large s = 1

Small s = 0.2

Problems with exact interpolation can produce poor generalisation performance as only data points constrain mapping overfitting problem Bishop(1995) example Underlying function f(x)=0.5+0.4sine(2pi x) sampled randomly for 30 points added gaussian noise to each data point 30 data points 30 hidden RBF units fits all data points but creates oscillations due added noise and unconstrained between data points

All Data Points 5 Basis functions

To fit an rbf to every data point is very inefficient due to the computational cost of matrix inversion and is very bad for generalisation so: Use less RBF’s than data points I.e. M<N Therefore don’t necessarily have RBFs centred at data points Can include bias terms Can have gaussians with general covariance matrices but there is a trade-off between complexity and the number of parameters to be found

1 parameter d parameters for d rbfs we have d(d+1)/2 parameters

6. Radial-basis function (RBF) networks II Generalised radial basis function networks Exact interpolation expensive due to cost of matrix inversion prefer fewer centres (hidden RBF units) centres not necessarily at data points can include biases can have general covariance matrices now no longer exact interpolation, so where M (number of hidden units) <N (number of training data)

f0 = 1 fM (x)=f (x-xM) Three-layer networks x1 x2 w0 = bias Input: nD vector Output y wM fM (x)=f (x-xM) xN Hidden layer output = S wi fi(x) adjustable parameters are weights wj, number of hidden units M (<N) Form of the basis functions decided in advance

F(x) F(x) S S w1 w2 w31 w32 sig(w1Tx) f(r1) f(r2) sig(w2Tx) * w1 w2 r2 * x x r1 w1Tx = k

Comparison of MLP to RBFN hidden unit outputs are monotonic functions of a weighted linear sum of the inputs => constant on (d-1)D hyperplanes distributed representation as many hidden units contribute to network output => interference between units => non-linear training => slow convergence RBF hidden unit outputs are functions of distance from prototype vector (centre) => constant on concentric (d-1)D hyperellipsoids localised hidden units mean that few contribute to output => lack of interference => faster convergence

Comparison of MLP to RBFN more than one hidden layer global supervised learning of all weights global approximations to nonlinear mappings RBF one hidden layer hybrid learning with supervised learning in one set of weights localised approximations to nonlinear mappings

f0 = 1 fM (x)=f (x-xM) Three-layer networks x1 x2 w0 = bias Input: nD vector Output y wM fM (x)=f (x-xM) xN Hidden layer output = S wi fi(x) adjustable parameters are weights wj, number of hidden units M (<N) Form of the basis functions decided in advance

Hybrid training of RBFN Two stage ‘hybrid’ learning process stage 1: parameterise hidden layer of RBFs - hidden unit number (M) -centre/position (ti) -width (s) use unsupervised methods (see below) as they are quick and unlabelled data is plentiful. Idea is to estimate the density of the data stage 2 Find weight values between hidden and output units minimize sum-of-squares error between actual output and desired responses --invert matrix F if M=N --Pseudoinverse of F if M<N Stage 2 later, now concentrate on stage 1.

Random subset approach Randomly select centres of M RBF hidden units from N data points widths of RBFs usually common and fixed to ensure a degree of overlap but based on an average or maximum distance between RBFs e.g. s = dmax /sqrt (2M) where dmax is the maximum distance between the set of M RBF units The method is efficient and fast, but suboptimal and its important to get s correct …

s = 10 s = 0.08 s = 0.4

Clustering Methods: K-means algorithm --divides data points into K subgroups based on similarity Batch version 1. Randomly assign each pattern vector x to one of K subsets 2. Compute mean vector of each subset 3. Reassign each point to subset with closest mean vector 4. Until no further reassignments, loop back to 2 On-line version 1. Randomly choose K data points to be basis centres mi 2. As each vector is xn presented, update the nearest mi using: Δmi = h(xn - mi) 3. Repeat until no further changes

The covariance matrices (s) can now be set to the covariance of the data points of each subset -- However, note that K must be decided at the start -- Also, the algorithm can be sensitive to initial conditions -- Can get problems of no/few points being in a set: see competitive learning lecture -- Might not cover the space accurately Other unsupervised techniques such as self organising maps and Gaussian mixture models can also be used Another approach is to use supervised techniques where the parameters of the basis functions are adaptive and can be optimised. However, this negates the speed and simplicity advantages of the 1st stage of training.

Relationship with probability density function estimation Radial basis functions can be related to kernel density functions (Parzen windows) used to estimate probability density functions E.g. In 2 dimensions the pdf at a point x can be estimated from the fraction of training points which fall within a square of side h centred on x X * x h y Here p(x) = 1/6 x 1/(hxh) x Sn H(x-xn,h) where H = 1 if |xn-x| < h ie estimate density by fraction of points within each square Alternatively, H(|xn-x|) could be gaussian giving a smoother estimate for the pdf

In Radial basis networks the first stage of training is an attempt to model the density of the data in an unsupervised way As in kernel density estimation, we try to get an idea of the underlying density by picking some prototypical points Then use distribution of the data to approximate a prior distribution

Back to Stage 2 for a network with M < N basis vectors Now for each training data vector ti and corresponding target di we want F ( ti ) = di , that is, we must find a function F that satisfies the interpolation condition : F ( ti ) = di for i =1,...,N Or more exactly find: satisfying:

So the interpolation matrix becomes: 1 f ( t1 - x1 ) f ( t1 - xM ) f ( tN - xN ) 1 f ( tN - x1 ) = w0 w1 wM d1 dN Which can be written as: F W = D where F is an MxN matrix (not square).

To solve this we need to generate an error function such as the least squares error: and minimise it. As the derivative of the least squares error is a linear function of the weights it can be solved using linear matrix inversion techniques (usually singular value decomposition (Press et al., Numerical Recipes)). Other error functions can be used but minimising the error then becomes a non-linear optimisation problem.

However, note that the problem is OverDetermined That is, by using N training vectors and only M centres we have M unknowns (the weights) and N bits of information eg training vectors (-2, 0), (1, 0), targets 1, 2 centre: (0, 0), linear rbf F W = D => w =0.5 or w =2 ??? Unless N=M and there are no degeneracies (parallel or nearly parallel) data vectors, we cannot simply invert the matrix and must use the pseudoinverse (using Singular Value Decomposition).

Alternatively, can view this as an ill-posed problem Ill-posed problems (Tikhonov) How do we infer function F which maps X onto y from a finite data set? This can be done if problem is well-posed - existence = each input pattern has an output - uniqueness = each input pattern maps onto only one output - continuity = small changes in input pattern space imply small changes in y In RBFs however: - noise can violate continuity condition - different output values for same input patterns violates uniqueness - insufficient information in training data may violate existence condition

Ill-posed problem: the finite data set does not yield a unique solution

Regularization theory (Tikhonov, 1963) To solve ill-posed problems need to supplement finite data set with prior knowledge about nature of mapping -- regularization theory common to place constraint that mapping is smooth (since smoothness implies continuity) add penalty term to standard sum-of squares error for non-smooth mappings E(F)=ES (F)+ l Ec(F) where eg: ES (F)= 1/2 S ( di- F(xi) )2 and Ec(F)=1/2 || DF ||2 and DF could be, say the first or second order derivative of F etc.

l is called the regularization parameter: l = 0 unconstrained (smoothness not enforced) l = infinity, smoothness constraint dominates and less account is taken of training data error l controls balance (trade-off) between a smooth mapping and fitting the data points exactly

EC = curvature l = 0 l = 40

Regularization networks --Poggio & Girosi (1990) applied regularization theory to RBF networks --By minimizing the new error function E(F) we obtain (using results from functional analysis) where I is the unit matrix. Provided EC is chosen to be quadratic in y, this equation can be solved using the same techniques as the non-regularised network.

Same data with uncorrelated noise, M2 rbfs Problems of RBFs 1. Need to choose number of basis functions 2. Due to local nature of basis functions has problems in ignoring ‘noisy’ input dimensions unlike MLPs (helps to use dimensionality reduction such as PCA) 1D data, M rbfs Same data with uncorrelated noise, M2 rbfs

Problems of RBFs 2 3. Optimal choice of basis function parameters may not be optimal for the output task Data from h => rbf at a, but gives a bad representation of h. In contrast, one centred at b would be perfect

Problems of RBFs 3 4. Because of dependence on distance, if variation in one parameter is small with respect to the others it will contribute very little to the outcome (l + e)2 ~ l2. Therefore, preprocess data to give zero mean and unit variance via simple transformation: x* = (x - m) s (Could achieve the same using general covariance matrices but this is simpler)

However, this does not take into account correlations in the data. Better to use whitening (Bishop, 1995, pp 299-300)

x* = L-1/2 UT (x - m) where U is a matrix whose columns are the eigenvectors ui of S, the covariance matrix of the data, and L a matrix with the corresponding eigenvalues li on the diagonals i.e: U = (u1, … …, un) And: L = diag(l1, ……, ln) l1u1 l2u2

Using RBF Nets in practice Choose a functional form (Gaussian generally, but prior knowledge/experience may suggest others) Select the type of pre-processing --Reduce dimensionality (techniques to follow in next few lectures) ? --Normalise (whiten) data? (no way of knowing if these will be helpful: may need to try a few combinations) Select clustering method (k-means) Select number of basis functions, cluster and find basis centres Find weights (via matrix inversion) Calculate performance measure.

If only life were so simple… How do we choose k? Similar to problem of selecting number of hidden nodes for MLP What type of pre-processing is best? Does the clustering method work for the data? E.g might be better to fix s and try again. There is NO general answer: each choice will be problem-specific. The only info you have is your performance measure.

Idea: try e.g. increasing k until performance measure decreases (or gets to a minimum, or something more adventurous). Performance measure Optimal k? k Note the dependence on the performance measure (make sure it’s a good one). Good thing about RBF Nets is that the training procedure is relatively quick and so lots of combinations can be used.