Presentation on theme: "6. Radial-basis function (RBF) networks"— Presentation transcript:
16. Radial-basis function (RBF) networks RBF = radial-basis function: a function which dependsonly on the radial distance from a pointXOR problemquadratically separable
2So RBFs are functions taking the form Where f is a nonlinear activation function, x is the input and xi is the i’th position, prototype, basis or centre vector.The idea is that points near the centres will have similar outputsI.e. if x ~ xi then f (x) ~ f (xi) since they should have similar properties.Therefore instead of looking at the data points themselves characterise the data by their distances from the prototype vectors (similar to kernel density estimation)
3For example, the simplest form of f is the identity function f (x) = xx1=(0,1)x d1 d2(0,0)(1,1)(0,1)(1,0)x2=(1,0.5)Now use the distances as the inputs to a network and form a weighted sum of these
4fN (y)=f |y-xN| Can be viewed as a Two-layer network y1 y2 Input OutputdwjyMfN (y)=f |y-xN|Hidden layeroutput = S wi fi(y)adjustable parameters are weights wjnumber of hidden units = number of prototype vectorsForm of the basis functions decided in advance
5use a weighted sum of the outputs from the basis functions for e. g use a weighted sum of the outputs from the basis functions for e.g. classification, density estimation etcTheory can be motivated by many things (regularisation, Bayesian classification, kernel density estimation, noisy interpolation etc), but all suggest that basis functions are set so as to represent the data.Thus centres can be thought of as prototypes of input data.****1**O1MLP vs RBFdistributed local
6Ok = P(Ck|(x) a p(x|Ck) P(Ck) E.g. Bayesian interpretation: if we choose to model the probability and we choose appropriate weights then we can interpret the outputs as the posterior probabilities:Ok = P(Ck|(x) a p(x|Ck) P(Ck)O1O2O3P(C1)P(C3)F1(x) = p(x|C1)F3(x) = p(x|C3)xy
7Starting point: exact interpolation Each input pattern x must be mapped onto a target value ddx
8That is, given a set of N vectors xi and a corresponding set of N real numbers, di (the targets), find a function F that satisfies the interpolation condition:F ( xi ) = di for i =1,...,Nor more exactly find:satisfying:
9Example: XOR problemx d(0,0)(1,1)(0,1)(1,0)Exact interpolation: RBF placed at position of each pattern vectorusing 1) linear RBF
10i.e. 4 hidden units in network Network structure
12And general solution is: Ie F(x1,x2) = sqrt(x12+x22)sqrt((x1-1)2+x22)sqrt(x12+(x2-1)2)+ sqrt((x1-1)2+(x2-1)2)
13f ( x i - xj ): scalar function of distance between vector x i and xj For n vectors get:f ( x1 - x1 )f ( x1 - xN )f ( xN - xN )f ( xN - x1 )=w1wNd1dNInterpolation Matrixweightf ( x i - xj ): scalar function of distance between vector x iand xjEquivalentlyF W = D
14above equation If F is invertible we have a unique solution of the Micchelli’s TheoremLet xi , i = 1, ..., N be a set of distinct points in Rd, Then theN-by-N interpolation matrix , whose ji-th element is f ( x i - xj ) ,is nonsingular.So provided F is nonsingular then interpolation matrix will havean inverse and weights to achieve exact interpolation
15Easy to see that there is always a solution. For instance, if we take f(x-y)=1 if x = y, and 0 otherwise (e.g. a Gaussian with very small s), setting wi=di solves the interpolation problemHowever, this is a bit trivial as the only general conclusion about the input space is that the training data points are different.
16To summarize:For a given data set containing N points (xi,di), i=1,…,NChoose a RBF function fCalculate f(xj - xi )Obtain the matrix FSolve the linear equation F W = DGet the unique solutionDone!Like MLP’s, RBFNs can be shown to be able to approximate any function to arbitrary accuracy (using an arbitrarily large numbers of basis functions).Unlike MLP’s, however, they have the property of ‘best approximation’ i.e. that there exists an RBFN with minimum approximation error.
17Other types of RBFs include Multiquadricsfor some c>0(b) Inverse multiquadricsGaussianfor some s >0
18Linear activation function has some undesirable properties e. g Linear activation function has some undesirable properties e.g. f (xi) = 0. (NB f is still a non-linear function as it is only piecewise linear in x).Inverse multiquadrics and Gaussian RBFs are both examplesof ‘localized’ functionsMultiquadrics RBFs are ‘nonlocalized’ functions
19‘Localized’: as distance from the centre increases the output of the RBF decreases
20‘Nonlocalized’: as distance from the centre increases the output of the RBF increases
21Example: XOR problemx d(0,0)(1,1)(0,1)(1,0)Exact interpolation: RBF placed at position of each pattern vectorusing 2) Gaussian RBF with s=1
22i.e. 4 hidden units in network Network structure
27Problems with exact interpolation can produce poor generalisation performance as only datapoints constrain mappingoverfitting problemBishop(1995) exampleUnderlying function f(x)= sine(2pi x)sampled randomly for 30 pointsadded gaussian noise to each data point30 data points hidden RBF unitsfits all data points but creates oscillations due added noiseand unconstrained between data points
29To fit an rbf to every data point is very inefficient due to the computational cost of matrix inversion and is very bad for generalisation so:Use less RBF’s than data points I.e. M<NTherefore don’t necessarily have RBFs centred at data pointsCan include bias termsCan have gaussians with general covariance matrices but there is a trade-off between complexity and the number of parameters to be found
301 parameterd parametersfor d rbfs we haved(d+1)/2 parameters
316. Radial-basis function (RBF) networks II Generalised radial basis function networksExact interpolation expensive due to cost of matrix inversionprefer fewer centres (hidden RBF units)centres not necessarily at data pointscan include biasescan have general covariance matricesnow no longer exact interpolation, sowhere M (number of hidden units) <N (number of training data)
32f0 = 1 fM (x)=f (x-xM) Three-layer networks x1 x2 w0 = bias Input: nD vectorOutputywMfM (x)=f (x-xM)xNHidden layeroutput = S wi fi(x)adjustable parameters are weights wj, number of hidden units M (<N)Form of the basis functions decided in advance
33F(x)F(x)SSw1w2w31w32sig(w1Tx)f(r1)f(r2)sig(w2Tx)*w1w2r2*xxr1w1Tx = k
34Comparison of MLP to RBFN hidden unit outputs are monotonic functions of a weighted linear sum of the inputs => constant on (d-1)D hyperplanesdistributed representation as many hidden units contribute to network output => interference between units => non-linear training => slow convergenceRBFhidden unit outputs are functions of distance from prototype vector (centre) => constant on concentric (d-1)D hyperellipsoidslocalised hidden units mean that few contribute to output => lack of interference => faster convergence
35Comparison of MLP to RBFN more than one hidden layerglobal supervised learning of all weightsglobal approximations to nonlinear mappingsRBFone hidden layerhybrid learning with supervised learning in one set of weightslocalised approximations to nonlinear mappings
36f0 = 1 fM (x)=f (x-xM) Three-layer networks x1 x2 w0 = bias Input: nD vectorOutputywMfM (x)=f (x-xM)xNHidden layeroutput = S wi fi(x)adjustable parameters are weights wj, number of hidden units M (<N)Form of the basis functions decided in advance
37Hybrid training of RBFN Two stage ‘hybrid’ learning processstage 1: parameterise hidden layer of RBFs- hidden unit number (M)-centre/position (ti)-width (s)use unsupervised methods (see below) as they are quick and unlabelled data is plentiful. Idea is to estimate the density of the datastage 2 Find weight values between hidden and output unitsminimize sum-of-squares error between actual output and desiredresponses--invert matrix F if M=N--Pseudoinverse of F if M<NStage 2 later, now concentrate on stage 1.
38Random subset approach Randomly select centres of M RBF hidden units from N data pointswidths of RBFs usually common and fixed to ensure a degreeof overlap but based on an average or maximum distance betweenRBFs e.g.s = dmax /sqrt (2M)where dmax is the maximum distance between the set of MRBF unitsThe method is efficient and fast, but suboptimal and its important to get s correct …
40Clustering Methods: K-means algorithm --divides data points into K subgroups based on similarityBatch version1. Randomly assign each pattern vector x to one of K subsets2. Compute mean vector of each subset3. Reassign each point to subset with closest mean vector4. Until no further reassignments, loop back to 2On-line version1. Randomly choose K data points to be basis centres mi2. As each vector is xn presented, update the nearest mi using:Δmi = h(xn - mi)3. Repeat until no further changes
41The covariance matrices (s) can now be set to the covariance of the data points of each subset -- However, note that K must be decided at the start-- Also, the algorithm can be sensitive to initial conditions-- Can get problems of no/few points being in a set: see competitive learning lecture-- Might not cover the space accuratelyOther unsupervised techniques such as self organising maps and Gaussian mixture models can also be usedAnother approach is to use supervised techniques where the parameters of the basis functions are adaptive and can be optimised. However, this negates the speed and simplicity advantages of the 1st stage of training.
42Relationship with probability density function estimation Radial basis functions can be related to kernel density functions (Parzen windows) used to estimate probability density functionsE.g. In 2 dimensions the pdf at a point x can be estimated from the fraction of training points which fall within a square of side h centred on xX*xhyHere p(x) = 1/6 x 1/(hxh) x Sn H(x-xn,h)where H = 1 if |xn-x| < h ie estimate density by fraction of points within each squareAlternatively, H(|xn-x|) could be gaussian giving a smoother estimate for the pdf
43In Radial basis networks the first stage of training is an attempt to model the density of the data in an unsupervised wayAs in kernel density estimation, we try to get an idea of the underlying density by picking some prototypical pointsThen use distribution of the data to approximate a prior distribution
44Back to Stage 2 for a network with M < N basis vectors Now for each training data vector ti and corresponding target di we want F ( ti ) = di , that is, we must find a function F that satisfies the interpolation condition :F ( ti ) = di for i =1,...,NOr more exactly find:satisfying:
45So the interpolation matrix becomes: 1 f ( t1 - x1 )f ( t1 - xM )f ( tN - xN )1 f ( tN - x1 )=w0w1wMd1dNWhich can be written as:F W = Dwhere F is an MxN matrix (not square).
46To solve this we need to generate an error function such as the least squares error: and minimise it.As the derivative of the least squares error is a linear function of the weights it can be solved using linear matrix inversion techniques (usually singular value decomposition (Press et al., Numerical Recipes)).Other error functions can be used but minimising the error then becomes a non-linear optimisation problem.
47However, note that the problem is OverDetermined That is, by using N training vectors and only M centres we have M unknowns (the weights) and N bits of information eg training vectors (-2, 0), (1, 0), targets 1, 2 centre: (0, 0), linear rbf F W = D =>w =0.5 or w =2 ??? Unless N=M and there are no degeneracies (parallel or nearly parallel) data vectors, we cannot simply invert the matrix and must use the pseudoinverse (using Singular Value Decomposition).
48Alternatively, can view this as an ill-posed problem Ill-posed problems (Tikhonov)How do we infer function F which maps X onto y from a finite data set?This can be done if problem is well-posed- existence = each input pattern has an output- uniqueness = each input pattern maps onto only one output- continuity = small changes in input pattern space imply small changes in yIn RBFs however:- noise can violate continuity condition- different output values for same input patterns violates uniqueness- insufficient information in training data may violate existencecondition
49Ill-posed problem: the finite data set does not yield a unique solution
50Regularization theory (Tikhonov, 1963) To solve ill-posed problems need to supplement finite data setwith prior knowledge about nature of mapping-- regularization theorycommon to place constraint that mapping is smooth (since smoothness implies continuity)add penalty term to standard sum-of squares error for non-smooth mappingsE(F)=ES (F)+ l Ec(F)where eg:ES (F)= 1/2 S ( di- F(xi) ) and Ec(F)=1/2 || DF ||2and DF could be, say the first or second order derivative of F etc.
51l is called the regularization parameter: l = unconstrained (smoothness not enforced)l = infinity, smoothness constraint dominates and lessaccount is taken of training data errorl controls balance (trade-off) between a smooth mapping and fitting the data points exactly
53Regularization networks --Poggio & Girosi (1990) applied regularization theoryto RBF networks--By minimizing the new error function E(F) we obtain (using results from functional analysis)where I is the unit matrix. Provided EC is chosen to be quadratic in y, this equation can be solved using the same techniques as the non-regularised network.
54Same data with uncorrelated noise, M2 rbfs Problems of RBFs1. Need to choose number of basis functions2. Due to local nature of basis functions has problems in ignoring ‘noisy’ input dimensions unlike MLPs (helps to use dimensionality reduction such as PCA)1D data, M rbfsSame data with uncorrelated noise, M2 rbfs
55Problems of RBFs 23. Optimal choice of basis function parameters may not be optimal for the output taskData from h => rbf at a, but gives a bad representation of h. In contrast, one centred at b would be perfect
56Problems of RBFs 34. Because of dependence on distance, if variation in one parameter is small with respect to the others it will contribute very little to the outcome (l + e)2 ~ l2. Therefore, preprocess data to give zero mean and unit variance viasimple transformation:x* = (x - m)s(Could achieve the same using general covariance matrices but this is simpler)
57However, this does not take into account correlations in the data. Better to use whitening (Bishop, 1995, pp )
58x* = L-1/2 UT (x - m) where U is a matrix whose columns are the eigenvectors ui of S, the covariance matrix of the data, and L a matrix with the corresponding eigenvalues li on the diagonals i.e:U = (u1, … …, un)And:L = diag(l1, ……, ln)l1u1l2u2
59Using RBF Nets in practice Choose a functional form (Gaussian generally, but prior knowledge/experience may suggest others)Select the type of pre-processing--Reduce dimensionality (techniques to follow in next few lectures) ?--Normalise (whiten) data?(no way of knowing if these will be helpful: may need to try a few combinations)Select clustering method (k-means)Select number of basis functions, cluster and find basis centresFind weights (via matrix inversion)Calculate performance measure.
60If only life were so simple… How do we choose k? Similar to problem of selecting number of hidden nodes for MLPWhat type of pre-processing is best?Does the clustering method work for the data? E.g might be better to fix s and try again.There is NO general answer: each choice will be problem-specific. The only info you have is your performance measure.
61Idea: try e.g. increasing k until performance measure decreases (or gets to a minimum, or something more adventurous).Performance measureOptimal k?kNote the dependence on the performance measure (make sure it’s a good one).Good thing about RBF Nets is that the training procedure is relatively quick and so lots of combinations can be used.