Presentation is loading. Please wait.

Presentation is loading. Please wait.

Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.

Similar presentations


Presentation on theme: "Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set."— Presentation transcript:

1 Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set

2 Example of Radial Basis Function (RBF) network Input vector d dimensions K radial basis functions Single output Structure used for multivariate regressions or binary classification

3 Review: RBF network provides alternative to back propagation Each hidden node is associated with a cluster of input instances Hidden layer connected to the output by linear least squares Gaussians are the most frequently used radial basis function  j (x) = exp(-½(|x-  j |/  j ) 2 ) Clusters of input instances are parameterized by a mean and variance

4 Linear least squares with basis functions Given training set and the mean and variance of K clusters of input data, construct the NxK matrix D and column vector r. Add a column of ones to include a bias node. Solve normal equations D T Dw = D T r for a vector w of K weights connecting hidden nodes to output node

5 RBF networks perform best with large datasets With large datasets, expect redundancy (i.e. multiple examples expressing the same general pattern) In RBF network, hidden layer is a feature-space representation of the data where redundancy has been used to reduce noise. A validation set may be helpful to determine K, the best number clusters of input data

6 6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Supervised learning: mapping input to output Unsupervised learning: find regularities in the input regularities reflect some probability distribution of attribute vectors, p(x t ) discovering p(x t ) called “density estimation” parametric method uses MLE to find  in p(x t |  ) In clustering, we look for regularities as group membership assume we know the number of clusters, K given K and dataset X, we want to find the size of each group P(G i ) and its component density p(x|G i ) Background on clustering

7 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Define group labels based on nearest center Get new trial centers Find group labels using the geometric interpretation of a cluster as points in attribute space closer to a “center” than they are to data points not in the cluster Define trial centers by reference vectors m j j = 1…k Judge convergence by K-Means Clustering: hard labels

8 K-means clustering pseudo code 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

9 9 Example of pseudo code application

10 Example of K-means with arbitrary starting centers and convergence plot 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Convergence

11 K-means is an example of the Expectation- Maximization (EM) approach to MLE 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Log likelihood of mixture model cannot be solved analytically for  Use a 2-step iterative method: E-step: estimate labels of x t given current knowledge of mixture components M-step: update component knowledge using labels from E-step

12 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) E - step M - step K-means clustering pseudo code with steps labeled

13 Given converged K-means centers, estimate variance for RBFs by  2 = d 2 max /2K, where d max is the largest distance between clusters. Gaussian mixture theory is another approach to getting RBFs Application of K-means clustering to RBF-ANN

14 14 X ={x t } t is made up of K groups (clusters) P(G i ) proportion of X in group i attributes in each group are Gaussian distributed p(x t |G i ) = N d ( μ i, ∑ i )  i  means of x t in group i  i covariance matrix of x t in group i Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Distribution of attributes is mixture of Gaussians Gaussian Mixture Densities

15 Given a group label for each data point, r i t, MLE provides estimates of parameters of Gaussian mixtures where p ( x | G i ) ~ N ( μ i, ∑ i ) Φ = {P (G i ), μ i, ∑ i } i=1 to k 15Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Estimators

16 p(x) = N ( μ, σ 2 ) MLE for μ and σ 2 : 16 μ σ Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 1D Gaussian distribution

17 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Mahalanobis distance: (x – μ) T ∑ –1 (x – μ) analogous to (x-  ) 2 /  2 x -  is column vector dx1  is dxd matrix M-distance is a scalar Measures distance of x from mean in units of  d denotes number of attributes dD Gaussian distribution

18 If x i are independent, offdiagonals of ∑ are 0, p(x) is product of probabilities for each component of x 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Replace the hard labels, r i t, by soft label, h i t, the probability that x t belongs to cluster i. Assume that cluster densities p(x t |  ) are Gaussian, then mixture proportions, means and covariance matrix are estimated by where h i t are soft labels from previous E-step Gaussian mixture model by EM: soft labels

20 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Initialize by k-means clustering. After a few iterations, use centers m i and instances covered by each center to estimate the covariance matrices S i and mixture proportions  i From m i, S i, and  i, calculate h i t, soft labels by Calculate new proportions, centers and covariance by Use these to calculate new soft labels Gaussian mixture model by EM: soft labels

21 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) K-means Hard labels Centers marked EM Gaussian mixtures with soft labels Contours show 1 standard deviation Colors show mixture proportions

22 22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) k-means hard lables

23 23 P(G 1 |x)=0.5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Data points color coded by greater soft label Contours show  +  of Gaussian densities Dashed contour is “separating” curve Gaussian mixtures; soft labels x marks cluster mean Outliers?

24 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) In applications of Gaussian mixtures to RBFs, correlation of attributes is ignored and diagonal elements of the covariance matrix are equal. In this approximation Mahalanobis distance reduces to Euclidence distance. Variance parameter of radial basis function becomes a scalar

25 Cluster based on similarities (distances) Distance measure between instances x r and x s Minkowski (L p ) (Euclidean for p = 2) City-block distance 25 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Hierarchical Clustering

26 Start with N groups each with one instance and merge the two closest groups at each iteration Distance between two groups G i and G j : Single-link: smallest distance between all possible pairs of attributes Complete-link: largest distance between all possible pairs of attributes Average-link, distance between centroids 26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Agglomerative Clustering

27 27 Dendrogram Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) At height h > sqrt(2) and < 2, dendrogram has the 3 clusters shown on data graph At h > 2 dendrogram shows 2 clusters. c, d, and f are one cluster at this distance Example: single-linked clusters

28 Application specific Plot data (after PCA, for example) and check for clusters Add one at a time using validation set 28 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) Choosing K (how many clusters?)


Download ppt "Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set."

Similar presentations


Ads by Google