Presentation is loading. Please wait.

Presentation is loading. Please wait.

ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for.

Similar presentations


Presentation on theme: "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."— Presentation transcript:

1 ETHEM ALPAYDIN © The MIT Press, 2010 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml2e Lecture Slides for

2

3 Introduction Function approximation: devide the input space into local regions and learn simple (constant/linear) models in each patch Unsupervised: (Competitive model: online clustering) Adaptive resonance theory (ART) Self-organizing map (SOM) Supervised: Radial-basis function network (RBF) Mixture of experts (MoE) 3 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

4 Competitive Learning Online k-means 4 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

5 Online k-means The advantages of an online method: We do not need extra memory to store the whole training set. Updates at each step are simple to implement. The input distribution may change in time. Hebbian learning Online k-means algorithm Ps. The batch version is given in Figure 7.3. 5 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

6 Winner-take-all If one group wins and gets updated, and the others are not updated at all. The winner-take-all competitive neural network A network of k perceptrons with recurrent connections at the output Dashed lines are recurrent connections, of which the ones that end with an arrow are excitatory and the ones that end with a circle are inhibitory. Each unit at the output reinforces its value and tries to suppress the other outputs. Under a suitable assignment of these recurrent weights, the maximum suppresses all the others. 6 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

7 Adaptive Resonance Theory Incremental; add a new cluster if not covered; defined by vigilance, ρ (Carpenter and Grossberg, 1988) Create a new class Belong to m i 7 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

8 Self-Organizing Maps Units have a neighborhood defined; m i is “between” m i-1 and m i+1, and are all updated together One-dim map: (Kohonen, 1990) 8 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

9 Radial-Basis Functions Locally-tuned units: Bell-shaped function Figure 12.8 Basis function 9 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

10 Local vs Distributed Representation Distributed representation: The input is encoded by the simultaneous activation of many hidden unit For example: Multilayer perceptrons (MLP) Local representation For a given input, only one or a few unit are active For example: Radial Basis functions 10 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

11 Local vs Distributed Representation 11 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

12 Training RBF Hybrid learning: (Figure 12.8) First layer centers and spreads: Unsupervised k-means Second layer weights: Supervised gradient-descent Fully supervised (Broomhead and Lowe, 1988; Moody and Darken, 1989) 12 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

13 Regression Batch error 13 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

14 Classification Update rules can similarly be derived using gradient descent. (exercise 2) Cross-entropy error 14 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

15 Rules and Exceptions Default rule Exceptions 15 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

16 Rule-Based Knowledge In the RBF framework, the rule is encoded by two Gaussian unit as Incorporation of prior knowledge (before training) The better our initial knowledge of a problem, the faster we can achieve good performance and the less training that is required. Fuzzy membership functions and fuzzy rules 16 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

17 Normalized Basis Functions Why normalized? To make sure that for any input there is at least one nonzero unit. To make sure that the values of the local units sum up to 1. Before normalization After normalization 17 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

18 Competitive Basis Functions Competitive Basis Function: In an RBF network the final output is determined as a weighted sum of the contributions of the local units. In regression, we minimize equation 12.15 based on the probabilistic model Assume the output is drawn from a mixture model: Mixture proportions Mixture components generating the output if that component is chosen h: hidden unit 18 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

19 Regression Let us take the case of regression where the components are Gaussian. Log likelihood Using gradient ascent to maximize the log likelihood ghgh 19 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

20 Classification In classification, each component by itself is a multinomial. Log likelihood ghgh 20 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

21 EM for RBF (Supervised EM) E-step: M-step: ghgh 21 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

22 Learning Vector Quantization H units per class prelabeled (Kohonen, 1990) Given x, m i is the closest: If the closest center has the correct label, it is moved toward the input to better represent it. If it belongs to the wrong class, it is moved away from the input in the expectation that if it is moved sufficiently away, a center of the correct class will be the closest in a future iteration. x mimi mjmj 22 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

23 Mixture of Experts In RBF, each local fit is a constant, w ih, second layer weight In MoE, each local fit is a linear function of x, a local expert: The mixture of experts can be seen as an RBF network where the second-layer weights are output of linear models. Only one linear model is shown for clarity. ( Jacobs et al., 1991) MoE (mixture of experts) is a piecewise linear approximation. 23 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

24 MoE as Models Combined Radial gating: Softmax gating: Local experts Gating network The mixture of experts can be seen as a model for combining multiple models. w h are the models and the gating network is another model determining the weight of each model, as given by g h. 24 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

25 Cooperative MoE Regression In the cooperative case: There is no force on the unit to be localized. In the competitive case: To increase the likelihood, unit are forced to be localized with more separation between them and smaller spreads. 25 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

26 Competitive MoE: Regression Regression Classification 26 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

27 Hierarchical Mixture of Experts Tree of MoE where each MoE is an expert in a higher- level MoE Soft decision tree: Takes a weighted (gating) average of all leaves (experts), as opposed to using a single path and a single leaf Can be trained using EM (Jordan and Jacobs, 1994) 27 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)

28 Discussion Cooperative model v.s. competitive model The cooperative model is generally more accurate The competitive model learns faster Why? In the cooperative case, models overlap more and implement a smoother approximation. It is preferable in regression problems. The competitive model makes a harder split; generally only one expert is active for an input and there fore learning is faster. 28 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0)


Download ppt "ETHEM ALPAYDIN © The MIT Press, 2010 Lecture Slides for."

Similar presentations


Ads by Google