INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 alpaydin@boun.edu.tr http://www.cmpe.boun.edu.tr/~ethem/i2ml Lecture Slides for

CHAPTER 11: Multilayer Perceptrons

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 3 Neural Networks Networks of processing units (neurons) with connections (synapses) between them Large number of neurons: 10 10 Large connectitivity: 10 5 Parallel processing Distributed computation/memory Robust to noise, failures

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 4 Understanding the Brain Levels of analysis (Marr, 1982) 1. Computational theory 2. Representation and algorithm 3. Hardware implementation Reverse engineering: From hardware to theory Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 5 Perceptron (Rosenblatt, 1962)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 6 What a Perceptron Does Regression: y=wx+w 0 Classification: y=1(wx+w 0 >0) w w0w0 y x x 0 =+1 w w0w0 y x s w0w0 y x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 7 K Outputs Classification : Regression :

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 8 Training Online (instances seen one by one) vs batch (whole sample) learning:  No need to store the whole sample  Problem may change in time  Wear and degradation in system components Stochastic gradient-descent: Update after a single pattern Generic update rule (LMS rule):

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 9 Training a Perceptron: Regression Regression (Linear output):

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 10 Classification Single sigmoid output K>2 softmax outputs

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 11 Sigmoid Unit  x1x1 x2x2 xnxn...... w1w1 w2w2 wnwn w0w0 x 0=1 net=  i=0 n w i x i o o=  (net)=1/(1+e -net )  (x) is the sigmoid function: 1/(1+e -x) d  (x)/dx=  (x) (1-  (x)) Derive gradient decent rules to train: one sigmoid function  E/  w i = -  d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 12 Sigmoid Unit  x1x1 x2x2 xnxn...... w1w1 w2w2 wnwn w0w0 x 0=1 net=  i=0 n w i x i o o=  (net)=1/(1+e -net )  (x) is the sigmoid function: 1/(1+e -x) d  (x)/dx=  (x) (1-  (x)) Derive gradient decent rules to train: one sigmoid function  E/  w i = -  d (t d -o d ) o d (1-o d ) x i Multilayer networks of sigmoid units backpropagation:

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 13 Learning Boolean AND

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 14 XOR No w 0, w 1, w 2 satisfy: (Minsky and Papert, 1969)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 15 Multi-Layer Networks input layer hidden layer output layer

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 16 Multilayer Perceptrons (Rumelhart et al., 1986)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 17 x 1 XOR x 2 = (x 1 AND ~x 2 ) OR (~x 1 AND x 2 )

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 18 Backpropagation

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 19 Backpropagation Algorithm Initialize each w i to some small random value Until the termination condition is met, Do  For each training example Do Input the instance (x 1,…,x n ) to the network and compute the network outputs o k For each output unit k   k =o k (1-o k )(t k -o k ) For each hidden unit h   h =o h (1-o h )  k w h,k  k For each network weight w,j Do w i,j =w i,j +  w i,j where  w i,j =   j x i,j

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 20 Backpropagation Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term  w i,j (n)=   j x i,j +   w i,j (n-1) Minimizes error training examples  Will it generalize well to unseen instances (over-fitting)? Training can be slow typical 1000-10000 iterations (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 21 Regression Forward Backward x

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 22 Regression with Multiple Outputs zhzh v ih yiyi xjxj w hj

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 23

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 24 8 inputs 3 hidden8 outputs 8-3-8 Binary Encoder -Decoder Hidden values.89.04.08.01.11.88.01.97.27.99.97.71.03.05.02.22.99.99.80.01.98.60.94.01

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 25 Sum of Squared Errors for the Output Units

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 26 Hidden Unit Encoding for Input 0100000

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 27 Convergence of Backprop Gradient descent to some local minimum Perhaps not global minimum Add momentum Stochastic gradient descent Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 28 Expressive Capabilities of ANN Boolean functions Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units Continuous functions Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 30 whx+w0whx+w0 zhzh vhzhvhzh

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 31 Two-Class Discrimination One sigmoid output y t for P(C 1 |x t ) and P(C 2 |x t ) ≡ 1-y t

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 33 Multiple Hidden Layers MLP with one hidden layer is a universal approximator (Hornik et al., 1989), but using multiple layers may lead to simpler networks

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 34 Improving Convergence Momentum Adaptive learning rate

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 35 Overfitting/Overtraining Number of weights: H (d+1)+(H+1)*K

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 37 Structured MLP (Le Cun et al, 1989)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 39 Hints Invariance to translation, rotation, size Virtual examples Augmented error: E’=E+ λ h E h If x’ and x are the “same”: E h =[g(x| θ )- g(x’| θ )] 2 Approximation hint: (Abu-Mostafa, 1995)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 40 Tuning the Network Size Destructive Weight decay: Constructive Growing networks (Ash, 1989) (Fahlman and Lebiere, 1989)

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 41 Bayesian Learning Consider weights w i as random vars, prior p(w i ) Weight decay, ridge regression, regularization cost=data-misfit + λ complexity

Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 44 Learning Time Applications:  Sequence recognition: Speech recognition  Sequence reproduction: Time-series prediction  Sequence association Network architectures  Time-delay networks (Waibel et al., 1989)  Recurrent networks (Rumelhart et al., 1986)

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for.

Similar presentations

Presentation on theme: "INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for.

Similar presentations

Presentation on theme: "INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, 2004 Lecture Slides for."— Presentation transcript:

Similar presentations

About project

Feedback