Presentation is loading. Please wait.

Presentation is loading. Please wait.

Artificial Neural Networks Dr. Lahouari Ghouti Information & Computer Science Department Artificial Neural Networks.

Similar presentations


Presentation on theme: "Artificial Neural Networks Dr. Lahouari Ghouti Information & Computer Science Department Artificial Neural Networks."— Presentation transcript:

1 Artificial Neural Networks Dr. Lahouari Ghouti Information & Computer Science Department Artificial Neural Networks

2 Single-Layer Perceptron (SLP) Artificial Neural Networks

3 Architecture We consider the following architecture: feed- forward neural network with one layer We consider the following architecture: feed- forward neural network with one layer It is sufficient to study single-layer perceptrons with just one neuron: It is sufficient to study single-layer perceptrons with just one neuron: Artificial Neural Networks

4 Perceptron: Neuron Model Uses a non-linear (McCulloch-Pitts) model of neuron: Uses a non-linear (McCulloch-Pitts) model of neuron: x1x1 x2x2 xmxm w2w2 w1w1 wmwm b (bias) zy g(z) g is the sign function: g(z) = +1IF z >= 0 -1IF z < 0 Is the function sign(z) Artificial Neural Networks

5 Perceptron: Applications The perceptron is used for classification (?): classify correctly a set of examples into one of the two classes C 1, C 2 : The perceptron is used for classification (?): classify correctly a set of examples into one of the two classes C 1, C 2 : If the output of the perceptron is +1 then the input is assigned to class C 1 If the output of the perceptron is +1 then the input is assigned to class C 1 If the output is -1 then the input is assigned to C 2 If the output is -1 then the input is assigned to C 2 Artificial Neural Networks

6 Perceptron: Classification The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C 1 and C 2 The equation below describes a hyperplane in the input space. This hyperplane is used to separate the two classes C 1 and C 2 x2x2 C1C1 C2C2 x1x1 decision boundary w 1 x 1 + w 2 x 2 + b = 0 decision region for C1 w 1 x 1 + w 2 x 2 + b > 0 w 1 x 1 + w 2 x 2 + b <= 0 decision region for C 2 Weighted Bias Artificial Neural Networks

7 Perceptron: Limitations The perceptron can only model linearly- separable functions. The perceptron can only model linearly- separable functions. The perceptron can be used to model the following Boolean functions: The perceptron can be used to model the following Boolean functions: AND AND OR OR COMPLEMENT COMPLEMENT But it cannot model the XOR. Why? But it cannot model the XOR. Why? Artificial Neural Networks

8 Perceptron: Limitations (Cont’d) The XOR is not a linearly-separable problem The XOR is not a linearly-separable problem It is impossible to separate the classes C 1 and C 2 with only one line It is impossible to separate the classes C 1 and C 2 with only one line x2x2 x1x1 C1C1 C1C1 C2C2 Artificial Neural Networks

9 Perceptron: Learning Algorithm Variables and parameters: Variables and parameters: x (n) = input vector = [+1, x 1 (n), x 2 (n), …, x m (n)] T = [+1, x 1 (n), x 2 (n), …, x m (n)] T w(n) = weight vector = [b(n), w 1 (n), w 2 (n), …, w m (n)] T = [b(n), w 1 (n), w 2 (n), …, w m (n)] T b(n) = bias y(n) = actual response d(n) = desired response  = learning rate parameter (More elaboration later) Artificial Neural Networks

10 The Fixed-Increment Learning Algorithm Initialization: set w(0) =0 Initialization: set w(0) =0 Activation: activate perceptron by applying input example (vector x(n) and desired response d(n)) Activation: activate perceptron by applying input example (vector x(n) and desired response d(n)) Compute actual response of the perceptron: Compute actual response of the perceptron: y(n) = sgn[w T (n)x(n)] Adapt the weight vector: if d(n) and y(n) are different then Adapt the weight vector: if d(n) and y(n) are different then w(n + 1) = w(n) +  [d(n)-y(n)]x(n) Where d(n) = +1 if x(n)  C 1 -1 if x(n)  C 2 Continuation: increment time index n by 1 and go to Activation step Artificial Neural Networks

11 A Learning Example Consider a training set C 1  C 2, where: C 1 = {(1,1), (1, -1), (0, -1)} elements of class 1 C 2 = {(-1,-1), (-1,1), (0,1)} elements of class -1 Use the perceptron learning algorithm to classify these examples. w(0) = [1, 0, 0] T  = 1 Artificial Neural Networks

12 x1x1 x2x2 C2C2 C1C /2 Decision boundary: 2x 1 - x 2 = A Learning Example (Cont’d) Artificial Neural Networks

13 The Learning Algorithm: Convergence Let n = Number of training samples (Set X); Let n = Number of training samples (Set X); X 1 = Set of training sample belonging to class C 1 ; X 1 = Set of training sample belonging to class C 1 ; X 2 = set of training sample belonging to C 2 X 2 = set of training sample belonging to C 2 For a given sample n: For a given sample n: x(n) = [+1, x 1 (n),…, x p (n)] T = input vector w(n) = [b(n), w 1 (n),…, w p (n)] T = weight vector Net activity Level: v(n) = w T (n)x(n) Output: y(n) = +1 if v(n) >= 0 -1 if v(n) < 0 Artificial Neural Networks

14 The Learning Algorithm: Convergence (Cont’d) The decision hyperplane separates classes C 1 and C 2 The decision hyperplane separates classes C 1 and C 2 If the two classes C 1 and C 2 are linearly separable, then there exists a weight vector w such that If the two classes C 1 and C 2 are linearly separable, then there exists a weight vector w such that w T x ≥ 0 for all x belonging to class C 1 w T x < 0 for all x belonging to class C 2 Artificial Neural Networks

15 Error-Correction Learning Update rule: w(n + 1) = w(n) + Δw(n) Update rule: w(n + 1) = w(n) + Δw(n) Learning process Learning process –If x(n) is correctly classified by w(n), then w(n + 1) = w(n) w(n + 1) = w(n) –Otherwise, the weight vector is updated as follows w(n + 1) = w(n) – η(n)x(n) if w(n) T x(n) ≥ 0; x(n) belongs to C 2 w(n) + η(n)x(n) if w(n) T x(n) < 0; x(n) belongs to C 1 Artificial Neural Networks

16 Perceptron Convergence Algorithm Variables and parameters Variables and parameters –x(n) = [+1, x 1 (n),…, x p (n)]; w(n) = [b(n), w 1 (n),…,w p (n)] –y(n) = actual response (output); d(n) = desired response –η = learning rate, a positive number less than 1 Step 1: Initialization Step 1: Initialization –Set w(0) = 0, then do the following for n = 1, 2, 3, … Step 2: Activation Step 2: Activation –Activate the perceptron by applying input vector x(n) and desired output d(n) Artificial Neural Networks

17 Perceptron Convergence Algorithm (Cont’d) Step 3: Computation of actual response Step 3: Computation of actual response y(n) = sgn[w T (n)x(n)] y(n) = sgn[w T (n)x(n)] –Where sgn(.) is the signum function Step 4: Adaptation of weight vector Step 4: Adaptation of weight vector w(n+1) = w(n) + η[d(n) – y(n)]x(n) w(n+1) = w(n) + η[d(n) – y(n)]x(n)Where d(n) = d(n) = Step 5 Step 5 –Increment n by 1, and go back to step 2 +1 if x(n) belongs to C 1 -1 if x(n) belongs to C 2 Artificial Neural Networks

18 Learning: Performance Measure A learning rule is designed to optimize a performance measure A learning rule is designed to optimize a performance measure –However, in the development of the perceptron convergence algorithm we did not mention a performance measure Intuitively, what would be an appropriate performance measure for a classification neural network? Intuitively, what would be an appropriate performance measure for a classification neural network? Define the performance measure: Define the performance measure: J = -E[e(n)v(n)] Artificial Neural Networks

19 Learning: Performance Measure Or, as an instantaneous estimate: J’(n) = -e(n)v(n) The error at iteration n: e(n) = = d(n) – y(n) v(n) = linear combiner output at iteration n; E[.] = expectation operator Artificial Neural Networks

20 Can we derive our learning rule by minimizing this performance function [Haykin’s textbook]: Can we derive our learning rule by minimizing this performance function [Haykin’s textbook]: Now v(n) = w T (n)x(n), thus Now v(n) = w T (n)x(n), thus Learning rule: Learning rule: Learning: Performance Measure (Cont’d) Artificial Neural Networks

21 Presentation of Training Examples Presenting all training examples once to the ANN is called an epoch. Presenting all training examples once to the ANN is called an epoch. In incremental stochastic gradient descent training examples can be presented in: In incremental stochastic gradient descent training examples can be presented in: –Fixed order (1,2,3…,M) –Randomly permutated order (5,2,7,…,3) –Completely random (4,1,7,1,5,4,……) Artificial Neural Networks

22 Concluding Remarks A single layer perceptron can perform pattern classification only on linearly separable patterns, regardless of the type of nonlinearity (hard limiter, sigmoidal) A single layer perceptron can perform pattern classification only on linearly separable patterns, regardless of the type of nonlinearity (hard limiter, sigmoidal) Papert and Minsky in 1969 elucidated limitations of Rosenblatt’s single layer perceptron (e.g. requirement of linear separability, inability to solve XOR problem) and cast doubt on the viability of neural networks Papert and Minsky in 1969 elucidated limitations of Rosenblatt’s single layer perceptron (e.g. requirement of linear separability, inability to solve XOR problem) and cast doubt on the viability of neural networks However, multilayer perceptron and the back- propagation algorithm overcomes many of the shortcomings of the single layer perceptron However, multilayer perceptron and the back- propagation algorithm overcomes many of the shortcomings of the single layer perceptron Artificial Neural Networks

23 Adaline: Adaptive Linear Element The output y is a linear combination of the input x: The output y is a linear combination of the input x: x1x1 x2x2 xmxm w2w2 w1w1 wmwm y  Artificial Neural Networks

24 Adaline: Adaptive Linear Element (Cont’d) Adaline: uses a linear neuron model and the Least-Mean-Square (LMS) learning algorithm Adaline: uses a linear neuron model and the Least-Mean-Square (LMS) learning algorithm The idea: try to minimize the square error, which is a function of the weights We can find the minimum of the error function E by means of the Steepest descent method (Optimization Procedure) We can find the minimum of the error function E by means of the Steepest descent method (Optimization Procedure) Artificial Neural Networks

25 Steepest Descent Method: Basics Start with an arbitrary point Start with an arbitrary point find a direction in which E is decreasing most rapidly find a direction in which E is decreasing most rapidly make a small step in that direction make a small step in that direction Artificial Neural Networks

26 Steepest Descent Method: Basics (Cont’d) (w 1,w 2 ) (w 1 +  w 1,w 2 +  w 2 ) Artificial Neural Networks

27 Steepest Descent Method: Basics (Cont’d) global min local min gradient? Artificial Neural Networks

28 Least-Mean-Square algorithm (Widrow-Hoff Algorithm) Approximation of gradient(E) Approximation of gradient(E) Update rule for the weights becomes: Update rule for the weights becomes: Artificial Neural Networks

29 Summary of LMS algorithm Training sample: Input signal vector x(n) Desired response d(n) Desired response d(n) User selected parameter  >0 Initializationset ŵ(1) = 0 Computationfor n = 1, 2, … compute e(n) = d(n) - ŵ T (n)x(n) ŵ(n+1) = ŵ(n) +  x(n)e(n) Artificial Neural Networks

30 Neuron with Sigmoid-Function Inputs Weights Activation x1x1 x2x2 xmxm w2w2 w1w1 wmwm y  Output Artificial Neural Networks

31 Multi-Layer Neural Networks Input layer Hidden layer Output layer Artificial Neural Networks

32 Backpropagation Principal xkxk xixi w ki w jk jj kk yjyj Backward Step: Propagate errors from output to hidden layer Forward Step: Propagate activation from input to output layer Artificial Neural Networks

33 Backpropagation Algorithm Initialize each w i to some small random value Initialize each w i to some small random value Until the termination condition is met, Do Until the termination condition is met, Do –For each training example Do »Input the instance (x 1,…,x n ) to the network and compute the network outputs y k »For each output unit k  k =y k (1-y k )(t k -y k )  k =y k (1-y k )(t k -y k ) »For each hidden unit h  h =y h (1-y h )  k w h,k  k  h =y h (1-y h )  k w h,k  k »For each network weight w i,j Do »w i,j =w i,j +  w i,j where  w i,j =   j x i,j  w i,j =   j x i,j Artificial Neural Networks

34 Backpropagation Algorithm (Cont’d) Gradient descent over entire network weight vector Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Easily generalized to arbitrary directed graphs Will find a local, not necessarily global error minimum Will find a local, not necessarily global error minimum -in practice often works well (can be invoked multiple times with different initial weights) Often include weight momentum term Often include weight momentum term  w i,j (n)=   j x i,j +   w i,j (n-1) Minimizes error training examples Minimizes error training examples –Will it generalize well to unseen instances (over-fitting)? Training can be slow typical iterations Training can be slow typical iterations (use Levenberg-Marquardt instead of gradient descent) (use Levenberg-Marquardt instead of gradient descent) Using network after training is fast Using network after training is fast Artificial Neural Networks

35 Convergence of Backpropagation Gradient descent to some local minimum perhaps not global minimum Add momentum term:  w ki (n) Add momentum term:  w ki (n) –  w ki (n) =   k (n) x i (n) +  w ki (n-1) with  [0,1] Stochastic gradient descent Stochastic gradient descent Train multiple nets with different initial weights Train multiple nets with different initial weights Nature of convergence Initialize weights near zero Initialize weights near zero Therefore, initial networks near-linear Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses Increasingly non-linear functions possible as training progresses Artificial Neural Networks

36 Optimization Methods There are other more efficient (faster convergence) optimization methods than gradient descent: There are other more efficient (faster convergence) optimization methods than gradient descent: –Newton’s method uses a quadratic approximation (2 nd order Taylor expansion) –F(x+  x) = F(x) +  F(x)  x +  x  2 F(x)  x + … –Conjugate gradients –Levenberg-Marquardt algorithm Artificial Neural Networks

37 Universal Approximation Property of ANN Boolean Functions: Every boolean function can be represented by network with single hidden layer Every boolean function can be represented by network with single hidden layer But might require exponential (in number of inputs) hidden units But might require exponential (in number of inputs) hidden units Continuous Functions: Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989, Hornik 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988] Artificial Neural Networks

38 Using Weight Derivatives How often to update How often to update –after each training case? –after a full sweep through the training data? How much to update How much to update –Use a fixed learning rate? –Adapt the learning rate? –Add momentum? –Don’t use steepest descent? Artificial Neural Networks

39 What Next? Bias Effect Bias Effect Batch vs. Continuous Learning Batch vs. Continuous Learning Variable Learning Rate (Update Rule?) Variable Learning Rate (Update Rule?) Effect of Neurons/Layer Effect of Neurons/Layer Effect of Hidden Layers Effect of Hidden Layers Artificial Neural Networks


Download ppt "Artificial Neural Networks Dr. Lahouari Ghouti Information & Computer Science Department Artificial Neural Networks."

Similar presentations


Ads by Google