Presentation on theme: "1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar."— Presentation transcript:
1 Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali Unar
2 Multilayer Feedforward Neural Networks Most popular ANN architecture Easy to implement It allows supervised learning Simple, it is layered. A layer consists of an arbitrary number of (artificial neurons) or nodes. In most cases, all the neurons in a particular layer contain the same activation function. The neurons in different layers may have different activation functions. When a signal passes through the network, it always propagates in the forward direction from one layer to the next layer, not to the other neurons in the same or the previous layer.
3 Multilayer Feedforward Neural Networks Input layerHidden layerOutput layer A simple three layer (one hidden layer) feedforwrd ANN w 11 w pn w 21 Fig1.
4 Multilayer Feedforward Neural Networks In a feedforward neural network, the first layer is called the input layer, and is the layer where the input patterns based on the data sample are fed in. This layer does not perform any processing and consists of fan out units only. Then follows one or more hidden layers, and as the name indicates these layers cannot be accessed from outside the network. The hidden layers enable the network to learn complex tasks by extracting progressively more meaningful features from the input patterns. The final layer is the output layer, which may contain one or more output neurons. This is the layer where the network decision, for a given input pattern, can be readout.
5 Types of feedforward Neural Networks There are two main categories of feedforward neural networks: Multilayer Perceptron Networks Multilayer Perceptron Networks Radial Basis Function Networks Radial Basis Function Networks
6 Multilayer Perceptron (MLP) Networks It is the best known type of feedforward ANN. The structure of MLP network is similar to that shown in figure 1 (slide 3). It consists of an input layer, one or more hidden layers, and an output layer. The number of hidden layers and the number of neurons in each layer is not fixed. Each layer may have a different number of neurons, depending on the application. The developer will have to determine how many layers and how many neurons per layer should be selected for a particular application.
7 Multilayer Perceptron (MLP) Networks All neurons in hidden layer have a sigmoidal nonlinearity (activation function) such as a logistic function or a hyperbolic tangent function where u i is the net internal activity level of neuron i, y i is the output of the same neuron, and a,b are constants. Generally, an MLP network learns faster with hyperbolic function than the logistic function. The important point to emphasize here is that the non-linearity is smooth.
8 Multilayer Perceptron (MLP) Networks A number of researchers have proved mathematically that a single hidden layer feedforward network is capable to approximate any continuous multivariable function to any desired degree of accuracy, provided that sufficiently many hidden layer neurons are available. The issue of choosing an appropriate number of neurons of an MLP network is almost unresolved.
9 Multilayer Perceptron (MLP) Networks With few hidden neurons, the network may not produce outputs reasonably close to the targets. This effect is called under fitting. On the other hand, an excessive number of hidden layer neurons will increase the training time and may cause problem called over fitting. The network will have so much information processing capability that it will learn insignificant aspects of the training set, aspect that are irrelevant to the general population.If the performance of the network is evaluated with the training set, it will be excellent. However, when the network is called upon to work with the general population, it will do poorly. This is because it will consider trivial features unique to training set members, as well as important general features, and become confused. Thus is very important to choose an appropriate number of hidden layer neurons for satisfactory performance of the network.
10 Multilayer Perceptron (MLP) Networks How to choose a suitable number of hidden layer neurons? Lippmann Rule: Lippmann has proved that the number of hidden layer neurons in one hidden layer MLP network should be Q(P+1), where Q is the number of output units and P is the number of inputs. Geometric Pyramid rule: In a single hidden layer network with P inputs and Q outputs, the number of neurons in the hidden layer should be.
11 Multilayer Perceptron (MLP) Networks Important: The above formulas are only rough approximations to the ideal hidden layer size and may be far from optimal in certain applications. For example, if the problem is complex but there are only few inputs and outputs, we may need many more hidden neurons than suggested by the above formulae. On the other hand, if problem is simple with many inputs and outputs, fewer neurons will often suffice.
12 Multilayer Perceptron (MLP) Networks A common approach is to start with a small number of hidden neurons (e.g. with just two hidden neurons). Then slightly increase the number of hidden neurons, again train and test the network. Continue this procedure until satisfactory performance is achieved. This procedure is time consuming, but usually results in success.
13 Error Back-Propagation Algorithm: + + f(.) + w j0 [n] w j1 [n] w ji [n] w j(p-1) [n] w jp [n] y 0 = -1 y 1 [n] y i [n] y p-1 [n] y p [n] u j [n] d j [n] y j [n] + e j [n] Case I: Neuron j is an output node: Figure 2 shows a neuron j located in the output layer of an MLP network.
14 The input to the neuron at iteration n is where p is the total number of inputs and p 0 is the threshold. The output of the neuron j will be where f(.) is the activation function of the neuron. Assume that d j [n] be the desired output at iteration n. The error signal will therefore be given by (1) (2) (3) The instantaneous value of the squared error corresponding to neuron j will be
15 and hence the instantaneous sum of squared errors will be (4) (5) where C is a set containing all neurons of the output layer. If the total number of patterns contained in the training set is N, then the average squared error of the network will be (6) This is the cost function of the network which is to be minimized.
16 Differentiating E[n] with respect to w ji [n] and making use of the chain rule, we get (7) The first term On the right hand side (RHS) of the above equation can be found by differentiating both sides of Equation (5) with respect to e j [n] (8) The next term, i.e. on the RHS of (7) can be obtained by differentiating (3) with respect to y j [n]
17 (9) To find the term y j [n]/ u j [n], we have to differentiate (2) with respect to u j [n]. That is, (10) Finally the last term (i.e. u j [n]/ w ji [n] on the RHS of (7) can be computed by differentiating (1) with respect to w ji [n] and is given by (11) Now equation (7) becomes
18 (12) The correction w ji [n] applied to w ji [n] can now be defined as (13) Where is the learning rate which is a factor deciding how fast the weights are allowed to change for each time step. The minus sign indicates that the weights are to be changed in such a way that the error decreases. substituting (12) into (13) yields (14) where the local gradient j [n] is defined by
19 (15) which shows that the local gradient j [n] is the product of the corresponding error signal e j [n] and the derivative f jj ’ (u j [n]) of the associated activation function. The above derivation is based on the assumption that the neuron j is located in the output layer of the network. Of course, this is the simplest case. Since neuron j is in the output layer where desired signal is always available so it is quite straightforward to compute the error e j [n] and the local gradient j [n] by using (3) and (15) respectively.
20 Let us consider the case in which the neuron j is not in the output layer of the network but is located in the hidden layer immediately left to the output layer, as shown in the figure of the next slide. Note that now the index j will refer to hidden layer and index k will refer to the output layer. Also note that the desired response d k [n] is not directly available to hidden layer neurons. Case II: Neuron j is a hidden node
21 + f(.) + y 0 = -1 y i [n] + d k [n] e k [n] w ji [n] u j [n] y j [n] w j0 u k [n] f(.) y k [n] Neuron j Neuron k
22 In this new situation, the local gradient will take the following form: As neuron k is located in the output layer which is simply (5) in which the index j has been replaced by the index k. Differentiating this equation with respect to y j [n] and using the chain rule, we obtain (16) (17) (18)
23 Since (19) Therefore, (20) The net input of neuron k is given by (21) where q is the total number of inputs applied to neuron k. Differentiating u k [n] with respect to y j [n] we have (22) substituting (20) and (22) in (18) yields
24 (23) The local gradient j [n] for the hidden neuron j can now be obtained by using (23) in (16): (24)
25 Summary: 1.The correction w ji [n] applied to the synaptic weight connecting Neuron I to neuron j is defined as: 2. The local gradient j [n] depends on whether neuron j is an output node or a hidden node: (a) If neuron j is an output node, j [n] equals the product of the derivative f j ’ [u j [n]]and the error signal e j [n], both of which are associated with neuron j. (b) If neuron j is a hidden node, j [n] equals the product of the associated derivative f j ’ [u j [n]] and the weighted sum of the ’s computed for the neurons in the next hidden or output layer that are connected to neuron j.
26 Improved Back-Propagation: The back-propagation algorithm derived above has some drawbacks. First of all, the learning parameter should be chosen to be small to provide minimization of the total error signal.However, for a small the learning process becomes very slow. On the other hand, large values of corresponds to rapid learning, but lead to parasitic oscillations which prevent the algorithm from converging to the desired solution. Moreover, if the error function contains many local minima, the network might get trapped in some local minimum, or get stuck on a very flat plateau. One simple way to improve the standard back-propagation algorithm is to use adaptive learning rate and momentum as described below:
27 Momentum: Here the idea is to give weights and biases some momentum so that they will not get stuck in local minima, but have enough energy to pass these. Mathematically adding momentum is expressed as: where is momentum constant and must have a value between 0 and 1. If is zero, the algorithm is same as the basic back-propagation rule, that means no momentum. equals to 1 means that the weights change exactly as they did in the preceding time step. A typical value of is 0.9 to equals to 1 means that the weights change exactly as they did in the preceding time step. A typical value of is 0.9 to (25)
28 Adaptive Learning Rate: As mentioned earlier, it is difficult to choose an appropriate value of the learning rate for a particular application. The optimal value can change during training. Thus, this parameter should be updated as the training phase progresses. That is, the learning rate should be adaptive. One way of doing this is to change the learning rate according to the way in which the error function responded to the last change in weights. If a weight update decreased the error function, the weights probably were changed in the right direction, and is increased. On the other hand, if the error function was increased, we reduce the value of .