Presentation on theme: "Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward Networks By Dr. Mukhtiar Ali."— Presentation transcript:
1Mehran University of Engineering and Technology, Jamshoro Department of Electronic Engineering Neural Networks Feedforward NetworksByDr. Mukhtiar Ali Unar
2Multilayer Feedforward Neural Networks Most popular ANN architectureEasy to implementIt allows supervised learningSimple, it is layered.A layer consists of an arbitrary number of (artificial neurons) or nodes.In most cases, all the neurons in a particular layer contain the same activation function.The neurons in different layers may have different activation functions.When a signal passes through the network, it always propagates in the forward direction from one layer to the next layer, not to the other neurons in the same or the previous layer.
3Multilayer Feedforward Neural Networks Input layerHidden layerOutput layerw11w21wpnFig1.A simple three layer (one hidden layer) feedforwrd ANN
4Multilayer Feedforward Neural Networks In a feedforward neural network, the first layer is called the input layer, and is the layer where the input patterns based on the data sample are fed in. This layer does not perform any processing and consists of fan out units only. Then follows one or more hidden layers, and as the name indicates these layers cannot be accessed from outside the network. The hidden layers enable the network to learn complex tasks by extracting progressively more meaningful features from the input patterns. The final layer is the output layer, which may contain one or more output neurons. This is the layer where the network decision, for a given input pattern, can be readout.
5Types of feedforward Neural Networks There are two main categories of feedforward neural networks:Multilayer Perceptron NetworksRadial Basis Function Networks
6Multilayer Perceptron (MLP) Networks It is the best known type of feedforward ANN.The structure of MLP network is similar to that shown in figure 1 (slide 3). It consists of an input layer, one or more hidden layers, and an output layer.The number of hidden layers and the number of neurons in each layer is not fixed. Each layer may have a different number of neurons, depending on the application. The developer will have to determine how many layers and how many neurons per layer should be selected for a particular application.
7Multilayer Perceptron (MLP) Networks All neurons in hidden layer have a sigmoidal nonlinearity (activation function) such as a logistic functionor a hyperbolic tangent functionwhere ui is the net internal activity level of neuron i, yi is the output of the same neuron, and a,b are constants. Generally, an MLP network learns faster with hyperbolic function than the logistic function. The important point to emphasize here is that the non-linearity is smooth.
8Multilayer Perceptron (MLP) Networks A number of researchers have proved mathematically that a single hidden layer feedforward network is capable to approximate any continuous multivariable function to any desired degree of accuracy, provided that sufficiently many hidden layer neurons are available.The issue of choosing an appropriate number of neurons of an MLP network is almost unresolved.
9Multilayer Perceptron (MLP) Networks With few hidden neurons, the network may not produce outputs reasonably close to the targets. This effect is called under fitting. On the other hand, an excessive number of hidden layer neurons will increase the training time and may cause problem called over fitting. The network will have so much information processing capability that it will learn insignificant aspects of the training set, aspect that are irrelevant to the general population.If the performance of the network is evaluated with the training set, it will be excellent. However, when the network is called upon to work with the general population, it will do poorly. This is because it will consider trivial features unique to training set members, as well as important general features, and become confused. Thus is very important to choose an appropriate number of hidden layer neurons for satisfactory performance of the network.
10Multilayer Perceptron (MLP) Networks How to choose a suitable number of hidden layer neurons?Lippmann Rule: Lippmann has proved that the number of hidden layer neurons in one hidden layer MLP network should be Q(P+1), where Q is the number of output units and P is the number of inputs.Geometric Pyramid rule: In a single hidden layer network with P inputs and Q outputs, the number of neurons in the hidden layer should be
11Multilayer Perceptron (MLP) Networks Important:The above formulas are only rough approximations to the ideal hidden layer size and may be far from optimal in certain applications. For example, if the problem is complex but there are only few inputs and outputs, we may need many more hidden neurons than suggested by the above formulae. On the other hand, if problem is simple with many inputs and outputs, fewer neurons will often suffice.
12Multilayer Perceptron (MLP) Networks A common approach is to start with a small number of hidden neurons (e.g. with just two hidden neurons). Then slightly increase the number of hidden neurons, again train and test the network. Continue this procedure until satisfactory performance is achieved. This procedure is time consuming, but usually results in success.
13Error Back-Propagation Algorithm: Case I: Neuron j is an output node:Figure 2 shows a neuron j located in the output layer ofan MLP network.wj0[n]y0 = -1dj[n]wj1[n]y1[n]wji[n]uj[n]yj[n]yi[n]+f(.)+ej[n]+-1-1+wj(p-1)[n]yp-1[n]wjp[n]yp[n]
14The input to the neuron at iteration n is (1)where p is the total number of inputs and p0 is the threshold.The output of the neuron j will be(2)where f(.) is the activation function of the neuron. Assume thatdj[n] be the desired output at iteration n. The error signal willtherefore be given by(3)The instantaneous value of the squared error corresponding toneuron j will be
15(4)and hence the instantaneous sum of squared errors will be(5)where C is a set containing all neurons of the output layer.If the total number of patterns contained in the training set is N,then the average squared error of the network will be(6)This is the cost function of the network which is to be minimized.
16Differentiating E[n] with respect to wji[n] and making use of the chain rule, we get(7)The first termOn the right hand side (RHS) of theabove equation can be found by differentiating both sides ofEquation (5) with respect to ej[n](8)The next term, i.e.on the RHS of (7) can be obtainedby differentiating (3) with respect to yj[n]
17(9)To find the term yj[n]/uj[n], we have to differentiate (2) withrespect to uj[n]. That is,(10)Finally the last term (i.e. uj[n]/wji[n] on the RHS of (7) canbe computed by differentiating (1) with respect to wji[n] andis given by(11)Now equation (7) becomes
18(12)The correction wji[n] applied to wji[n] can now be defined as(13)Where is the learning rate which is a factor deciding how fastthe weights are allowed to change for each time step. The minussign indicates that the weights are to be changed in such a waythat the error decreases.substituting (12) into (13) yields(14)where the local gradient j[n] is defined by
19(15)which shows that the local gradient j[n] is the product of thecorresponding error signal ej[n] and the derivative fjj’(uj[n]) ofthe associated activation function.The above derivation is based on the assumption that the neuron jis located in the output layer of the network. Of course, this is thesimplest case. Since neuron j is in the output layer where desiredsignal is always available so it is quite straightforward to computethe error ej[n] and the local gradient j[n] by using (3) and (15)respectively.
20Case II: Neuron j is a hidden node Let us consider the case in which the neuron j is not in the output layer of the network but is located in the hidden layer immediately left to the output layer, as shown in the figure of the next slide.Note that now the index j will refer to hidden layer and index k will refer to the output layer. Also note that the desired response dk[n] is not directly available to hidden layer neurons.
22As neuron k is located in the output layer In this new situation, the local gradient will take the following form:As neuron k is located in the output layerwhich is simply (5) in which the index j has been replaced by the index k. Differentiating this equation with respect to yj[n] and using the chain rule, we obtain(16)(17)(18)
23Since (19) Therefore, (20) The net input of neuron k is given by (21) where q is the total number of inputs applied to neuron k.Differentiating uk[n] with respect to yj[n] we have(22)substituting (20) and (22) in (18) yields
24(23)The local gradient j[n] for the hidden neuron j can now beobtained by using (23) in (16):(24)
25Summary: The correction wji[n] applied to the synaptic weight connecting Neuron I to neuron j is defined as:2. The local gradient j[n] depends on whether neuron j is anoutput node or a hidden node:(a) If neuron j is an output node, j[n] equals the product of thederivative fj’[uj[n]]and the error signal ej[n], both of whichare associated with neuron j.(b) If neuron j is a hidden node, j[n] equals the product of theassociated derivative fj’[uj[n]] and the weighted sum of the’s computed for the neurons in the next hidden or outputlayer that are connected to neuron j.
26Improved Back-Propagation: The back-propagation algorithm derived above has some drawbacks. First of all, the learning parameter should be chosen to be small to provide minimization of the total error signal.However, for a small the learning process becomes very slow. On the other hand, large values of corresponds to rapid learning, but lead to parasitic oscillations which prevent the algorithm from converging to the desired solution. Moreover, if the error function contains many local minima, the network might get trapped in some local minimum, or get stuck on a very flat plateau. One simple way to improve the standard back-propagation algorithm is to use adaptive learning rate and momentum as described below:
27Momentum:Here the idea is to give weights and biases some momentum so that they will not get stuck in local minima, but have enough energy to pass these. Mathematically adding momentum is expressed as:where is momentum constant and must have a value between 0 and 1. If is zero, the algorithm is same as the basic back-propagation rule, that means no momentum. equals to 1 means that the weights change exactly as they did in the preceding time step. A typical value of is 0.9 to 0.95.(25)
28Adaptive Learning Rate: As mentioned earlier, it is difficult to choose an appropriate value of the learning rate for a particular application. The optimal value can change during training. Thus, this parameter should be updated as the training phase progresses. That is, the learning rate should be adaptive.One way of doing this is to change the learning rate according to the way in which the error function responded to the last change in weights. If a weight update decreased the error function, the weights probably were changed in the right direction, and is increased. On the other hand, if the error function was increased, we reduce the value of .