NEURAL NETWORKS Perceptron

Presentation on theme: "NEURAL NETWORKS Perceptron"— Presentation transcript:

NEURAL NETWORKS Perceptron
Dear students, todays topic is the perceptron. The perceptron is the simplest form of a neural network. It consists of a single neuron with adjustable synaptic weights and a hard limiter. The perceptron was introduced by McCulloch and Pitts in 1943 as an artificial neuron with a hard-limiting activation function, . Recently the term multilayer perceptron has often been used as a synonym for the term multilayer feedforward neural network. In this section we will be referring to the former meaning. We will talk about the capabilities of perceptron. Also the learning algorithm known as perveptron learning rule will be explained and illustrated by examples. PROF. DR. YUSUF OYSAL

NEURAL NETWORKS – Perceptron
The Perceptron The operation of Rosenblatt’s perceptron is based on the McCulloch and Pitts neuron model. The model consists of a linear combiner followed by a hard limiter. The weighted sum of the inputs is applied to the hard limiter. (Step or sign function) Linear Combiner Hard Limiter w1 Output Inputs Y The operation of Rosenblatt’s perceptron is based on the McCulloch and Pitts neuron model. The model consists of a linear combiner followed by a hard limiter. The weighted sum of the inputs is applied to the hard limiter. (Step or sign function) The neuron computes the weighted sum of the input signals and compares the result with a threshold value, . If the net input is less than the threshold, the neuron output is –1. But if the net input is greater than or equal to the threshold, the neuron becomes activated and its output attains a value +1. This type of activation function is called a sign function. w2 x2 Threshold

NEURAL NETWORKS – Perceptron
Exercise x1 x2 Y 1 A neuron uses a step function as its activation function, has threshold -1.5 and has w1 = 0.1, w2 = 0.1. What is the output with the following values of x1 and x2? Can you find weights that will give an OR function? An XOR function? A neuron uses a step function as its activation function, has threshold 0.2 and has w1 = 0.1, w2 = 0.1. What is the output with the following values of x1 and x2? The output is 0 for the first three input pairs, because magnitude of threshold is bigger than the summation, and input of the hard limit function will be negative for these three input pairs. For the last input pair (1, 1) , the output of the summation node will be positive, and then the output will be 1. This single layer perceptron network implements the AND operatör, as you see. Can you find weights that will give an or function? An XOR function? We will now talk about a systematic way to talk about these questions.

Can a single neuron learn a task?
NEURAL NETWORKS – Perceptron Can a single neuron learn a task? In 1958, Frank Rosenblatt introduced a training algorithm that provided the first procedure for training a simple ANN: a perceptron. The perceptron is the simplest form of a neural network. It consists of a single neuron with adjustable synaptic weights and a hard limiter. Linear Threshold Units (LTUs) : Perceptron Learning Rule Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule How does the perceptron learn its classification tasks? This is done by making small adjustments in the weights to reduce the difference between the actual and desired outputs of the perceptron. The initial weights are randomly assigned, usually in the range [0.5, 0.5], and then updated to obtain the output consistent with the training examples. In 1958, Frank Rosenblatt introduced a training algorithm that provided the first procedure for training a simple ANN: a perceptron. The perceptron is the simplest form of a neural network. It consists of a single neuron with adjustable synaptic weights and a hard limiter. Linear Threshold Units (LTUs) : Perceptron Learning Rule Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule How does the perceptron learn its classification tasks? This is done by making small adjustments in the weights to reduce the difference between the actual and desired outputs of the perceptron. The initial weights are randomly assigned, usually in the range [0.5, 0.5], and then updated to obtain the output consistent with the training examples.

Perceptron for Classification
NEURAL NETWORKS – Perceptron Perceptron for Classification We try to find suitable values for the weights in such a way that the training examples are correctly classified. A single perceptron classifies input patterns, x, into two classes. The input patterns that can be classified by a single perceptron into two distinct classes are called linearly separable patterns. Geometrically, we try to find a hyper-plane that separates the examples of the two classes. In order for the perceptron to perform a desired task, its weights must be properly selected. In general two basic methods can be employed to select a suitable weight vector: By off-line calculation of weights. If the problem is relatively simple it is often possible to calculate the weight vector from the specification of the problem. By learning procedure. The weight vector is determined from a given (training) set of input-output vectors (exemplars) in such a way to achieve the best classification of the training vectors. Once the weights are selected (the encoding process), the perceptron performs its desired task (e.g., pattern classification) (the decoding process). We try to find suitable values for the weights in such a way that the training examples are correctly classified. A single perceptron classifies input patterns, x, into two classes. The input patterns that can be classified by a single perceptron into two distinct classes are called linearly separable patterns. Geometrically, we try to find a hyper-plane that separates the examples of the two classes. In order for the perceptron to perform a desired task, its weights must be properly selected. In general two basic methods can be employed to select a suitable weight vector: By off-line calculation of weights. If the problem is relatively simple it is often possible to calculate the weight vector from the specification of the problem. By learning procedure. The weight vector is determined from a given (training) set of input-output vectors (exemplars) in such a way to achieve the best classification of the training vectors. Once the weights are selected (the encoding process), the perceptron performs its desired task (e.g., pattern classification) (the decoding process).

NEURAL NETWORKS – Perceptron
Learning Rules Linear Threshold Units (LTUs) : Perceptron Learning Rule Linearly Graded Units (LGUs) : Widrow-Hoff learning Rule Training Set x1 x2 xm= 1 y1 y2 yn xm-1 w11 w12 w1m w21 w22 w2m wn1 wnm wn2 Goal: d1 d2 dn The learning rules in general have a goal which makes the neural network output converges to the actual output values. We have a desired input output pairs that are used for traininig. The learning process continues when the network produces outputs very close to actual outputs with small errors.

Perceptron Geometric View
NEURAL NETWORKS – Perceptron Perceptron Geometric View The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. decision region for C1 x2 decision boundary C1 x1 The equation below describes a (hyper-)plane in the input space consisting of real valued m-dimensional vectors. The plane splits the input space into two regions, each of them describing one class. A linear combination of signals and weights for which the augmented activation potential is zero, in otherwords total electrical signals going through activation function is zero,, describes a decision surface which partitions the input space into two regions. C2 decision region for C2 w1x1 + w2x2 + w0 = 0

Implementing Logic Gates with Single Layer Perceptron
NEURAL NETWORKS – Perceptron Implementing Logic Gates with Single Layer Perceptron In each case we have inputs x and outputs y, and need to determine the weights and thresholds. It is easy to find solutions by inspection: NOT X y 0 1 1 0 AND x1 x2 y OR NOT AND OR 1 1 1 We can use McCulloch-Pitts neurons to implement the basic logic gates. All we need to do is find the appropriate connection weights and neuron thresholds to produce the right outputs for each set of inputs. We shall see explicitly how one can construct simple networks that perform NOT, AND, and OR. It is then a well known result from logic that we can construct any logical function from these three operations. The resulting networks, however, will usually have a much more complex architecture than a simple Perceptron. We generally want to avoid decomposing complex problems into simple logic gates, by finding the weights and thresholds that work directly in a Perceptron architecture. It is easy to see that there are an infinite number of solutions. Similarly, there are an infinite number of solutions for the NOT, AND and OR networks. w = -1  = 0.5 w1 = 1 w2 = 1  = -1.5 w1 = 1 w2 = 1  = -0.5

Limitations of Simple Perceptrons
NEURAL NETWORKS – Perceptron Limitations of Simple Perceptrons We can follow the same procedure for the XOR network: w1 0 + w2 0 +  < 0  > 0 w1 0 + w2 1 +   0 w2   w1 1 + w2 0 +   0 w1   w1 1 + w2 1 +  < 0 w1 + w2 <  Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. We need more complex networks, e.g. that combine together many simple networks, or use different activation/thresholding/transfer functions. It then becomes much more difficult to determine all the weights and thresholds by hand. In the next slides we will see how a neural network can learn these parameters. First, we need to consider what these more complex networks might involve. XOR XOR x1 x2 y 1 We can follow the same procedure for the XOR network: w1 times 0 + w2 times 0 + teta is maller than 0 which means teta is bigger than 0. w1 times 0 + w2 times 1 + teta is greater than or equal to 0 which means w2 is greater than or equal to teta. w1 times 1 + w2 times 0 + teta is greater than or equal to 0 which means w1 is greater than or equal to teta. And finally w1 times 1 + w2 times 1 + teta is smaller than 0 which means w1 + w2 is smaller than teta. Clearly the second and third inequalities are incompatible with the fourth, so there is in fact no solution. We need more complex networks, e.g. that combine together many simple networks, or use different activation/thresholding/transfer functions. It then becomes much more difficult to determine all the weights and thresholds by hand. In the next slides we will see how a neural network can learn these parameters. First, we need to consider what these more complex networks might involve. Linearly Non-Separable

Decision Hyperplanes and Linear Separability
NEURAL NETWORKS – Perceptron Decision Hyperplanes and Linear Separability If we have two inputs, then the weights define a decision boundary that is a one dimensional straight line in the two dimensional input space of possible input values. If we have n inputs, the weights define a decision boundary that is an n-1 dimensional hyperplane in the n dimensional input space: w1x1 + w2x wnxn +  = 0 This hyperplane is clearly still linear (i.e. straight/flat) and can still only divide the space into two regions. We still need more complex transfer functions, or more complex networks, to deal with XOR type problems. Problems with input patterns which can be classified using a single hyperplane are said to be linearly separable. Problems (such as XOR) which cannot be classified in this way are said to be non-linearly separable. If we have two inputs, then the weights define a decision boundary that is a one dimensional straight line in the two dimensional input space of possible input values. If we have n inputs, the weights define a decision boundary that is an n-1 dimensional hyperplane in the n dimensional input space: w1x1 + w2x wnxn +  = 0 This hyperplane is clearly still linear (i.e. straight/flat) and can still only divide the space into two regions. We still need more complex transfer functions, or more complex networks, to deal with XOR type problems. Problems with input patterns which can be classified using a single hyperplane are said to be linearly separable. Problems (such as XOR) which cannot be classified in this way are said to be non-linearly separable.

Learning and Generalization
NEURAL NETWORKS – Perceptron Learning and Generalization A network will also produce outputs for input patterns that it was not originally set up to classify (shown with question marks). There are two important aspects of the network’s operation to consider: Learning The network must learn decision surfaces from a set of training patterns so that these training patterns are classified correctly. Generalization After training, the network must also be able to generalize, i.e. correctly classify test patterns it has never seen before. Often we wish the network to learn perfectly, and then generalize well. Sometimes, the training data may contain errors (e.g. noise in the experimental determination of the input values, or incorrect classifications). In this case, learning the training data perfectly may make the generalization worse. There is an important tradeoff between learning and generalization that arises quite generally. A network will also produce outputs for input patterns that it was not originally set up to classify (shown with question marks). There are two important aspects of the network’s operation to consider: Learning The network must learn decision surfaces from a set of training patterns so that these training patterns are classified correctly. Generalization After training, the network must also be able to generalize, i.e. correctly classify test patterns it has never seen before. Often we wish the network to learn perfectly, and then generalize well. Sometimes, the training data may contain errors (e.g. noise in the experimental determination of the input values, or incorrect classifications). In this case, learning the training data perfectly may make the generalization worse. There is an important tradeoff between learning and generalization that arises quite generally.

Generalization in Classification
NEURAL NETWORKS – Perceptron Generalization in Classification Suppose the task of our network is to learn a classification decision boundary: Our aim is for the network to generalize to classify new inputs appropriately. If we know that the training data contains noise, we don’t necessarily want the training data to be classified totally accurately as that is likely to reduce the generalization ability. Suppose the task of our network is to learn a classification decision boundary: Our aim is for the network to generalize to classify new inputs appropriately. If we know that the training data contains noise, we don’t necessarily want the training data to be classified totally accurately as that is likely to reduce the generalization ability. We can expect the network output to give a better representation of the underlying function if its output curve does not pass through all the data points. Again, allowing a larger error on the training data is likely to lead to better generalization.

Training a Neural Network
NEURAL NETWORKS – Perceptron Training a Neural Network Whether our neural network is a simple Perceptron, or a much more complicated multilayer network with special activation functions, we need to develop a systematic procedure for determining appropriate connection weights. The general procedure is to have the network learn the appropriate weights from a representative set of training data. In all but the simplest cases, however, direct computation of the weights is intractable. Instead, we usually start off with random initial weights and adjust them in small steps until the required outputs are produced. We will now look at a brute force derivation of such an iterative learning algorithm for simple Perceptrons. Then, in later lectures, we shall see how more powerful and general techniques can easily lead to learning algorithms which will work for neural networks of any specification we could possibly dream up. Whether our neural network is a simple Perceptron, or a much more complicated multilayer network with special activation functions, we need to develop a systematic procedure for determining appropriate connection weights. The general procedure is to have the network learn the appropriate weights from a representative set of training data. In all but the simplest cases, however, direct computation of the weights is intractable. Instead, we usually start off with random initial weights and adjust them in small steps until the required outputs are produced. We will now look at a brute force derivation of such an iterative learning algorithm for simple Perceptrons. Then, in later lectures, we shall see how more powerful and general techniques can easily lead to learning algorithms which will work for neural networks of any specification we could possibly dream up.

Wnew = Wold + w Perceptron Learning Learning Rate Error (d  y) Input
NEURAL NETWORKS – Perceptron Perceptron Learning Wnew = Wold + w Input Error (d  y) Learning Rate For simple Perceptrons performing classification, we have seen that the decision boundaries are hyperplanes, and we can think of learning as the process of shifting around the hyperplanes until each training pattern is classified correctly. Somehow, we need to formalise that process of “shifting around” into a systematic algorithm that can easily be implemented on a computer. The “shifting around” can conveniently be split up into a number of small steps. If the network weights at time t are wij(t) , then the shifting process corresponds to moving them by an amount Delta wij(t) so that at time t+1 we have weights wij (t +1) = wij (t) + Delta wij (t) It is convenient to treat the thresholds as weights as discussed previously, so we don’t need separate equations for them. It has become clear that each case can be written in the form wij new is equal to wij old +  times (target (d) – output (y)) times input This weight update equation is called the Perceptron Learning Rule. The positive parameter  is called the learning rate or step size – it determines how smoothly we shift the decision boundaries.

Convergence of Perceptron Learning
NEURAL NETWORKS – Perceptron Converge? Convergence of Perceptron Learning x y . d + The weight changes wij need to be applied repeatedly – for each weight wij in the network, and for each training pattern in the training set. One pass through all the weights for the whole training set is called one epoch of training. Eventually, when all the network outputs match the targets for all the training patterns, all the wij will be zero and the process of training will cease. We then say that the training process has converged to a solution. It can be shown that if there does exist a possible set of weights for a Perceptron which solves the given problem correctly, then the Perceptron Learning Rule will find them in a finite number of iterations Moreover, it can be shown that if a problem is linearly separable, then the Perceptron Learning Rule will find a set of weights in a finite number of iterations that solves the problem correctly. If the given training set is linearly separable, the learning process will converge in a finite number of steps.

The Learning Scenario NEURAL NETWORKS – Perceptron
Learning is a recursive procedure of modifying weights from a given set of input-output patterns. For a single perceptron, the objective of the learning (encoding) procedure is to find the decision plane, (that is, the related weight vector), which separates two classes of given input-output training vectors. Once the learning is finished, every input vector will be classified into an appropriate class. A single perceptron can classify only the linearly separable patterns. The perceptron learning procedure is an example of a supervised error-correcting learning law.

The Learning Scenario NEURAL NETWORKS – Perceptron
Like this Figure, we can identify a current weight vector, w(n), the next weight vector, w(n + 1), and the correct weight vector, w. Related decision planes are orthogonal to these vectors and are depicted as straight lines. During the learning process the current weight vector w(n) is modified in the direction of the current input vector x(n), if the input pattern is misclassified, that is, if the error is non-zero.

The Learning Scenario NEURAL NETWORKS – Perceptron
You see the changes in weight vectors in each iterations.

NEURAL NETWORKS – Perceptron
The Learning Scenario

NEURAL NETWORKS – Perceptron
The Learning Scenario

The Learning Scenario w4 = w3

The Learning Scenario Presenting the perceptron with enough training vectors, the weight vector w(n) will tend to the correct value w. The demonstration is in augmented space. Conceptually, in augmented space, we adjust the weight vector to fit the data. Rosenblatt proved that if input patterns are linearly separable, then the perceptron learning law converges, and the hyperplane separating two classes of input patterns can be determined. This example demonstrated the perceptron learning law. In the example we demonstrated three main steps typical to most of the neural network applications, namely: Preparation of the training patterns. Normally, the training patterns are specific to the particular application. In our example we generate a set of vectors and partition this set with a plane into two linearly separable classes. The set of vectors is split into two halves, one used as a training set, the other as a validation set. Learning. Using the training set, we apply the perceptron learning law as described. Validation. Using the validation set we verify the quality of the learning process.