# September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector.

## Presentation on theme: "September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector."— Presentation transcript:

September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector pairs, so-called exemplars. Each pair (x, y) consists of an input vector x and a corresponding output vector y. Whenever the network receives input x, we would like it to provide output y. The exemplars thus describe the function that we want to “teach” our network. Besides learning the exemplars, we would like our network to generalize, that is, give plausible output for inputs that the network had not been trained with.

September 21, 2010Neural Networks Lecture 5: The Perceptron 2 Supervised Function Approximation There is a tradeoff between a network’s ability to precisely learn the given exemplars and its ability to generalize (i.e., inter- and extrapolate). This problem is similar to fitting a function to a given set of data points. Let us assume that you want to find a fitting function f:R  R for a set of three data points. You try to do this with polynomials of degree one (a straight line), two, and nine.

September 21, 2010Neural Networks Lecture 5: The Perceptron 3 Supervised Function Approximation Obviously, the polynomial of degree 2 provides the most plausible fit. f(x)x deg. 1 deg. 2 deg. 9

September 21, 2010Neural Networks Lecture 5: The Perceptron 4 Supervised Function Approximation The same principle applies to ANNs: If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too few neurons, it may not have enough degrees of freedom to precisely approximate the desired function. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. If an ANN has too many neurons, it will learn the exemplars perfectly, but its additional degrees of freedom may cause it to show implausible behavior for untrained inputs; it then presents poor ability of generalization. Unfortunately, there are no known equations that could tell you the optimal size of your network for a given application; there are only heuristics.

September 21, 2010Neural Networks Lecture 5: The Perceptron 5 Evaluation of Networks Basic idea: define error function and measure error for untrained data (testing set)Basic idea: define error function and measure error for untrained data (testing set) Typical: where d is the desired output, and o is the actual output.Typical: where d is the desired output, and o is the actual output. For classification: E = number of misclassified samples/ total number of samplesFor classification: E = number of misclassified samples/ total number of samples

September 21, 2010Neural Networks Lecture 5: The Perceptron 6 The Perceptron x1x1 x2x2 xnxn … W1W1 W2W2 … WnWn f(x 1,x 2,…,x n ) unit i net input signal output threshold 

September 21, 2010Neural Networks Lecture 5: The Perceptron 7 The Perceptron x1x1 x2x2 xnxn … W1W1 W2W2 … WnWn f(x 1,x 2,…,x n ) unit i net input signal output threshold 0 x 0  1 W0W0 W 0 corresponds to -  Here, only the weight vector is adaptable, but not the threshold

September 21, 2010Neural Networks Lecture 5: The Perceptron 8 Perceptron Computation Similar to a TLU, a perceptron divides its n-dimensional input space by an (n-1)-dimensional hyperplane defined by the equation: w 0 + w 1 x 1 + w 2 x 2 + … + w n x n = 0 For w 0 + w 1 x 1 + w 2 x 2 + … + w n x n > 0, its output is 1, and for w 0 + w 1 x 1 + w 2 x 2 + … + w n x n  0, its output is -1. With the right weight vector (w 0, …, w n ) T, a single perceptron can compute any linearly separable function. We are now going to look at an algorithm that determines such a weight vector for a given function.

September 21, 2010Neural Networks Lecture 5: The Perceptron 9 Perceptron Training Algorithm Algorithm Perceptron; Start with a randomly chosen weight vector w 0 ; Let k = 1; while there exist input vectors that are misclassified by w k-1, do Let i j be a misclassified input vector; Let x k = class(i j )  i j, implying that w k-1  x k < 0; Update the weight vector to w k = w k-1 +  x k ; Increment k; end-while;

September 21, 2010Neural Networks Lecture 5: The Perceptron 10 Perceptron Training Algorithm For example, for some input i with class(i) = -1, If w  i > 0, then we have a misclassification. Then the weight vector needs to be modified to w +  w with (w +  w)  i < w  i to possibly improve classification. We can choose  w = -  i, because (w +  w)  i = (w -  i)  i = w  i -  i  i < w  i, and i  i is the square of the length of vector i and is thus positive. If class(i) = 1, things are the same but with opposite signs; we introduce x to unify these two cases.

September 21, 2010Neural Networks Lecture 5: The Perceptron 11 Learning Rate and Termination Terminate when all samples are correctly classified.Terminate when all samples are correctly classified. If the number of misclassified samples has not changed in a large number of steps, the problem could be the choice of learning rate  :If the number of misclassified samples has not changed in a large number of steps, the problem could be the choice of learning rate  : If  is too large, classification may just be swinging back and forth and take a long time to reach the solution;If  is too large, classification may just be swinging back and forth and take a long time to reach the solution; On the other hand, if  is too small, changes in classification can be extremely slow.On the other hand, if  is too small, changes in classification can be extremely slow. If changing  does not help, the samples may not be linearly separable, and training should terminate.If changing  does not help, the samples may not be linearly separable, and training should terminate. If it is known that there will be a minimum number of misclassifications, train until that number is reached.If it is known that there will be a minimum number of misclassifications, train until that number is reached.

September 21, 2010Neural Networks Lecture 5: The Perceptron 12 Guarantee of Success: Novikoff (1963) Theorem 2.1: Given training samples from two linearly separable classes, the perceptron training algorithm terminates after a finite number of steps, and correctly classifies all elements of the training set, irrespective of the initial random non-zero weight vector w 0. Let w k be the current weight vector. We need to prove that there is an upper bound on k.

September 21, 2010Neural Networks Lecture 5: The Perceptron 13 Guarantee of Success: Novikoff (1963) Proof: Assume  = 1, without loss of generality. After k steps of the learning algorithm, the current weight vector is w k = w 0 + x 1 + x 2 + … + x k. (2.1) Since the two classes are linearly separable, there must be a vector of weights w* that correctly classifies them, that is, sgn(w*  i k ) = class(i k ). Multiplying each side of eq. 2.1 with w*, we get: w*  w k = w*  w 0 + w*  x 1 + w*  x 2 + … + w*  x k.

September 21, 2010Neural Networks Lecture 5: The Perceptron 14 Guarantee of Success: Novikoff (1963) w*  w k = w*  w 0 + w*  x 1 + w*  x 2 + … + w*  x k. For each input vector i j, the dot product w*  i j has the same sign as class(i j ). Since the corresponding element of the training sequence x = class(i j )  i j, we can be assured that w*  x = w*  (class(i j )  i j ) > 0. Therefore, there exists an  > 0 such that w*  x i >  for every member x i of the training sequence. Hence: w*  w k > w*  w 0 + k . (2.2)

September 21, 2010Neural Networks Lecture 5: The Perceptron 15 Guarantee of Success: Novikoff (1963) w*  w k > w*  w 0 + k . (2.2) By the Cauchy-Schwarz inequality: |w*  w k | 2  ||w*|| 2  ||w k || 2. (2.3) We may assume that that ||w*|| = 1, since the unit length vector w*/||w*|| also correctly classifies the same samples. Using this assumption and eqs. 2.2 and 2.3, we obtain a lower bound for the square of the length of w k : ||w k || 2 > (w*  w 0 + k  ) 2. (2.4)

September 21, 2010Neural Networks Lecture 5: The Perceptron 16 Guarantee of Success: Novikoff (1963) Since w j = w j-1 + x j, the following upper bound can be obtained for this vector’s squared length: ||w j || 2 = w j  w j = w j-1  w j-1 + 2w j-1  x j + x j  x j = w j-1  w j-1 + 2w j-1  x j + x j  x j = ||w j-1 || 2 + 2w j-1  x j + ||x j || 2 = ||w j-1 || 2 + 2w j-1  x j + ||x j || 2 Since w j-1  x j < 0 whenever a weight change is required by the algorithm, we have: ||w j || 2 - ||w j-1 || 2 < ||x j || 2 Summation of the above inequalities over j = 1, …, k gives an upper bound ||w k || 2 - ||w 0 || 2 < k  max ||x j || 2

September 21, 2010Neural Networks Lecture 5: The Perceptron 17 Guarantee of Success: Novikoff (1963) ||w k || 2 - ||w 0 || 2 < k  max ||x j || 2 Combining this with inequality 2.4: ||w k || 2 > (w*  w 0 + k  ) 2 (2.4) Gives us: (w*  w 0 + k  ) 2 < ||w k || 2 < ||w 0 || 2 + k  max ||x j || 2 Now the lower bound of ||w k || 2 increases at the rate of k 2, and its upper bound increases at the rate of k. Therefore, there must be a finite value of k such that: (w*  w 0 + k  ) 2 > ||w 0 || 2 + k  max ||x j || 2 This means that k cannot increase without bound, so that the algorithm must eventually terminate.

Download ppt "September 21, 2010Neural Networks Lecture 5: The Perceptron 1 Supervised Function Approximation In supervised learning, we train an ANN with a set of vector."

Similar presentations