The Perceptron.

Slides:



Advertisements
Similar presentations
Aula 3 Single Layer Percetron
Advertisements

Multi-Layer Perceptron (MLP)
Beyond Linear Separability
NEURAL NETWORKS Perceptron
Support Vector Machines
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Supervised Learning Recap
Chapter 4: Linear Models for Classification
Computer vision: models, learning and inference
Classification and Prediction: Regression Via Gradient Descent Optimization Bamshad Mobasher DePaul University.
Artificial Neural Networks
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
Lecture 14 – Neural Networks
x – independent variable (input)
Radial Basis Functions
Prénom Nom Document Analysis: Linear Discrimination Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Logistic Regression Rong Jin. Logistic Regression Model  In Gaussian generative model:  Generalize the ratio to a linear model Parameters: w and c.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 15: Introduction to Artificial Neural Networks Martin Russell.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Hazırlayan NEURAL NETWORKS Radial Basis Function Networks I PROF. DR. YUSUF OYSAL.
Linear Discriminant Functions Chapter 5 (Duda et al.)
CS 4700: Foundations of Artificial Intelligence
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
1 Introduction to Artificial Neural Networks Andrew L. Nelson Visiting Research Faculty University of South Florida.
Classification Part 3: Artificial Neural Networks
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
1 Artificial Neural Networks Sanun Srisuk EECP0720 Expert Systems – Artificial Neural Networks.
CSCI 4410 Lecture 11: Introduction to Neural Networks adapted from Kathy Swigger.
Classification / Regression Neural Networks 2
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Linear Discrimination Reading: Chapter 2 of textbook.
Non-Bayes classifiers. Linear discriminants, neural networks.
11 1 Backpropagation Multilayer Perceptron R – S 1 – S 2 – S 3 Network.
Linear Models for Classification
ADALINE (ADAptive LInear NEuron) Network and
Back-Propagation Algorithm AN INTRODUCTION TO LEARNING INTERNAL REPRESENTATIONS BY ERROR PROPAGATION Presented by: Kunal Parmar UHID:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Insight: Steal from Existing Supervised Learning Methods! Training = {X,Y} Error = target output – actual output.
1  Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Introduction to Neural Networks Introduction to Neural Networks Applied to OCR and Speech Recognition An actual neuron A crude model of a neuron Computational.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
Neural Networks Vladimir Pleskonjić 3188/ /20 Vladimir Pleskonjić General Feedforward neural networks Inputs are numeric features Outputs are in.
Start with student evals. What function does perceptron #4 represent?
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Linear Discriminant Functions Chapter 5 (Duda et al.) CS479/679 Pattern Recognition Dr. George Bebis.
Multinomial Regression and the Softmax Activation Function Gary Cottrell.
Deep Feedforward Networks
Logistic Regression Gary Cottrell 6/8/2018
10701 / Machine Learning.
CSE 473 Introduction to Artificial Intelligence Neural Networks
Announcements HW4 due today (11:59pm) HW5 out today (due 11/17 11:59pm)
Classification / Regression Neural Networks 2
Classification Discriminant Analysis
Statistical Learning Dong Liu Dept. EEIS, USTC.
Machine Learning Today: Reading: Maria Florina Balcan
Chapter 3. Artificial Neural Networks - Introduction -
Perceptron as one Type of Linear Discriminants
Neural Network - 2 Mayank Vatsa
Backpropagation.
Artificial Intelligence 10. Neural Networks
Chapter - 3 Single Layer Percetron
Backpropagation.
Linear Discrimination
Outline Announcement Neural networks Perceptrons - continued
Presentation transcript:

The Perceptron

Truth Table for Logical AND Prehistory W.S. McCulloch & W. Pitts (1943). “A logical calculus of the ideas immanent in nervous activity”, Bulletin of Mathematical Biophysics, 5, 115-137. This seminal paper pointed out that simple artificial “neurons” could be made to perform basic logical operations such as AND, OR and NOT. x & y y x output inputs Truth Table for Logical AND x * +1 y * +1 if sum<0 : 0 else : 1 x+y-2 1 * -2 1 1 1 1 inputs weights sum output

Nervous Systems as Logical Circuits Groups of these “neuronal” logic gates could carry out any computation, even though each neuron was very limited. Could computers built from these simple units reproduce the computational power of biological brains? Were biological neurons performing logical operations? x | y y x output inputs Truth Table for Logical OR x * +1 y * +1 if sum<0 : 0 else : 1 x+y-1 1 * -1 1 1 1 1 1 1 inputs weights sum output

The Perceptron Frank Rosenblatt (1962). Principles of Neurodynamics, Spartan, New York, NY. Subsequent progress was inspired by the invention of learning rules inspired by ideas from neuroscience… Rosenblatt’s Perceptron could automatically learn to categorise or classify input vectors into types. output inputs weights sum Σxi wi * It obeyed the following rule: If the sum of the weighted inputs exceeds a threshold, output 1, else output -1. 1 if Σ inputi * weighti > threshold -1 if Σ inputi * weighti < threshold

Classifier Consider a network as a Classifier Network parameters are adapted so that it discriminates between classes For m classes, the classifier partitions the feature space into m decision regions The line or curve separating the classes is the decision boundary. In more than 2 dimensions this is a surface (e.g., a hyperplane) For 2 classes can view net output as a discriminant function y(x, w) where: y(x, w) = 1 if x in C1 y(x, w) = - 1 if x in C2 Need some training data with known classes to generate an error function for the network Need a (supervised) learning algorithm to adjust the weights

Linear discriminant functions A linear discriminant function is a mapping which partitions feature space using a linear function (a straight line, or a hyperplane) Thus in 2 dimensions the decision boundary is a straight line Simple form of classifier: “separate the two classes using a straight line in feature space”

The Perceptron as a Classifier For d-dimensional data perceptron consists of d-weights, a bias and a thresholding activation function. For 2D data we have: x1 w1 w2 x2 a = w0 + w1 x1 + w2 x2 y=g(a) {-1, +1} Output = class decision 1 w0 1. Weighted Sum of the inputs 2. Pass thru Heaviside function: T(a)= -1 if a < 0 T(a)= 1 if a >= 0 View the bias as another weight from an input which is constantly on If we group the weights as a vector w we therefore have the net output y given by: y = g(w . x + w0)

Interpretation of weights Since Heaviside function is thresholded on 0 the decision boundary is at: a = w . x + w0 = w0 + w1 x1 + w2 x2 = 0 Rearranging we get: x2 = - (w0 + w1 x1)/w2 unless w2=0 when we have x1= - w0/w1 or w2= w1=0 when classification depends on sign of w0 So perceptron functions as a linear discriminant function w ||w|| w0

x is in class Cj if yj (x)>= yk for all k w11 x1 Weight to output j from input k is wjk y1 wk1 wkd xd yk yj = g(Swjk xk + wk0) wk0 1 Perceptron can be extended to discriminate between k classes by having k output nodes: x is in class Cj if yj (x)>= yk for all k Resulting decision boundaries divide the feature space into convex decision regions C1 C2 C3

Generalised linear discriminants Other activation functions can also be used (usually chosen to be monotonic). NB discriminant is still linear. Use of the sigmoidal logistic activation function: g(a) = 1/(1 + e-a) together with data drawn from Gaussian or Bernoulli class-conditional distributions (P(x | Ck)) means that the network outputs can be interpreted as the posterior probabilities P(Ck | x) Generalised linear discriminants Linear discriminants can be made more general by including non-linear functions (basis functions) fk which to transform the input data. Thus the outputs become: yj = g(Swjk fk + wk0)

Network Learning Standard procedure for training the weights is by gradient descent For this process we have a set of training data from known classes to be used in conjunction with an error function E(w) (eg sum of squares error) to specify an error for each instantiation of the network Then do: w new = w old - h E(w) So: where: E(w) is a vector representing the gradient and h is the learning rate (small, positive) 1. This moves us downhill in direction E(w) (steepest downhill since E(w) is the direction of steepest increase) 2. How far we go is determined by the value of h D D D D

Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 > 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule decreases w1

Moving Downhill: Move in direction of negative derivative E(w) Decreasing E(w) w1 d E(w)/ dw1 w1 d E(w)/dw1 < 0 w1 <= w1 - h d E(w)/dw1 i.e., the rule increases w1

Illustration of Gradient Descent E(w) w1 w0

Illustration of Gradient Descent E(w) w1 w0

Illustration of Gradient Descent E(w) w1 Direction of steepest descent = direction of negative gradient w0

Illustration of Gradient Descent E(w) w1 Original point in weight space New point in weight space w0

General Gradient Descent Algorithm Define an objective function E(w) Algorithm: pick an initial set of weights w, e.g. randomly evaluate E(w) at w note: this can be done numerically or in closed form update all the weights w new = w old - h E(w) check if E(w) is approximately 0 if so, we have converged to a “flat minimum” if not, we move again in weight space D D D

Equivalent to hill-climbing Can be problems knowing when to stop Local minima can have multiple local minima (note: for perceptron, E(w) only has a single global minimum, so this is not a problem) gradient descent goes to the closest local minimum: solution: random restarts from multiple places in weight space

Sequential Gradient Descent In standard gradient descent (batch version) get network output for all data points and estimate error gradient from difference between outputs and targets (for current weights) Sequential gradient descent: get an approximation to the full gradient based on the ith training vector xi only use: where Ei is the error due to xi This allows us to update the weights as we cycle through each input - tends to be faster in practice - don’t have to store all outputs and vectors - can be used to adapt weights on-line - can track slow moving changes in the data - stochasticity can help to escape from local minima

Error function Need to define an error function to start the training procedure Also need to define target functions ti for each input pattern xi in the training data set X: ti = 1 if pattern xi is in C1 and –1 if xi is in C2 An obvious starting point is to use number of training patterns that are currently misclassified Equivalent to sum of squares error function: E(w) = S |y(xi) – ti| = (1/4) S (y(xi) – ti)2 However, thinking about the resulting Error Surface highlights some bad properties of this error for gradient descent

ie we cannot distinguish this: o In particular a smooth change in the weights Dw will not result in a smooth change in the error: Dw E 5 4 Dw w Either weight change has no effect on error Or a pattern is reclassified causing a discontinuity in the error surface This means we get no info from the error gradient (not great for a gradient descent procedure …) x x From this: o ie we cannot distinguish this: o

Use the Perceptron Criterion: Therfore want an error function which takes into account the distance of misclassified patterns from the boundary Use the Perceptron Criterion: Eperc(w) = - S w.xi ti for all misclassified xi Since if xi is in C1 but classified in C2 then ti = +1 and w.xi < 0 so w.xi ti = w.xi < 0 and if xi is in C2 but classified in C1 then ti = -1 and w.xi >=0 so w.xi ti = - w.xi < 0 Also from a geometrical interpretation of the weights we can show that w.xi is proportional to the absolute distance to the decision boundary xi w d a w.xi w

Applying the sequential gradient descent algorithm to this error function we get: w(t+1) = w(t) + h xj tj for all misclassified xj Equivalently we can use: w(t+1) = w(t) + h xj (tj – yj) Which is a form of the adaline learning algorithm The Perceptron Convergence Theorem (Rosenblatt, 1962) states that this algorithm is guaranteed to converge to a solution for linearly separable data. The idea of the proof is to consider ||w(t+1)-w||-||w(t)-w|| which is a decreasing function of t (See eg Bishop (1995))

w.x > 0 w.x <0 x o x o (t3 – y3) = - 2 x3 w(t)

w.x > 0 w.x <0 x o x o (t3 – y3) = - 2 x3 w(t) h x3 (t3 – y3)

w.x > 0 w.x <0 x o x o x3 w(t+1) = w(t) + h x3 (t3 – y3)

w.x > 0 w.x <0 x o x o x3 w(t+1)

The Fall of the Perceptron Marvin Minsky & Seymour Papert (1969). Perceptrons, MIT Press, Cambridge, MA. Before long researchers had begun to discover the Perceptron’s limitations. Unless input categories were “linearly separable”, a perceptron could not learn to discriminate between them. Unfortunately, it appeared that many important categories were not linearly separable. E.g., those inputs to an XOR gate that give an output of 1 (namely 10 & 01) are not linearly separable from those that do not (00 & 11).

The Fall of the Perceptron Successful Unsuccessful Footballers Academics Many Hours in the Gym per Week Few Hours in the Gym per Week …despite the simplicity of their relationship: Academics = Successful XOR Gym In this example, a perceptron would not be able to discriminate between the footballers and the academics… This failure caused the majority of researchers to walk away.

The simple XOR example masks a deeper problem ... 1. 2. 3. 4. Consider a perceptron classifying shapes as connected or disconnected and taking inputs from the dashed circles in 1. In going from 1 to 2, change of right hand end only must be sufficient to change classification (raise/lower linear sum thru 0) Similarly, the change in left hand end only must be sufficient to change classification Therefore changing both ends must take the sum even further across threshold Problem is because of single layer of processing local knowledge cannot be combined into global knowledge. So add more layers ...

THE PERCEPTRON CONTROVERSY There is no doubt that Minsky and Papert's book was a block to the funding of research in neural networks for more than ten years. The book was widely interpreted as showing that neural networks are basically limited and fatally flawed. What IS controversial is whether Minsky and Papert shared and/or promoted this belief ? Following the rebirth of interest in artificial neural networks, Minsky and Papert claimed that they had not intended such a broad interpretation of the conclusions they reached in the book Perceptrons. However, Jianfeng was present at MIT in 1974, and reached a different conclusion on the basis of the internal reports circulating at MIT. What were Minsky and Papert actually saying to their colleagues in the period after the publication of their book?

Minsky and Papert describe a neural network with a hidden layer as follows: GAMBA PERCEPTRON: A number of linear threshold systems have their outputs connected to the in- puts of a linear threshold system. Thus we have a linear threshold function of many linear threshold functions. Minsky and Papert then state: Virtually nothing is known about the computational capabilities of this latter kind of machine. We believe that it can do little more than can a low order perceptron. (This, in turn, would mean, roughly, that although they could recognize (sp) some relations between the points of a picture, they could not handle relations between such relations to any significant extent.) That we cannot understand mathematically the Gamba perceptron very well is, we feel, symptomatic of the early state of development of elementary computational theories.

In summary, Minsky and Papert, with intellectual honesty, confessed that they were not not able to prove that even with hidden layers, feed-forward neural nets were useless, but they expressed strong confidence that they were quite inadequate computational learning devices. NB Minsky and Papert restrict discussion to "linear threshold" rather than the sigmoid threshold functions prevalent in ANN. Conclusion? Don’t believe everything you hear …