 Navneet Goyal, BITS-Pilani Perceptrons. Labeled data is called Linearly Separable Data (LSD) if there is a linear decision boundary separating the classes.

Presentation on theme: "Navneet Goyal, BITS-Pilani Perceptrons. Labeled data is called Linearly Separable Data (LSD) if there is a linear decision boundary separating the classes."— Presentation transcript:

Navneet Goyal, BITS-Pilani Perceptrons

Labeled data is called Linearly Separable Data (LSD) if there is a linear decision boundary separating the classes Perceptron achieves perfect separation on LSD Perceptron iterates over the training set adjusting its parameters every time it encounters an incorrectly classified example Convergence is guaranteed for LSD Learning Rate Perceptrons

Most basic of all pattern recognition algorithms Proposed by Rosenblatt in 1958 as a simple model for learning in the brain Pattern x, is fed into a neuron in the form of n real numbers xi (i=1,2,…n) Synaptic strengths w are used as weights wi (i=1,2,…n) Neuron Excitation = Perceptrons

If, the neuron fires Excitation is a linear combination of inputs Perceptron is a linear classifier Represented as ‘inner product’ Vector Spaces with an inner product –Length or norm –Distance –Angle Perceptrons

Which side of the hyperplane a pattern vector x falls on? Perceptrons

Which side of the hyperplane a pattern vector x falls on? Weight vector w together with the threshold θ can be interpreted as a normal vector to a hyperplane Checking whether the inner product is greater than the threshold is equivalent to finding which side of the hyperplane pattern x falls on Perceptrons

Convergence of Perceptron Learning Algorithm Figure Taken from Christopher Bihop book on PR & ML

Finding a suitable value for w and θ Recall our discussion on model flexibility & complexity Perceptrons

Flexibility & Model Complexity  Different choices of w!!!

Flexibility & Model Complexity  Different choices of θ for a given w!!!

The learning problem for the perceptron is to find a vector w and a threshold θ that separates two classes of patterns as well as possible Similarity/differences with SVM??? It is a very common view to see learning as adapting weights in a neural network Perceptrons

ANNs attempt to simulate biological neural systems Human brain consists of nerve cells called neurons Neurons connect with other neurons via strands called axons Axons transmit nerve impulse from one neuron to another when a neuron simulates or fires Perceptron is the simplest ANN model Artificial Neural Networks

Source: http://cw.psypress.com/

Perceptron Output Y is 1 if at least two of the three inputs are equal to 1. Figure taken from: Introduction to Data Mining By Tan, Steinbach, Kumar

Perceptron Figure taken from: Introduction to Data Mining By Tan, Steinbach, Kumar

Perceptron Test points (1,1,0) & (0,1,0)

Training a perceptron model amounts to adapting the weights of the links until they fit the input- output relationship of the underlying data Iterative process Convergence is guaranteed for linearly separable data (provided learning rate is sufficiently small) What about non-linearly separable data? Perceptron

Figure taken from: Introduction to Data Mining By Tan, Steinbach, Kumar

Weight update formula: w (k) is the weight parameter associated with the i th input link after the k th iteration λ=learning rate between 0 & 1 x ij is the value of the jth attribute of the training example x i Learning Perceptron Model

1.D={(x i,y i ) | i=1,2,…N} training data 2.Initialize weight vector w(o) 3.Repeat 4. for each (x i,y i ) ∊ D 5. compute the predicted output 6. for each weight wj do 7. update the weight 8. end for 9. end for 10.until stopping condition is met Perceptron Learning Algorithm

Interpret different values of λ!! Adaptive λ Try to solve the XOR problem using Perceptron model Perceptron Learning Algorithm Figure taken from: Introduction to Data Mining By Tan, Steinbach, Kumar

Perceptron can fail to converge if the examples are not linearly separable A second training rule, Delta Rule, is designed to overcome this difficulty Converges towards a best-fit approximation to target concept Uses Gradient Descent to find weights that best-fit the training examples Delta rule provides the basis for the backpropagation algorithm Gradient Descent & Delta Rule Slide material adapted from the text book ( Machine Learning by Tom Mitchell)

Perceptron can fail to converge if the examples are not linearly separable A second training rule, Delta Rule, is designed to overcome this difficulty Converges towards a best-fit approximation to target concept Uses Gradient Descent to search the hypothesis space to find weights that best-fit the training examples Gradient Descent provides the basis for the backpropagation algorithm Gradient Descent also serves all those learning algorithms that search the hypothesis space containing many different kinds of continuously parametrized hypotheses Gradient Descent & Delta Rule Slide material adapted from the text book ( Machine Learning by Tom Mitchell)

Gradient Descent and batch gradient descent already discussed in the class. Gradient Descent & Delta Rule Slide material adapted from the text book ( Machine Learning by Tom Mitchell)

PLR updates weights based on the error in the thresholded perceptron output Delta Rule updates weights based on the error in the unthresholded linear combination of inputs For LS problems, PLR converges after a finite number of iterations Delta rule converges only asymptotically towards minimum error hypothesis, possibly requiring unbounded time, but converges regardless of whether the data is LS or not* Perceptron Learning Rule & Delta Rule Slide material adapted from the text book ( Machine Learning by Tom Mitchell) * Proof can be found in Hertz et al, 1991, Introduction to the theory of neural computations, MA Addison Wesley.

Linear Programming (LP) General and efficient method for solving set of linear inequalities Training examples correspond to inequality of the form w.x >0 or w.x <=0 Works only when the training examples are LS Duda & Hart (1973)* – used LP to handle non – LS case LP does not scale to training MLP In contrast, GD works quite well! Another Learning Rule! Slide material adapted from the text book ( Machine Learning by Tom Mitchell) * Proof can be found in Hertz et al, 1973, Pattern classification and scene analysis, NY, JW & S

Human Brain – 10 11 Neurons!!! Neuron – a simple slow processor with clock speed in milliseconds Massive network structure is the key! Most common NN – Feed Forward NN –Only forward movement of signals from input to output layers through intermediate layers without any feedback mechanism –Example MLP Multi-Layer Perceptron

Model Complications can arise due to: –Hidden layers and hidden nodes –More complex Activation Functions For example, when output is not binary we use logistic sigmoidal function Shape of LSF resembles the activity of a neuron Additional complexities allow MLNN to model more complex relationships between input & output The goal is to determine a set of weights that minimize the SSE Gradient Descent update formula? Multi-Layer Perceptron

Figure taken from: Introduction to Data Mining By Tan, Steinbach, Kumar

Design Issues: –Number of input nodes in the input layer Assign an input node to each numerical or binary variable For categorical create an input node for each distinct value or encode the k-ary variable using log 2 k (ceiling) input nodes –Number of input nodes in the output layer 2 class problem – single output node k- class problem – k output nodes –Network topology –Weights & Biases initialization (random assignments are usually acceptble) –Handling Missing values Multi-Layer Perceptron

Download ppt "Navneet Goyal, BITS-Pilani Perceptrons. Labeled data is called Linearly Separable Data (LSD) if there is a linear decision boundary separating the classes."

Similar presentations