EEE502 Pattern Recognition

Name: EEE502 Pattern Recognition
Uploaded: 2017-10-15T18:04:18+00:00
Duration: PTM19S39
Channel: Kenneth Cole
Description: EEE502 Pattern Recognition

EEE502 Pattern Recognition
Assoc. Prof. Devrim Ünay Office: A211

MULTILAYER PERCEPTRON

XOR and Linear Separability - Revisited
Table: XOR function If we use the same one-neuron model to learn the XOR (exclusive or) function, the model will fail. The first three cases will produce correct results; however, the last case will produce ‘1’, which is not correct. I1 I2 Output 1

XOR and Linear Separability (cont.)
The solution is to add a middle (hidden in ANN terminology) layer between the inputs and the output neuron Choose the weights w11=w12=w21=w22=1. Use a different sigmoid function, which is given with a certain threshold for each neuron: Confirm by calculating the neuron outputs for each possible input combinations that this neural network is indeed functioning like an XOR. (Hint: The output equal or below 0.5 is considered ‘0’, otherwise ‘1’)

Neuron calculation I1 I2 XOR X H1 H2 O Out 1 2 w11=w12=w21=w22=1

Neuron calculation (2) w11=w12=w21=w22=1 I1 I2 XOR X H1 H2 O Out
0.3775 1 0.6225 2 0.8176 w11=w12=w21=w22=1

Neuron calculation (3) w11=w12=w21=w22=1 I1 I2 XOR X H1 H2 O Out
0.3775 0.1824 1 0.6225 2 0.8176 w11=w12=w21=w22=1

Neuron calculation (4) I1 I2 XOR X H1 H2 O Out 0.3775 0.1824 0.4988 1 0.6225 0.5112 2 0.8176 w11=w12=w21=w22=1 Assuming that ‘0.5’ and below are considered as ‘0’ and above as ‘1’.

Visual Analysis

Multilayer perceptron
Multilayer Perceptron (MLP) consists of an input layer, hidden layer(s), and an output layer .

What do each of the layers do?
1st layer draws linear boundaries 2nd layer combines the boundaries 3rd layer can generate arbitrarily complex boundaries

The input signal propagates through the network in forward direction
The input signal propagates through the network in forward direction. Synaptic weights are updated by propagating the error signal backwards.

Basic Features of MLPs Model of each neuron includes a nonlinear activation function that is differentiable. Network contains ≥1 hidden layer(s). Network exhibits high degree of connectivity.

Deficiencies of MLPs Presence of distributed form of nonlinearity and high connectivity of network make theoretical analysis of MLPs difficult. Use of hidden neurons make learning process harder to visualize.

Function of hidden neurons
They act as feature detectors. As the learning process progresses across MLP, hidden neurons begin to gradually ‘discover’ the salient features that characterize the training data. They nonlinearly transform input data to a new space called the feature space, where classes may be more easily separated.

Notion of Credit Assignment
In a distributed learning system, internal decisions are responsible for the overall outcomes. Therefore, we assign credit or blame for overall outcomes to each of the internal decisions made by the hidden computational units of the distributed learning system.

MLPs MLPs are typically trained in a supervised manner using the well-known error back-propagation algorithm. This algorithm is based on error-correcting learning rule. It has two passes: Forward pass: Input propagates through the network and output is produced. During this pass, the weights are fixed. Backward pass: Weights are adjusted with respect to the error-correction rule. The error signal, generated as a difference of the desired and actual outputs, is propagated backward through the network, hence the name of the algorithm. It is referred to back-propagation algorithm or simply back-prop in the literature.

Learning by Error Minimization
Recall that Perceptron Learning Rule adjusts network weights to minimize difference between actual and desired outputs. We can quantify this difference by Sum-Squared-Error function: The aim of learning is to minimize this error by adjusting weights. Make small adjustments wij → wij + wij until E(wij) is small enough. This requires the knowledge of how error varies when we change weights, i.e. the gradient of E w.r.t. wij.

Computing Gradients and Derivatives
Consider a function y=f(x) Gradient, or rate of change, of f(x) at value x is Also known as partial derivative of f(x) w.r.t. x.

Examples of Computing Derivatives Analytically

Gradient Descent Minimization
Suppose we want to change x to minimize f(x) Then, Where  is a small positive constant specifying the change in x, and specifies the direction to go. Repeated use of this procedure  f(x) will keep descending towards its minimum (gradient descent minimization).

2D Gradient Descend - Example
2D function shown as contour plot with minimum inside the smallest ellipse: Gradients perpendicular to contours. Closer the contours, larger the vectors. We should take relative magnitudes of components of gradient vectors into account, if we are to head towards the minimum efficiently.

Back-propagation learning algorithm
Simplified hj h i w z y E =

Derivation

Backpropagation algorithm applies a correction wji(n) to the synaptic weight wji(n), which is proportional to the partial derivative We want to minimize the total instantaneous error by changing (tuning) weights!

Correction applied is defined by the delta rule:
Learning rate parameter Gradient descent in weight space (Seek a direction for weight change that reduces the value of E(n). )

Derivation

Forward pass Do not change weights Compute with For 1st hidden layer yi(n)=xi(n) Backward pass Start from output layer, recursively compute  for each neuron layer-by-layer moving towards the input layer direction. can be computed by the differentiable activation function.

Iteration I=1:M Shuffle samples For  training sample sample 1: forward & backward sample 2: forward & backward sample N: forward & backward Compute classification error OR

Generalized for a network with L-1 hidden layers Initialize the weights to small random values Choose an input pattern from the training set Propagate the signal forward through the network Compute δiL in the output layer where il represents the net input to the ith unit in the lth layer and ’ is the derivative of the activation function . Compute the deltas or gradients for the preceeding layers by propagating the errors backwards; for l=(L-1),…,1.

Generalized for a network with L-1 hidden layers Update weights using Go to step 2 and repeat until the error is below a certain threshold or a maximum number of iterations is reached Note that wijl is the weight on connection between the ith unit in layer (l-1) to the jth unit in layer l.

A Simple Back-Prop Learning Example
All biases set to 1. They are not shown for clarity. Learning rate h = 0.1 v11= -1 x1= 0 x1 w11= 1 y1 v21= 0 w21= -1 v12= 0 w12= 0 x2 x2= 1 y2 v22= 1 w22= 1 We have an input [0 1] with target [1 0].

Forward pass. Calculate 1st layer activations: u1 = 1 v11= -1 x1 w11= 1 y1 v21= 0 w21= -1 v12= 0 w12= 0 x2 y2 v22= 1 w22= 1 u2 = 2 u1 = -1x0 + 0x1 +1 = 1 u2 = 0x0 + 1x1 +1 = 2

Calculate first layer outputs by passing activations thru activation functions y1 = 1 v11= -1 x1 w11= 1 y1 v21= 0 w21= -1 v12= 0 w12= 0 x2 y2 v22= 1 w22= 1 y2 = 2 y1 = (u1) = 1 y2 = (u2) = 2

Calculate 2nd layer outputs (weighted sum thru activation functions): v11= -1 x1 w11= 1 y1= 2 v21= 0 w21= -1 v12= 0 w12= 0 x2 y2= 2 v22= 1 w22= 1 y1 = 1x1 + 0x2 +1 = 2 y2 = -1x1 + 1x2 +1 = 2

Backward pass: v11= -1 x1 w11= 1 1= -1 v21= 0 w21= -1 v12= 0 w12= 0 x2  2= -2 v22= 1 w22= 1 Target =[1, 0] so d1 = 1 and d2 = 0 So:  1 = (d1 - y1 )= 1 – 2 = -1  2 = (d2 - y2 )= 0 – 2 = -2

Calculate weight changes for 1st layer (cf perceptron learning): y1 = 1 v11= -1  1 y1 =-1 x1 w11= 1 v21= 0 w21= -1  1 y2 =-2 v12= 0 w12= 0  2 y1 =-2 x2 v22= 1 w22= 1  2 y2 =-4 y2 = 2

Weight changes will be: v11= -1 x1 w11= 0.9 v21= 0 w21= -1.2 v12= 0 w12= -0.2 x2 v22= 1 w22= 0.6

To compute weight changes in the preceeding layer, first we must calculate d’s: v11= -1 x1  1 w11= -1  1= -1 v21= 0  2 w21= 2 v12= 0  1 w12= 0 x2  2= -2 v22= 1  2 w22= -2

d’s propagate back: d1= 1 v11= -1 x1 d1= -1 v21= 0 v12= 0 x2 d2= -2 v22= 1 d2 = -2 d1 = = 1 d2 = 0 – 2 = -2

And are multiplied by inputs: d1 x1 = 0 v11= -1 x1= 0 d1= -1 v21= 0 d1 x2 = 1 v12= 0 d2 x1 = 0 x2= 1 d2= -2 v22= 1 d2 x2 = -2

Finally change weights: v11= -1 x1= 0 w11= 0.9 v21= 0 w21= -1.2 v12= 0.1 w12= -0.2 x2= 1 v22= 0.8 w22= 0.6 Note that the weights multiplied by the zero input are unchanged as they do not contribute to the error We have also changed biases (not shown)

Now go forward again (would normally use a new input vector): y1 = 1.1 v11= -1 x1= 0 w11= 0.9 v21= 0 w21= -1.2 v12= 0.1 w12= -0.2 x2= 1 v22= 0.8 w22= 0.6 y2 = 1.8

Now go forward again (would normally use a new input vector): v11= -1 x1= 0 y1 = 1.63 w11= 0.9 v21= 0 w21= -1.2 v12= 0.1 w12= -0.2 x2= 1 v22= 0.8 w22= 0.6 y2 = 0.76 Outputs now closer to target value [1, 0]

EEE502 Pattern Recognition

Similar presentations

Presentation on theme: "EEE502 Pattern Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

EEE502 Pattern Recognition

Similar presentations

Presentation on theme: "EEE502 Pattern Recognition"— Presentation transcript:

Similar presentations

About project

Feedback