# CS 416 Artificial Intelligence Lecture 22 Statistical Learning Chapter 20.5 Lecture 22 Statistical Learning Chapter 20.5.

## Presentation on theme: "CS 416 Artificial Intelligence Lecture 22 Statistical Learning Chapter 20.5 Lecture 22 Statistical Learning Chapter 20.5."— Presentation transcript:

CS 416 Artificial Intelligence Lecture 22 Statistical Learning Chapter 20.5 Lecture 22 Statistical Learning Chapter 20.5

Perceptrons Each input is binary and has associated with it a weightEach input is binary and has associated with it a weight The sum of the inner product of the input and weights is calculatedThe sum of the inner product of the input and weights is calculated If this sum exceeds a threshold, the perceptron firesIf this sum exceeds a threshold, the perceptron fires Each input is binary and has associated with it a weightEach input is binary and has associated with it a weight The sum of the inner product of the input and weights is calculatedThe sum of the inner product of the input and weights is calculated If this sum exceeds a threshold, the perceptron firesIf this sum exceeds a threshold, the perceptron fires

Perceptrons are linear classifiers Consider a two-input neuron Two weights are “tuned” to fit the dataTwo weights are “tuned” to fit the data The neuron uses the equation w 1 *x 1 + w 2 *x 2 to fire or notThe neuron uses the equation w 1 *x 1 + w 2 *x 2 to fire or not –This is like the equation of a line mx + b - y Consider a two-input neuron Two weights are “tuned” to fit the dataTwo weights are “tuned” to fit the data The neuron uses the equation w 1 *x 1 + w 2 *x 2 to fire or notThe neuron uses the equation w 1 *x 1 + w 2 *x 2 to fire or not –This is like the equation of a line mx + b - y http://www.compapp.dcu.ie/~humphrys/Notes/Neural/single.neural.html

Linearly separable These single-layer perceptron networks can classify linearly separable systems Consider a system like ANDConsider a system like AND x 1 x 2 x 1 AND x 2 111 010 100 000 These single-layer perceptron networks can classify linearly separable systems Consider a system like ANDConsider a system like AND x 1 x 2 x 1 AND x 2 111 010 100 000 1- 1

Linearly separable - AND Consider a system like ANDConsider a system like AND x 1 x 2 x 1 AND x 2 111 010 100 000 Consider a system like ANDConsider a system like AND x 1 x 2 x 1 AND x 2 111 010 100 000 1- 1  x1x1 x2x2 w1w1 w2w2  (x●w)

Linearly separable - XOR Consider a system like XORConsider a system like XOR x 1 x 2 x 1 XOR x 2 110 011 101 000 Consider a system like XORConsider a system like XOR x 1 x 2 x 1 XOR x 2 110 011 101 000 1- 1  x1x1 x2x2 w1w1 w2w2  (x●w)

Linearly separable - XOR IMPOSSIBLE!

2 nd Class Exercise x3 = ~x1, x4 = ~x2x3 = ~x1, x4 = ~x2 Find w1, w2, w3, w4, and theta such that Theta(x1*w1+x2* w2)= x1 xor x2Find w1, w2, w3, w4, and theta such that Theta(x1*w1+x2* w2)= x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done x3 = ~x1, x4 = ~x2x3 = ~x1, x4 = ~x2 Find w1, w2, w3, w4, and theta such that Theta(x1*w1+x2* w2)= x1 xor x2Find w1, w2, w3, w4, and theta such that Theta(x1*w1+x2* w2)= x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done

3 rd Class Exercise Find w1, w2, and f() such that f(x1*w1+x2*w2) = x1 xor x2Find w1, w2, and f() such that f(x1*w1+x2*w2) = x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done Find w1, w2, and f() such that f(x1*w1+x2*w2) = x1 xor x2Find w1, w2, and f() such that f(x1*w1+x2*w2) = x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done

Limitations of Perceptrons Minsky & Papert published (1969) “Perceptrons” stressing the limitations of perceptronsMinsky & Papert published (1969) “Perceptrons” stressing the limitations of perceptrons Single-layer perceptrons cannot solve problems that are linearly inseparable (e.g., xor)Single-layer perceptrons cannot solve problems that are linearly inseparable (e.g., xor) Most interesting problems are linearly inseparableMost interesting problems are linearly inseparable Kills funding for neural nets for 12-15 yearsKills funding for neural nets for 12-15 years Minsky & Papert published (1969) “Perceptrons” stressing the limitations of perceptronsMinsky & Papert published (1969) “Perceptrons” stressing the limitations of perceptrons Single-layer perceptrons cannot solve problems that are linearly inseparable (e.g., xor)Single-layer perceptrons cannot solve problems that are linearly inseparable (e.g., xor) Most interesting problems are linearly inseparableMost interesting problems are linearly inseparable Kills funding for neural nets for 12-15 yearsKills funding for neural nets for 12-15 years

A brief aside about Marvin Minsky Attended Bronx H.S. of ScienceAttended Bronx H.S. of Science Served in U.S. Navy during WW IIServed in U.S. Navy during WW II B.A. Harvard and Ph.D. PrincetonB.A. Harvard and Ph.D. Princeton MIT faculty since 1958MIT faculty since 1958 First graphical head-mounted display (1963)First graphical head-mounted display (1963) Co-inventor of Logo (1968)Co-inventor of Logo (1968) Nearly killed during 2001: A Space Odyssey but survived to write paper critical of neural networksNearly killed during 2001: A Space Odyssey but survived to write paper critical of neural networks Turing Award 1970Turing Award 1970 Attended Bronx H.S. of ScienceAttended Bronx H.S. of Science Served in U.S. Navy during WW IIServed in U.S. Navy during WW II B.A. Harvard and Ph.D. PrincetonB.A. Harvard and Ph.D. Princeton MIT faculty since 1958MIT faculty since 1958 First graphical head-mounted display (1963)First graphical head-mounted display (1963) Co-inventor of Logo (1968)Co-inventor of Logo (1968) Nearly killed during 2001: A Space Odyssey but survived to write paper critical of neural networksNearly killed during 2001: A Space Odyssey but survived to write paper critical of neural networks Turing Award 1970Turing Award 1970 From wikipedia.org

Single-layer networks for classification Single output with 0.5 as dividing line for binary classificationSingle output with 0.5 as dividing line for binary classification Single output with n-1 dividing lines for n-ary classificationSingle output with n-1 dividing lines for n-ary classification n outputs with 0.5 dividing line for n-ary classificationn outputs with 0.5 dividing line for n-ary classification Single output with 0.5 as dividing line for binary classificationSingle output with 0.5 as dividing line for binary classification Single output with n-1 dividing lines for n-ary classificationSingle output with n-1 dividing lines for n-ary classification n outputs with 0.5 dividing line for n-ary classificationn outputs with 0.5 dividing line for n-ary classification

Recent History of Neural Nets 1969 Minsky & Papert “kill” neural nets1969 Minsky & Papert “kill” neural nets 1974 Werbos describes back- propagation1974 Werbos describes back- propagation 1982 Hopfield reinvigorates neural nets1982 Hopfield reinvigorates neural nets 1986 Parallel Distributed Processing1986 Parallel Distributed Processing 1969 Minsky & Papert “kill” neural nets1969 Minsky & Papert “kill” neural nets 1974 Werbos describes back- propagation1974 Werbos describes back- propagation 1982 Hopfield reinvigorates neural nets1982 Hopfield reinvigorates neural nets 1986 Parallel Distributed Processing1986 Parallel Distributed Processing

Multi-layered Perceptrons Input layer, output layer, and “hidden” layersInput layer, output layer, and “hidden” layers Eliminates some concerns of Minsky and PapertEliminates some concerns of Minsky and Papert Modification rules are more complicated!Modification rules are more complicated! Input layer, output layer, and “hidden” layersInput layer, output layer, and “hidden” layers Eliminates some concerns of Minsky and PapertEliminates some concerns of Minsky and Papert Modification rules are more complicated!Modification rules are more complicated!

Why are modification rules more complicated? We can calculate the error of the output neuron by comparing to training data We could use previous update rule to adjust W 3,5 and W 4,5 to correct that errorWe could use previous update rule to adjust W 3,5 and W 4,5 to correct that error But how do W 1,3 W 1,4 W 2,3 W 2,4 adjust?But how do W 1,3 W 1,4 W 2,3 W 2,4 adjust? We can calculate the error of the output neuron by comparing to training data We could use previous update rule to adjust W 3,5 and W 4,5 to correct that errorWe could use previous update rule to adjust W 3,5 and W 4,5 to correct that error But how do W 1,3 W 1,4 W 2,3 W 2,4 adjust?But how do W 1,3 W 1,4 W 2,3 W 2,4 adjust?

First consider error in single-layer neural networks Sum of squared errors (across training data) For one sample: How can we minimize the error? Set derivative equal to zero and solve for weightsSet derivative equal to zero and solve for weights is that error affected by each of the weights in the weight vector?is that error affected by each of the weights in the weight vector? Sum of squared errors (across training data) For one sample: How can we minimize the error? Set derivative equal to zero and solve for weightsSet derivative equal to zero and solve for weights is that error affected by each of the weights in the weight vector?is that error affected by each of the weights in the weight vector?

Computing the partial By the Chain Rule: g ( ) = the activation function

Computing the partial g’( in ) = derivative of the activation function = g(1-g) in the case of the sigmoid

What changes in multilayer? Output is not one value, y Output is a vectorOutput is a vector We do not know the correct outputs for the hidden layers We will have to propagate errors backwardsWe will have to propagate errors backwards Back propagation (backprop) Output is not one value, y Output is a vectorOutput is a vector We do not know the correct outputs for the hidden layers We will have to propagate errors backwardsWe will have to propagate errors backwards Back propagation (backprop)

Multilayer

Backprop at the output layer Output layer error is computed as in single- layer and weights are updated in same fashion Let Err i be the i th component of the error vector y – h WLet Err i be the i th component of the error vector y – h W –Let Output layer error is computed as in single- layer and weights are updated in same fashion Let Err i be the i th component of the error vector y – h WLet Err i be the i th component of the error vector y – h W –Let

Backprop in the hidden layer Each hidden node is responsible for some fraction of the error  i in each of the output nodes to which it is connected  i is divided among all hidden nodes that connect to output i according to their strengths  i is divided among all hidden nodes that connect to output i according to their strengths Error at hidden node j: Each hidden node is responsible for some fraction of the error  i in each of the output nodes to which it is connected  i is divided among all hidden nodes that connect to output i according to their strengths  i is divided among all hidden nodes that connect to output i according to their strengths Error at hidden node j:

Backprop in the hidden layer Error is: Correction is: Error is: Correction is:

Summary of backprop 1.Compute the  value for the output units using the observed error 2.Starting with the output layer, repeat the following for each layer until done Propagate  value back to previous layerPropagate  value back to previous layer Update the weights between the two layersUpdate the weights between the two layers 1.Compute the  value for the output units using the observed error 2.Starting with the output layer, repeat the following for each layer until done Propagate  value back to previous layerPropagate  value back to previous layer Update the weights between the two layersUpdate the weights between the two layers

4 th Class Exercise Find w1, w2, w3, w4, w5, theta1, and theta2 such that output is x1 xor x2Find w1, w2, w3, w4, w5, theta1, and theta2 such that output is x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done Find w1, w2, w3, w4, w5, theta1, and theta2 such that output is x1 xor x2Find w1, w2, w3, w4, w5, theta1, and theta2 such that output is x1 xor x2 Or, prove that it can’t be doneOr, prove that it can’t be done

Back-Propagation (xor) Initial weights are randomInitial weights are random Threshold is now sigmoidal (function should have derivatives)Threshold is now sigmoidal (function should have derivatives) Initial weights are randomInitial weights are random Threshold is now sigmoidal (function should have derivatives)Threshold is now sigmoidal (function should have derivatives) Initial weights: w1=0.90, w2=-0.54 w3=0.21, w4=-0.03 w5 = 0.78

Back-Propagation (xor) Input layer – two unitInput layer – two unit Hidden layer – one unitHidden layer – one unit Output layer – one unitOutput layer – one unit Output is related to input byOutput is related to input by Performance is defined asPerformance is defined as Input layer – two unitInput layer – two unit Hidden layer – one unitHidden layer – one unit Output layer – one unitOutput layer – one unit Output is related to input byOutput is related to input by Performance is defined asPerformance is defined as For all samples in training set T

Back-Propagation (xor) Error at last layer (hidden  output) is defined as:Error at last layer (hidden  output) is defined as: Error at previous layer (input  hidden) is defined as:Error at previous layer (input  hidden) is defined as: Change in weight:Change in weight: Where:Where: Error at last layer (hidden  output) is defined as:Error at last layer (hidden  output) is defined as: Error at previous layer (input  hidden) is defined as:Error at previous layer (input  hidden) is defined as: Change in weight:Change in weight: Where:Where:

Back-Propagation (xor) (0,0)  0 – 1st example(0,0)  0 – 1st example Input to hidden unit is 0, sigmoid(0)=0.5Input to hidden unit is 0, sigmoid(0)=0.5 Input to output unit is (0.5)(-0.03)=-0.015Input to output unit is (0.5)(-0.03)=-0.015 Sigmoid(-0.015)=0.4963  error=-0.4963Sigmoid(-0.015)=0.4963  error=-0.4963 So,So, Example’s contribution to is –0.0062Example’s contribution to is –0.0062 (0,0)  0 – 1st example(0,0)  0 – 1st example Input to hidden unit is 0, sigmoid(0)=0.5Input to hidden unit is 0, sigmoid(0)=0.5 Input to output unit is (0.5)(-0.03)=-0.015Input to output unit is (0.5)(-0.03)=-0.015 Sigmoid(-0.015)=0.4963  error=-0.4963Sigmoid(-0.015)=0.4963  error=-0.4963 So,So, Example’s contribution to is –0.0062Example’s contribution to is –0.0062 Why are we ignoring the other weight changes?

Back-Propagation (xor) (0,1)  1 – 2nd example(0,1)  1 – 2nd example i h =-0.54  o h =0.3862i h =-0.54  o h =0.3862 i o =(0.3862)(-.03)+0.78=0.769  o o =0.6683i o =(0.3862)(-.03)+0.78=0.769  o o =0.6683 (0,1)  1 – 2nd example(0,1)  1 – 2nd example i h =-0.54  o h =0.3862i h =-0.54  o h =0.3862 i o =(0.3862)(-.03)+0.78=0.769  o o =0.6683i o =(0.3862)(-.03)+0.78=0.769  o o =0.6683 &c…

Back-Propagation (xor) Initial performance = -0.2696Initial performance = -0.2696 After 100 iterations we have:After 100 iterations we have: w=(0.913, -0.521, 0.036, -0.232, 0.288)w=(0.913, -0.521, 0.036, -0.232, 0.288) Performance = -0.2515Performance = -0.2515 After 100K iterations we have:After 100K iterations we have: w=(15.75, -7.671, 7.146, -7.149, 0.0022)w=(15.75, -7.671, 7.146, -7.149, 0.0022) Performance = -0.1880Performance = -0.1880 After 1M iterations we have:After 1M iterations we have: w=(21.38, -10.49, 9.798, -9.798, 0.0002)w=(21.38, -10.49, 9.798, -9.798, 0.0002) Performance = -0.1875Performance = -0.1875 Initial performance = -0.2696Initial performance = -0.2696 After 100 iterations we have:After 100 iterations we have: w=(0.913, -0.521, 0.036, -0.232, 0.288)w=(0.913, -0.521, 0.036, -0.232, 0.288) Performance = -0.2515Performance = -0.2515 After 100K iterations we have:After 100K iterations we have: w=(15.75, -7.671, 7.146, -7.149, 0.0022)w=(15.75, -7.671, 7.146, -7.149, 0.0022) Performance = -0.1880Performance = -0.1880 After 1M iterations we have:After 1M iterations we have: w=(21.38, -10.49, 9.798, -9.798, 0.0002)w=(21.38, -10.49, 9.798, -9.798, 0.0002) Performance = -0.1875Performance = -0.1875

Some general artificial neural network (ANN) info The entire network is a function g( inputs ) = outputsThe entire network is a function g( inputs ) = outputs –These functions frequently have sigmoids in them –These functions are frequently differentiable –These functions have coefficients (weights) Backpropagation networks are simply ways to tune the coefficients of a function so it produces desired outputBackpropagation networks are simply ways to tune the coefficients of a function so it produces desired output The entire network is a function g( inputs ) = outputsThe entire network is a function g( inputs ) = outputs –These functions frequently have sigmoids in them –These functions are frequently differentiable –These functions have coefficients (weights) Backpropagation networks are simply ways to tune the coefficients of a function so it produces desired outputBackpropagation networks are simply ways to tune the coefficients of a function so it produces desired output

Function approximation Consider fitting a line to data Coefficients: slope and y-interceptCoefficients: slope and y-intercept Training data: some samplesTraining data: some samples Use least-squares fitUse least-squares fit This is what an ANN does Consider fitting a line to data Coefficients: slope and y-interceptCoefficients: slope and y-intercept Training data: some samplesTraining data: some samples Use least-squares fitUse least-squares fit This is what an ANN does x y

Function approximation A function of two inputs… Fit a smooth curve to the available dataFit a smooth curve to the available data –Quadratic –Cubic –n th -order –ANN! A function of two inputs… Fit a smooth curve to the available dataFit a smooth curve to the available data –Quadratic –Cubic –n th -order –ANN!

Curve fitting A neural network should be able to generate the input/output pairs from the training dataA neural network should be able to generate the input/output pairs from the training data You’d like for it to be smooth (and well-behaved) in the voids between the training dataYou’d like for it to be smooth (and well-behaved) in the voids between the training data There are risks of over fitting the dataThere are risks of over fitting the data A neural network should be able to generate the input/output pairs from the training dataA neural network should be able to generate the input/output pairs from the training data You’d like for it to be smooth (and well-behaved) in the voids between the training dataYou’d like for it to be smooth (and well-behaved) in the voids between the training data There are risks of over fitting the dataThere are risks of over fitting the data

When using ANNs Sometimes the output layer feeds back into the input layer – recurrent neural networksSometimes the output layer feeds back into the input layer – recurrent neural networks The backpropagation will tune the weightsThe backpropagation will tune the weights You determine the topologyYou determine the topology –Different topologies have different training outcomes (consider overfitting) –Sometimes a genetic algorithm is used to explore the space of neural network topologies Sometimes the output layer feeds back into the input layer – recurrent neural networksSometimes the output layer feeds back into the input layer – recurrent neural networks The backpropagation will tune the weightsThe backpropagation will tune the weights You determine the topologyYou determine the topology –Different topologies have different training outcomes (consider overfitting) –Sometimes a genetic algorithm is used to explore the space of neural network topologies

What is the Purpose of NN? To create an Artificial Intelligence, or Although not an invalid purpose, many people in the AI community think neural networks do not provide anything that cannot be obtained through other techniquesAlthough not an invalid purpose, many people in the AI community think neural networks do not provide anything that cannot be obtained through other techniques –It is hard to unravel the “intelligence” behind why the ANN works To study how the human brain works? Ironically, those studying neural networks with this in mind are more likely to contribute to the previous purposeIronically, those studying neural networks with this in mind are more likely to contribute to the previous purpose To create an Artificial Intelligence, or Although not an invalid purpose, many people in the AI community think neural networks do not provide anything that cannot be obtained through other techniquesAlthough not an invalid purpose, many people in the AI community think neural networks do not provide anything that cannot be obtained through other techniques –It is hard to unravel the “intelligence” behind why the ANN works To study how the human brain works? Ironically, those studying neural networks with this in mind are more likely to contribute to the previous purposeIronically, those studying neural networks with this in mind are more likely to contribute to the previous purpose

Some Brain Facts Contains ~100,000,000,000 neuronsContains ~100,000,000,000 neurons Hippocampus CA3 region contains ~3,000,000 neuronsHippocampus CA3 region contains ~3,000,000 neurons Each neuron is connected to ~10,000 other neuronsEach neuron is connected to ~10,000 other neurons ~ (10 15 )(10 15 ) connections!~ (10 15 )(10 15 ) connections! Consumes ~20-30% of the body’s energyConsumes ~20-30% of the body’s energy Contains about 2% of the body’s massContains about 2% of the body’s mass Contains ~100,000,000,000 neuronsContains ~100,000,000,000 neurons Hippocampus CA3 region contains ~3,000,000 neuronsHippocampus CA3 region contains ~3,000,000 neurons Each neuron is connected to ~10,000 other neuronsEach neuron is connected to ~10,000 other neurons ~ (10 15 )(10 15 ) connections!~ (10 15 )(10 15 ) connections! Consumes ~20-30% of the body’s energyConsumes ~20-30% of the body’s energy Contains about 2% of the body’s massContains about 2% of the body’s mass

Similar presentations