Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109 Spring 2008.

Slides:



Advertisements
Similar presentations
Artificial Neural Networks
Advertisements

NEURAL NETWORKS Backpropagation Algorithm
1 Machine Learning: Lecture 4 Artificial Neural Networks (Based on Chapter 4 of Mitchell T.., Machine Learning, 1997)
Ch. Eick: More on Machine Learning & Neural Networks Different Forms of Learning: –Learning agent receives feedback with respect to its actions (e.g. using.
Artificial Intelligence 13. Multi-Layer ANNs Course V231 Department of Computing Imperial College © Simon Colton.
Models of Learning Hebbian ~ coincidence Recruitment ~ one trial Supervised ~ correction (backprop) Reinforcement ~ delayed reward Unsupervised ~ similarity.
CSCI 347 / CS 4206: Data Mining Module 07: Implementations Topic 03: Linear Models.
Kostas Kontogiannis E&CE
Machine Learning: Connectionist McCulloch-Pitts Neuron Perceptrons Multilayer Networks Support Vector Machines Feedback Networks Hopfield Networks.
Overview over different methods – Supervised Learning
INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RBF Neural Networks x x1 Examples inside circles 1 and 2 are of class +, examples outside both circles are of class – What NN does.
Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006.
Machine Learning Motivation for machine learning How to set up a problem How to design a learner Introduce one class of learners (ANN) –Perceptrons –Feed-forward.
Broca’s area Pars opercularis Motor cortexSomatosensory cortex Sensory associative cortex Primary Auditory cortex Wernicke’s area Visual associative cortex.
Artificial Neural Networks
Before we start ADALINE
Lecture 4 Neural Networks ICS 273A UC Irvine Instructor: Max Welling Read chapter 4.
Data Mining with Neural Networks (HK: Chapter 7.5)
Artificial Neural Networks
CS 484 – Artificial Intelligence
Neural Networks. Background - Neural Networks can be : Biological - Biological models Artificial - Artificial models - Desire to produce artificial systems.
Artificial Neural Network
Radial-Basis Function Networks
Dr. Hala Moushir Ebied Faculty of Computers & Information Sciences
Artificial Neural Networks
Classification Part 3: Artificial Neural Networks
Computer Science and Engineering
Neural NetworksNN 11 Neural netwoks thanks to: Basics of neural network theory and practice for supervised and unsupervised.
Artificial Neural Nets and AI Connectionism Sub symbolic reasoning.
Neural Networks Ellen Walker Hiram College. Connectionist Architectures Characterized by (Rich & Knight) –Large number of very simple neuron-like processing.
Neural Networks AI – Week 23 Sub-symbolic AI Multi-Layer Neural Networks Lee McCluskey, room 3/10
Machine Learning Chapter 4. Artificial Neural Networks
Chapter 3 Neural Network Xiu-jun GONG (Ph. D) School of Computer Science and Technology, Tianjin University
11 CSE 4705 Artificial Intelligence Jinbo Bi Department of Computer Science & Engineering
Machine Learning Dr. Shazzad Hosain Department of EECS North South Universtiy
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 16: NEURAL NETWORKS Objectives: Feedforward.
LINEAR CLASSIFICATION. Biological inspirations  Some numbers…  The human brain contains about 10 billion nerve cells ( neurons )  Each neuron is connected.
An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute.
Artificial Intelligence Methods Neural Networks Lecture 4 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
CS 478 – Tools for Machine Learning and Data Mining Backpropagation.
Artificial Neural Networks An Introduction. What is a Neural Network? A human Brain A porpoise brain The brain in a living creature A computer program.
Neural Networks and Machine Learning Applications CSC 563 Prof. Mohamed Batouche Computer Science Department CCIS – King Saud University Riyadh, Saudi.
EE459 Neural Networks Backpropagation
Neural Networks and Backpropagation Sebastian Thrun , Fall 2000.
Neural Networks - Berrin Yanıkoğlu1 Applications and Examples From Mitchell Chp. 4.
1 Lecture 6 Neural Network Training. 2 Neural Network Training Network training is basic to establishing the functional relationship between the inputs.
Neural Networks - lecture 51 Multi-layer neural networks  Motivation  Choosing the architecture  Functioning. FORWARD algorithm  Neural networks as.
SUPERVISED LEARNING NETWORK
Neural Networks Presented by M. Abbasi Course lecturer: Dr.Tohidkhah.
Neural Networks Teacher: Elena Marchiori R4.47 Assistant: Kees Jong S2.22
Artificial Neural Network
EEE502 Pattern Recognition
Learning Neural Networks (NN) Christina Conati UBC
Chapter 6 Neural Network.
Artificial Intelligence Methods Neural Networks Lecture 3 Rakesh K. Bissoondeeal Rakesh K. Bissoondeeal.
Learning: Neural Networks Artificial Intelligence CMSC February 3, 2005.
CS 182 Sections Leon Barrett ( bad puns alert!
Pattern Recognition Lecture 20: Neural Networks 3 Dr. Richard Spillman Pacific Lutheran University.
CS 182 Sections Leon Barrett (
Learning with Neural Networks Artificial Intelligence CMSC February 19, 2002.
CSE343/543 Machine Learning Mayank Vatsa Lecture slides are prepared using several teaching resources and no authorship is claimed for any slides.
Machine Learning Supervised Learning Classification and Regression
Fall 2004 Backpropagation CS478 - Machine Learning.
Learning with Perceptrons and Neural Networks
Real Neurons Cell structures Cell body Dendrites Axon
CSC 578 Neural Networks and Deep Learning
ECE 471/571 - Lecture 17 Back Propagation.
The McCullough-Pitts Neuron
Presentation transcript:

Connectionist Models: Backprop Jerome Feldman CS182/CogSci110/Ling109 Spring 2008

Recruiting connections Given that LTP involves synaptic strength changes and Hebb’s rule involves coincident-activation based strengthening of connections  How can connections between two nodes be recruited using Hebbs’s rule?

X Y

X Y

Finding a Connection P = Probability of NO link between X and Y N = Number of units in a “layer” B = Number of randomly outgoing units per unit F = B/N, the branching factor K = Number of Intermediate layers, 2 in the example N= K= # Paths = (1-P k-1 )*(N/F) = (1-P k-1 )*B P = (1-F) **B**K

Finding a Connection in Random Networks For Networks with N nodes and branching factor, there is a high probability of finding good links. (Valiant 1995)

Recruiting a Connection in Random Networks Informal Algorithm 1.Activate the two nodes to be linked 2. Have nodes with double activation strengthen their active synapses (Hebb) 3.There is evidence for a “now print” signal based on LTP (episodic memory)

Has-color Green Has-shape Round

Has-color Has-shape GREENROUND

Hebb’s rule is not sufficient What happens if the neural circuit fires perfectly, but the result is very bad for the animal, like eating something sickening?  A pure invocation of Hebb’s rule would strengthen all participating connections, which can’t be good.  On the other hand, it isn’t right to weaken all the active connections involved; much of the activity was just recognizing the situation – we would like to change only those connections that led to the wrong decision. No one knows how to specify a learning rule that will change exactly the offending connections when an error occurs.  Computer systems, and presumably nature as well, rely upon statistical learning rules that tend to make the right changes over time. More in later lectures.

Hebb’s rule is insufficient should you “punish” all the connections? tastebudtastes rotteneats foodgets sick drinks water

Models of Learning Hebbian – coincidence Supervised – correction (backprop) Recruitment – one-trial Reinforcement Learning- delayed reward Unsupervised – similarity

Abbstract Neuron w2w2 wnwn w1w1 w0w0 i 0 =1 o u t p u t y i2i2 inin i1i1... i n p u t i 1 if net > 0 0 otherwise { Threshold Activation Function

Boolean XOR input x1 input x2 output h2h2 x2x2 o x1x1 h1h AND OR XOR 11

Supervised Learning - Backprop How do we train the weights of the network  Basic Concepts Use a continuous, differentiable activation function (Sigmoid) Use the idea of gradient descent on the error surface Extend to multiple layers

Backprop To learn on data which is not linearly separable:  Build multiple layer networks (hidden layer)  Use a sigmoid squashing function instead of a step function.

Tasks Unconstrained pattern classification Credit assessment Digit Classification Speech Recognition Function approximation Learning control Stock prediction

Sigmoid Squashing Function w2w2 wnwn w1w1 w0w0 y 0 =1 o u t p u t y2y2 ynyn y1y1... i n p u t

The Sigmoid Function x=net y=a

The Sigmoid Function x=neti y=a Output=0 Output=1

The Sigmoid Function x=net y=a Output=0 Output=1 Sensitivity to input

Nice Property of Sigmoids

Gradient Descent

Gradient Descent on an error

Learning as Gradient Descent Error surface for a 2-wt, linear network Complex error surface for hypothetical network training problem

Learning Rule – Gradient Descent on an Root Mean Square (RMS) Learn w i ’s that minimize squared error O = output layer

Gradient Descent Gradient: Training rule:

Gradient Descent i2i2 i1i1 global mimimum: this is your goal it should be 4-D (3 weights) but you get the idea

Backpropagation Algorithm Generalization to multiple layers and multiple output units

Back-Propagation Algorithm We define the error term for a single node to be t i - y i xixi f yjyj w ij yiyi x i = ∑ j w ij y j y i = f(x i ) t i :target Sigmoid:

Backprop Details Here we go…

kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The derivative of the sigmoid is just The output layer learning rate

Nice Property of Sigmoids

kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The hidden layer

Let’s just do an example E = Error = ½ ∑ i (t i – y i ) 2 x0x0 f i1i1 w 01 y0y0 i2i2 b=1 w 02 w 0b E = ½ (t 0 – y 0 ) 2 i1i1 i2i2 y0y /(1+e^-0.5) E = ½ (0 – ) 2 = learning rate suppose  =

An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute delta (generalized error expression) for hidden units Compute  w for each wt in 1 st layer After amassing  w for all weights and, change each wt a little bit, as determined by the learning rate

Backprop learning algorithm (incremental-mode) n=1; initialize w(n) randomly; while (stopping criterion not satisfied and n<max_iterations) for each example (x,d) - run the network with input x and compute the output y - update the weights in backward order starting from those of the output layer: with computed using the (generalized) Delta rule end-for n = n+1; end-while;

Backpropagation Algorithm Initialize all weights to small random numbers For each training example do  For each hidden unit h:  For each output unit k:  For each hidden unit h:  Update each network weight w ij : with

Backpropagation Algorithm “activations” “errors”

What if all the input To hidden node weights are initially equal?

Momentum term The speed of learning is governed by the learning rate.  If the rate is low, convergence is slow  If the rate is too high, error oscillates without reaching minimum. Momentum tends to smooth small weight error fluctuations. the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time.

Convergence May get stuck in local minima Weights may diverge …but often works well in practice Representation power:  2 layer networks : any continuous function  3 layer networks : any function

Pattern Separation and NN architecture

Local Minimum USE A RANDOM COMPONENT SIMULATED ANNEALING

Adjusting Learning Rate and the Hessian The Hessian H is the second derivative of E with respect to w. The Hessian, tells you about the shape of the cost surface:  The eigenvalues of H are a measure of the steepness of the surface along the curvature directions. a large eigenvalue => steep curvature => need small learning rate the learning rate should be proportional to 1/eigenvalue

Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT

Stopping criteria Sensible stopping criteria:  total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).  generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Overfitting in ANNs

Early Stopping (Important!!!) Stop training when error goes up on validation set

Stopping criteria Sensible stopping criteria:  total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).  generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Architectural Considerations What is the right size network for a given job? How many hidden units? Too many: no generalization Too few: no solution Possible answer: Constructive algorithm, e.g. Cascade Correlation (Fahlman, & Lebiere 1990) etc

The number of layers and of neurons depend on the specific task. In practice this issue is solved by trial and error. Two types of adaptive algorithms can be used:  start from a large network and successively remove some nodes and links until network performance degrades.  begin with a small network and introduce new neurons until performance is satisfactory. Network Topology

Cascade Correlation It starts with a minimal network, consisting only of an input and an output layer. Minimizing the overall error of a net, it adds step by step new hidden units to the hidden layer. Cascade-Correlation is a supervised learning architecture which builds a near minimal multi-layer network topology. The two advantages of this architecture are that  there is no need for a user to worry about the topology of the network, and that  Cascade-Correlation learns much faster than the usual learning algorithms.

Supervised vs Unsupervised Learning Backprop requires a 'target' how realistic is that? Hebbian learning is unsupervised, but limited in power How can we combine the power of backprop (and friends) with the ideal of unsupervised learning?

Autoassociative Networks input copy of input as target Network trained to reproduce the input at the output layer Non-trivial if number of hidden units is smaller than inputs/outputs Forced to develop compressed representations of the patterns Hidden unit representations may reveal natural kinds (e.g. Vowels vs Consonants) Problem of explicit teacher is circumvented

Problems and Networks Some problems have natural "good" solutions Solving a problem may be possible by providing the right armory of general-purpose tools, and recruiting them as needed Networks are general purpose tools. Choice of network type, training, architecture, etc greatly influences the chances of successfully solving a problem Tension: Tailoring tools for a specific job Vs Exploiting general purpose learning mechanism

Summary Multiple layer feed-forward networks  Replace Step with Sigmoid (differentiable) function  Learn weights by gradient descent on error function  Backpropagation algorithm for learning  Avoid overfitting by early stopping

ALVINN drives 70mph on highways

Use MLP Neural Networks when … (vectored) Real inputs, (vectored) real outputs You’re not interested in understanding how it works Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Applications of FFNN Classification, pattern recognition: FFNN can be applied to tackle non-linearly separable learning problems.  Recognizing printed or handwritten characters,  Face recognition  Classification of loan applications into credit-worthy and non-credit-worthy groups  Analysis of sonar radar to determine the nature of the source of a signal Regression and forecasting: FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Extensions of Backprop Nets Recurrent Architectures Backprop through time

Elman Nets & Jordan Nets Updating the context as we receive input In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’ backprop Output Hidden ContextInput 1 α Output Hidden ContextInput 1

Recurrent Backprop we’ll pretend to step through the network one iteration at a time backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent) abc unrolling 3 iterations abc abc abc w2 w1w3 w4 w1w2w3w4 abc