Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006.

Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006

kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The derivative of the sigmoid is just The output layer learning rate

Nice Property of Sigmoids

kji w jk w ij E = Error = ½ ∑ i (t i – y i ) 2 yiyi t i : target The hidden layer

Let’s just do an example E = Error = ½ ∑ i (t i – y i ) 2 x0x0 f i1i1 w 01 y0y0 i2i2 b=1 w 02 w 0b E = ½ (t 0 – y 0 ) 2 i1i1 i2i2 y0y0 000 011 101 111 0.8 0.6 0.5 0 0 0.6224 0.5 1/(1+e^-0.5) E = ½ (0 – 0.6224) 2 = 0.1937 0 0 learning rate suppose  = 0.5 0.4268

An informal account of BackProp For each pattern in the training set: Compute the error at the output nodes Compute  w for each wt in 2 nd layer Compute delta (generalized error expression) for hidden units Compute  w for each wt in 1 st layer After amassing  w for all weights and, change each wt a little bit, as determined by the learning rate

Backprop learning algorithm (incremental-mode) n=1; initialize w(n) randomly; while (stopping criterion not satisfied and n<max_iterations) for each example (x,d) - run the network with input x and compute the output y - update the weights in backward order starting from those of the output layer: with computed using the (generalized) Delta rule end-for n = n+1; end-while;

Backpropagation Algorithm Initialize all weights to small random numbers For each training example do  For each hidden unit h:  For each output unit k:  For each hidden unit h:  Update each network weight w ij : with

Backpropagation Algorithm “activations” “errors”

What if all the input To hidden node weights are initially equal?

Momentum term The speed of learning is governed by the learning rate.  If the rate is low, convergence is slow  If the rate is too high, error oscillates without reaching minimum. Momentum tends to smooth small weight error fluctuations. the momentum accelerates the descent in steady downhill directions. the momentum has a stabilizing effect in directions that oscillate in time.

Convergence May get stuck in local minima Weights may diverge …but works well in practice Representation power:  2 layer networks : any continuous function  3 layer networks : any function

Pattern Separation and NN architecture

Local Minimum USE A RANDOM COMPONENT SIMULATED ANNEALING

Adjusting Learning Rate and the Hessian The Hessian H is the second derivative of E with respect to w. The Hessian, tells you about the shape of the cost surface:  The eigenvalues of H are a measure of the steepness of the surface along the curvature directions. a large eigenvalue => steep curvature => need small learning rate the learning rate should be proportional to 1/eigenvalue

Overfitting and generalization TOO MANY HIDDEN NODES TENDS TO OVERFIT

Stopping criteria Sensible stopping criteria:  total mean squared error change: Back-prop is considered to have converged when the absolute rate of change in the average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).  generalization based criterion: After each epoch the NN is tested for generalization. If the generalization performance is adequate then stop. If this stopping criterion is used then the part of the training set used for testing the network generalization will not be used for updating the weights.

Overfitting in ANNs

Summary Multiple layer feed-forward networks  Replace Step with Sigmoid (differentiable) function  Learn weights by gradient descent on error function  Backpropagation algorithm for learning  Avoid overfitting by early stopping

ALVINN drives 70mph on highways

Use MLP Neural Networks when … (vectored) Real inputs, (vectored) real outputs You’re not interested in understanding how it works Long training times acceptable Short execution (prediction) times required Robust to noise in the dataset

Applications of FFNN Classification, pattern recognition: FFNN can be applied to tackle non-linearly separable learning problems.  Recognizing printed or handwritten characters,  Face recognition  Classification of loan applications into credit-worthy and non-credit-worthy groups  Analysis of sonar radar to determine the nature of the source of a signal Regression and forecasting: FFNN can be applied to learn non-linear functions (regression) and in particular functions whose inputs is a sequence of measurements over time (time series).

Extensions of Backprop Nets Recurrent Architectures Backprop through time

Elman Nets & Jordan Nets Updating the context as we receive input In Jordan nets we model “forgetting” as well The recurrent connections have fixed weights You can train these networks using good ol’ backprop Output Hidden ContextInput 1 α Output Hidden ContextInput 1

Recurrent Backprop we’ll pretend to step through the network one iteration at a time backprop as usual, but average equivalent weights (e.g. all 3 highlighted edges on the right are equivalent) abc unrolling 3 iterations abc abc abc w2 w1w3 w4 w1w2w3w4 abc

Models of Learning Hebbian ~ coincidence Supervised ~ correction (backprop) Recruitment ~ one trial

Recruiting connections Given that LTP involves synaptic strength changes and Hebb’s rule involves coincident- activation based strengthening of connections –How can connections between two nodes be recruited using Hebbs’s rule?

The Idea of Recruitment Learning Suppose we want to link up node X to node Y The idea is to pick the two nodes in the middle to link them up Can we be sure that we can find a path to get from X to Y? the point is, with a fan-out of 1000, if we allow 2 intermediate layers, we can almost always find a path X Y BNK F = B/N

Finding a Connection P = Probability of NO link between X and Y N = Number of units in a “layer” B = Number of randomly outgoing units per unit F = B/N, the branching factor K = Number of Intermediate layers, 2 in the example 0.999.9999.99999 1.367.905.989 210 -440 10 -44 10 -5 N= K= # Paths = (1-P k-1 )*(N/F) = (1-P k-1 )*B P = (1-F) **B**K 10 6 10 7 10 8

Finding a Connection in Random Networks For Networks with N nodes and sqrt(N ) branching factor, there is a high probability of finding good links.

Recruiting a Connection in Random Networks Informal Algorithm 1.Activate the two nodes to be linked 2. Have nodes with double activation strengthen their active synapses (Hebb) 3.There is evidence for a “now print” signal

Triangle nodes and feature structures BC A ABC

“They all rose” triangle nodes: when two of the nodes fire, the third also fires model of spreading activation

Representing concepts using triangle nodes

MakinHamContainerPush dept~EE Color ~pink Inside ~region Schema ~slide sid~001 Taste ~salty Outside ~region Posture ~palm emp~GSI Bdy. ~curve Dir. ~ away BryantPeaPurchaseStroll dept~CS Color ~green Buyer ~person Schema ~walk sid~002 Taste ~sweet Seller ~person Speed ~slow emp~GSI Cost ~money Dir. ~ ANY Goods ~ thing Feature Structures in Four Domains

Recruiting triangle nodes Let’s say we are trying to remember a green circle currently weak connections between concepts (dotted lines) has-color bluegreenroundoval has-shape

Strengthen these connections and you end up with this picture has-color bluegreenroundoval has-shape Green circle

Distributed vs Localist Rep’n John 1100 Paul 0110 George 0011 Ringo 1001 John 1000 Paul 0100 George 0010 Ringo 0001 What are the drawbacks of each representation?

Distributed vs Localist Rep’n What happens if you want to represent a group? How many persons can you represent with n bits? 2^n What happens if one neuron dies? How many persons can you represent with n bits? n John 1100 Paul 0110 George 0011 Ringo 1001 John 1000 Paul 0100 George 0010 Ringo 0001

Connectionist Models in Cognitive Science Structured PDP (Elman) NeuralConceptualExistenceData Fitting Hybrid

Link to Vision: The Necker Cube

Spreading activation and feature structures Parallel activation streams. Top down and bottom up activation combine to determine the best matching structure. Triangle nodes bind features of objects to values Mutual inhibition and competition between structures Mental connections are active neural connections

Can we formalize/model these intuitions What is a neurally plausible computational model of spreading activation that captures these features. What does semantics mean in neurally embodied terms –What are the neural substrates of concepts that underlie verbs, nouns, spatial predicates?

5 levels of Neural Theory of Language Cognition and Language Computation Structured Connectionism Computational Neurobiology Biology MidtermQuiz Finals Neural Development Triangle Nodes Neural Net and learning Spatial Relation Motor Control Metaphor SHRUTI Grammar abstraction Pyscholinguistic experiments

The Color Story: A Bridge between Levels of NTL (http://www.ritsumei.ac.jp/~akitaoka/color-e.html

A Tour of the Visual System two regions of interest: –retina –LGN

http://www.iit.edu/~npr/DrJennifer/visual/retina.html Rods and Cones in the Retina

Physiology of Color Vision © Stephen E. Palmer, 2002 Cones cone-shaped less sensitive operate in high light color vision Rods rod-shaped highly sensitive operate at night gray-scale vision Two types of light-sensitive receptors

The Microscopic View

What Rods and Cones Detect Notice how they aren’t distributed evenly, and the rod is more sensitive to shorter wavelengths

Center / Surround Strong activation in center, inhibition on surround The effect you get using these center / surround cells is enhanced edges top: the stimuli itself middle: brightness of the stimuli bottom: response of the retina You’ll see this idea get used in Regier’s model http://www-psych.stanford.edu/~lera/psych115s/notes/lecture3/figures1.html

Center Surround cells No stimuli: –both fire at base rate Stimuli in center: –ON-center-OFF-surround fires rapidly –OFF-center-ON-surround doesn’t fire Stimuli in surround: –OFF-center-ON-surround fires rapidly –ON-center-OFF-surround doesn’t fire Stimuli in both regions: –both fire slowly

Color Opponent Cells These cells are found in the LGN Four color channels: Red, Green, Blue, Yellow R/G, B/Y pairs much like center/surround cells We can use these to determine the visual system’s fundamental hue responses Mean Spikes / Sec Wavelength (mμ) 25400700 +R-G 50 25 400700 +G-R 50 25 400700 +Y-B 25 400700 +B-Y (Monkey brain)

Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006.

Similar presentations

Presentation on theme: "Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006.

Similar presentations

Presentation on theme: "Connectionist Models: Lecture 3 Srini Narayanan CS182/CogSci110/Ling109 Spring 2006."— Presentation transcript:

Similar presentations

About project

Feedback