Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students.

Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students attending the 2003 Merck summer school, and for individuals who have explicitly been given permission by Ken Norman. Please do not distribute these slides to third parties without permission from Ken (which is easy to get… just email Ken at knorman@princeton.edu).

The Plan, and Acknowledgements The Plan: I will teach you all of the the secrets of neural network models in 2.5 hours Lecture for the first half Hands-on workshop for the second half Acknowledgements: Randy O’Reilly my lab: Greg Detre, Ehren Newman, Adler Perotte, and Sean Polyn

The Big Question How does the gray glop in your head give rise to cognition? We know a lot about the brain, and we also know a lot about cognition The real challenge is to bridge between these two levels

Complexity and Levels of Analysis The brain is very complex: billions of neurons, trillions of synapses, all changing every nanosecond Each neuron is a very complex entity unto itself We need to abstract away from this complexity! Is there some simpler, higher level for describing what the brain does during cognition?

We want to draw on neurobiology for ideas about how the brain performs a particular kind of task Our models should be consistent with what we know about how the brain performs the task But at the same time, we want to include only aspects of neurobiology that are essential for explaining task performance

Learning and Development Neural network models provide an explicit, mechanistic account of how the brain changes as a function of experience Goals of learning: To acquire an internal representation (a model) of the world that allows you to predict what will happen next, and to make inferences about “unseen” aspects of the environment The system must be robust to noise/degradation/damage Focus of workshop: Use neural networks to explore how the brain meets these goals

Outline of Lecture What is a neural network? Principles of learning in neural networks: Hebbian learning: Simple learning rules that are very good at extracting the statistical structure of the environment (i.e., what things are there in the world, and how are they related to one another) Shortcomings of Hebbian learning: It’s good at acquiring coarse category structure (prototypes) but it’s less good at learning about atypical stimuli and arbitrary associations Error-driven learning: Very powerful rules that allow networks to learn from their mistakes

Outline, Continued The problem of interference in neocortical networks, and how the hippocampus can help alleviate this problem Brief discussion of PFC and how networks can support active maintenance in the face of distracting information Background information for the “hands-on” portion of the workshop

Overall Philosophy The goal is to give you a good set of intuitions for how neural networks function I will simplify and gloss over lots of things. Please ask questions if you don’t understand what I’m saying...

What is a neural network? Neurons measure how much input they receive from other neurons; they “fire” (send a signal) if input exceeds a threshold value Input is a function of firing rate and connection strength Learning in neural networks involves adjusting connection strength

What is a neural network? Key simplifications: We reduce all of the complexity of neuronal firing to a single number, the activity of the neuron, that reflects how often the neuron is spiking We reduce all of the complexity of synaptic connections between neurons to a single number, the synaptic weight, that reflects how strong the connection is

Neurons are Detectors Each neuron is detecting some set of conditions (e.g., smoke detector). Representation is what is detected.

Understanding Neural Components in Terms of the Detector Model

Detector Model Neurons feed on each other’s outputs; layers of ever more complicated detectors Things can get very complex in terms of content, but each neuron is still carrying out the basic detector function

Two-layer Attractor Networks Input/Output Layer Hidden Layer (Internal Representation) Model of processing in neocortex Circles = units (neurons); lines = connections (synapses) Unit brightness = activity; line thickness = synaptic weight Connections are symmetric

Two-layer Attractor Networks Input/Output Layer Hidden Layer (Internal Representation) Units within a layer compete to become active. Competition is enforced by inhibitory interneurons that sample the amount of activity in the layer and send back a proportional amount of inhibition Inhibitory interneurons prevent epilepsy in the network Inhibitory interneurons are not pictured in subsequent diagrams I

Two-layer Attractor Networks Input/Output Layer Hidden Layer (Internal Representation) These networks are capable of sustaining a stable pattern of activity on their own. “Attractor” = a fancy word for “stable pattern of activity” Real networks are much larger than this, also > 1 unit is active in the hidden layer... I

Properties of Two-Layer Attractor Networks I will show that these networks are capable of meeting the “learning goals” outlined Given partial information (e.g., seeing something that has wings and features), the networks can make a “guess” about other properties of that thing (e.g., it probably flies) Networks show graceful degradation

“Pattern Completion” in two layer networks wingsbeakfeathers flies

wingsbeakfeathers flies “Pattern Completion” in two layer networks

wingsbeakfeathers flies Networks are Robust to Damage, Noise

wingsfeathers flies Networks are Robust to Damage, Noise

Learning: Overview Learning = changing connection weights Learning rules: How to adjust weights based on local information (presynaptic and postsynaptic activity) to produce appropriate network behavior Hebbian learning: building a statistical model of the world, without an explicit teacher... Error-driven learning: rules that detect undesirable states and change weights to eliminate these undesirable states...

Building a Statistical Model of the World The world is inhabited by things with relatively stable sets of features We want to wire detectors in our brains to detect these things. How can we do this? Answer: Leverage correlation The features of a particular thing tend to appear together, and to disappear together; a thing is nothing more than a correlated cluster of features Learning mechanisms that are sensitive to correlation will end up representing useful things

Hebbian Learning How does the brain learn about correlations? Donald Hebb proposed the following mechanism: When the pre-synaptic neuron and post-synaptic neuron are active at the same time, strengthen the connection between them “neurons that fire together, wire together”

Hebbian Learning

Proposed by Donald Hebb When the pre-synaptic (sending) neuron and post-synaptic (receiving) neuron are active at the same time, strengthen the connection between them “neurons that fire together, wire together” When two neurons are connected, and one is active but the other is not, reduce the connections between them “neurons that fire apart, unwire”

Hebbian Learning

Biology of Hebbian Learning: NMDA-Mediated Long-Term Potentiation

Biology of Hebbian Learning: Long-Term Depression When the postsynaptic neuron is depolarized, but presynaptic activity is relatively weak, you get weakening of the synapse

What Does Hebbian Learning Do? Hebbian learning tunes units to represent correlated sets of input features. Here is why: Say that a unit has 1,000 inputs In this case, turning on and off a single input feature won’t have a big effect on the unit’s activity In contrast, turning on and off a large cluster of 900 input features will have a big effect on the unit’s activity

Hebbian Learning

Because small clusters of inputs do not reliably activate the receiving unit, the receiving unit does not learn much about these inputs

Hebbian Learning

Big clusters of inputs reliably activate the receiving unit, so the network learns more about big (vs. small) clusters (the “gang effect”).

Hebbian Learning Big clusters of inputs reliably activate the receiving unit, so the network learns more about big (vs. small) clusters (the “gang effect”).

What Does Hebbian Learning Do? Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more!

Hebbian Learning scalyslithers wingsbeakfeathers flies

What Does Hebbian Learning Do? Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more! The outcome of Hebbian learning is a function of how well different inputs activate the unit, and how frequently they are presented

Self-Organizing Learning One detector can only represent one thing (i.e., pattern of correlated features) Goal: We want to present input patterns to the network and have different units in the network “specialize” for different things, such that each thing is represented by at least one unit Random weights (different initial receptive fields) and competition are important for achieving this goal What happens without competition...

No Competition lives under water scalyslithers wingsbeakfeathers flies

No Competition lives under water Without competition, all units end up representing the same “gang” of features; other, smaller correlations get ignored wingsbeakfeathersfliesscalyslithers

Competition is important lives under water scalyslithers wingsbeakfeathers flies

Competition is important lives under water inhibition scalyslithers wingsbeakfeathers flies

Competition is important lives under water scalyslithers wingsbeakfeathers flies

Competition is important stripedorangesharp teeth furryyellowchirps lives under water When units have different initial “receptive fields” and they compete to represent input patterns, units end up representing different things

Hebbian Learning: Summary Hebbian learning finds the thing in the world that most reliably activates the unit, and tunes the unit to like that thing even more When: There are multiple hidden units competing to represent input patterns Each hidden unit starts out with a distinct receptive field Then: Hebbian learning will tune these units so that each thing in the world (i.e., each cluster of correlated features) is represented by at least one unit

Problems with Penguins slitherslives in Antarctica waddles wingsbeakfeathers flies

Problems with Penguins slitherslives in Antarctica wingsbeakfeathers flies waddles

Problems with Penguins slitherslives in Antarctica wingsbeakfeathers flies inhibition waddles

Problems with Hebb, and Possible Solutions Self-organizing Hebbian learning is capable of discovering the “high-level” (coarse) categorical structure of the inputs However, it sometimes collapses across more subtle (but important) distinctions, and the learning rule does not have any provisions for fixing these errors once they happen

Problems with Hebb, and Possible Solutions In the penguin problem, if we want the network to remember that typical birds fly, but penguins don’t, then penguins and typical birds need to have distinct (non-identical) hidden representations Hebbian learning assigns the same hidden unit to penguins and typical birds We need to supplement Hebbian learning with another learning rule that is sensitive to when the network makes an error (e.g., saying that penguins fly) and corrects the error by pulling apart the hidden representations of penguins vs. typical birds.

What is an error, exactly? One common way of conceptualizing error is in terms of predictions and outcomes If you give the network a partial version of a studied pattern, the network will make a prediction as to the missing features of that pattern (e.g., given something that has “feathers”, the network will guess that it probably flies) Later, you learn what the missing features are (the outcome). If the network’s guess about the missing features is wrong, we want the network to be able to change its weights based on the difference between the prediction and the outcome. Today, I will present the GeneRec error-driven learning rule developed by Randy O’Reilly.

Error-Driven Learning slitherslives in Antarctica waddles wingsbeakfeathersflies Prediction phase: Present a partial pattern The network makes a guess about the missing features.

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies waddles Prediction phase: Present a partial pattern The network makes a guess about the missing features.

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica waddles wingsbeakfeathersflies waddles Prediction phase: Present a partial pattern The network makes a guess about the missing features. Outcome phase: Present the full pattern Let the network settle

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles Prediction phase: Present a partial pattern The network makes a guess about the missing features. Outcome phase: Present the full pattern Let the network settle

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles We now need to compare these two activity patterns and figure out which weights to change.

Motivating the Learning Rule The goal of error-driven learning is to discover an internal representation for the item that activates the correct answer. Basically, we want to find hidden units that are associated with the correct answer (in this case, “waddles”). The best way to do this is to examine how activity changes when “waddles” is clamped on during the “outcome” phase. Hidden units that are associated with “waddles” should show an increase in activity in the outcome (vs. prediction) phase. Hidden units that are not associated with “waddles” should show a decrease in activity in the outcome phase (because of increased competition from other units that are associated with “waddle”).

Motivating the Learning Rule Hidden units that are associated with “waddle” should show an increase in activity in the outcome (vs. prediction) phase. Hidden units that are not associated with “waddle” should show a decrease in activity in the outcome phase Here is the learning role: If a hidden unit shows increased activity (i.e., it’s associated with the correct answer), increase its weights to the input pattern If a hidden unit should decreased activity (i.e., it’s not associated with the correct answer), reduce its weights to the input pattern

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles Hebb and error have opposite effects on weights here! Error increases the extent to which penguin is linked to the right-hand unit, whereas Hebb reinforced penguin’s tendency to activate the left-hand unit

Error-Driven Learning slitherslives in Antarctica waddles wingsbeakfeathersflies

Error-Driven Learning slitherslives in Antarctica wingsbeakfeathersflies waddles

Catastrophic Interference If you change the weights too strongly in response to “penguin”, then the network starts to behave like all birds waddle. New learning interferes with stored knowledge... The best way to avoid this problem is to make small weight changes, and to interleave “penguin” learning trials with “typical bird” trials The “typical bird” trials serve to remind the network to retain the association between wings/feathers/beak and “flies”...

Interleaved Training slitherslives in Antarctica waddles wingsbeakfeathersflies

Interleaved Training slitherslives in Antarctica wingsbeakfeathersflies waddles

Gradual vs. One-Trial Learning Problem: It appears that the solution to the catastrophic interference problem is to learn slowly. But we also need to be able to learn quickly!

Gradual vs. One-Trial Learning Put another way: There appears to be a trade-off between learning rate and interference in the cortical network Our claim is that the brain avoids this trade-off by having two separate networks: A slow-learning cortical network that gradually develops internal representations that support generalization, prediction, categorization, etc. A fast-learning hippocampal network that is specialized for rapid memorization (but does not support generalization, categorization, etc.)

CA3 CA1 Dentate Gyrus Entorhinal Cortex input Entorhinal Cortex output lower-level cortex hippocampus neocortex

Interactions Between Hippo and Cortex According to the Complementary Learning Systems theory (McClelland et al., 1995), hippocampus rapidly memorizes patterns of cortical activity. The hippocampus manages to learn rapidly without suffering catastrophic interference because it has a built- in tendency to assign distinct, minimally overlapping representations to input patterns, even when they are very similar. Of course this hurts its ability to categorize.

Interactions Between Hippo and Cortex The theory states that, when you are asleep, the hippocampus “plays back” stored patterns in an interleaved fashion, thereby allowing cortex to weave new facts and experiences into existing knowledge structures. Even if something just happens once in the real world, hippocampus can keep re-playing it to cortex, interleaved with other events, until it sinks in... Detailed theory: slow-wave sleep = hippo playback to cortex REM sleep = cortex randomly activates stored representations; this strengthens pre-existing knowledge and protects it against interference

Role of the Hippocampus slitherslives in Antarctica waddles wingsbeakfeathers flies hippocampus

Role of the Hippocampus slitherslives in Antarctica wingsbeakfeathers flies hippocampus waddles

Error-Driven Learning: Summary Error-driven learning algorithms are very powerful: So long as the learning rate is small, and training patterns are presented in an interleaved fashion, algorithms like GeneRec can learn internal representations that support good “pattern completion” of missing features. Error-driven learning is not meant to be a replacement for Hebbian learning: The two algorithms can co-exist! Hebbian learning actually improves the performance of GeneRec by ensuring that hidden units represent meaningful clusters of features

Error-Driven Learning: Summary Theoretical issues to resolve with error-driven learning: The algorithm requires that the network “know” whether you are in a “prediction” phase or an “outcome” phase, how does the network know this? For that matter, the whole “phases” idea is sketchy GeneRec based on “prediction/outcome” differences is not the only way to do error-driven learning... Backpropagation Learning by reconstruction Adaptive Resonance Theory (Grossberg & Carpenter)

Learning by Reconstruction Instead of doing error-driven learning by comparing predictions and outcomes, you can also do error-driven learning as follows: First, you clamp the correct, full pattern onto the network and let it settle. Then, you erase the input pattern and see whether the network can reconstruct the input pattern based on its internal representation The algorithm is basically the same, you are still comparing two phases...

Learning by Reconstruction slitherslives in Antarctica waddles wingsbeakfeathersflies Clamp the to- be-learned pattern onto the input and let the network settle

slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles Learning by Reconstruction Clamp the to- be-learned pattern onto the input and let the network settle Next, wipe the input layer clean (but not the hidden layer) and let the network settle

slitherslives in Antarctica wingsbeakfeathersflies slitherslives in Antarctica wingsbeakfeathersflies waddles waddles Learning by Reconstruction Compare hidden activity in the two phases and adjust weights accordingly (i.e., if activation was higher with the correct answer clamped, increase weights; if activation was lower, decrease wts)

Adaptive Resonance Theory slitherslives in Antarctica waddles wingsbeakfeathers flies

Adaptive Resonance Theory slitherslives in Antarctica wingsbeakfeathers flies waddles

Adaptive Resonance Theory slitherslives in Antarctica wingsbeakfeathers flies MISMATCH! waddles

Spreading Activation vs. Active Maintenance Spreading activation is generally very useful... it lets us make predictions/inferences/etc. But sometimes you just want to hold on to a pattern of activation without letting activation spread (e.g., a phone number, or a person’s name). How do we maintain specific patterns of activity in the face of distraction?

Spreading Activation vs. Active Maintenance As you will see in the “hands-on” part of the workshop, the networks we have been discussing are not very robust to noise/distraction. Thus, there appears to be another tradeoff: Networks that are good at generalization/prediction are lousy at holding on to phone numbers/plans/ideas in the face of distraction

Spreading Activation vs. Active Maintenance Solution: We have evolved a network that is optimized for active maintenance: Prefrontal cortex! This complements the rest of cortex, which is good at generalization but not so good at active maintenance. PFC uses isolated representations to prevent spread of activity... Evidence for isolated stripes in PFC

Tripartite Functional Organization PC = posterior perceptual & motor cortex FC = prefrontal cortex HC = hippocampus and related structures

Tripartite Functional Organization PC = incremental learning about the structure of the environment FC = active maintenance, cognitive control HC = rapid memorization Roles are defined by functional tradeoffs…

Key Trade-offs Extracting what is generally true (across events) vs. memorizing specific events Inference (spreading activation) vs. robust active maintenance

Hands-On Exercises The goal of the hands-on part of the workshop is to get a feel for the kinds of representations that are acquired by Hebbian vs. error-driven learning, and for network dynamics more generally.

Here is the network that we will be using: Activity constraints: Only 10% of hidden units can be strongly active at once; in the input layer, only one unit per row Think of each row in the input as a feature dimension (e.g., shape) and the units in that row are mutually exclusive features along that dimension (square, circle, etc.)

This diagram illustrates the connectivity of the network: Each hidden unit is connected to 50% of the input units; there are also recurrent connections from each hidden unit to all of the other hidden units Weights are symmetric Initial weight values were set randomly

I trained up the network on the following 8 patterns: In each pattern, the bottom 16 rows encode prototypical features that tend to be shared across patterns within a category; the top 8 rows encode item-specific features that are unique to each pattern. Each category has 3 “typical” items and one “atypical” item During training, the network studied typical patterns 90% of the time and it studied atypical patterns 10% of the time

To save time, the networks you will be using have been pre-trained on the 8 patterns (by presenting them repeatedly, in an interleaved fashion) For some of the simulations, you will be using a network that was trained with (purely) Hebbian learning

For other simulations, you will be using a network that was trained with a combination of error-driven (GeneRec) and Hebbian learning. Training of this network use a three- phase design: First, there was a “prediction” (minus) phase where a partial pattern was presented Second, there was an “outcome” (plus) phase where the full version of the pattern was presented Finally, there was a nothing phase where the input pattern was erased (but not the hidden pattern) Error-driven learning occurred based on the difference in activity between the minus and plus patterns, and based on the differenced in activity between the plus and nothing patterns

When you get to the computer room, the simulation should already be open on the computer (some of you may have to double-up, I think there are slightly fewer computers than students) and there will be a handout on the desk explaining what to do You can proceed at your own pace I will be there to answer questions (about the lecture and about the computer exercises) and my two grad students Ehren Newman and Sean Polyn will also be there to answer questions.

Your Helpers Ehren Sean me

Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students.

Similar presentations

Presentation on theme: "Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students.

Similar presentations

Presentation on theme: "Secrets of Neural Network Models Ken Norman Princeton University July 24, 2003 Note: These slides have been provided online for the convenience of students."— Presentation transcript:

Similar presentations

About project

Feedback