Alan Pickering Autumn Term 2005

Alan Pickering Autumn Term 2005
Methods and Techniques in Neuroscience An introduction to neural network models Alan Pickering Autumn Term 2005

Outline: Part 1 Neural learning mechanisms
Learning through connections Hebbian learning What is connectionism (some definitions / distinctions) ? Connectionism Computational modelling Neural net modelling Building a simple model Terminology & structure How information passes through a model - exercise 1 How a model learns (supervised & unsupervised learning) - exercise 2

Neural learning Originally, it was thought that new associations/memories were formed by the growth of new nerve cells in the brain Santiago Ramon y Cajal ( ) Proposed that learning might occur through the strengthening of existing connections between nerve cells Donald O. Hebb ( ) Formulated Cajal’s ideas into a hypothetical biological mechanism dubbed ‘Hebbian learning’

Hebbian Learning Hebbian learning:
When two joining cells fire simultaneously, the connection between them strengthens (Hebb, 1949) Discovered at a biomolecular level by Lomo (1966) (Long-term potentiation). UR US CS Learned assocations through the strengthening of connections….

What is connectionism? Starting definition:
“Connectionism is the study of how learning can occur through the strengthening and/or weakening of connections between representations of pieces of information and/or behavioural responses” We can relate this definition straight back to classical and operant conditioning (connections strengthened between stimulus and response, or between CS and CR) Connectionist modelling is usually concerned with more complex associations and larger numbers of connections. Computers are used to store and process all of this information A connectionist model is computer simulation of learning

What is connectionism? Biological plausibility:
Some (but not all) connectionists appeal to the physical similarity of connectionist models to networks of neurons in the brain (neural net modelling) Neurons: Nodes Synapses: Connections (weights)

Structure and terminology of connectionist models
Output layer Output nodes (units) Hidden nodes (units) Input layer Input nodes (units)

How a connectionist model works
w = weights a = activation o = output j3 j2 j1 u Activation (au)? What input arrives at the output node (node u)? From j1: 1*0.8 = 0.8 From j2: 1*-0.5 = -0.5 From j3: 0*0.2 = 0.0 Sum of inputs = 0.3 Iu = ∑j outj * wju +0.8 -0.5 +0.2 Weights (wju) 1 Outputs (outj) Activation (aj) .8 .7 .1

Exercise 1 What input arrives at the output node (node u)?
w = weights a = activation o = output j3 j2 j1 u Activation (au)? What input arrives at the output node (node u)? Iu = ∑j outj * wju From j1: 1*0.8 = ??? From j2: 1*-0.5 = ??? From j3: 1*0.2 = ??? Sum of inputs = ??? +0.8 -0.7 +0.4 Weights (wju) 1 Outputs (outj) Activation (aj) .8 .7 .6

Activations and Outputs 1
Activation = membrane potential of cell Activation = 0= resting state (-70mV) +ve activation = cell being depolarised -ve activation =cell being hyperpolarised (inhibited) Cells fire (send an action potential) when sufficiently depolarised Output = mean firing rate of cell Output of node j1 is a function of its activation, outj = f(aj) j3 j2 j1 u Activation (aj) .8 .7 .1

How to convert activations into outputs? Threshold Gain (degree of nonlinearity) 1 Outputs (outj) Inhu .1 Activation (aj) .8 .7 .6 j3 j2 j1 u +0.2 +0.8 +0.5 Weights (wji) -0.7 k1 Excu

An example output function
Hi gain mid gain threshold lo gain

Creating an activation equation dau(t)/dt = dau(t)/dt = ce * Excu(t) dau(t)/dt = ce * Excu(t) * (Max – au(t)) + ci * Inhu(t) * (au(t) - Min) - cd * au(t) 1 Outputs (outj) Inhu .1 Activation (aj) .8 .7 .6 j3 j2 j1 u +0.2 +0.8 +0.5 Weights (wji) -0.7 au k1 Excu

Simulated Activations and Outputs

How does a connectionist model learn?
Following Cajal and Hebb, connectionist models learn through changes in the strength of connections: Weight changes There are two types of learning: Unsupervised learning In unsupervised learning the weight changes are made automatically and in relation to the degrees of association between incoming activations (E.g. Classical conditioning -- associations strengthen through temporal contiguity) Supervised learning In supervised learning the weight changes are made in proportion to the error at the output. In order to calculate the error at the output a teaching pattern is required for comparison =>hence “supervised” (E.g. Learning to talk or spell)

Unsupervised learning: 1
The most commonly used unsupervised learning rule is the “Hebb rule” (c.f. Hebbian learning): ∆w = k au aj ∆w = weight change k = a constant (e.g. 0.6) au = post-synaptic activation aj = pre-synaptic activation

Unsupervised learning: 2
Example: Classical conditioning ∆w = weight change k = a constant (e.g. 0.6) au = post-synaptic activation aj = pre-synaptic activation ∆w = k au aj Before conditioning food tone 1 1.0 0.0 During conditioning 1 food tone 1.0 After conditioning 1.0 food tone 1 .6 0.6 food tone 1 0.0 1.0 0.6

Supervised learning: 1 ∆wju = k up aj
In supervised learning the weight changes are made in proportion to the error at the output In order to calculate the error at the output a teaching pattern is required for comparison (hence supervised) (E.g. Learning to read or spell) The most commonly used supervised learning rule is the “delta rule” (Rosenblatt, 1966; Rumelhart & McClelland, 1986): ∆wju = k up aj ∆wju = weight change k = a constant up = error at output for input pattern p aj = pre-synaptic activation

Example: Learning to read
In order to read we need to learn relation how orthography (how a word looks) maps onto phonology (how it sounds) Imagine a network is learning how to pronounce ‘hit’. Say that hit has the orthography; and the phonology; ………

Supervised learning: 2 Example: Learning to read ‘Hit’ has:
Orthography Phonology ∆wju = k up aj ∆wju = weight change k = a constant ip = error at output for input pattern p aj = pre-synaptic activation 1 Teaching pattern 1 First - the error is calculated at each output Second the blame is apportioned (to active units contributing to the error) 1

Exercise 2 Example: Learning to read ‘Cat’ has: Orthography 0 1 0
∆wju = k up aj ∆w = weight change k = a constant up = error at output aj = pre-synaptic activation ‘Cat’ has: Orthography 0 1 0 Phonology 1 Teaching pattern Which connection/s will be altered? Remember: 1st find the error 2nd apportion blame 3rd alter weights coming from blamed input node/s to errorful output node/s 1 u1 u2 u3 1 j1 j2 j3

Neural network models: Part 1 summary
Recap on neural learning mechanisms Learning through connections: Cajal & Hebb both suggested that learning in the human brain may occur through changes in the strength of connections between neurons Hebbian learning; Hebb formulated a mechanism by which this associative learning might occur; synchronous pre- and post-synaptic firing increases the strength of the connection What is connectionism (some definitions / distinctions) ? Connectionism deals with the ways in which statistical structure in the environment can be learned by the strengthening and/or weakening of connections between representations of items of information Connectionism involves computer simulations of learning Neural net modelling: Connectionist models gains credibility because they resemble networks of neurons in the brain

Neural network models: Part 1 Summary
Building a connectionist model Structure They are a network of ‘nodes’ and ‘weighted connections’. Information passes through the from node to node through the ‘weighted connections, usually in one direction. Terminology Nodes can be thought of as neurons, and weighted connections as synapses between neurons The flow of information through the model The weighted connections control whether or not information passes from one layer of nodes to the next (∑j outj * wju) Unsupervised learning As in Hebbian learning, unsupervised models use correlations of pre and post-synaptic firing to change strengths of connections (∆wju = k au aj) Supervised learning Supervised learning relies on the error at the output (provided by a teaching patter) to determine the changes in connection strength (∆wju= k up aj)

Outline: Part 2 Why connectionism? (the case for parallel distributed processing) Parallel processing Distributed processing (representations) An example of how connectionist models can help us understand learning: Learning of inflectional morphology in early childhood Pinker & Prince (1988) Rumelhart & McClelland (1986)

Parallel distributed processing (PDP)
Rumelhart & McClelland (1986) made a very strong case for the use of connectionist models, highlighting the qualities of parallel and distributed processing Parallel Processing Parallel processing can be contrasted with serial processing Processing several different pieces of information at the same time, rather than one after the other: Example: face processing Distributed Processing Distributed processing can be contrasted with localist processing Representations of information are distribute across the whole neural network, rather than occupying specific locations Example: Karl Lashley and the ‘Engram’ (location of memory)

Parallel processing Example: Face processing
When looking at this face we recognise it not by looking at individual features one at a time (the eyes, the nose, the smile, the grimace), but by processing these features and their spatial configuration in parallel

Parallel processing Example: Face processing Margaret Thatcher
Tony Blair Nose nodes Grimace node Smile nodes Eye nodes

Distributed processing
Example: Lashley and the search for the ‘Engram’ Karl Spencer Lashley ( ) Pioneering researcher in the biological foundations of memory in the rat Lashley and colleages lesioned the brains of rats in order to test whether or not they had removed the part responsible for memory. They found no single locus (engram) which appeared to be solely responsible for memory Lashley concluded that memory was distributed throughout the cortex, rather than localised in one specific place

A connectionist model of the acquisition of inflectional morphology
The way in which we change words to convey: Plurality Past tense Examples: Cat + -s --> Cats (Plural) Play + -ed --> Played (Past tense) - *we focus on this* In English: 90% of verbs have a regular morphology 10% an irregular morphology

Acquisition of inflectional morphology
Past tense morphology Regular Morphology (90% verbs) talk => talked ram => rammed pit => pitted Irregular Morphology (10% verbs) hit => hit ‘no change’ come => came ‘vowel change’ sleep => slept ‘vowel change’ go => went ‘arbitrary’ How do children learn which words require a regular ending and which are irregular?

U-shaped Development Initially, children’s early inflections are correct Later they start making errors: Hitted Sleeped Goed Over-regularisation errors Later still children recover from these errors Phase 1: Rote learning -> initial error free performance Phase 2: Rule extraction -> over-regularisation errors Phase 3: Rule + rote -> recovery from errors

Dual-route model (Pinker & Prince, 1988)? ‘Rule’ route deals with regular verbs Exceptions route deals with irregulars Exceptions Rule Input Stem Output Inflection Errors in the middle of development occur due to overuse of the ‘rule system But do we really need two routes to explain this pattern of learning?

Connectionist model (McClelland & Rumelhart, 1986)? Wickelfeature Representation of Stem Wickelfeature Representation of Past Tense One single network learns to produce past tense for all the verbs it is taught A single route connectionist model for learning the past tense No rule route for producing the regular ending The network ‘learned’ to associate regular & irregular English verb stems with their past tense forms.

Connectionist model (McClelland & Rumelhart, 1986)? Like the children the network: Made over-regularisation errors Demonstrated u-shaped development in its performance on irregular verbs

Connectionist model of the acquisition of inflectional morphology (McClelland & Rumelhart, 1986) This model demonstrates that behaviour which looks as if it is driven by a knowledge of rules can in fact be driven by a distributed representation of statistical structure of the input (in this case a distributed representation of language) This explanation of how it is possible that ‘rule-like behaviour’ can occur in the absence of any actual representation of the rule is an important contribution of connectionist models

Neural network models: Part 2 Summary
Why connectionism? (the case for parallel distributed processing) Parallel processing Connectionist models process information in parallel, rather than serially; this has intuitive appeal when we consider how we take information in Distributed processing (representations) Connectionist models represent learned information in the distributed connections across the network, rather than in single locations or ‘rules’; this helps explain why it is diffiicult to find a single location for memory in the brain An example of how connectionist models can help us understand learning: Rumelhart & McClelland’s (1986) model This models the acquisition of inflectional morphology and demonstrates how networks can show ‘rule-like’ behaviour, in the absence of any representation of a rule.

Outline: Part 3 A history of neural network models
Single layer networks Rosenblatt’s perceptron (1958) Minsky & Papert’s (1969) criticism of perceptrons Multi layer networks McClelland & Rumelhart’s (1986) ‘Backpropagation of error’ learning rule

Structure and terminology of connectionist models
Output layer Output nodes (units) Hidden nodes (units) Input layer cover that connectionist model transforms material at the input into something else at the output -this is through the connection weights between input and output nodes… Input nodes (units)

One-layer vs multi-layer networks
One-layer networks were the first connectionist models to emerge in the 1950s Frank Rosenblatt’s (1958) ‘Perceptron’ In these networks learning occurs through changes in the weights of only one layer However, networks which only change one layer of weights have some important limitations These were pointed out by Minsky & Papert (1969)…

Minsky & Papert (1969) Minsky & Papert (1969) made the point that single-layer networks cannot solve ‘non-linearly separable problems’ Maths Example: The XOR problem If you think about the sums of the inputs then we can see why this isn’t linearly separable: When the sum of the inputs increases to 2, the desired output goes back down to 0 inputs output Task: to learn to solve XOR input output 1 1 0 1 0 1 0 1 1 0 0 0 sums 2 1 Single layer networks cannot solve this kind of problem

Minsky & Papert (1969) The inability to solve non-linear problems is a problem for any model of human learning because humans can solve non-linear problems.. Psychological Example 1: Learning to eat the right amount We all have to learn to eat enough to stay fit, but not so much as to make us sick This is like solving the XOR problem (we can learn to eat some of the food but not all of it) Psychological Example 2: Connected and unconnected figures a b c d

Minsky & Papert (1969) Psychological Example 2: Connected and unconnected figures a b c d How might a single layer network try to solve this? (1 = connected, 0 = unconnected) All figures have three horizontal lines so have to work this out on the basis of the presence of vertical lines (at particular locations) The net might start by discriminating between two connected and unconnected figures (e.g. c and d) by locating a vertical line (e.g. in the bottom left) But this also leads the net to discriminate between figures which we want to group together (e.g. c and b)

A solution: Multi-layered networks
Rosenblatt & Minsky & Papert agreed that this problem would be solved if you could train networks with more than one layer: Networks with ‘hidden units’ can redescribe the input into a format that can be separated linearly “Hidden units allow the network to treat physically similar inputs as different, as the need arises”

Multi-layered networks solve XOR
A multilayered network can solve the XOR problem: inputs output Task: to learn to solve XOR input hidden output +1 -1 You can set up a two layered network like the example on the left (with appropriate weights) to solve the XOR problem But the real problem envisaged by Minsky & Papert was how to train the weights on a network with two layers……..???

Training multi-layered networks
When a network learns through the delta rule, a teaching pattern is there to correct the weights leading into the output by measuring the difference between the teaching pattern and the output: Teaching pattern 1 BUT!! - there is no teaching pattern for the hidden layer, which can be used to change the weights from the input..!! The inability to find a learning rule which would change all weights in a network stifled connectionist research until a solution was found in 1986…. 1 1

Back-propagation of error
McClelland & Rumelhart (1986) came up with a learning rule to solve this: ‘back-prop’ Teaching pattern 1 1 Backprop learning: Backprop is an extension of the delta (supervised learning) rule Error at the output is used to assign blame to particular hidden units Then this blame is converted to error, and this is then used to calculate weight changes to the weights from input to hidden units

Summary: Part 3 Single layer networks Rosenblatt’s perceptron (1958)
The first connectionist models were known as ‘perceptrons’, and learned through changes to a single layer of weights Minsky & Papert’s (1969) criticism of perceptrons Single layer networks cannot be set up to solve non-linearly separable problems (e.g. the XOR problem) This is a problem because humans can solve non-linearly separable problems (e.g. M & P’s connected figures discrimination)

Summary: Part 3 A history of connectionist models Multi-layer networks
Multi-layered networks can solve non-linearly separable problems by redescribing the input in a linearly separable way at a set of intermediary nodes called ‘hidden units’ Minsky & Papert (1969) knew that mutilayered networks could provide a solution to non-linearly separable problems, but were not very optimistic about finding a way of training both layers of weights, until….. McClelland & Rumelhart (1986) … came up with a learning rule which could change the weights in multiple layers of weights - ‘Back-propagation of error’ (BP) BP works as an extension of normal ‘delta rule’ supervised learning, but by changing the weights arriving at the hidden units in relation to the blame assigned to these units

Outline: Part 4 Some applications Modelling human memory
McClelland’s (1981) Jets and Sharks model Modelling double dissociations in acquired dyslexia Plaut & Shallice (1988)

Parallel distributed processing in memory
McClelland’s (1981) ‘Jets and sharks’ model of memory A simulation of how humans might store information about people Imagine the Jets and Sharks are two rival gangs in your town You know a lot about the gang members How old they are (20s, 30s, 40s) How well educated they are (Junior high, High school, College) What their marital status is (single, married) What their job is (pusher, bookie, burglar) How do you access this information????

Jets and Sharks A computer (or conventional database) might store the information indexed to name: The name links all the information about that person together: John HS single jet burglar Terry JH married shark pusher Fred JH married shark burglar Name indexing is good for answering questions like. “is Fred a burglar?” But bad at answering, “who is a burglar?”

Jets and Sharks McClelland set up the following connectionist database: Links between names and professions /education /marital status etc.. are made by excitatory connections (green) Within areas of knowledge, categorical items have inhibitory connections (red) - these inhibitory connections help the network give a concrete answer (I.e. jet or shark, but not both)

Jets and Sharks Content addressability:
By putting activation into the network at the burglar node, we get information about who is a burglar (al, jim, john, doug, lance, george) - this is known as ‘content addressability’ There is also information about what age the burglars mostly are, whether the burglars are mostly jets or sharks etc…

Jets and Sharks Typicality effects:
McClelland’s network also nicely models an aspect of human memory called ‘typicality’: If we ask the net to tell us the name of a pusher, it is more likely to retrieve some pushers than others (Fred & Nick, but not Ol) This is because Ol is not a typical pusher (and does not benefit from the excitation coming from the activated typical pusher nodes

Parallel processing in memory
McClelland’s (1981) ‘Jets and sharks’ model of memory shows how a database of information can be set from which information about several different attributes (e.g. marital status, name, gang etc..) can be retrieved in parallel Furthermore, more than one address in memory (e.g. several names) can be accessed at once (in parallel) by activation of an attribute (e.g. jets) - this is an aspect of human memory called ‘content addressability’, and contrasts with a memory system in which items are searched one-by-one (serially) The memory in the network is distributed across all of the connection weights…(a distributed database)..

Connectionist models of double dissociations
Double dissociations are situations in neuropsychology in which you find that one brain damaged patient has a deficit in cognitive function A, but not B, whereas another patient has a deficit in B but not A This has been traditionally interpreted as indicating that A and B are cognitive functions which are independent (and located in different parts of the brain) Example: Dissociation between conditioned and expected fear

A double dissociation in acquired dyslexia
People can occasionally acquire dyslexia (reading difficulty) after brain injury Different types of acquired dyslexia have been identified: Difficulty reading concrete words (e.g. tack) vs. difficulty reading abstract words (e.g. tact) While most patients show a superiority for concrete words, some demonstrate better performance with abstract words (Warrington, 1981) This has been described as a double dissociation, and researchers have suggested separable semantic memory stores for concrete and abstract words (at different locations in the brain) Plaut & Shallice (1993) constructed a connectionist simulation to determine whether we do need to posit two separate stores on the basis of this double dissociation…..

After training, Plaut & Shallice’s (1993) model was able to correctly read concrete and abstract words The next step was to lesion the model in several different ways (cut some connections), to determine what deficits would occur Orthography & semantics phonology Plaut & Shallice found that if you lesion several models in different locations you can model the double dissociation with a single system

Plaut & Shallice’s (1993) finding is very significant as it shows that double dissociations do not necessarily mean that there are two separate systems involved. In this case both concrete and abstract words are represented in a distributed fashion across the entire system rather than in separate localised stores

Summary: Part 4 What connectionist models can do
Modelling human memory McClelland’s (1981) Jets and Sharks model is a distributed memory database (rather than a serially accessed database It neatly models content addressability and typicality - two aspects of human memory which a serially accessed memory store cannot model Modelling double dissociations in acquired dyslexia Plaut & Shallice (1993) show that clinical double dissociations of ability to read concrete and abstract words can be modelled in a single route network, in which information about both is processed in parallel, and distribuited across the net

Alan Pickering Autumn Term 2005

Similar presentations

Presentation on theme: "Alan Pickering Autumn Term 2005"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alan Pickering Autumn Term 2005

Similar presentations

Presentation on theme: "Alan Pickering Autumn Term 2005"— Presentation transcript:

Similar presentations

About project

Feedback