Presentation on theme: "Methods and Techniques in Neuroscience An introduction to neural network models Alan Pickering Autumn Term 2005."— Presentation transcript:
Methods and Techniques in Neuroscience An introduction to neural network models Alan Pickering Autumn Term 2005
Outline: Part 1 Neural learning mechanisms Learning through connections Hebbian learning What is connectionism (some definitions / distinctions) ? Connectionism Computational modelling Neural net modelling Building a simple model Terminology & structure How information passes through a model - exercise 1 How a model learns (supervised & unsupervised learning) - exercise 2
Neural learning Originally, it was thought that new associations/memories were formed by the growth of new nerve cells in the brain Santiago Ramon y Cajal ( ) Proposed that learning might occur through the strengthening of existing connections between nerve cells Donald O. Hebb ( ) Formulated Cajals ideas into a hypothetical biological mechanism dubbed Hebbian learning
Hebbian Learning Learned assocations through the strengthening of connections…. UR US CS Hebbian learning: When two joining cells fire simultaneously, the connection between them strengthens (Hebb, 1949) Discovered at a biomolecular level by Lomo (1966) (Long-term potentiation).
What is connectionism? Starting definition: Connectionism is the study of how learning can occur through the strengthening and/or weakening of connections between representations of pieces of information and/or behavioural responses We can relate this definition straight back to classical and operant conditioning (connections strengthened between stimulus and response, or between CS and CR) Connectionist modelling is usually concerned with more complex associations and larger numbers of connections. Computers are used to store and process all of this information A connectionist model is computer simulation of learning
What is connectionism? Biological plausibility: Some (but not all) connectionists appeal to the physical similarity of connectionist models to networks of neurons in the brain (neural net modelling) Neurons: Nodes Synapses: Connections (weights)
Structure and terminology of connectionist models Input nodes (units) Hidden nodes (units) Output nodes (units) Output layerInput layer
How a connectionist model works Activation (a u )? Activation (a j ) Outputs (out j ) What input arrives at the output node (node u)? From j1: 1*0.8 = 0.8 From j2: 1*-0.5 = -0.5 From j3: 0*0.2 = 0.0 Sum of inputs = 0.3 I u = j out j * w ju j3 j2 j1 u w = weights a = activation o = output Weights (w ju )
Exercise 1 Activation (a u )? Activation (a j ) Outputs (out j ) What input arrives at the output node (node u)? I u = j out j * w ju From j1: 1*0.8 = ??? From j2: 1*-0.5 = ??? From j3: 1*0.2 = ??? Sum of inputs = ??? j3 j2 j1 u w = weights a = activation o = output Weights (w ju )
Activations and Outputs 1 Activation (a j ) Activation = membrane potential of cell Activation = 0= resting state (-70mV) +ve activation = cell being depolarised -ve activation =cell being hyperpolarised (inhibited) Cells fire (send an action potential) when sufficiently depolarised Output = mean firing rate of cell Output of node j1 is a function of its activation, out j = f(a j ) j3 j2 j1 u
Activations and Outputs 2 How to convert activations into outputs? Threshold Gain (degree of nonlinearity) j3 j2 j1 u Exc u Inh u.1 Activation (a j ) k1k Weights (w ji ) Outputs (out j ) 1
An example output function Hi gain mid gain lo gain threshold
Activations and Outputs 3 Creating an activation equation da u (t)/dt = da u (t)/dt = c e * Exc u (t) da u (t)/dt = c e * Exc u (t) * (Max – a u (t)) + c i * Inh u (t) * (a u (t) - Min) da u (t)/dt = c e * Exc u (t) * (Max – a u (t)) + c i * Inh u (t) * (a u (t) - Min) - c d * a u (t) j3 j2 j1 u Exc u Inh u.1 Activation (a j ) k1k Weights (w ji ) Outputs (out j ) 1 auau
Simulated Activations and Outputs
How does a connectionist model learn? Following Cajal and Hebb, connectionist models learn through changes in the strength of connections: Weight changes There are two types of learning: Unsupervised learning In unsupervised learning the weight changes are made automatically and in relation to the degrees of association between incoming activations (E.g. Classical conditioning -- associations strengthen through temporal contiguity) Supervised learning In supervised learning the weight changes are made in proportion to the error at the output. In order to calculate the error at the output a teaching pattern is required for comparison =>hence supervised (E.g. Learning to talk or spell)
Unsupervised learning: 1 The most commonly used unsupervised learning rule is theHebb rule (c.f. Hebbian learning): w = k a u a j w = weight change k = a constant (e.g. 0.6) a u = post-synaptic activation a j = pre-synaptic activation
Unsupervised learning: 2 Example: Classical conditioning w = weight change k = a constant (e.g. 0.6) a u = post-synaptic activation a j = pre-synaptic activation Before conditioning foodtone foodtone During conditioning 1 foodtone After conditioning 1.0 foodtone w = k a u a j
Supervised learning: 1 In supervised learning the weight changes are made in proportion to the error at the output In order to calculate the error at the output a teaching pattern is required for comparison (hence supervised) (E.g. Learning to read or spell) The most commonly used supervised learning rule is the delta rule (Rosenblatt, 1966; Rumelhart & McClelland, 1986): w ju = k up a j w ju = weight change k = a constant up = error at output for input pattern p a j = pre-synaptic activation
In order to read we need to learn relation how orthography (how a word looks) maps onto phonology (how it sounds) Imagine a network is learning how to pronounce hit. Say that hit has the orthography; and the phonology; ……… Example: Learning to read
Supervised learning: 2 w ju = k up a j w ju = weight change k = a constant ip = error at output for input pattern p a j = pre-synaptic activation Example: Learning to read Hit has: Orthography Phonology Teaching pattern First - the error is calculated at each output 0 Second the blame is apportioned (to active units contributing to the error)
Exercise 2 Example: Learning to read w ju = k up a j w = weight change k = a constant up = error at output a j = pre-synaptic activation Cat has: Orthography Phonology Teaching pattern Which connection/s will be altered? Remember: 1st find the error 2nd apportion blame 3rd alter weights coming from blamed input node/s to errorful output node/s u1 u3 u2 j1 j3 j2
Neural network models: Part 1 summary Recap on neural learning mechanisms Learning through connections: Cajal & Hebb both suggested that learning in the human brain may occur through changes in the strength of connections between neurons Hebbian learning; Hebb formulated a mechanism by which this associative learning might occur; synchronous pre- and post-synaptic firing increases the strength of the connection What is connectionism (some definitions / distinctions) ? Connectionism deals with the ways in which statistical structure in the environment can be learned by the strengthening and/or weakening of connections between representations of items of information Connectionism involves computer simulations of learning Neural net modelling: Connectionist models gains credibility because they resemble networks of neurons in the brain
Neural network models: Part 1 Summary Building a connectionist model Structure They are a network of nodes and weighted connections. Information passes through the from node to node through the weighted connections, usually in one direction. Terminology Nodes can be thought of as neurons, and weighted connections as synapses between neurons The flow of information through the model The weighted connections control whether or not information passes from one layer of nodes to the next ( j out j * w ju ) Unsupervised learning As in Hebbian learning, unsupervised models use correlations of pre and post- synaptic firing to change strengths of connections (w ju = k a u a j ) Supervised learning Supervised learning relies on the error at the output (provided by a teaching patter) to determine the changes in connection strength (w ju = k up a j )
Outline: Part 2 Why connectionism? (the case for parallel distributed processing) Parallel processing Distributed processing (representations) An example of how connectionist models can help us understand learning: Learning of inflectional morphology in early childhood Pinker & Prince (1988) Rumelhart & McClelland (1986)
Parallel distributed processing (PDP) Rumelhart & McClelland (1986) made a very strong case for the use of connectionist models, highlighting the qualities of parallel and distributed processing Parallel Processing Parallel processing can be contrasted with serial processing Processing several different pieces of information at the same time, rather than one after the other: Example: face processing Distributed Processing Distributed processing can be contrasted with localist processing Representations of information are distribute across the whole neural network, rather than occupying specific locations Example: Karl Lashley and the Engram (location of memory)
Parallel processing Example: Face processing When looking at this face we recognise it not by looking at individual features one at a time (the eyes, the nose, the smile, the grimace), but by processing these features and their spatial configuration in parallel
Parallel processing Example: Face processing Smile nodes Nose nodes Grimace node Margaret Thatcher Tony Blair Eye nodes
Distributed processing Example: Lashley and the search for the Engram Karl Spencer Lashley ( ) Pioneering researcher in the biological foundations of memory in the rat Lashley and colleages lesioned the brains of rats in order to test whether or not they had removed the part responsible for memory. They found no single locus (engram) which appeared to be solely responsible for memory Lashley concluded that memory was distributed throughout the cortex, rather than localised in one specific place
A connectionist model of the acquisition of inflectional morphology Inflectional morphology The way in which we change words to convey: Plurality Past tense Examples: Cat + -s --> Cats (Plural) Play + -ed --> Played (Past tense) - *we focus on this* In English: 90% of verbs have a regular morphology 10% an irregular morphology
Acquisition of inflectional morphology Past tense morphology Regular Morphology (90% verbs) talk => talked ram => rammed pit => pitted Irregular Morphology (10% verbs) hit => hitno change come => camevowel change sleep => sleptvowel change go => wentarbitrary How do children learn which words require a regular ending and which are irregular?
Acquisition of inflectional morphology U-shaped Development Initially, childrens early inflections are correct Later they start making errors: Hitted Sleeped Goed Over-regularisation errors Later still children recover from these errors Phase 1: Rote learning -> initial error free performance Phase 2: Rule extraction -> over-regularisation errors Phase 3: Rule + rote -> recovery from errors
Acquisition of inflectional morphology Dual-route model (Pinker & Prince, 1988)? Rule route deals with regular verbs Exceptions route deals with irregulars ExceptionsRule Input Stem Output Inflection Errors in the middle of development occur due to overuse of the rule system But do we really need two routes to explain this pattern of learning?
Acquisition of inflectional morphology Connectionist model (McClelland & Rumelhart, 1986)? A single route connectionist model for learning the past tense No rule route for producing the regular ending The network learned to associate regular & irregular English verb stems with their past tense forms. Wickelfeature Representation of Stem Wickelfeature Representation of Past Tense One single network learns to produce past tense for all the verbs it is taught
Acquisition of inflectional morphology Connectionist model (McClelland & Rumelhart, 1986)? Like the children the network: Made over- regularisation errors Demonstrated u- shaped development in its performance on irregular verbs
Acquisition of inflectional morphology Connectionist model of the acquisition of inflectional morphology (McClelland & Rumelhart, 1986) This model demonstrates that behaviour which looks as if it is driven by a knowledge of rules can in fact be driven by a distributed representation of statistical structure of the input (in this case a distributed representation of language) This explanation of how it is possible that rule-like behaviour can occur in the absence of any actual representation of the rule is an important contribution of connectionist models
Neural network models: Part 2 Summary Why connectionism? (the case for parallel distributed processing) Parallel processing Connectionist models process information in parallel, rather than serially; this has intuitive appeal when we consider how we take information in Distributed processing (representations) Connectionist models represent learned information in the distributed connections across the network, rather than in single locations or rules; this helps explain why it is diffiicult to find a single location for memory in the brain An example of how connectionist models can help us understand learning: Rumelhart & McClellands (1986) model This models the acquisition of inflectional morphology and demonstrates how networks can show rule-like behaviour, in the absence of any representation of a rule.
Outline: Part 3 A history of neural network models Single layer networks Rosenblatts perceptron (1958) Minsky & Paperts (1969) criticism of perceptrons Multi layer networks McClelland & Rumelharts (1986) Backpropagation of error learning rule
Structure and terminology of connectionist models Input nodes (units) Hidden nodes (units) Output nodes (units) Output layerInput layer
One-layer vs multi-layer networks One-layer networks were the first connectionist models to emerge in the 1950s Frank Rosenblatts (1958) Perceptron In these networks learning occurs through changes in the weights of only one layer However, networks which only change one layer of weights have some important limitations These were pointed out by Minsky & Papert (1969)…
Minsky & Papert (1969) Minsky & Papert (1969) made the point that single-layer networks cannot solve non-linearly separable problems Maths Example: The XOR problem inputs output Task: to learn to solve XOR inputoutput If you think about the sums of the inputs then we can see why this isnt linearly separable: When the sum of the inputs increases to 2, the desired output goes back down to 0 sums Single layer networks cannot solve this kind of problem
Minsky & Papert (1969) The inability to solve non-linear problems is a problem for any model of human learning because humans can solve non-linear problems.. Psychological Example 1: Learning to eat the right amount We all have to learn to eat enough to stay fit, but not so much as to make us sick This is like solving the XOR problem (we can learn to eat some of the food but not all of it) Psychological Example 2: Connected and unconnected figures adc b
Minsky & Papert (1969) Psychological Example 2: Connected and unconnected figures adc b How might a single layer network try to solve this? ( 1 = connected, 0 = unconnected) All figures have three horizontal lines so have to work this out on the basis of the presence of vertical lines (at particular locations) The net might start by discriminating between two connected and unconnected figures (e.g. c and d) by locating a vertical line (e.g. in the bottom left) But this also leads the net to discriminate between figures which we want to group together (e.g. c and b)
A solution: Multi-layered networks Rosenblatt & Minsky & Papert agreed that this problem would be solved if you could train networks with more than one layer: Networks with hidden units can redescribe the input into a format that can be separated linearly Hidden units allow the network to treat physically similar inputs as different, as the need arises
Multi-layered networks solve XOR A multilayered network can solve the XOR problem: inputs output You can set up a two layered network like the example on the left (with appropriate weights) to solve the XOR problem Task: to learn to solve XOR inputhiddenoutput But the real problem envisaged by Minsky & Papert was how to train the weights on a network with two layers……..???
Training multi-layered networks When a network learns through the delta rule, a teaching pattern is there to correct the weights leading into the output by measuring the difference between the teaching pattern and the output: Teaching pattern 1 10 BUT!! - there is no teaching pattern for the hidden layer, which can be used to change the weights from the input..!! The inability to find a learning rule which would change all weights in a network stifled connectionist research until a solution was found in 1986…. 11 0
Back-propagation of error McClelland & Rumelhart (1986) came up with a learning rule to solve this: back-prop Teaching pattern 1 10 Backprop learning: Backprop is an extension of the delta (supervised learning) rule Error at the output is used to assign blame to particular hidden units Then this blame is converted to error, and this is then used to calculate weight changes to the weights from input to hidden units
Summary: Part 3 Single layer networks Rosenblatts perceptron (1958) The first connectionist models were known as perceptrons, and learned through changes to a single layer of weights Minsky & Paperts (1969) criticism of perceptrons Single layer networks cannot be set up to solve non- linearly separable problems (e.g. the XOR problem) This is a problem because humans can solve non-linearly separable problems (e.g. M & Ps connected figures discrimination)
Summary: Part 3 A history of connectionist models Multi-layer networks Multi-layered networks can solve non-linearly separable problems by redescribing the input in a linearly separable way at a set of intermediary nodes called hidden units Minsky & Papert (1969) knew that mutilayered networks could provide a solution to non-linearly separable problems, but were not very optimistic about finding a way of training both layers of weights, until….. McClelland & Rumelhart (1986) … came up with a learning rule which could change the weights in multiple layers of weights - Back-propagation of error (BP) BP works as an extension of normal delta rule supervised learning, but by changing the weights arriving at the hidden units in relation to the blame assigned to these units
Outline: Part 4 Some applications Modelling human memory McClellands (1981) Jets and Sharks model Modelling double dissociations in acquired dyslexia Plaut & Shallice (1988)
Parallel distributed processing in memory McClellands (1981) Jets and sharks model of memory A simulation of how humans might store information about people Imagine the Jets and Sharks are two rival gangs in your town You know a lot about the gang members How old they are (20s, 30s, 40s) How well educated they are (Junior high, High school, College) What their marital status is (single, married) What their job is (pusher, bookie, burglar) How do you access this information????
Jets and Sharks A computer (or conventional database) might store the information indexed to name: The name links all the information about that person together: Name indexing is good for answering questions like. is Fred a burglar? But bad at answering, who is a burglar? JohnHS singlejet burglar Terry JH marriedsharkpusher FredJHmarriedsharkburglar
Jets and Sharks McClelland set up the following connectionist database: Links between names and professions /education /marital status etc.. are made by excitatory connections (green) Within areas of knowledge, categorical items have inhibitory connections (red) - these inhibitory connections help the network give a concrete answer (I.e. jet or shark, but not both)
Jets and Sharks Content addressability: By putting activation into the network at the burglar node, we get information about who is a burglar (al, jim, john, doug, lance, george) - this is known as content addressability There is also information about what age the burglars mostly are, whether the burglars are mostly jets or sharks etc…
Jets and Sharks Typicality effects: McClellands network also nicely models an aspect of human memory called typicality: If we ask the net to tell us the name of a pusher, it is more likely to retrieve some pushers than others (Fred & Nick, but not Ol) This is because Ol is not a typical pusher (and does not benefit from the excitation coming from the activated typical pusher nodes
Parallel processing in memory McClellands (1981) Jets and sharks model of memory shows how a database of information can be set from which information about several different attributes (e.g. marital status, name, gang etc..) can be retrieved in parallel Furthermore, more than one address in memory (e.g. several names) can be accessed at once (in parallel) by activation of an attribute (e.g. jets) - this is an aspect of human memory called content addressability, and contrasts with a memory system in which items are searched one-by-one (serially) The memory in the network is distributed across all of the connection weights…(a distributed database)..
Connectionist models of double dissociations Double dissociations are situations in neuropsychology in which you find that one brain damaged patient has a deficit in cognitive function A, but not B, whereas another patient has a deficit in B but not A This has been traditionally interpreted as indicating that A and B are cognitive functions which are independent (and located in different parts of the brain) Example: Dissociation between conditioned and expected fear
A double dissociation in acquired dyslexia People can occasionally acquire dyslexia (reading difficulty) after brain injury Different types of acquired dyslexia have been identified: Difficulty reading concrete words (e.g. tack) vs. difficulty reading abstract words (e.g. tact) While most patients show a superiority for concrete words, some demonstrate better performance with abstract words (Warrington, 1981) This has been described as a double dissociation, and researchers have suggested separable semantic memory stores for concrete and abstract words (at different locations in the brain) Plaut & Shallice (1993) constructed a connectionist simulation to determine whether we do need to posit two separate stores on the basis of this double dissociation…..
A double dissociation in acquired dyslexia After training, Plaut & Shallices (1993) model was able to correctly read concrete and abstract words The next step was to lesion the model in several different ways (cut some connections), to determine what deficits would occur Orthography & semantics phonology Plaut & Shallice found that if you lesion several models in different locations you can model the double dissociation with a single system
A double dissociation in acquired dyslexia Plaut & Shallices (1993) finding is very significant as it shows that double dissociations do not necessarily mean that there are two separate systems involved. In this case both concrete and abstract words are represented in a distributed fashion across the entire system rather than in separate localised stores
Summary: Part 4 What connectionist models can do Modelling human memory McClellands (1981) Jets and Sharks model is a distributed memory database (rather than a serially accessed database It neatly models content addressability and typicality - two aspects of human memory which a serially accessed memory store cannot model Modelling double dissociations in acquired dyslexia Plaut & Shallice (1993) show that clinical double dissociations of ability to read concrete and abstract words can be modelled in a single route network, in which information about both is processed in parallel, and distribuited across the net