Presentation on theme: "Alan Pickering Autumn Term 2005"— Presentation transcript:
1Alan Pickering Autumn Term 2005 Methods and Techniques in Neuroscience An introduction to neural network modelsAlan Pickering Autumn Term 2005
2Outline: Part 1 Neural learning mechanisms Learning through connectionsHebbian learningWhat is connectionism (some definitions / distinctions) ?ConnectionismComputational modellingNeural net modellingBuilding a simple modelTerminology & structureHow information passes through a model - exercise 1How a model learns (supervised & unsupervised learning) - exercise 2
3Neural learningOriginally, it was thought that new associations/memories were formed by the growth of new nerve cells in the brainSantiago Ramon y Cajal ( )Proposed that learning might occur through the strengthening of existing connections between nerve cellsDonald O. Hebb ( )Formulated Cajal’s ideas into a hypothetical biological mechanism dubbed ‘Hebbian learning’
4Hebbian Learning Hebbian learning: When two joining cells fire simultaneously, the connection between them strengthens (Hebb, 1949)Discovered at a biomolecular level by Lomo (1966) (Long-term potentiation).URUSCSLearned assocations through the strengthening of connections….
5What is connectionism? Starting definition: “Connectionism is the study of how learning can occur through the strengthening and/or weakening of connections between representations of pieces of information and/or behavioural responses”We can relate this definition straight back to classical and operant conditioning (connections strengthened between stimulus and response, or between CS and CR)Connectionist modelling is usually concerned with more complex associations and larger numbers of connections. Computers are used to store and process all of this informationA connectionist model is computer simulation of learning
6What is connectionism? Biological plausibility: Some (but not all) connectionists appeal to the physical similarity of connectionist models to networks of neurons in the brain (neural net modelling)Neurons: NodesSynapses: Connections (weights)
7Structure and terminology of connectionist models Output layerOutput nodes (units)Hidden nodes (units)Input layerInput nodes (units)
8How a connectionist model works w = weightsa = activationo = outputj3j2j1uActivation (au)?What input arrives at the output node (node u)?From j1: 1*0.8 = 0.8From j2: 1*-0.5 = -0.5From j3: 0*0.2 = 0.0Sum of inputs = 0.3Iu = ∑j outj * wju+0.8-0.5+0.2Weights(wju)1Outputs(outj)Activation (aj).8.7.1
9Exercise 1 What input arrives at the output node (node u)? w = weightsa = activationo = outputj3j2j1uActivation (au)?What input arrives at the output node (node u)?Iu = ∑j outj * wjuFrom j1: 1*0.8 = ???From j2: 1*-0.5 = ???From j3: 1*0.2 = ???Sum of inputs = ???+0.8-0.7+0.4Weights(wju)1Outputs(outj)Activation (aj).8.7.6
10Activations and Outputs 1 Activation = membrane potential of cellActivation = 0= resting state (-70mV)+ve activation = cell being depolarised-ve activation =cell being hyperpolarised (inhibited)Cells fire (send an action potential) when sufficiently depolarisedOutput = mean firing rate of cellOutput of node j1 is a function of its activation, outj = f(aj)j3j2j1uActivation (aj).8.7.1
11Activations and Outputs 2 How to convert activations into outputs?ThresholdGain (degree of nonlinearity)1Outputs(outj)Inhu.1Activation (aj).8.7.6j3j2j1u+0.2+0.8+0.5Weights(wji)-0.7k1Excu
12An example output function Hi gainmid gainthresholdlo gain
13Activations and Outputs 3 Creating an activation equationdau(t)/dt =dau(t)/dt = ce * Excu(t)dau(t)/dt = ce * Excu(t) * (Max – au(t))+ ci * Inhu(t) * (au(t) - Min)- cd * au(t)1Outputs(outj)Inhu.1Activation (aj).8.7.6j3j2j1u+0.2+0.8+0.5Weights(wji)-0.7auk1Excu
15How does a connectionist model learn? Following Cajal and Hebb, connectionist models learn through changes in the strength of connections:Weight changesThere are two types of learning:Unsupervised learningIn unsupervised learning the weight changes are made automatically and in relation to the degrees of association between incoming activations (E.g. Classical conditioning -- associations strengthen through temporal contiguity)Supervised learningIn supervised learning the weight changes are made in proportion to the error at the output. In order to calculate the error at the output a teaching pattern is required for comparison =>hence “supervised” (E.g. Learning to talk or spell)
16Unsupervised learning: 1 The most commonly used unsupervised learning rule is the “Hebb rule” (c.f. Hebbian learning):∆w = k au aj∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation
17Unsupervised learning: 2 Example: Classical conditioning∆w = weight changek = a constant (e.g. 0.6)au = post-synaptic activationaj = pre-synaptic activation∆w = k au ajBefore conditioningfoodtone11.00.0During conditioning1foodtone1.0After conditioning1.0foodtone1.60.6foodtone10.01.00.6
18Supervised learning: 1 ∆wju = k up aj In supervised learning the weight changes are made in proportion to the error at the outputIn order to calculate the error at the output a teaching pattern is required for comparison (hence supervised) (E.g. Learning to read or spell)The most commonly used supervised learning rule is the “delta rule” (Rosenblatt, 1966; Rumelhart & McClelland, 1986):∆wju = k up aj∆wju = weight changek = a constantup = error at output for input pattern paj = pre-synaptic activation
19Example: Learning to read In order to read we need to learn relation how orthography (how a word looks) maps onto phonology (how it sounds)Imagine a network is learning how to pronounce ‘hit’. Say that hit has the orthography; and the phonology; ………
20Supervised learning: 2 Example: Learning to read ‘Hit’ has: OrthographyPhonology∆wju = k up aj∆wju = weight changek = a constantip = error at output for input pattern paj = pre-synaptic activation1Teaching pattern1First - the error is calculated at each outputSecond the blame is apportioned (to active units contributing to the error)1
21Exercise 2 Example: Learning to read ‘Cat’ has: Orthography 0 1 0 ∆wju = k up aj∆w = weight changek = a constantup = error at outputaj = pre-synaptic activation‘Cat’ has:Orthography 0 1 0Phonology1Teaching patternWhich connection/s will be altered?Remember:1st find the error2nd apportion blame3rd alter weights coming from blamed input node/s to errorful output node/s1u1u2u31j1j2j3
22Neural network models: Part 1 summary Recap on neural learning mechanismsLearning through connections: Cajal & Hebb both suggested that learning in the human brain may occur through changes in the strength of connections between neuronsHebbian learning; Hebb formulated a mechanism by which this associative learning might occur; synchronous pre- and post-synaptic firing increases the strength of the connectionWhat is connectionism (some definitions / distinctions) ?Connectionism deals with the ways in which statistical structure in the environment can be learned by the strengthening and/or weakening of connections between representations of items of informationConnectionism involves computer simulations of learningNeural net modelling: Connectionist models gains credibility because they resemble networks of neurons in the brain
23Neural network models: Part 1 Summary Building a connectionist modelStructureThey are a network of ‘nodes’ and ‘weighted connections’. Information passes through the from node to node through the ‘weighted connections, usually in one direction.TerminologyNodes can be thought of as neurons, and weighted connections as synapses between neuronsThe flow of information through the modelThe weighted connections control whether or not information passes from one layer of nodes to the next (∑j outj * wju)Unsupervised learningAs in Hebbian learning, unsupervised models use correlations of pre and post-synaptic firing to change strengths of connections (∆wju = k au aj)Supervised learningSupervised learning relies on the error at the output (provided by a teaching patter) to determine the changes in connection strength (∆wju= k up aj)
24Outline: Part 2Why connectionism? (the case for parallel distributed processing)Parallel processingDistributed processing (representations)An example of how connectionist models can help us understand learning:Learning of inflectional morphology in early childhoodPinker & Prince (1988)Rumelhart & McClelland (1986)
25Parallel distributed processing (PDP) Rumelhart & McClelland (1986) made a very strong case for the use of connectionist models, highlighting the qualities of parallel and distributed processingParallel ProcessingParallel processing can be contrasted with serial processingProcessing several different pieces of information at the same time, rather than one after the other:Example: face processingDistributed ProcessingDistributed processing can be contrasted with localist processingRepresentations of information are distribute across the whole neural network, rather than occupying specific locationsExample: Karl Lashley and the ‘Engram’ (location of memory)
26Parallel processing Example: Face processing When looking at this face we recognise it not by looking at individual features one at a time (the eyes, the nose, the smile, the grimace), but by processing these features and their spatial configuration in parallel
27Parallel processing Example: Face processing Margaret Thatcher Tony BlairNose nodesGrimace nodeSmile nodesEye nodes
28Distributed processing Example: Lashley and the search for the ‘Engram’Karl Spencer Lashley ( )Pioneering researcher in the biological foundations of memory in the ratLashley and colleages lesioned the brains of rats in order to test whether or not they had removed the part responsible for memory. They found no single locus (engram) which appeared to be solely responsible for memoryLashley concluded that memory was distributed throughout the cortex, rather than localised in one specific place
29A connectionist model of the acquisition of inflectional morphology The way in which we change words to convey:PluralityPast tenseExamples:Cat + -s --> Cats (Plural)Play + -ed --> Played (Past tense) - *we focus on this*In English:90% of verbs have a regular morphology10% an irregular morphology
30Acquisition of inflectional morphology Past tense morphologyRegular Morphology (90% verbs)talk => talkedram => rammedpit => pittedIrregular Morphology (10% verbs)hit => hit ‘no change’come => came ‘vowel change’sleep => slept ‘vowel change’go => went ‘arbitrary’How do children learn which words require a regular ending and which are irregular?
31Acquisition of inflectional morphology U-shaped DevelopmentInitially, children’s early inflections are correctLater they start making errors:HittedSleepedGoedOver-regularisation errorsLater still children recover from these errorsPhase 1: Rote learning -> initial error free performancePhase 2: Rule extraction -> over-regularisation errorsPhase 3: Rule + rote -> recovery from errors
32Acquisition of inflectional morphology Dual-route model (Pinker & Prince, 1988)?‘Rule’ route deals with regular verbsExceptions route deals with irregularsExceptionsRuleInput StemOutput InflectionErrors in the middle of development occur due to overuse of the ‘rule systemBut do we really need two routes to explain this pattern of learning?
33Acquisition of inflectional morphology Connectionist model (McClelland & Rumelhart, 1986)?Wickelfeature Representation of StemWickelfeature Representation of Past TenseOne single network learns to produce past tense for all the verbs it is taughtA single route connectionist model for learning the past tenseNo rule route for producing the regular endingThe network ‘learned’ to associate regular & irregular English verb stems with their past tense forms.
34Acquisition of inflectional morphology Connectionist model (McClelland & Rumelhart, 1986)?Like the children the network:Made over-regularisation errorsDemonstrated u-shaped development in its performance on irregular verbs
35Acquisition of inflectional morphology Connectionist model of the acquisition of inflectional morphology (McClelland & Rumelhart, 1986)This model demonstrates that behaviour which looks as if it is driven by a knowledge of rules can in fact be driven by a distributed representation of statistical structure of the input (in this case a distributed representation of language)This explanation of how it is possible that ‘rule-like behaviour’ can occur in the absence of any actual representation of the rule is an important contribution of connectionist models
36Neural network models: Part 2 Summary Why connectionism? (the case for parallel distributed processing)Parallel processingConnectionist models process information in parallel, rather than serially; this has intuitive appeal when we consider how we take information inDistributed processing (representations)Connectionist models represent learned information in the distributed connections across the network, rather than in single locations or ‘rules’; this helps explain why it is diffiicult to find a single location for memory in the brainAn example of how connectionist models can help us understand learning:Rumelhart & McClelland’s (1986) modelThis models the acquisition of inflectional morphology and demonstrates how networks can show ‘rule-like’ behaviour, in the absence of any representation of a rule.
37Outline: Part 3 A history of neural network models Single layer networksRosenblatt’s perceptron (1958)Minsky & Papert’s (1969) criticism of perceptronsMulti layer networksMcClelland & Rumelhart’s (1986) ‘Backpropagation of error’ learning rule
38Structure and terminology of connectionist models Output layerOutput nodes (units)Hidden nodes (units)Input layercover that connectionist model transforms material at the input into something else at the output -this is through the connection weights between input and output nodes…Input nodes (units)
39One-layer vs multi-layer networks One-layer networks were the first connectionist models to emerge in the 1950sFrank Rosenblatt’s (1958) ‘Perceptron’In these networks learning occurs through changes in the weights of only one layerHowever, networks which only change one layer of weights have some important limitationsThese were pointed out by Minsky & Papert (1969)…
40Minsky & Papert (1969)Minsky & Papert (1969) made the point that single-layer networks cannot solve ‘non-linearly separable problems’Maths Example: The XOR problemIf you think about the sums of the inputs then we can see why this isn’t linearly separable:When the sum of the inputs increases to 2, the desired output goes back down to 0inputsoutputTask: to learn to solve XORinput output1 1 01 0 10 1 10 0 0sums21Single layer networks cannot solve this kind of problem
41Minsky & Papert (1969)The inability to solve non-linear problems is a problem for any model of human learning because humans can solve non-linear problems..Psychological Example 1: Learning to eat the right amountWe all have to learn to eat enough to stay fit, but not so much as to make us sickThis is like solving the XOR problem (we can learn to eat some of the food but not all of it)Psychological Example 2: Connected and unconnected figuresabcd
42Minsky & Papert (1969)Psychological Example 2: Connected and unconnected figuresabcdHow might a single layer network try to solve this? (1 = connected, 0 = unconnected)All figures have three horizontal lines so have to work this out on the basis of the presence of vertical lines (at particular locations)The net might start by discriminating between two connected and unconnected figures (e.g. c and d) by locating a vertical line (e.g. in the bottom left)But this also leads the net to discriminate between figures which we want to group together (e.g. c and b)
43A solution: Multi-layered networks Rosenblatt & Minsky & Papert agreed that this problem would be solved if you could train networks with more than one layer:Networks with ‘hidden units’ can redescribe the input into a format that can be separated linearly“Hidden units allow the network to treat physically similar inputs as different, as the need arises”
44Multi-layered networks solve XOR A multilayered network can solve the XOR problem:inputsoutputTask: to learn to solve XORinput hidden output+1-1You can set up a two layered network like the example on the left (with appropriate weights) to solve the XOR problemBut the real problem envisaged by Minsky & Papert was how to train the weights on a network with two layers……..???
45Training multi-layered networks When a network learns through the delta rule, a teaching pattern is there to correct the weights leading into the output by measuring the difference between the teaching pattern and the output:Teaching pattern1BUT!! - there is no teaching pattern for the hidden layer, which can be used to change the weights from the input..!!The inability to find a learning rule which would change all weights in a network stifled connectionist research until a solution was found in 1986….11
46Back-propagation of error McClelland & Rumelhart (1986) came up with a learning rule to solve this: ‘back-prop’Teaching pattern11Backprop learning:Backprop is an extension of the delta (supervised learning) ruleError at the output is used to assign blame to particular hidden unitsThen this blame is converted to error, and this is then used to calculate weight changes to the weights from input to hidden units
47Summary: Part 3 Single layer networks Rosenblatt’s perceptron (1958) The first connectionist models were known as ‘perceptrons’, and learned through changes to a single layer of weightsMinsky & Papert’s (1969) criticism of perceptronsSingle layer networks cannot be set up to solve non-linearly separable problems (e.g. the XOR problem)This is a problem because humans can solve non-linearly separable problems (e.g. M & P’s connected figures discrimination)
48Summary: Part 3 A history of connectionist models Multi-layer networks Multi-layered networks can solve non-linearly separable problems by redescribing the input in a linearly separable way at a set of intermediary nodes called ‘hidden units’Minsky & Papert (1969) knew that mutilayered networks could provide a solution to non-linearly separable problems, but were not very optimistic about finding a way of training both layers of weights, until…..McClelland & Rumelhart (1986)… came up with a learning rule which could change the weights in multiple layers of weights - ‘Back-propagation of error’ (BP)BP works as an extension of normal ‘delta rule’ supervised learning, but by changing the weights arriving at the hidden units in relation to the blame assigned to these units
49Outline: Part 4 Some applications Modelling human memory McClelland’s (1981) Jets and Sharks modelModelling double dissociations in acquired dyslexiaPlaut & Shallice (1988)
50Parallel distributed processing in memory McClelland’s (1981) ‘Jets and sharks’ model of memoryA simulation of how humans might store information about peopleImagine the Jets and Sharks are two rival gangs in your townYou know a lot about the gang membersHow old they are (20s, 30s, 40s)How well educated they are (Junior high, High school, College)What their marital status is (single, married)What their job is (pusher, bookie, burglar)How do you access this information????
51Jets and SharksA computer (or conventional database) might store the information indexed to name:The name links all the information about that person together:John HS single jet burglarTerry JH married shark pusherFred JH married shark burglarName indexing is good for answering questions like. “is Fred a burglar?”But bad at answering, “who is a burglar?”
52Jets and SharksMcClelland set up the following connectionist database:Links between names and professions /education /marital status etc.. are made by excitatory connections (green)Within areas of knowledge, categorical items have inhibitory connections (red) - these inhibitory connections help the network give a concrete answer (I.e. jet or shark, but not both)
53Jets and Sharks Content addressability: By putting activation into the network at the burglar node, we get information about who is a burglar (al, jim, john, doug, lance, george) - this is known as ‘content addressability’There is also information about what age the burglars mostly are, whether the burglars are mostly jets or sharks etc…
54Jets and Sharks Typicality effects: McClelland’s network also nicely models an aspect of human memory called ‘typicality’:If we ask the net to tell us the name of a pusher, it is more likely to retrieve some pushers than others (Fred & Nick, but not Ol)This is because Ol is not a typical pusher (and does not benefit from the excitation coming from the activated typical pusher nodes
55Parallel processing in memory McClelland’s (1981) ‘Jets and sharks’ model of memory shows how a database of information can be set from which information about several different attributes (e.g. marital status, name, gang etc..) can be retrieved in parallelFurthermore, more than one address in memory (e.g. several names) can be accessed at once (in parallel) by activation of an attribute (e.g. jets) - this is an aspect of human memory called ‘content addressability’, and contrasts with a memory system in which items are searched one-by-one (serially)The memory in the network is distributed across all of the connection weights…(a distributed database)..
56Connectionist models of double dissociations Double dissociations are situations in neuropsychology in which you find that one brain damaged patient has a deficit in cognitive function A, but not B, whereas another patient has a deficit in B but not AThis has been traditionally interpreted as indicating that A and B are cognitive functions which are independent (and located in different parts of the brain)Example: Dissociation between conditioned and expected fear
57A double dissociation in acquired dyslexia People can occasionally acquire dyslexia (reading difficulty) after brain injuryDifferent types of acquired dyslexia have been identified:Difficulty reading concrete words (e.g. tack) vs. difficulty reading abstract words (e.g. tact)While most patients show a superiority for concrete words, some demonstrate better performance with abstract words (Warrington, 1981)This has been described as a double dissociation, and researchers have suggested separable semantic memory stores for concrete and abstract words (at different locations in the brain)Plaut & Shallice (1993) constructed a connectionist simulation to determine whether we do need to posit two separate stores on the basis of this double dissociation…..
58A double dissociation in acquired dyslexia After training, Plaut & Shallice’s (1993) model was able to correctly read concrete and abstract wordsThe next step was to lesion the model in several different ways (cut some connections), to determine what deficits would occurOrthography & semanticsphonologyPlaut & Shallice found that if you lesion several models in different locations you can model the double dissociation with a single system
59A double dissociation in acquired dyslexia Plaut & Shallice’s (1993) finding is very significant as it shows that double dissociations do not necessarily mean that there are two separate systems involved.In this case both concrete and abstract words are represented in a distributed fashion across the entire system rather than in separate localised stores
60Summary: Part 4 What connectionist models can do Modelling human memoryMcClelland’s (1981) Jets and Sharks model is a distributed memory database (rather than a serially accessed databaseIt neatly models content addressability and typicality - two aspects of human memory which a serially accessed memory store cannot modelModelling double dissociations in acquired dyslexiaPlaut & Shallice (1993) show that clinical double dissociations of ability to read concrete and abstract words can be modelled in a single route network, in which information about both is processed in parallel, and distribuited across the net