Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 The Brain's Capability for Language Language-readiness:

Similar presentations


Presentation on theme: "Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 The Brain's Capability for Language Language-readiness:"— Presentation transcript:

1 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 The Brain's Capability for Language Language-readiness: the capacity to acquire and use language. Biological Evolution: The processes of genetic selection that yielded a language-ready brain Cultural Evolution: The processes of non-biological, social selection whereby a variety of languages arose and “cross-pollinated”. k “Resetting the null hypothesis”, we claim that being “language- ready” does not imply “having language”, and that many of the features of language are the product of cultural evolution. k The Mirror System Hypothesis: An account of how and why the human brain differs from that of other primates to make humans “language ready”.

2 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 2 Social Evolution Ease of acquisition of a skill does not imply genetic encoding of the skill per se:  Surfing the Web and playing video games:  computer technology has evolved to match the preadaptations of the human brain and body. Social evolution can result from both biological and cultural (non- genetic) evolution. Clearly, language enables an immense amplification of the second factor. Human evolution saw the co-evolution of increasingly complex social structures and of increasingly complex patterns of behavior and communication to serve those social interactions. Gamble in Timewalkers views human evolution in terms of preadaptation for global colonization, with language one of many relevant traits in that evolution. He emphasizes the relation of humans to other species in the same environment.

3 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 3 Basic Questions  a... What were those features of the human brain that pre-adapted us for human language, i.e., made the human brain “language ready”,  b... What aspects of human language were already present in the earliest humans of 200,000 years ago?  a... How can this perspective on language evolution help explain why language looks the way it does today?  b... What has been the interplay of biological inheritance and cultural evolution since the emergence of Homo sapiens? k Subsidiary debate: How can we best describe cultural evolution in a way which reflects both its dependence on that biological inheritance and the vast variety behavior exhibits across different cultures? k Dynamics of language on multiple time-scales: How can the study of language acquisition (a minor focus of the course) and of historical linguistics (at most a side topic) help tease apart biological and cultural contributions to the mastery of language by present-day humans?

4 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 4 Biological Basis of Human Evolution Background factors for our work: Homo sapiens has:  Bipedality  Manual dexterity  Larynx well-suited for vocal production We will stress (The Mirror System Hypothesis):  Ability to relate the actions of others to one’s own actions Key issue beyond the mirror:  Imitation: Ability to generate and comprehend hierarchical structures “on the fly”.  Ability to rapidly acquire a vast array of flexible strategies for pragmatic and communicative action.

5 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 5 The Ancestral Communication System For want of better data, we assume that our common human-monkey ancestors shared with monkeys the following: Primate Call System a limited set of species-specific calls Oro-Facial Gesture System a limited set of gestures expressive of emotion and related social indicators Note the linkage between the two systems: communication is inherently multi-modal. Combinatorial properties for the openness of communication are virtually absent in primate calls and oro-facial communication

6 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 6 A Challenge for Our Account of Brain Evolution The neural substrate for primate calls is in a region of monkey cingulate cortex For most humans, language is heavily intertwined with speech Why then is the cingulate area – already involved in monkey vocalization – not homologous to the Broca's area's substrate for language?

7 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 7 Evolution of Perceptual Systems Primates have evolved general-purpose sensory mechanisms which are not restricted to providing innate releasing mechanisms for specific fixed action patterns.  We do not require that all sensory processes be linked to/preceded by motor activity.  Yet our need to understand the integrated role of speaker/signer and hearer/comprehender links us back to the motor system. This does not require a strict motor theory of perception, but we still need to specify how the mirror system is reflected into a multi-level language system. Contrast sorting phonemes without meaning (tuning the perceptual system) vs. recognizing the meaning of "milk" - integrating the action of drinking with the sensory experience (sight, taste) and reinforcement (reduction of thirst and hunger) that go with it.

8 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 8 Whether we are planning an action or creating a sentence we are choosing a subset of objects, states of action, and relationships from a complex (perceived, remembered, or imagined) scene. How is the selection of those objects, actions and relationships, and the binding of distributed representations of each of these, neurally represented as an integrated subset of a greater integrated whole? Selection and Binding

9 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 9 The Minimal Subscene as a Meeting Ground for Action, Action Recognition and Language Explore the interaction between focal visual attention and the recognition of objects and actions to better understand how humans perceive and act upon complex dynamic scenes and how such perception and action are linked to language. Key Concept: ”Minimal subscenes" in which an object is linked to another object or two via some action. Hypothesis: These ground the basic sentences containing a verb together with noun phrases which express such a subscene. Modeling Component: k Integrated neural models for action recognition, visual attention, and minimal subscene representation to provide dynamic scene understanding integrated with scene description and question answering. Experimental Component. k visual psychophysics experiments to test human performance on minimal subscene description and recognition, question answering, and related attentional processes k fMRI experiments to constrain our analysis of how the interactions among our model components should be developed.

10 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 10 Broca’s and Wernicke’s Aphasias Paul Broca (1865): Broca's aphasia is characterized by nonfluent speech, few words, short sentences, and many pauses. The words that the patient can produce come with great effort and often sound distorted. The melodic intonation is flat and monopitched. This gives the speech the general appearance of a telegraphic nature, because of the deletion of functor words and disturbances in word order. However, aural comprehension for conversational speech is relatively intact. There is often an accompanying right hemiparesis involving face, arm, and leg. Carl Wernicke (1874): Wernicke’s aphasia is known as a fluent aphasia because the patient does not appear to have any difficulty articulating speech, but may be paraphasic. However, comprehension of speech is impaired and sometimes even single words are not comprehended. The patient may even speak in a meaningless “neologistic” jargon, devoid of any content but with free use of verb tenses, clauses, and subordinates.

11 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 11 Broca’s and Wernicke’s Aphasias Warning: Localization of Aphasias is HIGHLY Variable Wernicke’s original drawing (wrong hemisphere!) Broca’s Area (Negative Image) Wernicke’s Area MRI-scans from Keith A. Johnson, M.D. and J. Alex Becker The Whole Brain Atlas http://www.med.harvard.edu./ AANLIB/home.html Slice viewed from below: So “right” is left “Production” Wernicke: “Perception” Broca: “Production” “Perception”

12 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 12 The Brain is Language-Ready Versus Language is Innate in the Brain The central role of language-readiness is a key tenet of our approach to the brain mechanisms of language

13 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 13 Brain Mechanisms for Language We say: k Broca’s and Wernicke’s area are key components of the human brain’s mechanisms for language but suggest this is shorthand for: k Broca’s and Wernicke’s areas evolved biologically to support a variety of mechanisms linking the production and perception of complex behaviors (e.g., those involved in imitation) k they were so pre-adapted that when humans evolved language culturally these areas ‘self-organized’ in response to a language-rich environment to support language production and perception rather than: k They evolved to encode the syntax of language per se

14 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 14 Imaging the Human Brain Measuring rCBF - regional Cerebral Blood Flow - to get a measure of how activity in a brain region differs from task to task: PET:Positron Emission Tomography fMRI: functional Magnetic Resonance Imaging

15 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 15 A High Level View of Human Brain Activity k Left: A recent positron emission tomography (PET) study found that regional cerebral blood flow (rCBF) in the cerebellum is correlated with the accuracy of sensory prediction. k Right: Higher cortical regions, in particular the right dorsolateral prefrontal cortex (DLPFC; Brodmann area 9/46), come in to play when there is a conscious conflict between intentions and their consequences. From the review “From the Perception of Action to the Understanding of Intention” by Sarah-Jayne Blakemore and Jean Decety, Nature Reviews Neuroscience 2001 2:561 et seq.

16 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 16 Beyond Boxology Not k Looking at the brain one large region at a time But k Working down through the levels: Y How do brain regions compete and cooperate to yield overall behavior (including changes of internal states) Y How does the neural circuitry within each region mediate its contribution to overall “information processing”? Y How does synaptic plasticity allow experience to “self-organize” the brain in both development and learning so that the brain at any time is a dynamic blend of “nature and nurture”?

17 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 17 Bickerton on Protolanguage To keep the argument clear: A prelanguage is the system of utterances used by a particular prehominid species (including Homo sapiens) which we may recognize as a precursor to human language in the modern sense. Warning: We have no traces of any hominid prelanguage!! Bickerton: Infant language, pidgins, and the “language” taught to apes are all protolanguages made up of utterances comprising a few words in the current sense without syntactic structure. Bickerton’s Hypothesis: The prelanguage of Homo erectus was a protolanguage in his sense. Language just “added syntax” through the evolution of Universal Grammar. My counter-proposal: The prelanguage of Homo erectus and early Homo sapiens was composed mainly of “unitary utterances”: “grufluk” Words co-evolved culturally with syntax through fractionation.

18 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 18 Protolanguage in Historical Linguistics For Dixon (1997) [studying historical changes in languages]: a protolanguage is a human language ancestral to a specific family of human languages. Hypothesis: The rate of historical language change supports the view that language in its full modern sense may not have been within the repertoire of early Homo sapiens, and that the subsequent development of languages rested more on cultural than on biological evolution. Deep Time The divergence of the Romance languages took about one thousand years. The divergence of the Indo-European languages with their immense diversity k Hindi, German, Italian, English,... took about 6,000 years. How can we imagine what has changed since the emergence of Homo sapiens some 200,000 years ago? Or in 5,000,000 years of prior hominid evolution?

19 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 19 Criteria for Language-Readiness A Hypothesis on which human brain mechanisms underlie language Properties Supporting Prelanguage Communication: Symbolization: The ability to associate an arbitrary symbol with a class of episodes, objects or actions. At first, these symbols may not have been word-like in the modern sense “grufluk” These symbols may have been manual gestures, rather than being vocalized. Intentionality. Extension of communication to be intended by the utterer to have a particular effect on the recipient. Parity: What counts for the speaker must count for the listener (Mirror Property) More General Properties: Hierarchical Structuring: Perception and action involving components with sub-parts (Action-oriented perception) but the units of these structures may not map to symbols Temporal Ordering: Coding hierarchical structures “of the mind” Beyond the Here-and-Now: recalling past events, imagining future ones. Paedomorphy and Sociality: Conditions for complex social learning

20 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 20 Paedomorphy One key feature of humans is paedomorphy – the infant is helpless for 18 months (or 5 years or …) whereas a guinea pig can fend for itself from birth. Therefore, human infants have time to acquire a wide range of culturally- determined behavior:  humans broke the bounds of a limited ecological niche and could reinvent themselves culturally to master more and more new environments, even to the point of adapting the environments to their needs. This in turn required the appropriate co-evolution of the biology of social relations in general, and of extended child care and mother-child relations  Caution: Look for differences of degree rather than kind in contrasting human with other species.

21 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 21 Criteria for Language A Hypothesis on what culture and learning add to the brain’s capabilities Extending Parity, Hierarchical Structuring, and Temporal Ordering of Language-Readiness: Symbolization: The symbols become words in the modern sense, interchangeable and composable units for the expression of meaning. Syntax and Semantics: The matching of syntactic to semantic structures co-evolves with the fractionation of utterances Recursivity: is a byproduct Beyond the Here-and-Now: Verb tenses or other circumlocutions express the ability to recall past events or imagine future ones. Learnability: To qualify as a human language, it must contain a significant subset of symbolic structures learnable by most human children. [But: It is not true that children master a language by 5 or 7 years of age.]

22 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 22 A Very Quick Tour of Evolution Mammals Primates Monkeys Hominids Homo Sapiens Apes

23 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 23 Primate Evolution: Two Key Branch Points 20 million years ago Monkey  Human 5 million years ago Chimp  Human Adapted from Clive Gamble: Timewalkers Figure 4.2

24 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 24 5 million years of hominid evolution     Adapted from Clive Gamble: Timewalkers Figure 4.6 What were the biological changes supporting language-readiness? What were the cultural changes extending the utility of language as a socially transmitted vehicle for communication and representation? How did biological and cultural change interact “in a spiral” prior to the emergence of Homo sapiens?

25 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 25 NO: YES: Out of Africa - Twice Adapted from Clive Gamble: Timewalkers Figure 8.1

26 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 26 Gamble, Timewalkers Table 8.2 Life at the fireside: the technology of survival ____________________________ Life on the land: regional exchange ____________________________ Expansion into new habitats and the rise of complex behavior spoken language increased forward planning

27 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 27 and this is where the story really starts … The Mirror System Hypothesis

28 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 28 The Grasp System of the Monkey Brain F5 - grasp commands in premotor cortex Giacomo Rizzolatti AIP - grasp affordances in parietal cortex Hideo Sakata

29 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 29 Grasp Specificity in an F5 Neuron Precision pinch (top) Power grasp (bottom) (Data from Rizzolatti et al.)

30 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 30 FARS (Fagg-Arbib-Rizzolatti-Sakata) Model 1: The Role of Prefrontal Cortex Task Constraints (F6) Working Memory (46?) Instruction Stimuli (F2) AIP Dorsal Stream: Affordances IT Ventral Stream: Recognition Ways to grab this “thing” “It’s a mug” PFC AIP extracts a set of affordances but IT and PFC are crucial to F5’s selection of the affordance to execute F5 Fagg and Arbib,1998

31 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 31 Mirror Neurons Mirror Neurons: A subset of the grasp-related premotor neurons of F5 discharging when the monkey observes meaningful hand movements made by the experimenter The effective observed movement  the effective executed movement. Rizzolatti, Fadiga, Gallese, and Fogassi, 1995: Premotor cortex and the recognition of motor actions F5 is endowed with an observation/execution matching system

32 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 32 The Mirror Neuron System (MNS) Model

33 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 33 Direct and Inverse Models From Arbib and Rizzolatti (1997) The vertical path is the execution system. The loop on the left provides mechanisms for imitating observed gestures in such a way as to create expectations. The observation matching system (inverse model) goes from "view of gesture" via gesture description (STS) and gesture recognition (PF) to a representation of the "command" for such a gesture The expectation system (direct model) from an F5 command via the expectation neural network ENN to MP, the motor program for generating a given gesture. The latter path may mediate a comparison between "expected gesture" and "observed gesture" for the monkey’s self-generated movements.

34 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 34 Roles of Mirror Neurons 1) Self-correction: based on the discrepancy between intended and observed self action. 2) Learning by imitation:  at the level of the single element  Beyond the Mirror System (narrowly conceived) based on "parsing" into familiar elements and then repeating the observed structure composed from those elements. 3) Social interaction. A monkey seems at most able to parse some specific classes of "ecological sequences/ behaviors". But humans can parse "abstractly"  This progression in behavior may be crucial for sentence understanding--- exploiting the general ability for hierarchical extraction of constituents.

35 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 35 Studying Monkeys Above: For their intrinsic interest k Caution: Many different species Below: To help us understand ourselves k Monkeys (and chimpanzees and bonobos …) are our cousins not our ancestors k But we hope that their study will help us infer more about our ancestors and how we came to be human.

36 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 36 Rizzolatti, Fadiga, Matelli, Bettinardi, Perani, and Fazio: Broca's region is activated by observation of hand gestures: a PET study. PET study of human brain with 3 experimental conditions:  Object observation (control condition)  Grasping object  Grasping observation The most striking result was that the highly significant frontal activation for (action and action recognition versus control) was in the rostral part of Broca's area. But Broca’s area is a key language area!!! Another PET data, by Petrides et al., showed that during execution of a sequences of self-ordered hand movements there was a highly significant activation of Broca's area. An Observation/Execution Matching System in Humans

37 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 37 F5 is Homologous to Area 45 of Broca’s Area Massimo Matelli (in Rizzolatti and Arbib 1998) provides the key to relating F5 in the Monkey to Area 45 in the Human. Broca's Area: Areas 44 + 45 Monkey Human

38 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 38 A New Approach to the Evolution of Human Language Rizzolatti, G., Fadiga L., Gallese, V., and Fogassi, L., 1996, Premotor cortex and the recognition of motor actions. Cogn Brain Res., 3: 131-141. Rizzolatti, G, and Arbib, M.A., 1998, Language Within Our Grasp, Trends in Neuroscience, 21(5):188-194: The functional specialization of human Broca's area to contribute to language-readiness derives from an ancient mechanism related to production and understanding of motor acts. The "generativity" which some see as the hallmark of language is present in manual behavior...which can thus supply the evolutionary substrate for its appearance in language. Kimura argues that the left hemisphere is specialized not for language, but for complex motor programming functions which are, in particular, essential for language production.  Language may require its own "copy" of motor sequencing mechanisms, with the adjacency of these to "old" mechanisms. This makes lesions which dissociate the two very rare.

39 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 39 The Mirror System Hypothesis The parity requirement for language in humans what counts for the speaker must (approximately!!) count for the hearer is met because language evolved from the mirror system for grasping in the common ancestor of monkey and human with its capacity to generate and recognize a set of actions. This adds a neural “missing link” to the tradition that roots speech in a prior system for communication based on manual gesture. [See most recently: William C. Stokoe (2001) Language in Hand: Why Sign Came Before Speech.] Beyond the Mirror: We then have to understand that language (readiness) rests on far more than a mirror system - seeing F5 as part of a larger mirror system, then extending our understanding via imitation to language-readiness*.

40 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 40 Four stages of evolution were hypothesized in “Language Within our Grasp”:  grasping  a mirror system for grasping (i.e., a system that matches observation and execution)  a manual-based communication system, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire  speech as a result of the "invasion" of the vocal apparatus by collaterals from the manual/oro-facial communication system based on F5/Broca's area

41 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 41 What the Mirror System Hypothesis Explains k Why language is multi-modal k Why Broca’s area is the homologue of F5 rather than the cingulate area devoted to monkey vocalizations To achieve these implications we must go beyond the core data on the mirror system to stress that k manipulation inherently involves hierarchical motor structures which are unavailable for the closed call system of primates k Note: These are not the property of premotor cortex in isolation but involve (at least) their integration with SMA-proper (one division of the Supplementary Motor Area) and the Basal Ganglia.

42 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 42 What the Mirror System Hypothesis does not say (i) It does not say that having a mirror system is equivalent to having language.  Monkeys have mirror systems but do not have language, and we expect that many species have mirror systems for varied socially relevant behaviors. (ii) It does not say that the ability to match the perception and production of single gestures is sufficient for language. (iii) It does not say that language evolution can be studied in isolation from cognitive evolution more generally.  In using language, we make use of, for example, negation, counterfactuals, and verb tenses.  But each of these linguistic structures is of no value unless we can understand that the facts contradict an utterance, and can recall past events and imagine future possibilities.

43 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 43 How Vertebrate Brains Evolve (Butler and Hodos 1996) The course of brain evolution among vertebrates has been determined by  Formation of multiple new nuclei through elaboration or duplication  Regionally specific increases in cell proliferation in different parts of the brain  Gain of some new connections and loss of some established connections These phenomena can be influenced by relatively simple mutational events that can thus become established in a population as the result of random variation. Selective pressures determine whether the behavioral phenotypic expressions of central nervous system organization produced by these random mutations increase their proportional representation within the population and eventually become established as the normal condition.

44 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 44 An Integrated State of Knowledge Big question: How did evolution couple the separate parietal  frontal subsystems into an “integrated state of knowledge”? Fuster sees Prefrontal cortex (PFC) as evolving to increase working memory capacity Petrides argues that we need PFC to go beyond single items to keeping multiple objects or events in order.  Note the challenge of embedding the mirror system in a system for handling sequential structure, and hierarchical structure more generally.  How does this relate to the role of HC in episodic memory?  Note the parallel problem of keeping multiple objects in spatial relation in scene perception – and the related syndrome of simultagnosia.  Note that events – not objects – are primary in our story, keeping action at the center.

45 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 45 Basic ingredients for linking the motor system to data on aphasia in the context of extending the mirror system hypothesis Enlargement of the pre-frontal lobe (which uses motivation to evaluate future courses of action) to provide sophisticated memory structures (coupled, e.g., to hippocampus) to extend the reach in space and time Extending the number, sophistication and coordination of parietal-frontal perceptuo-motor systems  Extending the “reach” of mirror systems (F5/Broca's area)  Sequential/Hierarchical behaviors (bring in SMA/BG: Adding prefrontal circuitry with refinements of the basal ganglia and cerebellum keeping pace; w hile the ratio of pre-motor cortex to motor cortex increases drastically from monkey to human)  POT (Parieto-Occipito-Temporal cortex) is a semantic storehouse - its enlargement is a parallel development to the mirror system story (Wernicke's area)  past and future (bring in PFC and HC)  motor control for vocalization (cingulate cortex, etc.)

46 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 46 Beyond the Mirror Lewis Carroll Through the Looking-Glass and what Alice found there Illustrations by John Tenniel Imitation is the Key

47 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 47 From Grasp to Language: Seven hypothesized stages of evolution  grasping  a mirror system for grasping (i.e., a system that matches observation and execution) [Shared with common ancestor of human and monkey]  a simple imitation system for grasping [Shared with common ancestor of human and chimpanzee]  Pre-Hominid  Hominid Evolution  a complex imitation system for grasping,  a manual-based communication system, breaking through the fixed repertoire of primate vocalizations to yield an open repertoire  proto-speech resting on the "invasion" of the vocal apparatus by collaterals from the communication system based on F5/Broca's area  Cultural Evolution in Homo Sapiens  language: the change from action-object frames to verb-argument structures to syntax and semantics: Co-evolution of cognitive and linguistic complexity

48 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 48 From Hominids to Homo sapiens My current hypothesis is that: Stages (4) and (5) and a rudimentary (pre-syntactic) form of (6) were present in pre-human hominids; but The "explosive" development of (6) that we know as language (7) depended on "cultural evolution" well after biological evolution had formed modern Homo sapiens. This remains speculative, and one should note that biological evolution may have continued to reshape the human genome for the brain even after the skeletal form of Homo sapiens was essentially stabilized, as it certainly has done for skin pigmentation and other physical characteristics. However, the fact that people can master any language equally well, irrespective of their genetic community, shows that these changes are not causal with respect to the structure of language.

49 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 49 Chimps are not Monkeys Apes imitate; monkeys do not. What does this say for our evolutionary hypothesis? Whiten study: chimps can learn a sequence quickly, whereas the monkey cannot. Speech recognition in Kanzi (a bonobo)  How does this square with the view that the "phonetic module" is a speech-specific human module? Hypothesis: Extension of the mirror system from single actions to compound actions was the key innovation in the brains of human, chimp and the common ancestor (as compared to the monkey-human common ancestor) relevant to language-readiness.

50 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 50 The Monkey Cannot Imitate? What is the evidence that the monkey "understands”, or recognizes the "goal" of an action? What is involved in linking a goal to an action? To a sequence of actions? Do the monkey’s social interactions require "understanding" rather than a complex of fixed action patterns with releasers varying from innate to learned? On our principle that the motor system can do more than just produce overt behavior, we can see the latter behavior as more important for its parsing of a sequence than for its production of behavior. This "parsing" -- present in the chimp but not in the monkey -- may be a crucial transition towards mechanisms for language-readiness. Judy Cameron (personal communication) offers the following observation from the Oregon Regional Primate Research Center: k Researchers at the Center had laboriously taught monkeys to run on a treadmill as a basis for tests they wished to conduct. It took five months to train the first batch of monkeys in this task. But they then found that if they allowed other monkeys to observe the trained monkeys running on a treadmill, then the naïve monkeys would urn successfully the first time they were placed on a treadmill." This is not evidence that the monkey mirror system for grasping is part of a system for imitation of hand movements, but does render this likely.

51 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 51 Imitation Imitation involves, in part, seeing the other's performance as a set of familiar movements. But:  One must not only observe actions and their composition, but also novelties in the constituents and their variations.  One must also perceive the overlapping and sequencing of all these moves and then remember the “coordinated control program” so constructed.  Each approximation provides the framework in which attention can be shifted to specific components which can then be tuned and/or fractionated appropriately, or better coordinated with other components of the skill.  This process is recursive, yielding both the mastery of ever finer details, and the increasing grace and accuracy of the overall performance.

52 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 52 Stage 3: Simple Imitation Masako Myowa-Yamakoshi: ò the form of “imitation” employed by chimpanzees is a long and laborious process compared to the rapidity with which humans can acquire novel sequences; ò the focus is on moving objects to objects rather than on the structure of movements per se. Monkeys less so and chimpanzees more so (and, presumably, the common ancestor of human and chimpanzees) have Simple imitation: imitating simple novel behaviors but only through repeated exposure.

53 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 53 Chimpanzees use and make tools Different tool traditions in isolated groups of chimpanzees: k Different types of tools used for “termite fishing” at the Gombe in Tanzania and at sites in Senegal. k Chimpanzees use stones and other objects as projectiles to do harm k Chimpanzees in Tai National Park, Ivory Coast, use stone tools to crack nuts open, but chimps in the Gombe have not been seen do this. YThe nut-cracking technique is not mastered until adulthood. Mothers overtly correct and instruct their infants from the time they first attempt to crack nuts, at age three years, and at least four years of practice are necessary before any benefits are obtained. Note: the form of imitation reported here for chimpanzees is a long and laborious process compared to the rapidity with which humans can acquire novel sequences.

54 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 54 Stage 4: Complex Imitation Humans have complex imitation: they can acquire (longer) novel sequences in a single trial if the sequences are not too long and the components are relatively familiar. Y The very structure of these sequences can serve as the basis for immediate imitation or for the immediate construction of an appropriate response, as well as contributing to the longer-term enrichment of experience Extension of the mirror system from single actions to compound actions adequate to support complex imitation was an evolutionary change of key relevance to language-readiness Y Hypothesis: This emerged on the hominid line after the divergence from the common ancestor of humans and chimpanzees.

55 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 55 Action = Movement + Goal/Expectation What makes a movement into an action is that (i) it is associated with a goal, and (ii) initiation of the movement is accompanied by the creation of an expectation that the goal will be met. To the extent that the unfolding of the movement departs from that expectation, to that extent will an error be detected and the movement modified. An individual performing an action is able to predict its consequences and, therefore, the action representation and its consequences are associated.

56 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 56 Understanding and Awareness Many authors have suggested that language and understanding are inseparable, but our experience of scenery and sunsets and songs and seductions makes clear that we humans understand more than we can express in words. k Some aspects of such awareness and understanding are available to animals who do not possess language. k But our development, as "modern" humans, i.e., as individuals within a language-based society, greatly extends our understanding beyond that possible for non-humans or for humans raised apart from a language community. k Conversely other species are aware of aspects of their environment and society that we humans can at best dimly comprehend.

57 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 57 Hierarchical progress beyond the mirror system Classic mirror system: just a fixed set of actions at one level. Need to understand multi- level representations and interactions.  Distinguish "imitating" a familiar action from imitating a complex behavior. Shadowing experiments show effects of the highest applicable level in repeating a sequence of phonemes - whether the changes were at the phoneme, word, or syntax- of-the utterance levels.  One may also see filling in of missing elements.  Higher levels may dominate but do not do so completely -- compare Magritte. In aphasia, note the interaction of perception and production (this is bidirectional, not just unidirectional).

58 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 58 Communication and Representation The specific communication system based on primate calling was not the precursor of language. However, co-evolution of communication and representation was essential for the emergence of human language. Both  representation within the individual and  communication between individuals  could provide selection pressures for the biological evolution of language-readiness and the biological and cultural evolution of language, with advances in the one triggering advances in the other.

59 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 59 From Unit Actions to Complex Behaviors

60 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 60 From Unit Actions to Complex Behaviors We hypothesize that the plan of an action (whether observed or "intended") is encoded in the brain. Three cases: k a whole set of actions is overlearned and encoded in stable neural connectivity. k the whole set of actions is planned in advance based on knowledge of the current situation. k dynamic planning is involved, with the plan being updated and extended as new observations become available. An automaton formalism is broad enough to encompass the above range from overlearned to dynamic plans, but it is still an open question as to how best to distribute the encoding of the various components of the automaton between stable synapses, rapidly changing synapses, and neural firing patterns. In general, this "automaton" will be event-driven, rather than operating on a fixed clock – different sub-behaviors take different lengths of time, and may be terminated either because of an external stimulus, or by some internal encoding of completion. At a basic level, then, we might characterize imitation in terms of ability to "infer automata”. However, complex behaviors may be expressed as coordinated control programs, which are built up from assemblages of simpler schemas.

61 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 61 The Problem of Serial Order in Behavior (Karl Lashley) If we tried to learn a sequence like A  B  A  C by reflex chaining, what is to stop A triggering B every time, to yield the performance A  B  A  B  A  ….. (or we might get A  B+C  A  B+C  A  …..)? A solution: Store the “action codes” (motor schemas) A, B, C, … in one part of the brain (F5 in FARS) and have another area (pre-SMA in FARS) hold “abstract sequences” and learn to pair the right action with each element: (pre-SMA): x1  x2  x3  x4 abstract sequence (F5): A B C action codes/motor schemas We further posit that Basal Ganglia (BG) manage priming and inhibition.

62 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 62 FARS (Fagg-Arbib-Rizzolatti-Sakata) Model 2: Sequential Behavior in the Sakata Task The five F5 units participate in a common program (in this case, a precision grasp), but each cell fires during a different phase of the program. But what controls the sequencing? Fagg and Arbib,1998

63 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 63 Beyond the Mirror System F5 alone is not the “full” mirror system  We want not only the “unit actions” but also sequences and more general patterns The FARS model sketched how to generate a sequence positing roles for SMA and BG. Our proposed mirror model must match this with a model of how  the units of a sequence and  their order/interweaving can be recognized. This new model requires recognition of a complex behavior on multiple occasions with increasing success in recognizing component actions and in linking them together.  [cf. scene analysis in vision.]

64 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 64 Sequential Learning: The Hominid Difference Further, we need to study how the extension of the mirror system from single actions to compound actions was further refined along the hominoid evolutionary track. We distinguish sequential learning at two levels:  (i) the abstraction of regularities in many sentences to come up with “syntax”;  (ii) the ability, given syntactic and semantic knowledge, to extract the sequential or semantic structure of an utterance (parsing) to reflect meaning upward from basic units via constituent structures to larger units. Extending the Mirror System Hypothesis, we must show how the ability to comprehend and create utterances via their underlying syntactico-semantic hierarchical structure can build upon the observation/ execution of single actions.  Distinguish between learning individual sequences via conditioning and the ability to infer and use sequential (and more general hierarchical structures) “at sight”.

65 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 65 From Vocalization to Manual Gesture and back to Vocalization again: The path to proto-speech is indirect

66 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 66 The Ancestral Communication System As before, we assume that our common human-monkey ancestors shared with monkeys the following: Primate Call System a limited set of species-specific calls Oro-Facial Gesture System a limited set of gestures expressive of emotion and related social indicators Note the linkage between the two systems: communication is inherently multi-modal. We have argued that the Mirror System Hypothesis explains why F5, rather than the cingulate area already involved in monkey vocalization, homologous to the Broca's area's substrate for language?

67 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 67 Starting with Homo habilis The Fossil Record Imprints in the cranial cavity of endocasts indicate that "speech areas" were already present in early hominids such as H. habilis long before the larynx reached the modern “speech-optimal” configuration but  there is a debate over whether such areas were already present in australopithecines  were they for speech or proto-speech or proto-sign-language? A Related Hypothesis The transition from australopithecines to early Homo coincided with the transition from a mirror system, enlarged but only for action recognition to a human-like mirror system for intentional communication.

68 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 68 Did Homo habilis have language? Was Homo habilis language-ready? Homo habilis has a small brain and poorly developed larynx - but  Homo habilis has the motor dexterity for sign language. Is there any way to prove its brain is “less powerful” than that of a 7-year-old human? Homo habilis left few traces of technology that show evidence of a complex culture that would exploit the use of language - but  new data (Nature 1999) push sophisticated tool making back to 2 MYr BP  language can develop to serve highly complex cultural interactions even in a low-technology society  Homo sapiens has only recently invented towns and high- technology society, so a language-ready brain does not guarantee high technology.

69 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 69 From Praxis to Communication Our hypothetical sequence for manual gesture: pragmatic action directed towards a goal object pantomime in which similar actions are produced away from the goal object  Imitation is the generic attempt to reproduce movements performed by another, whether to master a skill or simply as part of a social interaction. By contrast, pantomime is performed with the intention of getting the observer to think of a specific action or event. It is essentially communicative in its nature. The imitator observes; the panto-mimic intends to be observed abstract gestures divorced from their pragmatic origins (if such existed) and available as elements for the formation of compounds which can be paired with meanings in more or less arbitrary fashion.  In pantomime it might be hard to distinguish a grasping movement signifying “grasping” from one meaning “a [graspable] raisin”, thus providing an “incentive” for coming up with an arbitrary gesture to distinguish the two meanings.

70 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 70 Two Roles for Imitation in the Evolution of Manual-Based Communication 1. Extending imitation from imitation of hand movements by hand movements to pantomime which uses the degrees of freedom of the hand (and arm and body) to imitate degrees of freedom of objects and actions other than hand movements.  this extends the repertoire of recognizable and describable actions well beyond those that can be performed by the speaker/hearer (cf. “the bird is flying”). 2. Extending these pantomime movements to provide ad hoc gestures that may convey to the observer information which is hared to pantomime in an “obvious” manner. This requires extending the mirror system from the grasping repertoire to mediate imitation of gestures to support the transition from ad hoc gestures to conventional signs which can reduce ambiguity and extend semantic range.  Question: We start with a clear distinction between the representation of the grasp (the action/proto-verb) and the raisin (the thing/proto-noun) in the brain, but in the spoken language both noun and verb are uttered by actions. How does this relate to neural correlates of the distinction between verb and noun?  Perhaps the answer lies in the perceptual/semantic processing rather than the symbolic/linguistic processing:

71 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 71 Here the noun is characterized by short, repeated movements, while the verb is characterized by a single, prolonged movement Noun/Verb pairs differentiated by movement A change in the speed of movement will change the meaning of a sign A change in the extent of movement will change the meaning of a sign Stokoe Language in Hand Figure 1 Figure 3 Figure 6

72 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 72 Stage 5: Gestural Communication Emerges A distinct manuo-brachial communication system evolved to complement the primate calls/oro-facial communication system On this view, the "speech" area of early hominids  i.e., the area somewhat homologous to monkey F5 and human Broca’s is not yet even a proto-speech area! Instead, it primarily mediated orofacial and manuo-brachial communication Question: Did “protosign” precede the initiation of “protospeech” or (currently my preferred hypothesis) is the better hypothesis that: “Protosign” reached sufficient sophistication to provide a basic but effective form of communication at a time when there were few arbitrary vocal gestures (as distinct from species-specific primate calls)

73 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 73 The Evolving Larynx and the Breath Group P. Lieberman views the descent of the larynx first seen in Homo sapiens as being crucial in enabling the wide articulatory range exploited in human speech.  Caution: This view is still controversial. Clearly, some level of language-readiness and vocal communication preceded this:  a core of proto-speech (but not necessarily language) was needed to provide pressures for larynx evolution. Lieberman also suggests that the primate call made by an infant separated from its mother not only survives in the human infant, but in humans develops into the breath group that provides the contour for each continuous sequence of an utterance.

74 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 74 Stage 6: From Manual Gesture to Proto-Speech The "generativity" which some see as the hallmark of language is present in manual behavior. Combinatorial properties are inherent in the manuo-brachial system. This provided the evolutionary opportunity for: Stage 6. The manual-orofacial symbolic system then “recruited” vocalization. Association of vocalization with manual gestures allowed them to assume a more open referential character. This explains why F5, rather than the primate call area provide the evolutionary substrate for speech This yields our explanation for the evolutionary prevalence of the lateral motor system over the medial (emotion-related) primate call system in becoming the main communication channel in humans.

75 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 75 Collateral Control of Vocalization TINS, May 98: “This new use of vocalization necessitated its skillful control, a requirement that could not be fulfilled by the ancient emotional vocalization centers. This new situation was most likely the ‘cause’ of the emergence of human Broca’s area.” I would now rather say: Homo habilis and even more so Homo erectus had a “proto-Broca’s area” based on an F5-like precursor mediating communication by manual and oro-facial gesture. This made possible a process of collateralization whereby this “proto” Broca’s area gained primitive control of the vocal machinery, thus yielding increased skill and openness in vocalization. Larynx and brain regions could then co-evolve to yield the configuration seen in Homo sapiens. Kojima on onomatopeia: The above hypothesis would see onomatopeia as a secondary mechanism for extending the “vocabulary” of proto-speech, but perhaps we need to respond with a “multiple factors” account of the evolution of the overall system.

76 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 76 From Primate to Human (A Production Viewpoint) Primate Call System a limited set of species-specific calls Oro-Facial Gesture System a limited set of gestures expressive of emotion and related social indicators Larynx and Vocal Cords Facial Muscles Arm and Hand Manual Gesture System an open set of communicative gestures Proto-Speech System an open set of communicative gestures A primitive system plus an advanced system? Or one multi-modal controller? Perception systems are not shown.The mirror system is thus implicit.

77 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 77 Linking the “F5-Broca” and Vocalization Systems Our original aim was to show why speech did not evolve “simply” by extending the classic primate vocalization system. We now note the co-evolution of the two systems:  Lesions centered in the anterior cingulate cortex and supplementary motor areas of the brain can cause mutism in humans, similar to the effects produced in muting monkey vocalizations I hypothesize cooperative computation between cingulate cortex and Broca’s area,  with cingulate cortex involved in breath groups and emotional shading (and imprecations!), and  Broca’s area providing the motor control for rapid production and interweaving of elements of an utterance.

78 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 78 Gesture Remains TINS, May 98: “ Manual gestures progressively lost their dominance, while in contrast, vocalization acquired autonomy, until the relation between gestural and vocal communication inverted and speech took off.” Our use of writing as a record of speech has long since created the mistaken impression that language is a speech-based system. But:  McNeill has used videotape analysis to show the crucial use that people make of gestures synchronized with speech  Even blind people use manual gestures when speaking  Sign languages are full human languages rich in lexicon, syntax, and semantics.  Moreover: not only deaf people use sign language, so do some aboriginal Australian tribes, and some native populations in North America

79 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 79 Language acquisition We locate phonology in a speech-manual-orofacial gesture complex: a hearing person shifts the major information load of language -- but by no means all of it -- into the speech domain, whereas  for a deaf person the major information load is removed from speech and taken over by hand and orofacial gestures  and note that blind children accompany speech with hand movements

80 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 80 Not three separate systems but a single system operating in multiple motor and sensory modalities Caution: One system but many brain regions, each with its own evolutionary story. Primate Call System a limited set of species-specific calls Larynx and Vocal Cords Facial Muscles Arm and Hand Oro-Facial Gesture System a limited set of gestures expressive of emotion and related social indicators Manual Gesture System an open set of communicative gestures Speech System an open set of communicative gestures Genuine Cooperation

81 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 81 Our Descriptions are different from Neural Representations We may classify the specific structure Hit (John, Mary, his hand) as an instance of a more general structure Hit (Agent, Recipient, Instrument) but the brain-representations of the constituent entities may or may not entail their recognition as belonging to these structures. … we must ensure that descriptive categories are not automatically ascribed to the “neural strategies” of the subject. We must be careful: Hit (John, Mary, his hand) is a symbolic string which represents two different representations in the brain, neither of which looks like this structure.  The action-object frame represents (whether as a result of perception or motor planning or both) the relation between an action, two agents and an “object” without demanding that any names or words or explicit symbols be attached to any of these entities.  The verb-argument structure is an abstraction from the action-object frame in that it lacks any graded representation of the specific event, but is enriched by the linkage of each entity to a specific name or symbol.

82 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 82 The Biological Basis of Language-Readiness “Knowing there are things and events”: The ability for perception of Action-Object Frames in which an actor, an action, and related role players can be perceived in relationship – was well established in the primate line: Hypothesis: The ability to communicate a fair number of such frames was established in the hominid line prior to the emergence of Homo sapiens.  Recognizing an object and acting on it; Recognizing a conspecific and interacting with it  Recognizing action-object frames  Extending the mirror system beyond single actions to a repertoire of action-object frames which is unbounded a priori.  Naming action-object frames  creation of a “symbol toolkit” of meaningless elements from which an open ended class of symbols can be generated  abstract symbols are grounded in action-oriented perception Note that such naming does not imply separate names for the actions and objects or their attributes; i.e., it does not entail that utterances of protolanguage were compounded from words akin to those we see in, e.g., the Indo-European languages.

83 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 83 The Transition to Language Hypothesis: The ability to communicate a fair number of action-object frames using gesture and proto-speech was established in the hominid line prior to the emergence of Homo sapiens. The Transition to Homo sapiens may have involved “language amplification” through increased speech ability and  Fractionation of symbols to yield symbols for actions and objects, yielding the ability to create an unlimited set of verb-argument structures linked to action-object frames  The one word ripe halves the number of fruit names to be learned  Separating verbs from nouns let’s one learn only m+n+p words to be able to form m*n*p of the most basic utterances. Consideration of the spatial basis for “prepositions” may further show how visuomotor coordination underlies some aspects of language. However, the basic semantic-syntactic correspondences have been overlaid by a multitude of later innovations and borrowings.

84 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 84 The spatial basis for “prepositions” Consideration of the spatial basis for “prepositions” may help show how visuomotor coordination underlies some aspects of language and makes clear the “naturalness” of sign. Stokoe Language in Hand Figure 10 However, the basic semantic-syntactic correspondences have been overlaid by a multitude of later innovations and borrowings. The addition of movement transforms IN to INTO and exemplifies the differences in meaning between the two signs

85 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 85 Co-Evolution of Words and Syntax  The ability to compound those structures in diverse ways, with abstraction and compounding of more generic verb-argument structure  Recognition of hierarchical structure rather than mere sequencing could provide the bridge to constituent analysis in language – relating particular subactions (themselves further decomposable) to achievement of certain subgoals in a complex manipulation.  Syntax and semantics: compounding utterances, “going recursive The result: A spiraling co-evolution of communication and representation, extending the repertoire of achievable, recognizable and describable actions.

86 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 86 Abstraction, Negation, and Hierarchicalization Claim: Many ways of expressing relationships were the discovery of Homo sapiens. I.e., adjectives, conjunctions such as but, and, or or and that, unless, or because, etc., might well have been “post-biological” in their origin. Extending the repertoire of recognizable and describable actions: Recognition of hierarchical structure rather than mere sequencing could provide the bridge to constituent analysis in language – relating particular subactions (themselves further decomposable) to achievement of certain subgoals in a complex manipulation. But the power of language comes from breaking away from the here- and-now, not just by hierarchicalization but also by negation and abstraction. Need to analyze how the brain can support counterfactual cognitive representations and relate them to language.

87 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 87 (Schema Assemblages) Semantic Structures (Hierarchical Constituents expressing objects, actions and relationships) “Phonological” Structures (Ordered Expressive Gestures) P r o d u c t i o n P e r c e p t i o n Cognitive Structures The Action-Object Frame The action-object frame is non-linguistic k the representation of an action involving one or more objects and agents. (Composing them yields “schema assemblages”) Verb-argument structure is an overt linguistic representation k in modern human languages, generally the action is named by a verb and the objects are named by nouns (or noun phrases). (Composing them yields semantic structures.) A grammar for a language is then a specific mechanism (whether explicit or implicit) for converting verb- argument structures in particular, and more complex structures based on hierarchical compounds of verb- argument structures more generally into strings of words, and vice versa. Cautionary Note: In the brain there is probably no single grammar, but rather k a “direct model/grammar” for production k an “inverse model/grammar” for perception John hit Mary with his hand is an English sentence for the structure we may encode (arbitrarily!) as: Hit (John, Mary, his hand)

88 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 88 Neural Representations? The mirror neuron analysis must be extended to address these questions: How are action-object frames and verb-argument structures represented in the brain? How are action-object frames mapped to and from verb-argument structures, and how are the latter mapped to and from the utterances of (spoken, written, signed) language?

89 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 89 Tentative steps towards a “Mirror Neurolinguistics”  Cooperative computation in the brain: to make sense of data relating different brain regions to different aspects of language.  Do these data reflect the brain's genetic prespecification and/or the results of the self-organization of the infant brain when the infant develops within a particular language community?

90 Visual Perception A Selective Naming B Articulatory System D Switching Control C Plan Formation F of the Linear Scheme G Lexical Analysis H Speech Memory I Analysis of Significant Elements F' Logical Scheme J Phonemic Analysis E Updating the Plan of the Expr’n F"... Auditory Input Visual Input Arbib and Caplan (1979) based on Luria (1973)  Naming of objects  Verbal expression of motives  Speech understanding  Speech repetition

91 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 91 F5 Homologue Monkey and Human: A Comparative Approach to Neurolinguistics My goals:  A fully articulated model of the monkey mirror system (grounded in neurophysiology of macaque [and other?] monkeys ;  a cooperative computation model of interacting brain regions for human neurolinguistics (language-readiness versus language) as well as human mirror systems and imitation; and  a coherent evolutionary framework which links them, both by synthetic brain imaging and by brain imaging across monkeys, chimps, and other primates. Not AIP Homologue: Let’s discuss this!

92 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 92 Visual Cortex Parietal Cortex Inferotemporal Cortex How (dorsal) What (ventral) reach programming grasp programming AT: Goodale and Milner Lesion here: Inability to verbalize or pantomime size or orientation DF: Jeannerod et al. Lesion here: Inability to Preshape (except for objects with size “in the semantics” “What” versus “Where”: Mishkin and Ungerleider  “What” versus “How”

93 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 93 Goodale and Milner 1 Our evolutionary theory suggests a progression from action to pantomime to (pre)language  object  AIP  F5 canonical : pragmatics  action  PF  F5 mirror : action understanding  scene  Wernicke’s  Broca’s: utterance The "zero order” model of AT and DF data is:  Parietal “affordances”  preshape  IT “perception of object”  pantomime or verbally describe size

94 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 94 Goodale and Milner 2  IT “perception of object” is needed to pantomime or verbally describe size seems to imply one cannot pantomime or verbalize an affordance; one needs a "unified view of the object" (IT) to express attributes. The problem with this is that the “language” path as shown in  is completely independent of the parietal  F5 system, and so the data seem to contradict our view in  :  scene  Wernicke’s  Broca’s: utterance

95 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 95 Necessary Background: FARS (Fagg-Arbib-Rizzolatti-Sakata) Model Overview AIP extracts a set of affordances but IT and PFC are crucial to F5’s selection of the affordance to execute IT PFC Task Constraints (F6) Working Memory (46?) Instruction Stimuli (F2) AIP Dorsal Stream: Affordances Ventral Stream: Recognition Ways to grab this “thing” “It’s a mug” F5 canonical

96 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 96 Some Crucial Psychophysical Data Bridgeman, Peery & Anand, 1997: An observer sees a target in one of several possible positions, and a frame either centered before the observer or deviated left or right. k Verbal judgments of target position are altered by the background frame's position k Pointing at the target never misses, regardless of the frame's position. The data demonstrate independent representations of visual space in the two systems, with the observer aware only of the spatial values in the IT system They have also shown that a symbolic message about which of two targets to jab can be communicated from the cognitive (inferotemporal) to the sensorimotor (parietal) system without communicating the cognitive system's spatial bias as well. Hypothesis: Just as F5 mirror receives its parietal input from PF rather than AIP, so Broca's area receives its size data as well as object identity data from IT via PFC, rather than via a side path from AIP.

97 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 97 Enhancing the Pathways We thus enhance each of the pathways  object  AIP  F5 canonical : pragmatics  action  PF  F5 mirror : action understanding  scene  Wernicke’s  Broca’s: utterance by having PFC modulate the activity of each premotor component.

98 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 98 An Early Pass on the AT/DF Challenge Is PF a homologue of Wernicke’s area? How does the role of PFC in the FARS model relate to its roles in the mirror system of monkey and in language? Do these link the right boxes? What is the relationship? Recognizing an Action Recognizing an Object or an Action Visual Input Choosing an Action Describing an Episode, Object or Action Prefrontal (PFC) F5 canonical F5 mirror Broca’s Area Wernicke’s Area AIP STS IT Memory PF

99 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 99 Going Further...  To extend our existing model of the mirror system in monkey to chart the human brain mechanisms for recognizing interactions of actors and objects  Use this to ground a theory of brain representations underlying the capacity to symbolize episodes, actions, objects and actors.  Use these representations to ground a functional/cognitive account integrating syntax and semantics, and link this to a new approach to neurolinguistics that goes "Beyond the Mirror" to test and refine the Mirror System Hypothesis.  Develop new models of the evolutionary linkage from the primitive mirror system to modern human brain mechanisms: e.g., for the evolution of increasingly subtle forms of imitation, and the transition from manual to vocal gestures.

100 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 100 A Sample of Further Research Topics Modeling of monkey brain mechanisms for  visually guided behavior & mirror neurons  vocalization, communication & multi-modal integration  compound behaviors and social interactions Comparative modeling of primate (including human) brain mechanisms:  extending the monkey model to chimp and human  comparative/evolutionary model of different types of imitation The minimal subscene as a meeting ground for action, action recognition and language Neurolinguistics and a Functional/Cognitive Integration of Syntax and Semantics k extending the basic sentence frames for descriptions and questions with minimal subscenes k Cognitive Form: linking the action-frame to semantic and syntactic structures Evolution: From Grasp to Imitation to Language k includes a study of the linkage between sign language and vocalization as a basis for evolutionary theorizing

101 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 101 Evolution: From Grasp to Imitation to Language We will build the bridge that links the basic mirror system for grasping to language via the stages of k a complex imitation system for grasping k a manual-based communication system, and k proto-speech, characterized as the open-ended production and perception of sequences of vocal gestures. Major Goals: k to extend the work on modeling the mirror system to the study of the imitation of actions in monkey, chimpanzee and human. k to explain how, in the course of hominid evolution, the manual-orofacial symbolic system may have "recruited" vocalization. This approach will help us understand language as more a cultural product than a biologically innate one, with culture shaping a brain which is language-ready in a multi-modal way, so that language performance may integrate speech with manual and oro-facial gesture, or may reduce to sign language as readily as to spoken language.

102 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 102 Beyond the Mirror to Neurolinguistics If the monkey needs so many brain regions for the mirror system for grasping, how many more brain regions will we need for an account of language-readiness that goes beyond the mirror to develop a full neurolinguistic model far beyond the F5  Broca’s area homology??


Download ppt "Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 The Brain's Capability for Language Language-readiness:"

Similar presentations


Ads by Google