Presentation is loading. Please wait.

Presentation is loading. Please wait.

Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 Michael Arbib with Erhan Oztop (Modeling)

Similar presentations


Presentation on theme: "Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 Michael Arbib with Erhan Oztop (Modeling)"— Presentation transcript:

1 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 Michael Arbib with Erhan Oztop (Modeling) Giacomo Rizzolatti (Neurophysiology) The Mirror Neuron System for Grasping

2 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 2 Part 1 Classic Concepts of Grasp Control

3 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 3 Preshaping While Reaching to Grasp Jeannerod & Biguer 1979

4 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 4 A Classic Coordinated Control Program for Reaching and Grasping visual, kinesthetic, and tactile input Visual Location Hand Reaching Fast Phase Movement Hand Preshape Hand Rotation Actual Grasp Size Recognition Orientation Recognition size Grasping orientation activation of visual search visual and kinesthetic input target location visual input visual input Slow Phase Movement visual input recognition criteria activation of reaching (Arbib 1981) Perceptual schemas Motor schemas

5 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 5 Opposition Spaces and Virtual Fingers The goal of a successful preshape, reach and grasp is to match the opposition axis defined by the virtual fingers of the hand with the opposition axis defined by an affordance of the object (Iberall and Arbib 1990)

6 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 6 Hand State Our current representation of hand state defines a 7-dimensional trajectory F(t) with the following components F(t) = (d(t), v(t), a(t), o 1 (t), o 2 (t), o 3 (t), o 4 (t)): d(t): distance to target at time t v(t): tangential velocity of the wrist a(t): Aperture of the virtual fingers involved in grasping at time t o 1 (t): Angle between the object axis and the (index finger tip – thumb tip) vector [relevant for pad and palm oppositions] o 2 (t): Angle between the object axis and the (index finger knuckle – thumb tip) vector [relevant for side oppositions] o 3 (t), o 4 (t): The two angles defining how close the thumb is to the hand as measured relative to the side of the hand and to the inner surface of the palm.

7 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 7 Part 2 Basic Neural Mechanisms of Grasp Control

8 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 8 Visual Control of Grasping in Macaque Monkey F5 - grasp commands in premotor cortex Giacomo Rizzolatti AIP - grasp affordances in parietal cortex Hideo Sakata A key theme of visuomotor coordination: parietal affordances (AIP) drive frontal motor schemas (F5)

9 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 9 Grasp Specificity in an F5 Neuron Precision pinch (top) Power grasp (bottom) (Data from Rizzolatti et al.)

10 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 10 PP (Posterior Parietal) affordances for PM precise motor coordinates for motor cortex PM (Premotor Cortex) codes action fairly abstractly Motor Cortex and MPGs tell PM the possible categories “here’s my choice” the details - action parameters overall action Cerebellum Tuning and coordinating MPGs A Broad Perspective on F5-AIP Interactions

11 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 11 FARS (Fagg-Arbib-Rizzolatti-Sakata) Model Overview Task Constraints (F6) Working Memory (46?) Instruction Stimuli (F2) AIP Dorsal Stream: Affordances IT Ventral Stream: Recognition Ways to grab this “thing” “It’s a mug” PFC F5 AIP extracts affordances - features of the object relevant to physical interaction with it. Prefrontal cortex provides “context” so F5 may select an appropriate affordance

12 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 12 Part 3 Neural Mechanisms of Grasp Control that Support a “Mirror System”

13 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 13 Mirror Neurons Rizzolatti, Fadiga, Gallese, and Fogassi, 1995: Premotor cortex and the recognition of motor actions Mirror neurons form the subset of grasp-related premotor neurons of F5 which discharge when the monkey observes meaningful hand movements made by the experimenter or another monkey. F5 is endowed with an observation/execution matching system [The non-mirror grasp neurons of F5 are called F5 canonical neurons.]

14 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 14 What is the mirror system (for grasping) for? Action recognition Understanding (assigning meaning to other’s actions) Associative memory for actions Mirror neurons: The cells that selectively discharge when the monkey executes particular actions as well as when the monkey observes an other individual executing the same action. Mirror neuron system (MNS): The mirror neurons and the brain regions involved in eliciting mirror behavior. Interpretations:

15 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 15 Computing the Mirror System Response The FARS Model: Recognize object affordances and determine appropriate grasp. The Mirror Neuron System (MNS) Model: We must add recognition of k trajectory and k hand preshape to k recognition of object affordances and ensure that all three are congruent. There are parietal systems other than AIP adapted to this task.

16 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 16 Further Brain Regions Involved cIPS Spatial coding for objects, analysis of motion during interaction of objects and self-motion 7b (PF): Rostral part of the posterior parietal lobule Mainly somatosensory Mirror-like responses STS: Superior Temporal Sulcus Detection of biologically meaningful stimuli (e.g.hand actions) Motion related activity (MT/MST part) cIPS Axis orientation and surface orientation cIPS: caudal intraparietal sulcus 7a (PG): caudal part of the posterior parietal lobule

17 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 17 cIPS cell response Surface orientation selectivity of a cIPS cell Sakata et al. 1997 cIPS

18 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 18 Key Criteria for Mirror Neuron Activation When Observing a Grasp a) Does the preshape of the hand correspond to the grasp encoded by the mirror neuron? b) Does this preshape match an affordance of the target object? c) Do samples of the hand state indicate a trajectory that will bring the hand to grasp the object? Modeling Challenges: i) To have mirror neurons self-organize to learn to recognize grasps in the monkey’s motor repertoire ii) To learn to activate mirror neurons from smaller and smaller samples of a trajectory.

19 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 19 Hypothesis on Mirror Neuron Development The development of the (grasp) mirror neuron system in a healthy infant is driven by the visual stimuli generated by the actions (grasps) performed by the infant himself. The infant (with maturation of visual acuity) gains the ability to map other individual’s actions into his internal motor representation. [In the MNS model, the hand state provides the key representation for this transfer.] Then the infant acquires the ability to create (internal) representations for novel actions observed. Parallel to these achievements, the infant develops an action prediction capability (the recognition of an action given the prefix of the action and the target object)

20 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 20 The Mirror Neuron System (MNS) Model

21 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 21 Part 4 Implementing the Basic Schemas of the Mirror Neuron System (MNS) Model using Artificial Neural Networks (The Work of Erhan Oztop)

22 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 22 Curve recognition The general problem: associate N-dimensional space curves with object affordances A special case: The recognition of two (or three) dimensional trajectories in physical space Simplest solution: Map temporal information into spatial domain. Then apply known pattern recognition techniques. Problem with simplest solution: The speed of the moving point can be a problem! The spatial representation may change drastically with the speed Scaling can overcome the problem. However the scaling must be such that it preserves the generalization ability of the pattern recognition engine. Solution: Fit a cubic spline to the sampled values. Then normalize and re- sample from the spline curve. Result:Very good generalization. Better performance than using the Fourier coefficients to recognize curves.

23 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 23 Curve recognition Spatial resolution: 30 Network input size: 30 Hidden layer size: 15 Output size: 5 Training : Back-propagation with momentum.and adaptive learning rate Sampled points Point used for spline interpolation Fitted spline Curve recognition system demonstrated for hand drawn numeral recognition (successful recognition examples for 2, 8 and 3).

24 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 24 STS hand shape recognition Model Matching Precision grasp Hand Configuration Classification Step 2: The feature vector generated by the first step is used to fit a 3D-kinematics model of the hand by the model matching module. The resulting hand configuration is sent to the classification module. Color Coded Hand Feature Extraction Step 1 of hand shape recognition: system processes the color-coded hand image and generates a set of features to be used by the second step

25 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 25 STS hand shape recognition 1: Color Segmentation and Feature Extraction Preprocessing Color Expert (Network weights) Training phase: A color expert is generated by training a feed-forward network to approximate human perception of color. Features Actual processing: The hand image is fed to the augmented segmentation system. The color decision during segmentation is done by consulting color expert. NN augmented segmentation system

26 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 26 STS hand shape recognition2: 3D Hand Model Matching A realistic drawing of hand bones. The hand is modelled with 14 degrees of freedom as illustrated. ClassificationGrasp Type Result of feature extraction Feature Vector Error minimization The model matching algorithm minimizes the error between the extracted features and the model hand.

27 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 27 Virtual Hand/Arm and Reach/Grasp Simulator A power grasp and a side grasp A precision pinch

28 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 28 Power grasp time series data +: aperture; *: angle 1; x: angle 2;  : 1-axisdisp1;  :1-axisdisp2;  : speed;  : distance.

29 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 29 Core Mirror Circuit Hand state Mirror Neurons (F5mirror) Association (7b) Neurons Mirror Feedback Object affordance Mirror Neuron Output Motor Program (F5 canonical) Hand shape recognition & Hand motion detection Hand-Object spatial relation analysis Object affordance - hand state association Object Affordances Action recognition (Mirror Neurons) Motor program Motor execution Mirror Feedback Integrate temporal association Motor program F5canonical F5mirror

30 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 30 Connectivity pattern 7b7b Mirror Feedback Object affordance (AIP) 7a7a STS F5mirror Motor Program (F5canonical)

31 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 31 A single grasp trajectory viewed from three different angles The wrist trajectory during the grasp is shown by square traces, with the distance between any two consecutive trace marks traveled in equal time intervals. How the network classifies the action as a power grasp. Empty squares: power grasp output; filled squares: precision grasp; crosses: side grasp output

32 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 32 Power and precision grasp resolution

33 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 33 Part 5 Quo Vadis?

34 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 34 Future Directions 1) Technology: Increasing the robustness and learning rates of the schemas using improved learning algorithms for artificial neural nets. 2) Neuroscience: Implementing the schemas with biologically plausible neural nets to model neurophysiological data. 3) Learning to recognize variations on, and assemblages of, familiar actions. 4) Imitation 5) Language!

35 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 35 Michael Arbib: CS564 - Brain Theory and Artificial Intelligence University of Southern California, Fall 2001 Lecture 10. The Mirror Neuron System Model (MNS) 1 Reading Assignment: Schema Design and Implementation of the Grasp-Related Mirror Neuron System Erhan Oztop and Michael A. Arbib

36 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 36 Visual Control of Grasping in Macaque Monkey F5 - grasp commands in premotor cortex Giacomo Rizzolatti AIP - grasp affordances in parietal cortex Hideo Sakata A key theme of visuomotor coordination: parietal affordances (AIP) drive frontal motor schemas (F5)

37 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 37 Mirror Neurons Rizzolatti, Fadiga, Gallese, and Fogassi, 1995: Premotor cortex and the recognition of motor actions Mirror neurons form the subset of grasp-related premotor neurons of F5 which discharge when the monkey observes meaningful hand movements made by the experimenter or another monkey. F5 is endowed with an observation/execution matching system

38 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 38 F5 Motor Neurons F5 Motor Neurons include all F5 neurons whose firing is related to motor activity. k We focus on grasp-related behavior. Other F5 motor neurons are related to oro-facial movements. F5 Mirror Neurons form the subset of grasp-related F5 motor neurons of F5 which discharge when the monkey observes meaningful hand movements. F5 Canonical Neurons form the subset of grasp-related F5 motor neurons of F5 which fire when the monkey sees an object with related affordances.

39 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 39 What is the mirror system (for grasping) for? Action recognition Understanding (assigning meaning to other’s actions) Associative memory for actions Mirror neurons: The cells that selectively discharge when the monkey executes particular actions as well as when the monkey observes an other individual executing the same action. Mirror neuron system (MNS): The mirror neurons and the brain regions involved in eliciting mirror behavior. Interpretations:

40 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 40 Computing the Mirror System Response The FARS Model: Recognize object affordances and determine appropriate grasp. The Mirror Neuron System (MNS) Model: We must add recognition of k trajectory and k hand preshape to k recognition of object affordances and ensure that all three are congruent. There are parietal systems other than AIP adapted to this task.

41 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 41 Further Brain Regions Involved cIPS Spatial coding for objects, analysis of motion during interaction of objects and self-motion 7b (PF): Rostral part of the posterior parietal lobule Mainly somatosensory Mirror-like responses STS: Superior Temporal Sulcus Detection of biologically meaningful stimuli (e.g.hand actions) Motion related activity (MT/MST part) cIPS Axis and surface orientation cIPS: caudal intraparietal sulcus 7a (PG): caudal part of the posterior parietal lobule

42 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 42 cIPS cell response Surface orientation selectivity of a cIPS cell Sakata et al. 1997 cIPS

43 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 43 Key Criteria for Mirror Neuron Activation When Observing a Grasp a) Does the preshape of the hand correspond to the grasp encoded by the mirror neuron? b) Does this preshape match an affordance of the target object? c) Do samples of the hand state indicate a trajectory that will bring the hand to grasp the object? Modeling Challenges: i) To have mirror neurons self-organize to learn to recognize grasps in the monkey’s motor repertoire ii) To learn to activate mirror neurons from smaller and smaller samples of a trajectory.

44 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 44 Initial Hypothesis on Mirror Neuron Development The development of the (grasp) mirror neuron system in a healthy infant is driven by the visual stimuli generated by the actions (grasps) performed by the infant himself. The infant (with maturation of visual acuity) gains the ability to map other individual’s actions into his internal motor representation. [In the MNS model, the hand state provides the key representation for this transfer.] Then the infant acquires the ability to create (internal) representations for novel actions observed. Parallel to these achievements, the infant develops an action prediction capability (the recognition of an action given the prefix of the action and the target object)

45 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 45 The Mirror Neuron System (MNS) Model

46 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 46 Implementing the Basic Schemas of the Mirror Neuron System (MNS) Model using Artificial Neural Networks (Work of Erhan Oztop)

47 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 47 Opposition Spaces and Virtual Fingers The goal of a successful preshape, reach and grasp is to match the opposition axis defined by the virtual fingers of the hand with the opposition axis defined by an affordance of the object (Iberall and Arbib 1990)

48 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 48 Hand State Our current representation of hand state defines a 7-dimensional trajectory F(t) with the following components F(t) = (d(t), v(t), a(t), o 1 (t), o 2 (t), o 3 (t), o 4 (t)): d(t): distance to target at time t v(t): tangential velocity of the wrist a(t): Aperture of the virtual fingers involved in grasping at time t o 1 (t): Angle between the object axis and the (index finger tip – thumb tip) vector [relevant for pad and palm oppositions] o 2 (t): Angle between the object axis and the (index finger knuckle – thumb tip) vector [relevant for side oppositions] o 3 (t), o 4 (t): The two angles defining how close the thumb is to the hand as measured relative to the side of the hand and to the inner surface of the palm.

49 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 49 Curve recognition The general problem: associate N-dimensional space curves with object affordances A special case: The recognition of two (or three) dimensional trajectories in physical space Simplest solution: Map temporal information into spatial domain. Then apply known pattern recognition techniques. Problem with simplest solution: The speed of the moving point can be a problem! The spatial representation may change drastically with the speed Scaling can overcome the problem. However the scaling must be such that it preserves the generalization ability of the pattern recognition engine. Solution: Fit a cubic spline to the sampled values. Then normalize and re- sample from the spline curve. Result:Very good generalization. Better performance than using the Fourier coefficients to recognize curves.

50 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 50 Curve recognition Spatial resolution: 30 Network input size: 30 Hidden layer size: 15 Output size: 5 Training : Back-propagation with momentum.and adaptive learning rate Sampled points Point used for spline interpolation Fitted spline Curve recognition system demonstrated for hand drawn numeral recognition (successful recognition examples for 2, 8 and 3).

51 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 51 STS hand shape recognition Model Matching Precision grasp Hand Configuration Classification Step 2: The feature vector generated by the first step is used to fit a 3D-kinematics model of the hand by the model matching module. The resulting hand configuration is sent to the classification module. Color Coded Hand Feature Extraction Step 1 of hand shape recognition: system processes the color-coded hand image and generates a set of features to be used by the second step

52 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 52 STS hand shape recognition 1: Color Segmentation and Feature Extraction Preprocessing Color Expert (Network weights) Training phase: A color expert is generated by training a feed-forward network to approximate human perception of color. Features Actual processing: The hand image is fed to the augmented segmentation system. The color decision during segmentation is done by consulting color expert. NN augmented segmentation system

53 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 53 STS hand shape recognition2: 3D Hand Model Matching A realistic drawing of hand bones. The hand is modelled with 14 degrees of freedom as illustrated. ClassificationGrasp Type Result of feature extraction Feature Vector Error minimization The model matching algorithm minimizes the error between the extracted features and the model hand.

54 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 54 Virtual Hand/Arm and Reach/Grasp Simulator A power grasp and a side grasp A precision pinch

55 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 55 Power grasp time series data +: aperture; *: angle 1; x: angle 2;  : 1-axisdisp1;  :1-axisdisp2;  : speed;  : distance.

56 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 56 Core Mirror Circuit Hand state Mirror Neurons (F5mirror) Association (7b) Neurons Mirror Feedback Object affordance Mirror Neuron Output Motor Program (F5 canonical) Hand shape recognition & Hand motion detection Hand-Object spatial relation analysis Object affordance - hand state association Object Affordances Action recognition (Mirror Neurons) Motor program Motor execution Mirror Feedback Integrate temporal association Motor program F5canonical F5mirror

57 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 57 Connectivity pattern 7b7b Mirror Feedback Object affordance (AIP) 7a7a STS F5mirror Motor Program (F5canonical)

58 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 58 A single grasp trajectory viewed from three different angles The wrist trajectory during the grasp is shown by square traces, with the distance between any two consecutive trace marks traveled in equal time intervals. How the network classifies the action as a power grasp. Empty squares: power grasp output; filled squares: precision grasp; crosses: side grasp output

59 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 59 Power and precision grasp resolution (a) (b) Power Grasp Mirror Neuron Precision Pinch Mirror Neuron Note that the modeling yields novel predictions for time course of activity across a population of mirror neurons.

60 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 60 Research Plan Development of the Mirror System k Development of Grasp Specificity in F5 Motor and Canonical Neurons k Visual Feedback for Grasping: A Possible Precursor of the Mirror Property Recognition of Novel and Compound Actions and their Context k The Pliers Experiment: Extending the Visual Vocabulary k Recognition of Compounds of Known Movements k From Action Recognition to Understanding: Context and Expectation

61 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 61 Lecture 23: MNS Model 2 Michael Arbib and Erhan Oztop The Mirror Neuron System for Grasping: Visual Processing for the MNS model The Virtual Arm The Core Mirror Neuron Circuit Michael Arbib: CS564 - Brain Theory and Artificial Intelligence University of Southern California, Fall 2001

62 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 62 The Mirror Neuron System (MNS) Model

63 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 63 Visual Processing for the MNS model k How much we should attempt to solve ? k Even though computers are getting more powerful every day the vision problem in its general form is an unsolved problem in engineering. k There exists gesture recognition systems for human-computer interaction and sign language interpretation (Holden) k Our vision system must at least we need to recognize Y1) Hand and its Configuration Y2) The object features

64 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 64 Simplifying the problem k We attempt to recognize the Hand and its Configuration by simplifying the problem by using color markers on the articulation points of the hand. k If we can extract the marker positions reliably then we can try to extract some of the features that make up the hand state by trying to estimate the 3D pose of the hand from 2D pose. k Thus we have 2 steps: –Extract the color marker positions –Estimate 3D pose

65 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 65 Reminder: Hand State components For most components we need to know (3D) configuration of the hand.

66 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 66 The Vision task is simplified using colored tapes on the joints and articulation points The First step of hand configuration analysis is to locate the color patches unambiguously (not easy!). Use color segmentation. But we have to compensate for lighting, reflection, shading and wrinkling problems: Robust color detection A simplified video input

67 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 67 Robust Detection of the Colors – RGB space ] A color image in a computer is composed of a matrix of pixels triplets (Red,Green,Blue) that define the color of the pixel. ] We want to label a given pixel color as belonging to one of the color patches we used to mark the hand, or as not belonging to any class. ] A straightforward way to detect whether a given target color (R’,G’,B’) matches the pixel color (R,G,B) is to look at the squared distance ((R-R’)^2+(G-G’)^2+(B-B’)^2) with a threshold to do the classification.. ] This does not work well, because the shading and different lighting conditions effect R,G,B values a lot and a our simple nearest neighbor method fails. For example an orange patch under shadow is very close to red in RGB space. ] Solution: There are better color spaces for color classification which are more robust to shading and lighting effects.

68 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 68 Robust Detection of the Colors - HSVspace ] HSV stands for (Hue, Saturation, Value) and the Value component carries the information we want. ] The HSV color model is more suitable for classifying colors in terms of their perceived color. ] Thus in labeling the pixels we can simply compare the pixels Value component to representative Values (template) for each marker and assign the the pixel to the marker with the closest value unless the difference is not over a threshold in which case we label it is non-marker color. ] But we can do better: Train a neural network that can do the labeling for us

69 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 69 Robust Detection of Colors – the Color Expert Create a training set using a test image by manually picking colors from the image and specifying their labels. Create a NN – in our case a one hidden layer feed-forward network - that will accept the R,G,B values as input and put out the marker label, or 0 for a non-marker color. Make sure that the network is not too “powerful” so that it does not memorize the training set (as distinct from generalization) Train it then Use it: When given a pixel to classify, apply the RGB values of the pixel to the trained network and use the output as the marker that the pixel belongs to. One then needs a segmentation system to aggregate the pixels into a patch with a single color label.

70 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 70 STS hand shape recognition 1: Color Segmentation and Feature Extraction Preprocessing Color Expert (Network weights) Training phase: A color expert is generated by training a feed-forward network to approximate human perception of color. Features Actual processing: The hand image is fed to an augmented segmentation system. The color decision during segmentation is done by the consulting color expert. NN augmented segmentation system

71 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 71 Color Coded Hand Feature Extraction Given a color-coded hand image, the first step of the hand configuration extraction is to find the position of the center of color markers. Then the marker center positions are converted into a feature vector with a corresponding confidence vector. To convert the marker center coordinates into a feature vector simply the wrist position is subtracted from all the centers found and are placed into the feature vector (the relative x,y coordinates for each marker) The second step of the hand configuration extraction is to create a pose of a 3D hand model such that the features of the given hand image and the 3D hand model is as close as possible. The contribution of each component to the distance is weighted by its confidence vector. This is an optimization problem which can be solved using gradient descent or hill climbing in weight space (simulated gradient descent) Hand Configuration Estimation

72 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 72 STS hand shape recognition Model Matching Precision grasp Hand Configuration Classification Step 2: The feature vector generated by the first step is used to fit a 3D-kinematics model of the hand by the model matching module. The resulting hand configuration is sent to the classification module. Color Coded Hand Feature Extraction Step 1 of hand shape recognition: system processes the color-coded hand image and generates a set of features to be used by the second step

73 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 73 STS hand shape recognition2: 3D Hand Model Matching A realistic drawing of hand bones. The hand is modelled with 14 degrees of freedom as illustrated. ClassificationGrasp Type Result of feature extraction Feature Vector Error minimization The model matching algorithm minimizes the error between the extracted features and the model hand.

74 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 74 Virtual Hand/Arm and Reach/Grasp Simulator A power grasp and a side grasp A precision pinch

75 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 75 Virtual Arm b A Kinematics model of an arm and hand. b 19 DOF freedom: Shoulder(3), Elbow(1), Wrist(3), Fingers(4*2), Thumb (3) Implementation Requirements b Rendering: Given the 3D positions of links’ start and end points, generate a 2D representation of the arm/hand (easy) b Forward Kinematics: Given the 19 angles of the joints compute the position of each link (easy) b Inverse Kinematics: Given a desired position in space for a particular link what are the joint angles to achieve the desired position (semi-hard) b Reach & Grasp execution: Harder than simple inverse kinematics since there are more constraints to be satisfied (e.g. multiple target positions to be achived at the same time) (hard)

76 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 76 A 2D, 3DOF arm example Forward kinematics: given joint angles A,B,C compute the end effector position P: X = a*cos(A) + b*cos(B) + c*cos(C) Y = a*sin(A) + b*sin(B) + c*sin(C) B A C P(x,y) a b c Inverse kinematics: given joint position P there are infinitely many joint angle triplets to achieve P!! P(x,y) Radius of the circles are a and c and the segments connecting the circles are all equal length of b Radius=a Radius=c b b b

77 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 77 Let constrain ourselves to the arm. The forward kinematics of the arm can be represented as a vector function that maps joint angles of the arm to the wrist position. (x,y,z)=F(s1,s2,s3,e), where s1,s2,s3 are the shoulder angles and e is the elbow angle. We can formulate the inverse kinematics problem as an optimization problem: Given the desired P’=(x’,y’,z’) to be achieved we can introduce the error function J=|| (P’-F(s1,s2,s3,e)) || Then we can compute the gradient with respect to s1,s2,s3,e and follow the minus gradient to reach the minimum of J. k This method is called to Jacobian Transpose method as the partial derivatives of F encountered in the above process can be arranged into the transpose of a special derivative matrix called the Jacobian (of F). A Simple Inverse Kinematics Solution

78 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 78 Other Inverse Kinematics Solution Methods? k Inverse kinematics for multiple targets… Precision Grasp Planning k Determination of finger contact points etc. Other Issues

79 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 79 Power grasp time series data +: aperture; *: angle 1; x: angle 2;  : 1-axisdisp1;  :1-axisdisp2;  : speed;  : distance.

80 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 80 Curve recognition The general problem: associate N-dimensional space curves with object affordances A special case: The recognition of two (or three) dimensional trajectories in physical space Simplest solution: Map temporal information into spatial domain. Then apply known pattern recognition techniques. Problem with simplest solution: The speed of the moving point can be a problem! The spatial representation may change drastically with the speed Scaling can overcome the problem. However the scaling must be such that it preserves the generalization ability of the pattern recognition engine. Solution: Fit a cubic spline to the sampled values. Then normalize and re- sample from the spline curve. Result:Very good generalization. Better performance than using the Fourier coefficients to recognize curves.

81 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 81 Curve recognition Spatial resolution: 30 Network input size: 30 Hidden layer size: 15 Output size: 5 Training : Back-propagation with momentum.and adaptive learning rate Sampled points Point used for spline interpolation Fitted spline Curve recognition system demonstrated for hand drawn numeral recognition (successful recognition examples for 2, 8 and 3).

82 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 82 Core Mirror Circuit Hand state Mirror Neurons (F5mirror) Association (7b) Neurons Mirror Feedback Object affordance Mirror Neuron Output Motor Program (F5 canonical) Hand shape recognition & Hand motion detection Hand-Object spatial relation analysis Object affordance - hand state association Object Affordances Action recognition (Mirror Neurons) Motor program Motor execution Mirror Feedback Integrate temporal association Motor program F5canonical F5mirror

83 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 83 Connectivity pattern 7b7b Mirror Feedback Object affordance (AIP) 7a7a STS F5mirror Motor Program (F5canonical)

84 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 84 A single grasp trajectory viewed from three different angles The wrist trajectory during the grasp is shown by square traces, with the distance between any two consecutive trace marks traveled in equal time intervals. How the network classifies the action as a power grasp. Empty squares: power grasp output; filled squares: precision grasp; crosses: side grasp output

85 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 85 Power and precision grasp resolution

86 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 86 Future Directions 1) Technology: Increasing the robustness and learning rates of the schemas using improved learning algorithms for artificial neural nets. 2) Neuroscience: Implementing the schemas with biologically plausible neural nets to model neurophysiological data. 3) Learning to recognize variations on, and assemblages of, familiar actions. 4) Imitation 5) Language!

87 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 87 Related Research Issues: Grasping and the Mirror System in Monkey Modeling of monkey brain mechanisms for  visually guided behavior  mirror neurons  vocalization and communication  multi-modal integration  compound behaviors and social interactions  this will build, for example on our earlier modeling of interactions of pre-SMA and basal ganglia, and of the role of dopamine in planning based on behavior, neuroanatomy and neurophysiology

88 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 88 Some Specific Subgoals for Mirror System Modeling Development of the Mirror System k Development of Grasp Specificity in F5 Motor and Canonical Neurons k Visual Feedback for Grasping: A Possible Precursor of the Mirror Property Recognition of Novel and Compound Actions and their Context k The Pliers Experiment: Extending the Visual Vocabulary k Recognition of Compounds of Known Movements k From Action Recognition to Understanding: Context and Expectation

89 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 89 Visual Feedback for Grasping: A Possible Precursor of the Mirror Property Hypothesis: the F5 mirror neurons develop by selecting, via re-afferent connections, patterns of visual input describing those relations of hand shapes and motions to objects effective in visual guidance of a successful grasp. The validation here is computational: if the hypothesis is correct, we will be able to show that such a hand control system indeed exhibits most of the properties needed for a mirror system for grasping. For a reaching task, the simplest visual feedback is some form of signal of the distance between object and hand. This may suffice for grabbing bananas, but for peeling a banana, feedback on the shape of the hand relative to the banana, as well as force feedback become crucial. We predict that the parameters needed for such visual feedback for grasp will look very much like those we specified explicitly for our MNS1 hand state.

90 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 90 Related Research Issues 2  a simple imitation system for grasping [Shared with common ancestor of human and chimpanzee]  a complex imitation system for grasping Research: Comparative modeling of primate brain mechanisms based on data from primate behavior, neuroanatomy and neurophysiology and human brain imaging:  extending the monkey model to chimp and human  comparative/evolutionary model of different types of imitation

91 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 91 Learning in the Mirror System For both oro-facial and grasp mirror neurons we may have a limited "hard-wired" repertoire that can then be built on through learning: 1) Developing a set of basic grasps that are effective; 2) Learning to associate view of one's hand with grasp and object; 3) Matching this to views of others grasping; 4) Learning new grasps by imitations of others. We do not know if the necessary learning is in F5 or elsewhere.

92 Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 92 Beyond Simple Mirroring Tuning mirror neurons in terms of self-movement: Possession of movement precedes its perception but Imitation: Perception of a novel movement precedes ability to perform (some approximation to) that movement. Eventually, we perceive many things we cannot do: k Variation = known y +  y k Assemblage A key question for our empirical study of imitation: How can we objectively compare "imitation capability" across species?


Download ppt "Arbib and Itti: CS 664 (University of Southern California, Spring 2002) Integrating Vision, Action and Language 1 Michael Arbib with Erhan Oztop (Modeling)"

Similar presentations


Ads by Google