Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Université Paris 8 Multimodal Expressive Embodied Conversational Agents Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini.

Similar presentations


Presentation on theme: "1 Université Paris 8 Multimodal Expressive Embodied Conversational Agents Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini."— Presentation transcript:

1 1 Université Paris 8 Multimodal Expressive Embodied Conversational Agents Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini Magalie Ochs, FT Christopher Peters Radek Niewiadomski

2 2 ECAs Capabilities Anthropomorphic autonome figures Anthropomorphic autonome figures New form on human-machine interaction New form on human-machine interaction Study of human communication, human- human interaction Study of human communication, human- human interaction ECAs ought to be endowed with dialogic and expressive capabilities ECAs ought to be endowed with dialogic and expressive capabilities Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in. Perception: an ECA must be able to pay attention to, perceive user and the context she is placed in.

3 3 ECAs capabilities Interaction: Interaction: –speaker and addressee emits signals –speaker perceives feedback from addressee –speaker may decide to adapt to addressee’s feedback –consider social context Generation: expressive synchronized visual and acoustic behaviors. Generation: expressive synchronized visual and acoustic behaviors. –produce expressive behaviours  words, voice, intonation,  gaze, facial expression, gesture  body movements, body posture

4 4 Synchrony tool - BEAT Cassell et al, Media Lab MIT Cassell et al, Media Lab MIT Decomposition of text into theme and rheme Decomposition of text into theme and rheme Linked to WordNet Linked to WordNet Computation of: Computation of: –intonation –gaze –gesture

5 5 Virtual Training Environments MRE (J. Gratch, L. Jonhson, S. Marsella…, USC)

6 6 Interactive System  Real state agent  Gesture synchronized with speech and intonation  Small talk  Dialog partner

7 7 MAX, S. Kopp, U of Bielefeld Gesture understanding and imitation

8 8 Gilbert and George at the Bank (Upenn, 1994)

9 9

10 10 Greta

11 11 Problem to Be Solved Human communication is endowed with three devices to express communicative intention: Human communication is endowed with three devices to express communicative intention: –Verbs and formulas –Intonation and paralinguistic –Facial expression, gaze, gesture, body movement, posture… Problem: For any communicative act, the Speaker has to decide: Problem: For any communicative act, the Speaker has to decide: –Which nonverbal behaviors to show –How to execute them

12 12 Verbal and Nonverbal Communication Suppose I want to advise a friend to put on her coat because it is snowing. Suppose I want to advise a friend to put on her coat because it is snowing. Which signals do I use? Which signals do I use? Verbal signal: use of a syntactically complex sentence: Verbal signal: use of a syntactically complex sentence: Take your umbrella because it is raining Verbal + nonverbal signals: Verbal + nonverbal signals: Take your umbrella + point out to the window to show the rain by a gesture or by gaze

13 13 Multimodal Signals The whole body communicates by using: The whole body communicates by using: –Verbal acts (words and sentences) –Prosody, intonation (nonverbal vocal signals) –Gesture (hand and arm movements) –Facial action (smile, frown) –Gaze (eyes and head movements) –Body orientation and posture (trunk and leg movements) All these systems of signals have to cooperate in expressing overall meaning of communicative act. All these systems of signals have to cooperate in expressing overall meaning of communicative act.

14 14 Multimodal Signals Multimodal Signals  Accompany flow of speech  Synchronized at the verbal level  Punctuate accented phonemic segments and pauses  Substitute for word(s)  Emphasize what is being said  Regulate the exchange of speaking turn

15 15 Synchronization There exists an isomorphism between patterns of speech, intonation and facial actions There exists an isomorphism between patterns of speech, intonation and facial actions Different levels of synchrony: Different levels of synchrony: –Phoneme level (blink) –Word level (eyebrow) –Phrase level (hand gesture) Interactional synchrony: Synchrony between speaker and addressee Interactional synchrony: Synchrony between speaker and addressee

16 16 Taxonomy of Communicative Functions (I. Poggi) The speaker may provide three broad types of information about: The speaker may provide three broad types of information about: –Information about the world: deictic, iconic (adjectival),… –Information about the speaker’s mind:  belief (certainty, adjectival)  goal (performative, rheme/theme, turn-system, belief relation)  emotion  meta-cognitive –Information about speaker’s identity (sex, culture, age…)

17 17 Multimodal Signals (Isabella Poggi) Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg: Characterization of multimodal signals by their placement with respect to linguistic utterance and significance in transmitting information. Eg: –Raised eyebrow may signal surprise, emphasis, question mark, suggestion… –Smile may express happiness, be a polite greeting, be a backchannel signal… Need two information to characterize multimodal signals: Need two information to characterize multimodal signals: –Their meaning –Their visual action

18 18 Lexicon=(meaning, signal) Expression meaning Expression meaning – deictic: this, that, here, there – adjectival: small, difficult – certainty: certain, uncertain… – performative: greet, request – topic comment: emphasis – Belief relation: contrast,… – turn allocation: take/give turn – affective: anger, fear, happy- for, sorry-for, envy, relief, …. Expression signal Expression signal – Deictic: gaze direction – Certainty: Certain: palm up open hand; Uncertain: raised eyebrow – adjectival: small eye aperture – Belief relation: Contrast: raised eyebrow – Performative: Suggest: small raised eyebrow, head aside; Assert: horizontal ring – Emotion: Sorry-for: head aside, inner eyebrow up; Joy: raising fist up – Emphasis: raised eyebrows, head nod, beat

19 19 Representation Language Affective Presentation Markup Language – APML Affective Presentation Markup Language – APML – describes the communicative functions – works at meaning level and not the signal level Good Morning, Angela. It is so wonderful to see you again. I was sure we would do so, one day!.

20 20 Facial Description Language Facial expressions defined as (meaning, signal) pairs stored in library Facial expressions defined as (meaning, signal) pairs stored in library Hierarchical set of classes: Hierarchical set of classes: –Facial basis FB class: basic facial movement –An FB may be represented as a set of MPEG-4 compliant FAPs or recursively, as a combination of other FBs using the `+' operators  FB={fap3=v 1,…,fap69=v k };  FB'=c 1 *FB 1 +c 2 *FB 2 ;  where c 1 and c 2 are constants and FB 1 and FB 2 can be: –Previous defined FBs –FB of the form: {fap3=v 1,…,fap69=v k }

21 21 Facial basis class Facial basis class Facial basis class –Examples of facial basis class:  Eyebrow: small_frown, left_raise, right_raise  Eyelid: upper_lid_raise  Mouth: left_corner_stretch, left_corner_raise + =

22 22 Facial Displays Every facial display (FD) is made up of one or more FBs: Every facial display (FD) is made up of one or more FBs: –FD=FB 1 + FB 2 + FB 3 + … + FB n ; –surprise=raise_eyebrow+raise_lid+open_mouth; –worried=(surprise*0.7)+sadness;

23 23 Facial Displays Probabilistic mapping between the tags and signals: Probabilistic mapping between the tags and signals: –Es: happy_for = (smile*0.5, 0.3) + (smile*0.25) + (smile*2 + raised_eyebrow, 0.35) + (nothing, 0.1) Definition of a function class for addressee association (meaning, signal) Definition of a function class for addressee association (meaning, signal) Class communicative function: Class communicative function: –Certainty –Adjectival –Performative –Affective –…

24 24 Facial Temporal Course

25 25 Gestural Lexicon Certainty: Certainty: –Certain: palm up open hand –Uncertain: showing empty hands while lowering forearms Belief-relation: Belief-relation: –List of items of same class: numbering on fingers –Temporal relation: fist with extended hand moves back and forth behind one’s shoulder Turn-taking: Turn-taking: –Hold the floor: raise hand, palm toward hearer Performative: Performative: –Assert: horizontal ring –Reproach: extended index, palm to left, rotating up & down on wrist Emphasis: beat Emphasis: beat

26 26 Gesture Specification Language Scripting language for hand-arm gestures, based on formational parameters [Stokoe]: Scripting language for hand-arm gestures, based on formational parameters [Stokoe]: –Hand shape specified using HamNoSys [Prillwitz et. al.] –Arm position: concentric squares in front of agent [McNeill] –Wrist orientation: palm and finger base orientation Gestures are defined by a sequence of timed key poses: gesture frame Gestures are defined by a sequence of timed key poses: gesture frame Gestures are broken down temporally into distinct (optional) phases: Gestures are broken down temporally into distinct (optional) phases: –Gesture phase: preparation, stroke, hold, retraction –Change of formational components over time

27 27 Gesture specification example: Certain

28 28 Gesture Temporal Course rest positionpreparation stroke start – stroke end retractionrest position

29 29 ECA architecture

30 30 ECA Architecture Input to the system: APML annotated text Input to the system: APML annotated text Output to the system: Animation files and WAV file for the audio Output to the system: Animation files and WAV file for the audio System: System: –Interprets APML tagged dialogs, i.e. all communicative functions –Looks in a library the mapping between the meaning (specified by the XML-tag) and signals –Decides which signals to convey on which modalities –Synchronizes the signals with speech at different levels (word, phoneme or utterance)

31 31 Behavioral Engine

32 32Modules APML Parser: XML parser APML Parser: XML parser TTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration. TTS Festival: manages the speech synthesis and give us the list of phonemes and phonemes duration. Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals Expr2Signal Converter: given a communicative function and its meaning, this module returns the list of facial signals Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial parts Conflicts Resolver: resolves the conflicts that may happened when more than one facial signals should be activated on same facial parts Face Generator: converts the facial signals into MPEG-4 FAP values Face Generator: converts the facial signals into MPEG-4 FAP values Viseme Generator: converts each phoneme, given by Festival, into a set of FAPs Viseme Generator: converts each phoneme, given by Festival, into a set of FAPs MPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine MPEG4 FAP Decoder: is an MPEG-4 compliant Facial Animation Engine

33 33 TTS Festival Drive the synchronization of facial expression Drive the synchronization of facial expression Synchronization implemented at word level Synchronization implemented at word level –Timing of facial expression connected to the text embedded between the markers Use of the tree structure of Festival to compute expressions duration Use of the tree structure of Festival to compute expressions duration

34 34 Expr2Signal Converter Instantiation of APML tags: meaning of a given communicative function Instantiation of APML tags: meaning of a given communicative function Converts markers into facial signals Converts markers into facial signals Use of a library containing the lexicon of the type (meaning, facial expressions) Use of a library containing the lexicon of the type (meaning, facial expressions)

35 35 Gaze Model Based on communicative functions’ model of Isabella Poggi Based on communicative functions’ model of Isabella Poggi This model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. This model predicts what should be the value of gaze in order to have a given meaning in a given conversational context. For example: For example: –agent wants to emphasize a given word, the model will output that the agent should gaze at her conversant.

36 36 Gaze Model Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes) Very deterministic behavior model: at every Communicative Function associated with a meaning correspond the same signal (with probabilistic changes) Event-driven model: only when a Communicative Function is specified the associated signals are computed Event-driven model: only when a Communicative Function is specified the associated signals are computed only when a Communicative Function is specified, the corresponding behavior may vary only when a Communicative Function is specified, the corresponding behavior may vary

37 37 Gaze Model Several drawbacks as there is no temporal consideration: Several drawbacks as there is no temporal consideration: –No consideration of past and current gaze behavior to compute the new one –No consideration of how long the current gaze state of S and L has lasted

38 38 Gaze Algorithm Two steps: Two steps: 1. Communicative prediction: Apply the communicative function model to compute the gaze behavior as to convey a given meaning for S and LApply the communicative function model to compute the gaze behavior as to convey a given meaning for S and L 2. Statistical prediction: The communicative gaze model is probabilistically modified by a statistical model defined with constraints:The communicative gaze model is probabilistically modified by a statistical model defined with constraints: –what is the communicative gaze behavior of S and L – in which gaze behavior S and L were – the duration of the current state of S and L

39 39 Temporal Gaze Parameters The gaze behaviors depend on the communicative functions, general purpose of the conversation (persuasion discours, teaching...), personality, cultural root, social relations... The gaze behaviors depend on the communicative functions, general purpose of the conversation (persuasion discours, teaching...), personality, cultural root, social relations... Very, too, complex model Very, too, complex model propose parameters that control the gaze behavior overall  T S=1,L=1 max : maximum duration the mutual gaze state may remain active.  T S=1 max : maximum duration of gaze state S=1.  T L=1 max : maximum duration of gaze state L=1.  T S=0 max : maximum duration of gaze state S=0.  T L=0 max : maximum duration of gaze state L=0.

40 40 Mutual Gaze

41 41 Gaze Aversion

42 42 Gesture Planner Adaptive instantiation: Adaptive instantiation: –Preparation and retraction phase adjustments –Transition key and rest gesture insertion –Joint-chain follow-through  Forward time shifting of children joints in time Stroke of gesture on stressed word Stroke of gesture on stressed word Stroke expansion Stroke expansion  During planning phase, identify rheme clauses with closely repeated emphases/pitch accents  Indicate secondary accents by repeating the stroke of the primary gesture with decreasing amplitude

43 43 Gesture Planner Determination of gesture: Determination of gesture: –Look in dictionary Selection of gesture Selection of gesture –Gestures associated with most embedded tags have priority (except beat): adjectival, deictic Duration of gesture: Duration of gesture: –Coarticulation between successive gestures closed in time –Hold for gestures belonging to higher up tag hierarchy (e.g. performative, belief-relation) –Otherwise go to rest position

44 44 Behavior Expressivity Behavior is related to the (Wallbott, 1998): Behavior is related to the (Wallbott, 1998): –quality of the mental state (e.g. emotion) it refers to –quantity (somehow linked to the intensity factor of the mental state) Behaviors encode: Behaviors encode: –content information (the ‘What is communicating’) –expressive information (the ‘How it is communicating’) Behavior expressivity refers to the manner of execution of the behavior Behavior expressivity refers to the manner of execution of the behavior

45 45 Expressivity Dimensions Spatial: amplitude of movement Spatial: amplitude of movement Temporal: duration of movement Temporal: duration of movement Power: dynamic property of movement Power: dynamic property of movement Fluidity: smoothness and continuity of movement Fluidity: smoothness and continuity of movement Repetitiveness: tendency to rhythmic repeats Repetitiveness: tendency to rhythmic repeats Overall Activation: quantity of movement across modalities Overall Activation: quantity of movement across modalities

46 46 Overall Activitation Threshold filter on atomic behaviors during APML tag matching Determines the number of nonverbal signals to be executed.

47 47 Spatial Parameter Amplitude of movement controlled through asymmetric scaling of the reach space that is used to find IK goal positions Expand or condense the entire space in front of agent

48 48 Temporal parameter Stroke shift / velocity control of a beat gesture Y position of wrist w.r.t. shoulder [cm] Frame # Determine the speed of the arm movement of a gesture's meaning-carrying stroke phase Modify speed of stroke

49 49 Fluidity Continuity control of TCB interpolation splines and gesture-to- gesture Continuity of arms’ trajectory paths Control the velocity profiles of an action coarticulation X position of wrist w.r.t. shoulder [cm] Frame #

50 50Power Tension and Bias control of TCB splines; Overshoot reduction Acceleration and deceleration of limbs Hand shape control for gestures that do not need hand configuration to convey their meaning (beats).

51 51Repetitivity Technique of stroke expansion: Consecutive emphases are realized gesturally by repeating the stroke of the first gesture.

52 52 Multiple Modality Ex: Abrupt Overall Activity = 0.6 Spatial = 0 Temporal = 1 Fluidity = -1 Power = 1 Repetition = -1

53 53 Multiple Modality Ex: Vigorous Overall Activity = 1 Spatial = 1 Temporal = 1 Fluidity = 1 Power = 0 Repetition = 1

54 54 Evaluation of Expressive Gesture (H1) The chosen implementation for mapping single dimensions of expressivity onto animation parameters is appropriate - a change in a single dimension can be recognized and correctly attributed by users. (H1) The chosen implementation for mapping single dimensions of expressivity onto animation parameters is appropriate - a change in a single dimension can be recognized and correctly attributed by users. (H2) Combining parameters in such a way that they reflect a given communicative intent will result in more believable overall impression of the agent. (H2) Combining parameters in such a way that they reflect a given communicative intent will result in more believable overall impression of the agent. 106 subjects from 17 to 26 years old 106 subjects from 17 to 26 years old

55 55 Perceptual Test Studies Evaluation of the adequacy of the implementation of each parameter: Evaluation of the adequacy of the implementation of each parameter: –check whether subjects could perceive and distinguish the six different expressivity parameters and indicate their direction of change. –Result: good recognition for spatial and temporal parameters; lower recognition for fluidity and power parameters as they are inter-dependent. Evaluation task: does setting appropriate values for the expressivity parameters create behaviors that are judged as exhibiting corresponding expressivity? Evaluation task: does setting appropriate values for the expressivity parameters create behaviors that are judged as exhibiting corresponding expressivity? –3 different types of behaviors: abrupt, sluggish, vigorous –users prefer the coherent performance for vigorous and abrupt

56 56Interaction Interaction: two or more parties exchange messages. Interaction: two or more parties exchange messages. Interaction is by no means a one way communication channel between parties. Interaction is by no means a one way communication channel between parties. Within an interaction, parties take turns in playing the roles of the speaker and of the addressee. Within an interaction, parties take turns in playing the roles of the speaker and of the addressee.

57 57 Interaction Speaker and addressee adapt their behaviors to each other Speaker and addressee adapt their behaviors to each other –Speaker monitors addressees attention and interest in what he has to say –addressee selects feedback behaviors to show the speaker that he is paying attention

58 58 Interaction Speaker: Speaker: –Pointless for a speaker to engage in an act of communication if addressee does not pay or intend to pay attention –Important for speaker to assess addressee’s engagement at:  when starting an interaction: assess the possibility of engagement in interaction (establish phase)  when interaction is going on: check if engagement is lasting and sustaining conversation (maintain phase)

59 59 Interaction addressee addressee –attention: pay attention to the signals produced by speaker to perceive, process and memorize them –perception: of signals –comprehension: understand meaning attached to signals –internal reaction: the comprehension of the meaning may create cognitive and emotional reaction –decision: communication or not of the internal reaction –generation: display behaviors

60 60 Backchannel Types of backchannels (I. Poggi): Types of backchannels (I. Poggi): –attention –comprehension –belief –interest –agreement positive/negative positive/negative any combination of the above: pay attention but not understand; understand but non believe, etc. any combination of the above: pay attention but not understand; understand but non believe, etc.

61 61 Backchannel Depending on the type of speech act they respond to, a signal will be interpreted as a backchannel or not. Depending on the type of speech act they respond to, a signal will be interpreted as a backchannel or not. –backchannel: a signal of agreement / disagreement that follows the expression of opinions, evaluations, planning –not a backchannel: a signal of comprehension / incomprehension after an explicit question « Did you understand? »

62 62 Backchannel Polysemy of backchannel signals: Polysemy of backchannel signals: –a signal may provide different types of information –a frown: negative feedback for understanding, believing and agreeing

63 63 Backchannel signals of gaze gaze: gaze: –show direction of attention –inform on level of engagement or on intention to maintain engagement –indicate degree of intimacy but also –monitor the gaze behavior of others to establish their intention to engage or maintain engaged shared attention situation involved mutual gaze at each other partner or mutual gaze at a same object shared attention situation involved mutual gaze at each other partner or mutual gaze at a same object

64 64 Backchannel modelling Reactive model Reactive model –generates an instinctive feedback without reasoning –simple backchannel or mimicry –spontaneous - sincere Cognitive model Cognitive model –conscious decision to provide backchannel to provoke a particular effect on the speaker or to reach a specific goal –deliberate – possibly pretended –it can be shifted to automatic ( ex. when listening to a bore )

65 65 Backchannel Demo

66 66 A reactive backchannel Currently, our model is reactive in nature Currently, our model is reactive in nature –Dependent on perception  Speaker interprets addressee’s behavior  Speaker generates or alters its own behavior –Our focus: interest and attention on a signal level (not on a cognitive level)

67 67 Organization of the communication Attraction of attention Communicative agents: the agents provide information to the user, and should guarantee the user pay attention Communicative agents: the agents provide information to the user, and should guarantee the user pay attention Animation expressivity: principle of “staging”, so that a single idea is clearly expressed at each instant of time Animation expressivity: principle of “staging”, so that a single idea is clearly expressed at each instant of time Animation specificity: animators’ creativity, no realistic constraints for animators Animation specificity: animators’ creativity, no realistic constraints for animators What types of gesture properties could guarantee user’s attention? France Telecom

68 68 Organization of the communication Attraction of attention Corpus: videos from traditional animation that illustrate different types of conversational interaction Corpus: videos from traditional animation that illustrate different types of conversational interaction the modulations of gesture expressivity over time play a role in managing communication, thus serving as a pragmatic tool France Telecom

69 69Emotion elicited by the evaluation of events, objects, actions elicited by the evaluation of events, objects, actions integration of emotions in a dialog system (Artimis, FT) integration of emotions in a dialog system (Artimis, FT) identify under which circumstances a dialog agent should express emotions identify under which circumstances a dialog agent should express emotions France Telecom

70 70 Emotion BDI representation BDI representation based on OCC model: Appraisal variables [Ortony et al. 1988]: based on OCC model: Appraisal variables [Ortony et al. 1988]: –Desirability/Undesirability : Achievement or threaten of the agent's choice –Degree of realization : Degree of certainty of the choice's achievement –Probability of an event : Probability of feasibility of an event –Agency : The agent who is actor of the event Emotional Mental State Set of appraisal variables Configuration of mental attitudes Representation of appraisal variables by mental attitudes France Telecom

71 71Emotion complex emotions: complex emotions: –superposition of 2 emotions: evaluation of an event can happen under different angles –mask an emotion by another one : consideration of social context joy + deception = masking joy + deception = masking

72 72 Video Masking of Deception by Joy

73 73Conclusion Creation of a virtual agent able to Creation of a virtual agent able to –communicate nonverbally –show emotions –use expressive gestures –perceive and be attentive –maintain the attention Two studies on expressivity Two studies on expressivity –from manual annotation of video corpus –from mimicry of movement analysis


Download ppt "1 Université Paris 8 Multimodal Expressive Embodied Conversational Agents Catherine Pelachaud Elisabetta Bevacqua Nicolas Ech Chafai, FT Maurizio Mancini."

Similar presentations


Ads by Google