Presentation is loading. Please wait.

Presentation is loading. Please wait.

Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4.

Similar presentations


Presentation on theme: "Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4."— Presentation transcript:

1 multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4

2 going multimodal ‘multimodal’ is this decade’s main ‘affective interaction’ aspect. plethora of modalities available to capture and process –visual, aural, haptic… –‘visual’ can be broken down to ‘facial expressivity’, ‘hand gesturing’, ‘body language’, etc. –‘aural’ to ‘prosody’, ‘linguistic content’, etc.

3 why multimodal? Extending unimodality… –recognition from traditional unimodal inputs had serious limitations –Multimodal corpora become available What to gain? –have recognition rates improved? –or just introduced more uncertain features

4 essential reading Communications of the ACM, Nov. 1999, Vol. 42, No. 11, pp. 74- 81

5 putting it all together myth #6: multimodal integration involves redundancy of content between modes you have features from a person’s –facial expressions and body language –speech prosody and linguistic content, –even their heartbeat rate so, what do you do when their face tells you different than their …heart?

6 first, look at this video

7 and now, listen!

8 but it can be good what happens when one of the available modalities is not robust? –better yet, when the ‘weak’ modality changes over time? consider the ‘bartender problem’ –very little linguistic content reaches its target –mouth shape available (viseme) –limited vocabulary

9 but it can be good

10 again, why multimodal? holy grail: assigning labels to different parts of human-human or human-computer interaction yes, labels can be nice! –humans do it all the time –and so do computers (e.g., classification) –OK, but what kind of label?

11 In the beginning … Based on the claim that ‘there are six facial expressions recognized universally across cultures’… all video databases used to contain images of sad, angry, happy or fearful people… thus, more sad, angry, happy or fearful people appear, even when data involve HCI, and subtle emotions/additional labels are out of the picture –can you really be afraid that often when using your computer?

12 the Humaine approach so where is Humaine in all that? –subtle emotions –natural expressivity –alternative emotion representations –discussing dynamics –classification of emotional episodes from life-like HCI and reality TV

13 Humaine WP4 results Dataset (leader) Frames/users/ Length Modalities present Features extracted Until nowPlans for 2007Recognition rates ERMIS SAL (QUB- ICCS) Four subjects ~ 2hr of audio/video annotated with Feeltrace facial expressions, Speech prosody, head pose FAPs per frame Acoustic features per tune Phonemes/Vise mes One subject analyzed ~34.000 frames ~800 tunes Extract facial and prosody features from three remaining subjects Analyze head pose RecurrentNNs:87% Rule-based : 78,4% Possibilistic: 65,1% EmoTV (LIMSI) 28 clips ~5 minutes total Subtle facial expressions, Restricted gesturing Overall activation (FAPs or prosody not possible All clipsExtract Remaining expressivity features (where possible) Correlation with manual annotator: κ*=0,83 Emo Taboo (LIMSI) 2 clips ~5 minutes Facial expressions Speech prosody FAPsAll clipsHead pose Prosody features Annotation not yet available CEICES (FAU) 51 children ~ 9 hrs recorded and annotated Speech prosody Acoustic features per turn/word All clips Completed analysis, pending comparison of recognition schemes Mean recognition rate: 55.8% Genoa 06 corpus (Genoa) 10 subjects ~50 gesture repetitions each ~1 hour FAPs gesturing, Pseudolangua ge FAPs gestures Speech All clipsExpressivity features from hand movement Facial: 59.6% Gestures: 67.1% Speech: 70.8% Multimodal: 78.3% GEMEP (GERG) 1200 clips totalFAPs gesturing, Pseudolangua Expressivity, gestures FAPs Speech 8 body clips 30 face clips Analyze remaining 1200 clips Few clips analyzed

14 HUMAINE 2010 three years from now in a galaxy (not) far, far away…

15 a fundamental question

16 OK, people may be angry or sad, or express positive/active emotions face recognition provides response to the ‘who?’ question ‘when?’ and ‘where?’ are usually known or irrelevant but, does anyone know ‘why?’ –context information –semantics

17 a fundamental question (2)

18 is it me or?...

19 some modalities may display no clues or, worse, contradicting clues the same expression may mean different things coming from different people can we ‘bridge’ what we know about someone or about the interaction with what we sense? –and can we adapt what we know based on that? –or can we align what we sense with other sources?

20 another kind of language

21 sign language analysis poses a number of interesting problems –image processing and understanding tasks –syntactic analysis –context (e.g. when referring to a third person) –natural language processing –vocabulary limitations

22 want answers? Let us try to extend some of the issues already raised!

23 Semantic Analysis Semantics – Context (a peak at the future) Visual- data SegmentationFeature Extraction C1C1 C2.C2. CnCn Context Fusion Adapt - ation Label- ling Ontology infrastructure Context analysis Visual analysis Classifiers Fuzzy Reasoning Engine (FiRE) Centralised /Decentralised Knowledge Repository

24 Standardisation Activities W3C Multimedia Semantics Incubator Group W3C Emotion Incubator Group  Provide machine understandable representations of available Emotion Modelling, Analysis, Synthesis theory, cues and results to be accessed through the Web and used in all types of affective interaction.


Download ppt "Multimodality, universals, natural interaction… and some other stories… Kostas Karpouzis & Stefanos Kollias ICCS/NTUA HUMAINE WP4."

Similar presentations


Ads by Google