Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe.

Similar presentations


Presentation on theme: "1 Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe."— Presentation transcript:

1

2 1 Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe and America And the middle east HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004

3 2 Overview A short introduction to speech science A short introduction to speech science … and speech analysis tools … and speech analysis tools Speech and emotion: Speech and emotion: Models, problems... and Models, problems... and Results Results A review of open issues A review of open issues Deliverables within the HUMAINE framework Deliverables within the HUMAINE framework

4 3 Part 1: Speech science in a nutshell

5 4 A short introduction to SPEECH: Most of those present here are familiar with various aspects of signal processing Most of those present here are familiar with various aspects of signal processing For the benefit of those who aren ’ t acquainted with the speech signal in particular: For the benefit of those who aren ’ t acquainted with the speech signal in particular: We ’ ll start with an overview of speech production models and analysis techniques We ’ ll start with an overview of speech production models and analysis techniques The rest of you can sleep for a few minutes The rest of you can sleep for a few minutes

6 5 The speech signal A 1-D signal A 1-D signal Does that make it a simple one? NO … Does that make it a simple one? NO … There are many analysis techniques There are many analysis techniques Like many types of systems - parametric models are one very useful here … Like many types of systems - parametric models are one very useful here … A simple and very useful speech production model: the source/filter model A simple and very useful speech production model: the source/filter model (in case you ’ re worried, we ’ ll see that this is directly related to emotions also) (in case you ’ re worried, we ’ ll see that this is directly related to emotions also)

7 6 The source/filter model Components: Components: The lungs (create air pressure) The lungs (create air pressure) Two elements that turn this into a “ raw ” signal: Two elements that turn this into a “ raw ” signal: The vocal folds (periodic signals) The vocal folds (periodic signals) Constrictions that make the airflow turbulent (noise) Constrictions that make the airflow turbulent (noise) The vocal tract Partly immobile: upper jaw, teeth Partly mobile: soft palate, tongue, lips, lower jaw – also called “ articulators ” Its influence on the raw signal can be modeled very will with a low order (~10) digital filter source filter

8 7 The net result: A complex signal that changes its properties constantly: A complex signal that changes its properties constantly: Sometimes periodic Sometimes periodic Sometimes colored noise Sometimes colored noise Approximately stationary over time windows of ~20 milliseconds Approximately stationary over time windows of ~20 milliseconds And of course – contains a great deal of information And of course – contains a great deal of information Text – linguistic information Text – linguistic information Other stuff – paralinguistic information Other stuff – paralinguistic information Speaker identity Speaker identity Gender Gender Socioeconomic background Socioeconomic background Stress, accent Stress, accent Emotional state Emotional state Etc. … Etc. …

9 8 How is this information coded? Textual information - mainly in the filter and the way it changes its properties over time Textual information - mainly in the filter and the way it changes its properties over time Filter “ snapshots ” are called segments Filter “ snapshots ” are called segments Paralinguistic information – mainly in the source parameters Paralinguistic information – mainly in the source parameters Lung pressure – determines the intensity Lung pressure – determines the intensity Vocal fold periodicity – determines instantaneous frequency or “ pitch ” Vocal fold periodicity – determines instantaneous frequency or “ pitch ” Configuration of the glottis determines overall spectral tilt – “ voice quality ” Configuration of the glottis determines overall spectral tilt – “ voice quality ”

10 9 Prosody : Prosody is another name for part of the paralinguistic information, composed of: Prosody is another name for part of the paralinguistic information, composed of: Intonation – the way in which pitch changes over time Intonation – the way in which pitch changes over time Intensity – changes in intensity over time Intensity – changes in intensity over time Problem: some segments are inherently weaker than others Problem: some segments are inherently weaker than others Rhythm – segment durations vs. time Rhythm – segment durations vs. time Prosody does not include voice quality, but voice quality is also part of the paralinguistic information Prosody does not include voice quality, but voice quality is also part of the paralinguistic information

11 10 To summarize: Speech science is at a mature stage Speech science is at a mature stage The source/filter model is very useful in understanding speech production The source/filter model is very useful in understanding speech production Many applications (speech recognition, speaker verification, emotion recognition, etc.) require extraction of the model parameters from the speech signal (an inverse problem) Many applications (speech recognition, speaker verification, emotion recognition, etc.) require extraction of the model parameters from the speech signal (an inverse problem) This is the domain of: speech analysis techniques This is the domain of: speech analysis techniques

12 11 Speech analysis and classification Part 2:

13 12 The large picture: speech analysis in the HUMAINE framework Speech analysis is just one component in the context of speech and emotion: Speech analysis is just one component in the context of speech and emotion: Theory Of emotion Training data Speech Analysis engine Real data High Level application Its overall objectives: Its overall objectives: Calculate raw speech parameters Calculate raw speech parameters Extract features salient to emotional content Extract features salient to emotional content Discard irrelevant features Discard irrelevant features Use them to characterize and maybe classify emotional speech Use them to characterize and maybe classify emotional speech

14 13 Signals to Signs - The process Data Warehouse Data Representation Databases Files Data Cleaning and Integration Selection and Transformation Patterns Data Mining Knowledge Evaluation and Presentation

15 14 S2S ( SOS…? ) - The tools a combination of techniques that belong to different types of disciplines: a combination of techniques that belong to different types of disciplines: Data warehouse technologies (data storage, information retrieval, query answering, etc ’ ) Data warehouse technologies (data storage, information retrieval, query answering, etc ’ ) Data preprocessing and handling Data preprocessing and handling Data modeling / visualization Data modeling / visualization Machine learning (statistical data analysis, pattern recognition, information retrieval, etc ’ ) Machine learning (statistical data analysis, pattern recognition, information retrieval, etc ’ )

16 15 The objective of speech analysis techniques 1. To extract the raw model parameters from the speech signal Interfering factors: Interfering factors: Reality never exactly fits the model Reality never exactly fits the model Background noise Background noise Speaker overlap Speaker overlap 2. To extract features 3. To interpret them in meaningful ways (pattern recognition) Really hard! Really hard!

17 16 It remains that - Useful models and techniques exist for extracting the various information types from the speech signal Useful models and techniques exist for extracting the various information types from the speech signal Yet … Many applications such as speech recognition, speaker identification, speech synthesis, etc., are far from being perfected Yet … Many applications such as speech recognition, speaker identification, speech synthesis, etc., are far from being perfected … So what about emotion?

18 17 For the moment – let ’ s focus on the small picture The consensus is that emotions are coded in The consensus is that emotions are coded in Prosody Prosody Voice quality Voice quality And sometimes in the textual information And sometimes in the textual information Let ’ s discuss the purely technical aspects of evaluating all of these … Let ’ s discuss the purely technical aspects of evaluating all of these …

19 18 Extracting features from the speech signal Stage 1 – Extracting raw features: Stage 1 – Extracting raw features: Pitch Pitch Intensity Intensity Voice quality Voice quality Pauses Pauses Segmental information – phones and their duration Segmental information – phones and their duration Text Text (by the way … who extracts them – man, machine or both? ) (by the way … who extracts them – man, machine or both? )

20 19 Pitch Pitch: The instantaneous frequency Pitch: The instantaneous frequency Sounds deceptively simple to find – but it isn ’ t! Sounds deceptively simple to find – but it isn ’ t! Lots of research has been devoted to pitch detection Lots of research has been devoted to pitch detection Composed of two sub-problems: Composed of two sub-problems: For a given signal – is there periodicity at all? For a given signal – is there periodicity at all? If so – what ’ s the fundamental frequency? If so – what ’ s the fundamental frequency? Complicating factors: Complicating factors: Speaker related factors – hoarseness, diplophony, etc. Speaker related factors – hoarseness, diplophony, etc. Background related factors – noise, overlapping speakers, filters (as in telephony) Background related factors – noise, overlapping speakers, filters (as in telephony) In the context of emotions: In the context of emotions: Small errors are acceptable Small errors are acceptable Large errors (octave jumps, false positives) are catastrophic Large errors (octave jumps, false positives) are catastrophic

21 20 An example: The raw pitch contour in PRAAT: The raw pitch contour in PRAAT: Errors:

22 21 Intensity Appears to be even simpler than pitch! Appears to be even simpler than pitch! Intensity is quite easy to measure … Intensity is quite easy to measure … Yet most influenced by unrelated factors! Yet most influenced by unrelated factors! Aside from the speaker, intensity is gravely affected by: Aside from the speaker, intensity is gravely affected by: Distance from the microphone Distance from the microphone Gain settings in the recording equipment Gain settings in the recording equipment Clipping Clipping AGC AGC Background noise Background noise Recording environment Recording environment Without normalization – intensity is almost useless! Without normalization – intensity is almost useless!

23 22 Voice quality Several measures are used to measure it: Several measures are used to measure it: Local irregularity in pitch and intensity Local irregularity in pitch and intensity Ratio between harmonic components and noise components Ratio between harmonic components and noise components Distribution of energy in the spectrum Distribution of energy in the spectrum Affected by a multitude of factors other than emotions Affected by a multitude of factors other than emotions Some standardized measures are often used in clinical applications Some standardized measures are often used in clinical applications A large factor in emotional speech! A large factor in emotional speech!

24 23 Segments There are different ways of defining precisely what these are There are different ways of defining precisely what these are Automatic segmentation is difficult, though not as difficult as speech recognition Automatic segmentation is difficult, though not as difficult as speech recognition Even the segment boundaries can give important timing information, related to rhythm – Even the segment boundaries can give important timing information, related to rhythm – an important component of prosody an important component of prosody

25 24 Text Is this “ raw ” data or not? Is this “ raw ” data or not? Is it data … at all? Is it data … at all? Some studies on emotion specifically eliminated this factor (filtered speech, uniform texts) Some studies on emotion specifically eliminated this factor (filtered speech, uniform texts) Other studies are interested mainly in text Other studies are interested mainly in text If we want to deal with text, we must keep in mind: If we want to deal with text, we must keep in mind: Automated speech recognition is HARD! Automated speech recognition is HARD! Especially with strong background noise Especially with strong background noise Especially when strong emotions are present, modifying the speakers normal voices and mannerisms Especially when strong emotions are present, modifying the speakers normal voices and mannerisms Especially when dealing with multiple speakers Especially when dealing with multiple speakers

26 25 Some complicating factors in raw feature extraction: Background noise Background noise Speaker overlap Speaker overlap Speaker variability Speaker variability Variability in recording equipment Variability in recording equipment

27 26 In the general context of speech analysis - The raw features we discussed are not specific only to the study of emotion The raw features we discussed are not specific only to the study of emotion Yet – issues related to calculating them reliably crop up again and again in emotion related studies Yet – issues related to calculating them reliably crop up again and again in emotion related studies Some standard and reliable tools would be very helpful Some standard and reliable tools would be very helpful

28 27 Two opposing approaches to computing raw features: Assume we have perfect algorithms for extracting all this information Assume we have perfect algorithms for extracting all this information If we don ’ t – help out manually If we don ’ t – help out manually This can be carried out only over small databases This can be carried out only over small databases Useful in purely theoretical studies Useful in purely theoretical studies Acknowledge we only have imperfect algorithms Acknowledge we only have imperfect algorithms Find how to deal automatically with imperfect data Find how to deal automatically with imperfect data Very important for large databases Very important for large databases Ideal Error prone Real life

29 28 Next - what do we do with it all? Reminder: we have large amounts of raw data Reminder: we have large amounts of raw data Now we have to make some meaning from it Now we have to make some meaning from it

30 29 Feature extraction … Stage 2 – data reduction: Stage 2 – data reduction: Take a sea of numbers Take a sea of numbers Reduce it to a small number of meaningful measures Reduce it to a small number of meaningful measures Prove they ’ re meaningful Prove they ’ re meaningful An interesting way to look at it: An interesting way to look at it: Separating the “ signal ” (e.g emotion) from the “ noise ” (anything else) Separating the “ signal ” (e.g emotion) from the “ noise ” (anything else)

31 30 An example of “ Noise ” : Here pitch and intensity have totally unemotional (but important) roles: [Deller et al] Here pitch and intensity have totally unemotional (but important) roles: [Deller et al]

32 31 Examples of high level features Pitch fitting – Pitch fitting – stylization stylization MoMel MoMel Parametric modeling Parametric modeling statistics statistics

33 32 An example: The raw pitch contour in PRAAT: Errors:

34 33 Patching it up a bit:

35 34 One way to extract the essential information: Pitch stylization – IPO method

36 35 Another way to extract the essential information: MoMel

37 36 Yet another way to extract the essential information: MoMel

38 37 Some observations: Different parameterizations give Different parameterizations give different curves different curves different features different features Yet: perceptually – they are all very similar Yet: perceptually – they are all very similar

39 38 Questions: We can ask what is the minimal or most representative information to capture the pitch contour? We can ask what is the minimal or most representative information to capture the pitch contour? More importantly, though: What aspects of the pitch contour are most relevant to emotion? More importantly, though: What aspects of the pitch contour are most relevant to emotion?

40 39 Several answers appear in the literature: Statistical features taken from the raw contour: Statistical features taken from the raw contour: Mean, variance, max, min, range etc. Mean, variance, max, min, range etc. Features taken from parameterized contours: Features taken from parameterized contours: Slopes, “ main ” peaks and dips etc. Slopes, “ main ” peaks and dips etc.

41 40 There ’ s not much time to go into: Intensity contours Intensity contours Spectra Spectra Duration Duration But the problems are very similar

42 41 The importance of time frames We have several measures that vary over time We have several measures that vary over time Over what time frame should we consider them? Over what time frame should we consider them? The meaning we attribute to speech parameters is dependent on the time frame over which they ’ re considered: The meaning we attribute to speech parameters is dependent on the time frame over which they ’ re considered: Fixed length windows Fixed length windows Phones Phones Words Words “ Intonation units ” “ Intonation units ” “ Tunes ” “ Tunes ”

43 42 Which time frame is best? Fixed time frames of several seconds – simple to implement, but na ï ve Fixed time frames of several seconds – simple to implement, but na ï ve Very arbitrary Very arbitrary Words Words Need a recognizer to be marked Need a recognizer to be marked Probably the shortest meaningful frame Probably the shortest meaningful frame “ Intonation units ” “ Intonation units ” Nobody knows exactly what they are (one “ idea ” per unit?) Nobody knows exactly what they are (one “ idea ” per unit?) Hard to measure Hard to measure Correlate best with coherent stretches of speech Correlate best with coherent stretches of speech “ Tunes ” – from one pause to the next “ Tunes ” – from one pause to the next feasible to implement feasible to implement Correlate to some extent with coherent stretches of speech. Correlate to some extent with coherent stretches of speech.

44 43 Why is this such an important decision? It might help us interpret our data correctly! It might help us interpret our data correctly!

45 44 Therefore … the problem of feature extraction: Is NOT a general one Is NOT a general one We want features that are specifically relevant to emotional content … We want features that are specifically relevant to emotional content … But before we get to that - we have: But before we get to that - we have:

46 45 The Data Mining part Stage 3: To extract knowledge = previously unknown information (rules, constraints, regularities, patterns, etc ’ ) from the features database

47 46 What are we mining? We look for patterns that either describe the stored data We look for patterns that either describe the stored data or infer from it (predictions) or infer from it (predictions) Summarization and characterization (of the class of data that interests us) Discrimination and comparison of features of different classes

48 47 Types of Analysis Association analysis of rules of the form X => Y (DB tuples that satisfy X are likely to satisfy Y) where X and Y are pairs of attribute and value/set of values Association analysis of rules of the form X => Y (DB tuples that satisfy X are likely to satisfy Y) where X and Y are pairs of attribute and value/set of values Classification and class prediction – find a set of functions to describe and distinguish data classes/concepts that can be used predict the class of unlabeled data. Classification and class prediction – find a set of functions to describe and distinguish data classes/concepts that can be used predict the class of unlabeled data. Cluster analysis (unsupervised clustering) – analyze the data when there are no class labels to deal with new types of data and help group similar events together Cluster analysis (unsupervised clustering) – analyze the data when there are no class labels to deal with new types of data and help group similar events together

49 48 Association Rules We search for interesting relationships among items in the data We search for interesting relationships among items in the data Interestingness Measures: Interestingness Measures: Support = # tuples that contain both A and B / # tuples Confidence = # tuples that contain both A and B / # tuples that contain A Support measures usefulness Confidence measures certainty

50 49 Classification A two step process: 1.Use data tuples with known labels to construct a model 2.Use the learned model to classify (assign labels) new data Since the class label of each training sample is known, this is Supervised Learning Test data is used to estimate the predictive accuracy of the learned model. Data is divided into two groups: training data and test data

51 50 Assets No need to know the rules in advance No need to know the rules in advance Some rules are not easily formulated as mathematical or logical expressions Some rules are not easily formulated as mathematical or logical expressions Similar to one of the ways human learn Similar to one of the ways human learn Could be more robust to noise and incomplete data Could be more robust to noise and incomplete data May require a lot of samples May require a lot of samples Learning depends on existing data only! Learning depends on existing data only!

52 51 Algorithms: Algorithms: Machine learning (Statistical learning) Machine learning (Statistical learning) Expert systems Expert systems Computational neuroscience Computational neuroscience Dangers: Dangers: The model might not be able to learn The model might not be able to learn There might not be enough data There might not be enough data Over-fitting the model to the training data Over-fitting the model to the training data

53 52 Prediction Classification predicts categorical labels Classification predicts categorical labels Prediction models continuous valued function Prediction models continuous valued function It is usually used to predict the value or a range of values of an attribute of a given sample It is usually used to predict the value or a range of values of an attribute of a given sample Regression Regression Neural Networks Neural Networks

54 53 Clustering constructing models for assigning class labels to data that is unlabeled. constructing models for assigning class labels to data that is unlabeled. un supervised learning un supervised learning Clustering is an ill defined task Clustering is an ill defined task Once clusters are discovered, the clustering model can be used for predicting labels of new data Once clusters are discovered, the clustering model can be used for predicting labels of new data Alternatively, the clusters can be used as labels to train a supervised classification algorithm Alternatively, the clusters can be used as labels to train a supervised classification algorithm

55 54 So how does this technical Mumbo Jumbo tie into -

56 55 Speech and emotion Part 3:

57 56 Speech and emotion Emotion can affect speech in many ways Emotion can affect speech in many ways Consciously Consciously Unconsciously Unconsciously Through the Autonomous nervous system Through the Autonomous nervous system Examples: Examples: Textual content is usually consciously chosen, except maybe sudden interjections which may stem from sudden or strong emotions Textual content is usually consciously chosen, except maybe sudden interjections which may stem from sudden or strong emotions Many speech patterns related to emotions are strongly ingrained – therefore, though they can be controlled by the speaker, most often they are not, unless the speaker tries modify them consciously Many speech patterns related to emotions are strongly ingrained – therefore, though they can be controlled by the speaker, most often they are not, unless the speaker tries modify them consciously Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to inhibit (e.g. vocal tremor due to grief) Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to inhibit (e.g. vocal tremor due to grief)

58 57 Speech analysis: the big picture - again Speech analysis is just one component in the context of speech and emotion: Speech analysis is just one component in the context of speech and emotion: Application Real data Speech analysis Theories of emotion Databases

59 58 Is this just another way to spread the blame? Us speech analysis guys are just poor little engineers Us speech analysis guys are just poor little engineers Methods we can supply can be no better than the theory and the data that drive them Methods we can supply can be no better than the theory and the data that drive them … and unfortunately, the jury is still out on both of those points … or not? … and unfortunately, the jury is still out on both of those points … or not? Ask WP3 and WP5 people Ask WP3 and WP5 people They ’ re here somewhere They ’ re here somewhere Actually – Actually – One of the difficulties HUMAINE is intended to ease, is that often researchers in the field find themselves having to address all of the above! (guilty) One of the difficulties HUMAINE is intended to ease, is that often researchers in the field find themselves having to address all of the above! (guilty)

60 59 The most fundamental problem: What are the features that signify emotion? To paraphrase – what signals are signs of emotion? What are the features that signify emotion? To paraphrase – what signals are signs of emotion?

61 60 The most common solutions: Calculate as many as you can think of Calculate as many as you can think of Intuition Intuition Theory based answers Theory based answers Data-driven answers Data-driven answers Ha! Once more – it ’ s not our fault! Ha! Once more – it ’ s not our fault!

62 61 What seems to be the most plausible approach - The data driven approach The data driven approach Requiring: Requiring: Emotional speech databases ( “ corpora ” ) Emotional speech databases ( “ corpora ” ) Perceptual evaluation of these databases Perceptual evaluation of these databases This is then correlated with speech features This is then correlated with speech features Which takes us back to a previous square Which takes us back to a previous square

63 62 So tell us already – how does emotion influence speech? … It seems that the answer depends on how you look for it … It seems that the answer depends on how you look for it As hinted before – the answer cannot really be separated from: As hinted before – the answer cannot really be separated from: The theories of emotion The theories of emotion The databases we have of emotional speech - The databases we have of emotional speech - Who the subjects are Who the subjects are How emotion was elicited How emotion was elicited

64 63 A short digression - Will all the speech clinicians in the audience please stand up? Hmm …. We don ’ t seem to have so many Let ’ s look at what one of them has to say

65 64 Emotions in the speech Clinic Some speakers have speech/voice problems that modify their “ signal ”, thus misleading the listener VOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds). - Breathy voice (in women) is, sometimes, perceived as “ sexy ” (while it actually reflects incomplete adduction of the vocal folds). - Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor)

66 65 Clinical Examples: STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “ repetition units ”, and on rate of repetitions). CLUTTERING – listeners judge cluttering people as nervous and less intelligent

67 66 So- though this is a WP4 meeting … It ’ s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) issues It ’ s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) issues The signs we ’ re looking for can never be separated from the questions: The signs we ’ re looking for can never be separated from the questions: Signs of what (emotions)? Signs of what (emotions)? Signs in what (data)? Signs in what (data)? May God and Phillipe Gelin forgive me … May God and Phillipe Gelin forgive me …

68 67 A not-so-old example: (Murray and Arnott, 1993) Very qualitative Very qualitative Presupposes dealing with primary emotions Presupposes dealing with primary emotions

69 68 BUT … If you expect more recent results to give more detailed descriptive outlines If you expect more recent results to give more detailed descriptive outlines Then you ’ re wrong Then you ’ re wrong The data-driven approaches use a large number of features, and let the computer sort them out The data-driven approaches use a large number of features, and let the computer sort them out 32 significant features found by ASSESS, from the initial 375 used 32 significant features found by ASSESS, from the initial 375 used 5 emotions, acted 5 emotions, acted 55% recognition 55% recognition

70 69 Some remarks: Some features are indicative, even though we probably don ’ t use them perceptually Some features are indicative, even though we probably don ’ t use them perceptually e.g. pitch mean: usually this is raised with higher activation e.g. pitch mean: usually this is raised with higher activation But we don ’ t have to know the speaker ’ s neutral mean to perceive heightened activation But we don ’ t have to know the speaker ’ s neutral mean to perceive heightened activation My guess: voice quality is what we perceive in such cases My guess: voice quality is what we perceive in such cases How “ simple ” can characterization of emotions become? How “ simple ” can characterization of emotions become? How many features do we listen for? How many features do we listen for? Can this be verified? Can this be verified?

71 70 Time intervals This issue becomes more and more important as we go towards “ natural ” data This issue becomes more and more important as we go towards “ natural ” data Emotion production: Emotion production: How long do emotions last? How long do emotions last? Full blown emotions are usually short (but not always! Look at Peguy in the LIMSI interview database) Full blown emotions are usually short (but not always! Look at Peguy in the LIMSI interview database) Moods, or pervasive emotions are subtle but long lasting Moods, or pervasive emotions are subtle but long lasting Emotion Analysis: Emotion Analysis: Over what span of speech are they easiest to detect? Over what span of speech are they easiest to detect?

72 71 From the analysis viewpoint: Current efforts seem to be focusing on methods that aim to use time spans that have some inherent meaning: Current efforts seem to be focusing on methods that aim to use time spans that have some inherent meaning: Acoustically (ASSESS – Cowie et al) Acoustically (ASSESS – Cowie et al) Linguistically (Batliner et al) Linguistically (Batliner et al) We mentioned that prosody carries We mentioned that prosody carries emotional information (our “ signal ” ) emotional information (our “ signal ” ) other information ( “ noise ” ): phrasing, various types of prominence other information ( “ noise ” ): phrasing, various types of prominence BUT …

73 72 Why I like intonation units Spontaneous speech is organized differently from written language Spontaneous speech is organized differently from written language “ sentences ” and “ paragraphs ” don ’ t really exist there “ sentences ” and “ paragraphs ” don ’ t really exist there Phrasing is a loose phrase for …” Intonation units ” Phrasing is a loose phrase for …” Intonation units ” Theoretical linguists love to discuss what they are Theoretical linguists love to discuss what they are An exact definition is as hard to find as it is to parse spontaneous speech An exact definition is as hard to find as it is to parse spontaneous speech Prosodic markers help replace various written markers Prosodic markers help replace various written markers Maybe emotion is not an “ orthogonal ” bit of information on top of these (the signal+noise model) Maybe emotion is not an “ orthogonal ” bit of information on top of these (the signal+noise model) If emotion modifies these, If emotion modifies these, It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we ’ re emotional It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we ’ re emotional Problem: Engineers don ’ t like ill defined concepts! Problem: Engineers don ’ t like ill defined concepts! But emotion is one of them too, isn ’ t it? But emotion is one of them too, isn ’ t it?

74 73 Just to provoke some thought: From a paper on animation (think of it – these guys have to integrate speech and image to make them fit naturally): From a paper on animation (think of it – these guys have to integrate speech and image to make them fit naturally): “… speech consists of a sequence of intonation phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of maximum emphasis. Boundaries between successive phrases are associated with perceived disjuncture and are marked in English with cues such as pitch movements … Gestures are performed in units that coincide with these intonation phrases, and points of prominence in gestures also coincide with the emphasis in the concurrent speech …” [Stone et al., SIGGRAPH 2004]

75 74 We haven ’ t even discussed WP3 issues - What are the scales/categories? What are the scales/categories? Possibility 1: emotional labeling Possibility 1: emotional labeling Possibility 2: psychological scales (such as valence/activation – e.g. Feeltrace) Possibility 2: psychological scales (such as valence/activation – e.g. Feeltrace) QUESTION: QUESTION: Which is more directly related to speech features? Which is more directly related to speech features? Hopefully we ’ ll hammer out a tentative answer by Tuesday..

76 75 Current results Part 4:

77 76 Evaluating results Results often demonstrate how elusive the solution is … Results often demonstrate how elusive the solution is … Consider a similar problem: Speech Recognition Consider a similar problem: Speech Recognition To evaluate results – To evaluate results – Make recordings Make recordings Submit them to an algorithm Submit them to an algorithm Measure the recognition rate! Measure the recognition rate! Emotion recognition results are far more difficult to quantify Emotion recognition results are far more difficult to quantify Heavily dependent on induction techniques and labeling methods Heavily dependent on induction techniques and labeling methods

78 77 Several popular contexts: Acted prototypical emotions Acted prototypical emotions Call center data Call center data Real Real WoZ type WoZ type Media (radio, TV) based data Media (radio, TV) based data Narrative speech (event recollection) Narrative speech (event recollection) Synthesized speech (monterro, gobl) Synthesized speech (monterro, gobl) Most of these methods can be placed on the spectrum between: Most of these methods can be placed on the spectrum between: Acted, full blown bursts of stereotypical emotions Acted, full blown bursts of stereotypical emotions Fully natural, mixtures of mood, affect and bursts of difficult-to-label emotions recorded in noisy environments Fully natural, mixtures of mood, affect and bursts of difficult-to-label emotions recorded in noisy environments

79 78 Call centers A real life scenario! (with commercial interests … )! A real life scenario! (with commercial interests … )! Sparse emotional content: Sparse emotional content: Controlled (usually) Controlled (usually) Negative (usually) Negative (usually) Lends itself easily to WOZ scenarios Lends itself easily to WOZ scenarios

80 79 Ang et al., 2002 Standardized call-center data from 3 different sources Standardized call-center data from 3 different sources Uninvolved users, true HMI interaction Uninvolved users, true HMI interaction Detects neutral/annoyance/frustration Detects neutral/annoyance/frustration Mostly automatic extraction, with some additional human labeling Mostly automatic extraction, with some additional human labeling Defines human “ accuracy ” as 75% Defines human “ accuracy ” as 75% But this is actually the percentage of human consensus But this is actually the percentage of human consensus Machine accuracy is comparable Machine accuracy is comparable A possible measure: maybe “ accuracy ” is where users wanted human intervention A possible measure: maybe “ accuracy ” is where users wanted human intervention

81 80 Batliner et al. Professional acting, amateur acting, WOZ scenario Professional acting, amateur acting, WOZ scenario the latter with uninvolved users, true HMI interaction the latter with uninvolved users, true HMI interaction Detects trouble in communication Detects trouble in communication Much thought was given to this definition! Much thought was given to this definition! Combines prosodic features with others: Combines prosodic features with others: POS labels POS labels Syntactic boundaries Syntactic boundaries Overall – shows a typical result: Overall – shows a typical result: The closer we get to “ real ” scenarios, the more difficult the problem becomes! The closer we get to “ real ” scenarios, the more difficult the problem becomes! Up to 95% on acted speech Up to 95% on acted speech Up to 79% on read speech Up to 79% on read speech Up to 73% on WOZ data Up to 73% on WOZ data

82 81 Devillers et al. Real call center data Real call center data Contains also fear (of losing money!) Contains also fear (of losing money!) Human – human interaction, involved users Human – human interaction, involved users Human accuracy of 75% is reported Human accuracy of 75% is reported Is this, as in Ang, the degree of human agreement? Is this, as in Ang, the degree of human agreement? Use a small number of intonation features Use a small number of intonation features Treat pauses and filled pauses separately Treat pauses and filled pauses separately Some results: Some results: Different behavior between clients and agents, males and females Different behavior between clients and agents, males and females Was classification attempted also? Was classification attempted also?

83 82 Games and simulators These provide an extremely interesting setting These provide an extremely interesting setting Participants can often be found to experience real emotions Participants can often be found to experience real emotions The experimenter can sometimes control these to a certain extent The experimenter can sometimes control these to a certain extent Such as driving conditions or additional tasks in a driving simulator Such as driving conditions or additional tasks in a driving simulator

84 83 Fernandez & Picard (2000) Subjects did math problems while driving a simulator Subjects did math problems while driving a simulator This was supposed to induce stress This was supposed to induce stress Spectral features were used Spectral features were used No prosody at all! No prosody at all! Advanced classifiers were applied Advanced classifiers were applied Results were inconsistent across users, raising a familiar question: Results were inconsistent across users, raising a familiar question: Is it the classifier, or is it the data? Is it the classifier, or is it the data?

85 84 Kehrein (2002) 2 subjects in 2 separate rooms: 2 subjects in 2 separate rooms: One had instructions One had instructions One had a set of Lego building blocks One had a set of Lego building blocks The first had to explain to the other what to construct The first had to explain to the other what to construct A wide range of “ natural ” emotions was reported A wide range of “ natural ” emotions was reported His thesis is in German  His thesis is in German  No classification was attempted No classification was attempted

86 85 Acted speech Widely used Widely used An ever-recurring question: An ever-recurring question: Does it reflect the way emotions are expressed in spontaneous speech? Does it reflect the way emotions are expressed in spontaneous speech?

87 86 McGilloway et al. ASSESS used for feature extraction ASSESS used for feature extraction Speech read by non-professionals Speech read by non-professionals Emotion evoking texts Emotion evoking texts Categories: sadness, happiness, fear, anger, neutral Categories: sadness, happiness, fear, anger, neutral Up to 55% recognition Up to 55% recognition

88 87 Recalled emotions Subjects are asked to recall emotional episodes and describe them Subjects are asked to recall emotional episodes and describe them Data is composed of long narratives Data is composed of long narratives It isn ’ t clear if subjects actually re- experience these emotions or just recount them as “ observers ” It isn ’ t clear if subjects actually re- experience these emotions or just recount them as “ observers ” Can contain good instances of low-key pervasive emotions Can contain good instances of low-key pervasive emotions

89 88 Ron and Amir Ongoing work Ongoing work

90 89 Open issues Part 5:

91 90 Robust raw feature extraction Pitch and VAD (voice activity detection) Pitch and VAD (voice activity detection) Intensity (normalization) Intensity (normalization) Vocal quality Vocal quality Duration – is this still an open problem? Duration – is this still an open problem?

92 91 Determination of time intervals This might have to be addressed on a theoretical vs. practical level – This might have to be addressed on a theoretical vs. practical level – Phones? Phones? Words? Words? Tunes? Tunes? Intonation units? Intonation units? Fixed length intervals? Fixed length intervals?

93 92 Feature extraction Which features are most relevant to emotion? Which features are most relevant to emotion? How do we separate noise (speaker mannerisms, culture, language, etc) from the signals of emotion? How do we separate noise (speaker mannerisms, culture, language, etc) from the signals of emotion?

94 93 HUMAINE Deliverables Part 6:

95 94 Tangible results we are expected to deliver: Tools Tools Exemplars Exemplars

96 95 Tools: Something along the lines of: solutions to parts of the problem that people can actually download and use right off Something along the lines of: solutions to parts of the problem that people can actually download and use right off

97 96 Exemplars: These should cover a wide scope - These should cover a wide scope - Concepts Concepts Methodologies Methodologies Knowledge pools – tutorials, reviews, etc. Knowledge pools – tutorials, reviews, etc. Complete solutions to “ reduced ” problems Complete solutions to “ reduced ” problems Test-bed systems Test-bed systems Designs for future systems/applications Designs for future systems/applications

98 97 Tools - suggestions: Useful feature extractors: Useful feature extractors: Robust pitch detection and smoothing methods Robust pitch detection and smoothing methods Public domain segment/speech recognizers Public domain segment/speech recognizers Synthesis engines or parts thereof Synthesis engines or parts thereof E.g. emotional prosody generators E.g. emotional prosody generators Classifying engines Classifying engines

99 98 Exemplars - suggestions: Knowledge bases - Knowledge bases - A taxonomy of speech features A taxonomy of speech features Papers (especially short ones) say what we used Papers (especially short ones) say what we used What about why? And what we didn ’ t used? What about why? And what we didn ’ t used? What about what we wished we had? What about what we wished we had? Test-bed systems - Test-bed systems - A working modular SAL (credit to Marc Schroeder) A working modular SAL (credit to Marc Schroeder) Embodies analysis, classification, synthesis, emotion induction/data collection … like a breeder nuclear reactor! Embodies analysis, classification, synthesis, emotion induction/data collection … like a breeder nuclear reactor! Parts of it already exist Parts of it already exist Human parts can be replaced by automated ones as they develop Human parts can be replaced by automated ones as they develop

100 99 Exemplars – suggestions (cont): More focused systems – More focused systems – Call center systems Call center systems Deal with sparse emotional content Deal with sparse emotional content emotions vary over a relatively small range emotions vary over a relatively small range Standardized (provocative?) data Standardized (provocative?) data Exemplifying difficulties on different levels: feature extraction, emotion classification Exemplifying difficulties on different levels: feature extraction, emotion classification Maybe in conjunction with WP5 Maybe in conjunction with WP5 Integration Integration Demonstrations of how different modalities can complement/enhance each other Demonstrations of how different modalities can complement/enhance each other

101 100 How do we get useful info from WP3 and WP5? Categories Categories Scales Scales Models (pervasive, burst etc) Models (pervasive, burst etc)

102 101 What is it realistic to expect? Useful info from other workgroups Useful info from other workgroups WP3: WP3: Models of emotional behavior in different contexts Models of emotional behavior in different contexts Definite scales and categories for measuring it Definite scales and categories for measuring it WP5: WP5: Databases embodying the above Databases embodying the above Data which exemplifies data on the scale from Data which exemplifies data on the scale from Clearly identifiable … to … Clearly identifiable … to … Difficult to identify Difficult to identify

103 102 What is it realistic to expect? Exemplars that show Exemplars that show Some of the problems that are easier to solve Some of the problems that are easier to solve The many problems that are difficult to solve The many problems that are difficult to solve Directions for useful further research Directions for useful further research How not to repeat previous errors How not to repeat previous errors

104 103 Some personal thoughts Oversimplification is a common pitfall to be avoided Oversimplification is a common pitfall to be avoided Looking at real data, one finds that emotion is often Looking at real data, one finds that emotion is often Difficult to describe in simple terms Difficult to describe in simple terms Jumps between modalities (text might be considered a separate modality) Jumps between modalities (text might be considered a separate modality) Extremely dependent on context, character, settings, personality Extremely dependent on context, character, settings, personality A task so complex for humans cannot be easy for machines! A task so complex for humans cannot be easy for machines!

105 104 Summary Speech is a major channel for signaling emotional information Speech is a major channel for signaling emotional information And lots of other information too And lots of other information too HUMAINE will not solve all the issues involved HUMAINE will not solve all the issues involved We should focus on those that can benefit most from the expertise and collaboration of its members We should focus on those that can benefit most from the expertise and collaboration of its members Examining multiple modalities can prove extremely interesting Examining multiple modalities can prove extremely interesting


Download ppt "1 Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe."

Similar presentations


Ads by Google