Presentation is loading. Please wait.

Presentation is loading. Please wait.

HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004

Similar presentations


Presentation on theme: "HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004"— Presentation transcript:

1 HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004
Emotion and Speech Techniques, models and results Facts, fiction and opinions Past present and future Acted, spontaneous, recollected In Asia Europe and America And the middle east HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004

2 Overview A short introduction to speech science
… and speech analysis tools Speech and emotion: Models, problems ... and Results A review of open issues Deliverables within the HUMAINE framework

3 Speech science in a nutshell
Part 1: Speech science in a nutshell

4 A short introduction to SPEECH:
Most of those present here are familiar with various aspects of signal processing For the benefit of those who aren’t acquainted with the speech signal in particular: We’ll start with an overview of speech production models and analysis techniques The rest of you can sleep for a few minutes

5 The speech signal A 1-D signal
Does that make it a simple one? NO… There are many analysis techniques Like many types of systems - parametric models are one very useful here… A simple and very useful speech production model: the source/filter model (in case you’re worried, we’ll see that this is directly related to emotions also)

6 The source/filter model
Components: The lungs (create air pressure) Two elements that turn this into a “raw” signal: The vocal folds (periodic signals) Constrictions that make the airflow turbulent (noise) The vocal tract Partly immobile: upper jaw, teeth Partly mobile: soft palate, tongue, lips, lower jaw – also called “articulators” Its influence on the raw signal can be modeled very will with a low order (~10) digital filter source filter

7 The net result: A complex signal that changes its properties constantly: Sometimes periodic Sometimes colored noise Approximately stationary over time windows of ~20 milliseconds And of course – contains a great deal of information Text – linguistic information Other stuff – paralinguistic information Speaker identity Gender Socioeconomic background Stress, accent Emotional state Etc. …

8 How is this information coded?
Textual information - mainly in the filter and the way it changes its properties over time Filter “snapshots” are called segments Paralinguistic information – mainly in the source parameters Lung pressure – determines the intensity Vocal fold periodicity – determines instantaneous frequency or “pitch” Configuration of the glottis determines overall spectral tilt – “voice quality”

9 Prosody: Prosody is another name for part of the paralinguistic information, composed of: Intonation – the way in which pitch changes over time Intensity – changes in intensity over time Problem: some segments are inherently weaker than others Rhythm – segment durations vs. time Prosody does not include voice quality, but voice quality is also part of the paralinguistic information

10 To summarize: Speech science is at a mature stage
The source/filter model is very useful in understanding speech production Many applications (speech recognition, speaker verification, emotion recognition, etc.) require extraction of the model parameters from the speech signal (an inverse problem) This is the domain of: speech analysis techniques

11 Speech analysis and classification
Part 2: Speech analysis and classification

12 The large picture: speech analysis in the HUMAINE framework
Speech analysis is just one component in the context of speech and emotion: Real data Its overall objectives: Calculate raw speech parameters Extract features salient to emotional content Discard irrelevant features Use them to characterize and maybe classify emotional speech Theory Of emotion Speech Analysis engine Training data High Level application

13 Signals to Signs - The process
Knowledge Patterns Evaluation and Presentation Selection and Transformation Data Mining Data Representation Data Warehouse Data Cleaning and Integration Databases Files

14 S2S (SOS…?) - The tools a combination of techniques that belong to different types of disciplines: Data warehouse technologies (data storage, information retrieval, query answering, etc’) Data preprocessing and handling Data modeling / visualization Machine learning (statistical data analysis, pattern recognition, information retrieval, etc’)

15 The objective of speech analysis techniques
To extract the raw model parameters from the speech signal Interfering factors: Reality never exactly fits the model Background noise Speaker overlap To extract features To interpret them in meaningful ways (pattern recognition) Really hard!

16 It remains that - Useful models and techniques exist for extracting the various information types from the speech signal Yet … Many applications such as speech recognition, speaker identification, speech synthesis, etc., are far from being perfected … So what about emotion?

17 For the moment – let’s focus on the small picture
The consensus is that emotions are coded in Prosody Voice quality And sometimes in the textual information Let’s discuss the purely technical aspects of evaluating all of these …

18 Extracting features from the speech signal
Stage 1 – Extracting raw features: Pitch Intensity Voice quality Pauses Segmental information – phones and their duration Text (by the way …who extracts them – man, machine or both? )

19 Pitch Pitch: The instantaneous frequency
Sounds deceptively simple to find – but it isn’t! Lots of research has been devoted to pitch detection Composed of two sub-problems: For a given signal – is there periodicity at all? If so – what’s the fundamental frequency? Complicating factors: Speaker related factors – hoarseness, diplophony, etc. Background related factors – noise, overlapping speakers, filters (as in telephony) In the context of emotions: Small errors are acceptable Large errors (octave jumps, false positives) are catastrophic

20 An example: The raw pitch contour in PRAAT: Errors:

21 Intensity Appears to be even simpler than pitch!
Intensity is quite easy to measure … Yet most influenced by unrelated factors! Aside from the speaker, intensity is gravely affected by: Distance from the microphone Gain settings in the recording equipment Clipping AGC Background noise Recording environment Without normalization – intensity is almost useless!

22 Voice quality Several measures are used to measure it:
Local irregularity in pitch and intensity Ratio between harmonic components and noise components Distribution of energy in the spectrum Affected by a multitude of factors other than emotions Some standardized measures are often used in clinical applications A large factor in emotional speech!

23 Segments There are different ways of defining precisely what these are
Automatic segmentation is difficult, though not as difficult as speech recognition Even the segment boundaries can give important timing information, related to rhythm – an important component of prosody

24 Text Is this “raw” data or not? Is it data … at all?
Some studies on emotion specifically eliminated this factor (filtered speech, uniform texts) Other studies are interested mainly in text If we want to deal with text, we must keep in mind: Automated speech recognition is HARD! Especially with strong background noise Especially when strong emotions are present, modifying the speakers normal voices and mannerisms Especially when dealing with multiple speakers

25 Some complicating factors in raw feature extraction:
Background noise Speaker overlap Speaker variability Variability in recording equipment

26 In the general context of speech analysis -
The raw features we discussed are not specific only to the study of emotion Yet – issues related to calculating them reliably crop up again and again in emotion related studies Some standard and reliable tools would be very helpful

27 Two opposing approaches to computing raw features:
Assume we have perfect algorithms for extracting all this information If we don’t – help out manually This can be carried out only over small databases Useful in purely theoretical studies Acknowledge we only have imperfect algorithms Find how to deal automatically with imperfect data Very important for large databases Ideal Real life Error prone

28 Next - what do we do with it all?
Reminder: we have large amounts of raw data Now we have to make some meaning from it

29 Feature extraction … Stage 2 – data reduction:
Take a sea of numbers Reduce it to a small number of meaningful measures Prove they’re meaningful An interesting way to look at it: Separating the “signal” (e.g emotion) from the “noise” (anything else)

30 An example of “Noise”: Here pitch and intensity have totally unemotional (but important) roles: [Deller et al]

31 Examples of high level features
Pitch fitting – stylization MoMel Parametric modeling statistics

32 An example: The raw pitch contour in PRAAT: Errors:

33 Patching it up a bit:

34 One way to extract the essential information:
Pitch stylization – IPO method

35 Another way to extract the essential information:
MoMel

36 Yet another way to extract the essential information:
MoMel

37 Some observations: Different parameterizations give
different curves different features Yet: perceptually – they are all very similar

38 Questions: We can ask what is the minimal or most representative information to capture the pitch contour? More importantly, though: What aspects of the pitch contour are most relevant to emotion?

39 Several answers appear in the literature:
Statistical features taken from the raw contour: Mean, variance, max, min, range etc. Features taken from parameterized contours: Slopes, “main” peaks and dips etc.

40 There’s not much time to go into:
Intensity contours Spectra Duration But the problems are very similar

41 The importance of time frames
We have several measures that vary over time Over what time frame should we consider them? The meaning we attribute to speech parameters is dependent on the time frame over which they’re considered: Fixed length windows Phones Words “Intonation units” “Tunes”

42 Which time frame is best?
Fixed time frames of several seconds – simple to implement, but naïve Very arbitrary Words Need a recognizer to be marked Probably the shortest meaningful frame “Intonation units” Nobody knows exactly what they are (one “idea” per unit?) Hard to measure Correlate best with coherent stretches of speech “Tunes” – from one pause to the next feasible to implement Correlate to some extent with coherent stretches of speech.

43 Why is this such an important decision?
It might help us interpret our data correctly!

44 Therefore … the problem of feature extraction:
Is NOT a general one We want features that are specifically relevant to emotional content … But before we get to that - we have:

45 The Data Mining part Stage 3: To extract knowledge = previously unknown information (rules, constraints, regularities, patterns, etc’) from the features database

46 What are we mining? We look for patterns that either describe the stored data or infer from it (predictions) Summarization and characterization (of the class of data that interests us) Discrimination and comparison of features of different classes

47 Types of Analysis Association analysis of rules of the form X => Y (DB tuples that satisfy X are likely to satisfy Y) where X and Y are pairs of attribute and value/set of values Classification and class prediction – find a set of functions to describe and distinguish data classes/concepts that can be used predict the class of unlabeled data. Cluster analysis (unsupervised clustering) – analyze the data when there are no class labels to deal with new types of data and help group similar events together

48 Association Rules We search for interesting relationships among items in the data Interestingness Measures: Support = # tuples that contain both A and B / # tuples Confidence = # tuples that contain both A and B / # tuples that contain A Support measures usefulness Confidence measures certainty

49 Classification A two step process:
Use data tuples with known labels to construct a model Use the learned model to classify (assign labels) new data Data is divided into two groups: training data and test data Test data is used to estimate the predictive accuracy of the learned model. Since the class label of each training sample is known, this is Supervised Learning

50 Assets No need to know the rules in advance
Some rules are not easily formulated as mathematical or logical expressions Similar to one of the ways human learn Could be more robust to noise and incomplete data May require a lot of samples Learning depends on existing data only!

51 Dangers: Algorithms: The model might not be able to learn
There might not be enough data Over-fitting the model to the training data Algorithms: Machine learning (Statistical learning) Expert systems Computational neuroscience

52 Prediction Classification predicts categorical labels
Prediction models continuous valued function It is usually used to predict the value or a range of values of an attribute of a given sample Regression Neural Networks

53 Clustering constructing models for assigning class labels to data that is unlabeled. un supervised learning Clustering is an ill defined task Once clusters are discovered, the clustering model can be used for predicting labels of new data Alternatively, the clusters can be used as labels to train a supervised classification algorithm

54 So how does this technical Mumbo Jumbo tie into -

55 Part 3: Speech and emotion

56 Speech and emotion Emotion can affect speech in many ways Consciously
Unconsciously Through the Autonomous nervous system Examples: Textual content is usually consciously chosen, except maybe sudden interjections which may stem from sudden or strong emotions Many speech patterns related to emotions are strongly ingrained – therefore, though they can be controlled by the speaker, most often they are not, unless the speaker tries modify them consciously Certain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to inhibit (e.g. vocal tremor due to grief)

57 Speech analysis: the big picture - again
Speech analysis is just one component in the context of speech and emotion: Databases Application Speech analysis Real data Theories of emotion

58 Is this just another way to spread the blame?
Us speech analysis guys are just poor little engineers Methods we can supply can be no better than the theory and the data that drive them … and unfortunately, the jury is still out on both of those points … or not? Ask WP3 and WP5 people They’re here somewhere  Actually – One of the difficulties HUMAINE is intended to ease, is that often researchers in the field find themselves having to address all of the above! (guilty)

59 The most fundamental problem:
What are the features that signify emotion? To paraphrase – what signals are signs of emotion?

60 The most common solutions:
Calculate as many as you can think of Intuition Theory based answers Data-driven answers Ha! Once more – it’s not our fault!

61 What seems to be the most plausible approach -
The data driven approach Requiring: Emotional speech databases (“corpora”) Perceptual evaluation of these databases This is then correlated with speech features Which takes us back to a previous square

62 So tell us already – how does emotion influence speech?
… It seems that the answer depends on how you look for it As hinted before – the answer cannot really be separated from: The theories of emotion The databases we have of emotional speech - Who the subjects are How emotion was elicited

63 A short digression - Will all the speech clinicians in the audience please stand up? Hmm…. We don’t seem to have so many Let’s look at what one of them has to say

64 Emotions in the speech Clinic
Some speakers have speech/voice problems that modify their “signal”, thus misleading the listener VOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds). - Breathy voice (in women) is, sometimes, perceived as “sexy” (while it actually reflects incomplete adduction of the vocal folds). - Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor)

65 Clinical Examples: STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “repetition units”, and on rate of repetitions). CLUTTERING – listeners judge cluttering people as nervous and less intelligent

66 So- though this is a WP4 meeting …
It’s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) issues The signs we’re looking for can never be separated from the questions: Signs of what (emotions)? Signs in what (data)? May God and Phillipe Gelin forgive me …

67 A not-so-old example: (Murray and Arnott, 1993)
Very qualitative Presupposes dealing with primary emotions

68 BUT … If you expect more recent results to give more detailed descriptive outlines Then you’re wrong The data-driven approaches use a large number of features, and let the computer sort them out 32 significant features found by ASSESS, from the initial 375 used 5 emotions, acted 55% recognition

69 Some remarks: Some features are indicative, even though we probably don’t use them perceptually e.g. pitch mean: usually this is raised with higher activation But we don’t have to know the speaker’s neutral mean to perceive heightened activation My guess: voice quality is what we perceive in such cases How “simple” can characterization of emotions become? How many features do we listen for? Can this be verified?

70 Time intervals This issue becomes more and more important as we go towards “natural” data Emotion production: How long do emotions last? Full blown emotions are usually short (but not always! Look at Peguy in the LIMSI interview database) Moods, or pervasive emotions are subtle but long lasting Emotion Analysis: Over what span of speech are they easiest to detect?

71 From the analysis viewpoint:
Current efforts seem to be focusing on methods that aim to use time spans that have some inherent meaning: Acoustically (ASSESS – Cowie et al) Linguistically (Batliner et al) We mentioned that prosody carries emotional information (our “signal”) other information (“noise”): phrasing, various types of prominence BUT …

72 Why I like intonation units
Spontaneous speech is organized differently from written language “sentences” and “paragraphs” don’t really exist there Phrasing is a loose phrase for …”Intonation units” Theoretical linguists love to discuss what they are An exact definition is as hard to find as it is to parse spontaneous speech Prosodic markers help replace various written markers Maybe emotion is not an “orthogonal” bit of information on top of these (the signal+noise model) If emotion modifies these, It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we’re emotional Problem: Engineers don’t like ill defined concepts! But emotion is one of them too, isn’t it?

73 Just to provoke some thought:
From a paper on animation (think of it – these guys have to integrate speech and image to make them fit naturally): “… speech consists of a sequence of intonation phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of maximum emphasis. Boundaries between successive phrases are associated with perceived disjuncture and are marked in English with cues such as pitch movements … Gestures are performed in units that coincide with these intonation phrases, and points of prominence in gestures also coincide with the emphasis in the concurrent speech…” [Stone et al., SIGGRAPH 2004]

74 We haven’t even discussed WP3 issues -
What are the scales/categories? Possibility 1: emotional labeling Possibility 2: psychological scales (such as valence/activation – e.g. Feeltrace) QUESTION: Which is more directly related to speech features? Hopefully we’ll hammer out a tentative answer by Tuesday..

75 Part 4: Current results

76 Evaluating results Results often demonstrate how elusive the solution is … Consider a similar problem: Speech Recognition To evaluate results – Make recordings Submit them to an algorithm Measure the recognition rate! Emotion recognition results are far more difficult to quantify Heavily dependent on induction techniques and labeling methods

77 Several popular contexts:
Acted prototypical emotions Call center data Real WoZ type Media (radio, TV) based data Narrative speech (event recollection) Synthesized speech (monterro, gobl) Most of these methods can be placed on the spectrum between: Acted, full blown bursts of stereotypical emotions Fully natural, mixtures of mood, affect and bursts of difficult-to-label emotions recorded in noisy environments

78 Call centers A real life scenario! (with commercial interests…)!
Sparse emotional content: Controlled (usually) Negative (usually) Lends itself easily to WOZ scenarios

79 Ang et al., 2002 Standardized call-center data from 3 different sources Uninvolved users, true HMI interaction Detects neutral/annoyance/frustration Mostly automatic extraction, with some additional human labeling Defines human “accuracy” as 75% But this is actually the percentage of human consensus Machine accuracy is comparable A possible measure: maybe “accuracy” is where users wanted human intervention

80 Batliner et al. Professional acting, amateur acting, WOZ scenario
the latter with uninvolved users, true HMI interaction Detects trouble in communication Much thought was given to this definition! Combines prosodic features with others: POS labels Syntactic boundaries Overall – shows a typical result: The closer we get to “real” scenarios, the more difficult the problem becomes! Up to 95% on acted speech Up to 79% on read speech Up to 73% on WOZ data

81 Devillers et al. Real call center data
Contains also fear (of losing money!) Human – human interaction, involved users Human accuracy of 75% is reported Is this, as in Ang, the degree of human agreement? Use a small number of intonation features Treat pauses and filled pauses separately Some results: Different behavior between clients and agents, males and females Was classification attempted also?

82 Games and simulators These provide an extremely interesting setting
Participants can often be found to experience real emotions The experimenter can sometimes control these to a certain extent Such as driving conditions or additional tasks in a driving simulator

83 Fernandez & Picard (2000) Subjects did math problems while driving a simulator This was supposed to induce stress Spectral features were used No prosody at all! Advanced classifiers were applied Results were inconsistent across users, raising a familiar question: Is it the classifier, or is it the data?

84 Kehrein (2002) 2 subjects in 2 separate rooms:
One had instructions One had a set of Lego building blocks The first had to explain to the other what to construct A wide range of “natural” emotions was reported His thesis is in German  No classification was attempted

85 Acted speech Widely used An ever-recurring question:
Does it reflect the way emotions are expressed in spontaneous speech?

86 McGilloway et al. ASSESS used for feature extraction
Speech read by non-professionals Emotion evoking texts Categories: sadness, happiness, fear, anger, neutral Up to 55% recognition

87 Recalled emotions Subjects are asked to recall emotional episodes and describe them Data is composed of long narratives It isn’t clear if subjects actually re-experience these emotions or just recount them as “observers” Can contain good instances of low-key pervasive emotions

88 Ron and Amir Ongoing work 

89 Part 5: Open issues

90 Robust raw feature extraction
Pitch and VAD (voice activity detection) Intensity (normalization) Vocal quality Duration – is this still an open problem?

91 Determination of time intervals
This might have to be addressed on a theoretical vs. practical level – Phones? Words? Tunes? Intonation units? Fixed length intervals?

92 Feature extraction Which features are most relevant to emotion?
How do we separate noise (speaker mannerisms, culture, language, etc) from the signals of emotion?

93 Part 6: HUMAINE Deliverables

94 Tangible results we are expected to deliver:
Tools Exemplars

95 Tools: Something along the lines of: solutions to parts of the problem that people can actually download and use right off

96 Exemplars: These should cover a wide scope - Concepts Methodologies
Knowledge pools – tutorials, reviews, etc. Complete solutions to “reduced” problems Test-bed systems Designs for future systems/applications

97 Tools - suggestions: Useful feature extractors: Classifying engines
Robust pitch detection and smoothing methods Public domain segment/speech recognizers Synthesis engines or parts thereof E.g. emotional prosody generators Classifying engines

98 Exemplars - suggestions:
Knowledge bases - A taxonomy of speech features Papers (especially short ones) say what we used What about why? And what we didn’t used? What about what we wished we had? Test-bed systems - A working modular SAL (credit to Marc Schroeder) Embodies analysis, classification, synthesis, emotion induction/data collection … like a breeder nuclear reactor! Parts of it already exist Human parts can be replaced by automated ones as they develop

99 Exemplars – suggestions (cont):
More focused systems – Call center systems Deal with sparse emotional content emotions vary over a relatively small range Standardized (provocative?) data Exemplifying difficulties on different levels: feature extraction, emotion classification Maybe in conjunction with WP5 Integration Demonstrations of how different modalities can complement/enhance each other

100 How do we get useful info from WP3 and WP5?
Categories Scales Models (pervasive, burst etc)

101 What is it realistic to expect?
Useful info from other workgroups WP3: Models of emotional behavior in different contexts Definite scales and categories for measuring it WP5: Databases embodying the above Data which exemplifies data on the scale from Clearly identifiable … to … Difficult to identify

102 What is it realistic to expect?
Exemplars that show Some of the problems that are easier to solve The many problems that are difficult to solve Directions for useful further research How not to repeat previous errors

103 Some personal thoughts
Oversimplification is a common pitfall to be avoided Looking at real data, one finds that emotion is often Difficult to describe in simple terms Jumps between modalities (text might be considered a separate modality) Extremely dependent on context, character, settings, personality A task so complex for humans cannot be easy for machines!

104 Summary Speech is a major channel for signaling emotional information
And lots of other information too HUMAINE will not solve all the issues involved We should focus on those that can benefit most from the expertise and collaboration of its members Examining multiple modalities can prove extremely interesting


Download ppt "HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004"

Similar presentations


Ads by Google