Presentation on theme: "HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004"— Presentation transcript:
1HUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004 Emotion and SpeechTechniques, models and resultsFacts, fiction and opinionsPast present and futureActed, spontaneous, recollectedIn Asia Europe and AmericaAnd the middle eastHUMAINE Workshop on Signals and signs (WP4), Santorini, September 2004
2Overview A short introduction to speech science … and speech analysis toolsSpeech and emotion:Models, problems ... andResultsA review of open issuesDeliverables within the HUMAINE framework
3Speech science in a nutshell Part 1:Speech science in a nutshell
4A short introduction to SPEECH: Most of those present here are familiar with various aspects of signal processingFor the benefit of those who aren’t acquainted with the speech signal in particular:We’ll start with an overview of speech production models and analysis techniquesThe rest of you can sleep for a few minutes
5The speech signal A 1-D signal Does that make it a simple one? NO…There are many analysis techniquesLike many types of systems - parametric models are one very useful here…A simple and very useful speech production model: the source/filter model(in case you’re worried, we’ll see that this is directly related to emotions also)
6The source/filter model Components:The lungs (create air pressure)Two elements that turn this into a “raw” signal:The vocal folds (periodic signals)Constrictions that make the airflow turbulent (noise)The vocal tractPartly immobile: upper jaw, teethPartly mobile: soft palate, tongue, lips, lower jaw – also called “articulators”Its influence on the raw signal can be modeled very will with a low order (~10) digital filtersourcefilter
7The net result:A complex signal that changes its properties constantly:Sometimes periodicSometimes colored noiseApproximately stationary over time windows of ~20 millisecondsAnd of course – contains a great deal of informationText – linguistic informationOther stuff – paralinguistic informationSpeaker identityGenderSocioeconomic backgroundStress, accentEmotional stateEtc. …
8How is this information coded? Textual information - mainly in the filter and the way it changes its properties over timeFilter “snapshots” are called segmentsParalinguistic information – mainly in the source parametersLung pressure – determines the intensityVocal fold periodicity – determines instantaneous frequency or “pitch”Configuration of the glottis determines overall spectral tilt – “voice quality”
9Prosody:Prosody is another name for part of the paralinguistic information, composed of:Intonation – the way in which pitch changes over timeIntensity – changes in intensity over timeProblem: some segments are inherently weaker than othersRhythm – segment durations vs. timeProsody does not include voice quality, but voice quality is also part of the paralinguistic information
10To summarize: Speech science is at a mature stage The source/filter model is very useful in understanding speech productionMany applications (speech recognition, speaker verification, emotion recognition, etc.) require extraction of the model parameters from the speech signal (an inverse problem)This is the domain of: speech analysis techniques
11Speech analysis and classification Part 2:Speech analysis and classification
12The large picture: speech analysis in the HUMAINE framework Speech analysis is just one component in the context of speech and emotion:RealdataIts overall objectives:Calculate raw speech parametersExtract features salient to emotional contentDiscard irrelevant featuresUse them to characterize and maybe classify emotional speechTheoryOfemotionSpeechAnalysisengineTrainingdataHighLevelapplication
13Signals to Signs - The process KnowledgePatternsEvaluationand PresentationSelection andTransformationData MiningDataRepresentationDataWarehouseData Cleaning and IntegrationDatabasesFiles
14S2S (SOS…?) - The toolsa combination of techniques that belong to different types of disciplines:Data warehouse technologies (data storage, information retrieval, query answering, etc’)Data preprocessing and handlingData modeling / visualizationMachine learning (statistical data analysis, pattern recognition, information retrieval, etc’)
15The objective of speech analysis techniques To extract the raw model parameters from the speech signalInterfering factors:Reality never exactly fits the modelBackground noiseSpeaker overlapTo extract featuresTo interpret them in meaningful ways (pattern recognition)Really hard!
16It remains that -Useful models and techniques exist for extracting the various information types from the speech signalYet … Many applications such as speech recognition, speaker identification, speech synthesis, etc., are far from being perfected… So what about emotion?
17For the moment – let’s focus on the small picture The consensus is that emotions are coded inProsodyVoice qualityAnd sometimes in the textual informationLet’s discuss the purely technical aspects of evaluating all of these …
18Extracting features from the speech signal Stage 1 – Extracting raw features:PitchIntensityVoice qualityPausesSegmental information – phones and their durationText(by the way …who extracts them – man, machine or both? )
19Pitch Pitch: The instantaneous frequency Sounds deceptively simple to find – but it isn’t!Lots of research has been devoted to pitch detectionComposed of two sub-problems:For a given signal – is there periodicity at all?If so – what’s the fundamental frequency?Complicating factors:Speaker related factors – hoarseness, diplophony, etc.Background related factors – noise, overlapping speakers, filters (as in telephony)In the context of emotions:Small errors are acceptableLarge errors (octave jumps, false positives) are catastrophic
20An example:The raw pitch contour in PRAAT:Errors:
21Intensity Appears to be even simpler than pitch! Intensity is quite easy to measure …Yet most influenced by unrelated factors!Aside from the speaker, intensity is gravely affected by:Distance from the microphoneGain settings in the recording equipmentClippingAGCBackground noiseRecording environmentWithout normalization – intensity is almost useless!
22Voice quality Several measures are used to measure it: Local irregularity in pitch and intensityRatio between harmonic components and noise componentsDistribution of energy in the spectrumAffected by a multitude of factors other than emotionsSome standardized measures are often used in clinical applicationsA large factor in emotional speech!
23Segments There are different ways of defining precisely what these are Automatic segmentation is difficult, though not as difficult as speech recognitionEven the segment boundaries can give important timing information, related to rhythm –an important component of prosody
24Text Is this “raw” data or not? Is it data … at all? Some studies on emotion specifically eliminated this factor (filtered speech, uniform texts)Other studies are interested mainly in textIf we want to deal with text, we must keep in mind:Automated speech recognition is HARD!Especially with strong background noiseEspecially when strong emotions are present, modifying the speakers normal voices and mannerismsEspecially when dealing with multiple speakers
25Some complicating factors in raw feature extraction: Background noiseSpeaker overlapSpeaker variabilityVariability in recording equipment
26In the general context of speech analysis - The raw features we discussed are not specific only to the study of emotionYet – issues related to calculating them reliably crop up again and again in emotion related studiesSome standard and reliable tools would be very helpful
27Two opposing approaches to computing raw features: Assume we have perfect algorithms for extracting all this informationIf we don’t – help out manuallyThis can be carried out only over small databasesUseful in purely theoretical studiesAcknowledge we only have imperfect algorithmsFind how to deal automatically with imperfect dataVery important for large databasesIdealReal lifeErrorprone
28Next - what do we do with it all? Reminder: we have large amounts of raw dataNow we have to make some meaning from it
29Feature extraction … Stage 2 – data reduction: Take a sea of numbersReduce it to a small number of meaningful measuresProve they’re meaningfulAn interesting way to look at it:Separating the “signal” (e.g emotion) from the “noise” (anything else)
30An example of “Noise”:Here pitch and intensity have totally unemotional (but important) roles: [Deller et al]
31Examples of high level features Pitch fitting –stylizationMoMelParametric modelingstatistics
32An example:The raw pitch contour in PRAAT:Errors:
34One way to extract the essential information: Pitch stylization – IPO method
35Another way to extract the essential information: MoMel
36Yet another way to extract the essential information: MoMel
37Some observations: Different parameterizations give different curvesdifferent featuresYet: perceptually – they are all very similar
38Questions:We can ask what is the minimal or most representative information to capture the pitch contour?More importantly, though: What aspects of the pitch contour are most relevant to emotion?
39Several answers appear in the literature: Statistical features taken from the raw contour:Mean, variance, max, min, range etc.Features taken from parameterized contours:Slopes, “main” peaks and dips etc.
40There’s not much time to go into: Intensity contoursSpectraDurationBut the problems are very similar
41The importance of time frames We have several measures that vary over timeOver what time frame should we consider them?The meaning we attribute to speech parameters is dependent on the time frame over which they’re considered:Fixed length windowsPhonesWords“Intonation units”“Tunes”
42Which time frame is best? Fixed time frames of several seconds – simple to implement, but naïveVery arbitraryWordsNeed a recognizer to be markedProbably the shortest meaningful frame“Intonation units”Nobody knows exactly what they are (one “idea” per unit?)Hard to measureCorrelate best with coherent stretches of speech“Tunes” – from one pause to the nextfeasible to implementCorrelate to some extent with coherent stretches of speech.
43Why is this such an important decision? It might help us interpret our data correctly!
44Therefore … the problem of feature extraction: Is NOT a general oneWe want features that are specifically relevant to emotional content …But before we get to that - we have:
45The Data Mining partStage 3: To extract knowledge = previously unknown information (rules, constraints, regularities, patterns, etc’) from the features database
46What are we mining?We look for patterns that either describe the stored dataor infer from it (predictions)Summarization and characterization (of the class of data that interests us)Discrimination andcomparison of features of different classes
47Types of AnalysisAssociation analysis of rules of the form X => Y (DB tuples that satisfy X are likely to satisfy Y) where X and Y are pairs of attribute and value/set of valuesClassification and class prediction – find a set of functions to describe and distinguish data classes/concepts that can be used predict the class of unlabeled data.Cluster analysis (unsupervised clustering) – analyze the data when there are no class labels to deal with new types of data and help group similar events together
48Association RulesWe search for interesting relationships among items in the dataInterestingness Measures:Support = # tuples that contain both A and B / # tuplesConfidence = # tuples that contain both A and B / # tuples that contain ASupport measures usefulnessConfidence measures certainty
49Classification A two step process: Use data tuples with known labels to construct a modelUse the learned model to classify (assign labels) new dataData is divided into two groups: training data and test dataTest data is used to estimate the predictive accuracy of the learned model.Since the class label of each training sample is known, this is Supervised Learning
50Assets No need to know the rules in advance Some rules are not easily formulated as mathematical or logical expressionsSimilar to one of the ways human learnCould be more robust to noise and incomplete dataMay require a lot of samplesLearning depends on existing data only!
51Dangers: Algorithms: The model might not be able to learn There might not be enough dataOver-fitting the model to the training dataAlgorithms:Machine learning (Statistical learning)Expert systemsComputational neuroscience
52Prediction Classification predicts categorical labels Prediction models continuous valued functionIt is usually used to predict the value or a range of values of an attribute of a given sampleRegressionNeural Networks
53Clusteringconstructing models for assigning class labels to data that is unlabeled.un supervised learningClustering is an ill defined taskOnce clusters are discovered, the clustering model can be used for predicting labels of new dataAlternatively, the clusters can be used as labels to train a supervised classification algorithm
54So how does this technical Mumbo Jumbo tie into -
56Speech and emotion Emotion can affect speech in many ways Consciously UnconsciouslyThrough the Autonomous nervous systemExamples:Textual content is usually consciously chosen, except maybe sudden interjections which may stem from sudden or strong emotionsMany speech patterns related to emotions are strongly ingrained – therefore, though they can be controlled by the speaker, most often they are not, unless the speaker tries modify them consciouslyCertain speech characteristics are affected by the degree of arousal, and therefore nearly impossible to inhibit (e.g. vocal tremor due to grief)
57Speech analysis: the big picture - again Speech analysis is just one component in the context of speech and emotion:DatabasesApplicationSpeech analysisRealdataTheories of emotion
58Is this just another way to spread the blame? Us speech analysis guys are just poor little engineersMethods we can supply can be no better than the theory and the data that drive them… and unfortunately, the jury is still out on both of those points … or not?Ask WP3 and WP5 peopleThey’re here somewhere Actually –One of the difficulties HUMAINE is intended to ease, is that often researchers in the field find themselves having to address all of the above! (guilty)
59The most fundamental problem: What are the features that signify emotion? To paraphrase – what signals are signs of emotion?
60The most common solutions: Calculate as many as you can think ofIntuitionTheory based answersData-driven answersHa! Once more – it’s not our fault!
61What seems to be the most plausible approach - The data driven approachRequiring:Emotional speech databases (“corpora”)Perceptual evaluation of these databasesThis is then correlated with speech featuresWhich takes us back to a previous square
62So tell us already – how does emotion influence speech? … It seems that the answer depends on how you look for itAs hinted before – the answer cannot really be separated from:The theories of emotionThe databases we have of emotional speech -Who the subjects areHow emotion was elicited
63A short digression -Will all the speech clinicians in the audience please stand up?Hmm…. We don’t seem to have so manyLet’s look at what one of them has to say
64Emotions in the speech Clinic Some speakers have speech/voice problems that modify their “signal”, thus misleading the listenerVOICE – People with vocal instability (high jitter/shimmer/tremor are clinically perceived as nervous (although the problems reflect irregularity in the vocal folds).- Breathy voice (in women) is, sometimes, perceived as “sexy” (while it actually reflects incomplete adduction of the vocal folds).- Higher excitation level leads to vocal instability (high jitter/shimmer/ tremor)
65Clinical Examples:STUTTERING – listeners judge people who stutter as nervous, tensed, and less confident (identification of stuttering depends on pause duration within the “repetition units”, and on rate of repetitions).CLUTTERING – listeners judge cluttering people as nervous and less intelligent
66So- though this is a WP4 meeting … It’s impossible to avoid talking about WP3 (theory of emotion) and WP5 (databases) issuesThe signs we’re looking for can never be separated from the questions:Signs of what (emotions)?Signs in what (data)?May God and Phillipe Gelin forgive me …
67A not-so-old example: (Murray and Arnott, 1993) Very qualitativePresupposes dealing with primary emotions
68BUT …If you expect more recent results to give more detailed descriptive outlinesThen you’re wrongThe data-driven approaches use a large number of features, and let the computer sort them out32 significant features found by ASSESS, from the initial 375 used5 emotions, acted55% recognition
69Some remarks:Some features are indicative, even though we probably don’t use them perceptuallye.g. pitch mean: usually this is raised with higher activationBut we don’t have to know the speaker’s neutral mean to perceive heightened activationMy guess: voice quality is what we perceive in such casesHow “simple” can characterization of emotions become?How many features do we listen for?Can this be verified?
70Time intervalsThis issue becomes more and more important as we go towards “natural” dataEmotion production:How long do emotions last?Full blown emotions are usually short (but not always! Look at Peguy in the LIMSI interview database)Moods, or pervasive emotions are subtle but long lastingEmotion Analysis:Over what span of speech are they easiest to detect?
71From the analysis viewpoint: Current efforts seem to be focusing on methods that aim to use time spans that have some inherent meaning:Acoustically (ASSESS – Cowie et al)Linguistically (Batliner et al)We mentioned that prosody carriesemotional information (our “signal”)other information (“noise”): phrasing, various types of prominenceBUT …
72Why I like intonation units Spontaneous speech is organized differently from written language“sentences” and “paragraphs” don’t really exist therePhrasing is a loose phrase for …”Intonation units”Theoretical linguists love to discuss what they areAn exact definition is as hard to find as it is to parse spontaneous speechProsodic markers help replace various written markersMaybe emotion is not an “orthogonal” bit of information on top of these (the signal+noise model)If emotion modifies these,It would be very useful if we could identify the prosodic markers we use and the ways we modify them when we’re emotionalProblem: Engineers don’t like ill defined concepts!But emotion is one of them too, isn’t it?
73Just to provoke some thought: From a paper on animation (think of it – these guys have to integrate speech and image to make them fit naturally):“… speech consists of a sequence of intonation phrases. Each intonation phrase is realized with fluid, continuous articulation and a single point of maximum emphasis. Boundaries between successive phrases are associated with perceived disjuncture and are marked in English with cues such as pitch movements … Gestures are performed in units that coincide with these intonation phrases, and points of prominence in gestures also coincide with the emphasis in the concurrent speech…” [Stone et al., SIGGRAPH 2004]
74We haven’t even discussed WP3 issues - What are the scales/categories?Possibility 1: emotional labelingPossibility 2: psychological scales (such as valence/activation – e.g. Feeltrace)QUESTION:Which is more directly related to speech features?Hopefully we’ll hammer out a tentative answer by Tuesday..
76Evaluating resultsResults often demonstrate how elusive the solution is …Consider a similar problem: Speech RecognitionTo evaluate results –Make recordingsSubmit them to an algorithmMeasure the recognition rate!Emotion recognition results are far more difficult to quantifyHeavily dependent on induction techniques and labeling methods
77Several popular contexts: Acted prototypical emotionsCall center dataRealWoZ typeMedia (radio, TV) based dataNarrative speech (event recollection)Synthesized speech (monterro, gobl)Most of these methods can be placed on the spectrum between:Acted, full blown bursts of stereotypical emotionsFully natural, mixtures of mood, affect and bursts of difficult-to-label emotions recorded in noisy environments
78Call centers A real life scenario! (with commercial interests…)! Sparse emotional content:Controlled (usually)Negative (usually)Lends itself easily to WOZ scenarios
79Ang et al., 2002Standardized call-center data from 3 different sourcesUninvolved users, true HMI interactionDetects neutral/annoyance/frustrationMostly automatic extraction, with some additional human labelingDefines human “accuracy” as 75%But this is actually the percentage of human consensusMachine accuracy is comparableA possible measure: maybe “accuracy” is where users wanted human intervention
80Batliner et al. Professional acting, amateur acting, WOZ scenario the latter with uninvolved users, true HMI interactionDetects trouble in communicationMuch thought was given to this definition!Combines prosodic features with others:POS labelsSyntactic boundariesOverall – shows a typical result:The closer we get to “real” scenarios, the more difficult the problem becomes!Up to 95% on acted speechUp to 79% on read speechUp to 73% on WOZ data
81Devillers et al. Real call center data Contains also fear (of losing money!)Human – human interaction, involved usersHuman accuracy of 75% is reportedIs this, as in Ang, the degree of human agreement?Use a small number of intonation featuresTreat pauses and filled pauses separatelySome results:Different behavior between clients and agents, males and femalesWas classification attempted also?
82Games and simulators These provide an extremely interesting setting Participants can often be found to experience real emotionsThe experimenter can sometimes control these to a certain extentSuch as driving conditions or additional tasks in a driving simulator
83Fernandez & Picard (2000)Subjects did math problems while driving a simulatorThis was supposed to induce stressSpectral features were usedNo prosody at all!Advanced classifiers were appliedResults were inconsistent across users, raising a familiar question:Is it the classifier, or is it the data?
84Kehrein (2002) 2 subjects in 2 separate rooms: One had instructionsOne had a set of Lego building blocksThe first had to explain to the other what to constructA wide range of “natural” emotions was reportedHis thesis is in German No classification was attempted
85Acted speech Widely used An ever-recurring question: Does it reflect the way emotions are expressed in spontaneous speech?
86McGilloway et al. ASSESS used for feature extraction Speech read by non-professionalsEmotion evoking textsCategories: sadness, happiness, fear, anger, neutralUp to 55% recognition
87Recalled emotionsSubjects are asked to recall emotional episodes and describe themData is composed of long narrativesIt isn’t clear if subjects actually re-experience these emotions or just recount them as “observers”Can contain good instances of low-key pervasive emotions
94Tangible results we are expected to deliver: ToolsExemplars
95Tools:Something along the lines of: solutions to parts of the problem that people can actually download and use right off
96Exemplars: These should cover a wide scope - Concepts Methodologies Knowledge pools – tutorials, reviews, etc.Complete solutions to “reduced” problemsTest-bed systemsDesigns for future systems/applications
97Tools - suggestions: Useful feature extractors: Classifying engines Robust pitch detection and smoothing methodsPublic domain segment/speech recognizersSynthesis engines or parts thereofE.g. emotional prosody generatorsClassifying engines
98Exemplars - suggestions: Knowledge bases -A taxonomy of speech featuresPapers (especially short ones) say what we usedWhat about why? And what we didn’t used?What about what we wished we had?Test-bed systems -A working modular SAL (credit to Marc Schroeder)Embodies analysis, classification, synthesis, emotion induction/data collection … like a breeder nuclear reactor!Parts of it already existHuman parts can be replaced by automated ones as they develop
99Exemplars – suggestions (cont): More focused systems –Call center systemsDeal with sparse emotional contentemotions vary over a relatively small rangeStandardized (provocative?) dataExemplifying difficulties on different levels: feature extraction, emotion classificationMaybe in conjunction with WP5IntegrationDemonstrations of how different modalities can complement/enhance each other
100How do we get useful info from WP3 and WP5? CategoriesScalesModels (pervasive, burst etc)
101What is it realistic to expect? Useful info from other workgroupsWP3:Models of emotional behavior in different contextsDefinite scales and categories for measuring itWP5:Databases embodying the aboveData which exemplifies data on the scale fromClearly identifiable … to …Difficult to identify
102What is it realistic to expect? Exemplars that showSome of the problems that are easier to solveThe many problems that are difficult to solveDirections for useful further researchHow not to repeat previous errors
103Some personal thoughts Oversimplification is a common pitfall to be avoidedLooking at real data, one finds that emotion is oftenDifficult to describe in simple termsJumps between modalities (text might be considered a separate modality)Extremely dependent on context, character, settings, personalityA task so complex for humans cannot be easy for machines!
104Summary Speech is a major channel for signaling emotional information And lots of other information tooHUMAINE will not solve all the issues involvedWe should focus on those that can benefit most from the expertise and collaboration of its membersExamining multiple modalities can prove extremely interesting