Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C. Lammert, Brian.

Slides:

Advertisements

Similar presentations

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Advertisements

“You made me do it”: Classification of Blame in Married Couples’ Interactions by Fusing Automatically Derived Speech and Language Information Matthew P.

Modelling and Analyzing Multimodal Dyadic Interactions Using Social Networks Sergio Escalera, Petia Radeva, Jordi Vitrià, Xavier Barò and Bogdan Raducanu.

Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.

VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.

AN IMPROVED AUDIO Jenn Tam Computer Science Dept. Carnegie Mellon University SOAPS 2008, Pittsburgh, PA.

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.

Robust Real-time Object Detection by Paul Viola and Michael Jones ICCV 2001 Workshop on Statistical and Computation Theories of Vision Presentation by.

Speaker Adaptation for Vowel Classification

Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

HIWIRE MEETING CRETE, SEPTEMBER 23-24, 2004 JOSÉ C. SEGURA LUNA GSTC UGR.

Feature Screening Concept: A greedy feature selection method. Rank features and discard those whose ranking criterions are below the threshold. Problem:

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

1 A Survey of Affect Recognition Methods: Audio, Visual, and Spontaneous Expressions Zhihong Zeng, Maja Pantic, Glenn I. Roisman, Thomas S. Huang Reported.

Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Kinect Player Gender Recognition from Speech Analysis

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Prepared by: Waleed Mohamed Azmy Under Supervision:

Machine Learning in Spoken Language Processing Lecture 21 Spoken Language Processing Prof. Andrew Rosenberg.

REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Crowdsourcing for Spoken Dialogue System Evaluation Ling 575 Spoken Dialog April 30, 2015.

Multimodal Information Analysis for Emotion Recognition

Object Detection with Discriminatively Trained Part Based Models

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

Chapter 4: Pattern Recognition. Classification is a process that assigns a label to an object according to some representation of the object’s properties.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Dynamic Captioning: Video Accessibility Enhancement for Hearing Impairment Richang Hong, Meng Wang, Mengdi Xuy Shuicheng Yany and Tat-Seng Chua School.

Perceptual Analysis of Talking Avatar Head Movements: A Quantitative Perspective Xiaohan Ma, Binh H. Le, and Zhigang Deng Department of Computer Science.

M. Brendel 1, R. Zaccarelli 1, L. Devillers 1,2 1 LIMSI-CNRS, 2 Paris-South University French National Research Agency - Affective Avatar project ( )

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Performance Comparison of Speaker and Emotion Recognition

ACADS-SVMConclusions Introduction CMU-MMAC Unsupervised and weakly-supervised discovery of events in video (and audio) Fernando De la Torre.

Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

Predicting Voice Elicited Emotions

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.

Interpreting Ambiguous Emotional Expressions Speech Analysis and Interpretation Laboratory ACII 2009.

Predicting Children’s Reading Ability using Evaluator-Informed Features Matthew Black, Joseph Tepperman, Sungbok Lee, and Shrikanth Narayanan Signal Analysis.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

Using Speech Recognition to Predict VoIP Quality

University of Rochester

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

Supervised Time Series Pattern Discovery through Local Importance

Conditional Random Fields for ASR

EE513 Audio Signals and Systems

Marian Stewart Bartlett, Gwen C. Littlewort, Mark G. Frank, Kang Lee

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

Automatic Classification of Married Couples’ Behavior using Audio Features Matthew Black, Athanasios Katsamanis, Chi-Chun Lee, Adam C. Lammert, Brian R. Baucom, Andrew Christensen, Panayiotis G. Georgiou, and Shrikanth Narayanan September 29, 2010

Automatic Classification of Married Couples' Behavior using Audio Features (Common Slides) Insert here … 9/29/20102 / 21

Automatic Classification of Married Couples' Behavior using Audio Features Overview of Study 9/29/20103 / 21 Trained Human Evaluator Available Data (e.g., audio, video, text) Judgments (e.g., how much is one spouse blaming the other spouse?) Signal Processing (e.g., feature extraction) Computational Modeling (e.g., machine learning) Feedback Married couples discussing a problem in their relationship Data Coding

Automatic Classification of Married Couples' Behavior using Audio Features Motivation Psychology research depends on perceptual judgments Interaction recorded for offline hand coding Rely on a variety of established coding standards [Margolin et al. 1998] Manual coding process expensive and time consuming Creating coding manual Training coders Coder reliability Technology can help code audio-visual data Certain measurements are difficult for humans to make Computers can extract these low-level descriptors (LLDs) [Schuller et al. 2007] Consistent way to quantify human behavior from objective signals 9/29/20104 / 21

Automatic Classification of Married Couples' Behavior using Audio Features Corpus Real couples in 10-minute problem-solving dyadic interactions Longitudinal study at UCLA and U. of Washington [Christensen et al. 2004] 134 distressed couples received couples therapy for 1 year 574 sessions (96 hours) Split-screen video (704x480 pixels, 30 fps) Single channel of far-field audio Data originally only intended for manual coding Recording conditions not ideal Video angle, microphone placement, and background noise varied Access to word transcriptions with speaker explicitly labeled No indications of timing or speech overlap regions 9/29/20105 / 21

Husband: what did I tell you about you can spend uh everything that we uh earn Wife: then why did you ask Husband: and spend more and get us into debt Wife: yeah why did you ask see my question is Husband: mm hmmm Wife: if if you told me this and I agree I would keep the books and all my expenses and everything Automatic Classification of Married Couples' Behavior using Audio Features Sample Transcript 9/29/20106 / 21

Automatic Classification of Married Couples' Behavior using Audio Features Manual Coding Each spouse evaluated by 3-4 trained coders 33 session-level codes (all on 1 to 9 scale) Utterance- and turn-level ratings were not obtained Social Support Interaction Rating System Couples Interaction Rating System All evaluators underwent a training period to standardize the coding process We analyzed 6 codes for this study Level of acceptance (“acc”) Level of blame (“bla”) Global positive affect (“pos”) Global negative affect (“neg”) Level of sadness (“sad”) Use of humor (“hum”) 9/29/20107 / 21

Segment the sessions into meaningful regions Exploit the known lexical content (transcriptions with speaker labels) Recursive automatic speech-text alignment technique [Moreno 1998] Session split into regions: wife/husband/unknown Aligned >60% of sessions’ words for 293/574 sessions Speaker Segmentation AM = Acoustic Model LM = Language Model Dict = Dictionary MFCC = Mel-Frequency Cepstral Coefficients ASR = Automatic Speech Recognition HYP = ASR Hypothesized Transcript Automatic Classification of Married Couples' Behavior using Audio Features9/29/20108 / 21

Goals Separate extreme cases of session-level perceptual judgments using speaker-dependent features derived from (noisy) audio signal Extraction and analysis of relevant acoustic (prosodic/spectral) features Relevance Automatic coding of real data using objective features Extraction of high-level speaker information from complex interactions Automatic Classification of Married Couples' Behavior using Audio Features9/29/20109 / 21 Goals

Explore the use of 10 acoustic low-level descriptors (LLDs) Broadly useful in psychology and engineering research 1) Speaking rate Extracted for each aligned word [words/sec, letters/sec] 2) Voice Activity Detection (VAD) [Ghosh et al. 2010] Trained on 30-second clip from held-out session Separated speech from non-speech regions Extracted 2 session-level first-order Markov chain features Pr(x i = speech | x i-1 = speech) Pr(x i = non-speech | x i-1 = speech) Extracted durations of each speech and non-speech segment to use as LLD for later feature extraction Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21 Feature Extraction (1/3)

Extracted over each voiced region (every 10ms with 25ms window) 3) Pitch 4) Root-Mean-Square (RMS) Energy 5) Harmonics-to-Noise Ratio (HNR) 6) “Voice Quality” (zero-crossing rate of autocorrelation function) 7) 13 MFCCs 8) 26 magnitude of Mel-frequency bands (MFBs) 9) Magnitude of the spectral centroid 10) Magnitude of the spectral flux LLDs 4-10 extracted with openSMILE [Eyben et al. 2009] Pitch extracted with Praat [Boersma 2001] Median filtered (N=5) and linearly interpolated No interpolation across speaker-change points (using automatic alignment) Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21 Feature Extraction (2/3)

Normalized the pitch stream 2 ways: Mean pitch value, µ Fo, computed across session using automatic alignment Unknown regions treated as coming from one “unknown” speaker Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21 Pitch Example & Normalization

Each session split into 3 “domains” Wife (when wife was speaker) Husband (when husband was speaker) Speaker-independent (full session) Extracted 13 functionals across each domain for each LLD Mean, standard deviation, skewness, kurtosis, range, minimum, minimum location, maximum, maximum location, lower quartile, median, upper quartile, interquartile range Final set of 2007 features To capture global acoustic properties of the spouses and interaction Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21 Feature Extraction (3/3)

Classification Experiment Binary classification task – Only analyzed sessions that had mean scores evaluator scores in the top/bottom 20% of the code range – Goal: separate the 2 extremes automatically – Leave-one-couple-out cross-validation – Trained wife and husband models separately [Christensen 1990] – Error metric: % of misclassified sessions Classifier: Fisher’s linear discriminant analysis (LDA) – Forward feature selection to choose which features to train the LDA – Empirically better than other common classifiers (SVM, logistic regression) Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

Classification Results (1/2) Separated extreme behaviors better than chance (50% error) for 3 of the 6 codes – Acceptance, global positive affect, global negative affect – Global speaker-dependent cues captured evaluators’ perception well Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

Other codes – Need to be modeled with more dynamic methods – Depend more crucially on non-acoustic cues – Inherently less separable Classification Results (2/2) Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

LDA classifier feature selection – Mean = 3.4 features, Std. Dev. = 0.81 features – Domain breakdown – Wife models: Independent=56%, Wife=32%, Husband=12% – Husband models: Independent=41%, Husband=37%, Wife=22% – Feature breakdown – Pitch=30%, MFCC=26%, MFB=26%, VAD=12%, Speaking Rate=4%, Other=2% Also explored using features from a single domain (make bar graphs!) As expected, performance decreases in mismatched conditions Good performance using speaker- independent features: session is inherently interactive (need to use more dynamic methods) Results: Feature Selection Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

Conclusions Work represents initial analysis of a novel and challenging corpus consisting of real couples interacting about problems in their relationship Showed we could train binary classifiers using only audio features that separated spouses’ behavior significantly better than chance for 3 of the 6 codes we analyzed Provides a partial explanation of coders’ subjective judgments by objective acoustic signals Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

Future Work More automation – Front-end: speaker segmentation using only audio signal Multimodal fusion – Acoustic features + Lexical features Saliency detection – Using supervised methods (manual coding at a finer temporal scale) – Using unsupervised “signal-driven” methods Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

References P. Boersma, “Praat, a system for doing phonetics by computer,” Glot International, vol. 5, no. 9/10, pp. 341–345, A. Christensen, D.C. Atkins, S. Berns, J. Wheeler, D. H. Baucom, and L.E. Simpson. “Traditional versus integrative behavioral couple therapy for significantly and chronically distressed married couples.” J. of Consulting and Clinical Psychology, 72: , F. Eyben, M. W¨ollmer, and B. Schuller, “openEAR–Introducing the Munich open-source emotion and affect recognition toolkit,” in Proc. IEEE ACII, P. K. Ghosh, A. Tsiartas, and S. S. Narayanan, “Robust voice activity detection using long-term signal variability,” IEEE Trans. Audio, Speech, and Language Processing, 2010, accepted. C. Heavey, D. Gill, and A. Christensen. Couples interaction rating system 2 (CIRS2). University of California, Los Angeles, J. Jones and A. Christensen. Couples interaction study: Social support interaction rating system. University of California, Los Angeles, G. Margolin, P.H. Oliver, E.B. Gordis, H.G. O'Hearn, A.M. Medina, C.M. Ghosh, and L. Morland. “The nuts and bolts of behavioral observation of marital and family interaction.” Clinical Child and Family Psychology Review, 1(4): , P.J. Moreno, C. Joerg, J.-M. van Thong, and O. Glickman. “A recursive algorithm for the forced alignment of very long audio segments.” In Proc. ICSLP, V. Rozgić, B. Xiao, A. Katsamanis, B. Baucom, P. G. Georgiou, and S. Narayanan, “A new multichannel multimodal dyadic interaction database,” in Proc. Interspeech, B. Schuller, A. Batliner, D. Seppi, S. Steidl, T. Vogt, J. Wagner, L. Devillers, L. Vidrascu, N. Amir, and L. Kessous. “The relevance of feature type for automatic classification of emotional user states: Low level descriptors and functionals.” In Proc. Interspeech, A. Vinciarelli, M. Pantic, and H. Bourlard. “Social signal processing: Survey of an emerging domain.” Image and Vision Computing, 27: , Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21

Thank you! Questions?

Automatic Classification of Married Couples' Behavior using Audio Features Recruited Couples Couples married an average of 10.0 years (SD = 7.7 years) Ages ranged from 22 to 72 years Men median age = 43 years (SD = 8.8 years) Women median age = 42 years (SD = 8.7 years) College-educated on average Men and women median level of education = 17 years (SD = 3.2 years) Ethnicity breakdown 77% Caucasian 8% African American 5% Asian or Pacific Islander 5% Latino/Latina 5% Other 9/29/2010Hidden Slide

Automatic Classification of Married Couples' Behavior using Audio Features Social Support Interaction Rating System Designed in [Jones et al. 1998] 18 codes Measures emotional content of interaction and topic of conversation 4 categories Affectivity, Dominance/Submission, Features of Interaction, Topic Definition Example: Global positive affect “An overall rating of the positive affect the target spouse showed during the interaction. Examples of positive behavior include overt expressions of warmth, support, acceptance, affection, positive negotiation, and compromise. Positivity can also be expressed through facial and bodily expressions, such as smiling and looking happy, talking easily, looking comfortable and relaxed, and showing interest in the conversation.” 9/29/2010Hidden Slide

Automatic Classification of Married Couples' Behavior using Audio Features Couples Interaction Rating System 9/29/2010Hidden Slide Designed in [Heavey et al. 2002] 13 codes Specifically designed to rate an individual spouse as he/she interacts with his/her partner about a problem Example: Blame “Blames, accuses, or criticizes the partner, uses critical sarcasm; makes character assassinations such as … ‘all you do is eat,’ or ‘why are you such a jerk about it?’ Explicit blaming statements (e.g. ‘you made me do it’ or ‘you prevent me from doing it’), in which the spouse is the causal agent for the problem or the subject’s reactions, warrant a high score.”

Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden Slide Automatic Alignment Details Reject aligned region if …Explanation Audio duration < 0.25 sec Too short Text string length < 6 letters Too short Speaking rate < 8 letters/sec Too slow Speaking rate > 40 letters/sec Too fast Align reference and hypothesized transcripts Dynamic programming string alignment to minimize the edit distance Insertion/deletion/substitution errors assigned penalty scores of 3/3/4 Example of alignment shown below (aligned regions underlined) REF Trans: YOU ARE INSISTING ON THE ANGER THAT YOU HAVE REF Align: you ARE insisting on ***** the anger THAT YOU HAVE HYP Align: you insisting on the anger **** TO NOT HYP Trans: (has start and stop times for each recognized word) Reject “low-confidence” aligned regions

Speaker Segmentation Analysis Aligned >60% of sessions’ words for 293/574 sessions Performance largely dependent on audio signal-to-noise ratio Only considered these 293 sessions for this study Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010 R 2 = 0.63 Hidden Slide

Feature-level Fusion Fusion of audio and lexical features for multimodal prediction – Acoustic feature =, Lexical feature = – Experiment 1: Make one decision for entire session – Experiment 2: Make decision for each turn/utterance Score-level Fusion Feature Selection Future Work: Multimodal Fusion Feature Selection Classifier Feature Selection Classifier ∑ Turn 1 Turn 2 Turn N Classifier ∑ Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden Slide

Future Work: Saliency Detection Incorporate saliency detection into the computational framework – Certain regions more relevant than others for human subjective judgment – Can be viewed as data selection or data weighting problem Proposal: “Supervised” method – Mimic the saliency trends of trained human evaluators – Manually code a subset of the data at a finer-grained temporal scale – Can inform the more useful feature streams “active” features during regions specified as relevant (salient) to human evaluators – Can be used to train classifiers at the turn/utterance-level – Turn/Utterance-level classifiers can then be mapped to Session-level ones Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden Slide

Ambiguous Displays Saliency detection could lead to a more accurate representation for ambiguous displays of human behavior – Move beyond just separating extreme behaviors With the possibility of detecting codes at shorter temporal scale – Can specify how much a code is occurring in a session – Provides framework for a more “regression-like” continuous output Automatic Classification of Married Couples' Behavior using Audio Features9/29/2010Hidden Slide

Ongoing Work This work is part of a larger problem set out lab is working on – Behavioral Signal Processing (BSP) New data collection [Rozgić et al. 2010] – Designed smart room with 10-HD cameras, 15-microphones, motion capture – Allows for multimodal analysis of interpersonal interactions Show video clip … – Point out specific behavior we are targeting: approach/avoidance Automatic Classification of Married Couples' Behavior using Audio Features9/29/ / 21