Predicting Voice Elicited Emotions

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Digital Signal Processing
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Computational Rhythm and Beat Analysis Nick Berkner.
Content-based retrieval of audio Francois Thibault MUMT 614B McGill University.
An Overview of Machine Learning
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Introduction The aim the project is to analyse non real time EEG (Electroencephalogram) signal using different mathematical models in Matlab to predict.
Basic Spectrogram Lab 8. Spectrograms §Spectrograph: Produces visible patterns of acoustic energy called spectrograms §Spectrographic Analysis: l Acoustic.
Speaker Recognition Sharat.S.Chikkerur Center for Unified Biometrics and Sensors
Ao-Jan Su † Y. Charlie Hu ‡ Aleksandar Kuzmanovic † Cheng-Kok Koh ‡ † Northwestern University ‡ Purdue University How to Improve Your Google Ranking: Myths.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.
Pitch Prediction for Glottal Spectrum Estimation with Applications in Speaker Recognition Nengheng Zheng Supervised under Professor P.C. Ching Nov. 26,
Part I: Classification and Bayesian Learning
A PRESENTATION BY SHAMALEE DESHPANDE
Jeff Howbert Introduction to Machine Learning Winter Machine Learning Feature Creation and Selection.
Introduction to machine learning
Machine Learning in Simulation-Based Analysis 1 Li-C. Wang, Malgorzata Marek-Sadowska University of California, Santa Barbara.
Representing Acoustic Information
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Source/Filter Theory and Vowels February 4, 2010.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
SoundSense: Scalable Sound Sensing for People-Centric Application on Mobile Phones Hon Lu, Wei Pan, Nocholas D. lane, Tanzeem Choudhury and Andrew T. Campbell.
Eng. Shady Yehia El-Mashad
Detection and Segmentation of Bird Song in Noisy Environments
Schizophrenia and Depression – Evidence in Speech Prosody Student: Yonatan Vaizman Advisor: Prof. Daphna Weinshall Joint work with Roie Kliper and Dr.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Health Diagnosis through Voice Analysis Sahil Loomba & Shamiek Mangipudi, Department of Electronics and Electrical Engineering, IIT Guwahati Deepest appreciation.
Introduction to machine learning and data mining 1 iCSC2014, Juan López González, University of Oviedo Introduction to machine learning Juan López González.
Multimodal Information Analysis for Emotion Recognition
Speech Perception 4/4/00.
Understanding The Semantics of Media Chapter 8 Camilo A. Celis.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
A Novel Local Patch Framework for Fixing Supervised Learning Models Yilei Wang 1, Bingzheng Wei 2, Jun Yan 2, Yang Hu 2, Zhi-Hong Deng 1, Zheng Chen 2.
Loan Default Model Saed Sayad 1www.ismartsoft.com.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Audio processing methods on marine mammal vocalizations Xanadu Halkias Laboratory for the Recognition and Organization of Speech and Audio
National Taiwan University, Taiwan
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
Data Mining and Decision Support
RESEARCH MOTHODOLOGY SZRZ6014 Dr. Farzana Kabir Ahmad Taqiyah Khadijah Ghazali (814537) SENTIMENT ANALYSIS FOR VOICE OF THE CUSTOMER.
Neural networks (2) Reminder Avoiding overfitting Deep neural network Brief summary of supervised learning methods.
Research Methodology Proposal Prepared by: Norhasmizawati Ibrahim (813750)
Machine Learning Usman Roshan Dept. of Computer Science NJIT.
Detection Of Anger In Telephone Speech Using Support Vector Machine and Gaussian Mixture Model Prepared By : Siti Marahaini Binti Mahamood.
Oracle Advanced Analytics
CS 445/656 Computer & New Media
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Spark MLlib
Recognition of bumblebee species by their buzzing sound
University of Rochester
ARTIFICIAL NEURAL NETWORKS
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Sharat.S.Chikkerur S.Anand Mantravadi Rajeev.K.Srinivasan
Urban Sound Classification with a Convolution Neural Network
Machine Learning Feature Creation and Selection
Machine Learning Week 1.
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Audio and Speech Computers & New Media.
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Data Mining, Machine Learning, Data Analysis, etc. scikit-learn
Machine Learning with Clinical Data
Measuring the Similarity of Rhythmic Patterns
Outlines Introduction & Objectives Methodology & Workflow
Presentation transcript:

Predicting Voice Elicited Emotions Nishant Pandey

Synopsis Problem statement and motivation Previous work and background System Intuition and Overview Pre-processing of audio signals Building feature space Finding patterns in unlabelled data and labelling of samples Regression Results Deployed System Market Research

Problem Statement Motivation To be able to analyse voice and predict listener emotions elicited by the paralinguistic elements of the voice. Motivation Automate the screening process in service based industries Hourly job workers (two-thirds of U.S. Labour force or ~50 million job seekers every year) - Paralinguistic feature – tone - “how things are said are just as important as what is being said” This can be used in service based industries, where customer’s emotional response to worker’s voice, may affect the service outcome.

Previous work 2 set of goals, which includes recognizing- the type of personality traits intrinsically possessed by the speaker, for e.g. speaker trait and speaker state the types of emotions carried within the speech clip, for e.g. acoustic affect (cheerful, trustworthy, deceitful etc.) Train possessed by speaker: age, gender, pronunciation, fluency, personality Speak state: affection, interest, stress Emotions carried within the speech: sounds pleasant, cheerful, trustworthy Current work focuses on predicting the elicited emotions of voice clips.

Background – Emotion Taxonomy The framework articulated by “FEELTRACE” Includes all the emotion responses we want to predict. Emotions by finite quantifiable dimensions. - Finite dimension as active-passive, pos-neg assist us in mapping emotion characteristics to and measure them by known voice paralinguistic features

Features - Paralinguistic features of Voices Concept Definition Data Representation Amplitude measurement of the variations over time of the acoustic signal quantified values of a sound wave’s Oscillation Energy acoustic signal energy representation in decibels 20*log10(abs(FFT)) Formants the resonance frequencies of the vocal tract maxima detected using Linear Prediction on audio windows with high tonal content Perceived pitch Perceived Fundamental frequency and harmonics Fundamental frequency the reciprocal of time duration of one glottal cycle - a strict definition of “pitch” first formant - Generally, frequency and energy variation are major cues used to analyse emotions in voice samples.

System – Intuition Common day exp is that we can listen and tell the emotions elicited by speech, give an example. Energy level difference bw the clips, 1 is from clip which would make listener less engaged and vice versa. Spectrogram of two job applicants responding to “Greet me as if I am a customer”

System – Overview Record and sample raw voice clips Extract audio features that represent voice cues Construct data feature space suitable for data mining and ml algorithms Build models using supervise/un-supervised learning Engineer scalable data processing pipelines which process and generate prediction scores

System – Pre-Processing of Audio Signals Pre-processing tasks involve: Removing voice clips with <2 seconds length and containing noise audio signal to data in time and frequency domain Short-term Fast Fourier Transform per frame Energy measures in frequency domain per frame Linear prediction coefficient in frequency domain per frame Read more about FFT, Energy measures, LPC

System - Feature Space Construction We experimented with feature construction based on the following dimensions and combinations: Signal measurements such as energy and amplitude. Statistics such as min, max, mean, and standard deviation on signal measurements Measurement window in time domain: different time size and entire time window Measurement window in frequency domain: all frequencies, optimal audible frequencies, and selected frequency ranges Existing research, we focused on voice energy features and constructed feature space using statistical measures of energy attributes.

System – Labels and Right set of Features? Conventional approach – getting voice samples rated by experts Unsupervised Learning – Analyse features and their effectiveness Process: Unsupervised learning is used to find patterns in unlabelled data. Now, training data sets are constructed based on clustering results and manual labelling. analysed feature selection against clustering algorithms to determine effectiveness of features

System – How do we get the labels? Contd. Parameters Cost Function: Connectivity Dunn Index Silhouette Clustering Results Technique: Hierarchical Clustering Number of clusters: 5 Manual validation of clusters was also done Clustering on paralinguistic features of voice Experimented on clustering algorithms and distance metrics Results evaluated on cluster quality measurements (compactness, good separation, connectedness and stability) Manual validation (aural inspection) of samples within a cluster, are they meaningful or not

System – Visualization of clusters

System – Modelling Supervised Learning algorithms Logistic Regression Support Vector Machine Random Forest Semi-Supervised Learning algorithm KODAMA Output: Binary outcome (positive or negative) Numerical scores

Case Study – Modelling Prediction – Positive vs Negative Response A positive response could be one or multiple perceptions of a “pleasant voice”, “makes me feel good”, “cares about me”, “makes me feel comfortable”, or “makes me feel engaged”. System.V1 -> Using SVM and V2 -> Random Forest Interview Prompts: “Greet me as If I am a customer” Given a voice clip, our model predicts degree in which a listener will find voice ‘engaging’

System - Prediction Results Accuracy : 0.86 95% CI : (0.76, 0.92) P-Value [Acc > NIR] : 5.76e-07 Sensitivity : 0.81 Specificity : 0.88 Pos Pred Value : 0.81 Neg Pred Value : 0.88

System - Prediction Results (KODAMA) Kodama performs feature extraction from noisy and high- dimensional data. Output of Kodama includes dissimilarity matrix from which we can perform clustering and classification.

Deployed System

Market Research Demographics Matters Young listeners (18-29 years old) and Income less than $29000/year have more strict criteria of how they sense engaging. No Correlation b/w emotion elicited vs age/ ethnicity/ education level. Bias towards female voice.

Thanks

Time and Frequency Domain Time Domain: https://en.wikipedia.org/wiki/Time_domain#/media/File:Fourier_tra nsform_time_and_frequency_domains_(small).gif Frequency Domain: https://en.wikipedia.org/wiki/Frequency_domain#/media/File:Fourier_ transform_time_and_frequency_domains_(small).gif

Learnings – Difference in Voice Characteristics Result Improves by 10% - when a decision tree is layered by features related to voice characteristic on top of the Random Forest.

Prediction Results – SVM vs Random Forest