Recommendations Based on Speech Classification

Slides:

Advertisements

Similar presentations

Pattern Recognition and Machine Learning

Advertisements

Change-Point Detection Techniques for Piecewise Locally Stationary Time Series Michael Last National Institute of Statistical Sciences Talk for Midyear.

ECG Signal processing (2)

Acoustic Vector Re-sampling for GMMSVM-Based Speaker Verification

Support Vector Machines

An Overview of Machine Learning

Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.

Speaker Adaptation for Vowel Classification

9/20/2004Speech Group Lunch Talk Speaker ID Smorgasbord or How I spent My Summer at ICSI Kofi A. Boakye International Computer Science Institute.

Optimal Adaptation for Statistical Classifiers Xiao Li.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Scalable Text Mining with Sparse Generative Models

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Why is ASR Hard? Natural speech is continuous

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.

An Introduction to Support Vector Machines Martin Law.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

PATTERN RECOGNITION AND MACHINE LEARNING

July 11, 2001Daniel Whiteson Support Vector Machines: Get more Higgs out of your data Daniel Whiteson UC Berkeley.

Occasion:HUMAINE / WP4 / Workshop "From Signals to Signs of Emotion and Vice Versa" Santorin / Fira, 18th – 22nd September, 2004 Talk: Ronald Müller Speech.

9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.

VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.

DTU Medical Visionday May 27, 2009 Generative models for automated brain MRI segmentation Koen Van Leemput Athinoula A. Martinos Center for Biomedical.

On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.

Multimodal Information Analysis for Emotion Recognition

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

An Introduction to Support Vector Machines (M. Law)

Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.

A Passive Approach to Sensor Network Localization Rahul Biswas and Sebastian Thrun International Conference on Intelligent Robots and Systems 2004 Presented.

A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

Performance Comparison of Speaker and Emotion Recognition

Support Vector Machines Tao Department of computer science University of Illinois.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

GENDER AND AGE RECOGNITION FOR VIDEO ANALYTICS SOLUTION PRESENTED BY: SUBHASH REDDY JOLAPURAM.

Predicting Voice Elicited Emotions

Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

Machine Learning for Computer Security

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Deep Feedforward Networks

Online Multiscale Dynamic Topic Models

Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Statistical Models for Automatic Speech Recognition

EE513 Audio Signals and Systems

John H.L. Hansen & Taufiq Al Babba Hasan

Modeling IDS using hybrid intelligent systems

Presentation transcript:

Recommendations Based on Speech Classification (and examples of what recommender systems can learn from signal processing) Christian Müller German Research Center for Artificial Intelligence International Computer Science Institute, Berkeley, CA

Recommendations Based on Speech Classification Overview Speech as a source of information for non-intrusive user modeling Speech/signal processing Take-away messages Recommendations Based on Speech Classification Vocal aging -> features for speaker age recognition GMM/SVM supervector approach for acoustic speech features Detection task and pseudo-NIST evaluation procedure Rank and polynomial rank normalization Knowledge-driven feature selection Classification methods for independent “bag of observations” features Valid application-independent evaluation Feature space warping normalization (and examples of what recommender systems can learn from signal processing) Conclusions Christian Müller

Speech as a Source for Non-Intrusive UM Now it’s time to get to gate 38. Information about the user user model adaptive speech dialog system provides recommendations (e.g. a different route to the gate) adapts it's dialog behavior (e.g. detailed map with shops vs. arrows) A speaker classification ? speech = sensor inference from sensors (not intrusive) B explicit statement (intrusive) Christian Müller

Speaker Classification Systems Cognitive Load Best Research Paper Award UM 2001 S y s t em Age and Gender Voice Award 2007 Telekom live operation 2009 Language 14 languages + dialects NIST evaluation 2007 Audio segment (telephone quality) Identity Project with BKA 2009 NIST* Evaluation 2008 Acoustic Events Project with VW 2008 Interspeech 2008 Christian Müller

Recommendations Based on Speech Classification products media services actions strategies age gender emotions language dialect accent identity acoustic events Christian Müller

Product Recommendations Based on Age and Gender Christian Müller

Product Recommendations Based on Age and Gender AM Michael Feld and Christian Müller. Speaker Classification for Mobile Devices. In Proceedings of the 2nd IEEE International Interdisciplinary Conference on Portable Information Devices (Portable 2008). 2008 YF Christian Müller

How can you find features for building your models by explicitly studying the underlying phenomena? Proposing Knowledge-driven feature select the example of features for speaker age recognition Christian Müller

Speaker Classification as an Interdisciplinary Area of Research How can the age (and the gender) of a speaker be recognized automatically ? Which are the manifestations of age (and gender) in the speaker’s voice and speaking style ? Which are the requirements of a speaker classification system and how can they be solved on the implementation layer ? Speech Technology / Artificial Intelligence Phonetics Voice Pathology Speaker Classification Software- Technology Christian Müller

Impact of Aging on the Human Speech Production Speech breathing effects: lower expirational volume more speech pauses lower amplitude thorax stiffer Wir kommen nun zu Gliederungspunkt zwei: den empirischen Studien. lungs lighter less elastic lower position Christian Müller

Impact of Aging on the Human Speech Production laryngal area effects: rise of fundamental frequency (in men) reduced voice quality larynx calcification and ossification vocal folds loss of tissue stiffening Wir kommen nun zu Gliederungspunkt zwei: den empirischen Studien. Christian Müller

Impact of Aging on the Human Speech Production supralaryngal area facial bones and muscles degeneration reduced elasticity effects: imprecise articulation for example vowel centralization Wir kommen nun zu Gliederungspunkt zwei: den empirischen Studien. Christian Müller

Impact of Aging on the Human Speech Production neurological effects loss of tissue in the cortex reduced performance of the neuronal transmitters effects: reduced articulation rate defective coordination between the articulators vowel centralization Wir kommen nun zu Gliederungspunkt zwei: den empirischen Studien. Christian Müller

Development of F0 in Men / Women 170 160 150 140 130 120 110 100 90 20 50 60 70 80 40 30 F0 (Hz) men only non-smokers women smokers and non-smokers Linville (2001) age in years Christian Müller

Age Classes YF Female Male Children Youth AF Adults SF SM Seniors CF CM Children <= 13 years YF YM Youth 14 - 19 years AM AF Adults 20 - 64 years SF SM Seniors >= 65 Jahren Christian Müller

Age Classes YF Female Male Children Youth AF Adults SF SM Seniors CF CM Children <= 13 years YF YM Youth 14 - 19 years AM AF Adults 20 - 64 years SF SM Seniors >= 65 Jahren Christian Müller

Features voice quality articulation rate speech pauses fundamental frequency (pitch) mean pitch_mean standard deviation pitch_stddev min, max and difference pitch_min / pitch_max / pitch_diff voice quality shimmer shim_l / shim_ldb / shim_apq3 / shim_apq11 / shim_ddp jitter jitt_l / jitt_la / jitt_rap / jitt_ppq / jitt_ddp harmonics-to-noise-ratio harm_mean / harm_stddev Die untersuchten Stimm- und Sprechverhaltensmerkmale waren die folgenden: erstens die Grundfrequenz oder Pitch mit den statistischen Derivaten Mittelwert, Standardabweichung sowie Minimal, Maximalwert und die Differenz von beiden. Anhand der Standardabweichung können globale Frequenzschwankungen gemessen werden, was in der Regel als Tremor bezeichnet wird. Minimal und Maximalwert lassen Rückschlüssen darauf zu, welchen Frequenzumfang die Stimme des Sprechers hat. Desweiteren wurden verschiedene Maße der Stimmqualität untersucht wie Shimmer – Mikrovariationen der F0-Amplitude und Jitter – Mikrovariationen der F0-Frequenz. Von beiden wurden verschiedenen Maße untersucht, die sich vor allen Dingen dahingehen unterscheiden, ob sie die relativen oder absulten Perturbationen messen. Als ein weiteres Maß der Stimmqualität wurde die Harmonics-to-Noise-Ratio untersucht, also das Verhältnis von stimmhaften und stimmlosen Anteilen im Signal. Weitere Merkmale waren die Intensität sowie die Artikulationsgeschwindigkeit und die Anzahl und Dauer der Sprecherpausen. Artikulationsgeschwindigkeit und Sprechpausen werden Sprechverhaltensmerkmale genannt, wähend die übrigen als Stimmmerkmale bezeichnet werden. articulation rate ar_rate speech pauses pause_num / pause_dur Christian Müller

Features voice voice quality articulation rate speaking style fundamental frequency (pitch) mean standard deviation min, max and difference voice voice quality shimmer jitter harmonics-to-noise-ratio Die untersuchten Stimm- und Sprechverhaltensmerkmale waren die folgenden: erstens die Grundfrequenz oder Pitch mit den statistischen Derivaten Mittelwert, Standardabweichung sowie Minimal, Maximalwert und die Differenz von beiden. Anhand der Standardabweichung können globale Frequenzschwankungen gemessen werden, was in der Regel als Tremor bezeichnet wird. Minimal und Maximalwert lassen Rückschlüssen darauf zu, welchen Frequenzumfang die Stimme des Sprechers hat. Desweiteren wurden verschiedene Maße der Stimmqualität untersucht wie Shimmer – Mikrovariationen der F0-Amplitude und Jitter – Mikrovariationen der F0-Frequenz. Von beiden wurden verschiedenen Maße untersucht, die sich vor allen Dingen dahingehen unterscheiden, ob sie die relativen oder absulten Perturbationen messen. Als ein weiteres Maß der Stimmqualität wurde die Harmonics-to-Noise-Ratio untersucht, also das Verhältnis von stimmhaften und stimmlosen Anteilen im Signal. Weitere Merkmale waren die Intensität sowie die Artikulationsgeschwindigkeit und die Anzahl und Dauer der Sprecherpausen. Artikulationsgeschwindigkeit und Sprechpausen werden Sprechverhaltensmerkmale genannt, wähend die übrigen als Stimmmerkmale bezeichnet werden. articulation rate speaking style speech pauses Christian Müller

Example Results high jitter value = low voice quality CF CM YF YM AF AM SF SM C_YF AF SF YM_AM_SM high jitter value = low voice quality CF CM YF YM AF AM SF SM fundamental frequency (F0) Christian Müller. Zweistufige kontextsensitive Sprecherklassifikation am Beispiel von Alter und Geschlecht [Two-layered Context-Sensitive Speaker Classification on the Example of Age and Gender]. AKA, Berlin, 2006 speech pauses Christian Müller

Hiearchical Feature Model High-level features (learned characteristics) semantics ? dialog d e c b a A: B: ideloect <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ... prosody spectrum Low-level features (physical characterstics) Christian Müller

How can your features be modeled assuming that they are multi-dimentional represent repeating observations of the same kind can be assumed to be independent (“bag” of observations) Proposing the GMM/SVM Supervector Approach on the example of frame-by-frame acoustic features Christian Müller

General Classification Scheme x1 x2 y1 wji -1 0.5 0.7 -0,4 y2 1 -1.5 zk wkj e.g. channel compensation (not addressed in this talk) Preprocessing support-vector machines multilayer perceptron networks Feature Extraction Classification Fusion Top-Down- Knowledge Christian Müller

Modeling Acoustics and Prosodics semantics ? dialog d e c b a A: B: ideloect no ASR <s> how shall I say this <c> <s> yeah I know... phonetics /S/ /oU/ /m/ /i:/ /D/ /&/ /m/ // /n/ /i:/ ... prosody spectrum Christian Müller

Generative Approach: Gaussian Mixture Model (GMM) training “emergency vehicle” feature extraction probability density “emergencyvehicle” model frame of speech test feature extraction “emergencyvehicle” model ? avg likelihood over all frames for class “emergency vehicle” Christian Müller

Generative Approach: Gaussian Mixture Model (GMM) test ? feature extraction “emergencyvehicle” model avg. log likelihood ratio over all frames for class “emergency vehicle” frame of speech background model Christian Müller

A Mixture of Gaussians Means, variances, and mixtures weights are optimized in training Black line = mixture of 3 Gaussians Christian Müller

Discriminative Method: Support Vector Machine (SVM) training “em. vehic.” (1) feature extraction “em. vehic.” model “not em. vehic.” (-1) Features are transformed into higher-dimensional space where problem is linear Discriminating hyper plane is learned using linear regression Trade-off between training error and width of margin Model is stored in form of “support vectors” (data points on the margin) Christian Müller

Discriminative Method: Support Vector Machine (SVM) test ? feature extraction score (distance to hyper plane) Discriminative methods have shown to be superior to generative methods for similar tasks Features vectors have to be of the same lengths (sensitive to variable segment lengths) Solutions: feature statistics calculated over the entire utterance fixes portion of the segment sequential kernels Christian Müller

GMM/SVM Supervector Approach feature extraction Gaussian means (MAP adapted) Combines discriminative power of SVMs with length independency of GMMs Very successful with similar tasks such as speaker recognition GMM is trained using MAP adaptation Christian Müller

Evaluation Results Christian Müller Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller

How can you evaluate your multi-class models independently from the given application? How can you establish a appropriate evaluation in order procedure to obtain valid results? Proposing the detection task and the “pseudo NIST” evaluation procedure on the example of acoustic event detection and speaker age recognition. Christian Müller

Background With multi-class recognition problems, many test/analyzing methods are very application specific. e.g. confusion matrices. we want a method that allows results to be generalized across a large set of applications. With home-grown databases, parameter tuning on the evaluation set often compromises the validity of the results/inferences. we want a fair “one shot” evaluation. Christian Müller

The Detection Task Given system yes , 1.324326 emergeny vehicle ? Given a speech segment (s) and an acoustic event to be detected (target event, ET ) the task is to decide whether ET is present in s (yes or no) the system's output shall also contains a score indicating its confidence with more positive scores indicating greater confidence. Christian Müller

Terminology Segment class Target Trial e.g. segment event, segment age-class. ground truth (not known). Target the hypothesized class. Trial a combination of segment and target. Christian Müller

Evaluation system yes 1.32432 no -0.3212 no 1.8463 no -2.5773 yes 0.00132 no 2.20122 emergency vehicle ? music ? talking ? laughing ? phone ? no event ? The system performance is evaluated by presenting it with a set of trials. Each test segment is used for multiple trials. The absence of all of all targets is explicitly included. Christian Müller

Type of Errors system system “MISS” “FALSE ALARM” no yes segment “em. vehic.” system no “MISS” target “em. vehic” ? segment “em. vehic” system yes “FALSE ALARM” target “phone” ? Christian Müller

Decision-Error Tradeoff misses “equal error rate” false alarms Selecting an operating point (decision threshold) along the dotted line trades misses off false alarms. Optimal operating point is application dependent. Low false alarm rates are desirable for most applications. Christian Müller

Decision Cost Function C(ET, EN) = CMiss · PTarget · PMiss(ET) + CFA · (1-PTarget) · PFA (ET,EN) where ET and EN are the target and non-target events, and CMiss, CFA and PTarget are application model parameters. The application parameters for EER are: CMiss = CFA = 1 and PTarget = 0.5 Weighted sum of misses and false alarms using variable costs and priors. Application model parameters are selected according to the application. Christian Müller

Example DET-Plot miss probability false alarm probability Christian Müller, Joan-Isaac Biel, Edward Kim, and Daniel Rosario, “Speech-overlapped Acoustic Event Detection for Automotive Applications,” in Proceedings of the Interspeech 2008, Brisbane, Australia, 2008. Christian Müller

Example Cost Chart COSTS: (At, An) An C YF YM AF AM SF SM -- 0.220 0.092 0.145 0.083 0.133 0.069 0.166 0.081 0.201 0.080 0.198 0.070 0.076 0.084 0.130 0.203 0.108 0.188 0.088 0.161 0.110 0.095 0.219 0.082 0.064 0.254 0.139 0.105 0.228 0.096 0.150 0.100 0.249 0.091 0.065 0.085 0.238 0.117 0.246 0.118 Avg Cost (At) 0.146 0.164 0.147 0.122 Avg Cost Acoustic GMM/SVM Supervector system on 7-class age task Christian Müller

Pseudo NIST Evaluation Procedure ERL provided development and evaluation data as representative as possible for the application. Three months before the evaluation, ICSI was provided with the development data. At a pre-determined date, the blind evaluation data was provided to ICSI for processing. The system's output was submitted to ERL in NIST format. ERL downloaded the scoring software from NIST’s website, made the necessary modifications due to the changes in the labels. ERL ran the software on the submitted system output. The results were then disclosed to ICSI along with the keys (truth) for further analysis. --> Fair “one-shot” evaluation, no parameter tuning on the evaluation set. Christian Müller

How can you normalize your features in order to obtain a uniform scale and a unifom distribution? Proposing rank normalization respectively polynomial rank normalization Christian Müller

Background Fundamental frequency (pitch): 75-200 Hz Jitter: 0.001324 PPQ --> implicit feature weighing Christian Müller

Mean/Variance Normalization 1 ai = vi − min(vi) max(vi) − min(vi) -1 1 uniform scale non-uniform distribution Christian Müller

Rank-Normalization create ordered list of values using bg data feature 0101 0.01 ... background model 0101 0 0 0101 0.01 0.25 0101 0.06 0.5 0101 0.13 0.75 0101 0.29 1 ... normalized feature 0101 0.75 ... 0123 0.4 2317 0.2 0101 0.06 ... 0101 0.13 ... create ordered list of values using bg data rank = position in list / number of values no occurrence mapped to 0 0101 0.29 ... Christian Müller

Rank Normalization 1 -1 1 -1 1 (+) uniform distribution (-) large three dimensional lookup tables (-) linear interpolation for unseen values larger values ? smaller values ? Christian Müller

Polynomial Rank Normalization use ranks to train a polynomial apply polynomial instead of look-up tables better interpolation no need to store look-up tables Christian Müller and Joan-Isaac Biel. The ICSI 2007 Language Recognition System. In Proceedings of the Odyssey 2008 Workshop on Speaker and Language Recognition. Stellenbosch, South Africa, 2008 Christian Müller

Conclusions Vocal aging -> features for speaker age recognition Speech as a source of information for non-intrusive user modeling Speech/signal processing Take-away messages Vocal aging -> features for speaker age recognition GMM/SVM supervector approach for acoustic speech features Detection task and pseudo-NIST evaluation procedure Rank and polynomial rank normalization Knowledge-driven feature selection Classification methods for independent “bag of observations” features Valid application-independent evaluation Feature space warping normalization Christian Müller

Thank you! Christian Müller