Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: The Linear Prediction Model The Autocorrelation Method Levinson and Durbin.
1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.
Prosody modification in speech signals Project by Edi Fridman & Alex Zalts supervision by Yizhar Lavner.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
Vineel Pratap Girish Govind Abhilash Veeragouni. Human listeners are capable of extracting information from the acoustic signal beyond just the linguistic.
Outlines  Objectives  Study of Thai tones  Construction of contextual factors  Design of decision-tree structures  Design of context clustering.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Speaking Style Conversion Dr. Elizabeth Godoy Speech Processing Guest Lecture December 11, 2012.
Facial expression as an input annotation modality for affective speech-to-speech translation Éva Székely, Zeeshan Ahmed, Ingmar Steiner, Julie Carson-Berndsen.
VOICE CONVERSION METHODS FOR VOCAL TRACT AND PITCH CONTOUR MODIFICATION Oytun Türk Levent M. Arslan R&D Dept., SESTEK Inc., and EE Eng. Dept., Boğaziçi.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
Emotions and Voice Quality: Experiments with Sinusoidal Modeling Authors: Carlo Drioli, Graziano Tisato, Piero Cosi, Fabio Tesser Institute of Cognitive.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
EMOTIONS NATURE EVALUATION BASED ON SEGMENTAL INFORMATION BASED ON PROSODIC INFORMATION AUTOMATIC CLASSIFICATION EXPERIMENTS RESYNTHESIS VOICE PERCEPTUAL.
Advanced Technology Center Stuttgart EMOTIONAL SPACE IMPROVES EMOTION RECOGNITION Raquel Tato, Rocio Santos, Ralf Kompe Man Machine Interface Lab Advance.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
Back-End Synthesis* Julia Hirschberg (*Thanks to Dan, Jim, Richard Sproat, and Erica Cooper for slides)
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
MIL Speech Seminar TRACHEOESOPHAGEAL SPEECH REPAIR Arantza del Pozo CUED Machine Intelligence Laboratory November 20th 2006.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Introduction to Automatic Speech Recognition
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.
Multimodal Interaction Dr. Mike Spann
By Sarita Jondhale1 Pattern Comparison Techniques.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Prepared by: Waleed Mohamed Azmy Under Supervision:
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
On Speaker-Specific Prosodic Models for Automatic Dialog Act Segmentation of Multi-Party Meetings Jáchym Kolář 1,2 Elizabeth Shriberg 1,3 Yang Liu 1,4.
SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,
Multimodal Information Analysis for Emotion Recognition
Speech Parameter Generation From HMM Using Dynamic Features Keiichi Tokuda, Takao Kobayashi, Satoshi Imai ICASSP 1995 Reporter: Huang-Wei Chen.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
HMM-Based Synthesis of Creaky Voice
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Acoustic Cues to Emotional Speech Julia Hirschberg (joint work with Jennifer Venditti and Jackson Liscombe) Columbia University 26 June 2003.
Business Unit or Product Name © 2006 IBM Corporation 26/02/2007 Zeynep Inanoglu & Steve Young Machine Intelligence Lab, CUED March 12 th, 2008 Data-driven.
High Quality Voice Morphing
Investigating Pitch Accent Recognition in Non-native Speech
Mr. Darko Pekar, Speech Morphing Inc.
August 15, 2008, presented by Rio Akasaka
Emotional Speech Modelling and Synthesis
A maximum likelihood estimation and training on the fly approach
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Low Level Cues to Emotion
Speech Prosody Conversion using Sequence Generative Adversarial Nets
Auditory Morphing Weyni Clacken
Presentation transcript:

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab CUED

Toshiba Update 04/09/2006 Project Goal Restated Given a neutral utterance and all the known features that can be extracted from a TTS framework – convert the utterance into a specified emotion with zero degradation in quality. – Particularly useful for unit-selection synthesizers with good-quality neutral output. What can be modified? – Prosody (F0, duration, prominence structure) – Voice quality (spectral features relating to the voice source & filter) Method: Data-Driven Learning & Generation (Decision trees, HMM) This presentation addresses both the issues of prosody generation and voice quality modification.

Toshiba Update 04/09/2006 Data Specifics – Female Speaker Parallel data used: 595 utterances of each emotion 545 utterances used for training of the different modules, 50 were set aside as the test set. Features extracted: – Text-based features Phone identity, lexical stress, syllable position, word length, word position, part of speech. – Perceptual features (Intonation Units) TOBI-based syllable labels: {alh, ah, al, c} represent three accent types and one symbol for unaccented syllables. Automatically extracted by Prosodizer.

Toshiba Update 04/09/2006 Step 1: Convert Intonation Unit Sequence ASSUMPTION: Each emotion has an intrinsic pattern in its intonation unit sequence (e.g. surprised utterances use alh much more frequently than neutral) For each input neutral unit and its context, we want to find the corresponding unit in a target emotion. Training: Use Decision Trees trained on parallel data – This is similar to sliding a decision tree along the utterance and generating an output at each syllable. { alh c c c ah c al c } Sequence Conversion { alh c alh c ah c alh c } Target Emotion Text-based features Neutral Sequence

Toshiba Update 04/09/2006 Step 1: Sequence Conversion Results Sequence prediction accuracy is computed on the test data by measuring the number of units that match between the converted and target sequences. (Substitution error) As a benchmark, the unchanged neutral sequence is also compared to the target sequence. Sequence conversion improves accuracy for happy, angry and surprised and doesn’t change the results for sad. Sequence Prediction Accuracy (%) Neutral SequenceConverted Sequence happy sad angry surprised

Toshiba Update 04/09/2006 Step 2: Intonation Model Training Each syllable intonation is modelled as a three-state left-to-right HMM. Each model is highly context sensitive – An example model: Syllable models trained with interpolated F0 and Energy values based on the laryngograph signal as well as first and second order differentials. (F0, E,  F0,  F0,  E,  E) Decision tree-based parameter tying was performed. Intonation Models F0 contour { alh c alh c …} Text-based features

Toshiba Update 04/09/2006 Step 2: Model-Based Intonation Generation The goal is to generate an optimal sequence of F0 values directly from context-sensitive syllable HMMs given the intonation sequence: This results in a sequence of mean state values. Cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995) Differential F0 features are used as constraints in contour generation. Results in smoother contours.

Toshiba Update 04/09/2006 Step 2: Model-Based Intonation Generation Difficult to obtain an objective measure for F0 comparison that is perceptually relevant. A simple approach is to align the target and the model-generated contours so that they both have N pitch points and measure RMS error per utterance The average RMS error for all generated test contours is given below. The first row presents the original error between the neutral contour and the target contour as a benchmark. RMSE in Hz SurprisedAngryHappySad RMSE (neutral, target) RMSE (gen, target)

Toshiba Update 04/09/2006 Step 3: Duration Tree Training A decision tree was built for each voiced broad class – vowels, nasals, glides and fricatives. All text-based features and intonation units used to build the trees. The features that were most significant varied with emotion and phone class. For each test utterance a duration tier was constructed by taking the ratio of predicted duration to neutral duration. Text-based features Duration Trees Duration Tier { alh c alh c …}

Toshiba Update 04/09/2006 Step 3: Duration Trees - Evaluation Assume Poisson distribution at the leaves of the decision tree. where λ is the mean of the leaf node (predicted duration) Measure the performance of duration trees by using them as a classifier. – How likely is it for happy durations to be generated by neutral/happy/sad…trees? We want the diagonal entries to be minimum (most likely) of each row. Mean Log Likelihood Of Test Data Decision Tree NeutralHappySadSurprisedAngry Actual Duration Neutral Happy Sad Surprised Angry

Toshiba Update 04/09/2006 Overview of Run-Time System { alh c c c ah c al c } Sequence Conversion { alh c alh c ah c alh c } Target Emotion Intonation Models Text-based features Text-based features Duration Trees TD-PSOLA (Praat) F0 contour Duration Tier Neutral Sequence Target Emotion Target Emotion

Toshiba Update 04/09/2006 Prosodic Conversion - Samples NeutralHappySadSurprisedAngry

Toshiba Update 04/09/2006 Experiments With Voice Quality Analysis of Long Term Average Spectra for vowels. – Pitch-Synchronous Analysis (single pitch period frames) – Total power in each frame normalized to a constant value. – Anger has significantly more energy in the band and less in – Sadness has a sharper spectral tilt and more low frequency energy – Happy & surprised follow similar spectral patterns /ae/

Toshiba Update 04/09/2006 Upcoming work More experiments with voice quality modification – Decision tree based filter bank generation approach. Combine voice quality processing with prosody generation. Application of techniques to the MMJ (male) corpus, performance comparison across gender. Perceptual study – Acquire recognition scores across emotions, gender and feature set Miscellaneous: Application of FSP, MMJ models to a new speaker/language.