We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byValentine Banks
Modified over 6 years ago
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System Slava Shechtman IBM Haifa Research Laboratory
IBM Labs in Haifa © 2007 IBM Corporation 2 Outline CART intonation modeling Maximal Likelihood Dynamic intonation model Dynamic observations Maximum-likelihood solution Microprosody preservation technique Implementation and preliminary results Future research directions
IBM Labs in Haifa © 2007 IBM Corporation 3 CART prosody modeling pitch tree grow duration tree duration tree Semantic data Syllable location Syntactic data Phonetic context grow pitch tree language data Speech corpus with pitch data
IBM Labs in Haifa © 2007 IBM Corporation 4 Basic CART intonation model Rough, but simple and automatic Extract semantic, syntactic and phonetic features from the TTS Front-end (per syllable) POS, word stress, syllable stress Sentence type, phrase type Syllable location Phonetic context 3 log-pitch observations per syllable (in a sonorant part of syllable) Mean pitch values are associated with tree leaves to represent the target intonation (implicit i.i.d. assumption) Q1 Q3 Q2
IBM Labs in Haifa © 2007 IBM Corporation 5 Basic application of CART intonation model Use mean log-pitch values to estimate target pitch for concatenated segments Use distance from the target pitch cost as an additive factor in the overall segment selection cost Optionally, use the above target pitch curve for speech synthesis (after smoothing and/or combination with the actual pitch from the selected segments)
IBM Labs in Haifa © 2007 IBM Corporation 6 Maximal Likelihood Dynamic intonation model Model cross-syllable dynamic observations as well as intra-syllable observations Maximum Likelihood solution, based on HMM synthesis approach (Tokuda et al) convenient framework for combining both instantaneous and differential observations in order to obtain the most-likely smooth parameter contour, for a given clustering. May be applied over the regular CART trees S1S1 S2S2 S3S3
IBM Labs in Haifa © 2007 IBM Corporation 7 Dynamic features for CART intonation modeling Extend the static observation vectors for n-th syllable, Add four time-normalized differences of static observations Guarantee non-zero time interval between the observation instances New observation vector Pairs of observation points for difference calculation (→)
IBM Labs in Haifa © 2007 IBM Corporation 8 Maximal Likelihood Dynamic intonation model Assume a cluster sequence Q is predetermined by CART Each cluster is modeled by a single 7-dim Gaussian ( ) Concatenated observations: Concatenated static observations: Sparse (block diagonal) linear transformation:
IBM Labs in Haifa © 2007 IBM Corporation 9 Maximal Likelihood Dynamic intonation model The log-likelihood of O sequence is given by Where
IBM Labs in Haifa © 2007 IBM Corporation 10 Maximal Likelihood Dynamic intonation model Likelihood Minimization with respect to static observations C An efficient time-recursive solution exists (Tokuda et al, 1996) Jointly determine full utterance pitch curve. The solution depends both on individual CART cluster models and on their sequence in the synthesized sentence
IBM Labs in Haifa © 2007 IBM Corporation 11 Maximal Likelihood Dynamic intonation model Smoothes abrupt changes existing in the mean solution Controlled by the scaling factor inside dynamic observations Allows usage of larger CART trees for fine clustering (→)
IBM Labs in Haifa © 2007 IBM Corporation 12 Microprosody preservation Improve rough pitch curve resolution Keep original fine pitch structure inside the contiguous portion of speech to increase naturalness, but be aligned with the target intonation curve Compensate for the imperfectness of the CART model and feature extraction
IBM Labs in Haifa © 2007 IBM Corporation 13 Mean solution vs. ML dynamic intonation model Mean solution : ML dynamic solution Pref.No pref. Static, smoothed (A)Dynamic ML (B) All pref. Strong pref. All pref. Strong pref. %22.5126.96.36.199.2
IBM Labs in Haifa © 2007 IBM Corporation 14 Incorporation within CTTS system Applied on embedded version of IBM CTTS system with sub-phoneme basic concatenation unit (regularly one third of a phoneme) (A): CART mean solution as a target pitch, smoothed original pitch curve as a synthesis pitch. (B): dynamic ML CART solution as a target pitch, use the microprosody preservation technique to combine original and target pitches TTS experts + native speakers subjective results Pref.No pref. (A)(B) Strong or weak pref. Strong pref. Strong or weak pref. Strong pref. %37.9188.8.131.52.7
IBM Labs in Haifa © 2007 IBM Corporation 15 Summary and further research directions Dynamic ML CART intonation model was proposed and shown to perform better then the baseline CART intonation. It was successfully combined with the original pitch curve using microprosody preservation technique. Further research Alternative dynamic features Statistical microprosody modeling for very-small-footprint voices Adaptive microprosody incorporation
IBM Labs in Haifa © 2007 IBM Corporation 16 Exact formulation of dynamic features Let Then: (←)(←)
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Budapest May 27, 2008 Unifying mixed linear models and the MASH algorithm for breakpoint detection and correction Anders Grimvall, Sackmone Sirisack, Agne.
Outlines Objectives Study of Thai tones Construction of contextual factors Design of decision-tree structures Design of context clustering.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Reducing Drift in Parametric Motion Tracking
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Sound and Speech. The vocal tract Figures from Graddol et al.
Text-To-Speech Synthesis An Overview. What is a TTS System Goal A system that can read any text Automatic production of new sentences Not just audio.
HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Producing Emotional Speech Thanks to Gabriel Schubiner.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Automatic Continuous Speech Recognition Database speech text Scoring.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
© 2021 SlidePlayer.com Inc. All rights reserved.