Presentation is loading. Please wait.

Presentation is loading. Please wait.

LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.

Similar presentations


Presentation on theme: "LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes."— Presentation transcript:

1 LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes 1, Lannion, France. VIVOS project, funded by the French National Agency for Research (ANR)‏

2 LREC 2008, Marrakech, Morocco2 OUTLINE ►Introduction ►Corpus description ►Experimentation ■text verification ■phonetisation ■HMM modeling ►A new mixed model ►Results ►Conclusion and perspectives

3 LREC 2008, Marrakech, Morocco3 Introduction ►Objectives ■To develop an automatic segmentation system adapted to expressive speech taken from movie dubbing. ■To investigate a new modelling methodology using mixed HMM models based on both Context Dependent and Context Independent Models. ►Motivations ■Voices for TTS applications are created from constrained recordings whereas unconstrained recordings are available, notably in the post-production industry. ■Context-independent phoneme models are usually used to perform label alignment, but, in some cases, context- dependent phoneme models can improve the alignment precision for co-articulated sounds.

4 LREC 2008, Marrakech, Morocco4 The speech corpus ►Voice-over recordings of short fantastic stories ■recorded in a dubbing studio ■speech expressing suspense ►French-native male speaker ►Database content ■5 hours and 20 minutes ■1633 speech turns ■average of 32 words/turn ■4995 sentences ►Effects of expressivity ■large variability in prosody, long pauses, fillers ■the speaker takes liberties in his pronunciation (unusual liaisons, approximative pronunciation of some words)‏

5 LREC 2008, Marrakech, Morocco5 Experimentation ►3 corpora ■learning : 70% of the corpus -> to train the models ■validation : 12% of the corpus -> to set modeling parameters ■test : 18% of the corpus -> to evaluate the overall performance

6 LREC 2008, Marrakech, Morocco6 Text verification ►Manual checking ■spelling ■pronunciation ►Insertions of tags in the text ■indicating deep breathing and long pauses ■not synchronized with the signal ►Exception dictionary for ■some acronyms ■foreign words ■~600 words ►speech turns synchronization

7 LREC 2008, Marrakech, Morocco7 Phonetisation ►Rules-based grapheme-phoneme conversion ►Variants : liaisons, schwas, pauses ►Production of a graph including optional variants ►HTK phonological words ils sont amenés => i l / s õ / a m ø n e

8 LREC 2008, Marrakech, Morocco8 HMM methodology ►1 phoneme ↔1 hmm model ►12 MFCC + Energy + derivatives (39 coefficents)‏ ►3 emitting states ►Context Independent models : ■initialised on the learning corpus (70% of the corpus)‏ ■3 gaussian components mixture ►Context Dependent models : ■initialised on Context Independent models ■4 gaussian components mixture ■estimation of missing contextual models using a classification tree ►Mixed models

9 LREC 2008, Marrakech, Morocco9 Mixed models ►Mixing context-dependant models and context- independant models according to their performance on a validation set

10 LREC 2008, Marrakech, Morocco10 Comparing CD vs CI models ►Difference of %age of correct alignments (<20 ms) between Context-Dependent models and Context-Independent models

11 LREC 2008, Marrakech, Morocco11 Results : phonetic decoding ►Disagreement (Elisions+Insertions+Substitutions) between 5.11% and 5.55% ►Good labelling of liaisons, elisions and insertions of pauses and schwas ►Substitutions : inversion between open and closed vowels

12 LREC 2008, Marrakech, Morocco12 Results : label alignments ►computed on well recognised phonetic labels ►mixed models take advantage of context-dependent models ( semi-vowels, voiced fricatives, *-nasal consonants)‏ ►+8% for semi-vowels-* 90.54% (mixed) vs 82.58% (CI)

13 LREC 2008, Marrakech, Morocco13 Conclusion and perspectives ►Good segmentation scores of expressive speech are due to ■an accurate text verification (...but only at a text level)‏ ■an automatically generated graph of phonemesa including variants ■an automatic hmm segmentation ►Experimentation of a new segmentation methodology by mixing CI and CD models ►Perspectives ■to improve automatic grapheme to phoneme conversion of acronyms and proper names ■to apply post-processings for open/closed vowels and pauses ■to include new filler models


Download ppt "LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes."

Similar presentations


Ads by Google