LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
The Sound Patterns of Language: Phonology
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Analysis and Synthesis of Shouted Speech Tuomo Raitio Jouni Pohjalainen Manu Airaksinen Paavo Alku Antti Suni Martti Vainio.
Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Speech Recognition Training Continuous Density HMMs Lecture Based on:
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Non-native Speech Languages have different pronunciation spaces
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Phonological Processes
Automatic Continuous Speech Recognition Database speech text Scoring.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield
Overview of NIT HMM-based speech synthesis system for Blizzard Challenge 2011 Kei Hashimoto, Shinji Takaki, Keiichiro Oura, and Keiichi Tokuda Nagoya.
English Pronunciation Learning System for Japanese Students Based on Diagnosis of Critical Pronunciation Errors Yasushi Tsubota, Tatsuya Kawahara, Masatake.
Segmental factors in language proficiency: Velarization degree as a signature of pronunciation talent Henrike Baumotte and Grzegorz Dogil {henrike.baumotte,
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Clinical Applications of Speech Technology Phil Green Speech and Hearing Research Group Dept of Computer Science University of Sheffield
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,
Turn-taking Discourse and Dialogue CS 359 November 6, 2001.
HMM-Based Synthesis of Creaky Voice
A Fully Annotated Corpus of Russian Speech
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
July Age and Gender Recognition from Speech Patterns Based on Supervised Non-Negative Matrix Factorization Mohamad Hasan Bahari Hugo Van hamme.
Training Tied-State Models Rita Singh and Bhiksha Raj.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Letter to Phoneme Alignment Using Graphical Models N. Bolandzadeh, R. Rabbany Dept of Computing Science University of Alberta 1 1.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Performance Comparison of Speaker and Emotion Recognition
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Outline  I. Introduction  II. Reading fluency components  III. Experimental study  1) Method and participants  2) Testing materials  IV. Interpretation.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology 1 Chinese Named Entity Recognition using Lexicalized HMMs.
Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
an Introduction to English
Statistical Models for Automatic Speech Recognition
Audio Books for Phonetics Research
Statistical Models for Automatic Speech Recognition
Research on the Modeling of Chinese Continuous Speech Recognition
Network Training for Continuous Speech Recognition
Presentation transcript:

LREC 2008, Marrakech, Morocco1 Automatic phone segmentation of expressive speech L. Charonnat, G. Vidal, O. Boëffard IRISA/Cordial, Université de Rennes 1, Lannion, France. VIVOS project, funded by the French National Agency for Research (ANR)‏

LREC 2008, Marrakech, Morocco2 OUTLINE ►Introduction ►Corpus description ►Experimentation ■text verification ■phonetisation ■HMM modeling ►A new mixed model ►Results ►Conclusion and perspectives

LREC 2008, Marrakech, Morocco3 Introduction ►Objectives ■To develop an automatic segmentation system adapted to expressive speech taken from movie dubbing. ■To investigate a new modelling methodology using mixed HMM models based on both Context Dependent and Context Independent Models. ►Motivations ■Voices for TTS applications are created from constrained recordings whereas unconstrained recordings are available, notably in the post-production industry. ■Context-independent phoneme models are usually used to perform label alignment, but, in some cases, context- dependent phoneme models can improve the alignment precision for co-articulated sounds.

LREC 2008, Marrakech, Morocco4 The speech corpus ►Voice-over recordings of short fantastic stories ■recorded in a dubbing studio ■speech expressing suspense ►French-native male speaker ►Database content ■5 hours and 20 minutes ■1633 speech turns ■average of 32 words/turn ■4995 sentences ►Effects of expressivity ■large variability in prosody, long pauses, fillers ■the speaker takes liberties in his pronunciation (unusual liaisons, approximative pronunciation of some words)‏

LREC 2008, Marrakech, Morocco5 Experimentation ►3 corpora ■learning : 70% of the corpus -> to train the models ■validation : 12% of the corpus -> to set modeling parameters ■test : 18% of the corpus -> to evaluate the overall performance

LREC 2008, Marrakech, Morocco6 Text verification ►Manual checking ■spelling ■pronunciation ►Insertions of tags in the text ■indicating deep breathing and long pauses ■not synchronized with the signal ►Exception dictionary for ■some acronyms ■foreign words ■~600 words ►speech turns synchronization

LREC 2008, Marrakech, Morocco7 Phonetisation ►Rules-based grapheme-phoneme conversion ►Variants : liaisons, schwas, pauses ►Production of a graph including optional variants ►HTK phonological words ils sont amenés => i l / s õ / a m ø n e

LREC 2008, Marrakech, Morocco8 HMM methodology ►1 phoneme ↔1 hmm model ►12 MFCC + Energy + derivatives (39 coefficents)‏ ►3 emitting states ►Context Independent models : ■initialised on the learning corpus (70% of the corpus)‏ ■3 gaussian components mixture ►Context Dependent models : ■initialised on Context Independent models ■4 gaussian components mixture ■estimation of missing contextual models using a classification tree ►Mixed models

LREC 2008, Marrakech, Morocco9 Mixed models ►Mixing context-dependant models and context- independant models according to their performance on a validation set

LREC 2008, Marrakech, Morocco10 Comparing CD vs CI models ►Difference of %age of correct alignments (<20 ms) between Context-Dependent models and Context-Independent models

LREC 2008, Marrakech, Morocco11 Results : phonetic decoding ►Disagreement (Elisions+Insertions+Substitutions) between 5.11% and 5.55% ►Good labelling of liaisons, elisions and insertions of pauses and schwas ►Substitutions : inversion between open and closed vowels

LREC 2008, Marrakech, Morocco12 Results : label alignments ►computed on well recognised phonetic labels ►mixed models take advantage of context-dependent models ( semi-vowels, voiced fricatives, *-nasal consonants)‏ ►+8% for semi-vowels-* 90.54% (mixed) vs 82.58% (CI)

LREC 2008, Marrakech, Morocco13 Conclusion and perspectives ►Good segmentation scores of expressive speech are due to ■an accurate text verification (...but only at a text level)‏ ■an automatically generated graph of phonemesa including variants ■an automatic hmm segmentation ►Experimentation of a new segmentation methodology by mixing CI and CD models ►Perspectives ■to improve automatic grapheme to phoneme conversion of acronyms and proper names ■to apply post-processings for open/closed vowels and pauses ■to include new filler models