Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.

Slides:



Advertisements
Similar presentations
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Advertisements

High Level Prosody features: through the construction of a model for emotional speech Loic Kessous Tel Aviv University Speech, Language and Hearing
Communicating with Robots using Speech: The Robot Talks (Speech Synthesis) Stephen Cox Chris Watkins Ibrahim Almajai.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
DYNAMIC ADAPTATION FOR LANGUAGE AND DIALECT IN A SPEECH SYNTHESIS SYSTEM Craig Olinsky Media Lab Europe / University College Dublin.
General Problems  Foreign language speakers of a target language cause a great difficulty to native speakers because the sounds they produce seems very.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Towards an NLP `module’ The role of an utterance-level interface.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
Introduction to Speech Production Lecture 1. Phonetics and Phonology Phonetics: The physical manifestation of language in sound waves. –How sounds are.
Sound and Speech. The vocal tract Figures from Graddol et al.
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Auditory User Interfaces
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Building a Catalan diphone voice Ariadna Font Llitjos May 10, 2001.
Phonetics and Phonology.
Bootstrapping pronunciation models: a South African case study Presented at the CSIR Research and Innovation Conference Marelie Davel & Etienne Barnard.
Preparing for the Verbal Reasoning Measure. Overview Introduction to the Verbal Reasoning Measure Question Types and Strategies for Answering General.
The LC-STAR project (IST ) Objectives: Track I (duration 2 years) Specification and creation of large word lists and lexica suited for flexible.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
Phonetics Linguistics for ELT B Ed TESL 2005 Cohort 2.
Definitions Phonetics - the study of the symbols that represent meaningful speech sounds. –The sounds in all the languages of the world together constitute.
Recommendations for Morgan’s Instruction Instruction for improving reading fluency Instruction for improving word recognition, word decoding, and encoding.
Numerical Text-to-Speech Synthesis System Presentation By: Sevakula Rahul Kumar.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
STANDARDIZATION OF SPEECH CORPUS Li Ai-jun, Yin Zhi-gang Phonetics Laboratory, Institute of Linguistics, Chinese Academy of Social Sciences.
Grammars.
Some Thoughts on HPC in Natural Language Engineering Steven Bird University of Melbourne & University of Pennsylvania.
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Supervisor: Dr. Eddie Jones Electronic Engineering Department Final Year Project 2008/09 Development of a Speaker Recognition/Verification System for Security.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Spoken dialog for e-learning supported by domain ontologies Dario Bianchi, Monica Mordonini and Agostino Poggi Dipartimento di Ingegneria dell’Informazione.
1 Computational Linguistics Ling 200 Spring 2006.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Please check, just in case…. APA Tip of the Day: Quotation marks inside quotes If the text that you are quoting includes a word or phrase with double.
Transcription of Text by Incremental Support Vector machine Anurag Sahajpal and Terje Kristensen.
PETRA – the Personal Embedded Translation and Reading Assistant Werner Winiwarter University of Vienna InSTIL/ICALL Symposium 2004 June 17-19, 2004.
Vergina: A Modern Greek Speech Database for Speech Synthesis Alexandros Lazaridis Theodoros Kostoulas Todor Ganchev Iosif Mporas Nikos Fakotakis Artificial.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
PED 392 Child Growth and Development. Definitions Language A symbolic system: a series of sounds or gestures in which words represent an idea, object.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.
READING STRATEGIES THAT WORK A Report to the Carnegie Corporation READING NEXT A Vision for Action and Research in Middle and High School Literacy © 2004.
Introduction to Computational Linguistics
ACE TESOL Diploma Program – London Language Institute OBJECTIVES You will understand: 1. The scope of the field of phonology; 2. The relevance of phonology.
Performance Comparison of Speaker and Emotion Recognition
Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.
International Conference on Fuzzy Systems and Knowledge Discovery, p.p ,July 2011.
Lecture 1 Phonetics – the study of speech sounds
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Against formal phonology (Port and Leary).  Generative phonology assumes:  Units (phones) are discrete (not continuous, not variable)  Phonetic space.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
G. Anushiya Rachel Project Officer
Segments and Divergences.
Text-To-Speech System for English
PHONETICS.
Preparing for the Verbal Reasoning Measure
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
ENGLISH PHONETICS AND PHONOLOGY Week 2
Presentation transcript:

Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002

Introducing the Problem Given a set of recordings and transcriptions in an arbitrary language, can we quickly and easily build a speech synthesizer? YES, if we know something about the language. However, for the majority of languages for which such resources don’t exist…

PROS  The existing synthesizer provides a store of “linguistic” knowledge we can start from.  Analogue to speaker adaptation in Speech Recognition systems.  Overall, quality should be better. CONS  Difficulty related to degree of different between sample and target language.  Best as a gradual process: accent/dialect, not language Starting from Sample

PROS  Difficulty directly proportional to complexity of the language.  Common (machine-learning) procedure based upon machine learning from recordings and transcript. CONS  Don’t have a great deal of relevant knowledge to apply to the task.  If not using principled phone set, necessary to segment / label recordings cleanly Starting from Scratch

The Obvious Compromise Take what we do know from building speech synthesis, and generalize it to an existing framework. -- we’re not specifically learning from “scratch” -- at the same time, we’re not making linguistic assumptions pre-coded into the source voices

“Generic” Synthesis Framework/Toolkit  Set of Scripts, Utilities, and Definition files to help to help to automate the creation of reasonable speech synthesis voices from an arbitrary language without the need for linguistic or language-specific information.  Build on top of the Festival Speech Synthesis System and FestVox toolkit (for wave form synthesis; most of text processing and pronunciation handling externalized to locally-developed tools)

Language-Dependent Synthesis Components  Phone set  Word pronunciation (lexicon and/or letter- to-sound rules)  Token processing rules (numbers etc)  Durations  Intonation (accents and F0 contour)  Prosodic phrasing method

Phoneme Sets  If we rely on a pre-existing set of pronunciation rules, lexicon, etc., we are automatically limited to using the phone-set used in those resources (or something which they can be mapped to); most likely something language-dependent.  IPA, SAMPA: something language-universal?  We need to generate pronunciations: how do we create the relationship between our training database / phonetic representation / orthography?

“Multilingual” Phoneme Sets: IPA, SAMPA We don’t want to be stuck with a set of phonemes targeted for a specific language, so we instead use a phoneme definition designed to be inclusive of all But… this still assumes we know the relationship between the phone set and orthography of the language; i.e. for any given text we can generate a pronunciation. This approach still assumes linguistic knowledge!

Orthography as Pronunciation cf: R. Singh, B. Raj and R.M. Stern, “Automatic Generation of Phone Sets and Lexical Transcriptions;”.. Suppose we begin with the orthography of the written language. e.g. CAT = [c] [a] [t]DOG = [d] [o] [g] This implies A relation between number of characters in a spelling and the length of the pronunciation The orthography of a language is consistent / efficient

Orthography as Pronunciation

Implications for Data Labeling and Training

Non-Roman Orthography: Questions of Transcription

Difficulties in Machine Learning of Pronunciation “But there is a much more fundamental problem … in that it crucially assumes that letter-to-phoneme correspondences can in general be determined on the basis of information local to a particular portion of the letter string. While this is clearly true in some languages (e.g. Spanish), it is simply false for others…. “…It is unreasonable to expect that good results will be obtained from a system trained with no guidence of this kind, or … with data that is simply insufficient to the task.” – Sproat et. al, Multilingual Text-to-Speech Synthesis: The Bell Labs Approach, pp.76-77

Lexicon / Letter-to- Sound Rules

Token Processing

Duration and Stress Modeling

Intonation and Phrasing

Unit Selection and Waveform Synthesis

Overview: Adaptation for Accent and Dialect

Final Points