S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

The sound patterns of language
The Sound Patterns of Language: Phonology
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Year 3 Objectives: Writing
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
1 Università di Cagliari Corso di Laurea in Economia e Gestione Aziendale Economia e Finanza Economia e Finanza Lingue e Culture per la Mediazione Programma.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Phonology Phonology is essentially the description of the systems and patterns of speech sounds in a language. It is, in effect, based on a theory of.
Making & marking text for synthesis Caroline Henton 10 August 2006.
PHONETICS AND PHONOLOGY
1 Frequency Domain Analysis/Synthesis Concerned with the reproduction of the frequency spectrum within the speech waveform Less concern with amplitude.
MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Sound and Speech. The vocal tract Figures from Graddol et al.
Chapter three Phonology
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
A PRESENTATION BY SHAMALEE DESHPANDE
Producing Emotional Speech Thanks to Gabriel Schubiner.
Phonetics and Phonology.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
A Text-to-Speech Synthesis System
The Description of Speech
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
TEXT TO SPEECH SYNTHESIS
1 SSML Extensions for TTS in Indian Languages II workshop on Internationalizing SSML May 2006, Greece Nixon Patel and Kishore Prahallad Bhrigus.
Phonology, phonotactics, and suprasegmentals
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Assessing Reading: Meeting Year 3 Expectations
Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Data-driven approach to rapid prototyping Xhosa speech synthesis Albert Visagie Justus Roux Centre for Language and Speech Technology Stellenbosch University.
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Phonetics and Phonology
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Speech Science Fall 2009 Nov 2, Outline Suprasegmental features of speech Stress Intonation Duration and Juncture Role of feedback in speech production.
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Reading Aid for Visually Impaired Veera Raghavendra, Anand Arokia Raj, Alan W Black, Kishore Prahallad, Rajeev Sangal Language Technologies Research Center,
Introduction to Linguistics Ms. Suha Jawabreh Lecture 9.
Kishore Prahallad IIIT-Hyderabad 1 Unit Selection Synthesis in Indian Languages (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
OPTIMAL TEXT SELECTION ALGORITHM ASR Project Meetings Dt: 08 June Rohit Kumar - LTRC, IIIT Hyderabad.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Introduction to Computational Linguistics
A Fully Annotated Corpus of Russian Speech
Ways to generate computer speech Record a human speaking every sentence HAL will ever speak (not likely) Make a mathematical model of the human vocal.
Performance Comparison of Speaker and Emotion Recognition
THE SOUND PATTERNS OF LANGUAGE
Speech recognition Home Work 1. Problem 1 Problem 2 Here in this problem, all the phonemes are detected by using phoncode.doc There are several phonetics.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Introduction to Linguistics
Being a Reader at St Leonard's
G. Anushiya Rachel Project Officer
Designing Cross-Language Information Retrieval System using various Techniques of Query Expansion and Indexing for Improved Performance  Hello everyone,
Università di Cagliari
Text-To-Speech System for English
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Rohit Kumar *, Amit Kataria, Sanjeev Sofat
Facoltà di Economia Economia e Gestione Aziendale Economia e Finanza
Indian Institute of Technology Bombay
Presentation transcript:

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad ** Punjab Engineering College Chandigarh D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

O RGANIZATION OF THE TALK 1.Text to Speech (TTS) Systems: Introduction 2.Components of TTS system 3.Concatenative Synthesis: Data Driven Approach 4.Issues in Data Driven Synthesis Approaches 5.Unit Selection Algorithm based on Prosodic Features 6.Development of Indian Language Text to Speech Systems 7.Demonstrations 8.Conclusions D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

T EXT TO SPEECH SYSTEMS : INTRODUCTION D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT A Text to Speech system converts an arbitrary text to speech wave form.

L ANGUAGE PROCESSING MODULES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT PREPROCESSOR:  Text is processed to expand abbreviations, address, numbers etc., and for sentence end detection -., ? ! :) : ; CONTEXTUAL ANALYZER:  Analyze the part of speech of words - In English, words like record, permit, present are produced with stress on the first syllable if a noun and on the second if a verb.  Analyze the structure of the sentence - declarative, interrogative. Useful to get appropriate prosodic structure. PHRASE ANALYSIS:  Decompose the text into phrases and clauses – provides useful hints to the synthesis engine to give prosodic pauses

D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT LETTER TO SOUND (LTS) RULES:  Pronunciation of words may differ from spelling – for example in English: known, write wrong etc.  LTS rules provides phonetic transcription of the input text.  Use pronunciation dictionaries for English like languages. COARTICULATION AND PROSODIC GENERATOR:  Incorporate co-articulation and prosodic knowledge such as intonation and duration L ANGUAGE PROCESSING MODULES

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT ARTICULATORY BASED SYNTHESIS:  Simplified models of articulators or shapes of the observed vocal tract are devised  Rules are specified to control the position of the articulation  Difficulty lies in modeling of the motion of the articulators PARAMETER BASED SYNTHESIS:  Speech segments are parameterized in terms of format frequencies or linear prediction coefficients  Rules are formed to manipulate the parameters of the speech unit to manifest co-articulation, intonation and duration  Several hundred precisely crafted rules may be required

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT CONCATENATIVE BASED SYNTHESIS:  Stored speech segments are concatenated for synthesis  Speech segments are also refer to as Units  Unit can be phrase, word, syllable, diphone or phone.  A longer unit such as phrase or word may have wide range of prosodical variations depending on the context. For example: 1) He cleaned it. 2) It has to be cleaned.  A smaller unit such as phone may not have the required co- articulation. For example a mere concatenation of isolated sound “c” “l” “e” “a” “n” does not result in “clean” as spoken in continuous speech.

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT CONCATENATIVE BASED SYNTHESIS:  Sub-word units such as diphone, syllable are considered to be suitable for concatenation.  Prosodic Variations:  Intonation and duration could be acquired and incorporated in the form of rules  Store multiple realizations of units with differing prosody

D ATA DRIVEN SYNTHESIS APPROACH D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

I SSUES IN DATA DRIVEN APPROACHES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT  Choice of unit  Basic unit could be sub-word unit  It should be possible to select variable length units  Characteristics of the speech database  Balanced in terms of coverage and prosodic diversity of all the units  Duration of the speech database?  Criteria to select a unit  Unit is selected depending on how well it matches with the input specification and on how well it matches with the other units in the sequence

E ARLIER WORKS ON UNIT SELECTION ALGORITHM D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Hunt and Black (1996) UNIT FEATURES  Phonemic information such as labels of neighboring units and position in phrases  Prosodic features such as energy, duration and pitch FOR SYNTHESIS PREDICTED  Prosodic features are PREDICTED for target units  A particular realization of a unit is selected depending on how well it minimizes the cost function of unit distortion and discontinuity measure

E ARLIER WORKS ON UNIT SELECTION ALGORITHM D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Black and Taylor (1997) UNIT FEATURES  Decision trees for each unit based on the questions concerning phonemic and prosodic context  Units of the same type are further clustered into different groups based on the acoustic similarity within FOR SYNTHESIS  For each target unit, an appropriate decision tree is selected to find the best cluster of candidate units  A search is then made to find the best path through the candidate units  A particular realization of a unit is selected taking into account the distance of the unit from its cluster center and the cost of joining two adjacent units

E ARLIER APPROACES FOR UNIT SELECTION  Need a prediction component to get the prosodic features of the target units  Build decision trees and defines a measure for acoustic similarity based on which the units of the same type are further clustered Questions:  Can we avoid prediction of prosodic features of target units using a prior knowledge (it may not available also?)  How do you define a measure for acoustic similarity based on Cepstrum for longer units such as syllables and words (For two units with different durations, we need more than simple Euclidean distance between Cepstral Coefficients of the units) D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES Unit selection mainly based on Pitch, Duration and Energy  No need of prior prosodic knowledge  Does not cluster the units based on Euclidean distance between Cepstral coefficients Hypothesis: A Unit is best perceived in synthesized speech if its neighboring units have same or at least similar prosodic and phonetic features as that of in the recorded speech (from which the unit is selected). D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES UNIT FEATURES Phonetic context:  Position of the unit (syllable)  Last phone of the previous unit  First phone of the succeeding unit Prosodic context:  Pitch, duration and energy  Previous unit’s pitch, duration and energy  Succeeding unit’s pitch, duration and energy D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES FOR SYNTHESIS Having selected (k-1) th unit, a realization of k th unit if it satisfies the following conditions  Position in the word is same as required  It should have the coarticulation effect of last phone of previous unit  Prosodic features such as pitch, duration and energy should match with the expectations of (k-1) th unit WHAT ARE THESE EXPECTATIONS AND HOW DO YOU MEASURE PROSODIC SIMILARITY D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES EXPECTATIONS OF (K-1) th UNIT Actual values of pitch, duration and energy of the unit following (k-1) th unit in the speech corpus If E a D a and P a are the energy, duration and pitch of k th syllable E e, D e, and P e are the expectations of (k-1) th syllable then Prosodic Matching Function D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

S IGNIFICANCE OF PROSODIC MATCHING FUNCTION  A lesser value of function m (.) indicates better prosodic match  The function m(.) attains the least possible value “0” if  E e = E a, D e = D a, and P e = P a (k th unit happens to the successor of (k-1) th in the speech corpus )  Selecting k th unit will lead to selection of longer sequence consisting of two units  Thus the function m(.) will implicitly selects longer available sequence such as words, phrases and even sentences HOW WELL IT WORKS ? D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

Nature of Indian Scripts a.Common phonetic Base b.About 35 Consonants and 18 Vowels c.Phonetic nature of languages - We almost speak what we write Choice of Unit – Syllable a.Basic units of Indian writing system are characters b.These characters are close to syllables and are typically of the form C, V, CV, CVC, CCV D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

Letter to Sound Rules(LTS) Almost one to one correspondence between character and sound Inherent Vowel Suppression(IVS) 1.No two successive characters undergoes IVS 2.The last character of the word always have its vowel suppressed unless the vowel is not /a/. e.g. rAma -> rAm 3.For characters in word middle position, IVS occurs if the next character in the word is a.Either not the last character b.Or has a vowel other than /a/. D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

Syllabification Rules 1.When nasals (/mz/ & /mzz/) succeeds a vowel, there are treated as part of vowel. e.g. samzskrit 2.When 3 or more consonants between consecutive vowels, 1 st consonants -> coda of previous syllable Other consonants -> onset of next syllable 3.When exactly 2 consonants between consecutive vowels, 1 st consonants -> coda of previous syllable 2 nd consonants -> onset of next syllable Exception: Second consonant is from {/r/, /s/, /sh/, /shz/} 1 st & 2 nd consonant -> onset of next syllable D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

D EVELOPMENT OF SPEECH DATABASES GENERATION OF SPEECH CORPUS  Sentences are selected from large text corpus (available with LTRC) taking into account of high frequency syllables  A sentence is selected if it has at least one high frequency syllable not present in the previous selected sentences  Recording is done in lab environment LABELLING THE CORPUS  Mark the boundaries of speech segments at phone level  Emulabel ( D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

D EVELOPMENT OF SPEECH DATABASES CORPUS DETAILS  Hindi  96 minutes  2391 high frequency syllables (23096 total realizations)  Telugu  110 minutes  2291 high frequency syllables (33417 total realizations) D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

D EMOS AND COMPARISON PMF based Speech Synthesizer D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Festvox based Speech Synthesizer Hindi Sample 1 Hindi Sample 2 Telugu Sample 1 Telugu Sample 2 CLICK HERE

C ONCLUSION  Data driven synthesis methods have capabilities to produce high quality synthesis  Syllable is more suitable unit than diphone for Indian languages  Unit selection algorithm based on simple prosodic features performs as good as or sometimes better than other methods  Selection of a unit satisfying local phonetic constraints and prosodic constraints through prosodic matching function produce high quality speech output D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT