S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad ** Punjab Engineering College Chandigarh D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

O RGANIZATION OF THE TALK 1.Text to Speech (TTS) Systems: Introduction 2.Components of TTS system 3.Concatenative Synthesis: Data Driven Approach 4.Issues in Data Driven Synthesis Approaches 5.Unit Selection Algorithm based on Prosodic Features 6.Development of Indian Language Text to Speech Systems 7.Demonstrations 8.Conclusions D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

T EXT TO SPEECH SYSTEMS : INTRODUCTION D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT A Text to Speech system converts an arbitrary text to speech wave form.

L ANGUAGE PROCESSING MODULES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT PREPROCESSOR:  Text is processed to expand abbreviations, address, numbers etc., and for sentence end detection -., ? ! :) : ; CONTEXTUAL ANALYZER:  Analyze the part of speech of words - In English, words like record, permit, present are produced with stress on the first syllable if a noun and on the second if a verb.  Analyze the structure of the sentence - declarative, interrogative. Useful to get appropriate prosodic structure. PHRASE ANALYSIS:  Decompose the text into phrases and clauses – provides useful hints to the synthesis engine to give prosodic pauses

D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT LETTER TO SOUND (LTS) RULES:  Pronunciation of words may differ from spelling – for example in English: known, write wrong etc.  LTS rules provides phonetic transcription of the input text.  Use pronunciation dictionaries for English like languages. COARTICULATION AND PROSODIC GENERATOR:  Incorporate co-articulation and prosodic knowledge such as intonation and duration L ANGUAGE PROCESSING MODULES

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT ARTICULATORY BASED SYNTHESIS:  Simplified models of articulators or shapes of the observed vocal tract are devised  Rules are specified to control the position of the articulation  Difficulty lies in modeling of the motion of the articulators PARAMETER BASED SYNTHESIS:  Speech segments are parameterized in terms of format frequencies or linear prediction coefficients  Rules are formed to manipulate the parameters of the speech unit to manifest co-articulation, intonation and duration  Several hundred precisely crafted rules may be required

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT CONCATENATIVE BASED SYNTHESIS:  Stored speech segments are concatenated for synthesis  Speech segments are also refer to as Units  Unit can be phrase, word, syllable, diphone or phone.  A longer unit such as phrase or word may have wide range of prosodical variations depending on the context. For example: 1) He cleaned it. 2) It has to be cleaned.  A smaller unit such as phone may not have the required coarticulation. For example a mere concatenation of isolated sound “c” “l” “e” “a” “n” does not result in “clean” as spoken in continuous speech.

S YNTHESIS STRATEGIES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT CONCATENATIVE BASED SYNTHESIS:  Sub-word units such as diphone, syllable are considered to be suitable for concatenation.  Prosodic Variations:  Intonation and duration could be acquired and incorporated in the form of rules  Store multiple realizations of units with differing prosody

D ATA DRIVEN SYNTHESIS APPROACH D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

I SSUES IN DATA DRIVEN APPROACHES D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT  Choice of unit  Basic unit could be sub-word unit  It should be possible to select variable length units  Characteristics of the speech database  Balanced in terms of coverage and prosodic diversity of all the units  Duration of the speech database?  Criteria to select a unit  Unit is selected depending on how well it matches with the input specification and on how well it matches with the other units in the sequence

E ARLIER WORKS ON UNIT SELECTION ALGORITHM D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Hunt and Black (1996) UNIT FEATURES  Phonemic information such as labels of neighboring units and position in phrases  Prosodic features such as energy, duration and pitch FOR SYNTHESIS PREDICTED  Prosodic features are PREDICTED for target units  A particular realization of a unit is selected depending on how well it minimizes the cost function of unit distortion and discontinuity measure

E ARLIER WORKS ON UNIT SELECTION ALGORITHM D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Black and Taylor (1997) UNIT FEATURES  Decision trees for each unit based on the questions concerning phonemic and prosodic context  Units of the same type are further clustered into different groups based on the acoustic similarity within FOR SYNTHESIS  For each target unit, an appropriate decision tree is selected to find the best cluster of candidate units  A search is then made to find the best path through the candidate units  A particular realization of a unit is selected taking into account the distance of the unit from its cluster center and the cost of joining two adjacent units

E ARLIER APPROACES FOR UNIT SELECTION  Need a prediction component to get the prosodic features of the target units  Build decision trees and defines a measure for acoustic similarity based on which the units of the same type are further clustered Questions:  Can we avoid prediction of prosodic features of target units using a prior knowledge (it may not available also?)  How do you define a measure for acoustic similarity based on Cepstrum for longer units such as syllables and words (For two units with different durations, we need more than simple Euclidean distance between Cepstral Coefficients of the units) D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES Unit selection mainly based on Pitch, Duration and Energy  No need of prior prosodic knowledge  Does not cluster the units based on Euclidean distance between Cepstral coefficients Hypothesis: A Unit is best perceived in synthesized speech if its neighboring units have same or at least similar prosodic and phonetic features as that of in the recorded speech (from which the unit is selected). D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES UNIT FEATURES Phonetic context:  Position of the unit (syllable)  Last phone of the previous unit  First phone of the succeeding unit Prosodic context:  Pitch, duration and energy  Previous unit’s pitch, duration and energy  Succeeding unit’s pitch, duration and energy D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES FOR SYNTHESIS Having selected (k-1) th unit, a realization of k th unit if it satisfies the following conditions  Position in the word is same as required  It should have the coarticulation effect of last phone of previous unit  Prosodic features such as pitch, duration and energy should match with the expectations of (k-1) th unit WHAT ARE THESE EXPECTATIONS AND HOW DO YOU MEASURE PROSODIC SIMILARITY D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

U NIT SELECTION BASED ON PROSODIC AND PHONETIC FEATURES EXPECTATIONS OF (K-1) th UNIT Actual values of pitch, duration and energy of the unit following (k-1) th unit in the speech corpus If E a D a and P a are the energy, duration and pitch of k th syllable E e, D e, and P e are the expectations of (k-1) th syllable then Prosodic Matching Function D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

S IGNIFICANCE OF PROSODIC MATCHING FUNCTION  A lesser value of function m (.) indicates better prosodic match  The function m(.) attains the least possible value “0” if  E e = E a, D e = D a, and P e = P a (k th unit happens to the successor of (k-1) th in the speech corpus )  Selecting k th unit will lead to selection of longer sequence consisting of two units  Thus the function m(.) will implicitly selects longer available sequence such as words, phrases and even sentences HOW WELL IT WORKS ? D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

Nature of Indian Scripts a.Common phonetic Base b.About 35 Consonants and 18 Vowels c.Phonetic nature of languages - We almost speak what we write Choice of Unit – Syllable a.Basic units of Indian writing system are characters b.These characters are close to syllables and are typically of the form C, V, CV, CVC, CCV D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

Letter to Sound Rules(LTS) Almost one to one correspondence between character and sound Inherent Vowel Suppression(IVS) 1.No two successive characters undergoes IVS 2.The last character of the word always have its vowel suppressed unless the vowel is not /a/. e.g. rAma -> rAm 3.For characters in word middle position, IVS occurs if the next character in the word is a.Either not the last character b.Or has a vowel other than /a/. D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

Syllabification Rules 1.When nasals (/mz/ & /mzz/) succeeds a vowel, there are treated as part of vowel. e.g. samzskrit 2.When 3 or more consonants between consecutive vowels, 1 st consonants -> coda of previous syllable Other consonants -> onset of next syllable 3.When exactly 2 consonants between consecutive vowels, 1 st consonants -> coda of previous syllable 2 nd consonants -> onset of next syllable Exception: Second consonant is from {/r/, /s/, /sh/, /shz/} 1 st & 2 nd consonant -> onset of next syllable D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT D EVELOPMENT OF TEXT TO SPEECH SYSTEMS FOR INDIAN LANGUAGES

D EVELOPMENT OF SPEECH DATABASES GENERATION OF SPEECH CORPUS  Sentences are selected from large text corpus (available with LTRC) taking into account of high frequency syllables  A sentence is selected if it has at least one high frequency syllable not present in the previous selected sentences  Recording is done in lab environment LABELLING THE CORPUS  Mark the boundaries of speech segments at phone level  Emulabel (www.festvox.org/emu) D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

D EVELOPMENT OF SPEECH DATABASES CORPUS DETAILS  Hindi  96 minutes  2391 high frequency syllables (23096 total realizations)  Telugu  110 minutes  2291 high frequency syllables (33417 total realizations) D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

D EMOS AND COMPARISON PMF based Speech Synthesizer D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT Festvox based Speech Synthesizer Hindi Sample 1 Hindi Sample 2 Telugu Sample 1 Telugu Sample 2 CLICK HERE

C ONCLUSION  Data driven synthesis methods have capabilities to produce high quality synthesis  Syllable is more suitable unit than diphone for Indian languages  Unit selection algorithm based on simple prosodic features performs as good as or sometimes better than other methods  Selection of a unit satisfying local phonetic constraints and prosodic constraints through prosodic matching function produce high quality speech output D ATA D RIVEN S YNTHESIS A PPROACH FOR I NDIAN L ANGUAGES USING S YLLABLE AS B ASIC U NIT

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

Similar presentations

Presentation on theme: "S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **

Similar presentations

Presentation on theme: "S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **"— Presentation transcript:

Similar presentations

About project

Feedback