Presentation on theme: "Shyamal Kumar Das Mandal, Asoke Kumar Datta Epoch Synchronous Non Overlap Add (ESNOLA) Method based Concatenative Speech Synthesis System for Bangla Centre."— Presentation transcript:
Shyamal Kumar Das Mandal, Asoke Kumar Datta Epoch Synchronous Non Overlap Add (ESNOLA) Method based Concatenative Speech Synthesis System for Bangla Centre for development of Advanced Computing (C-DAC), Kolkata, India firstname.lastname@example.org
Segmentation of the Speech signal Signal Dictionary Text Analyzer Phonological Rules and Exceptional word list Phonetic Synthesizer Bengali text Preprocessing Module Text Analysis Module Synthesizer Module Synthesized speech output Partneme Prosodic & Intonation information ESNOLA Scheme
Text input in Nepali script Grapheme to Phoneme Conversion Grapheme to Phoneme Conversion NLP phrase/clause, parts-of-speech, number and abbreviation NLP phrase/clause, parts-of-speech, number and abbreviation Prosodic & Intonation Rules Prosodic & Intonation Rules + Generated information for synthesizer Linguistic Analysis of text Phonological Rules Phonological Rules
Preprocessing module Creation of nonsense word set This set of words must contain acoustic phonetic characteristics of all phonemes. A set of tetra-syllabic nonsense words of the forms CVCVCVCV, CVVCVVCVVCVVC is used for normal consonants and vowels. However, as /n/- /ŋ/ distinction in case of Bangla is not ascertained except in conjunction with appropriate consonant, an additional 8 syllabic form CVNVCVNVCVNVCVNV for two nasals /n/ and /ŋ/ are included. The choice of tetra syllabic words in case of Bangla is necessary because Bangla being a bound stress language with stress occurring normally at the first syllable. Another set of words has to be collected from the normal lexicon of the language where the different vowel-vowel combinations occur. Usually all possible combination may not be easily available. For the unusual combinations appropriate sentences are created where such combination occur at word juncture.
V VCCV CC Required signal processing aspects Pitch normalization Amplitude normalization Segmentation process CCVVVC
1. CVCV. C +CV+V+VC+C+V+Vo 2. VCV. Vi+V +VC+C+CV+V+Vo 3. CVYV. C +CV+V+VY+YV+Vo /kaakaa/ k+kaa +aa+aak+k+kaa+aa+aao /aami/aa+aa+aam+m+mi+i+i /paay/p+paa+aay aa aarbhaa bhr ra a at t /bhaarat/ Token generation
* Epoch Synchronous Non-OverLapping Add Concatenative synthesis is based on putting together pieces ( acoustic unit) of natural(recorded) speech to synthesized an arbitrary utterance bhbhaaaaaarrraaatt We are following the ESNOLA* method Concatenative Approach /bhaarat/
1. Intonation prosodic rule generations 2. Modification of synthesized speech signal for introducing Intonation prosody Intonation and prosodic parameters Pitch contour Duration Amplitude Aim for naturalness Pause
Intensity modification (amplitude modification): This done by multiplying each of the sample value of the segment by the value specified by amplitude parameter of the corresponding token. Duration modification: This operation in the present system is performed on steady state vowel segment. Length of the steady state of vowel segment depends on the syllable duration. It may be noted that the duration of consonants and the CV and VC transition are pre – specified. F0 modification: Pitch (F0) modification of the synthesized signal is one of the important aspects to introduce intonation in the synthesized speech signal. In the segment dictionary the signal whose pitch have to be modified are the CV, VC, VV, nasal murmurs and laterals. Time scale pitch modification is done by changing the length of the period of the original signal. In ESNOLA pitch (F0) modification involves three steps. These are (1) Generation of short-time signals from original speech waveform, (2) Epoch synchronous modification brought to the short-term signals, and finally, (3) Synthesis by the concatenation of the modified signals. Implementation of naturalness in synthesizer
Generation of Short-Time (ST) Signals Let x(t) be the digitized speech waveform and letem: m = 1, 2, … represent the successive epoch positions in the signal. The intermediate representation of x(t) is a sequence of short-time (ST) signals, defined by for 0
"name": "Generation of Short-Time (ST) Signals Let x(t) be the digitized speech waveform and letem: m = 1, 2, … represent the successive epoch positions in the signal.",
"description": "The intermediate representation of x(t) is a sequence of short-time (ST) signals, defined by for 0
ST Signal for e1 in Figure 2.16 for n = 3 and α = 4 The effect of it in the synthesized signal would be like that a glottal pulse is generated much after the dying down of the previous glottal pulse. This condition would create a creaky voice. Similar, if the value of α is much lower, then the effect of it in the synthesized signal would be like that a glottal pulse is generated much before the dying down of the previous one. Thus, this will create a breathy voice. Empirically the value of α is obtained 0.25 for the production of good synthesized output.
The ESNOLA framework and partneme inventories altogether give a simple approach for the production of high quality synthesized speech, particularly useful for intonated concatenative synthesis system. Using only the epoch information of the voiced speech signal, the pitch and prosody can be manipulated by keeping the quality intact. The attractiveness of the present approach is its computational simplicity for pitch and duration manipulations. For prosody modification, it is also necessary to manipulate the pitch and duration in the CV, VC, murmur and laterals portions of the stored signals. The epoch detection algorithm is necessary for manipulating pitch and duration in these cases. But this can be avoided by an offline detection of the epochs and storing them in files. Implementation of natural prosody and intonation need comprehensive rule for the spoken dialect. Unfortunately this is no yet available for SCB. Therefore system for flat speech using the technique has been developed for use. Recently this system was used by the Election Communication for announcement of election results held in West Bengal. Conclusions
 Deketelaere S., Deroo O., Dutoit T., “Speech Processing for Communications: What’s New?”(*) MULTITEL ASBL, 1 Copernic Ave, Initialis Scientific Park, B-7000 MONS(**) Faculté Polytechnique de Mons, TCTS Lab, 1 Copernic Ave, Initialis Scientific Park, B-7000 MONS  Dan, T. K., Mukherjee B & Datta A. K. (1993). “Temporal approach for synthesis of singing (Soprano1).” SMAC 93, pp. 282-287, 1993.  Datta A.K, Ganguli N.R and Mukherjee B. “ Intonation in segment concatenated speech” Proc. ESCA Workshop on speech synthesis, France, pp.-153-156, Sep 1990.  Low P.H., Vaseghi S., “Synthesis Of Unseen Context And Spectral And Pitch Contour Smoothing In Concatenated Text To Speech Synthesis”, ICASSP, Florida, USA, Vol. 1, pp. 469-472, 2002.  Das Mandal S. K, Datta A.K, Gupta B. “Spectral Matching of Epoch Synchronous Non-Over Lapping Add (ESNPLA) Method Based Concatenative Synthesizer”, International Conference on Communications Devices and Intelligent System (CODIS-2004), Jadavpur University, 2004, pp 729- 732  Chatterji Suniti Kumar “The Original and Development of the Bengali Language” Published by Rupa.Co, 2002, ISBN 81-7167-117-9, 1926.  Sarkar Pabitra, “Bangla Balo” Published by Prama prakasani, 1990.  Das Mandal Shyamal Kr, Saha Arup, Sarkar Indranil Datta Asoke Kumar, “Phonological, International & Prosodic Aspects of Concatenative Speech Synthesizer Development for Bangla,” Proceeding of SIMPLE-05, February 2005, pp56-60, 2005. References