Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fundamental Frequency Contour Synthesis for Turkish Text to Speech

Similar presentations


Presentation on theme: "Fundamental Frequency Contour Synthesis for Turkish Text to Speech"— Presentation transcript:

1 Fundamental Frequency Contour Synthesis for Turkish Text to Speech
Erkan Abdullahbeşe

2 Content: TTS systems and prosody Turkish Intonation, Stress
Observations on Collected Data Methodology Improvements on Methodology Discussion Conclusion

3 Introduction to Text to Speech (TTS) Systems
Text -> speech signal Widespread applications Message to speech generation Man-machine dialogue Multimedia applications Talking aids for handicapped CHALLENGE: Machine Accent -> Natural Speech SOLUTION: Prosody Generation Modules

4 What is Prosody? Properties of speech that cannot be derived from the phoneme sequence Modulation of voice pitch Rhythm, changes in durations Fluctuations of loudness Related to domains larger than one phoneme (supra-segmental properties)

5 Basic Acoustic Parameters
Fundamental Frequency F0 (pitch) Duration Intensity Prosodic Phenomena Modulate the basic acoustic parameters Modulation of fundamental frequency Intonation Stress (accent)

6 Intonation Stress Ensemble of pitch variations
Perceived as speech melody Stress Modulate all the basic acoustic parameters Increase in F0 and intensity (loudness) Lengthening in duration Three types: Word stress Phrase stress Sentence stress Stress on a single syllable Phrase and sentence stress coincide with word stress

7 Prosody Generation Modules in TTS
Prosodic description Prosodic phrasing -> phrase boundaries Accent labeling -> accents on syllables Prosodic labels -> F0 contour PROBLEMS Complex linguistic processing units (morphology, syntax, semantics) Speaker-dependence Articulation-related problems: microprosody vs. macroprosody

8 Basic Intonation Models
Tone Sequence Models : Pitch contour as a sequence of fluctuations generated by local accents Pierrehumbert: A sequence of independent H and L tones (ortography) Pitch accent -> pitch movements on stressed syllables Boundary tone ->at phrase boundaries Phrase accent -> between stressed syllable and phrase boundary Superposition Models : Pitch contour as the superposition of several components with different domains: syllables, words, phrases, sentences, paragraphs, whole text Fujisaki: purely mathematical model -> parametric A basic F0 A phrase component (crit. Damped sec. Order to impulse) An accent component (crit. Damped sec. Order to rectangular) Optimization of parameter values wrt F0 (Analysis by Synthesis) Möbius -> Fujisaki + Linguistics -> German

9 Approaches Perform an analysis on a speech corpus
Transcribe the corpus Define F0 labels(rise, fall, peak etc.) and boundary labels (minor, major etc.) Labeling By hand Examination -> rules -> automatic Automatic learning of : labels -> F0 values (or parametrized) Neural Networks Stochastic methods Intonation pattern dictionary (from natural speech) Store pitch values in ST and key information (labels) for each pattern For the patterns in input sentence -> compare key info -> find closest pattern from dictionary -> apply pitch

10 Approaches For integration into TTS (labeling input sentence from text) Complex linguistic processing units Morphology Syntax Semantics Stochastic methods Syntax -> most probable label sequence

11 Sentence Intonation Types
Terminal intonation pitch decreases at the end -> message completed Interrogative intonation pitch slightly increases on the last syllable -> waiting for response Progressive intonation pitch either increases slightly or does not show any lowering at the end -> message not completed yet

12 Turkish Intonation Classification of sentences Type: Declaratives(↓)
wh-questions(↑) yes-no questions(↓) Structure: Simple Compound: (↑) at the end of subordinate Meşgul olduğundan(↑) bizimle sinemaya gelemedi(↓).

13 Turkish Intonation Tone groups (phrase or segment)
Division into tone groups / Oraya varınca beni arayın. / / Oraya varınca / beni arayın. / Focus (new information) in each tone group Pitch variations on focus

14 Turkish Intonation Four levels of pitch: low(1), mid(2), high(3), extra high(4) gi2di3yoru1m sa2hi4 mi1 Speech melody <–> musical melody (Nash) Hierarchy of intonation units(phrase -> text) Each intonation unit -> melody Successive intonation units related by motifs -> melody of the upper level Music: reiteration of motifs -> musical melody

15 Turkish Stress Word Stress
Fixed(bound) stress vs. Free stress(Turkish) Stress on a single syllable of a word in Turkish Effect of suffixes on stress Stress on final syllable of root + stressable suffix yolcu + -lar → yolcular Stress on final syllable of root, unstressable suffix involves oku + -yor → okuyor + -lar → okuyorlar Stress on non-final syllable of root karınca + -lar → karıncalar May disappear in sentence

16 Turkish Stress Sentence Stress
Signals the prominance of the most information-bearing element in a sentence Types Unmarked (preverbal position) Yarın İstanbul’a gidiyorlar. marked (any position) Focusing elements Precede focus: sadece, daha Mehmet daha bugün ödevine başlayabildi. Follow focus: -mi, da, bile Ayla mı bugün Ankara’dan dönüyor?

17 Turkish Stress Phrase Stress Phrase: modifier or complement and head
Phrase stress on modifier in Turkish Types Phrases used as nouns telefon ahizesi güzel çiçekler Phrases used as verbs hızlı koş severek yaşa Others senin için yarından sonra Preserved in the sentence

18 Motivation Nevin bugün menemen yemeli. (template) N Z F V
Nevin menemen yemeli. N F V Bizim Nevin domatesli menemen yemeli. P N A F V Nalan yarın ayna alıyor. N Z F V Nalan ayna alıyor. N F V Kardeşim Nalan yeni ayna alıyor. N N A F V

19 Nevin bugün menemen yemeli.
Nevin menemen yemeli.

20 Nevin bugün menemen yemeli.
Bizim Nevin domatesli menemen yemeli.

21 Nevin bugün menemen yemeli.
Nalan yarın ayna alıyor.

22 Nevin bugün menemen yemeli.
Nalan ayna alıyor.

23 Nevin bugün menemen yemeli.
Kardeşim Nalan yeni ayna alıyor.

24 Sentences 100 database sentences
Sentence Type Positive Negative Declaratives 25 15 Wh-questions 10 5 Yes-no questions Conditionals 6 4 Imperatives Exclamations 19 close test sentences (add/remove categories) 18 random test sentences Syllable-based handlabeling Pitch extraction

25 Nevin/bugün/menemen yemeli.
Observations Declaratives Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Nevin/bugün/menemen yemeli.

26 Evvelki gün/ikimiz de/kuyumcu Ali’ye uğradık.
Observations Declaratives Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Evvelki gün/ikimiz de/kuyumcu Ali’ye uğradık.

27 Dün neden zamanımı aldın?
Observations Wh-questions Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Dün neden zamanımı aldın?

28 Kimler yarın sınıf gezisine katılacaklar?
Observations Wh-questions Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Kimler yarın sınıf gezisine katılacaklar?

29 Oraları yine eskisi gibi güzel mi?
Observations Yes-no questions Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Oraları yine eskisi gibi güzel mi?

30 Mudanya’da bu sene de çok yağmur yağıyor mu?
Observations Yes-no questions Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Mudanya’da bu sene de çok yağmur yağıyor mu?

31 İnsan azimliyse herşeyi başarabilir.
Observations Conditionals Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable İnsan azimliyse herşeyi başarabilir.

32 Babam keyifsizse ona konuyu bu akşam anlatamam.
Observations Conditionals Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable Babam keyifsizse ona konuyu bu akşam anlatamam.

33 Akşam yemeği için çarşıdan birşeyler alsınlar.
Observations Imperatives Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Akşam yemeği için çarşıdan birşeyler alsınlar.

34 Sevgiyi ve mutluluğu yarınlara erteleme.
Observations Imperatives Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Sevgiyi ve mutluluğu yarınlara erteleme.

35 Aman büyüklerine bir saygısızlık yapma!
Observations Exclamations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Aman büyüklerine bir saygısızlık yapma!

36 Haydi bugün hep birlikte pikniğe gidelim!
Observations Exclamations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Haydi bugün hep birlikte pikniğe gidelim!

37 Ekonomik kriz / her kesimden insanı / olumsuz etkiledi.
Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Ekonomik kriz / her kesimden insanı / olumsuz etkiledi.

38 Evvelki gün / ikimiz de / kuyumcu Ali’ye uğradık.
Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Evvelki gün / ikimiz de / kuyumcu Ali’ye uğradık.

39 Local Observations Word stress may disappear
Beden sağlığımız için akşamları erken yatmalıyız. Mehmet daha bugün ödevine başlayabildi.

40 Local Observations Word stress disappears at the end of positives (terminal intonation) Nevin bugün menemen yemeli. Merve evine zamanında dönemez.

41 Local Observations Sentence stress (stress on focus)
Nevin bugün menemen yemeli. Mehmet daha bugün ödevine başlayabildi.

42 Nevin bugün menemen yemeli.
Local Observations Effects on neighbour syllables Unstressed + stressed (ne+vin) Stressed + stressed nevin+bu+gün Nevin bugün menemen yemeli.

43 Ben akşam partiye gelmeyeceğim.
Local Observations Effects on neighbour syllables Stressed + stressed (Partiye+gelmeyeceğim) Ben akşam partiye gelmeyeceğim.

44 Kardeşim beni dün gece rüyasında görmüş.
Local Observations Effects on neighbour syllables Stressed + unstressed (Gece+rüyasında) Kardeşim beni dün gece rüyasında görmüş.

45 Bu geç vakitte sizin eve neyle döneceğiz?
Local Observations Effects on neighbour syllables Stressed + unstressed (ney+le) Bu geç vakitte sizin eve neyle döneceğiz?

46 Akşamki yemek pek güzel değildi.
Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (değil+di) Akşamki yemek pek güzel değildi.

47 Oraları yine eskisi gibi güzel mi?
Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (güzel+mi) Oraları yine eskisi gibi güzel mi?

48 Generate Regional Durations
Methodology Overwiev Choose best sentence from a sentence database Apply its pitch to the matching regions of input sentence Compression / Stretching Interpolation Fit data to remaining regions using interpolation Choose Best Sentence Generate Regional Durations Read Files Apply Pitch

49 Methodology Read Files Input information used for sentences
Sentence type (declarative, wh-question, yes-no question, conditional, imperative, exclamation) Sentence state (positive or negative) Categories of each word Number of syllables of each word The index of the syllable bearing word stress, for each word (stress in sentence coincides with word stress)

50 Methodology Read Files
Word categories rely mainly on part-of-speech (POS) categories: Category Examples noun elma apple adjective güzel beautiful pronoun biz we verb geliyorum I’m coming adverb akşamleyin in the evening postposition kadar as…as conjunction fakat but interjection aman wh-word hangi which question suffix word almış mı did he take conditional iyiyse if good number beş five auxiliary şikayet (etti) (he complained) component Ali’nin Ali’s focus kitap (okuyor) (he reads) book comma (,)

51 Methodology Choose Best Sentence
Search in database to find the best sentence Search the template sentences with the same Type State as the input sentence Two different approaches for Sentences other than questions Question sentences

52 Sentences other than Questions
Calculate sentence resemblance scores based on word resemblance scores (WRS) Choose the template sentence having the maximum sentence resemblance score Word Resemblance Score (WRS) Measure of resemblance of two words Consists of Regional resemblance score (RRS) -> word stress information Category match score (CMS) -> word categories WRS = RRS + CMS

53 Regional Resemblance Score (RRS)
Makes use of the four regions defined for every word Region before the stressed syllable Stressed syllable Region after the stressed syllable Phrase-final syllable Measure of resemblance of any two words in terms of these regions Based on number of syllables in each region Consists of Score of existing regions Score of lacking regions RRS = 0.9 x ERS x LRS

54 Calculation of ERS and LRS
score = ERS = LRS = 0 (initialization) for all regions if the region exists in both words score = min( 1 , (NSRW1 / NSRW2) ) ERS = ERS + score else if region lacks in both words LRS = LRS + 1 LRS = LRS - 1 endif endfor ERS: score of existing regions LRS: score of lacking regions NSRW1: number of syllables in related region for first word NSRW2: number of syllables in related region for second word

55 Category Match Score (CMS)
Category match -> CMS CMS = 3.7 (maximum possible value of RRS) Example Calculation of WRS for the words İstanbul and Ankara: Word Region 1 Region 2 Region 3 Region 4 Ankara - An kara İstanbul İs tan bul ERS = 1/1 + 1/2 = 3/2 LRS = = 0 RRS = 0.9 x 3/ x 0 = 1.35 CMS = 3.7 WRS = = 5.05

56 Sentence Resemblance Score
I1, I2, …,IN : words of the input sentence D1, D2, …,DM : words of the template sentence MxN S : score matrix with Si,j’s where Si,j = WRS of the pair (Di, Ij) Path : (Da, Ib), (Dc, Id), …, (De, If) with 1 ≤a < c < … < e ≤ M and 1 ≤ b < d < … < f ≤ N Score of the path : sum of WRS’s of its pairs TASK: Find the path with the maximum score (maximum score path) score of maximum score path = sentence resemblance score optimum combination of word pairings preserving order

57 EXAMPLE: TEMPLATE: Geçen akşam hepimiz müziğin büyüsüne kapılmıştık. INPUT: Büyük dayımız Kadıköy’deki evinde senelerdir yalnız oturuyor. (akşam, Büyük), (müziğin, dayımız), (kapılmıştık, evinde): valid (hepimiz, dayımız), (geçen, evinde), (büyüsüne, yalnız): invalid (akşam, evinde), (müziğin, dayımız), (kapılmıştık, oturuyor): invalid (geçen, dayımız), (hepimiz, dayımız), (kapılmıştık, oturuyor): invalid

58 Procedure MxN MPS : maximum path scores matrix
MxNx2 CMPS : maximum path scores coordinates matrix MPSi,j : contains the score of the maximum score path beginning with the pair (Di, Ij) CMPSi,j,k : contains the indices of the next pair in the same path ( for example if the max score path of (Di, Ij) is (Di, Ij), (Dm, In), …, (Dp, Iq), then CMPSi,j,1 = m and CMPSi,j,2 = n ) Recursive generation of MPS from itself and S CMPS generated from MPS

59 Procedure for i = M, M-1, … , 1 for j = N, N-1, … , 1
if (i = M) or (j = N) MPSi,j = Si,j CMPSi,j,1 = CMPSi,j,2 = EMPTY else MPSi,j = Si,j + value of the max element of { MPSp,q | i+1 ≤ p ≤ M and j+1 ≤ q ≤ N } CMPSi,j,1 = first indice of max element of CMPSi,j,2 = second indice of max element of endif endfor

60

61 Finding the maximum score path from MPS and CMPS
Sentence resemblance score = maxi,j(MPSi,j) = MPSa,b for ex. MPSa,b -> max score path begins with (Da, Ib) Apply to CMPSa,b,1 and CMPSa,b,2 to obtain the second pair of the path If for ex. CMPSa,b,1 = c and CMPSa,b,2 = d -> (Dc, Id) is the second pair Similarly, apply to CMPSc,d,1 and CMPSc,d,2 to obtain the third pair of the path etc. Entire path is obtained

62 We obtained answers to the following questions:
What is the max resemblance capacity of the template sentence to the input sentence? Answer: sentence resemblance score (score of the max score path) How to arrive this max capacity, i.e. how to match the words and choose the pairs? Answer: as in max score path

63 Question Sentences Pitch curve of a question < - > Pitch curve of a word Whole question regarded as a word Use the same regions defined for words Region before the stressed syllable Stressed syllable (stressed syllable of the wh-word or question suffix word) Region after the stressed syllable Phrase-final syllable (exists for wh-questions) Use the same procedure assigning RRS to words to assign sentence resemblance score to the questions

64 Ayşe bugün evde hangi yemeği yaptı? Bu su sesi yukarıdan mı geliyor?
EXAMPLE Sentences: Ayşe bugün evde hangi yemeği yaptı? Bu su sesi yukarıdan mı geliyor? Regions: Region 1 Region 2 Region 3 Region 4 Ayşebugünevde han giyemeğiyap Bususesiyukarı dan mıgeliyor - Region 1 Region 2 Region 3 Region 4 6 1 5 7 4

65 Generate Regional Durations
Methodology Generate Regional Durations Region -> one or more syllables Inputs:(related to input and template sentences) The label files The number of syllables for each word The index of the syllable bearing word stress, for each word The information whether the last syllable shows a pitch rise or not, for each word (conditional, wh-question) Assumes a perfect duration analysis for the input sentence (label file of input sentence) Determines the durations of each region: the onset and end, for each word in both sentences

66 Methodology Apply Pitch Inputs:
Regional durations generated by the previos block Pitch contour of the template sentence The max score path pertaining to the input and template sentences For all pairs of the path, the pitch of the template sentence is applied to the input sentence, for the regions existing in both elements of a pair Usage of spline interpolation: Stretching / compression in time Data fitting for nonexisting regions

67 Discarding Unvoiced Regions
Improvements Discarding Unvoiced Regions Problem: unvoiced regions of template sentence + spline -> distortions Example: Input: Yıldızlar dünyadan gündüz görülmez Template: Zamanımı televizyonun karşısında boş yere harcayamam Path: (zamanımı, yıldızlar), (karşısında, dünyadan), (yere, gündüz), (harcayamam, görülmez) Problematic pairs: (karşısında, dünyadan) and (yere, gündüz) unvoiced regions in karşısında (/k/, /ş/ and /s/) and yere Solution: discard zero samples (unvoiced) and then apply

68 Yıldızlar dünyadan gündüz görülmez.

69 Improvements Problem: poor performance of spline outside the borders of data points to be interpolated Example: Input: Didem her akşam odasında günlük gazeteleri okur Template: Annem bize her zaman çok lezzetli yemekler pişirir Problematic pairs: (annem, didem) and (pişirir, okur) Word Region 1 Region 2 Region 3 Region 4 didem di dem - annem an nem okur o kur pişirir pişi rir Solution: applying the value of the outermost data point to the whole region, if the region goes beyond this data point

70 Didem her akşam odasında günlük gazeteleri okur.

71 Çocuklar yazın güneşin altında fazla kalmamalı.
Improvements Problem: spline sometimes yields unsatisfactory results within the data points Example: Input: Çocuklar yazın güneşin altında fazla kalmamalı. Problematic region: /zın/ of yazın generated by spline Çocuklar yazın güneşin altında fazla kalmamalı.

72 Linear regression and the two threshold lines.
Improvements Solution: check spline; spline -> linear interpolation when necessary Spline check: linear regression line, upper threshold and lower threshold lines for the pitch of template sentence If spline exceeds the threshold lines: spline -> linear Linear regression and the two threshold lines.

73 Çocuklar yazın güneşin altında fazla kalmamalı.

74 Discussion Performance at sentence ends
good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Kuzenim Nalan Oya’ya yarın alıyor.

75 Discussion Performance at sentence ends
good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Mars’ta hayat var mıdır?

76 Performance at sentence ends
Discussion Performance at sentence ends erroneous endings (increase instead of decrease) due to template pitch

77 Performance at sentence ends
Discussion Performance at sentence ends erroneous endings (increase instead of decrease) due to template pitch

78 Discussion Performance at movements (rises and falls) limited since
the method is confined to the capacity of the database (same type, state) the capacity of the template sentence prosodic boundaries (yazın) and accented syllables unknown Çocuklar yazın güneşin altında fazla kalmamalı.

79 Discussion Performance at movements (rises and falls) limited since
the slope of the rise or fall may differ in input and template sentences (bizim) Bizim Nevin domatesli menemen yemeli.

80 Discussion Performance at movements (rises and falls) limited since
there may be an absolute difference between pitch values of both sentences (gündüz) Yıldızlar genellikle gündüz görülmez.

81 Discussion Performance at movements (rises and falls) limited since
microprosodic effects (kardeşim) Kardeşim Nalan yeni ayna alıyor.

82 Performance at movements (rises and falls)
Discussion Performance at movements (rises and falls) limited since effects of rises and falls on neighbouring syllables are handled partially (only within words) Example: Input: Merve bu sefer zamanında dönemez Template: Akşamki yemek pek güzel değildi Merve from yemek (/ye/ of yemek affected by /ki/ of akşamki) Word Region 1 Region 2 Region 3 Region 4 Merve Mer ve - yemek ye mek

83 Akşamki yemek pek güzel değildi.
Merve bu sefer zamanında dönemez.

84 Discussion Performance at questions
High success due to their simple nature: Niçin sorularıma cevap vermiyorsun?

85 Discussion Performance at questions
High success due to their simple nature: Önce nereye bilgi verilmeli?

86 Discussion Performance at questions
High success due to their simple nature: Ona bu güzel kolyeyi satın almayacak mısın?

87 Discussion Objective Evaluation
Pitch -> speech melody, human perception -> ST scale distance d in ST between two frequencies f1 and f2 is given as: d = 12 x log2 (f1 / f2) metrics mean squared distance between original and synthesized in ST proportion < 2ST distance compare with baseline solution constructed as: 6 types x 2 states -> 12 groups of DB sentences for each sentence -> median of nonzero pitch average of median of sentences in each group -> 12 baselines

88 Discussion Objective Evaluation Sentence Domain
Average Mean square distance in ST Average Proportion of distance < 2 ST Method Baseline p Close test sentences 4.6514 x 10-5 0.6573 0.4682 0.0043 Random test sentences 6.9741 8.7683 0.2128 0.6016 0.5928 0.8616 All sentences 5.7814 9.6920 x 10-5 0.6302 0.5288 0.0160 All questions 4.3090 9.9181 0.0026 0.7084 0.4547 0.0081

89 Discussion Objective Evaluation Sentence Domain Number of sentences
Mean square distance in ST Proportion of distance < 2 ST Method is better Baseline is better Close test sentences 15 4 14 5 Random test sentences 11 7 All sentences 29 8 25 12 All questions 10 1

90 Discussion Objective Evaluation Results ANOVA (analysis of variance)
p = the probability of the means belonging to each method to be equal p < 0.10 or 0.05 or > averages statistically significant Method better than baseline in general Performance at close test sentences > Performance at random test sentences best results in questions similar results in both metrics

91 Conclusion Intonation and stress -> fundamental frequency
Analysis of pitch contours Method based on syntactic structure in terms of word categories and word stress information Automatic generation of these inputs from text is relatively easy. Makes use of a sentence database (corpus of natural speech) interpolation Recordings of a single speaker

92 Future Work Inclusion of other speakers
A further categorization of words instead of POS categories -> subcategories -> more complex syntactic structures -> larger database for efficiency Other inputs: prosodic boundaries accented syllables and their automatic generation from input text (prosodic description) Handling microprosody


Download ppt "Fundamental Frequency Contour Synthesis for Turkish Text to Speech"

Similar presentations


Ads by Google