Presentation is loading. Please wait.

Presentation is loading. Please wait.

Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbeşe.

Similar presentations


Presentation on theme: "Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbeşe."— Presentation transcript:

1 Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbeşe

2 Content : TTS systems and prosody Turkish Intonation, Stress Observations on Collected Data Methodology Improvements on Methodology Discussion Conclusion

3 Introduction to Text to Speech (TTS) Systems Text -> speech signal Widespread applications –Message to speech generation –Man-machine dialogue –Multimedia applications –Talking aids for handicapped CHALLENGE: Machine Accent -> Natural Speech SOLUTION: Prosody Generation Modules

4 What is Prosody? Properties of speech that cannot be derived from the phoneme sequence –Modulation of voice pitch –Rhythm, changes in durations –Fluctuations of loudness Related to domains larger than one phoneme (supra-segmental properties)

5 Basic Acoustic Parameters Fundamental Frequency F 0 (pitch) Duration Intensity Prosodic Phenomena Modulate the basic acoustic parameters Modulation of fundamental frequency Intonation Stress (accent)

6 Intonation Ensemble of pitch variations Perceived as speech melody Stress Modulate all the basic acoustic parameters Increase in F 0 and intensity (loudness) Lengthening in duration Three types: Word stress Phrase stress Sentence stress Stress on a single syllable Phrase and sentence stress coincide with word stress

7 Prosody Generation Modules in TTS Prosodic description –Prosodic phrasing -> phrase boundaries –Accent labeling -> accents on syllables Prosodic labels -> F 0 contour Complex linguistic processing units (morphology, syntax, semantics) Speaker-dependence Articulation-related problems: microprosody vs. macroprosody PROBLEMS

8 Basic Intonation Models Tone Sequence Models : Pitch contour as a sequence of fluctuations generated by local accents –Pierrehumbert: A sequence of independent H and L tones (ortography) Pitch accent -> pitch movements on stressed syllables Boundary tone ->at phrase boundaries Phrase accent -> between stressed syllable and phrase boundary Superposition Models : Pitch contour as the superposition of several components with different domains: syllables, words, phrases, sentences, paragraphs, whole text –Fujisaki: purely mathematical model -> parametric A basic F 0 A phrase component (crit. Damped sec. Order to impulse) An accent component (crit. Damped sec. Order to rectangular) Optimization of parameter values wrt F 0 (Analysis by Synthesis) –Möbius -> Fujisaki + Linguistics -> German

9 Approaches Perform an analysis on a speech corpus Transcribe the corpus –Define F 0 labels(rise, fall, peak etc.) and boundary labels (minor, major etc.) –Labeling By hand Examination -> rules -> automatic Automatic learning of : labels -> F 0 values (or parametrized) –Neural Networks –Stochastic methods Intonation pattern dictionary (from natural speech) –Store pitch values in ST and key information (labels) for each pattern –For the patterns in input sentence -> compare key info -> find closest pattern from dictionary -> apply pitch

10 Approaches For integration into TTS (labeling input sentence from text) –Complex linguistic processing units Morphology Syntax Semantics –Stochastic methods Syntax -> most probable label sequence

11 Sentence Intonation Types Terminal intonation –pitch decreases at the end -> message completed Interrogative intonation –pitch slightly increases on the last syllable -> waiting for response Progressive intonation –pitch either increases slightly or does not show any lowering at the end -> message not completed yet

12 Turkish Intonation Classification of sentences –Type: Declaratives(↓) wh-questions(↑) yes-no questions(↓) –Structure: Simple Compound: (↑) at the end of subordinate –Meşgul olduğundan(↑) bizimle sinemaya gelemedi(↓).

13 Turkish Intonation Tone groups (phrase or segment) –Division into tone groups / Oraya varınca beni arayın. / / Oraya varınca / beni arayın. / –Focus (new information) in each tone group / Oraya varınca beni arayın. / –Pitch variations on focus

14 Turkish Intonation Four levels of pitch: low(1), mid(2), high(3), extra high(4) –gi 2 di 3 yoru 1 m –sa 2 hi 4 mi 1 Speech melody musical melody (Nash) –Hierarchy of intonation units(phrase -> text) –Each intonation unit -> melody –Successive intonation units related by motifs -> melody of the upper level –Music: reiteration of motifs -> musical melody

15 Turkish Stress Fixed(bound) stress vs. Free stress(Turkish) Stress on a single syllable of a word in Turkish Effect of suffixes on stress –Stress on final syllable of root + stressable suffix yolcu + -lar → yolcular –Stress on final syllable of root, unstressable suffix involves oku + -yor → okuyor + -lar → okuyorlar –Stress on non-final syllable of root karınca + -lar → karıncalar May disappear in sentence Word Stress

16 Turkish Stress Signals the prominance of the most information-bearing element in a sentence Types –Unmarked (preverbal position) Yarın İstanbul’a gidiyorlar. –marked (any position) Yarın İstanbul’a gidiyorlar. Focusing elements –Precede focus: sadece, daha Mehmet daha bugün ödevine başlayabildi. –Follow focus: -mi, da, bile Ayla mı bugün Ankara’dan dönüyor? Sentence Stress

17 Turkish Stress Phrase: modifier or complement and head Phrase stress on modifier in Turkish Types –Phrases used as nouns telefon ahizesi güzel çiçekler –Phrases used as verbs hızlı koş severek yaşa –Others senin için yarından sonra Preserved in the sentence Phrase Stress

18 Motivation Nevin bugün menemen yemeli. (template) N Z F V Nevin menemen yemeli. N F V Bizim Nevin domatesli menemen yemeli. P N A F V Nalan yarın ayna alıyor. N Z F V Nalan ayna alıyor. N F V Kardeşim Nalan yeni ayna alıyor. N N A F V

19 Nevin bugün menemen yemeli. Nevin menemen yemeli.

20 Nevin bugün menemen yemeli. Bizim Nevin domatesli menemen yemeli.

21 Nevin bugün menemen yemeli. Nalan yarın ayna alıyor.

22 Nevin bugün menemen yemeli. Nalan ayna alıyor.

23 Nevin bugün menemen yemeli. Kardeşim Nalan yeni ayna alıyor.

24 Sentence TypePositiveNegative Declaratives2515 Wh-questions105 Yes-no questions105 Conditionals64 Imperatives64 Exclamations64 Sentences 100 database sentences 19 close test sentences (add/remove categories) 18 random test sentences Syllable-based handlabeling Pitch extraction

25 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Declaratives Nevin/bugün/menemen yemeli.

26 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Declaratives Evvelki gün/ikimiz de/kuyumcu Ali’ye uğradık.

27 Observations Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Wh-questions Dün neden zamanımı aldın?

28 Observations Pitch increase on the last syllable (interrogative intonation) Evident pitch increase on the stressed syllable of the wh-word No division into phrases Word stress often disappears Wh-questions Kimler yarın sınıf gezisine katılacaklar?

29 Observations Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Yes-no questions Oraları yine eskisi gibi güzel mi?

30 Observations Pitch decrease at the end Evident pitch increase on the stressed syllable of the word before -mi No division into phrases Word stress often disappears Yes-no questions Mudanya’da bu sene de çok yağmur yağıyor mu?

31 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable Conditionals İnsan azimliyse herşeyi başarabilir.

32 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) -se always a phrase-final syllable Conditionals Babam keyifsizse ona konuyu bu akşam anlatamam.

33 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Imperatives Akşam yemeği için çarşıdan birşeyler alsınlar.

34 Observations Pitch decrease at the end (terminal intonation) Division into phrases Pitch increase on the phrase-final syllable (progressive intonation) Imperatives Sevgiyi ve mutluluğu yarınlara erteleme.

35 Observations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Exclamations Aman büyüklerine bir saygısızlık yapma!

36 Observations Diverse Pitch decrease at the end (terminal intonation) Evident pitch increase on the stressed syllable of interjection or of another word Exclamations Haydi bugün hep birlikte pikniğe gidelim!

37 Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Ekonomik kriz / her kesimden insanı / olumsuz etkiledi.

38 Local Observations At most single stressed syllable excluding phrase-final increase Stress within the sentence coincides with the word stress Phrase stress preserved Evvelki gün / ikimiz de / kuyumcu Ali’ye uğradık.

39 Local Observations Word stress may disappear Beden sağlığımız için akşamları erken yatmalıyız. Mehmet daha bugün ödevine başlayabildi.

40 Local Observations Word stress disappears at the end of positives (terminal intonation) Nevin bugün menemen yemeli. Merve evine zamanında dönemez.

41 Local Observations Sentence stress (stress on focus) Nevin bugün menemen yemeli. Mehmet daha bugün ödevine başlayabildi.

42 Local Observations Effects on neighbour syllables Unstressed + stressed (ne+vin) Stressed + stressed nevin+bu+gün Nevin bugün menemen yemeli.

43 Local Observations Effects on neighbour syllables Stressed + stressed (Partiye+gelmeyeceğim) Ben akşam partiye gelmeyeceğim.

44 Local Observations Effects on neighbour syllables Stressed + unstressed (Gece+rüyasında) Kardeşim beni dün gece rüyasında görmüş.

45 Local Observations Effects on neighbour syllables Stressed + unstressed (ney+le) Bu geç vakitte sizin eve neyle döneceğiz?

46 Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (değil+di) Akşamki yemek pek güzel değildi.

47 Local Observations Effects on neighbour syllables Stressed + unstressed (last syllable, terminal intonation) (güzel+mi) Oraları yine eskisi gibi güzel mi?

48 Methodology Choose best sentence from a sentence database Apply its pitch to the matching regions of input sentence –Compression / Stretching –Interpolation Fit data to remaining regions using interpolation Overwiev Choose Best Sentence Generate Regional Durations Read FilesApply Pitch

49 Methodology Input information used for sentences –Sentence type (declarative, wh-question, yes-no question, conditional, imperative, exclamation) –Sentence state (positive or negative) –Categories of each word –Number of syllables of each word –The index of the syllable bearing word stress, for each word (stress in sentence coincides with word stress) Read Files

50 Methodology Word categories rely mainly on part-of-speech (POS) categories: Read Files CategoryExamples nounelmaapple adjectivegüzelbeautiful pronounbizwe verbgeliyorumI’m coming adverbakşamleyinin the evening postpositionkadaras…as conjunctionfakatbut interjectionaman wh-wordhangiwhich question suffix wordalmış mıdid he take conditionaliyiyseif good numberbeşfive auxiliaryşikayet (etti)(he complained) componentAli’ninAli’s focuskitap (okuyor)(he reads) book comma (,)

51 Methodology Search in database to find the best sentence Search the template sentences with the same –Type –State as the input sentence Two different approaches for –Sentences other than questions –Question sentences Choose Best Sentence

52 Calculate sentence resemblance scores based on word resemblance scores (WRS) Choose the template sentence having the maximum sentence resemblance score Sentences other than Questions Word Resemblance Score (WRS) Measure of resemblance of two words Consists of –Regional resemblance score (RRS) -> word stress information –Category match score (CMS) -> word categories WRS = RRS + CMS

53 Makes use of the four regions defined for every word –Region before the stressed syllable –Stressed syllable –Region after the stressed syllable –Phrase-final syllable Measure of resemblance of any two words in terms of these regions Based on number of syllables in each region Consists of –Score of existing regions –Score of lacking regions RRS = 0.9 x ERS x LRS Regional Resemblance Score (RRS)

54 Calculation of ERS and LRS score = ERS = LRS = 0 (initialization) for all regions if the region exists in both words score = min( 1, (NSRW1 / NSRW2) ) ERS = ERS + score else if region lacks in both words LRS = LRS + 1 else LRS = LRS - 1 endif endfor ERS: score of existing regions LRS: score of lacking regions NSRW1: number of syllables in related region for first word NSRW2: number of syllables in related region for second word

55 Example Calculation of WRS for the words İstanbul and Ankara: ERS = 1/1 + 1/2 = 3/2 LRS = = 0 RRS = 0.9 x 3/ x 0 = 1.35 CMS = 3.7 WRS = = 5.05 Category Match Score (CMS) Category match -> CMS CMS = 3.7 (maximum possible value of RRS) WordRegion 1Region 2Region 3Region 4 Ankara-Ankara- İstanbulİstanbul-

56 Sentence Resemblance Score I 1, I 2, …,I N : words of the input sentence D 1, D 2, …,D M : words of the template sentence MxN S : score matrix with S i,j ’s where S i,j = WRS of the pair (D i, I j ) Path : (D a, I b ), (D c, I d ), …, (D e, I f ) with 1 ≤a < c < … < e ≤ M and 1 ≤ b < d < … < f ≤ N Score of the path : sum of WRS’s of its pairs TASK: Find the path with the maximum score (maximum score path) score of maximum score path = sentence resemblance score optimum combination of word pairings preserving order

57 EXAMPLE: TEMPLATE: Geçen akşam hepimiz müziğin büyüsüne kapılmıştık. INPUT: Büyük dayımız Kadıköy’deki evinde senelerdir yalnız oturuyor. (akşam, Büyük), (müziğin, dayımız), (kapılmıştık, evinde): valid (hepimiz, dayımız), (geçen, evinde), (büyüsüne, yalnız): invalid (akşam, evinde), (müziğin, dayımız), (kapılmıştık, oturuyor): invalid (geçen, dayımız), (hepimiz, dayımız), (kapılmıştık, oturuyor): invalid

58 Procedure MxN MPS : maximum path scores matrix MxNx2 CMPS : maximum path scores coordinates matrix MPS i,j : contains the score of the maximum score path beginning with the pair (D i, I j ) CMPS i,j,k : contains the indices of the next pair in the same path ( for example if the max score path of (D i, I j ) is (D i, I j ), (D m, I n ), …, (D p, I q ), then CMPS i,j,1 = m and CMPS i,j,2 = n ) Recursive generation of MPS from itself and S CMPS generated from MPS

59 for i = M, M-1, …, 1 for j = N, N-1, …, 1 if (i = M) or (j = N) MPS i,j = S i,j CMPS i,j,1 = CMPS i,j,2 = EMPTY else MPS i,j = S i,j + value of the max element of { MPS p,q | i+1 ≤ p ≤ M and j+1 ≤ q ≤ N } CMPS i,j,1 = first indice of max element of { MPS p,q | i+1 ≤ p ≤ M and j+1 ≤ q ≤ N } CMPS i,j,2 = second indice of max element of { MPS p,q | i+1 ≤ p ≤ M and j+1 ≤ q ≤ N } endif endfor Procedure

60

61 Finding the maximum score path from MPS and CMPS Sentence resemblance score = max i,j (MPS i,j ) = MPS a,b for ex. MPS a,b -> max score path begins with (D a, I b ) Apply to CMPS a,b,1 and CMPS a,b,2 to obtain the second pair of the path If for ex. CMPS a,b,1 = c and CMPS a,b,2 = d -> (D c, I d ) is the second pair Similarly, apply to CMPS c,d,1 and CMPS c,d,2 to obtain the third pair of the path etc. Entire path is obtained

62 We obtained answers to the following questions: What is the max resemblance capacity of the template sentence to the input sentence? –Answer: sentence resemblance score (score of the max score path) How to arrive this max capacity, i.e. how to match the words and choose the pairs? –Answer: as in max score path

63 Pitch curve of a question Pitch curve of a word Whole question regarded as a word Use the same regions defined for words –Region before the stressed syllable –Stressed syllable (stressed syllable of the wh-word or question suffix word) –Region after the stressed syllable –Phrase-final syllable (exists for wh-questions) Use the same procedure assigning RRS to words to assign sentence resemblance score to the questions Question Sentences

64 EXAMPLE Sentences: Ayşe bugün evde hangi yemeği yaptı? Bu su sesi yukarıdan mı geliyor? Regions: Region 1Region 2Region 3Region 4 Ayşebugünevdehangiyemeğiyaptıtı Bususesiyukarıdanmıgeliyor- Region 1Region 2Region 3Region

65 Methodology Region -> one or more syllables Inputs:(related to input and template sentences) –The label files –The number of syllables for each word –The index of the syllable bearing word stress, for each word –The information whether the last syllable shows a pitch rise or not, for each word (conditional, wh-question) Assumes a perfect duration analysis for the input sentence (label file of input sentence) Determines the durations of each region: the onset and end, for each word in both sentences Generate Regional Durations

66 Methodology Inputs: –Regional durations generated by the previos block –Pitch contour of the template sentence –The max score path pertaining to the input and template sentences For all pairs of the path, the pitch of the template sentence is applied to the input sentence, for the regions existing in both elements of a pair Usage of spline interpolation: –Stretching / compression in time –Data fitting for nonexisting regions Apply Pitch

67 Improvements Problem: unvoiced regions of template sentence + spline -> distortions Example: –Input: Yıldızlar dünyadan gündüz görülmez –Template: Zamanımı televizyonun karşısında boş yere harcayamam Path: (zamanımı, yıldızlar), (karşısında, dünyadan), (yere, gündüz), (harcayamam, görülmez) Problematic pairs: (karşısında, dünyadan) and (yere, gündüz) –unvoiced regions in karşısında (/k/, /ş/ and /s/) and yere Solution: discard zero samples (unvoiced) and then apply Discarding Unvoiced Regions

68 Yıldızlar dünyadan gündüz görülmez.

69 Improvements Problem: poor performance of spline outside the borders of data points to be interpolated Example: –Input: Didem her akşam odasında günlük gazeteleri okur –Template: Annem bize her zaman çok lezzetli yemekler pişirir Problematic pairs: (annem, didem) and (pişirir, okur) Solution: applying the value of the outermost data point to the whole region, if the region goes beyond this data point WordRegion 1Region 2Region 3Region 4 didemdididem-- annem-annem- okurokur-- pişirirpişirir--

70 Didem her akşam odasında günlük gazeteleri okur.

71 Improvements Problem: spline sometimes yields unsatisfactory results within the data points Example: –Input: Çocuklar yazın güneşin altında fazla kalmamalı. Problematic region: /zın/ of yazın generated by spline Çocuklar yazın güneşin altında fazla kalmamalı.

72 Improvements Solution: check spline; spline -> linear interpolation when necessary –Spline check: linear regression line, upper threshold and lower threshold lines for the pitch of template sentence If spline exceeds the threshold lines: spline -> linear Linear regression and the two threshold lines.

73 Çocuklar yazın güneşin altında fazla kalmamalı.

74 Discussion good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Performance at sentence ends Kuzenim Nalan Oya’ya yarın alıyor.

75 Discussion Performance at sentence ends good -> choosing from same type and state -> expected microprosody degrades performance (unvoiced regions of input sentence unknown) Mars’ta hayat var mıdır?

76 Discussion erroneous endings (increase instead of decrease) due to template pitch Performance at sentence ends

77 Discussion erroneous endings (increase instead of decrease) due to template pitch Performance at sentence ends

78 Discussion limited since the method is confined to –the capacity of the database (same type, state) –the capacity of the template sentence prosodic boundaries (yazın) and accented syllables unknown Performance at movements (rises and falls) Çocuklar yazın güneşin altında fazla kalmamalı.

79 Discussion limited since the slope of the rise or fall may differ in input and template sentences (bizim) Performance at movements (rises and falls) Bizim Nevin domatesli menemen yemeli.

80 Discussion limited since there may be an absolute difference between pitch values of both sentences (gündüz) Performance at movements (rises and falls) Yıldızlar genellikle gündüz görülmez.

81 Discussion limited since microprosodic effects (kardeşim) Performance at movements (rises and falls) Kardeşim Nalan yeni ayna alıyor.

82 Discussion limited since effects of rises and falls on neighbouring syllables are handled partially (only within words) Example: Input: Merve bu sefer zamanında dönemez Template: Akşamki yemek pek güzel değildi Merve from yemek (/ye/ of yemek affected by /ki/ of akşamki) Performance at movements (rises and falls) WordRegion 1Region 2Region 3Region 4 MerveMerve-- yemekyemek--

83 Akşamki yemek pek güzel değildi. Merve bu sefer zamanında dönemez.

84 Discussion High success due to their simple nature: Performance at questions Niçin sorularıma cevap vermiyorsun?

85 Discussion High success due to their simple nature: Performance at questions Önce nereye bilgi verilmeli?

86 Discussion High success due to their simple nature: Performance at questions Ona bu güzel kolyeyi satın almayacak mısın?

87 Discussion Pitch -> speech melody, human perception -> ST scale distance d in ST between two frequencies f 1 and f 2 is given as: d = 12 x log 2 (f 1 / f 2 ) metrics –mean squared distance between original and synthesized in ST –proportion < 2ST distance compare with baseline solution constructed as: –6 types x 2 states -> 12 groups of DB sentences –for each sentence -> median of nonzero pitch –average of median of sentences in each group -> 12 baselines Objective Evaluation

88 Sentence Domain Average Mean square distance in ST Average Proportion of distance < 2 ST MethodBaselinepMethodBaselinep Close test sentences x Random test sentences All sentences x All questions Discussion Objective Evaluation

89 Discussion Objective Evaluation Sentence Domain Number of sentences Mean square distance in ST Proportion of distance < 2 ST Method is better Baseline is better Method is better Baseline is better Close test sentences Random test sentences All sentences All questions 101 1

90 Discussion Objective Evaluation Results Method better than baseline in general Performance at close test sentences > Performance at random test sentences best results in questions similar results in both metrics ANOVA (analysis of variance) –p = the probability of the means belonging to each method to be equal –p averages statistically significant

91 Conclusion Intonation and stress -> fundamental frequency Analysis of pitch contours Method based on syntactic structure in terms of word categories and word stress information Automatic generation of these inputs from text is relatively easy. Makes use of –a sentence database (corpus of natural speech) –interpolation Recordings of a single speaker

92 Future Work Inclusion of other speakers A further categorization of words instead of POS categories -> subcategories -> more complex syntactic structures -> larger database for efficiency Other inputs: –prosodic boundaries –accented syllables and their automatic generation from input text (prosodic description) Handling microprosody


Download ppt "Fundamental Frequency Contour Synthesis for Turkish Text to Speech Erkan Abdullahbeşe."

Similar presentations


Ads by Google