Presentation is loading. Please wait.

Presentation is loading. Please wait.

Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Similar presentations


Presentation on theme: "Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,"— Presentation transcript:

1

2 Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications and Media Informatics Budapest University of Technology and Economics, Budapest, Hungary

3 Budapest University of Technology & Economics (BME) Dept. of Telecommunications & Media Informatics (TMIT) Speech activities: Speech activities: Coordinator: Gordos Géza D.Sc. Coordinator: Gordos Géza D.Sc. Speech Technology Lab (STL) Németh Géza and Olaszy Gábor PhDD.Sc. Telecommunications & Signal Processing Lab (TSP) Tatai Péter MSc Laboratory of Speech Acoustics Vicsi Klára Vicsi Klára (LSA) D.Sc. In each lab 4-6 PhD students Graduate students 306 in Speech Information Systems subject (2005)

4 Basic research Multi-lingual artificial speech generation (synthesis, STL) Multi-lingual artificial speech generation (synthesis, STL) limited vocabulary (e.g., numbers, date, address) limited vocabulary (e.g., numbers, date, address) multi-lingual TTS (Hungarian, German, Polish, Spanish) multi-lingual TTS (Hungarian, German, Polish, Spanish) speech profiles (variability, individual features) speech profiles (variability, individual features) expression/emotion presentation (users manual news) expression/emotion presentation (users manual news) Speech recognition (TSP, LSA) Speech recognition (TSP, LSA) noise handling (telephone, in-car,..., TSP) noise handling (telephone, in-car,..., TSP) dictation (good quality, continouos, LSA) dictation (good quality, continouos, LSA) audio indexing (e.g. radio archives, broadcast news, TSP) audio indexing (e.g. radio archives, broadcast news, TSP) speech segmentation (TSP, LSA) speech segmentation (TSP, LSA) emotion detection (TSP) emotion detection (TSP) Speech understanding (TSP) Speech understanding (TSP) Speech databases (LSA, TSP) Speech databases (LSA, TSP)

5 Applied Research Fully proprietary components and solutions: Fully proprietary components and solutions: All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products T-Mobile Hungary services: reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader (STL) T-Mobile Hungary services: reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader (STL) Others: SMS reader 2001-, bookreader 2002-, (STL) Others: SMS reader 2001-, bookreader 2002-, (STL) Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) Industrial information systems (STL, TSP) Industrial information systems (STL, TSP) U nified Messaging (STL) U nified Messaging (STL) Call Center (STL, TSP) Call Center (STL, TSP) Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA) Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA)

6 Contact information Tel: (+36 1) Fax: (+36 1)

7 Overview Text-to-phoneme Text structure Prosody Summary Text Prosody normalization conversion prediction prescription

8 Overview Text-to-phoneme Text structure Prosody Summary Text Prosody normalization conversion prediction prescription

9 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

10 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Text structure elements already contained in SSML 1.0: paragraph paragraph sentence sentence Suggested further structuring: word word syllables syllables

11 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction This can be used to help to help text-to-phoneme conversion text-to-phoneme conversion prosody prediction and prescription prosody prediction and prescription … by giving higher level information, namely syllable structure syllable structure part-of-speech information part-of-speech information (Examples given later) to indicate words in languages that do not use space to separate words to indicate words in languages that do not use space to separate words

12 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Reasons to use text structure elements instead of e.g. phoneme, prosody, break, emphasis Easier for human editor to add Easier for human editor to add Replacing synthesis processor may necessitate rewriting Replacing synthesis processor may necessitate rewriting phoneme specification phoneme specification prosody prescription prosody prescription

13 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested word element … … E.g. hosszú hosszú halászsasokat halászsasokat

14 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion extended from other proposals … … When not a word, but an expression is labeled: … … E.g. three kilos 3 k. 3 k.

15 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

16 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction When pronunciation cannot be determined, you can 1. Add a lexicon element BUT hard to add all 2. Specify using phoneme : BUT hard to write & read for human 3. Add a textual replacement using sub 4. Provide higher level information Currently this is only say-as

17 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Other types of higher level information (easier, more natural) Syllable structure Syllable structure Part-of-speech information Part-of-speech information Language of included foreign text Language of included foreign text We are going to give you some examples.

18 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody predictionHungarian: highly agglutinative highly agglutinative pronunciation inference rules are used pronunciation inference rules are used rules can be tricked by some words rules can be tricked by some words E.g. egészség (health) Letter combinations might bes+zs [ S ]+[ Z ][ Z ] but they are in factsz+s [ s ]+[ S ][ S ] Syllable structure

19 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Enough to know syllable structure. Instead of egészség egészség you can write egészség egészség (Note: here you could also write egészség ) egészség ) Syllable structure

20 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Word forms may have several meanings/pronunciations Word forms may have several meanings/pronunciations Specifying part-of-speech may help Specifying part-of-speech may helpE.g. I will read the book I will read the book I have read the book I have read the book Part-of-speech

21 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Foreign parts often occur in texts Foreign parts often occur in texts Using same voice, currently you can Using same voice, currently you can Do nothing Do nothing Specify using phoneme Specify using phoneme Another desirable approach Another desirable approach Specify lexicon for language and specify language of text Specify lexicon for language and specify language of text Language

22 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Instead of … The title of the movie is: La vita è bella (Life is beautiful). Language

23 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction you could write … The title of the movie is: La vita è bella (Life is beautiful). Language

24 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested language attribute … … If both lang and ph is given, lang has priority If language is x-unknown, LID (language identification) is used. We suggest that x-unknown can be used with xml:lang also. Language

25 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

26 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Text normalization effectively assisted by say-as element. Text normalization effectively assisted by say-as element. The constructs we found appropriate in our practice include: date, time (including time intervals like opening hours), number, currency, name, address. The constructs we found appropriate in our practice include: date, time (including time intervals like opening hours), number, currency, name, address. Additionally suggest as standard values: acronym/abbreviation, web, , phone, program-code, table, equation. Additionally suggest as standard values: acronym/abbreviation, web, , phone, program-code, table, equation.

27 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

28 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction We speak differently in different situations (e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style We speak differently in different situations (e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style Differences in prosody can be quantified Differences in prosody can be quantified Emotional speech also in the focus of research Emotional speech also in the focus of research Modern TTS systems are likely to be able to imitate these to some extent Modern TTS systems are likely to be able to imitate these to some extent

29 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested speaking-style attribute Can be used where the xml:lang element, i.e. voice, speak, p, s, w Can be used where the xml:lang element, i.e. voice, speak, p, s, w Synthesis processors can define their own set of supported speaking-styles Synthesis processors can define their own set of supported speaking-styles They should support: "spelling" – can be viewed a special reading style They should support: "spelling" – can be viewed a special reading style They may support e.g. "syllabification", "causal", "news reading", "story telling" They may support e.g. "syllabification", "causal", "news reading", "story telling" Speaking style

30 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested emotion attribute Mentioned here, although prosody is only one of its aspects Mentioned here, although prosody is only one of its aspects Complementary to speaking-style, therefore separate attribute is suggested Complementary to speaking-style, therefore separate attribute is suggested Can be used where the xml:lang element, i.e. voice, speak, p, s, w Can be used where the xml:lang element, i.e. voice, speak, p, s, w Possible values: " happiness ", " sadness ", " anger ", " surprise ", " disgust ", " fear ". Possible values: " happiness ", " sadness ", " anger ", " surprise ", " disgust ", " fear ". Emotion

31 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Part-of-speech (POS) of word may affect emphasis and other aspects of prosody Part-of-speech (POS) of word may affect emphasis and other aspects of prosody Not always possible to automatically determine Not always possible to automatically determine More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) Example in Hungarian: Mondd, hogy vagy? (Tell me, how are you?) – interrogative adverb,strong (focus) emphasisMondd, hogy vagy? (Tell me, how are you?) – interrogative adverb,strong (focus) emphasis Igaz, hogy jól vagy? (Is it true that you are alright?) – conjunction,reduced emphasisIgaz, hogy jól vagy? (Is it true that you are alright?) – conjunction,reduced emphasis Part-of-speech

32 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

33 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Analytic languages (e.g. English, Chinese) Analytic languages (e.g. English, Chinese) Words are usually short Words are usually short They convey only one portion of the meaning They convey only one portion of the meaning Individual words can be stressed Individual words can be stressed Synthetic languages (e.g. Hungarian, Korean) Synthetic languages (e.g. Hungarian, Korean) Words are often long Words are often long Made up of several morphemes and have very complex meanings Made up of several morphemes and have very complex meanings Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables) Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables)

34 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Example 1: contrastive sentences English: The book is not in the box, but on the box. English: The book is not in the box, but on the box. Speaker can emphasize one word. Speaker can emphasize one word. Hungarian:Nem a dobozon, hanem a dobozban van a könyv. Hungarian:Nem a dobozon, hanem a dobozban van a könyv. Speaker sometimes has to emphasize one syllable. Speaker sometimes has to emphasize one syllable. Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume. Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume.

35 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Example 2: pitch change on syllable 1.Elmentek. – They are gone. Pitch is continuously falling 2.Elmentek? – Are they gone? Pitch rises at the beginning of the second syllable and falls down on the third syllable 1.2.

36 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion for extensions to prosody: Stress and prosody can be described on a per-syllable basis Stress and prosody can be described on a per-syllable basis Extension to prosody: time can be syllable position Extension to prosody: time can be syllable position decimal fractions can also be used decimal fractions can also be used negative values indicate n th position from end negative values indicate n th position from end special symbol syl_end indicates end of expression special symbol syl_end indicates end of expressionE.g.:

37 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion for optional extensions: some synthesis processors may process pitch-contour (= contour ), rate-contour, volume-contour time positions: the same as in contour rate / volume: described as in rate / volume pitch-contour (= contour ), rate-contour, volume-contour time positions: the same as in contour rate / volume: described as in rate / volume emphasis and break extended with a position attribute; value can be syllable position. In this case break will not be an empty element. emphasis and break extended with a position attribute; value can be syllable position. In this case break will not be an empty element.

38 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

39 OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested extensions … 2. … 3. ] 3. ] 4. optionally: pitch-contour (=contour), rate-contour, volume-contour; break, emphasis

40 Prosody prescription Prosody prediction Text normalization OverviewText-to-phonemeText structureSummary conversion

41 Thank you for your attention!


Download ppt "Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,"

Similar presentations


Ads by Google