Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology,

Slides:

Advertisements

Similar presentations

IBM WebSphere Everyplace Access for Multiplatforms Managing the e-business Customer Experience.

Advertisements

Speech based Drug Information System for Aged and Visually Impaired Persons Géza Németh, Gábor Olaszy, Mátyás Bartalis, Géza Kiss, Csaba Zainkó, and Péter.

Speech Synthesis Markup Language V1.0 (SSML) W3C Recommendation on September 7, 2004 SSML is an XML application designed to control aspects of synthesized.

C HAPTER – 3 I NTRODUCTION TO H TML By :- Pinkesh H. Patel.

Speech Synthesis Markup Language SSML. Introduced in September 2004 XML based Assists the generation of synthetic speech Specifies the way speech is outputted.

Applying the Pronunciation Lexicon Specification to ASR & TTS 1 Patrizio Bergallo 1 Monday, August 20, 2007 SpeechTEK ASTS - Advances in Text-to-Speech.

SSML extensions for multi-language usage Davide Bonardo W3C Workshop on Internationalizing SSML Crete, May 2006.

Dr. O. Dakkak & Dr. N. Ghneim: HIAST M. Abu-Zleikha & S. Al-Moubyed: IT fac., Damascus U. Prosodic Feature Introduction and Emotion Incorporation in an.

MULTI LINGUAL ISSUES IN SPEECH SYNTHESIS AND RECOGNITION IN INDIAN LANGUAGES NIXON PATEL Bhrigus Inc Multilingual & International Speech.

Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,

Technical Writing Post Graduate Notes. Course Contents I will select some of the topics described here. A comprehensive group of courses on technical.

Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.

1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.

FUNCTIONS OF INTONATION

Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.

Writing a Resume Geography 494 Internship. Overview Preparation Resume categories Types of resumes Writing a cover letter Writing a thank you letter.

Position Paper for W3C Workshop on Internationalizing SSML The Usage of Part-Of-Speech for Resolving Multiple Pronunciations in SSML Myoung-Wan.

Speech Synthesis Markup Language -----Aim at Extension Dr. Jianhua Tao National Laboratory of Pattern Recognition (NLPR) Institute of Automation, Chinese.

user guide Having a strong password allows other users to struggle to guess. To make a strong password you should use up to 12 letters and 1 or 2.

Lecture 8 Assessing Listening Chapter Six Pages: Brown, 2004.

SCRIPT WRITING TIPS TO COMPOSE AN EFFECTIVE AUDIO NARRATIVE.

Creating a Simple Page: HTML Overview

Internet Skills An Introduction to HTML Alan Noble Room 504 Tel: (44562 internal)

Public 1 © 2005 Nokia V1-Filename.ppt / yyyy-mm-dd / Initials Development Challenges of Multilingual Text-to-Speech Systems Kimmo Pärssinen

RERC on Telecommunications Access Overview: Accessibility of Voice Systems and Services.

How IPA is Used in SSML and PLS Paolo Baggia, Loquendo Wed. August 9 th, 2006.

Chapter Four Morphology

PrepTalk a Preprocessor for Talking book production Ted van der Togt, Dedicon, Amsterdam.

CS 4720 Usability and Accessibility CS 4720 – Web & Mobile Systems.

Using a Template to Create a Resume and Sharing a Finished Document

Learning Web Design: Chapter 4. HTML  Hypertext Markup Language (HTML)  Uses tags to tell the browser the start and end of a certain kind of formatting.

Chapter 7. BEAT: the Behavior Expression Animation Toolkit

SSML 1.1: The Internationalization of SSML Daniel C. Burnett August 9, 2006.

Interaction Modeling. Introduction (1) Third leg of the modeling tripod. It describes interaction within a system. The class model describes the objects.

Memorandum Memorandum. How to write memo? How to write memo? General Information About Memos: General Information About Memos: Audience and Purpose: Audience.

MS. SUHA JAWABREH LECTURE # 9 Oral Communication.

SPEECH CONTENT Spanish Expressive Voices: Corpus for Emotion Research in Spanish R. Barra-Chicote 1, J. M. Montero 1, J. Macias-Guarasa 2, S. Lufti 1,

Part 2 – Skills for Success

Speech Perception 4/4/00.

Chapter 5: Windows and Frames

Rundkast at LREC 2008, Marrakech LREC 2008 Ingunn Amdal, Ole Morten Strand, Jørn Almberg, and Torbjørn Svendsen RUNDKAST: An Annotated.

Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.

STAYING SAFE: Here are some safety tips when using Change your password regularly and keep it in a safe place. Don’t share your password with anyone.

Creating User Interfaces Directed Speech. XML. VoiceXML Classwork/Homework: Sign up to be Voxeo developer. Do tutorials.

© 2013 by Larson Technical Services

Dirk Van CompernolleAtranos Workshop, Leuven 12 April 2002 Automatic Transcription of Natural Speech - A Broader Perspective – Dirk Van Compernolle ESAT.

A Fully Annotated Corpus of Russian Speech

Getting Started with Marking Up Page Content. Tag defines a paragraph Automatically creates some space before and after itself Code Browser Display.

user guide Having a strong password allows other users to struggle to guess. To make a strong password you should use up to 12 letters and 1 or 2.

Lab: Making PDF documents truly accessible Mireia Ribera, Universitat de Barcelona Friday, Nov , 2:15 - 4:15 p.m. 12th Annual Accessing Higher.

S PEECH T ECHNOLOGY Answers to some Questions. S PEECH T ECHNOLOGY WHAT IS SPEECH TECHNOLOGY ABOUT ?? SPEECH TECHNOLOGY IS ABOUT PROCESSING HUMAN SPEECH.

Rapid Development in new languages Limited training data (6hrs) provided by NECTEC from 34 speakers, + 8 spks for development and test Romanization of.

The greeting is always the same, regardless who is calling: The user can't customise the service and it is not possible to have individual greetings for.

Speech Processing 1 Introduction Waldemar Skoberla phone: fax: WWW:

Part 2 – Skills for Success Chapter 5 Communicating on the Job.

Basics of Natural Language Processing Introduction to Computational Linguistics.

Systems and User Interface Software. Types of Operating System  Single User  Multi User  Multi-tasking  Batch Processing  Interactive  Real Time.

Functions of Intonation By Cristina Koch. Intonation “Intonation is the melody or music of a language. It refers to the way the voice rises and falls.

Link for App Inventor II:

Lesson 17 Mail Merge. Overview Create a main document. Create a data source. Insert merge fields into a main document. Perform a mail merge. Use data.

INTONATION And IT’S FUNCTIONS

How can speech technology be used to help people with disabilities?

Intro to HTML CS 1150 Spring 2017.

G. Anushiya Rachel Project Officer

Intro to HTML CS 1150 Fall 2016.

Prosody and Non- Verbal Communication

Getting Started with Marking Up Page Content

Presentation transcript:

Proposals for Extending SSML 1.0 from the Point-of-View of Hungarian TTS Developers Géza Németh, Géza Kiss, Bálint Tóth Laboratory of Speech Technology, Department of Telecommunications and Media Informatics Budapest University of Technology and Economics, Budapest, Hungary

Budapest University of Technology & Economics (BME) Dept. of Telecommunications & Media Informatics (TMIT) Speech activities: Speech activities: Coordinator: Gordos Géza D.Sc. Coordinator: Gordos Géza D.Sc. Speech Technology Lab (STL) Németh Géza and Olaszy Gábor PhDD.Sc. Telecommunications & Signal Processing Lab (TSP) Tatai Péter MSc Laboratory of Speech Acoustics Vicsi Klára Vicsi Klára (LSA) D.Sc. In each lab 4-6 PhD students Graduate students 306 in Speech Information Systems subject (2005)

Basic research Multi-lingual artificial speech generation (synthesis, STL) Multi-lingual artificial speech generation (synthesis, STL) limited vocabulary (e.g., numbers, date, address) limited vocabulary (e.g., numbers, date, address) multi-lingual TTS (Hungarian, German, Polish, Spanish) multi-lingual TTS (Hungarian, German, Polish, Spanish) speech profiles (variability, individual features) speech profiles (variability, individual features) expression/emotion presentation (users manual news) expression/emotion presentation (users manual news) Speech recognition (TSP, LSA) Speech recognition (TSP, LSA) noise handling (telephone, in-car,..., TSP) noise handling (telephone, in-car,..., TSP) dictation (good quality, continouos, LSA) dictation (good quality, continouos, LSA) audio indexing (e.g. radio archives, broadcast news, TSP) audio indexing (e.g. radio archives, broadcast news, TSP) speech segmentation (TSP, LSA) speech segmentation (TSP, LSA) emotion detection (TSP) emotion detection (TSP) Speech understanding (TSP) Speech understanding (TSP) Speech databases (LSA, TSP) Speech databases (LSA, TSP)

Applied Research Fully proprietary components and solutions: Fully proprietary components and solutions: All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products All parameters controlled, systems are tailor-made for the end-user, Integration of original research results, unique products T-Mobile Hungary services: reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader (STL) T-Mobile Hungary services: reader 1999-, name- and address reader in reverse directory, 2003 (Motto: Why is the human operator speaking, not the machine?!), Symbian SMS-reader (STL) Others: SMS reader 2001-, bookreader 2002-, (STL) Others: SMS reader 2001-, bookreader 2002-, (STL) Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) Voice portals (Generali Hungary name dial-in 2004, Hungarian VoiceXML browser, 2003, TSP+STL) Industrial information systems (STL, TSP) Industrial information systems (STL, TSP) U nified Messaging (STL) U nified Messaging (STL) Call Center (STL, TSP) Call Center (STL, TSP) Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) Audio user interfaces (especially portable/mobile devices, car information systems, wearable devices, STL, TSP) Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA) Disability (1986-, speech, vision, Hungarian version of Jaws for Windows, notetaker for blind people, STL, TSP, LSA)

Contact information Tel: (+36 1) Fax: (+36 1)

Overview Text-to-phoneme Text structure Prosody Summary Text Prosody normalization conversion prediction prescription

Overview Text-to-phoneme Text structure Prosody Summary Text Prosody normalization conversion prediction prescription

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Text structure elements already contained in SSML 1.0: paragraph paragraph sentence sentence Suggested further structuring: word word syllables syllables

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction This can be used to help to help text-to-phoneme conversion text-to-phoneme conversion prosody prediction and prescription prosody prediction and prescription … by giving higher level information, namely syllable structure syllable structure part-of-speech information part-of-speech information (Examples given later) to indicate words in languages that do not use space to separate words to indicate words in languages that do not use space to separate words

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Reasons to use text structure elements instead of e.g. phoneme, prosody, break, emphasis Easier for human editor to add Easier for human editor to add Replacing synthesis processor may necessitate rewriting Replacing synthesis processor may necessitate rewriting phoneme specification phoneme specification prosody prescription prosody prescription

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested word element … … E.g. hosszú hosszú halászsasokat halászsasokat

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion extended from other proposals … … When not a word, but an expression is labeled: … … E.g. three kilos 3 k. 3 k.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction When pronunciation cannot be determined, you can 1. Add a lexicon element BUT hard to add all 2. Specify using phoneme : BUT hard to write & read for human 3. Add a textual replacement using sub 4. Provide higher level information Currently this is only say-as

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Other types of higher level information (easier, more natural) Syllable structure Syllable structure Part-of-speech information Part-of-speech information Language of included foreign text Language of included foreign text We are going to give you some examples.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody predictionHungarian: highly agglutinative highly agglutinative pronunciation inference rules are used pronunciation inference rules are used rules can be tricked by some words rules can be tricked by some words E.g. egészség (health) Letter combinations might bes+zs [ S ]+[ Z ][ Z ] but they are in factsz+s [ s ]+[ S ][ S ] Syllable structure

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Enough to know syllable structure. Instead of egészség egészség you can write egészség egészség (Note: here you could also write egészség ) egészség ) Syllable structure

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Word forms may have several meanings/pronunciations Word forms may have several meanings/pronunciations Specifying part-of-speech may help Specifying part-of-speech may helpE.g. I will read the book I will read the book I have read the book I have read the book Part-of-speech

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Foreign parts often occur in texts Foreign parts often occur in texts Using same voice, currently you can Using same voice, currently you can Do nothing Do nothing Specify using phoneme Specify using phoneme Another desirable approach Another desirable approach Specify lexicon for language and specify language of text Specify lexicon for language and specify language of text Language

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Instead of … The title of the movie is: La vita è bella (Life is beautiful). Language

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction you could write … The title of the movie is: La vita è bella (Life is beautiful). Language

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested language attribute … … If both lang and ph is given, lang has priority If language is x-unknown, LID (language identification) is used. We suggest that x-unknown can be used with xml:lang also. Language

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Text normalization effectively assisted by say-as element. Text normalization effectively assisted by say-as element. The constructs we found appropriate in our practice include: date, time (including time intervals like opening hours), number, currency, name, address. The constructs we found appropriate in our practice include: date, time (including time intervals like opening hours), number, currency, name, address. Additionally suggest as standard values: acronym/abbreviation, web, , phone, program-code, table, equation. Additionally suggest as standard values: acronym/abbreviation, web, , phone, program-code, table, equation.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction We speak differently in different situations (e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style We speak differently in different situations (e.g. speaking with friends, giving a talk at a conference, reading news, reading stories to children) – speaking style Differences in prosody can be quantified Differences in prosody can be quantified Emotional speech also in the focus of research Emotional speech also in the focus of research Modern TTS systems are likely to be able to imitate these to some extent Modern TTS systems are likely to be able to imitate these to some extent

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested speaking-style attribute Can be used where the xml:lang element, i.e. voice, speak, p, s, w Can be used where the xml:lang element, i.e. voice, speak, p, s, w Synthesis processors can define their own set of supported speaking-styles Synthesis processors can define their own set of supported speaking-styles They should support: "spelling" – can be viewed a special reading style They should support: "spelling" – can be viewed a special reading style They may support e.g. "syllabification", "causal", "news reading", "story telling" They may support e.g. "syllabification", "causal", "news reading", "story telling" Speaking style

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested emotion attribute Mentioned here, although prosody is only one of its aspects Mentioned here, although prosody is only one of its aspects Complementary to speaking-style, therefore separate attribute is suggested Complementary to speaking-style, therefore separate attribute is suggested Can be used where the xml:lang element, i.e. voice, speak, p, s, w Can be used where the xml:lang element, i.e. voice, speak, p, s, w Possible values: " happiness ", " sadness ", " anger ", " surprise ", " disgust ", " fear ". Possible values: " happiness ", " sadness ", " anger ", " surprise ", " disgust ", " fear ". Emotion

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Part-of-speech (POS) of word may affect emphasis and other aspects of prosody Part-of-speech (POS) of word may affect emphasis and other aspects of prosody Not always possible to automatically determine Not always possible to automatically determine More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) More desirable to specify POS than to prescribe prosody (higher level, speaking style can override it) Example in Hungarian: Mondd, hogy vagy? (Tell me, how are you?) – interrogative adverb,strong (focus) emphasisMondd, hogy vagy? (Tell me, how are you?) – interrogative adverb,strong (focus) emphasis Igaz, hogy jól vagy? (Is it true that you are alright?) – conjunction,reduced emphasisIgaz, hogy jól vagy? (Is it true that you are alright?) – conjunction,reduced emphasis Part-of-speech

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Analytic languages (e.g. English, Chinese) Analytic languages (e.g. English, Chinese) Words are usually short Words are usually short They convey only one portion of the meaning They convey only one portion of the meaning Individual words can be stressed Individual words can be stressed Synthetic languages (e.g. Hungarian, Korean) Synthetic languages (e.g. Hungarian, Korean) Words are often long Words are often long Made up of several morphemes and have very complex meanings Made up of several morphemes and have very complex meanings Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables) Stress, pitch changes, etc. may need to be realized on certain morphemes (~syllables)

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Example 1: contrastive sentences English: The book is not in the box, but on the box. English: The book is not in the box, but on the box. Speaker can emphasize one word. Speaker can emphasize one word. Hungarian:Nem a dobozon, hanem a dobozban van a könyv. Hungarian:Nem a dobozon, hanem a dobozban van a könyv. Speaker sometimes has to emphasize one syllable. Speaker sometimes has to emphasize one syllable. Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume. Stress expressed mainly by pitch; may be aided by short pause, slower rate, higher volume.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Example 2: pitch change on syllable 1.Elmentek. – They are gone. Pitch is continuously falling 2.Elmentek? – Are they gone? Pitch rises at the beginning of the second syllable and falls down on the third syllable 1.2.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion for extensions to prosody: Stress and prosody can be described on a per-syllable basis Stress and prosody can be described on a per-syllable basis Extension to prosody: time can be syllable position Extension to prosody: time can be syllable position decimal fractions can also be used decimal fractions can also be used negative values indicate n th position from end negative values indicate n th position from end special symbol syl_end indicates end of expression special symbol syl_end indicates end of expressionE.g.:

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggestion for optional extensions: some synthesis processors may process pitch-contour (= contour ), rate-contour, volume-contour time positions: the same as in contour rate / volume: described as in rate / volume pitch-contour (= contour ), rate-contour, volume-contour time positions: the same as in contour rate / volume: described as in rate / volume emphasis and break extended with a position attribute; value can be syllable position. In this case break will not be an empty element. emphasis and break extended with a position attribute; value can be syllable position. In this case break will not be an empty element.

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction

OverviewText-to-phoneme conversion Text structureProsody prescription SummaryText normalization Prosody prediction Suggested extensions … 2. … 3. ] 3. ] 4. optionally: pitch-contour (=contour), rate-contour, volume-contour; break, emphasis

Prosody prescription Prosody prediction Text normalization OverviewText-to-phonemeText structureSummary conversion

Thank you for your attention!