Modelling Polish Intonation for Speech Synthesis Dominika Oliver 23 May 2002.

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

Prosody Modeling (in Speech) by Julia Hirschberg Presented by Elaine Chew QMUL: ELE021/ELED021/ELEM March 2012.
IBM Labs in Haifa © 2007 IBM Corporation SSW-6, Bonn, August 23th, 2007 Maximum-Likelihood Dynamic Intonation Model for Concatenative Text to Speech System.
S. P. Kishore*, Rohit Kumar** and Rajeev Sangal* * Language Technologies Research Center International Institute of Information Technology Hyderabad **
Nigerian English prosody Sociolinguistics: Varieties of English Class 8.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Niebuhr, D‘Imperio, Gili Fivela, Cangemi 1 Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category.
Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.
Making & marking text for synthesis Caroline Henton 10 August 2006.
PHONETICS AND PHONOLOGY
FLST: Prosodic Models FLST: Prosodic Models for Speech Technology Bernd Möbius
Bootstrapping a Language- Independent Synthesizer Craig Olinsky Media Lab Europe / University College Dublin 15 January 2002.
Spoken Language Technologies: A review of application areas and research issues Analysis and synthesis of F0 contours Agnieszka Wagner Department of Phonetics,
EE2F1 Speech & Audio Technology Sept. 26, 2002 SLIDE 1 THE UNIVERSITY OF BIRMINGHAM ELECTRONIC, ELECTRICAL & COMPUTER ENGINEERING Digital Systems & Vision.
Automatic Prosody Labeling Final Presentation Andrew Rosenberg ELEN Speech and Audio Processing and Recognition 4/27/05.
Dianne Bradley & Eva Fern á ndez Graduate Center & Queens College CUNY Eliciting and Documenting Default Prosody ABRALIN23-FEB-05.
Chapter three Phonology
1 ENGLISH PHONETICS AND PHONOLOGY Lesson 3A Introduction to Phonetics and Phonology.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Chapter 15 Speech Synthesis Principles 15.1 History of Speech Synthesis 15.2 Categories of Speech Synthesis 15.3 Chinese Speech Synthesis 15.4 Speech Generation.
1 Speech synthesis 2 What is the task? –Generating natural sounding speech on the fly, usually from text What are the main difficulties? –What to say.
Producing Emotional Speech Thanks to Gabriel Schubiner.
Intonation September 18, 2014 The Plan for Today Also: I have posted a couple of readings on TOBI (an intonation transcription system) to the course.
Phonetics and Phonology.
Text-To-Speech System for Marathi Miss. Deepa V. Kadam Indian Institute of Technology, Bombay.
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Toshiba Update 14/09/2005 Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department Supervisor: Prof. Steve Young A Statistical Approach.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Kishore Prahallad IIIT Hyderabad 1 Building a Limited Domain Voice Using Festvox (Workshop Talk at IIT Kharagpur, Mar 4-5, 2009)
Speech & Language Modeling Cindy Burklow & Jay Hatcher CS521 – March 30, 2006.
04/08/04 Why Speech Synthesis is Hard Chris Brew The Ohio State University.
Phonetics and Phonology
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Chapter 7. BEAT: the Behavior Expression Animation Toolkit
Prepared by: Waleed Mohamed Azmy Under Supervision:
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Alignment of tonal targets: 30 years on Bob Ladd University of Edinburgh.
A prosodically sensitive diphone synthesis system for Korean Kyuchul Yoon Linguistics Department The Ohio State University.
Segmental encoding of prosodic categories: A perception study through speech synthesis Kyuchul Yoon, Mary Beckman & Chris Brew.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
Levels of Linguistic Analysis
Language and Speech, 2000, 43 (2), THE BEHAVIOUR OF H* AND L* UNDER VARIATIONS IN PITCH RANGE IN DUTCH RISING CONTOURS Carlos Gussenhoven and Toni.
IIT Bombay ISTE, IITB, Mumbai, 28 March, SPEECH SYNTHESIS PC Pandey EE Dept IIT Bombay March ‘03.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
2014 Development of a Text-to-Speech Synthesis System for Yorùbá Language Olúòkun Adédayọ̀ Tolulope Department of Computer Science.
A Text-free Approach to Assessing Nonnative Intonation Joseph Tepperman, Abe Kazemzadeh, and Shrikanth Narayanan Signal Analysis and Interpretation Laboratory,
Lecture Overview Prosodic features (suprasegmentals)
4AOD Malinnikova Ekaterina
Natural Language Processing (NLP)
Representing Intonational Variation
Studying Intonation Julia Hirschberg CS /21/2018.
Intonational and Its Meanings
Intonational and Its Meanings
The American School and ToBI
Meaningful Intonational Variation
Speech Generation: From Concept and from Text
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Representing Intonational Variation
Representing Intonational Variation
Levels of Linguistic Analysis
ENGLISH PHONETICS AND PHONOLOGY Week 2
Jennifer J. Venditti Presentation by James Rishe
Natural Language Processing (NLP)
Natural Language Processing (NLP)
Presentation transcript:

Modelling Polish Intonation for Speech Synthesis Dominika Oliver 23 May 2002

Plan  Aims & Objectives  Reasons  Methodology

Building TTS systems  Basic building blocks: pre-processing: analysis of raw and labelled text into identifiable words.  Text normalisation (abbreviations, dates, money time indications, addresses, telephone num, bank accounts, etc)  tokenization, mapping tokens to words, resolving mark-up languages linguistic module : From words to segments:  Orthographic to phonetic conversion of words (morphological analysis, g2p, syllabification, stress assignment)  Sentence analysis (resolve pronunciation ambiguities, syntactic, lexical and semantic analysis)

Building TTS systems (cd)  phonetic module  F0 and durations (and anything else appropriate for waveform synthesis)  Prosodic modelling (generation of intonation contour by intonation model, prosodic phrase, accent and F0 prediction )  acoustic module (Waveform synthesis)  Conversion into digital speech signal  From segments, F0 and duration to a waveform.  There are many techniques to do this, concatenative synthesis (diphone, unit selection), formant synthesis and articulatory synthesis.

Terminology  Stress- lexically specified distinction between strong and weak syllables, a stressed syllable louder and longer than an unstressed one  Tone- lexically specified pitch movement, property of a syllable  Accent- post-lexical pitch movement, linked to a stressed syllable  Pitch accent - lexical pitch movement, property of a word

Intonation in TTS  Intonation prediction can be split into two tasks Prediction of accents: (and/or tones) this is done on a per syllable basis, identifying which syllables are to be accented as well as what type of accent is required (if appropriate for the theory). Realization of F0 contour: given the accents/tones generate an F0 contour.

Why is it important?  In the task of rendering natural sounding speech from raw text, one of the many tasks is generating natural sounding intonation.  A number of intonation theories have been utilised in various systems to try to do this task.  As the quality of speech synthesis improves, a greater demand is put on the intonation system to produce more varied intonation tunes.

Models of Intonation  Linear or Tone sequence models - generate values from left to right as a sequence of values or movements. British school - based on auditory analysis Pierrehumbert predominantly acoustic analysis Dutch school - ‘t Hart, Collier and Cohen perceptual data Tilt - Taylor phonetic  Superpositional or hierarchical models - generate a contour by modelling factors separately (phone, syllable, word, phase, sentence) and then combining the partial models. Fujisaki 1983, Grønnum 1992, Möbius et al. 1993,

Techniques of intonation modelling: using Tilt & ToBI  Tilt and ToBI typify two major classes of intonation systems.  Tilt comes from a data-driven approach attempting to form an abstraction of the natural contour directly from the waveform.  ToBI takes a more linguistic or phonological approach specifying a small set of discrete labels which identify the intonational space of accents and tones.  Also prosodic labelling systems

ToBI (Pierrehumbert, 1980)  Autosegmental-metrical approach, pitch movements are decomposed into pitch levels.  Intonation phrases are modelled as sequences of (H) high and (L) low pitch levels.  ToBI offers a well-defined intonation phonology for labelled speech. Most widely available standard labelling system.  The ToBI labelling system itself does not define a mechanism to go from the labels to an F0 contour, or the reverse. However there are both hand written rule systems (e.g. M. Anderson, J. Pierrehumbert, and M. Liberman 1984)  and statistically trained methods (e.g. A. Black and A. Hunt, 1996.) which do this task.  Machine readable.  Increase in descriptive power : transcriptions can be compared across dialects and languages, ToBI for English, GToBI for German, SCToBI for Serbo-Croatian, ToDI for Dutch, etc.

Tilt (Taylor 1998)  Tilt is a phonetic model of intonation that represents intonation as a sequence of continuously parameterised events (pitch accents or boundary tones).  These parameters are called tilt parameters, determined directly from F0 contour.  They are : duration, amplitude and tilt  Imposes no categorial classification on events.

Tilt (cd)  Duration is a sum of the rise and fall durations.  Amplitude is the sum of the magnitudes of the rise and fall amplitudes.  Tilt parameter – expresses overall shape of the event, the difference of the amplitudes divided by their sum.  The tilt parameter has a range of -1 to 1, -1 pure fall, 1 pure rise, 0 equal portions of rise and fall.

Examples of intonation control  Information provided by intonation: Focus or given/new information Emotions, word emphasis, syntactic disambiguation examples from Mary TTS (DFKI)  Gehen wir nach Hause !/?  Der Zug fährt nach Frankfurt, oder?  Ist die Nummer 180? Nein, die Nummer ist

Prosodic Labelling Systems  ToBi (Tones and Break Indices) ToBI is a intonational labelling standard for speech databases that in some way is based on Janet Pierrehumbert's thesis Pierrehumbert Made on the basis of a speech wave and F0 trace The labelling scheme consists of: (1) words spokenOrthographic tier (2) the degree of juncture between words Break-index tier (3) intonationTone tier (4) comments Miscellaneous tier

Prosodic Labelling Systems  ToBI (cd) discrete intonation accents types: H*, H+!H, L*, L*+H and L+H*. phrase accent type: H- and L- boundary tones: L-L%, L-H%, H-L% and H-H% break levels: 0, 1, 3, and 4 (2 reserved for special cases)

Prosodic Labelling Systems (cd)  Tilt A Tilt labelling for an utterance consists of an assignment of one of four basic intonational events:  pitch accents,  boundary tones,  connections,  silence (labelled a, b, c, sil).

Prosodic Labelling Systems (cd)

Polish synthesis (examples)  What is available : Festival (University of Edinburgh, CSTR) Realspeak (Scansoft) Spiker (IVO Software) SynTalk (Neurosoft)

Polish intonation model  British school (Jassem 1984, Demenko, 1999) The description of accent and intonation at the linguistic level is based on the main features of a British-English system developed essentially by O’Connor and Arnold (1973) and Jassem (1984), an intonational phrase is defined in terms of a sequence of (optional) pre-nuclear, (constitutive) nuclear, and (optional) post- nuclear accents. [prehead [ head [[ nucleus ] tail]]] (O'Connor & Arnold) [anacrusis][[prenuclear intonation[nuclear intonation]]] (Jassem) e.g.  To jest naj' lepsza 'pora "dnia.  To jest naj' lepsza po" radnia.  "Co mó  wiłeś?

Intro - Polish intonation structure  A Polish phrase includes only one ictic accent, which is the also referred to as nuclear accent,  The pre-ictic accent is referred to as pre-nuclear and post- ictic accents are called post-nuclear accents  The pre-nuclear and the nuclear accents are mainly determined by specific pitch relations, whilst the post-nuclear accent (if any) is essentially durational.

Intro - Polish intonation structure (cd)  2 classes of pre-nuclear accents: H (high) and L (low)  9 classes of nuclear accents: HL, ML, xL, HM, LM, MH, MM, and LHL have been distinguished, where H is High, M Medium, L Low and xL extra-Low relative to the particular speaker’s average and mean-Low pitch; e. g., LH means “rising from Low to High”. etc.  e.g. ``Znowu ten  wariat. (HL),, Znowu ten  wariat? (LH)

Platform  Festival is a speech synthesis application developed at the The Centre for Speech Technology Research (CSTR) at the University of Edinburgh Multilingual text to speech  (English, Spanish, German, Welsh, Catalan, Polish)  Allows addition of new languages Synthesis research and development environment  Tools for development - support for extracting information from speech databases, in a way suitable for building models. (Models for accent prediction, F0 generation, duration, vowel reduction, homograph disambiguation, phrase break assignment and unit selection)  Free software

Platform (cd) - direct route from research to use  Multi-lingual text to speech: for those who have little interest in the internal workings of the system, and just want speech output.  Synthesis for language system: for applications that generate text from known forms. In this type of system perhaps telephone numbers, addresses, etc. can be explicitly marked, language type, even intonational forms can be specified. This form of access requires more knowledge about the synthesis internals but still not its low level details.  Synthesis development environment: In this mode, new synthesis modules, intonation, waveform synthesizers, etc. can be developed and compared in a software environment that provides the right basic tools so that development may concentrate on the theory not the implementation.

Intonation in Festival  Task : Prediction of accents & realisation of F0 contour  Method : Statistical and rule based  Tilt  ToBI

Intonation in Festival (cd)  ToBI: Accents and boundary types are predicted by a CART tree (classification and regression trees), but the F0 generation method is a statistically trained method.  Three F0 values are predicted for each syllable, at the start, mid vowel and end. They are predicted using linear regression based on a number of features including ToBI accent type, phrase position, syllable position with contexts.  Although a three point prediction system cannot capture all the variability in natural intonation, by experiment it has been used to be sufficient to produce reasonable F0 contours (Black 1998).

Intonation in Festival (cd)  The Tilt Intonation Theory, takes a bottom up approach. Its intention is to build a parameterization of the F0 contour, that is abstract enough to be predictable in a text to speech system.  It has been shown that a good representation of a natural F0 contour can be made automatically from the raw signal (though it is better of the accents and boundaries are hand labelled). Dusterhoff 1997 further shows how that parameterization can be predicted from text.

Future work : pilot study  Immediate Plans ToBI description of Polish Intonation Phrase ( Polish Intonation database (Karpiński 2000)  Future Work Synthesis assessment : visually impaired  Potential Applications free Polish-English talking dictionary (EU project)