Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.

Slides:



Advertisements
Similar presentations
Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Advertisements

The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Frequency, Pitch, Tone and Length October 15, 2012 Thanks to Chilin Shih for making some of these lecture materials available.
1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.
INTONATION Chapters 15 & 16.
Suprasegmentals The term suprasegmental refers to those properties of an utterance which aren't properties of any single segment. The following are usually.
Motor Control Strategies for Chinese Intonation Greg Kochanski (University of Oxford, UK) Chilin Shih (University of Illinois, Urbana-Champaign) Tan Lee.
WORD STRESS PATTERNS IN PROSODIC PHONOLOGY
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Prosodics, Part 1 LIN Prosodics, or Suprasegmentals Remember, from our first discussions in class, that speech is really a continuous flow of initiation,
Nuclear Accent Shape and the Perception of Prominence Rachael-Anne Knight Prosody and Pragmatics 15 th November 2003.
Nigerian English prosody Sociolinguistics: Varieties of English Class 8.
AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Analyzing Students’ Pronunciation and Improving Tonal Teaching Ropngrong Liao Marilyn Chakwin Defense.
Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.
Making & marking text for synthesis Caroline Henton 10 August 2006.
Introduction to Linguistics for lawyers
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
Pitch Tracking + Prosody January 20, 2009 The Plan for Today One announcement: On Thursday, we’ll meet in the Tri-Faculty Computer Lab (SS 018) Section.
Combining Prosodic and Text Features for Segmentation of Mandarin Broadcast News Gina-Anne Levow University of Chicago SIGHAN July 25, 2004.
Context in Multilingual Tone and Pitch Accent Recognition Gina-Anne Levow University of Chicago September 7, 2005.
Chapter three Phonology
Producing Emotional Speech Thanks to Gabriel Schubiner.
Intonation September 18, 2014 The Plan for Today Also: I have posted a couple of readings on TOBI (an intonation transcription system) to the course.
Calibration & Curve Fitting
Phonology, phonotactics, and suprasegmentals
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Physics 114: Lecture 15 Probability Tests & Linear Fitting Dale E. Gary NJIT Physics Department.
Phonetics and Phonology
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Perceived prominence and nuclear accent shape Rachael-Anne Knight LAGB 5 th September 2003.
Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.
Lecture 6 The Intonation Phonology Suprasegmental phonology Intonation
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. 1 Part 4 Curve Fitting.
Assessment of Morphology & Syntax Expression. Objectives What is MLU Stages of Syntactic Development Examples of Difficulties in Syntax Why preferring.
On Different Perspectives of Utilizing the Fujisaki Model to Mandarin Speech Prosody Zhao-yu Su Phonetics Lab, Institute of Linguistics, Academia Sinica.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.
Levels of Language 6 Levels of Language. Levels of Language Aspect of language are often referred to as 'language levels'. To look carefully at language.
1 Determining query types by analysing intonation.
Frequency, Pitch, Tone and Length October 16, 2013 Thanks to Chilin Shih for making some of these lecture materials available.
The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.
Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.
Pitch Ladefoged, p. 23) Pitch refers to the rate of vibration of the vocal cords. The higher the vibration, the higher the pitch. Thus sounds are said.
Inference: Probabilities and Distributions Feb , 2012.
Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.
Chapter 8: Simple Linear Regression Yang Zhenlin.
Week 6. Statistics etc. GRS LX 865 Topics in Linguistics.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Levels of Linguistic Analysis
CSC321 Lecture 5 Applying backpropagation to shape recognition Geoffrey Hinton.
Nuclear Accent Shape and the Perception of Syllable Pitch Rachael-Anne Knight LAGB 16 April 2003.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]
INTONATION Islam M. Abu Khater.
Machine Learning in Practice Lecture 9 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Speech in the DHH Classroom A new perspective. Speech in the DHH Bilingual Classroom Important to look beyond the traditional view of speech Think of.
Stringing words together.  Connected speech is spoken language that is used in a continuous sequence, as in normal conversations. Also called connected.
Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.
Suprasegmental features and Prosody Lect 6A&B LING1005/6105.
McGraw-Hill/Irwin Copyright © 2009 by The McGraw-Hill Companies, Inc. All Rights Reserved. Chapter 3 Forecasting.
INTONATION And IT’S FUNCTIONS
Representing Intonational Variation
Review.
Levels of Linguistic Analysis
Presentation transcript:

Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan Jing (IBM) Jiahong Yuan (Cornell)

Questions Can we usefully include biomechanics into a phonetics model? Can we objectively assign an importance to a syllable? Can we write a unified description of F 0 for both tone and accent languages? Goal Build a mathematical model that takes a sequence of discrete symbols as input and produces a quantitative prediction for f0.

The Challenge

Existing work Rising?

Basic assumptions used in modeling People plan their utterances several syllables in advance. People produce speech optimized to communicate with minimal effort. A realistic model for the muscles that control f 0

Realistic model of muscle control for F 0 We’d like a model of prosody that can apply beyond F 0.

People talk nearly as fast as possible.

Speech could be optimal Most of what we say is made from bits and pieces we’ve said before. There are only 4 (Mandarin) or 6 (Cantonese) tones to combine. A speaker has the chance to practice and optimize all the common 3- and 4- tone sequences.

Optimize what? People want to minimize effort and/or talk faster –Chairs, Cars People want to minimize the chance that they will be misunderstood. –Risk = P(misinterpreted) * cost(misinterpreted) Minimize: Effort + cost*Error –We allow each syllable to have a different weight, so error is a sum over syllables or words. –Perhaps cost matches importance.

Modeling math is the muscle tension (~frequency) at time t. “Effort” Each target encodes some linguistic information, r i is the error of the i th target, and s i is its importance. y is the i th pitch target and a bar denotes an average over a target. “Error”

Effort and Error How does Effort depend on the form of the pitch curve? Error = mean-squared deviation between the f0 and the templates.

Model behavior For cost>>1, Error dominates, and pitch matches target. For cost<<1, Effort dominates, both speaker and listener accept large deviations, and pitch smoothly interpolates. For cost~1, everything compromises. Cost plays the role of a prosodic strength.

Another Challenge Time (10 ms intervals) F 0 (Hz) Tone shapes

The rest of the model. A model is a sequence of targets (used to compute the Error terms). Each target has a strength (i.e. the cost of misinterpretation). One target per tone. Targets are stretched to fit syllable duration. Only one phonological rule: 33  23

Model fits for Mandarin Chinese Tone class (input) Strength (result) Inside a word, strength is distributed by the metrical pattern

What’s the procedure? Compute the pitch curve as a function of phonological inputs and prosodic strength. Sequence of tones (phonology) Prosodic strengths Predicted F 0 Data Nonlinear least-squares fitting algorithm

Model fits to Mandarin Chinese 0.61 free parameters per syllable, 13 Hz RMS error.

Strengths are stable under small changes in the model. The two models have words defined by different labelers This model allows extra freedom: different tones are allowed to define their targets differently This model allows less freedom: all tones have the same type of target.

Model parameters Mandarin Cantonese Phrasing is marked in speech. Cantonese data courtesy of Prof. Tan Lee

Model parameters Cantonese Mandarin Nouns are relatively important.

Model parameters Cantonese Mandarin Longer words tend to be spoken more carefully.

Metrical patterns inside words Mandarin “Normal” segmentation of characters into words. Random segmentation of characters into words. Lexical acquisition

Other nice properties Strengths are correlated with duration: (duration is a proxy for prominence) r = 0.40 (sentence final) r = 0.27 (non-final) >95% confidence Strength is correlated with mutual information of neighboring syllables: r = >95% confidence Sloppy when generating unsurprising syllables, and precise for surprising syllables.

Local Conclusion Intonation can be represented as: – a small set of discrete symbols, in sequence, with –a per-person or per-style shape for each symbol; –modulated by a variable prosodic strength. One symbol per syllable seems enough The strength parameter seems real –Similar across languages –Matches language structure

Q: But does it work for English? A: Yes, under circumstances where the intonational phonology is simple enough to be obvious.

Reminder: Limitations of f 0 and complexity of prosody. To show the range of information that can be carried by prosody, observe an elegant experiment by Stan Freberg (1950): The text has virtually no lexical information, but it still tells a story. Even so, it is very hard to label individual words.

English Sentences in the form “ ?” Speaker is trying to confirm a single digit. Models have just 1.1 parameter per sentence.

The model for English There are identical boundary tones on every utterance. All target shapes are identical, except the focus. %X B B B | B A B | B B B B Y% %X B B B | A B B | B B B B Y% %X B A B | B B B | B B B B Y% Rather simple phonology. Accent prominence depends on position in phrase and in utterance.

Model details Strength time 910 – Decline over utterance Decline over phrase Local effect around accent Compress range after accent

The rest of the model. Where do you put the targets? What are the targets? –Pitch values? –Slopes? Do the targets change in f 0 range with changes in strength?

Model fits well over a range of speeds. Suppressed phrasing Low speed High speed Merger of accent with boundary tone

Model reproduces nontrivial features of the data and fits well over a range of speeds. Suppressed phrasing Low speed High speed Merger of accent with boundary tone

Conclusion Physiologically-based models can capture important aspects of speech. A very compact representation of behavior. It can be applied broadly: Two dialects of Chinese Some aspects of English It raises questions about where the phonetics/phonology boundary actually sits. Introduces an objective acoustic measure of prosodic prominence. Suggests that the speaker may help the listener segment the speech stream.