Presentation is loading. Please wait.

Presentation is loading. Please wait.

Motor Control Strategies for Chinese Intonation Greg Kochanski (University of Oxford, UK) Chilin Shih (University of Illinois, Urbana-Champaign) Tan Lee.

Similar presentations


Presentation on theme: "Motor Control Strategies for Chinese Intonation Greg Kochanski (University of Oxford, UK) Chilin Shih (University of Illinois, Urbana-Champaign) Tan Lee."— Presentation transcript:

1 Motor Control Strategies for Chinese Intonation Greg Kochanski (University of Oxford, UK) Chilin Shih (University of Illinois, Urbana-Champaign) Tan Lee (Chinese University of Hong-Kong) Hongyan Jing (IBM)

2 http://kochanski.org/gpk

3 The Goal: –Explain intonation in a way that is: Consistent with linguistic assumptions. Consistent with known Physiology and Neuroscience. The Method: –Motion planning over a phrase. –Minimize sum of Error between actual pitch and linguistic target An “effort” cost term that penalizes rapid, jerky motions. The Result: –Intonation in tone languages can be represented by: A lexically-specified tone template (i.e, you use a dictionary to look up which tone a syllable has). A continuous cost-of-error parameter, one per word. –Evidence that the cost-of-misinterpretations we measure are real: Cross-language similarities Metrical patterns Other

4 The Challenge

5 Tone languages provide the ideal test case for motor control strategies 1.tone is important, and 2.you can be sure what the speaker is trying to accomplish. The meaning of each syllable is determined by the pitch contour over the syllable. 1.Ma (high tone) = “Mother” 2.Ma (rising tone) = “Hemp” 3.Ma (low falling tone) = “Horse” 4.Ma (high falling tone) = “to scold” You can look up the tone in the dictionary. Pitch contour is determined primarily by muscle tension in the vocal folds.

6 Another Challenge Time (10 ms intervals) F 0 (Hz) 1 2 3 4 Tone shapes Typical tone shapes in green

7 People talk nearly as fast as possible, therefore dynamics must be important. Pitch (f 0 ) for a maximum-rate warble. Conversational Mandarin on the same time scale Pitch (f 0 ) for a maximum-rate warble.

8 The Data Male speaker of Madarin (Chinese) Female speaker of Cantonese (Chinese) Text from newspaper news stories. 737 syllables for Mandarin –4  1.4 syllables per second –1.2  0.7 seconds per phrase (between pauses). Segmented into words by three independent native speakers (Mandarin) Tracks of fundamental frequency vs. time (pitch) extracted by get_f0 from ESPS/Waves package.

9 Basic assumptions used in modeling People plan their utterances several syllables in advance. People produce optimal, highly practiced speech. –Most of what we say is made from bits and pieces we’ve said before. –There are only 4 (Mandarin) or 6 (Cantonese) tones to combine. –A speaker has the chance to practice and optimize all the common 3- and 4- tone sequences. A simple model for f 0 (pitch): f 0 is linearly related to muscle tensions. A simple model of the muscle control strategy. –No reason to believe pitch is controlled differently from other muscle motions.

10 Optimize what? People want to minimize the chance that they will be significantly misunderstood. Some words will be more important than others: –Risk = P(misinterpreted) * cost-of-misinterpretation –Perhaps weight matches importance. People want to minimize effort and/or talk faster –Chairs, Cars How to combine the two? –A weighted sum. –Cost-of-misinterpretation plays the role of the weight.

11 What is the unit of motion planning? Probably a phrase or a sentence. People start at a higher pitch when they begin longer sentences. Also planning of inhaled air volume. Therefore, there is some plan ~300 ms before start of speech. (Data courtesy Chilin Shih)

12 Modeling math Effort. R is the total risk for the utterance: r i is the error of the i th target, and s i is the cost if this particular word is misinterpreted. y(t) is the pitch of a point in the i th target. The time-dependence is suppressed for clarity. “We’re optimizing something” p is the realized pitch Where r i is the error of the i th target (this is an approximation; see elsewhere for correct, more detailed equation) p is implicitly a function of time

13 Modeling math – more detail. Total risk for the utterance. y is the pitch of a point in the i th target. A bar denotes an average over a target. Where r i is the error of the i th target Alpha (  ) controls how much the shape of the pitch contour matters. The cost of a misinterpretation of the i th syllable. Beta (  ) controls how much the average pitch of the syllable matters.

14 “Effort” How does G depend on the form of the pitch curve? Large effort implies a curve with larger slopes and sharper corners: wigglier.

15 Model behavior For s>>1, Error (R) dominates, and pitch matches target. For s<<1, Effort (G) dominates, both speaker and listener accept large deviations, and pitch smoothly interpolates. For s~1, everything compromises.

16 The rest of the model: A model is a sequence of targets. –The type of the target (tone1, tone2, …) is looked up in a dictionary. Each target has a cost-of-misinterpretation. –The cost is adjustable for each word –Syllables within a word are derived from word cost via the metrical pattern for words of a certain length. One target per tone. Targets are stretched to fit syllable duration. Only one phonological rule: 33  23

17 What’s the procedure? Compute the pitch curve as a function of phonological inputs and the cost of a misinterpretation. Sequence of tones (phonology) Costs of mis- interpretations Predicted F 0 Data Nonlinear least-squares fitting algorithm

18 Model fits for Mandarin Chinese Tone class (input) Cost-of-misinterpretation (result) Inside a word, the cost of a misinterpretation is distributed by the metrical pattern

19 Model fits to Mandarin Chinese 0.61 free parameters per syllable, 13 Hz RMS error.

20 Results are stable under small changes in the model. The two models have words defined by different labelers This model allows extra freedom: different tones are allowed to define their targets differently This model allows less freedom: all tones have the same type of target. Costs for misinterpreting different syllables.

21 Model parameters Mandarin Cantonese Phrasing is marked in speech. Cantonese data courtesy of Prof. Tan Lee

22 Metrical patterns inside words (Mandarin) “Normal” segmentation of characters into words. Random segmentation of characters into words – Note that the metrical pattern disappears, showing that we are measuring something real that is tied to words. The metrical pattern controls how the cost-of-misinterpretation is split up inside a word. Syllables are marked with . The vertical position is proportional to log(s) for each syllable, so higher syllables have larger s, and will be executed more carefully. For 4-syllable words, the error bars are shown by the pairs of arrows.

23 Another nice property The cost-of-misinterpretation parameter for a syllable is correlated with the mutual information with the preceeding syllable: r = -0.175 >95% confidence Pitch patterns are implemented sloppily for syllables that are unsurprising, and precisely for surprising ones. (Mutual informations from a database of 15000 newspaper sentences. Syllable identity was defined by phoneme content and tone.)

24 Conclusion Models with motor planning capture important aspects of speech. They allow a very compact representation of complex behaviors. Intonation is represented as: – a small set of discrete symbols, in sequence, –modulated by a cost-of-misinterpretation, with The cost-of-misinterpretation parameter seems real: –Similar across languages –Matches language structure This model can be applied broadly: Two dialects of Chinese Some aspects of English Separating different singing and speaking styles from the content See http://kochanski.org/papers. http://kochanski.org/papers


Download ppt "Motor Control Strategies for Chinese Intonation Greg Kochanski (University of Oxford, UK) Chilin Shih (University of Illinois, Urbana-Champaign) Tan Lee."

Similar presentations


Ads by Google