CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan Niu Center for Spoken Language Understanding OGI School of Science & Technology at OHSU
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 2 OVERVIEW 1.IMPORTANCE OF SPECTRAL BALANCE 2.MEASUREMENT OF SPECTRAL BALANCE 3.ANALYSIS METHODS 4.RESULTS 5.SYNTHESIS 6.CONCLUSIONS
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 3 1. IMPORTANCE OF SPECTRAL BALANCE Linguistic Control Factors –Stress-like factors –Positional factors –Phonemic factors Acoustic Correlates –Traditionally TTS-controlled: Pitch, timing, amplitude –Demonstrated in natural speech, but usually not TTS-controlled: Spectral tilt, balance Formant dynamics …
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 4 2. MEASUREMENT OF SPECTRAL BALANCE Data: –472 greedily selected sentences Genre: newspaper Greedy features: linguistic control factors –One female speaker –Manual segmentation –Accent: independent rating by 3 judges 0-3 score
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 5 2. MEASUREMENT OF SPECTRAL BALANCE Energy in 5 formant-range frequency bands –B 0 : Hz [~F0] –B 1 : Hz [~F1] –B 2 : Hz [~F2] –B 3 : Hz [~F3] –B 4 :3500- max Hz [~fricative noise] In other words, multidimensional measure Filter bank Square Average [1 ms rect.] 20 log 10 (B i ) Subtract estimated per-utterance means
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 6 2. MEASUREMENT OF SPECTRAL BALANCE Details: –Confounding with F 0 Measure pitch-corrected and raw –For certain wave shapes, pitch directly related to fixed-frame energy –Why do both: wave shapes may change in unknown ways F 0 not confined to B 0 [female speech] –Vowel formants not quite confined to bands [e.g., F 1 for /EE/ and F 3 for /ER/]
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 7 2. MEASUREMENT OF SPECTRAL BALANCE Why not more or different bands? –Multiple interacting Linguistic Control Factors Need measurements that minimize interactions –5 bands Different vowels “behave similarly” Can model vowels as a class Why not simply spectral tilt? –5 bands more information than single measure –Supply more information for synthesis
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 8 3. ANALYSIS METHODS Measures likely to behave like segmental duration: –Multiple interacting, confounded factors: Interaction: Magnitude of effects on one factor may depend on other factors Confounding: Unequal frequencies of control factor combinations –“Directional Invariance” Direction of effects on one factor independent of other factors
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 9 3. ANALYSIS METHODS Need method that –can handle multiple interacting, confounded factors and –takes advantage of Directional Invariance: Used: Sums of Products Model:
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING ANALYSIS METHODS Special cases: –Multiplicative model: K = {1}, I 1 = {0,…,n} –Additive model: K = {0,…,n}, I i = {i}
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING ANALYSIS METHODS Used additive model Note: Parameter estimates are: –Estimates of marginal means … –… in balanced design:
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING ANALYSIS METHODS Pitch correction: Confounding with F 0 : Show both and:
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (A) POSITIONAL EFFECTS 5 Bands, not pitch-corrected Solid: right position, dashed: left position. Y-axis: corrected mean
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (A) POSITIONAL EFFECTS 5 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (A) POSITIONAL EFFECTS 4 Bands, not pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (A) POSITIONAL EFFECTS 4 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, not pitch-corrected Solid: stressed syllable, dashed: unstressed. Y-axis: corrected mean
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (B) STRESS/ACCENT EFFECTS 5 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, not pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (B) STRESS/ACCENT EFFECTS 4 Bands, pitch-corrected
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING RESULTS: (C) TILT EFFECTS
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING SYNTHESIS Use ABS/OLA sinusoidal model: s[n] = sum of overlapped short-time signal frames s k [n] s k [n] = sum of quasi-harmonic sinusoidal components: s k [n] l A k,l cos( k,l n + k,l Each frame of unit is represented by a set of quasi-harmonic sinusoidal parameters; Given the desired F0 contour, pitch shift is applied to the sinusoidal parameter component of the unit to obtain the target parameter A k,l ;
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING SYNTHESIS Considering the differences of prosody factors between original and target unit, band differences: Transform the band difference into weights applying to the sinusoidal parameters:,when the j’th harmonic is located in the i'th band; Spectral smoothing across unit boundaries.
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING SYNTHESIS 5 Bands modification example [i:]
CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 25 CONCLUSIONS Described simple methods for predicting and synthesizing spectral balance But: Spectral balance is only one “non-standard acoustic correlate” Others that remain to be addressed: –Spectral dynamics –Phase