Understanding Variation of VOT in spontaneous speech

Slides:



Advertisements
Similar presentations
Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
Advertisements

Plasticity, exemplars, and the perceptual equivalence of ‘defective’ and non-defective /r/ realisations Rachael-Anne Knight & Mark J. Jones.
Speech Perception Dynamics of Speech
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Phonetic variability of the Greek rhotic sound Mary Baltazani University of Ioannina, Greece  Rhotics exhibit considerable phonetic variety cross-linguistically.
Infant sensitivity to distributional information can affect phonetic discrimination Jessica Maye, Janet F. Werker, LouAnn Gerken A brief article from Cognition.
Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.
Interlanguage Production of English Stop Consonants: A VOT Analysis Author: Liao Shu-jong Presenter: Shu-ling Hung (Sherry) Advisor: Raung-fu Chung Date:
Results ISI Variance in STP Corpus ISI Variance in BU Corpus * p
Prosodic Signalling of (Un)Expected Information in South Swedish Gilbert Ambrazaitis Linguistics and Phonetics Centre for Languages and Literature.
Voice Onset Time as a Parameter for Identification of Bilinguals Claire Gurski University of Western Ontario London, ON Canada.
Development of coarticulatory patterns in spontaneous speech Melinda Fricke Keith Johnson University of California, Berkeley.
Speech production: Gender and age differences in VOT VIU Hauptseminar: Speaker characteristics Referentin: Antonia Schulz Dozent: Prof. J. Harrington Ludwig.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Extracting Social Meaning Identifying Interactional Style in Spoken Conversation Jurafsky et al ‘09 Presented by Laura Willson.
Yao LSA Separating speaker- and listener- oriented forces in speech – Evidence from phonological neighborhood density.
Introduction to Speech Production Lecture 1. Phonetics and Phonology Phonetics: The physical manifestation of language in sound waves. –How sounds are.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
Sound and Speech. The vocal tract Figures from Graddol et al.
Chapter three Phonology
-- A corpus study using logistic regression Yao 1 Vowel alternation in the pronunciation of THE in American English.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Present Experiment Introduction Coarticulatory Timing and Lexical Effects on Vowel Nasalization in English: an Aerodynamic Study Jason Bishop University.
Speech rate affects the word error rate of automatic speech recognition systems. Higher error rates for fast speech, but also for slow, hyperarticulated.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Some thoughts on modelling phonetic effects in corpora.
1 Introducing The Buckeye Speech Corpus Kyuchul Yoon English Division, Kyungnam University March 21, 2008 School of English,
An investigation of postvocalic /r/ in Glaswegian adolescents Jane Stuart-Smith and Robert Lawson Department of English Language, University of Glasgow.
Suprasegmentals Segmental Segmental refers to phonemes and allophones and their attributes refers to phonemes and allophones and their attributes Supra-
VOT trumps other measures in predicting Korean children’s early mastery of tense stops Eun Jong Kong Mary E. Beckman Jan Edwards LSA2010 January 7 th.
Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,
Acoustic Aspects of Place Contrasts in Children with Cochlear Implants Kelly Wagner, M.S., & Peter Flipsen Jr., Ph.D. Idaho State University INTRODUCTION.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Epenthetic vowels in Japanese: a perceptual illusion? Emmanual Dupoux, et al (1999) By Carl O’Toole.
YAO UC BERKELEY JULY 25, 2008 An Exemplar-based Approach to Automatic Burst Detection in Voiceless.
From subtle to gross variation: an Ultrasound Tongue Imaging study of Dutch and Scottish English /r/ James M Scobbie Koen Sebregts Jane Stuart-Smith.
1 Cross-language evidence for three factors in speech perception Sandra Anacleto uOttawa.
A quick walk through phonetic databases Read English –TIMIT –Boston University Radio News Spontaneous English –Switchboard ICSI transcriptions –Buckeye.
Tongue movement kinematics in speech: Task specific control of movement speed Anders Löfqvist Haskins Laboratories New Haven, CT.
Gender Differences in Buffering Stress Responses in Same-Sex Friend Dyads Sydney N. Pauling, Jenalee R. Doom, & Megan R. Gunnar Institute of Child Development,
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Gender What question would you like to ask these people? DO NOT CHOOSE THE OBVIOUS QUESTION tch?v=WDswiT87oo8.
ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.
Arnar Thor Jensson Koji Iwano Sadaoki Furui Tokyo Institute of Technology Development of a Speech Recognition System For Icelandic Using Machine Translated.
University of Warwick, Department of Sociology, 2014/15 SO 201: SSAASS (Surveys and Statistics) (Richard Lampard)   Week 5 Multiple Regression  
Investigating the combined effects of word frequency and contextual predictability on eye movements during reading Christopher J. Hand Glasgow Language.
Analysis and Interpretation
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Conditional Random Fields for ASR
Chap 14 Perceptual and linguistic phonetics
Multiple Regression Analysis and Model Building
Using State Data to Assess Vehicle Performance
Sound & Voice Year 7 Drama.
Consonant variegations in first words: Infants’ actual productions of
Studying Intonation Julia Hirschberg CS /21/2018.
Abstraction versus exemplars
Detecting Prosody Improvement in Oral Rereading
Job Google Job Title: Linguistic Project Manager
Week 12 Slides.
Audio Books for Phonetics Research
Correlations: testing linear relationships between two metric variables Lecture 18:
Agustín Gravano & Julia Hirschberg {agus,
Week 11 Slides.
Looking at data: relationships - Caution about correlation and regression - The question of causation IPS chapters 2.4 and 2.5 © 2006 W. H. Freeman and.
Speech Perception (acoustic cues)
Analyzing Stability in Colorado K-12 Public Schools
Research on the Modeling of Chinese Continuous Speech Recognition
A Japanese trilogy: Segment duration, articulatory kinematics, and interarticulator programming Anders Löfqvist Haskins Laboratories New Haven, CT.
Presentation transcript:

Understanding Variation of VOT in spontaneous speech Yao Yao UC Berkeley yaoyao@berkeley.edu

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Background Keywords VOT (Voice Onset Time) VOT Variation Spontaneous speech VOT (Voice Onset Time) The duration of time between consonant release and the beginning of voicing of the next vowel Sensitive to speaker and speaking environment close release vowel onset A talk on VOT variation for speaker identification in this conf.

Background What conditions length of VOT? Place of articulation (POA) VOT increases as POA moves backward, i.e. [p]<[t]<[k] Following vowel Speaking rate Age, gender Dialectal background Speech disorders Lung volume Hormone level …

Background Why using spontaneous speech data? Previous results are mostly based on experimental data or read speech. The existence of large-scale transcribed speech corpora makes it possible to study patterns with “naturalistic” data. (Cf. Bell et al. 1999, Gahl in press, Raymond et al. 2006, etc)

Background Experimental data Spontaneous data Controlled content Easy to investigate individual factors Hard to see the general pattern of variation Not necessarily natural speech Spontaneous data Uncontrolled content Need to statistically control for irrelevant factors Provides a general picture of variation More naturalistic. Include factors such as disfluency

Background Purpose of this study Main statistical tool To investigate some of the factors that have been shown to affect VOT in experiments, as well as those that have been proposed to influence spontaneous speech production Main statistical tool Linear regression Adding variables step by step

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Data Buckeye corpus (Pitt et al. 2005) 40 speakers All residents at Columbus, Ohio Balanced in age and gender 1-hr interview Transcribed at word and phone level 19 speakers’ transcriptions were available at the time of this study At the time of this study, 19 speakers’ data were transcribed completely.

Data 2 speakers’ data are used for this study Target tokens F07: Older, female, low speaking rate (4.022 syllables/sec) M08: Younger, male, high speaking rate (6.434 syllabes/sec) Target tokens word-initial transcribed voiceless stops (i.e., [p], [t], [k])

Data Finding point of burst An automatic algorithm is used first. (cf. Yao 2007) >70% of the tokens are checked manually. Error <3.5 ms. Some tokens are rejected by the algorithm for not having significant burst point. 3.03% of F07’s tokens are rejected. 15.85% of M08’s tokens are rejected. Number of tokens F07 M08 Target tokens 231 618 Target tokens with burst point found 210 466

VOT by speaker F07: Mean = 57.41ms, SD = 26.00ms M08: Mean = 34.86ms, SD = 19.82ms

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Preliminary analysis: POA VOT by POA in F07 VOT by POA in M08 Canonical rule: VOT increases when POA moves backward, from lips to the palate. This trend is shown in M08’s data (F(2,463, 14.061) =1.77e-6), but not in F07 (F(2, 207, 3.9925) = 0.01989). p t k p t k

Preliminary analysis: Word class Split the data set into three subsets Content words Function words Other. (e.g. proper names) Number of words of different classes Content Function Other F07 155 47 8 M08 346 104 16

Preliminary analysis: Word class VOT by word class in F07 VOT by word class in M08 Previous literature shows that words of different classes are processed differently. In particular, function words are processed differently than content words. (XXX; YYY) Content words and function words differ greatly in usage frequency. In general, function words are much more often used than content words. Therefore, there could also be a confounding frequency effect. In both speakers, function words are on average shorter than content words, but the variation is vast. F07: F(2, 206, 7.5593) =0.000679 M08: F(2,457, 14.693) =6.541e-07 function content other function content other

Preliminary analysis: word class Word class distinction or general effect of frequency? Obviously word frequency and word class are two related measures, since function words are in general much more frequently used than content words.

Preliminary analysis: word frequency Two frequency measures: Log of Celex frequency Log of Buckeye frequency (speaker-specific) The two measures are highly correlated (r=0.826) Effect: more frequent words have shorter VOT The effect of word class suggests that there might be a more general effect of word frequency: word forms that are used more often tend to be shorter. R^2 indicates how much variance is explained in the regression model. Frequency effect Celex frequency Buckeye frequency p R^2 (%) F07 <0.001 5.1 4.8 M08 4.9 5.9

Word class vs. frequency After factoring out the effect of word class, frequency is no longer significant in F07’s data (p=0.277), but still in M08’s data (p=0.003) This suggests that the above frequency effect in F07 is mainly due to the effect of word class. In other words, we need to factor out the effect of word class if we really want to study the effect of frequency. Previous literature also suggests that content words and function words are processed differently, therefore it’s hard to see homogeneous effect in the overall dataset, that the two must be separated.

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Linear regression model We decide to only model the variation in the content word set F07: 155 tokens M08: 346 tokens Factors investigated POA Word frequency Phonetic context Speech rate Utterance position

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Regression: POA The canonical rule of [p] <[t] <[k] is only shown in M08’s data, not in F07’s data. F07 M08 p 0.216 <0.001 R-squared(%) 9.2 Not significant in F07’s content word set, but still in M08’s content word set.

Regression: word frequency In both speakers’ data, more frequent words tend to have shorter VOT, but the trends are not very significant. For both speakers, Buckeye frequency measure is slightly better than Celex frequency measure. Not significant in F07’s content word set, but still in M08’s content word set.

Regression: word frequency M08 Log Celex freq p R^2 (%) R^2 change of the model (%) 0.391 0.2 0  1.3 Log Celex freq p R^2 (%) R^2 change of the model (%) 0.169 0.3 9.2  9.4 Buckeye freq (speaker-specific) p R^2 (%) R^2 change of the model (%) 0.577 0  1.7 Buckeye freq (speaker-specific) p R^2 (%) R^2 change of the model (%) 0.067 0.7 9.2  9.6

Regression: phonetic context Two measures Category of the previous phone Coded as C(onsonant), V(owel), O(other sound), and N(on-linguistic) Category of the next phone

Regression: Phonetic context F07 M08 Previous phone category p R^2 (%) R^2 change of the model (%) 0.141 0.7 1.7  1.9 Previous phone category p R^2 (%) R^2 change of the model (%) 0.127 0.4 9.6  9.27 Next phone category p R^2 (%) R^2 change of the model (%) 0.563 0.4 1.7  0.6 Next phone category p R^2 (%) R^2 change of the model (%) 0.036 0.9 9.6  10.08

Regression: phonetic context VOT by previous phone category in F07 VOT by next phone category in M08

Regression: speech rate Three speed measures Duration of the next phone, in ms. Average speed of a 3-word period centered at the target word, measured in # of syll/s. Average speed of the pause-bounded stretch that contains the target word, measured in # of syll/s. All speed measures predict that words in faster speech tend to have shorter VOT

Regression: speech rate F07 M08 Duration of next phone p R^2 (%) R^2 change of the model (%) 0.014 3.2 1.9  5.1 Duration of next phone p R^2 (%) R^2 change of the model (%) 0.342 10.08  16.62 Average of the 3-wd stretch p R^2 (%) R^2 change of the model (%) <0.001 10.93 1.9  11.8 Average of the 3-wd stretch p R^2 (%) R^2 change of the model (%) <0.001 4.1 10.08  12.85 Average of the local stretch p R^2 (%) R^2 change of the model (%) <0.001 6 1.9  7.1 Average of the local stretch p R^2 (%) R^2 change of the model (%) 0.014 1.4 10.08  15.07

Regression: utterance position Utterance-final lengthening has been documented in the literature extensively. We code tokens for whether they are followed by silence. Number of tokens F07 M08 Non-final 146 312 final 9 34

Regression: utterance position F07 M08 non-final final non-final final

Regression: utterance position F07 M08 Utterance position p R^2 (%) R^2 change of the model (%) 0.021 2.8 11.8  19.11 Utterance position contributes to the variation in VOT Utterance position p R^2 (%) R^2 change of the model (%) 0.652 0.2 16.62  13.31 Utterance position doesn’t contribute to the variation in VOT

Regression: complete model F07 M08 Model performance Variable added R^2 (%) POA Buckeye Frequency 1.7 Previous phone category 1.9 Average speed of the 3-word stretch 11.8 Utterance position 19.11 Model performance Variable added R^2 (%) POA 9.2 Buckeye Frequency 9.6 Next phone category 10.08 Duration of the next phone 16.62

Regression: trends observed POA [p]<[t]<[k] Word class function words < content words Word frequency ??Higher frequency  shorter VOT Here the word shorter or longer are used loosely, not referring to the specific effect of lengthening or shortening.

Regression: trends observed Phonetic category ??Preceded by vowel  shorter VOT ??Followed by vowel  longer VOT Speaking rate Faster speech  shorter VOT Utterance position Utterance final  longer VOT

Regression: trends observed Missing from the picture Contextual predictability Stress Disfluency Emotion

Overview Background Methodology Results Discussion Data Preliminary analysis Regression model Results Discussion

Discussion Individual differences Other between-subject factors Measurements Other between-subject factors Age Gender Average speaking rate

Discussion Relatively little variation is explained in the full model. (19.11% in F07 and 16.62% in M08) Factors missing from the picture: contextual predictability, stress, disfluency, etc. Limitation of linear regression model Non-linear effect Non-homogeneous effect Mixture of categorical and continuous variables

Discussion Echoing and challenging previous findings VOT and POA Canonical rule is observed in M08, but not in F07 Word frequency effect Overshadowed by word class distinction Utterance-final lengthening Significant in F07, but not M08 Speaking style? Content words vs. function words? Speed measures? Given speed measures, the partial correlation between vot and utterance-position in F07 is (1) if spd = next_dur, p = 0.0724 (2) if spd= spd_3wd, p = 0.4529 (3) if spd= str_spd, p =0.0743

Conclusion Still a long way to go to model VOT variation in spontaneous speech… Thanks! Any comments are welcome!

Thanks to Anonymous subjects Contributors to the Buckeye corpus Prof. Keith Johnson Members of the phonology lab in UC, Berkeley

Selected references Bell, A. et al. (1999) Forms of English function words - Effects of disfluencies, turn position, age and sex, and predictability. Proceedings of ICPhS-99 Gahl, S. In press. "Time" and "thyme" are not homophones: The effect of lemma frequency on word durations in a corpus of spontaneous speech. To appear in Language. Pitt, M. et al. (2005) The Buckeye Corpus of conversational speech: labeling conventions and a test of transcriber reliability. Speech Communication. Vol 45, pp: 90-95 Raymond et al. (2006) Word-internal /t,d/ deletion in spontaneous speech: Modeling the effects of extra-linguistic, lexical, and phonological factors. Yao, Y. (2007) Closure duration and VOT of word-initial voiceless plosives in English in spontaneous connected speech. UC Berkeley PhonLab report