Parsing acoustic variability as a mechanism for feature abstraction Jennifer Cole Bob McMurray Gary Linebaugh Cheyenne Munson University of Illinois University.

Slides:



Advertisements
Similar presentations
Sounds that “move” Diphthongs, glides and liquids.
Advertisements

Plasticity, exemplars, and the perceptual equivalence of ‘defective’ and non-defective /r/ realisations Rachael-Anne Knight & Mark J. Jones.
Human Speech Recognition Julia Hirschberg CS4706 (thanks to John-Paul Hosum for some slides)
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008.
The sound patterns of language
Phonetic variability of the Greek rhotic sound Mary Baltazani University of Ioannina, Greece  Rhotics exhibit considerable phonetic variety cross-linguistically.
The Sound Patterns of Language: Phonology
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Language Comprehension Speech Perception Semantic Processing & Naming Deficits.
SPEECH RECOGNITION 2 DAY 15 – SEPT 30, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
The General Linear Model Or, What the Hell’s Going on During Estimation?
Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.
Speech perception 2 Perceptual organization of speech.
The Perception of Speech. Speech is for rapid communication Speech is composed of units of sound called phonemes –examples of phonemes: /ba/ in bat, /pa/
Evidence of a Production Basis for Front/Back Vowel Harmony Jennifer Cole, Gary Dell, Alina Khasanova University of Illinois at Urbana-Champaign Is there.
Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.
Development of coarticulatory patterns in spontaneous speech Melinda Fricke Keith Johnson University of California, Berkeley.
6/3/20151 Voice Transformation : Speech Morphing Gidon Porat and Yizhar Lavner SIPL – Technion IIT December
Speech perception Relating features of hearing to the perception of speech.
Exam 1 Monday, Tuesday, Wednesday next week WebCT testing centre Covers everything up to and including hearing (i.e. this lecture)
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Cognitive Processes PSY 334 Chapter 2 – Perception April 9, 2003.
TEMPLATE DESIGN © Listener’s variation in phoneme category boundary as a source of sound change: a case of /u/-fronting.
PSY 369: Psycholinguistics
SPEECH PERCEPTION The Speech Stimulus Perceiving Phonemes Top-Down Processing Is Speech Special?
Visual Cognition II Object Perception. Theories of Object Recognition Template matching models Feature matching Models Recognition-by-components Configural.
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.
Language Comprehension Speech Perception Naming Deficits.
The Perception of Speech
Conclusions  Constriction Type does influence AV speech perception when it is visibly distinct Constriction is more effective than Articulator in this.
Present Experiment Introduction Coarticulatory Timing and Lexical Effects on Vowel Nasalization in English: an Aerodynamic Study Jason Bishop University.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Statistical learning, cross- constraints, and the acquisition of speech categories: a computational approach. Joseph Toscano & Bob McMurray Psychology.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
Tone sensitivity & the Identification of Consonant Laryngeal Features by KFL learners 15 th AATK Annual Conference Hye-Sook Lee -Presented by Hi-Sun Kim-
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
Is phonetic variation represented in memory for pitch accents ? Amelia E. Kimball Jennifer Cole Gary Dell Stefanie Shattuck-Hufnagel ETAP 3 May 28, 2015.
SPEECH PERCEPTION DAY 16 – OCT 2, 2013 Brain & Language LING NSCI Harry Howard Tulane University.
Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.
Speech Science IX How is articulation organized? Version WS
A problem with linguistic explanations  A problem with linguistic explanations  Controlling articulatory movements  Memory for speech  The balance.
Assessment of Phonology
Epenthetic vowels in Japanese: a perceptual illusion? Emmanual Dupoux, et al (1999) By Carl O’Toole.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
Pragmatically-guided perceptual learning Tanya Kraljic, Arty Samuel, Susan Brennan Adaptation Project mini-Conference, May 7, 2007.
From Sound to Sense and back again: The integration of lexical and speech processes From Sound to Sense and back again: The integration of lexical and.
Phonetic Context Effects Major Theories of Speech Perception Motor Theory: Specialized module (later version) represents speech sounds in terms of intended.
Sounds and speech perception Productivity of language Speech sounds Speech perception Integration of information.
1 Cross-language evidence for three factors in speech perception Sandra Anacleto uOttawa.
Chapter II phonology II. Classification of English speech sounds Vowels and Consonants The basic difference between these two classes is that in the production.
Katherine Morrow, Sarah Williams, and Chang Liu Department of Communication Sciences and Disorders The University of Texas at Austin, Austin, TX
Exemplar Theory, part 2 April 15, 2013.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
WebCT You will find a link to WebCT under the “Current Students” heading on It is your responsibility to know how to work WebCT!
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
Speech Production “Problems” Key problems that science must address How is speech coded? How is speech coded? What is the size of the “basic units” of.
Transitions + Perception March 25, 2010 Tidbits Mystery spectrogram #3 is now up and ready for review! Final project ideas.
A STUDY ON PERCEPTUAL COMPENSATION FOR / /- FRONTING IN A MERICAN E NGLISH Reiko Kataoka February 14, 2009 BLS 35.
/u/-fronting in RP: a link between sound change and diminished perceptual compensation for coarticulation? Jonathan Harrington, Felicitas Kleber, Ulrich.
Acoustic phonetics: Concerned with describing the acoustics of speech. Also called speech acoustics. Big questions: (1) What are the relationships between.
Exemplar Dynamics VOT of stops varies in languages VOT of stops varies in languages So people learn language specific VOT So people learn language specific.
Cognitive Processes PSY 334
Conditional Random Fields for ASR
Chap 14 Perceptual and linguistic phonetics
The lexical/phonetic interface: Evidence for gradient effects of within-category VOT on lexical access Bob McMurray Richard N. Aslin Mickey K. TanenMouse.
Abstraction versus exemplars
What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.
Speech Perception (acoustic cues)
Presentation transcript:

Parsing acoustic variability as a mechanism for feature abstraction Jennifer Cole Bob McMurray Gary Linebaugh Cheyenne Munson University of Illinois University of Iowa

Phonetic precursors to phonological sound patterns Many phonological sound patterns are claimed to have precursors in systematic phonetic variation that arises due to coarticulation Assimilation –Vowel harmony from V-to-V coarticulation (Ohala 1994; Beddor et al. 2001) –Palatalization from V-to-C coarticulation (Ohala 1994) –Nasal Place assimilation (-mb, -nd, -ŋg) from C-to- C coarticulation (Browman & Goldstein 1991) Assimilation Epenthesis –Epenthetic stops from C-C coarticulation: sen[t]se (Ohala 1998) Assimilation Epenthesis Deletion –Consonant cluster simplification via deletion from C-C coarticulation: perfec(t) memory (Browman & Goldstein 1991)

The role of the listener Phonologization: when acoustic properties that arise due to coarticulation are interpreted by the listener as primary phonological properties of the target sound. generalization over variable acoustic input that results in a new constraint on sound patterning.

The role of the listener From V-to-V coarticulation … ɛ ʌ ɑ i

The role of the listener From V-to-V coarticulation … [… ɛ i …i…] ɛ ʌ ɑ i [… ɛ ɑ … ɑ …]

The role of the listener Perception may yield vowel assimilation [… ɛ i …i…] ɛ ʌ ɑ i [… ɛ ɑ … ɑ …] i ɑ

The role of the listener But – distinct factors can produce similar variants: [… ɛ i …i…] ɛ ʌ ɑ i [… ɛ ŋ…]

From perception to phonology What is the mechanism for mapping from continuous perceptual features to phonological categories? ɛ i mid and high central and front-peripheral ɛ ɑ mid and low central and back

From perception to phonology What is the mechanism for mapping from continuous perceptual features to phonological categories? ɛ i mid and high central and front-peripheral ɛ ɑ mid and low central and back The problem: The perceptual system is confronted with uncertainty due to variation arising from multiple sources. Yet, patterns of variation must get associated with individual features of the context vowel (e.g,. high, front) if coarticulation serves as a precursor to phonological assimilation. How do lawful, categorical patterns emerge from ambiguous, variable input? …the lack of invariance problem!

Our claims What is the mechanism for mapping from continuous perceptual features to phonological categories? Our claims: Variability is retained. Acoustic variability is parsed into components related to the target segment and the local context. Feature abstraction through parsing. Acoustic parsing provides a mechanism for the emergence of phonological features from patterned variation in fine phonetic detail.

Variability is retained Listeners are sensitive to fine-grained acoustic variation. (Goldinger 2000; Hay 2000; Pierrehumbert 2003)  Variability is retained, not discarded Consistent with exemplar models of the lexicon, phonetic detail is encoded and stored, and can inform subsequent categorization of new sound tokens.

Variability due to coarticulation is subtracted to identify the “underlying” target sound. (Fowler 1984; Beddor et al. 2001, 2002; Gow 2003) Variability is retained Variability is useful for the identification of sounds in contexts of coarticulation. The perceptual system uses information about variability to identify a sound and its context, in parallel. Variability due to coarticulation is exploited to facilitate perception. -- Listeners benefit from the presence of anticipatory coarticulation in predicting the identity of the upcoming sound. (Martin & Bunnell 1982; Fowler 1981, 1984; Gow 2001, 2003; Munson, this conference)

Variability and perceptual facilitation Perceptual facilitation from V-to-V coarticulation is expected to occur only if: The effects of coarticulation are systematic—an influencing vowel conditions a consistent acoustic effect on target vowels; The listener can recognize coarticulatory effects on the target vowel; The listener can isolate the effects of context vowel from other sources of variation, and attribute those effects to the context vowel.

Feature abstraction through parsing More specifically…under coarticulation of vowel height and backness, The listener must parse out the portion of the variance in F1 and F2 that is due to coarticulation, and base their perception of the target vowel on the residual values. Acoustic parsing isolates the effects of context vowel on F1 and F2.

Feature abstraction through parsing The parsed acoustic variance defines features of the context vowel, over which new generalizations can be formed.  phonologization [ ɛ ] + [i] [ ɛ ] +[high] ɛiɛi

Feature abstraction through parsing The parsed acoustic variance defines features of the context vowel, over which new generalizations can be formed.  phonologization [ ɛ ] + [i] [ ɛ ] +[high] phonologized to [i] i

Feature abstraction through parsing The parsed acoustic variance defines features of the context vowel, over which new generalizations can be formed.  phonologization Question: Why phonologization? If target and context vowels can both be identified from the fine phonetic detail…. What’s the force driving phonologization?

Testing the model The acoustic parsing model of speech perception requires that there is a robust and systematic pattern of acoustic variation from V-to-V coarticulation. This paper: we present supporting evidence from an acoustic study of coarticulation. We examine a range of V-to-V coarticulatory effects in VCV contexts that cross a word boundary, where coarticulation cannot be attributed to lexicalized phonetic patterns.

Key Questions Extent of phenomenon Does V-to-V coarticulation cross word boundaries? Does V-to-V coarticulation affect both F1 and F2? Relative strength of V-to-V effects vs. other forms of coarticulation? Usefulness of phenomenon How could V-to-V effects translate to perceptual inferences? Is the information by V-to-V coarticulation different when other sources of variation are explained?

Methods Target vowels: ɛ ʌ Measure coarticulation Context vowels: i æ ɑ Induce Coarticulation i æ ɑ ʌ ɛ /u/ excluded from contexts (rounded + fronted) intervening consonant varied in - place (labial, coronal, velar) - voicing - / ɛ g/ excluded (tends to be raised)

Methods bedactortechafternoonwebaddict eagleeveningecologist evergreenelevatoreducator ostrichOxygenOffer wetAfrodeckalligatorstepAdmiral Easter Bunnyeaster basketeast Eskimoelephantexit Oxenoctopusobstacle mudapplebugastronautpubadvertisement eaterevileasel umpireunderwearundergrad observationopticianoperator cutabdomenduckathletecupappetizer evenlyeatingeavesdropping onionusheroven Oliveofficeroccupant

Methods 10 University of Illinois students. 48 phrases x 3 repetitions. Sentences embedded in neutral carrier sentences / ɛ / He said ‘_______’ all the time / ʌ / I love ‘_______’ as a title Coding F1, F2, F3 - Converted to Bark for analysis LPC (Burg Method) Outliers / misproductions inspected by hand

Analysis Target x Voicing x Context F1F2 Voicing p=.033p=.001 Targetp=.005p=.001 Contextp=.001p=.001 Interactionsn.s.n.s. Target x Place x Context F1F2 Placen.s.p=.001 Targetp=.01p=.001 Contextp=.001p=.001 Interactionssomesome V-to-V coarticulation crosses word boundaries. Clear effects of coarticulatory context on both F1 and F2.

Analysis Male Female A lot of unexplained variance… How does the perceptual system “get to” the V- to-V coarticulation? How useful is V-to-V coarticulation? Does accounting for other sources of variance in the signal improve the usefulness of V-to-V?

Strategy Need to systematically account for sources of variance prior to evaluating V-to-V coarticulation. F2 ɛʌ 1431 hz1801 hz i ɑ ? ɑ -coarticulated ɛ ? or i-coarticulated ʌ ?

Strategy F2 ɛʌ 1431 hz1801 hz i ? A slightly i-coarticulated ɛ ? or A really i -coarticulated ʌ ? Need to systematically account for sources of variance prior to evaluating V-to-V coarticulation.

Strategy F2 ɛʌ 1431 hz1801 hz i ɑ ? If you knew the category… If ʌ, then expect i If ɛ then expect ɑ ? - ʌ : Positive (more i-like) ? - ɛ : Negative (more ɑ -like) F2 ? – F2 category = coarticulation direction Need to systematically account for sources of variance prior to evaluating V-to-V coarticulation.

Strategy Target – F2 ? = coarticulation direction Strategy: 1) Compute mean of a source of variance 2) Subtract that mean from F1/F2 3)Residual is coarticulation direction. 4)Repeat for each source of variance (speaker, target vowel, place, voicing).

Strategy F1 predicted =  1 * target +  0 If target = 0 for / ʌ / and 1 for / ɛ /… ʌ ) F1 predicted =  1 * 0 +  0 Mean / ʌ / =  0 ɛ ) F1 predicted =  1 * 1 +  0 Mean / ɛ / =  0 +  1 Hierarchical Regression can do exactly these things. 1) Compute mean of a source of variance

Strategy Hierarchical Regression can do exactly these things. 1)Compute mean of a source of variance. 2)Subtract that mean from F1/F2 3)Residual is coarticulation direction. Residual F1 actual - F1 predicted = F1 actual - (  1 · target +  0 ) ʌ ) Resid target = F1 actual -  0 ɛ ) Resid target = F1 actual - (  0 +  1 )

Strategy Hierarchical Regression can do exactly these things. 1)Compute mean of a source of variance. 2)Subtract that mean from F1/F2 3)Residual is coarticulation direction. 4)Repeat for each source of variance (speaker, target vowel, place, voicing). Resid target =  2 * Place +  0 Resid place =  3 * Voicing+  0 F1 =  0 * Target+  0 Resid voicing =  4 * V-to-V +  0

Strategy Construct a hierarchical regression to systematically account for known sources of variance from F1 and F2 Speaker Target vowel Place (intervening C) Voicing (intervening C) Interactions between target, place & voicing After partialing out these factors, how much variance does vowel context (V-to-V) account for?

Regression F2 1) Raw Data Male Female

Regression F2 1) Raw Data Partialed Out 2) Subject ʌ ɛ

Regression F2 1) Raw Data Partialed Out 2) Subject 3) Target Vowel

Regression F2 1) Raw Data Partialed Out 2) Subject 3) Target Vowel 4) Consonant

Regression F2 1) Raw Data Partialed Out 2) Subject 3) Target Vowel 4) Consonant 5) Interactions

Regression F1 StepVariablesR 2 change P 1Subjects (10).824***

Regression F1 StepVariablesR 2 change P 1Subjects (10).824*** 2Vowel.009*** 3Voicing.018*** 4Place (2).003**

Regression F1 StepVariablesR 2 change P 1Subjects (10).824*** 2Vowel.009*** 3Voicing.018*** 4Place (2).003** 5Vowel x Voicing Vowel x Place (2).002* 7Voicing x Place (2).012*** Total R 2 =.884 Post-hoc analysis: height only.

Regression F1 StepVariablesR 2 change P 1Subjects (10).824*** 2Vowel.009*** 3Voicing.018*** 4Place (2).003** 5Vowel x Voicing Vowel x Place (2).002* 7Voicing x Place (2).012*** 8ContextVl (3).012*** 9ContextVL interactions (12).003- Total R 2 =.884 Post-hoc analysis: height only.

Regression F2 StepVariablesR 2 change P 1Subjects (10).409***

Regression F2 StepVariablesR 2 change P 1Subjects (10).409*** 2Vowel.412*** 3Voicing.034*** 4Place (2).050***

Regression F2 StepVariablesR 2 change P 1Subjects (10).409*** 2Vowel.412*** 3Voicing.034*** 4Place (2).050*** 5Vowel x Voicing.008*** 6Vowel x Place (2).015*** 7Voicing x Place (2).004***

Regression F2 StepVariablesR 2 change P 1Subjects (10).409*** 2Vowel.412*** 3Voicing.034*** 4Place (2).050*** 5Vowel x Voicing.008*** 6Vowel x Place (2).015*** 7Voicing x Place (2).004*** 8ContextVl (3).008*** 9ContextVL interactions (12).001- Total R 2 =.940 Post-hoc analysis: height + backness.

Regression Summary Progressively accounting for variance is powerful F1: 88% of variance F2: 94% of variance using only known sources of variance V-to-V coarticulation is readily apparent when other sources of variance are explained. How useful would this be? Effect of V-to-V coarticulation has a similar size to place/voicing effects.

Predicting Vowel Identity Multinomial Logistic Regression (MLR) Classification algorithm Predict category membership from multiple variables. Categories do not have to be binary Same i ɑ æ Context Vowel

Predicting Vowel Identity Assumes optimal listener. Computes % correct. How much well could a listener do under ideal circumstances with information provided. Multinomial Logistic Regression (MLR) Classification algorithm Predict category membership from multiple variables. Categories do not have to be binary

Predicting Vowel Identity i ɑ æSame Vowel % Correct Partialed out Subject Vowel Place Voicing Interactions Model does quite well at predicting all vowels but the identity.

Predicting Vowel Identity F2 (Z) F1 (Z) ʌ-iʌ-i ʌ-æʌ-æ ʌ-ɑʌ-ɑ i æ ɑ F2 (Z) F1 (Z) ɛ-iɛ-i ɛ-æɛ-æ ɛ-ɑɛ-ɑ i æ ɑ

Predicting Vowel Identity Does partialing out other sources of variance improve the utility of V-to-V coarticulation? - Use linear regression to partial out variance. - Use F1, F2 residuals to predict vowels. FULL: Partial out everything RAW: No parsing SPEAKER:Partial out speaker variation only. Assume speaker normalization, but no interactions between consonant, or vowel and V-to V. VOWEL:Partial out effects of everything heard at the target vowel (speaker + target) NO-SPKR:Assume no normalization, but interactions between consonants.

Predicting Vowel Identity FULL: about 4% better than others. VOWEL: parsing out consonant may not be necessary SPEAKER: Effect of speaker and phonetic cues similar. RAW: V-to-V not useful without some parsing FULLVOWELSPEAKERNO-SPKRRAW % Correct

Predicting Vowel Identity 2) Regressively compensate for consonant coarticulation target vowel consonant context vowel preceding context 3) Use residuals to predict context vowel 1) Parse out speaker effects on target Suggests a 3-stage parsing process to maximally use V-to-V modifications.

Key Questions Extent of phenomenon Word boundaries? Both F1 and F2? Relative strength of V-to-V effects? Usefulness of phenomenon Perceptual inferences? Parsing our variability?

Summary: Extent Clear evidence for V-to-V coarticulation across word boundaries—not lexicalized. V-to-V in both formants (height + backness). Strength is similar to that of place and voicing. Known sources of variance (speaker, vowel, consonant, V-to-V) can account for most of the variability in vowel production. -Problem of lack of invariance? -Identifying multiple categories at once may be easier than identifying one.

Summary: Usefulness Idealized listener (+ parsing) could identify upcoming vowel at 40% correct given only V-to-V coarticulation. - Near 50% for /i/ and / ɑ / Parsing dramatically improves predictive power of V-to-V coarticulation Do you need perfect categorization of variance sources (e.g. speaker, target vowel, voicing…)? -Imperfect categorization enhances need for multiple cues. -Simultaneously evaluating multiple features (e.g. V1, C, V2) yields correct parse. How do you determine the order of parsing? - Temporal order of information arrival?

Future Directions How do you identify the components you will be parsing? See Toscano poster. Does the model actually describe perception? Parsing is a temporal process. Visual world paradigm to time-course of processing (e.g. McMurray, Clayards, Tanenhaus, in prep; McMurray, Tanenhaus & Aslin, 2002; McMurray, Munson & Gow, submitted). Parsing as part of word recognition. Lexical structure can contribute to inferences. Interactive activation models (McClelland & Elman, 1986) could implement this.

Conclusions Where do features come from? Emerge out of progressively accounting for sources of variance from signal. Any “chunk” (segment) of the input can provide multiple features. Speaker normalization may work by same process. Why phonologize? Eliminates one step of parsing. How does the system balance need for features with utility of fine-grained detail? Features provide tag to parse variance and utilize continuous detail.