Presentation on theme: "When do which sounds tell you who says what? A phonetic investigation of the familiar talker advantage in word recognition. University of Calgary Linguistics."— Presentation transcript:
When do which sounds tell you who says what? A phonetic investigation of the familiar talker advantage in word recognition. University of Calgary Linguistics Brown Bag presentation May 25, 2011 Steve Winters
What’s the Big Idea? The “Familiar Talker Advantage” = Speech is more intelligible when produced by familiar talkers, rather than unfamiliar talkers. First elaborated by Nygaard et al. (1994) and Nygaard and Pisoni (1998): 1.Trained listeners to identify 10 voices (5 male, 5 female) over 9 days. 2.Tested trained listeners’ ability to identify words produced by: Trained “familiar” voices Novel “unfamiliar” voices Word recognition scores: familiar > unfamiliar
Conditions Nygaard and Pisoni (1998) recognized that there were conditions on the emergence of the Familiar Talker Advantage: Only exhibited by listeners who had performed well (> 70%) on the talker identification training task. Nonetheless, similar interactions between talker identity and speech perception has been observed in: infants, who prefer to listen to their mother’s voice (DeCasper & Fifer, 1980) sinewave speech, which supports word recognition and talker identification (Remez, Fellowes & Rubin, 1997)
Why Do We Care? The Nygaard et al. studies emphasized the intersection of indexical and linguistic information in the signal. Abercrombie (1967): The linguistic properties of speech support the identification of linguistic (phonemic, etc.) contrasts. The indexical properties of speech support the identification of “extralinguistic” aspects of the speaker: Physical characteristics, dialect/group membership, gender, emotional/mental state. An old idea: speech perception must filter out indexical properties to extract the linguistic message.
Normalization A speech signal stripped of its indexical properties is highly abstract. = “Normalized” “...when we learn a new word we practically never remember most of the salient acoustic properties that must have been present in the signal that struck our ears; for example, we do not remember the voice quality, speed of utterance, and other properties directly linked to the unique circumstances surrounding every utterance.” -- Morris Halle (1985)
Unfiltered In contrast, exemplar theories of speech perception (Johnson, 2007) emphasize the utility of not breaking down the signal into separate components. Conjecture: listeners store unanalyzed representations of speech that are “rich” with informative detail: linguistic representations might include indexical (talker-specific) information; and indexical representations include linguistic information. Generalizations emerge on the fly, from summed activations of similar exemplars. The “Familiar Talker Advantage” effect seemingly supports this view.
Adaptation Contemporary versions of the “normalization” theory emphasize the active role of the listener in speech processing… Rather than the (abstract) content of representations. “The lack of invariance in the mapping of acoustic patterns onto phonetic categories is computationally non- deterministic…the nondeterministic mapping must be solved by mechanisms incorporating active control structures.” (Magnuson & Nusbaum, 2007) Q: Is speech perception even possible with the unanalytical approach of exemplar theory? Roughly: rules vs. representations
Which One? Experiment 1 attempts to adjudicate between these competing theories by exploiting a known asymmetry in the processing of indexical information in speech. Winters et al. (2008): tested identification of bilingual talkers across languages. 1.English-speaking listeners trained to identify: English-speaking bilinguals German-speaking bilinguals 2.Tested on same talkers speaking other language: English German: loss of ID accuracy German English: no loss in ID accuracy
Implications Winters et al. (2008) concluded: Talker representations from a known language are language-specific Talker representations from an unknown language are language-independent Exemplar-style representations of voices (= integrated linguistic and indexical information) only emerged within a known language. Q: The Familiar Talker Advantage emerges (for good listeners) within a known language; Will it emerge for the same talkers across languages, as well?
Predictions Known: listeners show complete generalization of talker knowledge from German to English. These listeners identify talkers based on language- independent information in speech. Exemplar-based prediction: Learning to identify talkers in German will not facilitate word recognition in English. (Listeners do not develop integrated representations.) Normalization-based prediction: Listeners filter same talker properties in both languages Familiar Talker Advantage should transfer across languages.
Experiment 1 Q: Does knowledge of a talker in one language facilitate linguistic processing of that talker in another? Training task: talker identification English-speaking listeners (monolingual) Bilingual talkers, speaking in either English or German. Testing task: English word recognition in noise Three talker groups: Familiar bilinguals Unfamiliar bilinguals Native English talkers
Experiment 1: Materials 10 L1 German / L2 English talkers All female These talkers produced 360 CVC English words (e.g., buzz, cheek) 360 CVC German words (e.g., hoch, Rahm) 5 talkers were designated as “Group 1”; The other 5 were “Group 2” Groups were balanced for intelligibility Also: 5 female monolingual English talkers produced the same set of 360 CVC English words
Experiment 1: Training 3 days of training 2 sessions per day (~30 min each) Each session involved: Familiarization: same 5 words from each talker Re-familiarization: same word from each talker Recognition: 5 words/talker, heard twice with feedback Testing: 10 words/speaker no feedback Half trained in German; half trained in English x2
Experiment 1: Word Recognition Trained listeners identified 24 words each from 15 different talkers: Group 1: 5 unfamiliar English talkers Group 2: 5 familiar German-English bilinguals Group 3: 5 unfamiliar German-English bilinguals Words were presented in four levels of white noise: Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR Responses scored in terms of words, phonemes, features correct…
Experiment 1: Training “Good” learners reached 70% accuracy at some point in training No difference in learning rate between language groups (again)
Results: Phoneme Recognition Familiar Talker Advantage for Good English listeners only (p =.008) No effect for German listeners (good or bad) Modest trend in poor learners towards an Unfamiliar Talker Advantage!
Discussion (Good) English-trained listeners exhibited better word recognition scores for familiar talkers. (Good) German-trained listeners did not. Familiar Talker Advantage was not supported by language-independent talker representations. Familiar talker effect is based on rich, talker-specific linguistic representations… rather than a filtering of “extra-linguistic” talker information. Caveat: some listeners develop these representations better than others.
Patterns 1.English-trained listeners displayed: Interactions between linguistic and talker categories in both experiments. 2.German-trained listeners: No interactions between linguistic and talker categories in either experiment. Implications: English-trained listeners develop richly detailed, exemplar-like representations of voices. German-trained listeners develop sparser, language-independent representations of voices.
Life’s Persistent Questions, Part 2 When learning to identify voices within a known language, listeners develop rich, exemplar-style representations. Will the Familiar Talker Advantage (within a known language) be affected by a change in the phonetic quality of the familiar voices? Note: listeners attend closely to phonetic cues that are consistently associated with a talker’s voice. E.g., for the German bilingual voices, F0 patterns were an (unintentionally) consistent cue.
Talker Identity Cues Winters (submitted) trained listeners to identify Thai- speaking voices consistently associated with particular phonetic cues: 1.Lexical tones 2.VOT (voiced, unaspirated, aspirated) 3.Vowel categories (front, central, back) Trained listeners were then tested on stimuli without these talker-cue associations. Associated cue salience hierarchy: Tones > VOT, Vowel
However… Voice quality seemed to be an even more distinctive cue to talker identity.
Background: Voice Quality Note that there are three primary types of vocal fold vibration: 1. modal vocal folds lightly adducted; flow of air causes periodic opening and closing of folds (“trilling”) 2. breathy vocal folds slightly apart; flow of air makes folds “wave” in the wind 3. creaky vocal folds tensely adducted; low airflow causes irregular, low frequency voicing
Experiment 2 Two groups of English Listeners were trained to identify English-speaking talkers: 1.Each talker produced stimuli only in a particular voice quality (Quality-dependent training) 2.Each talker produced stimuli in a variety of voice qualities (Quality-neutral training) After training, listeners completed a generalization task: 1.Talkers only produced (novel) words in voice qualities not presented in training 2.Talkers only produced novel words not presented in training--still in a variety of voice qualities
Experiment 2 Listeners also completed a word recognition task: Words produced by familiar and unfamiliar talkers 1.Familiar-talker words in both trained and untrained voice qualities 2.Familiar-talker words in a variety of voice qualities. Exemplar-based Predictions: Voice quality will form part of the representation of the talkers’ voice Familiar talkers will be more intelligible than unfamiliar talkers Familiar voice qualities will be more intelligible than unfamiliar voice qualities (for the same talker)
Experiment 2: Materials Phonetically trained talkers produced the same list of 360 English CVC words in three different voice qualities: modal breathy creaky In all, I recorded 6 female talkers and 6 male talkers Only the female talkers were presented in the experiment. Two recording sessions, lasting about an hour each; Talkers were paid $40 for their time and effort.
Experiment 2: Materials The “unfamiliar” talkers consisted of six female talkers recorded for the database used in Experiment #1. Note: these talkers were from a different dialect region of the United States (Indiana) Also note: the breathy and creaky tokens tended to be longer in duration. (Longer duration reflects less fluency with the articulation) 443 ms562 ms654 ms
Experiment 2: Methods Training methods were identical to those used in Experiment 1. Listeners learned to identify six different (female) voices over the course of three days Two training sessions on each day 1.Quality-dependent: each talker only produced words in a particular voice quality 2 modal talkers, 2 creaky talkers, 2 breathy talkers 2.Quality-neutral: all talkers produced words in a variety of voice qualities The relationship between talker and voice quality was randomized in each group.
Experiment 2: Participants 16 participants in each group Listeners were recruited from introductory linguistics classes (so they had some, but not a lot, of phonetics knowledge) On the fourth day of the experiment, listeners completed two tasks: 1.Generalization 2.Word Recognition Order of tasks was counterbalanced across listeners Listeners were paid $60 for their time and trouble.
Experiment 2: Generalization Task: talker identification Quality-dependent listeners: All talkers produced words in the two voice qualities that they did not produce in training (5 in each) Quality-neutral listeners: The relationship between talker and voice quality was still random (10 words/talker) Both groups identified talkers from words that were not presented in training.
Experiment 2: Word Recognition Listeners identified words, presented in pink noise (0 dB SNR), as produced by two sets of talkers: 1.Familiar (6 voices; 12 words each) 2.Unfamiliar (6 voices; 12 words each) For the quality-dependent listeners, words were evenly split between the voice quality associated with each talker in training (6) and the two voice qualities the talker did not produce in training (3 each). For example: Analysis: responses were scored in terms of words correct and phonemes correct (onset, nucleus, coda)
Results: Training QN listeners learned consistently over the six training sessions QD listeners effectively performed at ceiling, right from the start.
Results: Generalization No change for the QN group Catastrophic collapse for the QD group Voice quality was a highly salient cue for talker identity (and one that listeners relied on heavily in training)
Results: Word Recognition Strong effect of voice quality: modal > creaky > breathy Familiar (modal) voices more intelligible than unfamiliar voices No effect of training condition, however. Also: no poor vs. good learner problem
Results: Word Recognition Tendency for word recognition accuracy to be higher for trained voice qualities …but it was nowhere near significant. Familiar Talker Advantage does not depend on salient cues to talker identity. (?!)
Discussion: What? Combined results: what listeners are attending to most closely in the talker identification task is not useful to the word recognition task. …and yet the Familiar Talker Advantage emerges anyway. Basic idea: the Familiar Talker Advantage is supported by that which is meaningful in the signal (i.e., that which supports word recognition) to the listeners Crucial: hearing how a talker produces particular sequences of segments. Non-contrastive phonetic details may contribute to talker identification--and may even affect word recognition--but do not support talker-based word recognition.
Discussion: Huh? Perhaps Abercrombie (1967) was right: voice quality is “extralinguistic” in a language like English. = noise in the linguistic signal Note: listening to “meaningless” words in another language also does not induce the Familiar Talker Advantage. Two possible interpretations: Exemplar-based representations may depend on the meaningfulness of particular phonetic details. Word recognition may be an automatic process, whereas talker identification is not.
Where to Next? Q: Did the same Familiar Talker Advantage emerge in this study? Dialect issues. Make sure that both groups of talkers are equivalent in intelligibility Computational modeling of unanalyzed vs. analyzed (source-filter) spectral similarity matching. STRAIGHT Also: How do listeners learn from “pre-filtered” stimuli? Alison Harding (2011): F0 vs. segmental contributions to tone perception.
Results: Word Recognition Go over the general voice quality results first Modal > Creaky > Breathy Why? I guess because there is an inherently higher noise-to- signal ratio in breathy voice. And creaky voice? Not entirely sure about an explanation, other than that it’s more unusual than modal voice.
Results: Word Recognition Training interactions: there were none. Show the training interaction graph regardless.
Results: Word Recognition Word recognition scores for familiar vs. unfamiliar voices.
Future Directions Test raw word recognition scores to make sure that both groups of talkers are equivalent in intelligibility (for modal voice) Experiment idea: determine whether it’s more difficult to distinguish male from female voices in creaky and/or breathy voice. Computational modeling of unanalyzed vs. analyzed (source-filter) spectral similarity matching. Oh also: maybe mention Alison’s thesis “Analysis” of F0 vs. segmental contributions to tone perception.
Results: Generalization No change for the QN group; catastrophic collapse for the QD group Voice quality was a highly salient cue for talker identity (and one that listeners relied on heavily) Note the two or three QD listeners who didn’t bomb out completely in generalization. One (Hamish) told me that he noticed over time that there were other differences between speakers than just voice quality.
The Familiar Talker Advantage Describe the Nygaard et al. series of studies. Also mention the stuff that Suzanne has found on babies’ tendency to demonstrate the same ability. Other stuff to think about: The same finding in sinewave speech The Remez business about looking for the phonetic locus of the facilitation.
An Alternative View Present the basics of the “analytical” model of speech perception. Which should no longer be considered a “normalization” model, apparently. Rules vs. Representations Exemplar models focus more on the details in the signal; they assume that generalizations can emerge from those details, working in concert with one another Operations on the signal are minimal and de- emphasized Analytical models focus more on the operations of the listener Perceptually salient structures emerge through the active analysis of the speech signal by the listener.
What I/we have found The Familiar Talker Advantage is fragile. It does not transfer across languages. It does not encompass all phonetic aspects of the speech signal. The objective here: change the “voice” in two different ways: Its linguistic content Its acoustic (phonetic?) content Q: do either of these changes affect the emergence of the familiar talker advantage? A: Yes, the linguistic change does. This suggests that the FTA is a product of higher-level speech processing-- I.e., the connection with semantic content--rather than a by-product of the lower level processing of the phonetic content of the signal. I guess.
Experiment 1: Motivation Basically: perhaps the familiar talker advantage can help us adjudicate between the exemplar and analytical models of speech perception. What we’re trying to find out is--does the familiar talker advantage emerge because: 1.Talker properties are stripped away from the signal, thereby making the linguistic properties clearer? 2.More robust representations are formed of the interaction of the linguistic and indexical properties in the signal?
Experiment 1: Theoretical Predictions Walk through the exemplar story in detail (I.e., see if you can figure it out for yourself) The analytical story: the familiar talker advantage emerges from a perceptual clarification of which aspects of the signal are speaker-based (indexical), and which are segment-based (linguistic). The earlier data suggest that, in a familiar language, indexical processing is language-dependent But in an unfamiliar language, talker identification is language-dependent. Presumption: when learning to identify voices in another language,
Experiment 1: Predictions Identification of voices transfers completely from an unfamiliar language to a familiar one: whatever “filtering” methods are used in one language apply (without loss) to the other familiarity with a voice in one (unknown) language should lead to a word recognition advantage for that voice in a known language.
Persistent Questions, part 2 The Familiar Talker Advantage emerges (for good listeners) within a known language; Will it emerge for the same talkers across languages, as well? Experiment 2: Does the Familiar Talker Advantage depend on particular qualities of a talker’s voice?
Experiment 2: Motivation Known: ability to identify a talker’s voice facilitates recognition of words spoken by that talker. (Nygaard et al., 1994) 1.Exemplar-based account: linguistic representations include talker-specific information. Processing is facilitated by similarity to traces in memory. 2.Property-based account: listeners learn how to filter indexical properties of particular talkers. …thereby becoming more adept at revealing the linguistic core of the spoken word.
Experiment 2: Predictions Known: listeners show complete generalization of talker knowledge from German to English. (Experiment 1) These listeners identify talkers based on language- independent information in speech. Exemplar-based prediction: Learning to identify talkers in German will not facilitate word recognition in English. (Listeners do not develop integrated representations.) Property-based prediction: Listeners filter same talker properties in both languages facilitation should occur across languages.
Experiment 1: Training Listeners were trained to identify voices of either: Group 1 (five German L1 female talkers) Group 2 (five German L1 female talkers) Half trained in German; half trained in English Three days of training Two sessions per day
Listener Split Some listeners performed better on the talker identification task than others.
Experiment 2: Testing Trained listeners identified 24 words each from 15 different talkers: Group 1: 5 unfamiliar English talkers Group 2: 5 familiar German-English bilinguals Group 3: 5 unfamiliar German-English bilinguals Words were presented in four levels of white noise: Clear, +10 dB SNR, +5 dB SNR, 0 dB SNR Responses scored in terms of words, phonemes, features correct…
English LearnersGerman Learners Results: Word Recognition, all listeners Interaction between listener and talker groups is not significant.
Goats and Sheep Review of literature revealed that Nygaard et al. (1994) split listeners up into “good” and “poor” listeners. Good listeners = 70% correct or better in training. Poor listeners = < 70% correct in training. Splitting listeners in the same way yielded significant interactions in Experiment 2 data.
Results: Word Recognition, English Listeners Good LearnersPoor Learners Interaction (Good learners): p =.008; Interaction (Poor learners): p =.025.
Results: Word Recognition, German Listeners Good LearnersPoor Learners Interaction between listener and talker groups is not significant.
Some More Implications Certain properties of the signal are only informative for talker identification, and not word recognition (i.e., Abercrombie was right.) Maybe mention the “Who” and the “What” streams in the brain. It doesn’t seem like exemplar theory can get the job done with comparisons of unpacked spectral slices. Minimally, I might suggest a perceptual unraveling of the signal into source + filter characteristics. Maximally, I might suggest that exempla-based perception starts with an articulatory model of what gestures produced a particular acoustic sequence. And if I’m wrong, hopefully people will go on disagreeing with me for the rest of linguistic eternity.
Voice Quality Description? Examples and explanation of the three different voice qualities? The laryngeal settings necessary to produce these three different qualities are largely under a speaker’s control; however, female voices tend to be a bit breathier (all other things being equal) due to the relative thinness of their vocal folds (which makes complete closure more difficult to attain). That being said, it is quite common these days to hear young, female speakers of American/Canadian English use creaky voice. Point: voice quality is sub-phonemic in English; it does not signal meaningful segmental contrasts in any way.
Experiment 2: Stimuli Some discussion, perhaps, of the difficulties in recording the stimuli, and the acoustic differences that resulted. Specifically: the breathy and creaky tokens tended to be longer in duration. (Longer duration reflects less fluency with the articulation) Note that it is possible that the extended durations of these articulations made the particular segments in them easier to understand.
Experiment 2: Methods These methods will effectively be the same as in Experiment 1. Listeners learned to identify six different (female) voices over the course of four days Two training sessions on each day Each training session consisted of: 1.Familiarization/re-familiarization (5 words/talker, presented once--same words for everybody) 2.Recognition (5 words/talker, presented twice, with feedback) 3.Test (10 words/talker, presented once, without feedback)
Experiment 2: Conditions On the three days of training, listeners were split into one of two groups: 1.Quality-dependent: each talker only produced words in a particular voice quality 2 modal talkers, 2 creaky talkers, 2 breathy talkers 2.Quality-neutral: all talkers produced words in a variety of voice qualities The relationship between talker and voice quality was strictly randomized.
What? Why did we not have to split up the learners into good and poor learners in order to get the Familiar Talker Advantage in this second study? Is the Familiar Talker Advantage the same thing in this study? Maybe it only appears to be so, because the talkers from a different dialect area are not actually as intelligible to the listeners as the Canadian talkers.