Presentation on theme: "CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008."— Presentation transcript:
CS 551/651: Structure of Spoken Language Lecture 12: Tests of Human Speech Perception John-Paul Hosom Fall 2008
Recommended Reading: Chapter 5: Strong/Weak Forms, Intonation, and Stress Chapter 11, pp. 267 − 275: Balance between Phonetic Forces and “Physical Phonetics” Final Exam will be a take-home exam with 10 questions (same style as midterm, but may require use of calculator) and a number of spectrograms to be deciphered. It will be handed out at the end of class on Wednesday December 3. The exam will be due back to me by Friday December 12. This is worth about 30% of your grade. The final will cover material from Lecture 7 (“Syllable Structure…”) until the end of the term. Material covered on the midterm will probably not be covered on the final. The spectrogram reading exercises will be similar to the midterm, but will include the other classes of speech that we’ve been studying (nasals, approximants, and affricates) as well as the usual vowels (and diphthongs), fricatives, and stops. Reading
The Perceptual Second Formant: F2' Most vowels can be simulated using two resonances: In one study, the lower resonance was fixed at the frequency of a vowel formant, and the subject was asked to vary the higher resonance (F2') until the perceived sound most closely matched the target vowel. For back vowels and central vowels, subjects adjusted F2' to a frequency near the vowel’s F2 For front vowels except for /iy/, F2' was between the vowel’s F2 and F3; for /iy/, F2' was at or above the vowel’s F3 400 Hz2200 Hz 400 Hz target: /ih/
The Perceptual Second Formant: F2' These finding suggest that when formants are close in frequency, they are integrated so that there is a single “effective” formant equivalent to an average of the two peaks It has also been shown that when two or more formants occur within 3 to 3.5 Barks, the perceived vowel quality is equivalent to a resonance pattern with a single formant at the center of gravity of the two formants So, for two formants within 3 Barks, the formant positions affect a center of gravity measure of a single perceived resonance; beyond 3 Barks, two formants are heard as perceptually distinct. These results suggest that for steady vowels, there is an internal representation that has fairly low resolution.
Perception of Coarticulation In most cases, vowels are affected by coarticulation. In some cases, the vowel does not reach its “target” formant pattern. How does the brain deal with this variation in the signal? The acoustic effects of coarticulation referred to by Lindblom as “target undershoot”; the amount of undershoot depends on syllable duration, as well as on speaking style, and varies both across and within speakers. In vowel perception, Lindblom hypothesized that people compensate for target undershoot, and attempt to recover the canonical vowel targets. In an experiment, synthetic speech stimuli in a wVw and yVy context were presented to listeners, with the F2 of V varying from high (for an /ih/ vowel) to low (for an /uh/ vowel).
Perception of Coarticulation The boundary for perception of /ih/ and /uh/ (given the varying F2 values) was different in the wVw context and yVy context In yVy contexts, mid-level values of F2 were heard as /uh/, and in wVw contexts, mid-level values of F2 heard as /ih/. /w ih w y uh y
Perception of Coarticulation This demonstrates perceptual overshoot; subjects are relying on direction and slope of formant transitions to classify vowels Lindblom proposed Perceptual Compensation model, which “normalizes” formant frequencies based on formants of the surrounding consonants, canonical vowel targets, and syllable duration. However, many factors may account for target undershoot, and so a simple model is not effective in this case. Also, if applied to automatic speech recognition, determining locations of consonants and vowels is a non-trivial problem.
Are Formant Targets Important?? Strange et al. did experiment in which target information, dynamic information (in formant transitions), and duration information were manipulated independently in CVC syllables. Given a CVC, the middle region of the V was removed, or the transition regions were removed, or the duration was normalized, or some combination of these was applied The CVCs were presented to subjects, who were asked to identify the vowel. Regions with no target information are “Silent-Center”, regions with no transitions are “Centers-Alone”, and time- normalized versions are referred to as “Neutral-Duration”
Identification of Silent-Center vowels was “remarkably accurate”; in some cases, as good as identification of unmodified CVC. Neutral-Duration Silent-Center vowels not correctly identified as often as Silent-Center vowels. However, Neutral-Duration Silent Center vowels still more often correctly identified than Neutral-Duration Center-Alone vowels. Conclusions: (1) when vowel transition and duration information is present, recognition is highly accurate (2) with no duration information, transition information is more useful than nucleus information for vowel ID. (3) vowel targets alone are neither sufficient nor necessary
Are Formant Targets Important?? In another study by Furui, CV syllables were truncated either from the beginning or from the ending, and perception of the truncated syllable was measured In another experiment, both initial and final sections of the syllable were truncated, with a minimum duration of 40 msec The “perceptual critical point” was defined as the truncation position at which there was 80% correct recognition. Furui found: (a) The 10 msec during the point of greatest spectral transition is most important for identification of CV syllables, and (b) The crucial information for both vowels and consonants is in this 10-msec region; consonants can be mainly perceived by the spectral transition into the following vowel
Are Formant Targets Important?? Tekieli and Cullinan showed that (a) Given first 10 msec of isolated vowel, Place and Height can be distinguished at levels above chance; the tense-lax feature requires 30 msec. (b) Place of articulation in CV can be identified based on 10 msec after release, but voicing feature requires 20-30 msec. In short, timing information is critical for tense-lax and voiced-unvoiced distinctions, and making these distinctions requires about 30 msec of speech; other features can be identified in 10 msec. Finally, DiBenedetto demonstrated that the F1 trajectory influenced perception of front vowels; synthetic syllables in which F1 targets are reached earlier than normal are perceived as lower in Height (/iy/ /ih/, /ih/ /eh/, /eh/ /ae/).
Perception of Place of Articulation Acoustic cues to perception of place of articulation reside primarily in spectral transitions between phonemes (with some exceptions, notably weak /f, th/ vs. strong /s, sh/) In perceptual experiments with two synthetic formants, different bursts can be heard by changing the slope of the initial part of F2; a locus of 720 Hz causes perception of /b/, a locus of 1800 Hz causes perception of /d/, and a locus of 3000 Hz often causes perception of /g/. Different plosives can also be perceived based on the shape of the burst (see next slide).
Categorical Perception In labeling speech, we use a fixed symbol set (e.g. Worldbet, IPA, etc.) to record what is spoken But what do we hear? Do we hear discrete symbols, or a continuum of sounds? In other words, is perception categorical, or continuous? If categorical, then there will be a range of stimuli that will yield no perceptual difference, a boundary at which the perception will change, and another range of stimuli with no perceptual difference. One example of a categorically-perceived feature is voice- onset time (VOT); if VOT is long, people hear unvoiced plosives, if VOT is short, people hear voiced plosives. But people don’t hear ambiguous plosives at the boundary between short and long VOT (30 msec).
Categorical Perception In another experiment, the F2 transition was varied along a continuous scale, but what was heard were “essentially quantal jumps from one perceptual category to another” (namely /b/, /d/, and /g/). (Moore, p. 283) On the other hand, small changes in the formants of vowels are easily perceived, leading to perception of “blended” vowels. However, for continuous-speech vowels, perception may be more categorical (Stevens, 1968) and there is evidence that vowels are encoded in memory using distinctive features (when vowels are forgotten, other vowels with similar features are more likely to be remembered, Cole 1968). Other evidence for categorical perception is in second- language learning; e.g. Japanese distinguishing /r/ and /l/ (by the age of 6, perception of speech is altered)
However, another study presented subjects with a range of stimuli between /b/, /d/, and /g/, but subjects were asked to respond with either /b/ or /g/. If perception were completely categorical, the responses in the /d/ region should have been random, but in fact there were systematic responses. (Barclay, 1972) Perception may be continuous but have sharp category boundaries, e.g. (Massaro, 1998) Categorical Perception
Cue Trading Perception of “slit” vs. “split”, with duration of silence between /s/ and /l/ varied, and formant transitions of /l/ varied to be flat or more toward /p/ Long silence durations yield “split”, however, words with formants closer to /p/ transition required less silence to be heard as “split” Conclusion: both acoustic cues are integrated by the listener into a single phonemic perception; cues can be “traded” so that more of one cue requires less of another for one type of perception (e.g. “split”)
Cue Trading As Moore stated, “within limits, a change in the setting or value of one cue, which leads to a change in the phonetic percept, can be offset by an opposed setting of a change in another cue so as to maintain the original phonetic percept.” (p. 291) McGurk Effect: (1) audio signal contains /ga/, video signal contains /ba/, perceived sound is /da/ (2) audio signal contains /ma/, video signal contains /ta/, perceived sound is /na/ subjects not aware of the conflicting cues
Fuzzy-Logic Model of Perception (FLMP) Massaro has proposed the FLMP, in which cues are: (a) evaluated according to their degree of presence; this evaluation returns a high number (up to 1.0) if the feature is present, and a low number (as low as 0.0) if the feature is absent. (b) matched to a prototype higher-level feature, such as a high degree of lip rounding matching a bilabial sound. (c) incorporated in a pattern-classification step, to determine which higher-level feature best matches the available cues The “best” high-level feature is selected as the actual feature. For example, given the following prototypes: phn(labial, voiced) = /b/ phn(labial, not_voiced) = /p/ phn(alveolar, voiced) = /d/ phn(alvoelar, not_voiced) = /t/
Fuzzy-Logic Model of Perception (FLMP) And then given measurements of place of articulation along a scale of 0.0 = bilabial, 1.0 = alveolar, 0.0 = not_voiced, 1.0 = voiced Then the probability of identifying the sound as /b/ is: where A is the evidence of alveolar, and V is the evidence for voicing. This assumes that all of the evidence (cues) are independent. This is equivalent to Bayes’ rule if the “fuzzy” scales are interpreted as probabilities
Fuzzy-Logic Model of Perception (FLMP) With exponential weights on the pieces of evidence, the predicted probabilities of identification agree well with actual probabilities of identification, varying place of articulation and voice-onset-time of synthetic speech sounds: