Presentation is loading. Please wait.

Presentation is loading. Please wait.

From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)

Similar presentations


Presentation on theme: "From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)"— Presentation transcript:

1 From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC) NATO-ASI “Dynamics of Speech Production and Perception” Il Ciocco, Tuscany, Italy, July 4, 2002

2 July 4, 2002From speech signal acoustics to perception, Il Ciocco2 Overview how do we perceive (speech) dynamics? The Intelligent Ear. On the Nature of Sound Perception, by Reinier Plomp (2002) from psychoacoustics to speech perception (lack of) context; robustness; continuity V and C reduction; coarticulation perceptual compensation for artic. undershoot? speech efficiency conclusions

3 July 4, 2002From speech signal acoustics to perception, Il Ciocco3 Various scientific preferences several biases have affected the history of (speech &) hearing research (Plomp, 2002): dominance of sinusoidal tones as stimuli preference for microscopic approach (e.g., phoneme discrimination rather than intelligibility) emphasis on psychophysical (rather than cognitive) aspects of hearing clean stimuli in the lab rather than the acoustic reality of the outside world (disruptive sounds)

4 July 4, 2002From speech signal acoustics to perception, Il Ciocco4 Psychoacoustics - speech perc. duration, pitch, loudness, timbre, direction absolute and masked threshold, jnd, discrim. continuity complexity (pure - complex tone, voicing) effect of context, meaning (intell.), freq. occ. phoneme: more text-guided than perceived speech perceptual tasks: phoneme —> sent. identif.; discrim.; matching

5 July 4, 2002From speech signal acoustics to perception, Il Ciocco5 Detection thresholds and jnd multi-harmonic, simple, stationary signals single-formant-like periodic signals 3 - 5% 1.5 Hz % frequency F2 BW

6 July 4, 2002From speech signal acoustics to perception, Il Ciocco6 Perceiving speech-like trans. Ph.D thesis A. van Wieringen (1995) “Perceiving dynamic speechlike sounds. Psycho- acoustics and speech perception” see also vWie & Pols, Acustica 84 (1998) stimulus characteristics (segmented and/or reversed) natural or synthetic tone glide; single- or multi-formant transition isolated trans.; initial or final trans. with steady st. converg. or diverg. trans. (var. duration or slope) task: jnd/DL; matching; abs. ident.; classif.

7 July 4, 2002From speech signal acoustics to perception, Il Ciocco7 DL for short speech-like transitions Adopted from van Wieringen & Pols (1998), Acta Acustica 84, “Discrimination of short and rapid speechlike transitions” complex simple short longer trans. initial final

8 July 4, 2002From speech signal acoustics to perception, Il Ciocco8 Perceiving (speech) dynamics vowel perception w/w or w/o transitions? our claims (vSon, IFA Proc. 17 (1993)): only evidence for compensatory processes, i.e. perceptual-overshoot and dynamic-specification, when in an appropriate context synthetic isolated dynamic formant tracks lead to perceptual undershoot (=averaging) silent center studies are ambiguous concl.: info in formant dynamics is only used when V’s are heard in appropriate context

9

10 July 4, 2002From speech signal acoustics to perception, Il Ciocco10 Vowel identification compare V responses for dynamic stimuli with those for static stimuli calculate net shift in V responses per onglide (CV), complete (CVC), or offglide (VC) result: responses average over the trailing part of the formant track

11 July 4, 2002From speech signal acoustics to perception, Il Ciocco11 Net shift in vowel responses to tokens with curved formant tracks vs. stationary tokens. All values significant, except small open triangles Perceptual undershoot

12 July 4, 2002From speech signal acoustics to perception, Il Ciocco12 Effect of local context “Perisegmental speech improves consonant and vowel identification”, vSon & Pols, Speech Comm. 29,1-22 (1999) also “Phoneme recognition as a function of task and context”, IFA Proc. 24, (2001) and Proc. SPRAAC, (2001) also Pols & vSon (1993), “Acoustics and perception of dynamic vowel segments”, Speech Comm. 13,

13 July 4, 2002From speech signal acoustics to perception, Il Ciocco13 V and C identification gated tokens from 120 CVC speech fragments taken from a long text reading 50 ms V kernel, + V trans., + C part (L/R) stimuli randomized; V identification (17 Ss) and C i and C f identification (15 Ss) results: phoneme identification benefits from extra speech left context more beneficial than right context better identification when also other member of pair was identified correctly (context effect)

14

15 Error rates of vowel identification for the individual stimulus token types. Long-short vowel errors (/α-a:, -o:/) are ignored c

16 V and C in CV tokens were identified better when the other member of the pair was identified correctly

17 July 4, 2002From speech signal acoustics to perception, Il Ciocco17 Effect of (lack of) context 100 Dutch listeners identifying V segments “Vowel contrast reduction”, K-vBeinum (1980) 3 conditionsM1M2F1F2Av. isolated V% (3)ASC words% (5)ASC unstr., free conv. % (10)ASC ASC = 1/n Σ |LF i - LF i | 2 (total variance), LF i = log F i i=1 n

18 July 4, 2002From speech signal acoustics to perception, Il Ciocco18 Human word intelligibility vs. noise from Ph.D thesis H. Steeneken (1992) ‘On measuring and predicting speech intelligibility’

19 July 4, 2002From speech signal acoustics to perception, Il Ciocco19 Robustness to degraded speech speech = time-modulated signal in frequency bands relatively insensitive to (spectral) distortions prerequisite for digital hearing aid modulating spectral slope: -5 to +5 dB/oct, Hz temporal smearing of envelope modulation ca. 4 Hz max. in modulation spectrum  syllable LP>4 Hz and HP<8 Hz little effect on intelligibility spectral envelope smearing for BW>1/3 oct masked SRT starts to degrade (for references, see keynote paper Pols in Proc. ICPhS’99)

20 July 4, 2002From speech signal acoustics to perception, Il Ciocco20 Some examples partly reversed speech (Saberi & Perrott, Nature, 4/99) fixed duration segments time reversed or shifted in time perfect sentence intelligibility up to 50 ms (demo: every 50 ms reversedoriginal) low frequency modulation envelope (3-8 Hz) vs. acoustic spectrum syllable as information unit? (S. Greenberg) gap and click restoration ( Warren ) gating experiments

21 July 4, 2002From speech signal acoustics to perception, Il Ciocco21 Continuity, especially while masked continuity effect (Miller & Licklider), auditory induction (Warren), pulsation threshold (Houtgast) also for gliding tones also for complex tones also for pitch fission, fusion segregation, streaming phonemic restoration Hz —> time

22 July 4, 2002From speech signal acoustics to perception, Il Ciocco22 V and C reduction, coarticulation spectral variability is not random but, at least partly, speaker-, style-, and context-specific read - spontaneous; stressed - unstressed not just for vowels, but also for consonants duration; spectral balance intervocalic sound energy difference F2 slope difference; locus equation

23 Mean consonant durationMean error rate for C identification Adopted from van Son & Pols (Eurospeech’97) C-duration C error rate 791 VCV pairs (read & spontan.; stressed & unstr. segments; one male); C-identification by 22 Dutch subjects

24 July 4, 2002From speech signal acoustics to perception, Il Ciocco24 Perception of ac. V reduction Ph.D thesis Dick van Bergem (1995) “Acoustic and lexical vowel reduction” lexical V reduction: Fr /betõ/ vs. Du acoustic V reduction: Du ‘miljoen’ as /mIljun/ or as identify the unstressed vowels (as V by 20 listeners (8M, 12 F) in 47 words (cond. W and S) or 20 words (cond. P), like ‘milJOEN’ or ‘biosCOOP’ spoken by 20 male speakers (2280 stimuli)

25 5% 36% 60% 69% 4 reduction stages for 20 speakers % schwa responses on /I/ by 20 listeners model prediction for schwa in this m-l context adapted from vBergem (1995) Conclusion: Vowel reduction is not centralization but contextual assimilation

26 July 4, 2002From speech signal acoustics to perception, Il Ciocco26 Speech efficiency speech is most efficient if it contains only the information needed to understand it: “Speech is the missing information” (Lindblom, JASA ‘96) less information needed for more predictable things: shorter duration and more spectral reduction for high- frequent syllables and words C-confusion correlates with acoustic factors (duration, CoG) and with information content (syll./word freq.) I(x) = -log 2 (Prob(x)) in bits (see van Son, Koopmans-van Beinum, and Pols (ICSLP’98))

27 Correlation between consonant confusion and 4 measures indicated Adopted from van Son et al. (Proc. ICSLP’98) Dutch male sp. 20 min. R/S 12 k syll. 8k words 791 VCV R/S lex. str unstr. C ident. 22 Ss

28 July 4, 2002From speech signal acoustics to perception, Il Ciocco28 Conclusions perceiving speech (segments) very much depends on speech quality and context isolated segments is also a kind of context only ‘proper’ interpretation of formant transitions (perceptual compensation for spectro-temporal undershoot) when presented in an appropriate context reduced V are best perceived as schwa if transitions are contextually assimilated


Download ppt "From speech signal acoustics to perception Louis C.W. Pols Institute of Phonetic Sciences (IFA) Amsterdam Center for Language and Communication (ACLC)"

Similar presentations


Ads by Google