Presentation is loading. Please wait.

Presentation is loading. Please wait.

Temporal Properties of Spoken Language Steven Greenberg The Speech Institute

Similar presentations


Presentation on theme: "Temporal Properties of Spoken Language Steven Greenberg The Speech Institute"— Presentation transcript:

1 Temporal Properties of Spoken Language Steven Greenberg The Speech Institute http://www.icsi.berkeley.edu/~steveng steveng@cogsci.berkeley.edu

2 Acknowledgements and Thanks Research Funding U.S. Department of Defense U.S. National Science Foundation Research Collaborators Hannah Carvey, Shawn Chang, Ken Grant, Leah Hitchcock, Joy Hollenback, Rosaria Silipo

3 For Further Information Consult the web site: www.icsi.berkeley.edu/~steveng

4 This presentation examines WHY the temporal properties of speech are the way they are Some General Questions

5 Specifically, we ask …. WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms? Some General Questions

6 Specifically, we ask …. WHY are some syllables significantly longer than others? Some General Questions

7 Specifically, we ask …. WHY are some phonetic segments (usually vowels) longer than others (typically consonants)? Some General Questions

8 And …. WHAT can the temporal properties of speech tell us about spoken language? Some General Questions

9 The temporal properties of spoken language reflect INFORMATION contained in the speech signal Conclusions

10 PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION (PROSODY refers to the RHYTHM and TEMPO of syllables in an utterance) Conclusions

11 Much of the temporal variation in spoken language reflects prosodic factors Conclusions

12 Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language Conclusions

13 Prosody also shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation) Conclusions

14 Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal (acoustic and otherwise) This is the focus for today’s presentation But, before considering prosody per se, let’s first examine an important acoustic property of the speech signal …. Conclusions

15 SLOW modulation of acoustic energy, reflecting movement of the speech articulators, is crucial for understanding spoken language The fine spectral detail is FAR less important (80% of the spectrum can be discarded with much impact on intelligibility) WHY should this be so? WHY? WHY? WHY? WHY? Importance of Slow Modulations 90% Intelligibility

16 Quantifying Modulation Patterns in Speech The modulation spectrum provides a convenient quantitative method for computing the amount of modulation in the speech signal The technique is illustrated for a paradigmatic, simple signal The computation is performed for each spectral channel separately

17 The low-frequency modulation patterns can thus be quantified using the modulation spectrum, which looks like this for spontaneous speech …. Modulation Spectrum of Spoken Language The modulation spectrum has a broad peak of energy between 3 and 10 Hz

18 Linguistically, the modulation spectrum reflects SYLLABLES The distribution of syllable duration is similar to the modulation spectrum Modulation Spectrum of Spoken Language Syllable duration Modulation Spectrum 15 minutes of spontaneous material from a single Japanese speaker

19 Questions: Why do syllables vary so much in duration? And why is the peak of the modulation spectrum so broad? Variation in Syllable Duration Syllable duration Modulation Spectrum 15 minutes of spontaneous material from a single Japanese speaker

20 Why do syllables vary so much in duration? In large part, it is because syllables carry differential amounts of information Longer syllables tend to contain more information than shorter syllables Below, the vowels in “ride” and “bikes” are longer than in other words (as well as more intense) Variation in Syllable Duration

21 Duration is one of the most important correlates of syllable accent (prosody) We know this because of studies SIMULATING syllable prominence (accent) labeling by highly trained linguistic transcribers In one study, it was shown that duration is the single most important acoustic property related to syllable prominence (in Am. English) Duration Correlates with Syllable Stress Duration Amplitude Silipo and Greenberg (1999) Pitch

22 Word Duration and Syllabic Accent Level Words that contain an accented syllable tend to be considerably longer than unaccented words What are the implications of this insight? Heavily Accented Lightly Accented Unaccented All Words

23 Heavily Accented Lightly Accented Unaccented All Words Word Duration and Stress Accent Level The broad distribution of word duration (and, in turn, syllable duration) largely reflects the co-existence of accented and unaccented words (and syllables), often within the same utterance This interleaving of long and short syllables reflects the DIFFERENTIAL DISTRIBUTION of ENTROPY across an utterance

24 Breadth of the Modulation Spectrum The broad bandwidth of the modulation spectrum, as it reflects syllable duration, encapsulates the heterogeneity in syllabic and lexical duration associated with variation in syllable prominence Does this insight have implications for spoken language? Modulation spectrum of 40 TIMIT sentences (computed across a 6-kHz bandwidth) Unaccented Heavily Accented All Accents (Convergnce)

25 Modulation Spectrum Breadth & Intelligibility Long ago, Houtgast and Steeneken demonstrated that the modulation spectrum is highly predictive of speech intelligibility In highly reverberant environments, the modulation spectrum’s peak is severely attenuated, shifted down to ca. 2 Hz, and the signal becomes largely unintelligible What does this imply with respect to prosody? [based on an illustration by Hynek Hermansky] Modulation Spectrum

26 As the modulation spectrum is progressively low-pass filtered, intelligibility declines Suggesting that intelligibility requires both long and short (i.e., accented and unaccented) syllables (However, some syllables - the accented ones - are “more equal” than others) Intelligibility and Modulation Frequency Silipo et al. (1999) Unaccented Heavily Accented All Accents (Convergnce)

27 Syllable Duration and Accent Canonical Syllable Forms Heavily accented syllables are generally 60-100% longer than their unaccented counterparts The disparity in duration is most pronounced for syllable forms with one or no consonants (i.e., V, VC, CV) This pattern implies that accent has its greatest impact on vocalic duration V = Vowel C = Consonant

28 Canonical Syllable Forms Vowel Duration - Accent Level/Syllable Form Vowels in accented syllables are at least twice as long as their unaccented counterparts This pattern implies that the syllabic nucleus absorbs much of accent’s impact (at least as far as duration is concerned)

29 Canonical Syllable Forms Syllable Onset/Coda Duration and Accent ONSETS of accented syllables are generally 50-60% longer than their unaccented counterparts and are somewhat sensitive to stress accent While there is little difference in duration between accented and unaccented CODA constituents CODAS are relatively insensitive to prosody and carry less information than onsets OnsetsCodas

30 Sensitivity of Syllable Constituents to Accent Thus, the duration of syllabic nuclei (usually vowels) is most sensitive to syllable accent Syllable CODAS are LEAST sensitive to prosodic accent This differential sensitivity to prosodic accent reflects some fundamental principles of information encoding within the syllable, as well as principles of auditory function (e.g., onsets are more important than offsets for evoking neural discharge – hence much of the neural entropy is embedded in the onset)

31 Syllable Prominence (Accent) Illustrated [s] [eh] [vx] [en] accented syllable unaccented syllable “Seven” mean duration Full-spectrum perspective OGI Numbers95 [s] [eh] [vx] [en] Nucleus Onset Ambi-syllabic “pure” juncture Nucleus Juncture

32 Robustness Based on Temporal Properties Reflections from walls and other surfaces routinely modify the spectro- temporal structure of the speech signal under everyday conditions Yet, the intelligibility of speech is remarkably stable This implies that intelligibility is NOT based on the spectro-temporal DETAILS but rather on some more basic,TEMPORAL parameter(s)

33 Temporal Basis of Intelligibility 90% Intelligibility Four narrow channels, presented synchronously, yield ca. 90% intelligibility Intelligibility for two channels ranges between 10 and 60% 60% Intelligibility

34 Desynchronizing Slits Affects Intelligibility When the center slits lead or lag the lateral slits by more than 25 ms intelligibility suffers significantly Intelligibility plummets to ca. 55% for leads/lags of 50 ms And declines to 40% for leads/lags of 75 ms

35 Asynchrony greater than 50 ms results in intelligibility lower than baseline A trough in performance occurs at ca. 200-250 ms asynchrony, roughly the interval associated with the syllable What does this mean? Perhaps, that there is a syllable-length time window of integration Slit Asynchrony Affects Intelligibility

36 Importance of Visual Cues 0.0 0.5 1.0 1.5 2.0 2.5 0500100015002000250030003500 Time (ms) 0 10 20 30 40 50 60 70 80 WB F1 F2 F3 RMS Amplitude (dB) Lip Area (in 2 ) Watch the log float in the wide river Data courtesy of Ken Grant Amplitude Fluctuation in Different Spectral Regions Lip Aperture Variation Visual cues often supplement the acoustic signal, and are particularly important in adverse acoustic environments (i.e., noise & reverberation) What is the basis of visual supplementation to understanding speech? One possibility is the common modulatory properties of the visual and acoustic components of the speech signal

37 Combining Audio and Visual Cues + + Video Leads 40 – 400 ms Audio Leads 40 – 400 ms Baseline Condition SYNCHRONOUS A/V Place of Articulation Visual cues (a.k.a. speechreading) also provide important information about consonantal place of articulation and the nature of both prosodic and vocalic properties One can desynchronize the audio and visual streams and measure its impact on intelligibility

38 Focus on Audio-Leading-Video Conditions When the AUDIO signal LEADS the VIDEO, there is a progressive decline in intelligibility, similar to that observed for audio-alone signals These data are next compared with data from the audio-alone study to illustrate the similarity in the slope of the function

39 Comparison of A/V and Audio-Alone Data The decline in intelligibility for the audio-alone condition is similar to that of the audio-leading-video condition Such similarity in the slopes associated with intelligibility for both experiments suggest that the underlying mechanisms may be similar The intelligibility of the audio-alone signals is higher than the A/V signals due to slits 2+3 being highly intelligible by themselves

40 When the VIDEO signal LEADS the AUDIO, intelligibility is preserved for asynchrony intervals as large as 200 ms These data are rather strange, implying some form of “immunity” against intelligibility degradation when the video channel leads the audio Focus on Video-Leading-Audio Conditions

41 The slope of intelligibility-decline associated with the video-leading-audio conditions is rather different from the audio-leading-video conditions WHY? WHY? WHY? Auditory-Visual Integration - the Full Monty

42 Time Constants of Audio-Visual Intergration The temporal limits of combining visual and acoustic information are SYLLABLE length, particularly when the video precedes the audio signal Suggesting that visual speech cues are syllabically organized

43 WHY are the temporal properties of speech the way they are Because …. The brain requires such intervals to combine information across sensory modalities and to associate the sensory streams with meaning Some General Answers

44 WHY is the average duration of a syllable (in spontaneous speech) ca. 200 ms? The syllable’s duration reflects a basic sensori-motor integration time constant and can be considered to represent the sampling rate of consciousness Some General Answers

45 WHY are some syllables significantly longer than others? The heterogeneity in duration reflects the unequal distribution of entropy across an utterance and is a basic requirement for decoding the speech signal Some General Answers

46 WHY are some phonetic segments (usually vowels) longer than others (typically consonants)? Vowels reflect the influence of prosodic factors far more than consonants, and therefore convey more information concerning a syllable’s intrinsic entropy than their consonantal counterparts Some General Answers

47 WHAT can the temporal properties of speech tell us about spoken language in general? It provides a general theoretical framework for understanding the organization of spoken language and how the brain decodes the speech signal Some General Questions

48 The temporal properties of spoken language reflect INFORMATION contained in the speech signal PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION Much of the temporal variation in spoken language reflects prosodic factors Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language Prosody shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation) Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal Conclusions and Summary

49 That’s All Many Thanks for Your Time and Attention

50 Language - A Syllable-Centric Perspective An empirically grounded perspective of spoken language focuses on the SYLLABLE and Syllabic ACCENT as the interface between “sound” and “meaning” (or at least lexical form) Modes of Analysis Energy Time–FrequencyProsodic Accent Phonetic Interpretation Manner Segmentation Fric Voc V Nas J Word “Seven” Linguistic Tiers

51 Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic organization There is a systematic relationship between the syllable and the articulatory- acoustic features comprising phonetic constituents Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well

52 These slow modulation patterns are DIFFERENTIALLY distributed across the acoustic frequency spectrum The modulation spectra are similar (in certain respects) across frequency But vary in certain important ways …. Modulation Spectra Across Frequency Modulation Spectra

53 Modulation Spectrum Varies Across Frequency In Houtgast and Steeneken’s original formulation of the STI, the modulation spectrum was assumed to be similar across the acoustic frequency axis An analysis of spoken English (in this instance TIMIT sentences) suggests that their formulation was not quite accurate for the high frequency channels, as shown below The highest channels have considerable energy between 10 and 30 Hz

54 Summary of the Presentation Low-frequency modulation patterns reflect SYLLABLES, as well as their specific content and structure

55 Syllable as Interface between Sound & Meaning The syllable serves as a key organizational unit that binds the lower and higher tiers of linguistic organization There is a systematic relationship between the syllable and the articulatory- acoustic features comprising phonetic constituents Moreover, the syllable is the primary carrier of prosodic information and is linked to morphology and the lexicon as well

56 Summary of the Presentation Such temporal properties reflect a basic sensory-motor time constant of ca. 200 ms – the SAMPLIING RATE of CONSCIOUSNESS

57 Modulation Spectrum as Predictor of Intelligibility In the 1970’s, Houtgast and Steeneken demonstrated that the magnitude of the modulation spectrum could be used to predict speech intelligibility over a wide range of acoustic environments In optimum listening conditions, the modulation spectrum has a peak between 4 and 5 Hz, as shown below In highly reverberant environments, the modulation spectrum’s peak is attenuated, shifting down to ca. 2 Hz, becoming increasing unintelligible [based on an illustration by Hynek Hermansky] Modulation Spectrum

58 In face-to-face interaction the visual component of the speech signal can be extremely important for understanding spoken language (particularly in noisy and/or reverberant conditions) It is therefore of interest to ascertain the brain’s tolerance for asynchrony between the audio and visual components of the speech signal This exercise can also provide potentially illuminating insights into the nature of the neural mechanisms underlying speech comprehension Specifically, the contribution of speechreading cues can provide clues about what REALLY is IMPORTANT in the speech signal for INTELLIGIBILITY that is independent of the sensory modality involved Audio-Visual Integration of Speech

59 In Conclusion ….

60 Language - A Syllable-Centric Perspective A more empirically grounded perspective of spoken language focuses on the SYLLABLE as the interface between “sound” “vision” and “meaning” Important linguistic information is embedded in the TEMPORAL DYNAMICS of the speech signal (irrespective of the modality)

61 Germane Publications Arai, T. and Greenberg, S. (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony, IEEE International Conference on Acoustics, Speech and Signal Processing, Seattle, pp. 933-936. Grant, K. and Greenberg, S. (2001) Speech intelligibility derived from asynchronous processing of auditory-visual information. Proceedings of the ISCA Workshop on Audio-Visual Speech Processing (AVSP-2001), pp. 132-137. Greenberg, S. and Arai, T. (1998) Speech intelligibility is highly tolerant of cross-channel spectral asynchrony. Proceedings of the Joint Meeting of the Acoustical Society of America and the International Congress on Acoustics, Seattle, pp. 2677-2678. Greenberg, S., Arai, T. and Silipo, R. (1998) Speech intelligibility derived from exceedingly sparse spectral information, Proceedings of the International Conference on Spoken Language Processing, Sydney, pp. 74-77. Greenberg, S. (1996) Understanding speech understanding - towards a unified theory of speech perception. Proceedings of the ESCA Tutorial and Advanced Research Workshop on the Auditory Basis of Speech Perception, Keele, England, p. 1-8. Silipo, R., Greenberg, S. and Arai, T. (1999) Temporal constraints on speech intelligibility as deduced from exceedingly sparse spectral representations, 6th European Conference on Speech Communication and Technology (Eurospeech-99), pp. 2687-2690. http://www.icsi.berkeley.edu/~steveng

62 Syllables rise and fall in energy over the course of their duration Vocalic nuclei are highest in amplitude Onset consonants gradually rise in energy arching towards the peak Coda consonants decline in amplitude, usually more abruptly than onsets The Energy Arc Illustrated Spectro-temporal profile (STeP)Spectrogram + Waveform “seven”

63 Spectrally sparse audio and speech-reading information provide minimal intelligibility when presented alone in the absence of the other modality This same information can, when combined across modalities, provide good intelligibility (63% average accuracy) When the audio signal leads the video, intelligibility falls off rapidly as a function of modality asynchrony When the video signal leads the audio, intelligibility is maintained for asynchronies as long as 200 ms For eight out of nine subjects, the highest intelligibility is associated with conditions in which the video signal leads the audio (often by 80-120 ms) There are many potential interpretations of the data The interpretation currently favored (by the speaker) posits a relatively long (200 ms) integration buffer for audio-visual integration when the brain is confronted exclusively (even for short intervals) with speech-reading information (as occurs when the video signal leads the audio) The data further suggest that place-of-articulation cues evolve over syllabic intervals of ca. 200 ms in length and could therefore potentially apply to models of speech processing in general Speechreading also appears to provide important prosodic information that is extremely useful for decoding the speech signal Audio-Video Integration – Summary

64 The temporal properties of spoken language reflect INFORMATION contained in the speech signal PROSODY is the most sensitive LINGUISTIC reflection of INFORMATION Much of the temporal variation in spoken language reflects prosodic factors Hence, prosody is the key to understanding much of the temporal (and phonetic) variation observed in spoken language Prosody is what likely shields the information contained in the speech signal against the deleterious forces of nature (a.k.a. background noise and reverberation) Therefore, to understand spoken language, it is also necessary to understand how prosody is encapsulated in the speech signal This is the focus for today’s presentation But, before considering prosody, let’s first examine an important acoustic property of the speech signal …. Take Home Messages


Download ppt "Temporal Properties of Spoken Language Steven Greenberg The Speech Institute"

Similar presentations


Ads by Google