Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute 1947 Center Street,

Slides:

Advertisements

Similar presentations

Teaching Pronunciation

Advertisements

Sounds that “move” Diphthongs, glides and liquids.

SPPA 403 Speech Science1 Unit 3 outline The Vocal Tract (VT) Source-Filter Theory of Speech Production Capturing Speech Dynamics The Vowels The Diphthongs.

JPN494: Japanese Language and Linguistics JPN543: Advanced Japanese Language and Linguistics Phonology & Phonetics (2)

Acoustic Characteristics of Vowels

Using prosody to avoid ambiguity: Effects of speaker awareness and referential context Snedeker and Trueswell (2003) Psych 526 Eun-Kyung Lee.

Speech Science XII Speech Perception (acoustic cues) Version

1 The Effect of Pitch Span on the Alignment of Intonational Peaks and Plateaux Rachael-Anne Knight University of Cambridge.

Suprasegmentals The term suprasegmental refers to those properties of an utterance which aren't properties of any single segment. The following are usually.

Spoken Language Analysis Dept. of General & Comparative Linguistics Christian-Albrechts-Universität zu Kiel Oliver Niebuhr 1 At the Segment-Prosody.

AN ACOUSTIC PROFILE OF SPEECH EFFICIENCY R.J.J.H. van Son, Barbertje M. Streefkerk, and Louis C.W. Pols Institute of Phonetic Sciences / ACLC University.

Perception of syllable prominence by listeners with and without competence in the tested language Anders Eriksson 1, Esther Grabe 2 & Hartmut Traunmüller.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Niebuhr, D‘Imperio, Gili Fivela, Cangemi 1 Are there “Shapers” and “Aligners” ? Individual differences in signalling pitch accent category.

Tone, Accent and Stress February 14, 2014 Practicalities Production Exercise #2 is due at 5 pm today! For Monday after the break: Yoruba tone transcription.

CENTER FOR SPOKEN LANGUAGE UNDERSTANDING 1 PREDICTION AND SYNTHESIS OF PROSODIC EFFECTS ON SPECTRAL BALANCE OF VOWELS Jan P.H. van Santen and Xiaochuan.

PHONETICS AND PHONOLOGY

STRESS insimplewords. The nature of stress  More easily recognized than defined e.g. ´father´, ´apartment´, ´perhaps´  The conventions for marking stress.

Time Frames of Spoken Language Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley, CA 94704

Emotion in Meetings: Hot Spots and Laughter. Corpus used ICSI Meeting Corpus – 75 unscripted, naturally occurring meetings on scientific topics – 71 hours.

Beyond the Phoneme A Juncture-Accent Model of Spoken Language Steven Greenberg, Hannah Carvey, Leah Hitchcock and Shuangyu Chang International Computer.

What are the Essential Cues for Understanding Spoken Language? Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley,

The Relation Between Stress Accent and Pronunciation Variation in Spontaneous American English Discourse Steven Greenberg, Hannah Carvey, Leah Hitchcock.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.

On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.

From Here to Utility Melding Phonetic Insight With Speech Technology Steven Greenberg International Computer Science Institute 1947 Center Street, Berkeley,

Vowel articulation in English LING110 Fall Quarter 2002.

What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.

Linguisitics Levels of description. Speech and language Language as communication Speech vs. text –Speech primary –Text is derived –Text is not “written.

Chapter three Phonology

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang International Computer Science Institute.

An Elitist Approach to Articulatory-Acoustic Feature Classification in English and in Dutch Steven Greenberg, Shawn Chang and Mirjam Wester International.

STUDY OF ENGLISH STRESS AND INTONATION

Phonetics : The sounds of language “Vowels” Presented by : Wini Martika Nelli Rizky Alfadina Phonology course Mr. Yose Rianugraha.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Phonetics and Phonology

Diphthongs Five most frequent diphthongs in Māori are /ai ae au ou ao/. Mergers between /ai~ae/ and /au~ou/ [3] (Figure 2). Only one of these mergers is.

Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.

Phonetic Dissection of Switchboard-Corpus Automatic Speech Recognition Systems Steven Greenberg and Shuangyu Chang International Computer Science Institute.

An investigation of postvocalic /r/ in Glaswegian adolescents Jane Stuart-Smith and Robert Lawson Department of English Language, University of Glasgow.

1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.

Copyright 2007, Toshiba Corporation. How (not) to Select Your Voice Corpus: Random Selection vs. Phonologically Balanced Tanya Lambert, Norbert Braunschweiler,

The Phonetic Patterning of Spontaneous American English Discourse Steven Greenberg, Hannah Carvey, Leah Hitchcock and Shuangyu Chang International Computer.

Intonation in Communication Skill: Recent Research Discourse, both in theoretical linguistics and in foreign language pedagogy,has focused on describing.

Bernd Möbius CoE MMCI Saarland University Lecture 7 8 Dec 2010 Unit Selection Synthesis B Möbius Unit selection synthesis Text-to-Speech Synthesis.

Evaluating prosody prediction in synthesis with respect to Modern Greek prenuclear accents Elisabeth Chorianopoulou MSc in Speech and Language Processing.

The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.

Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.

The Effect of Pitch Span on Intonational Plateaux Rachael-Anne Knight University of Cambridge Speech Prosody 2002.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

Recognizing Discourse Structure: Speech Discourse & Dialogue CMSC October 11, 2006.

A Fully Annotated Corpus of Russian Speech

Tone, Accent and Quantity October 19, 2015 Thanks to Chilin Shih for making some of these lecture materials available.

Performance Comparison of Speaker and Emotion Recognition

Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?

1/17/20161 Emotion in Meetings: Business and Personal Julia Hirschberg CS 4995/6998.

Control of prosodic features under perturbation in collaboration with Frank Guenther Dept. of Cognitive and Neural Systems, BU Carrie Niziolek [carrien]

Lexical, Prosodic, and Syntactics Cues for Dialog Acts.

Suprasegmental Properties of Speech Robert A. Prosek, Ph.D. CSD 301 Robert A. Prosek, Ph.D. CSD 301.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Temporal Properties of Spoken Language Steven Greenberg In Collaboration with Hannah Carvey,

Pitch Tracking + Prosody January 19, 2012 Homework! For Tuesday: introductory course project report Background information on your consultant and the.

Suprasegmental features and Prosody Lect 6A&B LING1005/6105.

Audio Books for Phonetics Research CatCod2008 Jiahong Yuan and Mark Liberman University of Pennsylvania Dec. 4, 2008.

Lecture Overview Prosodic features (suprasegmentals)

What is Phonetics? Short answer: The study of speech sounds in all their aspects. Phonetics is about describing speech. (Note: phonetics ¹ phonics) Phonetic.

Audio Books for Phonetics Research

Agustín Gravano & Julia Hirschberg {agus,

Towards Automatic Fluency Assessment

Speech Perception (acoustic cues)

Presentation transcript:

Stress-Accent and Vowel Quality in The Switchboard Corpus Steven Greenberg and Leah Hitchcock International Computer Science Institute 1947 Center Street, Berkeley, CA NIST Workshop on Large Vocabulary Continuous Speech Recognition Maritime Institute of Technology, May 4, 2001

There is an intimate relationship between vocalic identity, nucleic duration and stress accent in spontaneous dialogue (at least in the Switchboard corpus) Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues) Certain vocalic classes exhibit a far greater dynamic range in duration than others –Diphthongs tend to be longer than monophthongs, BUT …. –The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of duration and dynamic range under stress (accent) similar to diphtongs The statistical patterns are consistent with the hypothesis that duration serves under many conditions as either a primary or secondary cue for vowel height (normally associated with the frequency of the first formant) Take Home Messages

Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of being stressed) Thus, the identity of a vowel can not be considered independently of stress-accent The two parameters are likely to be flip sides of the same Koine Although English is not generally considered to be a vowel-quantity language (as is Finnish), given the close relationship between stress-accent and duration, and between duration and vowel quality, there is some sense in which English (and perhaps other stress-accent languages) manifest certain properties of a “quantity” system Thus, vowel duration may be an important factor in disambiguating spoken language and therefore should be of interest to the speech recognition community Take Home Messages

What is (usually) Meant by Prosodic Stress? Prosody is supposed to pertain to extra-phonetic cues in the acoustic signal The pattern of variation over a sequence of SYLLABLES pertaining to: syllabic DURATION, AMPLITUDE and PITCH (f o ) variation over time (but the plot thickens, as we shall see)

It supposedly provides important information about: Focus of the speaker’s attention and emphasis for the listener What is “new” and “important” information Emotional context of the utterance - surprise, sarcasm, shock, delight anger impatience, etc. Syntactic disambiguation, particularly at the clausal/sentential level e.g., interrogative, declarative forms Perceptual processing - parsing the utterance into “chunks” for reliable understanding Prosody provides a window onto the higher levels of language Can be useful for developing semantic-oriented models for speech understanding (“Information spotting”) Prosody affects pronunciation (and vice versa) Can be useful for modeling pronunciation variation in ASR Phonetic properties may be correlated with prosodic stress - THIS IS THE TOPIC FOR TODAY’S PRESENTATION Why is Prosodic Stress Important?

SWITCHBOARD PHONETIC TRANSCRIPTION CORPUS (same as Phoneval-2000) – Switchboard contains informal telephone dialogues – 54 minutes of material that had previously been phonetically transcribed (by highly trained phonetics students from UC- Berkeley) –45.5 minutes of “pure” speech (filled pauses, junctures filtered out), consisting of: 9,991 words, 13,446 syllables, 33,370 phonetic segments – All of this material had been hand-segmented at either the phonetic- segment or syllabic level by the transcribers – The syllabic-segmented material was subsequently segmented at the phonetic-segment level by a special-purpose neural network trained on 72-minutes of hand-segmented Switchboard material. This automatic segmentation was manually verified The Nitty Gritty (a.k.a. the Corpus Material)

Evaluation Material Details Subjective Difficulty By Subjective Difficulty Dialect Region Number of Utterances By Dialect Region AN EQUAL BALANCE OF MALE AND FEMALE SPEAKERS BROAD DISTRIBUTION OF UTTERANCE DURATIONS – 2-4 sec - 40%, 4-8 sec - 50%, 8-17 sec - 10% (mean = 4.75 s) COVERAGE OF ALL (7) U.S. DIALECT REGIONS IN SWITCHBOARD A WIDE RANGE OF DISCUSSION TOPICS VARIABILITY IN DIFFICULTY (VERY EASY TO VERY HARD)

2 UC-Berkeley Linguistics students each transcribed the full 45 minutes of material (i.e., there is 100% overlap between the 2) Three levels of stress-accent were marked for each syllabic nucleus –Fully stressed (78% concordance between transcribers) –Completely unstressed (85% interlabeler agreement) –An intermediate level of accent (neither fully stressed, nor completely unstressed (ca. 60% concordance) –Hence, 95% concordance in terms of some level of stress The labels of the two transcribers were averaged –In those instances where there was disagreement, the magnitude of disparity was almost always (ca. 90%) one step. Usually, disagreement signaled a genuine ambiguity in stress accent The illustrations in this presentation are based solely on those data in which both transcribers concurred (i.e., fully stressed or completely unstressed) A table containing the complete set of data is in a paper submitted to Eurospeech (in the workshop notebook) Manual Transcription of Stress Accent

"Pitch is widely regarded, at least in English, as the most salient determinant of prominence. In other words, when a syllable or word is perceived as 'stressed' or 'emphasized,' it is pitch height or a change in pitch, more than length or loudness that is likely to be mainly responsible (see, for example, Fry 1958, Grimson 1980, pp , Lehiste 1976, Fudge, 1984, ch. 1)" Clark, J. and Yallop, C. (1990) An Introduction to Phonetics and Phonology. Oxford, Blackwell, p "In fact, although it is clear that stressed syllables often have greater overall acoustic intensity than weakly stressed ones, loudness seems to be the least salient and least consistent of the three parameters of pitch, duration and loudness - at least for purposes such as signaling stress" (ibid, p. 282) “Thus, acording to the ‘general consensus’ the important parameters are (in order) - PITCH, DURATION, LOUDNESS” (the latter most closely correlated with TOTAL ENERGY (i.e., duration x amplitude, cf. further on) The “Conventional Wisdom” on Stress-Accent

OGI Stories - Pitch Doesn’t Cut the Mustard Although pitch range is the most important of the f o -related cues, it is not as good a predictor of stress as DURATION Duration Amplitude Pitch Range Av. Pitch

Total Energy is the Best Predictor of Stress Duration x Amplitude is superior to all other combination pairs of acoustic parameters. Pitch appears redundant with duration. Duration x Amplitude Dur x Pitch Range Duration Dur x Pitch Av Pitch Range x Average Pitch Av x Amp Pitch Range x Amp

Vowel quality is generally thought to be a function primarily of two articulatory properties - both related to the motion of the tongue –The front-back plane is most closely associated with the second formant frequency (or more precisely F2 - F1) and the volume of the front-cavity resonance –The height parameter is closely linked to the frequency of F1 In the classic vowel “triangle” segments are positioned in terms of the tongue positions associated with their production, as follows: A Brief Primer on Vocalic Acoustics

Duration/Amplitude/Int. Energy - Which? There are supposed to be large differences in the “intrinsic” amplitude and duration of vowels Could such differences be compensated for in terms of stress? Let’s take a closer look!

Amplitude Differences - Stressed/Unstressed There are very small differences in amplitude between stressed and unstressed nuclei The lax monophthongs tend to be have a slightly larger dynamic range than diphthongs

Durational Differences - Stressed/Unstressed There is a large dynamic range in duration between stressed and unstressed nuclei Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs

Int. Energy Differences - Stressed/Unstressed There is a large dynamic range in integrated energy between stressed and unstressed nuclei Diphthongs and tense, low monophthongs tend to have a larger range than the lax monophthongs

Let’s return to the vowel triangle and see if it can shed light on certain patterns in the vocalic data The duration, amplitude (and their product, integrated energy, will be plotted on a 2-D grid, where the x-axis will always be in terms of hypothetical front-back tongue position (and hence remain a constant throughout the plots to follow) The y-axis will serve as the dependent measure, sometimes expressed in terms of duration, or amplitude, or their product Spatial Patterning of Duration and Amplitude

Dipthongal Amplitude and Vowel Height All nuclei

Monopthongal Amplitude and Vowel Height All nuclei

Amplitude - Monophthongs vs. Diphthongs All nuclei Diphthongs Monophthongs

Diphthongal Duration and Vowel Height All nuclei

Monopthongal Duration and Vowel Height All nuclei

Duration - Monophthongs vs. Diphthongs All nuclei Diphthongs Monophthongs

Dipthongal Int. Energy and Vowel Height All nuclei

Monopthongal Int. Energy and Vowel Height All nuclei

Int. Energy - Monophthongs vs. Diphthongs All nuclei DiphthongsMonophthongs

Dipthongal Amplitude and Vowel Height Stressed nuclei

Dipthongal Amplitude and Vowel Height Unstressed nuclei

Monopthongal Amplitude and Vowel Height Stressed nuclei

Monopthongal Amplitude and Vowel Height Unstressed nuclei

Amplitude - Monophthongs vs. Diphthongs Stressed Unstressed Diphthongs Monophthongs

Diphthongal Duration and Vowel Height Stressed nuclei

Diphphthongal Duration and Vowel Height Unstressed nuclei

Monopthongal Duration and Vowel Height Stressed nuclei

Monopthongal Duration and Vowel Height Unstressed nuclei

Duration - Monophthongs vs. Diphthongs Stressed Unstressed Diphthongs Monophthongs

Dipthongal Int. Energy and Vowel Height Stressed nuclei

Dipthongal Int. Energy and Vowel Height Unstressed nuclei

Monopthongal Int. Energy and Vowel Height Stressed nuclei

Monopthongal Int. Energy and Vowel Height Unstressed nuclei

Int. Energy - Monophthongs vs. Diphthongs Stressed Diphthongs Monophthongs Unstressed

Mystery Parameter There is one other parameter which when plotted in a vowel triangle plot shows an interesting pattern This is - proportion of stressed an unstressed nuclei

Proportion of Stress Accent and Vowel Height

Amplitude - Monophthongs vs. Diphthongs All nuclei Diphthongs Monophthongs

Duration - Monophthongs vs. Diphthongs All nuclei Diphthongs Monophthongs

Int. Energy - Monophthongs vs. Diphthongs All nuclei DiphthongsMonophthongs

There is an intimate relationship between vocalic identity, nucleic duration and stress accent in spontaneous dialogue (at least in the Switchboard corpus) Stressed syllables tend to have significantly longer nuclei than their unstressed counterparts, consistent with the findings reported by Silipo and Greenberg in previous years’ meetings regarding the OGI Stories corpus (telephone monologues) Certain vocalic classes exhibit a far greater dynamic range in duration than others –Diphthongs tend to be longer than monophthongs, BUT …. –The low monophthongs ([ae], [aa], [ay], [aw], [ao]) exhibit patterns of duration and dynamic range under stress (accent) similar to diphtongs The statistical patterns are consistent with the hypothesis that duration serves under many conditions as either a primary or secondary cue for vowel height (normally associated with the frequency of the first formant) Summary and Conclusions

Moreover, the stress-accent system in spontaneous (American) English appears to be closely associated with vocalic identity Low vowels are far more likely to be fully stressed than high vowels (with the mid vowels exhibiting an intermediate probability of being stressed) Thus, the identity of a vowel can not be considered independently of stress-accent Thus, vowel duration may be an important factor in disambiguating spoken language and therefore should be of interest to the speech recognition community Summary and Conclusions