© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Building an ASR using HTK CS4706
KARAOKE FORMATION Pratik Bhanawat (10bec113) Gunjan Gupta Gunjan Gupta (10bec112)
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Toward Automatic Music Audio Summary Generation from Signal Analysis Seminar „Communications Engineering“ 11. December 2007 Patricia Signé.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
: Recognition Speech Segmentation Speech activity detection Vowel detection Duration parameters extraction Intonation parameters extraction German Italian.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Toward Semantic Indexing and Retrieval Using Hierarchical Audio Models Wei-Ta Chu, Wen-Huang Cheng, Jane Yung-Jen Hsu and Ja-LingWu Multimedia Systems,
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
A PRESENTATION BY SHAMALEE DESHPANDE
Building High Quality Databases for Minority Languages such as Galician F. Campillo, D. Braga, A.B. Mourín, Carmen García-Mateo, P. Silva, M. Sales Dias,
Natural Language Understanding
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Isolated-Word Speech Recognition Using Hidden Markov Models
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
9 th Conference on Telecommunications – Conftele 2013 Castelo Branco, Portugal, May 8-10, 2013 Sara Candeias 1 Dirce Celorico 1 Jorge Proença 1 Arlindo.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
Speech and Language Processing
Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
The vowel detection algorithm provides an estimation of the actual number of vowel present in the waveform. It thus provides an estimate of SR(u) : François.
Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Performance Comparison of Speaker and Emotion Recognition
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
HMM-Based Speech Synthesis Erica Cooper CS4706 Spring 2011.
Arlindo Veiga Dirce Celorico Jorge Proença Sara Candeias Fernando Perdigão Prosodic and Phonetic Features for Speaking Styles Classification and Detection.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Phonetic features in ASR Kurzvortrag Institut für Kommunikationsforschung und Phonetik Bonn 17. Juni 1999 Jacques Koreman Institute of Phonetics University.
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Merging Segmental, Rhythmic and Fundamental Frequency Features for Automatic Language Identification Jean-Luc Rouas 1, Jérôme Farinas 1 & François Pellegrino.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
A NONPARAMETRIC BAYESIAN APPROACH FOR
Online Multiscale Dynamic Topic Models
Speech Recognition UNIT -5.
Dean Luo, Wentao Gu, Ruxin Luo and Lixin Wang
Statistical Models for Automatic Speech Recognition
Speech Processing Speech Recognition
Digital Systems: Hardware Organization and Design
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presentation transcript:

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal ICPhS XVII 17th International Congress of Phonetic Sciences Aug Hong-Kong, China CHARACTERIZATION OF HESITATIONS USING ACOUSTIC MODELS

2 SUMMARY Introduction Problem Statement Goal Filled Pauses (FPs) and Extension (EXs) corpus Data Analysis and Results Conclusions ICPhS XVII - Hong Kong - Aug

3 INTRODUCTION ICPhS XVII - Hong Kong - Aug Spontaneous speech is full of hesitations hhh… er… erm… well… ah… you know... Speaker wants to continue ‘speaking’

4 INTRODUCTION ICPhS XVII - Hong Kong - Aug Spontaneous speech is full of hesitations yes, yes…yes, I do Reinforced message

5 INTRODUCTION ICPhS XVII - Hong Kong - Aug Spontaneous speech is full of hesitations I spea…will speak Chinese one day To correct the message

6 INTRODUCTION ICPhS XVII - Hong Kong - Aug Spontaneous speech is full of hesitations What can I+++ … say? Time related

7 PROBLEM STATEMENT Hesitation events can be used (among others) to: identify the idiosyncrasy of the speakers improve the performance of automatic speech recognition (ASR) systems ICPhS XVII - Hong Kong - Aug The presence of hesitations in speech signals affects negatively the performance of ASR systems Solution?

8 GOAL ICPhS XVII - Hong Kong - Aug Solution? identify and annotate hesitation phenomena How? Studying its acoustic- phonetic properties

9 GOAL ICPhS XVII - Hong Kong - Aug What properties? How? Studying its acoustic- phonetic properties pitch energy spectral characteristics and durational characteristics

10 The study concentrated on both: filled pauses (FPs) extensions (EXs) ICPhS XVII - Hong Kong - Aug FPs comprise all sounds that phonetically belong to the Portuguese language but do not occur in the context of a complete word (e.g., uum, aaa, eee). EXs relates to phonetic prolongation into both functional and lexical words (e.g. [ ɐ ] in or the [u] in ). FILLED PAUSES AND EXTENSION CORPUS

11 The study concentrated on both: filled pauses (FPs) extensions (EXs) ICPhS XVII - Hong Kong - Aug FILLED PAUSES AND EXTENSION CORPUS a large number of examples is necessary. no public European Portuguese database with this kind of annotated events! Solution? Create one

12 FILLED PAUSES AND EXTENSION CORPUS We collected podcasted television news ICPhS XVII - Hong Kong - Aug Annotating FPs and EXs by an expert is a time-consuming task ! The events detected were then manually validated An automatic speech recognition system was used to help locating the filled pauses and extensions. around 22 hours of non-annotated speech

13 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug The semi-automatic procedure reduced the duration of the annotation task by at least 4 times, compared to the completely manual annotation process.

14 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug Detection steps Extract audio stream Convert to PCM 16 KHz, 16 bits per sample, mono Compute acoustic features Silence segmentation based on energy Phone Decoding of non silence segments Select hesitation candidate based on patterns and duration Confidence measure to reduce candidates Multimedia from Podcast Manual confirmation Decoding Task Grammar Phone 38 Phone 1 Phone 2 Phone 3 Phone 39

15 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug Phone Acoustic Models based on Hidden Markov Models (HMM) PDF 1PDF 2PDF 3 HMMs 3 states, left to right topology PDFs with 96 Gaussian Mixtures Features: 12 Mel-frequency cepstral coefficients (MFCC) + Log energy First and second order regression coefficients (deltas and delta-deltas); Frame rate:100Hz

16 FILLED PAUSES AND EXTENSION CORPUS The obtained FP and EX corpus includes about 800 event annotations ICPhS XVII - Hong Kong - Aug EXs are more frequent than FPs (62% vs. 38%) 15 different labels for FPs, mainly [ə] and [ ɐ ] (17.8%, 66.4%)

17 FILLED PAUSES AND EXTENSION CORPUS ICPhS XVII - Hong Kong - Aug The most frequent EX is [ə], followed by [ ɐ ] and [u]. The extension of the [i] is also common in spontaneous Portuguese. The open and open-mid vowels, such as [a], [ ɛ ] and [o] were not so frequent. The obtained FP and EX corpus includes about 800 event annotations

18 FILLED PAUSES AND EXTENSION CORPUS ICPhS XVII - Hong Kong - Aug An interesting fact is the lengthening of the diphthongs (both oral and nasal) - most frequent is [ ɐ ̃w̃]. We have verified that EXs occur mainly in prepositions and on the last syllable. Sometimes the difference between FP and EX is not obvious and could be distinguished only in the phonetic context. The obtained FP and EX corpus includes about 800 event annotations

19 DATA ANALYSIS AND RESULTS From the hesitation events we computed several acoustic parameters to characterize hesitation segments, namely F0 (pitch), energy and spectrum. ICPhS XVII - Hong Kong - Aug

20 The gradients of F0 and energy are, most of the times, negative, which means that they decay smoothly during hesitations. ICPhS XVII - Hong Kong - Aug DATA ANALYSIS AND RESULTS

21 DATA ANALYSIS AND RESULTS However, these values have small variation: the standard deviation of F0 is on average around 15 Hz (mean 128 Hz); standard deviation of energy is around 2.7 dB (mean 16 dB). ICPhS XVII - Hong Kong - Aug

22 DATA ANALYSIS AND RESULTS The parameter based on standard deviation of spectral band energies show a similar behavior. Mean duration is around 0.52 second with 0.16 of standard deviation. ICPhS XVII - Hong Kong - Aug

23 DATA ANALYSIS AND RESULTS Also observed is that these characteristics do not separate well between FP and EX hesitations, which agrees with the fact that perceptually their distinction is also ambiguous without a context. ICPhS XVII - Hong Kong - Aug

24 CONCLUSIONS ICPhS XVII - Hong Kong - Aug Automatic detection of filled pauses and extensions in spontaneous speech can be done by using a phone recognizer. Although it is not an optimal method, it proved to be useful for semi-automatic annotation. The detected events were characterized phonetically and acoustically. In the near future we intend to explore hesitations within utterances using the three-region surface structure. 再见