Presentation is loading. Please wait.

Presentation is loading. Please wait.

© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.

Similar presentations


Presentation on theme: "© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto."— Presentation transcript:

1 © 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto de Telecomunicações, Polo de Coimbra, Portugal 2 Universidade de Coimbra, DEEC, Portugal ICPhS XVII 17th International Congress of Phonetic Sciences Aug. 17-21 2011 Hong-Kong, China CHARACTERIZATION OF HESITATIONS USING ACOUSTIC MODELS

2 2 SUMMARY Introduction Problem Statement Goal Filled Pauses (FPs) and Extension (EXs) corpus Data Analysis and Results Conclusions ICPhS XVII - Hong Kong - Aug.17-21 2011

3 3 INTRODUCTION ICPhS XVII - Hong Kong - Aug.17-21 2011 Spontaneous speech is full of hesitations hhh… er… erm… well… ah… you know... Speaker wants to continue ‘speaking’

4 4 INTRODUCTION ICPhS XVII - Hong Kong - Aug.17-21 2011 Spontaneous speech is full of hesitations yes, yes…yes, I do Reinforced message

5 5 INTRODUCTION ICPhS XVII - Hong Kong - Aug.17-21 2011 Spontaneous speech is full of hesitations I spea…will speak Chinese one day To correct the message

6 6 INTRODUCTION ICPhS XVII - Hong Kong - Aug.17-21 2011 Spontaneous speech is full of hesitations What can I+++ … say? Time related

7 7 PROBLEM STATEMENT Hesitation events can be used (among others) to: identify the idiosyncrasy of the speakers improve the performance of automatic speech recognition (ASR) systems ICPhS XVII - Hong Kong - Aug.17-21 2011 The presence of hesitations in speech signals affects negatively the performance of ASR systems Solution?

8 8 GOAL ICPhS XVII - Hong Kong - Aug.17-21 2011. Solution? identify and annotate hesitation phenomena How? Studying its acoustic- phonetic properties

9 9 GOAL ICPhS XVII - Hong Kong - Aug.17-21 2011 What properties? How? Studying its acoustic- phonetic properties pitch energy spectral characteristics and durational characteristics

10 10 The study concentrated on both: filled pauses (FPs) extensions (EXs) ICPhS XVII - Hong Kong - Aug.17-21 2011 FPs comprise all sounds that phonetically belong to the Portuguese language but do not occur in the context of a complete word (e.g., uum, aaa, eee). EXs relates to phonetic prolongation into both functional and lexical words (e.g. [ ɐ ] in or the [u] in ). FILLED PAUSES AND EXTENSION CORPUS

11 11 The study concentrated on both: filled pauses (FPs) extensions (EXs) ICPhS XVII - Hong Kong - Aug.17-21 2011 FILLED PAUSES AND EXTENSION CORPUS a large number of examples is necessary. no public European Portuguese database with this kind of annotated events! Solution? Create one

12 12 FILLED PAUSES AND EXTENSION CORPUS We collected podcasted television news ICPhS XVII - Hong Kong - Aug.17-21 2011 Annotating FPs and EXs by an expert is a time-consuming task ! The events detected were then manually validated An automatic speech recognition system was used to help locating the filled pauses and extensions. around 22 hours of non-annotated speech

13 13 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug.17-21 2011 The semi-automatic procedure reduced the duration of the annotation task by at least 4 times, compared to the completely manual annotation process.

14 14 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug.17-21 2011 Detection steps Extract audio stream Convert to PCM 16 KHz, 16 bits per sample, mono Compute acoustic features Silence segmentation based on energy Phone Decoding of non silence segments Select hesitation candidate based on patterns and duration Confidence measure to reduce candidates Multimedia from Podcast Manual confirmation Decoding Task Grammar Phone 38 Phone 1 Phone 2 Phone 3 Phone 39

15 15 AUTOMATIC HESITATION DETECTOR ICPhS XVII - Hong Kong - Aug.17-21 2011 Phone Acoustic Models based on Hidden Markov Models (HMM) PDF 1PDF 2PDF 3 HMMs 3 states, left to right topology PDFs with 96 Gaussian Mixtures Features: 12 Mel-frequency cepstral coefficients (MFCC) + Log energy First and second order regression coefficients (deltas and delta-deltas); Frame rate:100Hz

16 16 FILLED PAUSES AND EXTENSION CORPUS The obtained FP and EX corpus includes about 800 event annotations ICPhS XVII - Hong Kong - Aug.17-21 2011 EXs are more frequent than FPs (62% vs. 38%) 15 different labels for FPs, mainly [ə] and [ ɐ ] (17.8%, 66.4%)

17 17 FILLED PAUSES AND EXTENSION CORPUS ICPhS XVII - Hong Kong - Aug.17-21 2011 The most frequent EX is [ə], followed by [ ɐ ] and [u]. The extension of the [i] is also common in spontaneous Portuguese. The open and open-mid vowels, such as [a], [ ɛ ] and [o] were not so frequent. The obtained FP and EX corpus includes about 800 event annotations

18 18 FILLED PAUSES AND EXTENSION CORPUS ICPhS XVII - Hong Kong - Aug.17-21 2011 An interesting fact is the lengthening of the diphthongs (both oral and nasal) - most frequent is [ ɐ ̃w̃]. We have verified that EXs occur mainly in prepositions and on the last syllable. Sometimes the difference between FP and EX is not obvious and could be distinguished only in the phonetic context. The obtained FP and EX corpus includes about 800 event annotations

19 19 DATA ANALYSIS AND RESULTS From the hesitation events we computed several acoustic parameters to characterize hesitation segments, namely F0 (pitch), energy and spectrum. ICPhS XVII - Hong Kong - Aug.17-21 2011

20 20 The gradients of F0 and energy are, most of the times, negative, which means that they decay smoothly during hesitations. ICPhS XVII - Hong Kong - Aug.17-21 2011 DATA ANALYSIS AND RESULTS

21 21 DATA ANALYSIS AND RESULTS However, these values have small variation: the standard deviation of F0 is on average around 15 Hz (mean 128 Hz); standard deviation of energy is around 2.7 dB (mean 16 dB). ICPhS XVII - Hong Kong - Aug.17-21 2011

22 22 DATA ANALYSIS AND RESULTS The parameter based on standard deviation of spectral band energies show a similar behavior. Mean duration is around 0.52 second with 0.16 of standard deviation. ICPhS XVII - Hong Kong - Aug.17-21 2011

23 23 DATA ANALYSIS AND RESULTS Also observed is that these characteristics do not separate well between FP and EX hesitations, which agrees with the fact that perceptually their distinction is also ambiguous without a context. ICPhS XVII - Hong Kong - Aug.17-21 2011

24 24 CONCLUSIONS ICPhS XVII - Hong Kong - Aug.17-21 2011 Automatic detection of filled pauses and extensions in spontaneous speech can be done by using a phone recognizer. Although it is not an optimal method, it proved to be useful for semi-automatic annotation. The detected events were characterized phonetically and acoustically. In the near future we intend to explore hesitations within utterances using the three-region surface structure. 再见


Download ppt "© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto."

Similar presentations


Ads by Google