Download presentation

Presentation is loading. Please wait.

Published byRudolf Cummings Modified about 1 year ago

1
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers

2
2 Summary Problem overview Baseline system Extensions to the baseline system Conclusions and future work

3
3 The Problem Speaker Gender Age Vocal tract characteristics Pronunciation Rate of Speech Stress Lombard Reflex Microphone Position Distortion Channel Distortion Noise Environment Background noises Intermitent noises Coktail party noises Reverberation

4
4 Corpus Description Multilingual telephone speech corpus SPEECHDAT(M)1000 speakers SPEECHDAT(II)4000 speakers Orthographically transcribed including noise events

5
5 Noise events [spk]:Speaker related noises [sta]:Stationary noises [int]:Intermittent noises

6

7
7 Train and Test Set Definition Selection procedure –Age, gender and region distribution are approximately equal in both train and test sets; SPEECHDAT II –Fixed 500 speakers evaluation set –Additional 300 speakers development set SPEECHDAT(M) –200 speakers evaluation set Overall ratio of 80% Train/20% Test

8
8 Sub-corpus Used I1 - Isolated digit strings B1 - Sequences of 10 digits N* - Natural numbers

9
9 Feature Extraction MFCC (Mel Frequency Cepstral Coefficients) –14 Cepstra + 14 Cepstra + Energy + Energy –Speech signal band-limited between 200 and 3800 Hz –Hamming Window: 25 ms each 10 ms Cepstral Mean Substraction –Simple but effective technique for channel and speaker normalization

10
10 Acoustic Modeling Left-right continuous density HMM’s –Word models for each digit. No skips. –Silence and filler models with forward and backward skips Gender dependent models HMM: Hidden Markov Model

11
11 Model Topology Fillers and silence models topology

12
12 Baseline System - Isolated Digits Choose isolated digits with no noise marks –HMM parameters initialized with the global mean and variance of the training data Embedded Baum-Welch Reestimation Evaluate performance withViterbi decoding –Grammar allowing one digit and initial and final silence –Grammar allowing one digit and any number of fillers or silence

13
13 Baseline System - Isolated Digits

14
14 Baseline System - Isolated Digits Increment Gaussian mixtures per state up to 3 for the digit models Introduce files with noise marks Repeat re-estimation/evaluation process Increment Gaussian mixtures per state up to 3 for the filler and digit models

15
15 Connected vs Isolated Digits Example: Number said as: Isolated Digits: t r e S u~ d o j S s 6 j S Connected Digits: t r e z u~ d o j S _ 6 j S

16
16 Baseline System - Connected Digits Use best isolated digit models as bootstrap models Repeat re-estimation/evaluation process Increment gradually Gaussian mixtures per state up to 5 for the digit models

17
17 Baseline System - Results

18
18 Extension to the Baseline System New way of modelling the filler models Same training/evaluation process Train the 9 filler and silence models with no skips Build a unique filler model concatenating all filler and silence models

19
19 New Filler Model Arquitecture

20
20 Results With New Filler Model

21
21 Natural Numbers Phone models with 3 states and no skips Larger vocabulary size May be adapted to other tasks Phones initialized from models already trained for a directory assistance task Digits are still modeled by word models Grammar for natural numbers ranging from zero to hundreds of millions

22
22 Natural Numbers Example Number 25: Hypothesis 1: vinte e cinco (Twenty and five) Hypotesis 2: vinte cinco (Twenty five) But “vinte cinco” could also be the sequence of natural numbers: 20 5

23
23 Natural Numbers - Results

24
24 Sample Application State Control Speech Recording User Server Feature Extraction Speech RecognitionSpeech Synthesis DIXI - SVIT Client Speech Prompts Speech / Commands Synthesised answer/ Commands Answer

25
25 Conclusions and Future Work Explicitly modeling fillers is a difficult task –Improved filler model decreases error rate up to 50 % Develop context dependent models –Solve vowel reduction and co-articulation problems Results may be improved through the use of discriminative training techniques

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google