Presentation is loading. Please wait.

Presentation is loading. Please wait.

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

Similar presentations


Presentation on theme: "Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005."— Presentation transcript:

1 Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005

2 Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling Measure the “Speechiness” of Fragments Summary and Plans

3 Multisource Decoding A framework which integrates bottom-up and top-down processes in sound understanding Easier to find a spectro-temporal region that belongs to a single source (a fragment) than to find a speech fragment (“missing data” techniques)

4 Unrealistic duration information encoded in HMMs No hard limits on word durations  Decoder may produce word matches with unusual durations Worse with multisouce decoding  Decide segregation hypotheses on incomplete data  Need more temporal constraints Modelling Durations, Why? a ii 1 – a ii

5 Factors that Determine Word Durations Lexical Stress:  Humans tend to lengthen a word when emphasizing it Surrounding Words:  The neighbouring words can affect the duration of a word Speaking Rate:  Fast speech vs. Slow speech Pause Context:  Words followed by a long pause have relatively longer durations: the “Pre-pausal Lengthening” effect 1 1. T. Crystal, “Segmental durations in connected-speech signals: Syllabic strees,” JASA, 1988.

6 Word Duration Model Investigation Different words have different durational statistics Skewed distribution shape Discrete distribution more attractive Word duration histograms for digit ‘oh’ and ‘six’

7 Context-Dependent Duration Modelling In a connected digits domain  High-level linguistic cues are minimised  The effect of lexical stress is not obvious  Surrounding words do not affect duration statistics This work only models the ‘pre-pausal’ lengthening effect

8 The “Pre-Pausal Lengthening” Effect Word duration histograms obtained by forced-alignment  Distributions (solid lines) have a wide variance  A clear second peak around 600 ms for ‘six’ Word duration examples divided into two parts  Non-terminating word vs pre-pausal word duration examples  Determine histograms for the two parts Smoothed word duration histograms for digit ‘oh’ and ‘six’

9 Compute Word Duration Penalty Estimate P(d|w,u), the probability of word w having duration d, if followed by u  Word duration histograms (bin width 10 ms) obtained by force- alignment  Smoothed and normalised to evaluate P(d|w,u)  u can only be pause or non-pause in our case, thus two histograms per digit  Scaling factors  to control the impact of the word duration penalties

10 Decoding with Word Duration Modelling In Viterbi decoding  Time-synchronous algorithm  Apply word duration penalties as paths leave final state  But within a template paths with different histories have different durations! 1. S. Renals and M. Hochberg (1999), “Start-synchronous search for large vocabulary continuous speech recognition,” Multi-Stack decoding  Idea from NOWAY 1 decoder  Time-asynchronous, but start-synchronous  Have knowledge of each hypothesis’s future

11 Multi-stack Decoding Partial word sequence hypotheses H(t,W(t),P(t)) stored on each stack  The reference time t at which the hypothesis end  The word sequence W(t)=w(1)w(2)…w(n) covering the time from 1 to t  Its overall likelihood P(t) The most likely hypothesis on each stack is extended further Viterbi algorithm is used to find one-word extension The final result: the best hypothesis on the stack at time T Viterbi Search Time t1t1 t2t2 T Final result t3t3 t4t4 t5t5 t6t6

12 When placing a hypothesis onto stacks  Compute the WD penalty based on the one- word extension  Apply the penalty to the hypothesis’s likelihood score Setting a search range: WD min and WD max  Reduce computational cost  A typical duration range for a digit is between 150-900 ms Applying Word Duration Penalties Time t1t1 t2t2 t3t3 t4t4 t5t5

13 Recognition Experiments “Soft mask” missing data system, Spectral domain features 16 states per HMM, 7 Gaussians per state Silence model and short pause model in use Aurora 2 connected digits recognition task, clean training

14 Experiment Results Four recognition systems: 1.Baseline system, no duration model 2.+ uniform duration model 3.+ context-independent duration model 4.+ context-dependent duration model

15 Discussion Context-dependent word duration model can offer significant improvement With duration constraints decoder can produce more reasonable duration patterns Assumes the duration pattern in clean situations is same as in noise Need normalisation by speaking rate

16 Overview of the Talk Introduction to Multisource Decoding Context-dependent Word Duration Modelling “Speechiness” Measures of Fragments Discussion

17 Motivation of Measuring “Speechiness” The multisource decoder assumes each fragment has a equal probability of being speech or not We can measure the “speechiness” of each fragment These measures can be used to bias the decoder towards including the fragments that are more likely to be speech.

18 A Noisy Speech Corpus Aurora 2 connected digits mixed with either violins or drums A set of a priori fragments have been generated, but unlabelled Allow us to study the integration problem in isolation of the problems of fragment construction

19 A Priori Fragments

20 Recognition Results “Correct”: a priori fragments with correct labels “Fragments”: a priori fragments with no labels Results demonstrate that the top-down information in our HMMs is insufficient AccDELSUBINS Violins Correct93.04%24442 Violins Fragments50.75%3221713 Drums Correct91.36%38481 Drums Fragments33.76%22138165

21 Approach to Measure “Speechiness” Extract features that represent speech characteristics Use statistic models like GMMs to fit the features Need a background model which fits everything Take the speech model / background model likelihoods ratio as the confidence measure

22 Preliminary Experiment 1 – F0 Estimation Speech and other sounds have differences in F0, and also in Delta F0 Measure the F0 of each fragment rather than full bands signal  Compute the correlogram of all the frequency channels  Only sum those channel within the fragment  For each frame, find the peak to estimate its F0  Smooth F0 crossing the fragment

23 Preliminary Experiment 1 – F0 Estimation PitchDelta pitchBoth 74.3%77.4%88.8% Accuracies using different features Gmms with full covariance and two Gaussians Speech fragments vs Violin fragments Background model trained on violin fragments Log likelihood ratio threshold is 0

24 Preliminary Experiment 2 – Energy Ratios Speech has more energy around formants Divides spectral features into frequency bands Compute the amount of energy of a fragment within each band, normalised by full band energy Two bands case: Channel Central Frequency (CF) = 50 – 1000 – 3750 Hz Four bands case: CF = 50 – 282 – 707 – 1214 – 3850 Hz

25 Preliminary Experiment 2 – Energy Ratios Speech fragments vs music fragments (violins & drums) Full covariance GMMs with 4 Gaussians Background model trained on all types of fragments Two bandsFour bands 79.7%93.2% Accuracies using different features

26 Summary and Plans Don’t need any classification, leave the confidence measures to the multisource decoder Assumes the background model is accessible, in practice needs a garbage model Combine different features together More speech features, e.g. syllabic rate

27 Thanks! Any questions?


Download ppt "Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005."

Similar presentations


Ads by Google