Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research.

Similar presentations


Presentation on theme: "Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research."— Presentation transcript:

1 Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research Institute

2 8.11.2005Hierarchical Approach for Spotting Keywords2 Introduction to the thesis  Existing keyword spotting approaches are usually based on speech recognition techniques  Growing apart from the original problem can lead to drawbacks, like lack of generality  Another approach is presented and studied, where only the target sounds of the keyword are looked for  To study and formulate this approach was my work at IDIAP Research Institute, 3/2005 - 8/2005 Objective ot the thesis: to see how far can we go without using hidden Markov models and dynamic programming techniques

3 8.11.2005Hierarchical Approach for Spotting Keywords3 Outline  Introduction to keyword spotting 4 - 7  Motivation for this work8  Steps of hierarchical processing9 - 14  Experiments15 - 20  Conclusions21

4 8.11.2005Hierarchical Approach for Spotting Keywords4 Keyword Spotting  Keyword Spotting (KWS) aims at finding only certain words while rejecting the rest (hypothesis – test)  Finding only certain, rare and high-information-valued words is feasible approach in for example voice command driven applications or multimedia indexing Picture from [Jun96]

5 8.11.2005Hierarchical Approach for Spotting Keywords5 Performance measures for keyword spotting  The possible events in keyword spotting are hit, false alarm and miss  The performance is evaluated by presenting the detection rate as function of the false alarm rate  This yields the receiver operating charasteristics (ROC) curve  Average detection rate in 0-10 false alarms per hour is called figure of merit (FOM) [Roh89] False Alarms / Hour Keywords detected / %

6 8.11.2005Hierarchical Approach for Spotting Keywords6 LVCSR / HMM based approaches  Typical large vocabulary continuous speech recognition (LVCSR) / hidden Markov model (HMM) based KWS approaches model both keywords and non-keywords (background or garbage)  Keywords are searched by using dynamic programming techniques Keyword spotting network from [Roh89]. Y X x1x1 xNxN y1y1 yMyM Optimal alignement between X and Y An example of dynamic programming.

7 8.11.2005Hierarchical Approach for Spotting Keywords7 LVCSR / HMM based approaches vs. hypothesis test approach LABEL: um... okay, uh... please open the, uh... window Spot1: ---------------------------1111------------------- Spot2: --------------------------------------------111111 Recog: garbage garbage garbage - OPEN – garbage – WINDOW word 1 YesNo time

8 8.11.2005Hierarchical Approach for Spotting Keywords8 Motivation for this work  Typical LVCSR / HMM based approaches require garbage model for Viterbi dynamic programming  The better the garbage model, the better the keyword spotting performance [Ros90]...... and the closer the system is to LVCSR  Use of LVCSR techniques can introduce task dependency, lack of generality computational load, complexity need for training data off-line operating mode complexity to add keywords How far can we go by looking only at the keysounds?

9 8.11.2005Hierarchical Approach for Spotting Keywords9 Hierarchical approach for spotting keywords  Key sounds (words) are spotted by looking for the target sounds (phonemes) that form the key sound. STEP 1: Estimate equally sampled phoneme posteriors STEP 2: Derive phoneme-spaced posterior estimates STEP 3: Search right sequences of high- confidence phonemes ALARM

10 8.11.2005Hierarchical Approach for Spotting Keywords10 Step 1: From acoustic stream to phoneme posteriors  TRAP-NN system: Feature extraction from 2-D filtering of critical band spectrogram, using 1010 ms long temporal patterns (TRAPs) Features are fed to a trained neural net (NN) vector classifier that returns estimates of phoneme posterior probabilities every 10 ms  TRAP-NN was succesfully used in [Szö05] for phoneme based keyword spotting

11 8.11.2005Hierarchical Approach for Spotting Keywords11 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors  Phonemes are found by filtering the posteriogram with a bank of matched filters  Matched filters are obtained by averaging 0.5 s long segments of phoneme trajectories  The purpose of filtering is to have one peak per phoneme

12 8.11.2005Hierarchical Approach for Spotting Keywords12 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (2)  The local maxima (peaks) of the filtered posteriogram are extracted and taken as estimates of underlying phonemes being present  The places of the peaks correspond to the center frames of the underlying phonemes:

13 8.11.2005Hierarchical Approach for Spotting Keywords13 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (3)  Matched filter bank, estimated from 30,000 phonemes of the training data (english numbers)  Filter lengths are 41 samples (210 ms processing delay)

14 8.11.2005Hierarchical Approach for Spotting Keywords14 Step 3: From phoneme estimates to words  Method 1: A posterior threshold is applied for phoneme estimates An alarm is set for a correct stream of phonemes Minimum and maximum intervals between phonemes are defined from the training data Only the primary lexical form of each word is searched Threshold

15 8.11.2005Hierarchical Approach for Spotting Keywords15 Experiments  Two telephone corpora were used [Col94, Col95]:  The MLP was trained to estimate the posterior probabilities of 28 English phonemes + silence (numbers from zero to ninety-nine)  A separate keyword spotter was implemented for all digits from zero to nine, with only the primary lexical forms  Results were compared to time-aligned phonemic labeling, and all legal pronunciations were treated as true alarms

16 8.11.2005Hierarchical Approach for Spotting Keywords16 Results – Experiment 1 (phoneme estimates only) Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI- Numbers95), experiment 1 (only phoneme estimates) Two main reasons for differencies in performance: 1. Some phonemes more prone to classification errors 2. The probability that a keyword is mixed with another word is not constant

17 8.11.2005Hierarchical Approach for Spotting Keywords17 Introduction of phoneme transition probability  Introduction of a confidence measure that tells, are there extraneous phonemes between two phonemes phoneme transition probability:  Phoneme transition probability is estimated using: Strategy 1: the height of the crossing point of posterior trajectories of the corresponding phonemes Strategy 2: the height of the crossing point of filtered posterior trajectories Strategy 3: one minus the minimum of the sum of the posteriors of the corresponding phonemes, between the phoneme estimates  New method for Step 3 (with transition probabilities): The posterior threshold of applied to the product of phoneme and transition estimates:

18 8.11.2005Hierarchical Approach for Spotting Keywords18 Results – Experiment 2 (Phoneme and transition estimates) Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95), experiment 2 (with phoneme transition probability estimates) The average increase in FOM compared to first experimet is 5.6% Only small differencies between different strategies of deriving the phoneme transition estimates.

19 8.11.2005Hierarchical Approach for Spotting Keywords19 ROC curve – ’zero’

20 8.11.2005Hierarchical Approach for Spotting Keywords20 ROC curve - ’eight’

21 8.11.2005Hierarchical Approach for Spotting Keywords21 Conclusions  A theoretical framework for keysound spotting was introduced and used to spot digits. Besides keyword spotting, the proposed processing can be applied in: Phoneme detection (experimented in the thesis) Event spotting in general  This approach has no garbage model and no dynamic programming techniques or HMMs are used  Benefits from looking only at the target sounds: Independence from vocabulary Some independece from language Less need for training the models Simple and fast  Relies on reliable phoneme estimates Quite robust for the choice of matched filter and phoneme sequence search technique  High variance in results between different words Short phonemes yield weaker estimates  Room to improve the performance Treat closure forms of plosive phonemes Look for all the possible pronunciation forms Use the non-keyword phoneme estimates to extract complementary information Introduce prior lexical knowledge

22 8.11.2005Hierarchical Approach for Spotting Keywords22 Questions? [Jun96]Junqua, J.C., Haton J.-P.: Robustness in Automatic Speech Recognition, Fundamentals and Applications. Dordrecht, The Netherlands, Kluwer Academic Publishers, 1996. [Roh89]Rohlicek., J., Russel, W., Roukos, S., Gish, H.: Continuous Hidden Markov Modeling For Speaker-Independent Word-Spotting. In ICASSP 89, pp. 627-630, 1989. [Ros90]Rose, R., Paul, D.: A Hidden Markov Model Based Keyword Recognition System. In Proceedings of ICASSP 90, pp. 129-132, Albuquerque, New Mexico, United States, 1990. [Szö05]Szöke, I., Schwarz P., Matejka P., Burget L., Fapso M., Karafiát M., Cernocký J.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In MLMI 05, Edinburgh, United Kingdom, July 2005. [Col94]Cole, R. et al.: Telephone Speech Corpus Development at CSLU. In Proceedings of ISCLP '94, pp. 1815-1818, Yokohama, Japan, 1994. [Col94]Cole, R. et al.: New Telephone Speech Corpora at CSLU. In Proceedings of Eurospeech '95, pp. 821-824, Madrid, Spain, 1995. Lehtonen, M., Fousek, P., Hermansky, H.: A Hierarchical Approach for Spotting Keywords. In 2nd Workshop on Multimodal Interaction and Related Machine Learning Algorithms – MLMI 05, Edinburgh, United Kingdom, July 2005.

23 8.11.2005Hierarchical Approach for Spotting Keywords23 Appendix: Application to phoneme detection  The phoneme estimates of Step 2 were used in phoneme detection  The phoneme stream was estimated by counting all the phoneme estimates over a threshold, with different threshold values  Results were estimated in terms of substitutios (S), insertionts (I) and deletions (D)  For example (N = Number of phonemes in labeling): Labeled:sehvahnfayv Recognized:silnehvnfayv Operation:ISD

24 8.11.2005Hierarchical Approach for Spotting Keywords24 Appendix: Application to phoneme detection (cont) Results from phoneme detection: ThresholdAccuracy 0.01-93.21 % 0.0528.57 % 0.1054.35 % 0.1564.37 % 0.2069.44 % 0.2571.24 % 0.3070.50 % 0.3568.12 % 0.4064.57 % Taking into account also the transition probabilities yielded 73.15 % accuracy. State-of-the-art phoneme recognition accuracy for unrestricted speech 67% - 77%.

25 8.11.2005Hierarchical Approach for Spotting Keywords25 Appendix: System diagram

26 8.11.2005Hierarchical Approach for Spotting Keywords26 Appendix: Conclusions (table) What affects/determines the performance Places for improvement Step 1 (from acoustic stream to phoneme posteriors) Phoneme’s proness to classification errors Phoneme’s duration (longer phonemes yield stronger posteriors) To treat the closure form phonemes Step 2 (from frame-based posteriors to phoneme- spaced posteriors) How the matched filter models the duration of the phoneme To adapt the filter lengths more precisely to the phoneme durations (e.g. through speech rate) Step 3 (from phoneme estimates to words) How well the keyword’s phonemes differentiate the keyword from the background How the single phoneme estimates are combined to word estimate The length of the keyword To extract complementary information from the non-keyword phonemes to avoid false alarms

27 8.11.2005Hierarchical Approach for Spotting Keywords27 Appendix: false alarms from similar phoneme streams  The approach (method 1 in step 3) doesn’t take care that the detected phoneme stream is the complete underlying stream  Problem: False alarms Example Label:.. s eh v ah n w ah n.. Example Label:.. t r uw th..  Solution: Make sure there are no extra phonemes between two keyword phonemes, by looking only at the target sounds nine two Extraneous phoneme?

28 8.11.2005Hierarchical Approach for Spotting Keywords28 Appendix: Phoneme intervals Histograms of distances (in 10 ms frames) between phonemes of word one (w –ah, ah – n and w – n).

29 8.11.2005Hierarchical Approach for Spotting Keywords29 Appendix: Average and variance filters

30 8.11.2005Hierarchical Approach for Spotting Keywords30 Appendix: Hard case - weak posteriors and classification error


Download ppt "Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research."

Similar presentations


Ads by Google