Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research.

Slides:

Advertisements

Similar presentations

Applications of one-class classification

Advertisements

Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.

Speaker Adaptation for Vowel Classification

Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.

VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.

VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,

Scalable Text Mining with Sparse Generative Models

Dynamic Time Warping Applications and Derivation

Face Processing System Presented by: Harvest Jang Group meeting Fall 2002.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Introduction to Automatic Speech Recognition

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Isolated-Word Speech Recognition Using Hidden Markov Models

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

1 Robust HMM classification schemes for speaker recognition using integral decode Marie Roch Florida International University.

Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.

Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.

Utterance Verification for Spontaneous Mandarin Speech Keyword Spotting Liu Xin, BinXi Wang Presenter: Kai-Wun Shih No.306, P.O. Box 1001,ZhengZhou,450002,

Speech Processing Laboratory

Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.

7-Speech Recognition Speech Recognition Concepts

COMPUTER-ASSISTED PLAGIARISM DETECTION PRESENTER: CSCI 6530 STUDENT.

Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.

Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

國立交通大學電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Olivier Siohan David Rybach

Hierarchical Multi-Stream Posterior Based Speech Recognition System

핵심어 검출을 위한 단일 끝점 DTW 알고리즘 Yong-Sun Choi and Soo-Young Lee

College of Engineering

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Mplp(t) derived from PLP cepstra,. This observation

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Statistical Models for Automatic Speech Recognition

Handwritten Characters Recognition Based on an HMM Model

Human Speech Communication

Learning Long-Term Temporal Features

2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.

Presentation transcript:

Hierarchical Approach for Spotting Keywords from an Acoustic Stream Supervisor:Professor Raimo Kantola Instructor:Professor Hynek Hermansky, IDIAP Research Institute

Hierarchical Approach for Spotting Keywords2 Introduction to the thesis  Existing keyword spotting approaches are usually based on speech recognition techniques  Growing apart from the original problem can lead to drawbacks, like lack of generality  Another approach is presented and studied, where only the target sounds of the keyword are looked for  To study and formulate this approach was my work at IDIAP Research Institute, 3/ /2005 Objective ot the thesis: to see how far can we go without using hidden Markov models and dynamic programming techniques

Hierarchical Approach for Spotting Keywords3 Outline  Introduction to keyword spotting  Motivation for this work8  Steps of hierarchical processing  Experiments  Conclusions21

Hierarchical Approach for Spotting Keywords4 Keyword Spotting  Keyword Spotting (KWS) aims at finding only certain words while rejecting the rest (hypothesis – test)  Finding only certain, rare and high-information-valued words is feasible approach in for example voice command driven applications or multimedia indexing Picture from [Jun96]

Hierarchical Approach for Spotting Keywords5 Performance measures for keyword spotting  The possible events in keyword spotting are hit, false alarm and miss  The performance is evaluated by presenting the detection rate as function of the false alarm rate  This yields the receiver operating charasteristics (ROC) curve  Average detection rate in 0-10 false alarms per hour is called figure of merit (FOM) [Roh89] False Alarms / Hour Keywords detected / %

Hierarchical Approach for Spotting Keywords6 LVCSR / HMM based approaches  Typical large vocabulary continuous speech recognition (LVCSR) / hidden Markov model (HMM) based KWS approaches model both keywords and non-keywords (background or garbage)  Keywords are searched by using dynamic programming techniques Keyword spotting network from [Roh89]. Y X x1x1 xNxN y1y1 yMyM Optimal alignement between X and Y An example of dynamic programming.

Hierarchical Approach for Spotting Keywords7 LVCSR / HMM based approaches vs. hypothesis test approach LABEL: um... okay, uh... please open the, uh... window Spot1: Spot2: Recog: garbage garbage garbage - OPEN – garbage – WINDOW word 1 YesNo time

Hierarchical Approach for Spotting Keywords8 Motivation for this work  Typical LVCSR / HMM based approaches require garbage model for Viterbi dynamic programming  The better the garbage model, the better the keyword spotting performance [Ros90] and the closer the system is to LVCSR  Use of LVCSR techniques can introduce task dependency, lack of generality computational load, complexity need for training data off-line operating mode complexity to add keywords How far can we go by looking only at the keysounds?

Hierarchical Approach for Spotting Keywords9 Hierarchical approach for spotting keywords  Key sounds (words) are spotted by looking for the target sounds (phonemes) that form the key sound. STEP 1: Estimate equally sampled phoneme posteriors STEP 2: Derive phoneme-spaced posterior estimates STEP 3: Search right sequences of high- confidence phonemes ALARM

Hierarchical Approach for Spotting Keywords10 Step 1: From acoustic stream to phoneme posteriors  TRAP-NN system: Feature extraction from 2-D filtering of critical band spectrogram, using 1010 ms long temporal patterns (TRAPs) Features are fed to a trained neural net (NN) vector classifier that returns estimates of phoneme posterior probabilities every 10 ms  TRAP-NN was succesfully used in [Szö05] for phoneme based keyword spotting

Hierarchical Approach for Spotting Keywords11 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors  Phonemes are found by filtering the posteriogram with a bank of matched filters  Matched filters are obtained by averaging 0.5 s long segments of phoneme trajectories  The purpose of filtering is to have one peak per phoneme

Hierarchical Approach for Spotting Keywords12 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (2)  The local maxima (peaks) of the filtered posteriogram are extracted and taken as estimates of underlying phonemes being present  The places of the peaks correspond to the center frames of the underlying phonemes:

Hierarchical Approach for Spotting Keywords13 Step 2: From frame-based phoneme posteriors to phoneme-spaced posteriors (3)  Matched filter bank, estimated from 30,000 phonemes of the training data (english numbers)  Filter lengths are 41 samples (210 ms processing delay)

Hierarchical Approach for Spotting Keywords14 Step 3: From phoneme estimates to words  Method 1: A posterior threshold is applied for phoneme estimates An alarm is set for a correct stream of phonemes Minimum and maximum intervals between phonemes are defined from the training data Only the primary lexical form of each word is searched Threshold

Hierarchical Approach for Spotting Keywords15 Experiments  Two telephone corpora were used [Col94, Col95]:  The MLP was trained to estimate the posterior probabilities of 28 English phonemes + silence (numbers from zero to ninety-nine)  A separate keyword spotter was implemented for all digits from zero to nine, with only the primary lexical forms  Results were compared to time-aligned phonemic labeling, and all legal pronunciations were treated as true alarms

Hierarchical Approach for Spotting Keywords16 Results – Experiment 1 (phoneme estimates only) Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI- Numbers95), experiment 1 (only phoneme estimates) Two main reasons for differencies in performance: 1. Some phonemes more prone to classification errors 2. The probability that a keyword is mixed with another word is not constant

Hierarchical Approach for Spotting Keywords17 Introduction of phoneme transition probability  Introduction of a confidence measure that tells, are there extraneous phonemes between two phonemes phoneme transition probability:  Phoneme transition probability is estimated using: Strategy 1: the height of the crossing point of posterior trajectories of the corresponding phonemes Strategy 2: the height of the crossing point of filtered posterior trajectories Strategy 3: one minus the minimum of the sum of the posteriors of the corresponding phonemes, between the phoneme estimates  New method for Step 3 (with transition probabilities): The posterior threshold of applied to the product of phoneme and transition estimates:

Hierarchical Approach for Spotting Keywords18 Results – Experiment 2 (Phoneme and transition estimates) Keyword spotting results (FOM) from spotting digits in the stream of other digits (OGI-Numbers95), experiment 2 (with phoneme transition probability estimates) The average increase in FOM compared to first experimet is 5.6% Only small differencies between different strategies of deriving the phoneme transition estimates.

Hierarchical Approach for Spotting Keywords19 ROC curve – ’zero’

Hierarchical Approach for Spotting Keywords20 ROC curve - ’eight’

Hierarchical Approach for Spotting Keywords21 Conclusions  A theoretical framework for keysound spotting was introduced and used to spot digits. Besides keyword spotting, the proposed processing can be applied in: Phoneme detection (experimented in the thesis) Event spotting in general  This approach has no garbage model and no dynamic programming techniques or HMMs are used  Benefits from looking only at the target sounds: Independence from vocabulary Some independece from language Less need for training the models Simple and fast  Relies on reliable phoneme estimates Quite robust for the choice of matched filter and phoneme sequence search technique  High variance in results between different words Short phonemes yield weaker estimates  Room to improve the performance Treat closure forms of plosive phonemes Look for all the possible pronunciation forms Use the non-keyword phoneme estimates to extract complementary information Introduce prior lexical knowledge

Hierarchical Approach for Spotting Keywords22 Questions? [Jun96]Junqua, J.C., Haton J.-P.: Robustness in Automatic Speech Recognition, Fundamentals and Applications. Dordrecht, The Netherlands, Kluwer Academic Publishers, [Roh89]Rohlicek., J., Russel, W., Roukos, S., Gish, H.: Continuous Hidden Markov Modeling For Speaker-Independent Word-Spotting. In ICASSP 89, pp , [Ros90]Rose, R., Paul, D.: A Hidden Markov Model Based Keyword Recognition System. In Proceedings of ICASSP 90, pp , Albuquerque, New Mexico, United States, [Szö05]Szöke, I., Schwarz P., Matejka P., Burget L., Fapso M., Karafiát M., Cernocký J.: Comparison of Keyword Spotting Approaches for Informal Continuous Speech. In MLMI 05, Edinburgh, United Kingdom, July [Col94]Cole, R. et al.: Telephone Speech Corpus Development at CSLU. In Proceedings of ISCLP '94, pp , Yokohama, Japan, [Col94]Cole, R. et al.: New Telephone Speech Corpora at CSLU. In Proceedings of Eurospeech '95, pp , Madrid, Spain, Lehtonen, M., Fousek, P., Hermansky, H.: A Hierarchical Approach for Spotting Keywords. In 2nd Workshop on Multimodal Interaction and Related Machine Learning Algorithms – MLMI 05, Edinburgh, United Kingdom, July 2005.

Hierarchical Approach for Spotting Keywords23 Appendix: Application to phoneme detection  The phoneme estimates of Step 2 were used in phoneme detection  The phoneme stream was estimated by counting all the phoneme estimates over a threshold, with different threshold values  Results were estimated in terms of substitutios (S), insertionts (I) and deletions (D)  For example (N = Number of phonemes in labeling): Labeled:sehvahnfayv Recognized:silnehvnfayv Operation:ISD

Hierarchical Approach for Spotting Keywords24 Appendix: Application to phoneme detection (cont) Results from phoneme detection: ThresholdAccuracy % % % % % % % % % Taking into account also the transition probabilities yielded % accuracy. State-of-the-art phoneme recognition accuracy for unrestricted speech 67% - 77%.

Hierarchical Approach for Spotting Keywords25 Appendix: System diagram

Hierarchical Approach for Spotting Keywords26 Appendix: Conclusions (table) What affects/determines the performance Places for improvement Step 1 (from acoustic stream to phoneme posteriors) Phoneme’s proness to classification errors Phoneme’s duration (longer phonemes yield stronger posteriors) To treat the closure form phonemes Step 2 (from frame-based posteriors to phoneme- spaced posteriors) How the matched filter models the duration of the phoneme To adapt the filter lengths more precisely to the phoneme durations (e.g. through speech rate) Step 3 (from phoneme estimates to words) How well the keyword’s phonemes differentiate the keyword from the background How the single phoneme estimates are combined to word estimate The length of the keyword To extract complementary information from the non-keyword phonemes to avoid false alarms

Hierarchical Approach for Spotting Keywords27 Appendix: false alarms from similar phoneme streams  The approach (method 1 in step 3) doesn’t take care that the detected phoneme stream is the complete underlying stream  Problem: False alarms Example Label:.. s eh v ah n w ah n.. Example Label:.. t r uw th..  Solution: Make sure there are no extra phonemes between two keyword phonemes, by looking only at the target sounds nine two Extraneous phoneme?

Hierarchical Approach for Spotting Keywords28 Appendix: Phoneme intervals Histograms of distances (in 10 ms frames) between phonemes of word one (w –ah, ah – n and w – n).

Hierarchical Approach for Spotting Keywords29 Appendix: Average and variance filters

Hierarchical Approach for Spotting Keywords30 Appendix: Hard case - weak posteriors and classification error