Speaker Identification:

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

© Fraunhofer FKIE Corinna Harwardt Automatic Speaker Recognition in Military Environment.
Voiceprint System Development Design, implement, test unique voiceprint biometric system Research Day Presentation, May 3 rd 2013 Rahul Raj (Team Lead),
Confidence Measures for Speech Recognition Reza Sadraei.
ETRW Modelling Pronunciation variation for ASR ESCA Tutorial & Research Workshop Modelling pronunciation variation for ASR INTRODUCING MULTIPLE PRONUNCIATIONS.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Acoustical and Lexical Based Confidence Measures for a Very Large Vocabulary Telephone Speech Hypothesis-Verification System Javier Macías-Guarasa, Javier.
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg, Julia Hirschberg Columbia University Interspeech /14/06.
On the Correlation between Energy and Pitch Accent in Read English Speech Andrew Rosenberg Weekly Speech Lab Talk 6/27/06.
VESTEL database realistic telephone speech corpus:  PRNOK5TR: 5810 utterances in the training set  PERFDV: 2502 utterances in testing set 1 (vocabulary.
Speech Recognition in Noise
VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,
1 Speech Parametrisation Compact encoding of information in speech Accentuates important info –Attempts to eliminate irrelevant information Accentuates.
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Advisor: Prof. Tony Jebara
DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Study of Word-Level Accent Classification and Gender Factors
Speech and Language Processing
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
Nick Wang, 25 Oct Speaker identification and verification using EigenVoices O. Thyes, R. Kuhn, P. Nguyen, and J.-C. Junqua in ICSLP2000 Presented.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Philip Jackson, Boon-Hooi Lo and Martin Russell Electronic Electrical and Computer Engineering Models of speech dynamics for ASR, using intermediate linear.
Speaker Change Detection using Support Vector Machines V.Kartik, D.Srikrishna Satish and C.Chandra Sekhar Speech and Vision Laboratory Department of Computer.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
By: Nicole Cappella. Why I chose Speech Recognition  Always interested me  Dr. Phil Show Manti Teo Girlfriend Hoax  Three separate voice analysts proved.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
1 Tempo Induction and Beat Tracking for Audio Signals MUMT 611, February 2005 Assignment 3 Paul Kolesnik.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Progress Report - V Ravi Chander
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )
CRANDEM: Conditional Random Fields for ASR
Statistical Models for Automatic Speech Recognition
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Segment-Based Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
John H.L. Hansen & Taufiq Al Babba Hasan
Research on the Modeling of Chinese Continuous Speech Recognition
A maximum likelihood estimation and training on the fly approach
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
3. Adversarial Teacher-Student Learning (AT/S)
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Speaker Identification: Speaker Modeling Techniques and Feature Sets Alex Park SLS Affiliates Meeting December 16, 2002 Advisor: T.J. Hazen

Overview Modeling Techniques Feature set experiments Baseline GMM (ASR Independent) Speaker Adaptive (ASR Dependent) Score Combination Multiple Utterance Results in Mercury Feature set experiments Comparison of Formants and F0 vs. MFCCs in TIMIT Current and Future work

Modeling - Baseline (GMM) Scoring Training Utterance Training Input waveforms for speaker “i” split into fixed-length frames Feature Space Feature vectors computed from each frame of speech GMMs trained from set of feature vectors One GMM per speaker GMM for Speaker “i” Testing Input feature vectors scored against each speaker GMM x1 -feature vectors are composed of mfcc measurements from points in the frame -data for all speech is lumped into a single probability model for that speaker -when we get to testing, we find out the score for a particular speaker by scoring feature vectors from the test segment against GMM for that speaker + x2 Test Utterance Frame scores for each speaker summed over entire utterance Highest total score is hypothesized speaker = score for speaker “i”

Modeling – Speaker Adaptive Scoring Training Build speaker-dependent (SD) recognition models for each speaker Get best hypothesis from recognizer using speaker-independent (SI) models Testing Rescore hypothesis with SD models Compute total speaker adapted score by interpolating SD score with SI score -interpolation factor is proportional to how much data there is for a particular phone model for speaker i -speaker adapted score is used as opposed to speaker dependent score so that we have more robust scores for phones that don't have much speaker dependent training data “fifty-five” SUMMIT (SI models) Test utterance f ih t tcl iy fi ihi ti tcli iyi fi ayi vi f ay v SD models for speaker “i” + ( ) li Speaker adapted score for speaker “i” x1 x2 x3 x4 x5 x6 x7 x8 x9 SI score y1 y2 y3 y4 y5 y6 y7 y8 y9 SD score

Modeling - Score Combination and Specifics Speaker ID done in two passes Initial n-best list computed using GMM speaker models N-best list rescored using second stage models N-best pruning reduces computation for refined models Refined models can use recognition results Test utterance 1st Stage 2nd Stage ASR GMM SID Speaker N-best list Word hypothesis 1. speaker “i” 2. speaker “j” : “fifty” Refined speaker ID Multiple speaker ID techniques can be combined in 2nd stage e.g., Multigrained + Speaker Adapted Classifier3 Classifier2 Classifier1 Rescored N-best list 1. speaker “k” 2. speaker “j” : -overview of what is actually done in the system -GMM speaker ID and speech recognizer both run in parallel, fairly quickly -2 reasons 1. Later models need the recognition output of summit 2. Nbest list pruning reduces the space of speakers that the more refined models need to search -along with this comes the added advantage that it is easy to combine the scores of multiple 2nd stage scoring techiques

Modeling – Single Utterance Results Compared modeling techniques on single utterances in YOHO and Mercury For YOHO, a standard speaker verification corpus, identification error rates were very low GMM (0.83%), Speaker Adaptive (0.31%) For Mercury, an in-house corpus, identification error rates were much higher GMM (22.4%), Speaker Adaptive (27.8%) Higher error rates likely due to effects of varied recording conditions and spontaneous speech Found that combination of classifiers lowered error rates in both domains YOHO: GMM + Phonetically Structured GMM (0.25%) Mercury: GMM + Phonetically Structured GMM (18.3%)

Modeling - Multiple Utterances Results 11.6 % 5.5 % 14.3 % 10.3 % 13.1 % 7.4 % -In yoho, best reported results on the identification task are around 0.7% error rate -most techniques beat that error rate, but we have to examine significance of these numbers -large difference in error rates between YOHO and mercury attributed to the relative difficulty of the two tasks. (refer to earlier) -speaker adapted method not performing as well in mercury can be due to word recognition errors, which it is more prone to than the other methods -score combination is good On multiple utterances, speaker adaptive scoring achieves lower error rates than next best individual method Relative error rate reductions of 28%, 39%, and 53% on 3, 5, and 10 utterances compared to baseline

Feature Set – Outline and Experiments Compared performance of non-MFCC features in mismatched conditions Used formants and F0, computed offline using ESPS Global speaker GMMs trained using formants and F0 values and trajectories in voiced regions Evaluation performed using closed set speaker ID task on TIMIT and NTIMIT Mismatched conditions used to evaluate robustness of feature extraction in telephone conditions

Feature Set – Results in Mismatched Conditions Compared performance of non-MFCC features in mismatched conditions Trained Test Baseline (MFCC) F1, F2, F3, F4 F1, F2 F1 F2 F3 F4 F0 TIMIT TIMIT NTIMIT MFCCs not well estimated in mismatched conditions 53% accuracy when trained and tested on NTIMIT F3 and F4 perform better in matched conditions, but have greater degradation in accuracy than F1 and F2 F3 and F4 performance degradations likely due to band-limiting in NTIMIT F0 has best individual performance with the least degradation 100.0 64.6 24.7 10.7 9.2 18.8 26.8 45.5 1.2 4.8 9.9 9.0 6.0 3.0 39.1 Identification Accuracy (%)

Extensions and Future Work Explored additional scoring strategies for verification Currently incorporating speaker recognition into existing applications Using speaker verification with Orion (Hazen) Combining speaker ID with face ID on Ipaq for Oxygen (Hazen and Weinstein) Use phone-specific models for formant and F0 features Incorporate duration as an additional feature for speaker adaptive approach