HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

Building an ASR using HTK CS4706

Speech Recognition with Hidden Markov Models Winter 2011

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.

An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:

Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Feature Selection, Acoustic Modeling and Adaptation SDSG REVIEW of recent WORK Technical University of Crete Speech Processing and Dialog Systems Group.

Speaker Adaptation for Vowel Classification

Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.

Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.

HIWIRE Progress Report Trento, January 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

LORIA Irina Illina Dominique Fohr Chania Meeting May 9-10, 2007.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Advances in WP1 and WP2 Paris Meeting – 11 febr

HIWIRE Progress Report – July 2006 Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Technical University.

HIWIRE Progress Report Technical University of Crete Speech Processing and Dialog Systems Group Presenter: Alex Potamianos Vasilis Diakoloukas Technical.

Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.

Isolated-Word Speech Recognition Using Hidden Markov Models

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Jacob Zurasky ECE5526 – Spring 2011

Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,

Signature with Text-Dependent and Text-Independent Speech for Robust Identity Verification B. Ly-Van*, R. Blouet**, S. Renouard** S. Garcia-Salicetti*,

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

In-car Speech Recognition Using Distributed Microphones Tetsuya Shinde Kazuya Takeda Fumitada Itakura Center for Integrated Acoustic Information Research.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

CS Statistical Machine learning Lecture 24

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.

PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Performance Comparison of Speaker and Emotion Recognition

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.

ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.

A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

Feature Transformation and Normalization Present by Howard Reference : Springer Handbook of Speech Processing, 3.3 Environment Robustness (J. Droppo, A.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

LECTURE 11: Advanced Discriminant Analysis

Statistical Models for Automatic Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Speaker Identification:

Visual Recognition of American Sign Language Using Hidden Markov Models 문현구 문현구.

Presenter: Shih-Hsiang(士翔)

Combination of Feature and Channel Compensation (1/2)

Presentation transcript:

HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University of Crete

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

 Combining several sources of information to improve the performance  Unfortunately, for different environments and noise conditions not all the sources of information are equally reliable.  Mismatch between training and test conditions. Goal  Propose estimators of optimal stream weights si that can be computed in an unsupervised manner Motivation

 Equal error rate in single-stream classifiers  Equal estimation error variance in each stream Optimal Stream Weights

 Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns)  Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams Experimental Results

 Two classes → anti models  Class membership → inter and intra classes distance Results (classification)

 Generalization of the inter- and intra-distances measure → inter distance among all the classes. Results (recognition)

 Stream weight computation for multi class classification task based on theoretical results for a two classes classification use of an anti-model technique  We use only the test utterance and the information contained in the trained models.  Generalization towards unsupervised estimation of stream weights for multi-streams classification and recognition problems. Conclusions

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

Vocal Tract Length Normalization.  Dependence between warping and phonemes.  Frame Segmentation into Regions.  Warping Factor and Function Estimation.  VTLN in Recognition.  Evaluation.

Dependence between warping and phonemes[1]. Examining the similarity between two frames before and after the warping:  For each phoneme and speaker and for the middle frame of the utterance, the average spectral envelope is computed.  An optimal warping factor is computed (for each phonemes’ s utterance), so that the MSE, between the warped spectrum and the corresponding unwraped spectrum,, is minimized. Optimization is achieved by a full search in the interval of warping factors ranging from 0.8 to 1.2, where 1 corresponds to no warping,  The mapped spectrum is warped according to this optimal warping factor.

Dependence between warping and phonemes[2]. Bi-Parametric Warping Function (2pts).  Different warping factors are evaluated, correspondingly, for the low (f < 3 KHz) and high (f ≥ 3 KHz) frequencies.  Constraints:, and step  A full search over the 25 ( ) candidate warping functions provides the optimal pair of warping factors. Four-Parametric Warping Function (4pts).  Different warping factors are evaluated for the frequency ranges, 0-1.5, 1.5-3, and KHz.  The constraints and step remain the same with the bi-parametric case.  Full search over the 625 ( ) different candidate warping functions. Bias addition before the warping process.  Based on the ML algorithm, we evaluate a linear bias that minimizes the spectral distance between the reference and mapped spectrums.  The extracted linear bias is added to the unwrapped mapped spectrum.

Results (over all speakers) after bias addition.

Frame Segmentation into Regions.  Based on unsupervised K-Means algorithm, a sequence of testing utterance’s frames, length M, is divided on, specific by us, population of regions.  The algorithm’s output is a function F between the frames m and the corresponding region index c,  As an additional constraint, a media filtering is placed on the region index’s sequence. This constraint has the effect of smoothing the sequence of indices so as to reflect a more physiologically degree of region transition between successive frames.

Warping Factor and Function Estimation.  After the division of frames into regions, an optimal factor and function for each region is obtained by maximizing the likelihood of the warped vectors with respect to the transcriptions from the first pass and the un-normalized Hidden Markov Model, where, is the testing utterance in which every frame, after its categorization into region c, is warped according to one of the R candidate factors and to one of the N candidate functions. The optimum warping factor for each region is obtained by searching over a value space between 0.88 and 1.12 with step  λ is the, trained with unnormalized training vectors, Hidden Markov Model,  W is the obtained by the first-pass transcription.

VTLN in Recognition. During recognition, since a preliminary transcription for testing utterances is not given, a multiple-pass strategy is introduced:  A preliminary transcription W is obtained through a first pass recognition using the unwrapped sequence of cepstral vectors X and the unnormalized model λ,  The utterance's frames are categorized into c regions  For each region c, an optimal warping factor and function is evaluated through a multi-dimensional grid search,  After the evaluation of the vectors related with the optimal per region factor and function the optimally warped sequence is decoded in order to obtain the final recognition result.

Results WER(%) # of Utters15 Baseline50.83 Li & Rose (2 pass) regions41.73 (+4.7%)42.79 (+1.60) 3 regions43.11 (+1.56)43.66 (-0.46)

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

The Linear Dynamic Model (LDM)  Discrete-time Linear Dynamical Systems:  Efficient model the evolution of spectral dynamics  An observation y k is produced in each time step  The state process is first-order Markov Initial state is Gaussian  The state and observation noises w k, v k are : Uncorrelated Temporally white Zero-mean Gaussian distributed

 Noise covariances are not constrained  Matrices F,H have canonical forms  Canonical form is identifiable if it is also controllable (Ljung) Generalized canonical form of LDM

Experimental Setup  Training Set Aurora 2 Clean Database 3800 training sentences  Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise ( Clean, SNR: 20, 15, 10, 5 dB )  Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations –12 Cepstral Coefficients + C0 + Energy –+ first and second order derivatives (δ, δδ)

Model Training on Speech Data  Word models with different number of segments based on the phonetic transcription  Segment alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven

Classification process  Keep true word-boundaries fixed Digit-level alignments produced by an HMM  Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence  Classification is based on maximizing the likelihood

Classification results  Comparison of LDM Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)LDMs MFCC, E+δ +δδMFCC, E+δ +δδ Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%

Classification results  Performance Comparison (MFCCs)

Classification results  Performance Comparison (MFCCs + δ + δδ)

Sub-optimal Viterbi decoding (SOVD)  We use a Viterbi-like decoding algorithm for speech classification  HMM state equivalent in LDMs is : [x k,s i ]  It is applied among the segments of each word-model Provides segment alignments based on the likelihood of the LDM Estimated with a Kalman filter Allows decoding at each time k using possible histories leading to a different [x k,s i ] combination at several depth levels

SOVD Steps

Sub-Optimal Viterbi-like Search S2S2 S1S1 S3S3 S4S4 F 1 x 0 F 1 x 1 F 1 x 2 F 1 x 3 F 1 x 4 F 2 x 1 F 2 x 2 F 2 x 3 F 2 x 4 F 3 x 2 F 3 x 3 F 3 x 4 F 4 x 3 F 4 x 4 t1t1 t2t2 t4t4 t5t5 t3t3 Time (frames)

Visualization of Model Predictions  Trajectories of true and predicted observations for c 1, c 3

Classification results  Comparison of Segment-Models and HTK HMM classification (% Accuracy) Same fixed Word-boundaries based on the HMM alignments Same Front-End configuration Both Models trained on clean training data AURORA Subway HMM-alignmentsSegment Models HMMLDMd=1d=2 Clean97,19%97,85%97,73% 97,76% SNR2090,91%92,53%93,52% SNR1580,09%85,93%89,68%89,77% SNR1057,68%71,30%77,21%77,33% SNR536,01%46,72%53,66%53,98%

Classification results (Larger State dimension)  Comparison of SOVD-LDM for LDM with several state dimensions Same Front-End configuration (MFCCs+E0+c0), same word alignments AURORA Subway Segment Models HMM Clean97,19%97,73%98,22% 98,28% SNR2090,91%93,52%92,22%91,58% SNR1580,09%89,68%84,98%84,70% SNR1057,68%77,21%73,27%73,08% SNR536,01%53,66%55,12%52,99%

Conclusions  We investigated generalized canonical forms for LDM  We proposed an element-wise ML estimation process  When alignments from an equivalent HMM Without derivatives LDMs significantly outperform HMMs particularly under highly noisy conditions When derivatives are used for both models their performance is similar

Conclusions  With segment alignments based on LDM HMM alignments hurt recognition performance Viterbi-like search for LDM  Larger-dimension Beneficial on clean data Performance degrades on noisy data  Future Lower-dimension, articulatory-based features Non-linear state-to-observation mappings

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

Noise-removal formulated as a BSS problem  I mutually uncorrelated speaker signals  J microphones  Each microphone signal :  Compact form:  If A invertible (W=A -1 ):

The simulated room  Used Douglas Cambell’s “Roomsim”  Depicts the positions of the speakers and mics  Mixed file was the one received by the first mic (top left)

Database  We considered Aurora 4 and TIMIT  BSS shows better separability fo speech signals >30s Aurora 4 average utterance length ~7sec TIMIT average utterance length ~3sec Concatenated sentences of the same speakers  When there was no overlapping during the whole time We replicated the smaller sentence with samples from the beginning  We normalized the sources to ensure same energy

Experimental Setup  Test Set: 330 Utterances (AURORA4) 16KHz – 16bits  Performance of the clean test-set: 11.13%

Results

Conclusions  Baseline model (with Spectral Subtraction) fails to separate the signals  Retraining the recognizer with mixed signals can significantly improve performance for small noise levels  BSS test data and Baseline model Significantly reduces WER when the speaker’s signal is at the same level Performance highly degrades as the energy of the second speaker decreases.  BSS test data + Retrained models with BSS data Best performance for noise levels 10dB or lower  For smaller noise levels (>10dB) use the recognizer retrained on mixed signals rather than BSS

Combined Results

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

 We want to determine  Weighted average of many estimators  In our approach θ denotes a Gaussian component Θ is a subset of Gaussians Optimal Bayes Adaptation

12M12M genone 1genone 2 Phone-Based Clustering Cluster the output distributions based on common central phone For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together Gaussian Size Number of Mixture Components

Likelihoods Collection  We compute the likelihoods by using  For each voice frame we track which triphones are used and calculate the probability for each θ.  We use delta smoothing to the distributions of θ according to

 Baseline trained on the WSJ database  Adaptation data: spoke3 WSJ task non-native speakers 5 male and 5 female 40 adaptation sentences per speaker 40 test sentences per speaker Adaptation Configuration

Results

Gender-dependent Results

Conclusions  A small improvement compared to the baseline case  Recent experiments have shown that dynamic associations of distributions have better results  Increasing the number of adaptation data improves the recognition results as recent experiments have shown.