HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University of Crete

 Audio-Visual Processing (WP1)  VTLN (WP2)  Segment Models (WP1)  Recognition on BSS (WP1)  Bayes’ Optimal Adaptation (WP2) Outline

 Combining several sources of information to improve the performance  Unfortunately, for different environments and noise conditions not all the sources of information are equally reliable.  Mismatch between training and test conditions. Goal  Propose estimators of optimal stream weights si that can be computed in an unsupervised manner Motivation

 Equal error rate in single-stream classifiers  Equal estimation error variance in each stream Optimal Stream Weights

 Subset of CUAVE database used: 36 speakers (30 training, 6 testing) 5 sequences of 10 connected digits per speaker Training set: 1500 digits (30x5x10) Test set: 300 digits (6x5x10)  Features: Audio: 39 features (MFCC_D_A) Visual: 39 features (ROIDCT_D_A, odd columns)  Multi-Streams HMM models: 8 state, left-to-right HMM whole-digit models Single Gaussian mixture AV-HMM uses separate audio and video feature streams Experimental Results

 Two classes → anti models  Class membership → inter and intra classes distance Results (classification)

 Generalization of the inter- and intra-distances measure → inter distance among all the classes. Results (recognition)

 Stream weight computation for multi class classification task based on theoretical results for a two classes classification use of an anti-model technique  We use only the test utterance and the information contained in the trained models.  Generalization towards unsupervised estimation of stream weights for multi-streams classification and recognition problems. Conclusions

Vocal Tract Length Normalization.  Dependence between warping and phonemes.  Frame Segmentation into Regions.  Warping Factor and Function Estimation.  VTLN in Recognition.  Evaluation.

Dependence between warping and phonemes[1]. Examining the similarity between two frames before and after the warping:  For each phoneme and speaker and for the middle frame of the utterance, the average spectral envelope is computed.  An optimal warping factor is computed (for each phonemes’ s utterance), so that the MSE, between the warped spectrum and the corresponding unwraped spectrum,, is minimized. Optimization is achieved by a full search in the interval of warping factors ranging from 0.8 to 1.2, where 1 corresponds to no warping,  The mapped spectrum is warped according to this optimal warping factor.

Dependence between warping and phonemes[2]. Bi-Parametric Warping Function (2pts).  Different warping factors are evaluated, correspondingly, for the low (f < 3 KHz) and high (f ≥ 3 KHz) frequencies.  Constraints:, and step 0.02.  A full search over the 25 ( ) candidate warping functions provides the optimal pair of warping factors. Four-Parametric Warping Function (4pts).  Different warping factors are evaluated for the frequency ranges, 0-1.5, 1.5-3, 3-4.5 and 4.5-8 KHz.  The constraints and step remain the same with the bi-parametric case.  Full search over the 625 ( ) different candidate warping functions. Bias addition before the warping process.  Based on the ML algorithm, we evaluate a linear bias that minimizes the spectral distance between the reference and mapped spectrums.  The extracted linear bias is added to the unwrapped mapped spectrum.

Results (over all speakers) after bias addition.

Frame Segmentation into Regions.  Based on unsupervised K-Means algorithm, a sequence of testing utterance’s frames, length M, is divided on, specific by us, population of regions.  The algorithm’s output is a function F between the frames m and the corresponding region index c,  As an additional constraint, a media filtering is placed on the region index’s sequence. This constraint has the effect of smoothing the sequence of indices so as to reflect a more physiologically degree of region transition between successive frames.

Warping Factor and Function Estimation.  After the division of frames into regions, an optimal factor and function for each region is obtained by maximizing the likelihood of the warped vectors with respect to the transcriptions from the first pass and the un-normalized Hidden Markov Model, where, is the testing utterance in which every frame, after its categorization into region c, is warped according to one of the R candidate factors and to one of the N candidate functions. The optimum warping factor for each region is obtained by searching over a value space between 0.88 and 1.12 with step 0.02.  λ is the, trained with unnormalized training vectors, Hidden Markov Model,  W is the obtained by the first-pass transcription.

VTLN in Recognition. During recognition, since a preliminary transcription for testing utterances is not given, a multiple-pass strategy is introduced:  A preliminary transcription W is obtained through a first pass recognition using the unwrapped sequence of cepstral vectors X and the unnormalized model λ,  The utterance's frames are categorized into c regions  For each region c, an optimal warping factor and function is evaluated through a multi-dimensional grid search,  After the evaluation of the vectors related with the optimal per region factor and function the optimally warped sequence is decoded in order to obtain the final recognition result.

Results WER(%) # of Utters15 Baseline50.83 Li & Rose (2 pass) 43.7943.48 2 regions41.73 (+4.7%)42.79 (+1.60) 3 regions43.11 (+1.56)43.66 (-0.46)

The Linear Dynamic Model (LDM)  Discrete-time Linear Dynamical Systems:  Efficient model the evolution of spectral dynamics  An observation y k is produced in each time step  The state process is first-order Markov Initial state is Gaussian  The state and observation noises w k, v k are : Uncorrelated Temporally white Zero-mean Gaussian distributed

 Noise covariances are not constrained  Matrices F,H have canonical forms  Canonical form is identifiable if it is also controllable (Ljung) Generalized canonical form of LDM

Experimental Setup  Training Set Aurora 2 Clean Database 3800 training sentences  Test set: Aurora 2, test A, subway sentences 1000 test sentences Different levels of noise ( Clean, SNR: 20, 15, 10, 5 dB )  Front-End extracts 14-dimensional features (static features): HTK standard front-end 2 feature configurations –12 Cepstral Coefficients + C0 + Energy –+ first and second order derivatives (δ, δδ)

Model Training on Speech Data  Word models with different number of segments based on the phonetic transcription  Segment alignments produced using HTK SegmentsModels 2oh 4two, eight 6one, three, four, five, six, nine, zero 8seven

Classification process  Keep true word-boundaries fixed Digit-level alignments produced by an HMM  Apply suboptimum search and pruning algorithm Keep the 11 most probable word-histories for each word in the sentence  Classification is based on maximizing the likelihood

Classification results  Comparison of LDM Segment-Models and HTK HMM classification (% Accuracy) Same Front-End configuration, same alignments Both Models trained on clean training data AURORA Subway HMM (HTK)LDMs MFCC, E+δ +δδMFCC, E+δ +δδ Clean97,19%97,57%97,53% 97,61% SNR2090,91%95,71%93,23%95,12% SNR1580,09%91,76%87,91%91,13% SNR1057,68%81,93%76,29%82,69% SNR536,01%64,24%54,87%63,56%

Classification results  Performance Comparison (MFCCs)

Classification results  Performance Comparison (MFCCs + δ + δδ)

Sub-optimal Viterbi decoding (SOVD)  We use a Viterbi-like decoding algorithm for speech classification  HMM state equivalent in LDMs is : [x k,s i ]  It is applied among the segments of each word-model Provides segment alignments based on the likelihood of the LDM Estimated with a Kalman filter Allows decoding at each time k using possible histories leading to a different [x k,s i ] combination at several depth levels

SOVD Steps

Sub-Optimal Viterbi-like Search S2S2 S1S1 S3S3 S4S4 F 1 x 0 F 1 x 1 F 1 x 2 F 1 x 3 F 1 x 4 F 2 x 1 F 2 x 2 F 2 x 3 F 2 x 4 F 3 x 2 F 3 x 3 F 3 x 4 F 4 x 3 F 4 x 4 t1t1 t2t2 t4t4 t5t5 t3t3 Time (frames)

Visualization of Model Predictions  Trajectories of true and predicted observations for c 1, c 3

Classification results  Comparison of Segment-Models and HTK HMM classification (% Accuracy) Same fixed Word-boundaries based on the HMM alignments Same Front-End configuration Both Models trained on clean training data AURORA Subway HMM-alignmentsSegment Models HMMLDMd=1d=2 Clean97,19%97,85%97,73% 97,76% SNR2090,91%92,53%93,52% SNR1580,09%85,93%89,68%89,77% SNR1057,68%71,30%77,21%77,33% SNR536,01%46,72%53,66%53,98%

Classification results (Larger State dimension)  Comparison of SOVD-LDM for LDM with several state dimensions Same Front-End configuration (MFCCs+E0+c0), same word alignments AURORA Subway Segment Models HMM142022 Clean97,19%97,73%98,22% 98,28% SNR2090,91%93,52%92,22%91,58% SNR1580,09%89,68%84,98%84,70% SNR1057,68%77,21%73,27%73,08% SNR536,01%53,66%55,12%52,99%

Conclusions  We investigated generalized canonical forms for LDM  We proposed an element-wise ML estimation process  When alignments from an equivalent HMM Without derivatives LDMs significantly outperform HMMs particularly under highly noisy conditions When derivatives are used for both models their performance is similar

Conclusions  With segment alignments based on LDM HMM alignments hurt recognition performance Viterbi-like search for LDM  Larger-dimension Beneficial on clean data Performance degrades on noisy data  Future Lower-dimension, articulatory-based features Non-linear state-to-observation mappings

Noise-removal formulated as a BSS problem  I mutually uncorrelated speaker signals  J microphones  Each microphone signal :  Compact form:  If A invertible (W=A -1 ):

The simulated room  Used Douglas Cambell’s “Roomsim”  Depicts the positions of the speakers and mics  Mixed file was the one received by the first mic (top left)

Database  We considered Aurora 4 and TIMIT  BSS shows better separability fo speech signals >30s Aurora 4 average utterance length ~7sec TIMIT average utterance length ~3sec Concatenated sentences of the same speakers  When there was no overlapping during the whole time We replicated the smaller sentence with samples from the beginning  We normalized the sources to ensure same energy

Experimental Setup  Test Set: 330 Utterances (AURORA4) 16KHz – 16bits  Performance of the clean test-set: 11.13%

Results

Conclusions  Baseline model (with Spectral Subtraction) fails to separate the signals  Retraining the recognizer with mixed signals can significantly improve performance for small noise levels  BSS test data and Baseline model Significantly reduces WER when the speaker’s signal is at the same level Performance highly degrades as the energy of the second speaker decreases.  BSS test data + Retrained models with BSS data Best performance for noise levels 10dB or lower  For smaller noise levels (>10dB) use the recognizer retrained on mixed signals rather than BSS

Combined Results

 We want to determine  Weighted average of many estimators  In our approach θ denotes a Gaussian component Θ is a subset of Gaussians Optimal Bayes Adaptation

12M12M genone 1genone 2 Phone-Based Clustering Cluster the output distributions based on common central phone For example based on the entropy-based distance between the Gaussians the less distant Gaussians (in gray color) are clustered together Gaussian Size Number of Mixture Components

Likelihoods Collection  We compute the likelihoods by using  For each voice frame we track which triphones are used and calculate the probability for each θ.  We use delta smoothing to the distributions of θ according to

 Baseline trained on the WSJ database  Adaptation data: spoke3 WSJ task non-native speakers 5 male and 5 female 40 adaptation sentences per speaker 40 test sentences per speaker Adaptation Configuration

Results

Gender-dependent Results

Conclusions  A small improvement compared to the baseline case  Recent experiments have shown that dynamic associations of distributions have better results  Increasing the number of adaptation data improves the recognition results as recent experiments have shown.

HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.

Similar presentations

Presentation on theme: "HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University."— Presentation transcript:

Similar presentations

About project

Feedback