Download presentation

Presentation is loading. Please wait.

Published byCamron Lester Modified over 2 years ago

1
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas

2
2 Motivation Goal: Develop a training procedure that overcomes SAT limitations SAT limitations: –A point estimate of the transform for each speaker is found –Speaker clusters remain fixed throughout training SAT modeling accuracy depends on the selection of the clusters –Transforms are not integrated in the training/decoding procedure The transforms in the training-set are not used in decoding A new set of ML transforms is estimated for the test set speakers Potential mismatch when using discriminatively trained SAT

3
3 Bayesian Speaker Adaptive Training (BSAT) Use a discrete distribution rather than a point estimate Decoding criterion: Decoding: Weighted sum of likelihoods under each transform in the mixture Acoustic model training: Estimate set of transforms, transform priors and HMM parameters

4
4 BSAT Details Transforms are shared across all utterances –Transforms are not speaker-dependent as in SAT –No need to group the utterances into speaker-clusters –Avoid locally optimal solutions related to speaker-clustering BSAT can be trained under Maximum Likelihood or discriminative criteria –Discriminatively trained transforms used directly in decoding BSAT treats the mixture of transforms similar to gaussian mixture training –Transforms and gaussians are built incrementally –Transform splitting is an open question; not as easy as gaussian splitting

5
5 Transform Splitting Challenge: Find meaningful perturbations of a transform to obtain initial estimates Workaround: Cluster the utterances and estimate initial transforms for each cluster Bias clustering –Idea: Group utterances that have similar transforms –Challenge: Estimation of a full matrix is impractical –Solution: Estimate a bias term for each utterance –Cluster biases and estimate an initial transform for each cluster –Issue: Biases have almost zero variance due to mean normalization of features Feature clustering –Use a K-Means procedure to cluster the utterances –Each object in K-Means corresponds to an utterance –Estimate an initial transform for each cluster

6
6 BSAT Estimation Procedure Split transform(s) Update transforms and transform priors Update gaussian parameters 3 Iterations Start with a single identity CMLLR transform

7
7 Experiment Setup Training/Test set –Training: 150 hrs of Arabic BN data –Test: bnat05 test set Acoustic model –Seed BSAT estimation from a well-trained SI model 12 mixtures per state, 1762 states, 24K total Gaussians Decoding procedure –Find the 1-best hypothesis from the baseline unadapted decoding –Select the transform that gives the highest likelihood on the 1- best hypothesis –Rescore lattice created by the baseline unadapted decoding

8
8 BSAT Training *Numbers in boxes indicate number of transforms Both clustering procedures yield comparable likelihood

9
9 BSAT Decoding Both BSAT systems yield comparable WER BSAT: 1% absolute gain using only 16 transforms SI: 0.9% absolute gain by doubling the number of parameters WER reaches a plateau by increasing the number of transforms in the mixture *Numbers in boxes indicate number of transforms

10
10 Conclusions Integrated the transforms into the training/decoding procedure –Discriminatively trained transforms can be used in decoding Preliminarily results show that BSAT improves SI model performance with as few as 16 transforms Future work Improve transform splitting Apply transform splitting concurrently with gaussian splitting Use top-N transforms in decoding Use MLLR transforms rather than CMLLR transforms Use discriminative estimation criteria rather than ML

Similar presentations

© 2017 SlidePlayer.com Inc.

All rights reserved.

Ads by Google