Download presentation

Presentation is loading. Please wait.

1
**Speaker Adaptation for Vowel Classification**

Xiao Li Electrical Engineering Dept.

2
**Outline Introduction Background on statistical classifiers**

Proposed Adaptation strategies Experiments and results Conclusion

3
**Application “Vocal Joystick” (VJ) Vowel classification**

Human-computer interaction for people with motor-impairments Acoustic parameters – energy, pitch, vowel quality, discrete sound Vowel classification Vowels /ae/ (bat); /aa/ (bought); /uh/ (boot); /iy/ (beat) Control motion direction /ae/ /aa/ /uh/ /iy/

4
**Features Formants Mel-frequency cesptral coefficients (MFCC)**

Peaks in spectrum Low dimension (F1, F2, F3, F4 + dynamics) Hard to estimate Mel-frequency cesptral coefficients (MFCC) Cosine transform of log spectrum High dimension (26 including deltas) Easy to compute Our choice – MFCCs

5
**User-Independent vs. User–Dependent**

User-independent models NOT optimized for a specific speaker Easy to get a large train set User-dependent models Optimized for a specific speaker Difficult to get a large train set

6
**Adaptation What is adaptation?**

Adapting user-independent models to a specific user, using a small set of user-dependent data Adaptation methodology for vowel classification Train speaker-independent vowel models Ask a speaker to articulate a few seconds of vowels for each class Adapt the classifier on this small amount of speaker-dependent data

7
**Outline Introduction Background on statistical classifiers**

Proposed Adaptation strategies Experiments and results Conclusion

8
**Gaussian mixture models (GMM)**

Generative models Training objective – maximum likelihood (EM) For training samples O1:T Classification Compute the likelihood scores for each class, and choose the one with the highest likelihood Limitation A class model is trained using only the data in this class Constraints on the discriminant functions

9
**Neural Networks (NN) Three layer perceptrons Training objective**

# input nodes – feature dimension x window size # hidden nodes – empirically chosen # output nodes – # of classes Training objective Minimum relative entropy Classification Compare the output values Advantages Discriminative training Nonlinearity Features taken from multiple frames Target yk

10
**NN-SVM Hybrid Classifier**

Idea – replace the hidden-to-output layer of the NN by linear-kernel SVMs Training objective Maximum margin theoretically guaranteed on test error bound Classification Compare the output values of binary classifiers Advantages Compared to pure NN: optimal solution in the last layer Compared to pure SVM: efficiently handling features from multiple frames; no need to choose kernel

11
**Outline Introduction Background on statistical classifiers**

Proposed Adaptation strategies Experiments and results Conclusion

12
**MLLR for GMM Adaptation**

Maximum Likelihood Linear Regression Apply a linear transformation on the Gaussian mean Same transformation for the mixture of Gaussians in the same class The covariance matrix can be adapted in a similar fashion, but less effective

13
**MLLR Formulas Objective – maximum likelihood**

For adaptation samples O1:T First-order derivative vanishes The transform W is obtained by solving a linear equation

14
NN Adaptation Idea – fix the nonlinear mapping and adapt the last layer (linear classifier) Adaptation objective – minimum relative entropy Start from the original weights Gradient descent formulas

15
**NN-SVM Classifier Adaptation**

Idea – *again* fix the nonlinear mapping and adapt the last layer Adaptation objective – maximum margin Adaptation procedure Keep the support vectors of the training data Combine these support vectors with the adaptation data Retrain the linear-kernel SVMs for the last layer

16
**Outline Introduction Background on statistical classifiers**

Proposed Adaptation strategies Experiments and results Conclusion

17
**Database Pure vowel recordings with different energy and pitch**

Duration – long short Energy – loud, normal, quiet Pitch – rising, level, falling Statistics Train set speakers Test set – 5 speakers 4 or 8 or 9 vowel classes 18 utterances (2000 samples) for each vowel and each speaker

18
**Adaptation and Evaluation Set**

6-fold cross-validation for each speaker 18 utterances are divided into 6 subsets We adapt on each subset and evaluate on the rest We get 6 accuracy scores for each vowel, and compute the mean and deviation Average over 5 speakers

19
**Speaker-Independent Classifiers**

% Accuracy 4 –class 8-class 9-class GMM mixture # = 16 85.13±0.67 55.88±0.64 51.21±0.54 NN window = 7 hidden = 50 89.19±0.65 60.05±0.72 53.75±0.61 NN-SVM 89.89±0.55 -- The individual scores for different speakers vary a lot If NN window = 1, the performance is similar to GMM

20
**Adapted Classifiers % Accuracy 4 –class 8-class 9-class MLLR for GMM**

85.13±0.67 90.73±0.82 55.88±0.64 67.52±1.27 51.21±0.54 62.94±1.37 Gradient Descent for NN 89.19±0.65 91.85±1.30 60.05±0.72 74.33±1.41 53.75±0.61 71.06±1.62 Maximum Margin for NN-SVM 89.89±0.55 94.70±0.30 --

21
Conclusion For speaker-independent models, the NN classifier (with multiple frame input) works well For speaker-adapted models, the NN classifier is effective, and NN-SVM so far gets the best performance

Similar presentations

© 2020 SlidePlayer.com Inc.

All rights reserved.

To make this website work, we log user data and share it with processors. To use this website, you must agree to our Privacy Policy, including cookie policy.

Ads by Google