Speaker Adaptation for Vowel Classification

Speaker Adaptation for Vowel Classification
Xiao Li Electrical Engineering Dept.

Outline Introduction Background on statistical classifiers
Proposed Adaptation strategies Experiments and results Conclusion

Application “Vocal Joystick” (VJ) Vowel classification
Human-computer interaction for people with motor-impairments Acoustic parameters – energy, pitch, vowel quality, discrete sound Vowel classification Vowels /ae/ (bat); /aa/ (bought); /uh/ (boot); /iy/ (beat) Control motion direction /ae/ /aa/ /uh/ /iy/

Features Formants Mel-frequency cesptral coefficients (MFCC)
Peaks in spectrum Low dimension (F1, F2, F3, F4 + dynamics) Hard to estimate Mel-frequency cesptral coefficients (MFCC) Cosine transform of log spectrum High dimension (26 including deltas) Easy to compute Our choice – MFCCs

User-Independent vs. User–Dependent
User-independent models NOT optimized for a specific speaker Easy to get a large train set User-dependent models Optimized for a specific speaker Difficult to get a large train set

Adaptation What is adaptation?
Adapting user-independent models to a specific user, using a small set of user-dependent data Adaptation methodology for vowel classification Train speaker-independent vowel models Ask a speaker to articulate a few seconds of vowels for each class Adapt the classifier on this small amount of speaker-dependent data

Gaussian mixture models (GMM)
Generative models Training objective – maximum likelihood (EM) For training samples O1:T Classification Compute the likelihood scores for each class, and choose the one with the highest likelihood Limitation A class model is trained using only the data in this class Constraints on the discriminant functions

Neural Networks (NN) Three layer perceptrons Training objective
# input nodes – feature dimension x window size # hidden nodes – empirically chosen # output nodes – # of classes Training objective Minimum relative entropy Classification Compare the output values Advantages Discriminative training Nonlinearity Features taken from multiple frames Target yk

NN-SVM Hybrid Classifier
Idea – replace the hidden-to-output layer of the NN by linear-kernel SVMs Training objective Maximum margin theoretically guaranteed on test error bound Classification Compare the output values of binary classifiers Advantages Compared to pure NN: optimal solution in the last layer Compared to pure SVM: efficiently handling features from multiple frames; no need to choose kernel

MLLR for GMM Adaptation
Maximum Likelihood Linear Regression Apply a linear transformation on the Gaussian mean Same transformation for the mixture of Gaussians in the same class The covariance matrix can be adapted in a similar fashion, but less effective

MLLR Formulas Objective – maximum likelihood
For adaptation samples O1:T First-order derivative vanishes The transform W is obtained by solving a linear equation

NN Adaptation Idea – fix the nonlinear mapping and adapt the last layer (linear classifier) Adaptation objective – minimum relative entropy Start from the original weights Gradient descent formulas

NN-SVM Classifier Adaptation
Idea – *again* fix the nonlinear mapping and adapt the last layer Adaptation objective – maximum margin Adaptation procedure Keep the support vectors of the training data Combine these support vectors with the adaptation data Retrain the linear-kernel SVMs for the last layer

Database Pure vowel recordings with different energy and pitch
Duration – long short Energy – loud, normal, quiet Pitch – rising, level, falling Statistics Train set speakers Test set – 5 speakers 4 or 8 or 9 vowel classes 18 utterances (2000 samples) for each vowel and each speaker

Adaptation and Evaluation Set
6-fold cross-validation for each speaker 18 utterances are divided into 6 subsets We adapt on each subset and evaluate on the rest We get 6 accuracy scores for each vowel, and compute the mean and deviation Average over 5 speakers

Speaker-Independent Classifiers
% Accuracy 4 –class 8-class 9-class GMM mixture # = 16 85.13±0.67 55.88±0.64 51.21±0.54 NN window = 7 hidden = 50 89.19±0.65 60.05±0.72 53.75±0.61 NN-SVM 89.89±0.55 -- The individual scores for different speakers vary a lot If NN window = 1, the performance is similar to GMM

Adapted Classifiers % Accuracy 4 –class 8-class 9-class MLLR for GMM
85.13±0.67 90.73±0.82 55.88±0.64 67.52±1.27 51.21±0.54 62.94±1.37 Gradient Descent for NN 89.19±0.65 91.85±1.30 60.05±0.72 74.33±1.41 53.75±0.61 71.06±1.62 Maximum Margin for NN-SVM 89.89±0.55 94.70±0.30 --

Conclusion For speaker-independent models, the NN classifier (with multiple frame input) works well For speaker-adapted models, the NN classifier is effective, and NN-SVM so far gets the best performance

Speaker Adaptation for Vowel Classification

Similar presentations

Presentation on theme: "Speaker Adaptation for Vowel Classification"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Speaker Adaptation for Vowel Classification

Similar presentations

Presentation on theme: "Speaker Adaptation for Vowel Classification"— Presentation transcript:

Similar presentations

About project

Feedback