Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA

Lecture 5: Generalization Error; Support Vector Machines Observation Vector Summary Statistic; Principal Components Analysis (PCA) Risk Minimization –If Posterior Probability is known: MAP is optimal –Example: Linear Discriminant Analysis (LDA) –When true Posterior is unknown: Generalization Error –VC Dimension, and bounds on Generalization Error Lagrangian Optimization Linear Support Vector Machines –The SVM Optimality Metric –Lagrangian Optimization of SVM Metric –Hyper-parameters & Over-training Kernel-Based Support Vector Machines –Kernel-based classification & optimization formulas –Hyperparameters & Over-training –The Entire Regularization Path of the SVM High-Dimensional Linear SVM –Text classification using indicator functions –Speech acoustic classification using redundant features

What is an Observation? Observation can be: A vector created by “vectorizing” many consecutive MFCC or mel-spectra A vector including MFCC, formants, pitch, PLP, auditory model features, …

Normalized Observations

Plotting the Observations, Part I: Scatter Plots and Histograms

Problem: Where is the Information in a 1000-Dimensional Vector?

Statistics that Summarize a Training Corpus

Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1

Eigenvectors and Eigenvalues of R

Plotting the Observations, Part 2: Principal Components Analysis

What Does PCA Extract from the Spectrogram? Plot: “PCAGram” 1024-dimensional principal component → 32X32 spectrogram, plot as an image: 1st principal component (not shown) measures total energy of the spectrogram 2nd principal component: E(after landmark) – E(before landmark) 3rd principal component: E(at the landmark) – E(surrounding syllables)

Minimum-Risk Classifier Design

True Risk, Empirical Risk, and Generalization

When PDF is Known: Maximum A Posteriori (MAP) is Optimal

Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio

MAP Example: Gaussians with Equal Covariance

Linear Discriminant Projection of the Data

Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize R emp (v,b))

A Serious Problem: Over-Training Minimum-Error projection of training data The same projection, applied to new test data

When the True PDF is Unknown: Upper Bounds on True Risk

The VC Dimension of a Hyperplane Classifier

Schematic Depiction: |w| Controls the Expressiveness of the Classifier (and a less expressive classifier is less prone to overtrain)

The SVM = An Optimality Criterion

Lagrangian Optimization: Inequality Constraint Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: g(v*) = 0 g(v)=0 curve is tangent to f(v)=fmin curve at v=v * g(v*) > 0 v* minimizes f(v) Diagram from Osborne, 2004 g(v) > 0 g(v) < 0 Unconstrained Minimum g(v) = 0 v*

Case 1: g m (v*)=0

Case 2: g m (v*)>0

Training an SVM

Differentiate the Lagrangian

… now Simplify the Lagrangian…

… and impose Kuhn-Tucker…

Three Types of Vectors Interior Vector:  =0 Margin Support Vector: 0<  <C Error:  =C Partial Error:  =C From Hastie et al., NIPS 2004

… and finally, Solve the SVM

Quadratic Programming  i2  i1 C C  i2 is off the margin; truncate to  i2 =0.  i1 is still a margin candidate; solve for it again in iteration i+1. i*i*

Linear SVM Example

Choosing the Hyper-Parameter to Avoid Over-Training (Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. =1/C, classification of nasal vs. non-nasal vowels.

Choosing the Hyper-Parameter to Avoid Over-Training Recall that v=  m  m y m x m Therefore, |v| < (C  m |x m |) 1/2 < (CM max|x m |) 1/2 Therefore, width of the margin is constrained to 1/|v| > (CM max|x m |) -1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors Recommended solution: –Normalize x m so that max|x m |≈1 (e.g., using libsvm) –Set C≈1/M –If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease

From Linear to Nonlinear SVM

Example: RBF Classifier

An RBF Classification Boundary

Two Hyperparameters  Choosing Hyperparameters is Much Harder (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

Optimum Value of C Depends on  (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004

SVM is a “Regularized Learner” ( =1/C)

SVM Coefficients are a Piece-Wise Linear Function of =1/C (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)

The Entire Regularization Path of the SVM: Algorithm (Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) Start with large enough (C small enough) so that all training tokens are partial errors (  m =C). Compute the solution to the quadratic programming problem in this case, including inversion of X T X or XX T. Reduce (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem,  m =C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes d  m /d change, but only for the two training vectors the margin; all other training vectors continue to have  m =C.Calculate the new values of d  m /d for these two training vectors. Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: –A value of  m that was on the margin leaves the margin, i.e., the piece-wise-linear function  m (  hits  m =0 or  m =C. –One or more interior points enter the margin, i.e., in the QP problem,  m =0 becomes the unconstrained solution rather than just the constrained solution. –One or more interior points enter the margin, i.e., in the QP problem,  m =C becomes the unconstrained solution rather than just the constrained solution.

One Method for Using SVMPath (WS04, Johns Hopkins, 2004) Download SVMPath code from Trevor Hastie’s web page Test several values of g, including values within a few orders of magnitude from  =1/K. For each candidate value of , use SVMPath to find the C- breakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of  m. Test the SVMs on a separate development test database: for each combination (C,  ), find the development test error. Choose the combination that gives least development test error.

Results, RBF SVM SVM test corpus error vs. =1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004

High-Dimensional Linear SVMs

Motivation: “Project it Yourself” The purpose of a nonlinear SVM: –  (x) contains higher-order polynomial terms in the elements of x. –By combining these higher-order polynomial terms,  y m  m K(x,x m ) can create a more flexible boundary than can  y m  m x T x m. –The flexibility of the boundary does not lead to generalization error: the regularization term |v| 2 avoids generalization error. A different approach: –Augment x with higher-order terms, up to a very large dimension. These terms can include: Polynomial terms, e.g., x i x j N-gram terms, e.g., (x i at time t AND x j at time  ) Other features suggested by knowledge-based analysis of the problem –Then: apply a linear SVM to the higher-dimensional problem

Example #1: Acoustic Classification of Stop Place of Articulation Feature Dimension: K=483/10ms –MFCCs+d+dd, 25ms window: K=39/10ms –Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms –Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms –Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms –Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. RBF SVM still outperforms linear SVM, but only by 1%

Example #2: Text Classification Goal: –Utterances were recorded by physical therapy patients, specifying their physical activity once/half hour for seven days. –Example utterance: “I ate breakfast for twenty minutes, then I walked to school for ten minutes.” –Goal: for each time period, determine the type of physical activity, from among 2000 possible type categories. Indicator features –50000 features: one per word, in a 50000-word dictionary –x = [  1,  2,  3, …,  50000 ] T –  i = 1 if the ith dictionary word was contained in the utterance, zero otherwise –X is very sparse: most sentences contain only a few words –Linear SVM is very efficient

Example #2: Text Classification Result –85% classification accuracy –Most incorrect classifications were reasonable to a human “I played hopskotch with my daughter” = “playing a game”, or “light physical exercise”? –Some categories were never observed in the training data, therefore no test data were assigned to those categories Conclusion: SVM is learning keywords & keyword combinations

Summary Plotting the Data: Use PCA, LDA, or any other discriminant If PDF is known: Use MAP classifier If PDF unknown: Structural Risk Minimization “SVM” is a training criterion – a particular upper bound on structural risk of hyperplane Choosing hyperparameters –Easy for a linear classifier –For a nonlinear classifier: use the Complete Regularization Path algorithm High-dimensional Linear SVMs: human user acts as an “intelligent kernel”

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Similar presentations

Presentation on theme: "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson

Similar presentations

Presentation on theme: "Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson"— Presentation transcript:

Similar presentations

About project

Feedback