Download presentation
Presentation is loading. Please wait.
1
Landmark-Based Speech Recognition: Spectrogram Reading, Support Vector Machines, Dynamic Bayesian Networks, and Phonology Mark Hasegawa-Johnson jhasegaw@uiuc.edu University of Illinois at Urbana-Champaign, USA
2
Lecture 5: Generalization Error; Support Vector Machines Observation Vector Summary Statistic; Principal Components Analysis (PCA) Risk Minimization –If Posterior Probability is known: MAP is optimal –Example: Linear Discriminant Analysis (LDA) –When true Posterior is unknown: Generalization Error –VC Dimension, and bounds on Generalization Error Lagrangian Optimization Linear Support Vector Machines –The SVM Optimality Metric –Lagrangian Optimization of SVM Metric –Hyper-parameters & Over-training Kernel-Based Support Vector Machines –Kernel-based classification & optimization formulas –Hyperparameters & Over-training –The Entire Regularization Path of the SVM High-Dimensional Linear SVM –Text classification using indicator functions –Speech acoustic classification using redundant features
3
What is an Observation? Observation can be: A vector created by “vectorizing” many consecutive MFCC or mel-spectra A vector including MFCC, formants, pitch, PLP, auditory model features, …
4
Normalized Observations
5
Plotting the Observations, Part I: Scatter Plots and Histograms
6
Problem: Where is the Information in a 1000-Dimensional Vector?
7
Statistics that Summarize a Training Corpus
8
Summary Statistics: Matrix Notation Examples of y=-1 Examples of y=+1
9
Eigenvectors and Eigenvalues of R
10
Plotting the Observations, Part 2: Principal Components Analysis
11
What Does PCA Extract from the Spectrogram? Plot: “PCAGram” 1024-dimensional principal component → 32X32 spectrogram, plot as an image: 1st principal component (not shown) measures total energy of the spectrogram 2nd principal component: E(after landmark) – E(before landmark) 3rd principal component: E(at the landmark) – E(surrounding syllables)
12
Minimum-Risk Classifier Design
13
True Risk, Empirical Risk, and Generalization
14
When PDF is Known: Maximum A Posteriori (MAP) is Optimal
15
Another Way to Write the MAP Classifier: Test the Sign of the Log Likelihood Ratio
16
MAP Example: Gaussians with Equal Covariance
17
Linear Discriminant Projection of the Data
18
Other Linear Classifiers: Empirical Risk Minimization (Choose v, b to Minimize R emp (v,b))
19
A Serious Problem: Over-Training Minimum-Error projection of training data The same projection, applied to new test data
20
When the True PDF is Unknown: Upper Bounds on True Risk
21
The VC Dimension of a Hyperplane Classifier
22
Schematic Depiction: |w| Controls the Expressiveness of the Classifier (and a less expressive classifier is less prone to overtrain)
23
The SVM = An Optimality Criterion
24
Lagrangian Optimization: Inequality Constraint Consider minimizing f(v), subject to the constraint g(v) ≥ 0. Two solution types exist: g(v*) = 0 g(v)=0 curve is tangent to f(v)=fmin curve at v=v * g(v*) > 0 v* minimizes f(v) Diagram from Osborne, 2004 g(v) > 0 g(v) < 0 Unconstrained Minimum g(v) = 0 v*
25
Case 1: g m (v*)=0
26
Case 2: g m (v*)>0
27
Training an SVM
28
Differentiate the Lagrangian
29
… now Simplify the Lagrangian…
30
… and impose Kuhn-Tucker…
31
Three Types of Vectors Interior Vector: =0 Margin Support Vector: 0< <C Error: =C Partial Error: =C From Hastie et al., NIPS 2004
32
… and finally, Solve the SVM
33
Quadratic Programming i2 i1 C C i2 is off the margin; truncate to i2 =0. i1 is still a margin candidate; solve for it again in iteration i+1. i*i*
34
Linear SVM Example
36
Choosing the Hyper-Parameter to Avoid Over-Training (Wang, Presentation at CLSP workshop WS04) SVM test corpus error vs. =1/C, classification of nasal vs. non-nasal vowels.
37
Choosing the Hyper-Parameter to Avoid Over-Training Recall that v= m m y m x m Therefore, |v| < (C m |x m |) 1/2 < (CM max|x m |) 1/2 Therefore, width of the margin is constrained to 1/|v| > (CM max|x m |) -1/2, and therefore, the SVM is not allowed to make the margin very small in its quest to fix individual errors Recommended solution: –Normalize x m so that max|x m |≈1 (e.g., using libsvm) –Set C≈1/M –If desired, adjust C up or down by a factor of 2, to see if error rate on independent development test data will decrease
38
From Linear to Nonlinear SVM
39
Example: RBF Classifier
40
An RBF Classification Boundary
41
Two Hyperparameters Choosing Hyperparameters is Much Harder (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
42
Optimum Value of C Depends on (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004) From Hastie et al., NIPS 2004
43
SVM is a “Regularized Learner” ( =1/C)
44
SVM Coefficients are a Piece-Wise Linear Function of =1/C (Hastie, Rosset, Tibshirani, and Zhu, NIPS 2004)
45
The Entire Regularization Path of the SVM: Algorithm (Hastie, Zhu, Tibshirani and Rosset, NIPS 2004) Start with large enough (C small enough) so that all training tokens are partial errors ( m =C). Compute the solution to the quadratic programming problem in this case, including inversion of X T X or XX T. Reduce (increase C) until the initial event occurs: two partial error points enter the margin, i.e., in the QP problem, m =C becomes the unconstrained solution rather than just the constrained solution. This is the first breakpoint. The slopes d m /d change, but only for the two training vectors the margin; all other training vectors continue to have m =C.Calculate the new values of d m /d for these two training vectors. Iteratively find the next breakpoint. The next breakpoint occurs when one of the following occurs: –A value of m that was on the margin leaves the margin, i.e., the piece-wise-linear function m ( hits m =0 or m =C. –One or more interior points enter the margin, i.e., in the QP problem, m =0 becomes the unconstrained solution rather than just the constrained solution. –One or more interior points enter the margin, i.e., in the QP problem, m =C becomes the unconstrained solution rather than just the constrained solution.
46
One Method for Using SVMPath (WS04, Johns Hopkins, 2004) Download SVMPath code from Trevor Hastie’s web page Test several values of g, including values within a few orders of magnitude from =1/K. For each candidate value of , use SVMPath to find the C- breakpoints. Choose a few dozen C-breakpoints for further testing, and write out the corresponding values of m. Test the SVMs on a separate development test database: for each combination (C, ), find the development test error. Choose the combination that gives least development test error.
47
Results, RBF SVM SVM test corpus error vs. =1/C, classification of nasal vs. non-nasal vowels. Wang, WS04 Student Presentation, 2004
48
High-Dimensional Linear SVMs
49
Motivation: “Project it Yourself” The purpose of a nonlinear SVM: – (x) contains higher-order polynomial terms in the elements of x. –By combining these higher-order polynomial terms, y m m K(x,x m ) can create a more flexible boundary than can y m m x T x m. –The flexibility of the boundary does not lead to generalization error: the regularization term |v| 2 avoids generalization error. A different approach: –Augment x with higher-order terms, up to a very large dimension. These terms can include: Polynomial terms, e.g., x i x j N-gram terms, e.g., (x i at time t AND x j at time ) Other features suggested by knowledge-based analysis of the problem –Then: apply a linear SVM to the higher-dimensional problem
50
Example #1: Acoustic Classification of Stop Place of Articulation Feature Dimension: K=483/10ms –MFCCs+d+dd, 25ms window: K=39/10ms –Spectral shape: energy, spectral tilt, and spectral compactness, once/millisecond: K=40/10ms –Noise-robust MUSIC-based formant frequencies, amplitudes, and bandwidths: K=10/10ms –Acoustic-phonetic parameters (Formant-based relative spectral measures and time-domain measures): K=42/10ms –Rate-place model of neural response fields in the cat auditory cortex: K=352/10ms Observation = concatenation of up to 17 frames, for a total of K=17 X 483 = 8211 dimensions Results: Accuracy improves as you add more features, up to 7 frames (one/10ms; 3381-dimensional x). Adding more frames didn’t help. RBF SVM still outperforms linear SVM, but only by 1%
51
Example #2: Text Classification Goal: –Utterances were recorded by physical therapy patients, specifying their physical activity once/half hour for seven days. –Example utterance: “I ate breakfast for twenty minutes, then I walked to school for ten minutes.” –Goal: for each time period, determine the type of physical activity, from among 2000 possible type categories. Indicator features –50000 features: one per word, in a 50000-word dictionary –x = [ 1, 2, 3, …, 50000 ] T – i = 1 if the ith dictionary word was contained in the utterance, zero otherwise –X is very sparse: most sentences contain only a few words –Linear SVM is very efficient
52
Example #2: Text Classification Result –85% classification accuracy –Most incorrect classifications were reasonable to a human “I played hopskotch with my daughter” = “playing a game”, or “light physical exercise”? –Some categories were never observed in the training data, therefore no test data were assigned to those categories Conclusion: SVM is learning keywords & keyword combinations
53
Summary Plotting the Data: Use PCA, LDA, or any other discriminant If PDF is known: Use MAP classifier If PDF unknown: Structural Risk Minimization “SVM” is a training criterion – a particular upper bound on structural risk of hyperplane Choosing hyperparameters –Easy for a linear classifier –For a nonlinear classifier: use the Complete Regularization Path algorithm High-dimensional Linear SVMs: human user acts as an “intelligent kernel”
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.