Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska

Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska
Sequence Scoring Experiments Using the TIMIT Corpus and the HTK Recognition Framework Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska

ASR Defined Automatic Speech Recognition (ASR) - mapping an acoustic signal into a string of words. ASR systems play a big role in Human Machine Interaction (HMI). Speech has a natural potential to be much more intuitive to use to command a machine versus the existing input methods, such as keyboard and mouse. Thesis deals with creating a specific type of ASR (presented shortly) . Its relatively

Early ASR Systems Earliest systems for ASR would model natural resonances that occur as a result of air flowing over the vocal tract creating sounds Example: To recognize the digit “five”, the system would determine that the vowel sound “eye” matched the correct digit. Limitation - Utterance contained only a single digit and no other word or non-speech event that would confuse the system. So let’s just take a minute and discuss the evolution of ASR systems Natural resonant frequences are called Formants.

ASR Improvements ASR System Development in the 1980s and 1990s introduced use of Hidden Markov Models (HMMs). Still widely used over the past two decades Improvements being made on a continual basis. ASR received interest from DARPA, leading to new and notable ASR systems such as the CMU Sphinx (Carnegie Mellon University) system. Formalized the tasks and evaluation criterion that were used to measure ASR System Performance. HMM-Based recognition is a major theme in this thesis.

Major Tasks in ASR History
Resource Management – Rigid military expression with a vocabulary of 1000 words. ATIS – Simple spontaneous speech conversation with an automated air travel information retrieval system. WSJ – Transcribed audio from read paragraphs from the Wall Street Journal with a notably large vocabulary of 60,000 words. Switchboard – Conversational and spontaneous speech with many disfluencies, such as partial words, hesitations, and noise. The trend is increasingly complexity tasks coupled with better performance. relationship between vocabulary size, speaker context, speaker-dependent variations, and channel characteristics (such as noise) with the WER that was able to be achieved. In fact, these are all key parameters that characterize the way in which an ASR system is designed and trained.

Timeline of ASR Achievements
This slide re-inforces the concepts in the previous viewgraph. (over time, tasks become more complex) In the 1960s, it was possible to recognize small vocabulary tasks on the order of 10 to 100 words, based solely on the knowledge of acoustic properties of speech sounds. The 1970s introduced the ability to recognize medium-vocabulary tasks using new spectral representations of speech signals and template-based pattern recognition techniques. The 1980s paved the way into the modern era of speech recognition, with the addition of HMM technology

Characteristics of ASR Systems
ASR Systems are defined by the tasks they are designed to solve. We have already discussed some examples of tasks. Tasks involve the following parameters: Vocalbulary Size Fluency Environmental Effects Speaker Characteristics

Vocabulary Size Milestones in ASR Systems are often related to how large of a vocabulary a system can handle while keeping error rate at a minimum. Simple Task Vocabulary: Recognizing digits: “zero,” “one,” “two,”…, and “oh” These eleven words are the in-vocabulary words (INV). If the system encounters any words outside of this set, they are known as out-of-vocabulary words (OOV).

ASR Tasks and Vocabulary Sizes
Task Name Vocabulary Size Word Error Rate (%) Texas Instruments (TI) Digits 11 (zero-nine, oh) 0.5 Wall Street Journal 1 5,000 3 Wall Street Journal 2 20,000 Broadcast News 64,000+ 10 Conversational Telephone Speech 20 As vocabulary size of a task increases, so does the Word Error Rate (WER). WER is the standard evaluation metric for speech recognition

Example WER Calculation
This example is an output hypothesis of a string of numbers from an ASR system, compared with the true sentence string. The bottom line marks the types of errors as they occur in the transcription. Reference: ONE TWO THREE FOUR FIVE SIX SEVEN ***** Hypothesis: **** TWO ******* FIVE FIVE SIX SEVEN ONE Evaluation: D D S I We will use a variation of this for the “task” in this thesis.

ASR System Fluency Fluency measures the rigidity of input speech.
In isolated-word recognition, the speech to be processed is surrounded by a known silence or pause. Examples include the digit recognition or command- and-control tasks. Continuous-speech systems must take non- speech events and segmentation of real words into account. This is much harder to accomplish!

Other ASR System Parameters
Environmental noise and channel characteristics. Recording instruments may be located at different distances from each speaker and may pick up other noises in addition to speech. Speaker-dependant characteristics. Speaker dialect and accent.

Wake-up-Word Paradigm
The Wake-up-Word (WUW) ASR Problem: Detect a single word or phrase when spoken in an alerting context, while rejecting all other words, phrases, sounds, noises and other acoustic events with virtually 100% accuracy including the same word or phrase of interest spoken in a non-alerting (i.e. referential) context. WUW is a specific and unique facet of ASR.

WUW Example Application
User utters the WUW “Computer” to alert a machine to perform various commands. When the user utters the command phrase “Computer, begin presentation,” WUW technology should detect that “Computer” was spoken in the alerting context and perform the requested command. If the user utters the phrase “I want to buy a new computer,” WUW technology must detect that “Computer” was used in a non-alerting context and avoid parsing the command.

WUW Problem Areas Detecting WUW Context – The WUW system must be able to notify the host system that attention is required in certain circumstances and with high accuracy. Unlike keyword-spotting ,WUW dictates these occurrences only be reported during an alerting context. This context can be determined using features such as leading and trailing silence, difference in the long term average of speech features, and prosodic information (pitch, intonation, rhythm, etc.). Identifying WUW – After identifying the correct context for a spoken utterance, the WUW paradigm shall be responsible for determining if the utterance contains the pre-defined Wake- up-Word to be used for command (e.g. “Computer”) with a high degree of accuracy, e.g., > 99%. Correct Rejection of Non-WUW – Similar to identification of the WUW, the system shall also be capable of filtering speech tokens that are not WUWs with practically 100% accuracy to guarantee 0% false acceptances.

Current WUW System These applications are in development by Dr. Kepuska and his graduate students. Many of these system components will be used in the system developed in this work. Currently being used for practical applications such as: PowerPoint Commander, Elevator Simulator, Car Inspection System, and Nursing Call Center

Motivations for External Scoring Toolkit
Support for standard speech recognition testing data sets. Provide support for evaluating the TIMIT data set in order to evaluate novel scoring methods against a broader class of words. Integration of standard toolkits. Utilize the Hidden Markov Model Toolkit (HTK) and the SVM library (LIBSVM) to build and evaluate HMM and SVM models. Using industry-standard frameworks has the benefit of a well-documented environment and previous results. Integration of novel scoring techniques with standard toolkits. The novel method used in the WUW system must be integrated with the existing workflow in the HTK framework in order to augment the technique and evaluate its effectiveness against additional data sources. Provide MATLAB-based analysis and experimentation tools. Once results are obtained using the SeqRec tools for HTK and LIBSVM, MATLAB scripts will be used to provide visualization of the results. Provide support for One-Class SVM modeling. A technique that allows a recognition model to be built on only INV data scores. This SVM type will be applied to WUW and the benefits and disadvantages will be explored. External Toolkits allow us to reduce the scope of the scoring methods and hold the *multitude* of ASR factors constant.

SeqRec System Overview
In order to further explore and refine the unique speech recognition elements of the WUW system, the Sequence Recognizer (SeqRec) Toolkit was developed. SeqRec is the actual system component developed for the thesis.

Speech Recognition Goals
Speech recognition systems often assume speech is a realization of some message encoded as a sequence of one or more discrete symbols. Speech is normally converted into a sequence of equally spaced discrete parameter vectors. (typically every 10ms). Makes the assumption that a speech waveform can be regarded as a stationary process over the sampling time Fundamentals of HMM-based ASR now discussed as an introduction to SeqRec.

Speech Recognition Goals, contd.
The speech recognizer’s job is to create a mapping between the sequences of speech frames and the underlying speech symbols that constitute the utterance.

Probability Theory of ASR
“What is the most likely discrete symbol sequence out of all valid sequences in the language L, given some acoustic input O?” Acoustic Input is set of discrete observations: Symbol sequence is defined as: Fundamental ASR System Goal: Argmax expression is to maximize the probability for all valid W sequence in the language. W_hat is the hypothesized word sequence, W is the actual.

Probability Theory of ASR, contd.
Applying Bayes’ Theorem: New quantities are easier to compute than P(W |O). P(W) is defined as the prior probability for the sequence itself. This is calculated by using the prior knowledge of occurrences of the sequence W. P(O) is the prior probability of the acoustic input occurrence. P(O) would be quite hard to calculate.

Probability Theory of ASR, contd.
P(O) is not needed, because the argmax expression implies we will be calculating over all possible sequences. The probability P(O|W), which is the likelihood of the acoustic input O, given the sequence W, is defined as the observation likelihood. (often referred to as the acoustic score) This quantity can be determined using the Hidden Markov Model.

Elements of HMMs The set of states constituting the model. Although the states themselves are “hidden” from the perspective of state assignment of each observation vector, the exact number of states often carries a physical significance

Elements of HMMs, contd. The transition probability matrix. Each element of this matrix represents the probability of transitioning from state i to state j. Each row of this matrix must sum to 1 to be valid.

Elements of HMMs, contd. The emission probabilities. Each of these expresses the probability of an observation being generated during state i. Note that the beginning and end states of an HMM do not have an associated emission probability.

Elements of HMMs, contd. Note that for ASR Speech “left-to-right” HMMs, usually state 1 has a pi of 1, others 0. The probability distribution of starting in each state.

Elements of HMMs, contd. The following equation is used to express all the parameters of an HMM in a compact form:

ASR HMMs An ASR HMM is normally used to model a phoneme.
Smallest distinguishable sound unit in a language. Generally have three emitting states in order to model the transition-in, steady state, and transition-out regions of the phoneme. Whole word HMM is created by simply concatenating the phonemes used to spell the word in question. Note that non-emitting states are nice b/c we can drop them to concat.

Acoustic Scores Using HMMs
So how do we use HMMs to calculate the probability of an observation sequence, given a specific model? Restated: Score how well a given model matches an input observation sequence. For HMMs, each hidden state produces only a single observation. Length(sequence of traversed states) == Length(sequence of observation) This of course assumes statistical independence of each observation vector in the O sequence, which is generally acceptable.

Acoustic Scores Using HMMs, contd.
The actual state sequence that observation sequence will take is hidden. Assuming independence, have to calculate joint probability of a particular observation sequence and a particular state sequence: This probability must be calculated across all valid state sequences in the model: This presents the need for a *weighted* sequence probability by the state sequenece probability itself.

Acoustic Scores Using HMMs, contd.
While this solution is valid, it presents a calculation that requires O(N^T) computations. For speech processing applications of HMM, these parameters can become quite large. In order to reduce the amount of calculations needed, the forward algorithm is used.

Forward Algorithm The forward algorithm is a dynamic programming technique that uses a table to store intermediate values as it builds the final probability of the observation sequence. Each cell is calculated by summing over the extensions in all paths that lead to the current cell.

Forward Algorithm, contd.
The forward algorithm is a three step process: Initialization: Induction: Termination: Terminal result is the probability we are after.

Forward Algorithm, contd.

HMM Paramter Re-estimation
HMM parameter re-estimation is how we should adjust the model parameters in order to maximize the acoustic score. This problem is addressed by using the Baum-Welch algorithm. We are going to be *brief* with this topic!

HMM Paramter Re-estimation, contd.
Goal for Re-estimating the transition probability matrix A: Goal for Re-estimating the emission probability distributions:

These calculations lead to the following equations. (See Rabiner for details and derivations.) Beta is the backward probability, essentially iterating backward through the trellis from before. Further details and derivation of these equations, see the Rabiner. These calcaluations are made iteratively to create new, refined models.

If a current model is re-estimated using the EM algorithm to create a new, refined model, then either: The initial model defines a critical point of the likelihood function, in which case (no HMM parameter updates were made). A new model has been discovered that describes an HMM in which an observation sequence O is more likely to have been produced. The final model produced by EM is called the maximum likelihood HMM.

Speech-Specific HMM Recognition
The previous section presented the fundamentals associated with using HMMs to perform general sequence recognition. There are some additional concepts associated specifically with the speech recognition task domain: Feature Representation of Speech Gaussian Mixture Model Distributions

Feature Representation of Acoustic Speech Signals
The input to an ASR system is normally a continuous speech waveform. This input must be transformed into a sequence of acoustic feature vectors, each of which captures a small amount of information within the original waveform.

Feature Representation of Acoustic Speech Signals, contd.
Pre-emphasis – This stage is used to amplify energy in the high-frequencies of the input speech signal. This allows information in these regions to be more recognizable during HMM model training and recognition.

Windowing – This stage slices the input signal into discrete time segments. A Hamming window is commonly used to prevent edge effects associated with the sharp changes in a Rectangular window.

Discrete Fourier Transform – DFT is applied to the windowed speech signal, resulting in the magnitude and phase representation of the signal.

Mel Filter Bank - Human hearing is less sensitive at frequencies above Hz. so the spectrum is warped using a logarithmic Mel scale. A bank of filters is constructed with filters distributed equally below 1000 Hz and spaced logarithmically above 1000 Hz We take the log of each of the mel filter bank output values, makes output less sensitive to variations.

Inverse DFT – The IDFT of the Mel spectrum is computed, resulting in the cepstrum. This representation is valuable because it separates characteristics of the source and filter of the speech waveform. The first 12 values of the resulting cepstrum are recorded. Delta MFCC Features – In order to capture the changes in speech from frame-to-frame, the first and second derivative of the MFCC coefficients are also calculated and included.

Energy Feature – This step is performed in parallel with the MFCC feature extraction and involves calculating the total energy of the input frame.

Results in a 39- element Observation Vector for each Frame of Speech Feature Type Count Cepstral Coefficients 12 Delta Cepstral Coefficients Double Delta Cepstral Coefficients Energy Coefficient 1 Delta Energy Coefficient Double Delta Energy Coefficient Total 39

Gaussian Mixture Models
Until now, the emission probability associated with an HMM state was left as a general probability distribution. In most ASR systems, these output probabilities are continuous-density multivariate output distributions. The most common form of this distribution used in speech recognition is the Gaussian Mixture Model (GMM).

Gaussian Mixture Models, contd.
A simple Gaussian distribution describing a one- dimensional random variable X is described by the mean and variance

Assume a simple (though impractical) ASR system exists where a single-variable Gaussian is used. Each HMM state would have an emission probability that assumes the values of each observation vector are normally distributed Recall b_j is one of the three parameters in an HMM.

Recall that each observation is actually a D element vector. (where we found D = 39 for common MFCC representations) Extend the distribution to multivariate Gaussian distributions. In this case, the mean is a vector of length D and the covariance is a matrix of size D x D

What if some of the features do not follow a strict, normal distribution. This is actually quite common. In order to account for complex, non-normal distributions, the Gaussian Mixture Model is used Result of combining M Gaussian mixtures, the contribution of each is given by a scaler weight.

GMM Example Example of a non-normal, one-dimensional probability distribution that is more effectively modeled using a GMM with 3 mixtures

The Hidden Markov Modeling Toolkit (HTK)
HTK is a well-established framework, primarily designed to build HMM-based systems used for speech processing and speech recognition tools.

HTK Data Preparation Tools
Provide mechanisms to prepare arbitrarily formatted speech sound files and textual transcriptions into a uniform format suitable for HMM model training. The raw waveform audio must also be converted to MFCCs. Support data such as the phonetic dictionary must be properly formatted to ensure all pronunciations are available prior to training.

HTK Training Tools Uses the HTK-formatted data from the previous stage to define, initialize, and re-estimate the set of HMM models.

HTK Testing Tools Tools for generating text hypothesis given a set of unknown speed data. HTK provides features for full speech recognition SeqRec only needs tools that will generate the acoustic scores.

TIMIT Corpus Experiments will use the Texas Instruments and Massachusetts Institute of Technology (TIMIT) corpus. Contains recordings of 630 speakers in 8 dialects of U.S. English. Each speaker is assigned 10 sentences to read that are carefully designed to contain a wide range of phonetic variability. Each utterance is recorded as a 16-bit waveform file sampled at 16 KHz. Two Partitions of TIMIT: TRAIN – Used to generate HMM Models. TEST - Unseen by the SeqRec system until the final evaluation The grammar and semantics of the sentences are quite meaningless, but the phonetic content is rich and varied, creating a database of evenly distributed vocal sounds.

TIMIT Experiment Data Set
Word TRAIN Count TEST Count Phonemic Pronunciation Phonemic Length Frame Length the 1603 599 da ah 2 8.26 to 1018 352 t uw 10.69 in 947 313 ih n 13.14 a 867 301 ah 1 6.69 all 545 223 ao l 20.67 that 612 215 dh ae t 3 31.02 she 572 208 sh iy 19.86 an 571 207 ae n 9.83 your 565 202 y ao r 12.31 me 517 193 m iy 12.11 of 455 185 ah v 11.99 had 526 183 hh ae d 24.54 like 518 179 l ay k 23.65 year 473 177 y ih r 30.75 and 492 175 ah n d 14.13 dark 171 d aa r k 4 33.2 water 479 170 w ao t er 28.35 ask 464 169 ae s k 28.12 carry 463 k ae r iy 36.51 suit 462 168 s uw t 34.99 greasy g r iy s iy 5 39.05 wash 469 w aa sh 35.07 oily 470 oy l iy 33.38 rag r ae g 34.23 The 24 words with the highest count of occurrences in the database. Varying length from ~7 frames for “a” to ~39 frames for “greasy.” Highlighted words are another subset that will be used to show detailed experiment results.

The HTK Recipe The versatility of HTK Toolkit presents a steep learning curve. The HTK Recipe is used by SeqRec to provide a known- good starting point to create a well-trained set of monophone HMM models based on TIMIT.

Isolated Word Recognition Result Format
Normalization of score count histos. The experiment word-set presented in Table 8-1 can now be evaluated against the set of trained HMM models. This is done by calculating the acoustic score of all TEST occurrences of the word of interest (INV) against the whole-word HMM model corresponding to the INV. All other non-INV words (i.e. OOV) from TEST are also evaluated against the INV model. Recall from Section 2 this procedure is a simple concatenation of phoneme HMMs. Red - normalized acoustic scores for the INV evaluated against INV HMM Blue - normalized acoustic scores for the OOV evaluated against the INV HMM

Isolated Word Recognition Result Format, contd.
The experiment word-set presented in Table 8-1 can now be evaluated against the set of trained HMM models. This is done by calculating the acoustic score of all TEST occurrences of the word of interest (INV) against the whole-word HMM model corresponding to the INV. All other non-INV words (i.e. OOV) from TEST are also evaluated against the INV model. Recall from Section 2 this procedure is a simple concatenation of phoneme HMMs. CDFs plotted for each score distribution, OOV reversed. Point where two CDFs intersect is the operating threshold.

Isolated Word Recognition Result Format, contd.
The experiment word-set presented in Table 8-1 can now be evaluated against the set of trained HMM models. This is done by calculating the acoustic score of all TEST occurrences of the word of interest (INV) against the whole-word HMM model corresponding to the INV. All other non-INV words (i.e. OOV) from TEST are also evaluated against the INV model. Recall from Section 2 this procedure is a simple concatenation of phoneme HMMs. FA Rate –False acceptances of OOV words as INV. FR Rate – False rejections of INV words as OOV. Total Error Rate = FA Rate + FR Rate

HMM Biasing Prior to scoring, the monophone HMM models constituting the INV word are re- estimated against only the INV Training Data. This allows SeqRec to simulate the performance improvement of context- dependant models. Experimentally found that performing two re-estimations yielded the optimum increase in performance.

HMM Biasing and Increased Recognizer Performance
For these viewgraphs, can graphically interpret increased performance as “pushing” the INV/OOV scoring distributions further apart, resulting in less confusion and mis-classification.

Baseline Monophone HMM Results
The TIMIT single-word recognizer performance baseline was established using monophone HMMs with 1, 8, and 16 Gaussian Mixture components Baseline results shown against just a subset of TIMIT Test

Validation of Results Same TIMIT data set was evaluated against third-party “WSJ” models from the author of the HTK Recipe procedure. Average Total Error Rate was compared to the SeqRec models. Averages was for *all* of TIMIT Test words.

Baseline Results Observations
In general, a higher number of mixture components in the GMM models yield lower error rates. Expected, due to the complex distributions of many of the MFCC features used to represent the speech data. HMM models generated by SeqRec perform slightly better than the WSJ models WSJ models are re-estimated many times using data from a much broader test set than just TIMIT. Overall, the baseline monophone models have shown that 16-mixture TIMIT monophone HMMs yield the lowest average Total Error Rate of 20.07% Notice that a WUW recognizer with this type of performance would fail one out of five times and would not be suitable for any practical purposes. will be the model set carried forward in forthcoming experiments

Incorporating Additional Scores
Key feature of the existing WUW system is the application of an additional scoring method. “Score 2” can be computed using the same HTK tools that were used to determine the acoustic score. Note that using Score 2 for recognition is useless.

Distribution of Multiple Scores
When combined, score 1 and score 2 each contribute unique information to the recognition task. Score 2 “shifts” the INV score result distribution below the OOV results

Introduction to SVM Cannot use the simple, one-dimensional binary classifier with two scores. Support Vector Machines (SVMs) are a set of learning methods that can be used to build a complex classification model for data with multiple features.

Fundamentals of SVM Classifiers
Consider a task requiring the binary classification of m data points, each having classification labels +/-1. Each data point is represented by a d- dimensional collection of attributes (also known as a feature vector).

Discriminant Plane The vector w describes the orientation and b describes the offset of a discriminant plane that can be used to classify the data. There are an infinite amount of planes that can be applied to a set of points.

Maximal Margin The plane given by the solid line provides the best solution because it would be more robust to additional data that exhibit perturbations from the training set. This plane is said to provide the maximal margin between the two classes of data points.

Maximal Margin, contd. For linearly seperable data, a method for determining the maximum margin between the two classes is to maximize the margin between two parallel supporting planes. The distance between these planes is maximized to determine the optimal plane for classification

Maximal Margin, contd. Maximizing the margin is equivalent to maximizing the distance between the two supporting planes. Solved using the following Quadratic Programming problem: A Plane supports a class if all of its class data is on one side of it. Distance between two planes is 2/(norm(w)^2)

Linearly Inseperable Data
For this type of data, have to introduce a “slack” variable to each constraint and then add as a weighted penalty term. Practically, the C parameter represents a trade-off between classification error and maximal margin Note this results in classification errors, but the problem is still solvable (no longer a hard margin problem)

Alternate Form of the QP
Writing the classification rule in its dual form reveals that the maximum margin hyperplane is only a function of the support vectors - the training data that lie on the margin Orange data points in previous slides

Non-Linear Classification
For many data distributions, a simple linear plane cannot be effectively applied to classify points. This data distribution would be best classified using an elliptical classification surface.

Non-Linear Classification, contd.
Consider 2-dimensional training data with attributes [r,s]. To construct a quadratic discriminant function, the 2- dimensional input can be mapped into a 5-dimensional data set described by [r, s, rs, r2, s2] . A linear discriminant can then be computed in this new feature space. This can be substituted into the original linear discriminant function, taking into account the mapping function into feature-space.

Existing Quadratic Programming problem from can be modified to use the mapping function: For practical usage of SVM, it is not feasible to calculate the mapping function. SVMs work around this issue by using kernel functions. Allows us to evaluate the inner product without having to explicitly know the mapping function.

Final form of the Quadratic Programming problem: Following table outlines the popular Kernel functions used in SVMs:

Summary of SVM Procedure
Select the C parameter (recall this is the trade-off between classification error and margin maximization). Select the kernel function and any kernel-specific parameter values. Solve the Quadratic Problem to determine the set of support vectors and multipliers. Recover the threshold variable b using the set of support vectors. Apply the SVM to classify a new data point x using the final classification function.

Example of SVM – Polynomial Kernel

Applying SVM to SeqRec LIBSVM is a software library that provides tools to allow users to easily and quickly implement SVM-based classifiers. svm-scale – This tool is used to scale the features of input data. svm-train – This tool will train an SVM model using a set of labeled training data. It supports the popular kernel functions and specification of the C parameter to use. svm-predict – This tool takes un-labeled data and a previously generated SVM model and outputs the classification label hypothesis determined by applying the decision function.

SVM Parameter Search The RBF kernel will be used for the experiments with TIMIT. The two parameters that must be selected when applying the RBF kernel to SVM are the C and γ parameters. A common method to perform parameter searching is known as cross-validation. LIBSVM provides an implementation known as v-fold cross- validation. Training data set is first sub-divided into v subsets. Each subset is then sequentially tested using a classifier trained using the other v-1 subsets. Repeat for each other subset, allowing each instance of the whole training set to be predicted once. The cross-validation accuracy is the percentage of data correctly classified using the procedure. RBF is recommended by LIBSVM authors and previous work done by Dr. Kepuska. Has ability to do non-linear classification and only has one parameter. We use v == 5 for our experiments.

SVM Parameter Search, contd.
Cross-validation has the property of avoiding the problem of overtraining. If parameters were chosen that yielded the best classification accuracy for the entire training data set, the SVM may be too specific and would falsely reject unseen data. SVM may have worse accuracy during the model building stage but in general will perform better against unseen data. Cross-validation accuracy is computed across the following parameter ranges:

Applying SVM to Multiple Scores
TIMIT “greasy” TRAIN score data yields γ = and C = 8 as the best parameters. To evaluate the model on un-seen data, the SVM model is now applied to the TEST scores 1 & 2 for TIMIT “greasy”. LIBSVM is able to output decision values in addition to the binary class labels. Greater magnitude of a decision value means greater confidence that the value is a part of the chosen class. These values can then be treated as a single-dimensional input to the original binary classifier

Two-Class SVM – TIMIT “greasy”
Total Error Rate is reduced from 2.45% to 0.97% Recognition rate is 2.55 times better!

Incorporating the Word Duration Feature
As opposed to the TIMIT “greasy” example considered so far, the distributions for some words are highly correlated and do not exhibit good performance using just Score 1 & 2. One possible cause for this is that the shorter the time duration of the word, the more apparent any errors in the hand-labeled durations are. “and” is generally the worst performing word out of the entire data set.

Incorporating the Word Duration Feature, contd.
SVMs are capable of handling data with many features. Makes sense to think of the length of the scored word as a feature itself. If two phonetically similar words such as “a” and “and” produce very similar acoustic scores, duration could intuitively be used to increase the reliability of the decision.

SVM with Duration - TIMIT “and”
Able to lower the original monophone classifier error rate from 61.83% to 32.95% Relative improvement of 88% or 1.88 times. Notice that SVM applied without the duration feature is basically useless for this particular word. Although the original motivation of adding the duration feature was to improve classification results for otherwise indistinguishable words such as “and,” it turns out that using duration improves the classification performance for all words in the TIMIT test set.

One-Class SVM SVMs that have been considered thus far have operated by classifying data vectors into one of two different classes. This requires a database of acoustic scores for both the INV word, as well as scores of all other words. One-Class SVM is a class of SVM models that only depend on having a single class of data available for classification.

One-Class SVM, contd. Problem Statement - Suppose that some data set has a probability distribution P in feature space. Find a “simple” subset S of the feature space such that the probability that a test point from P lies outside of S is bounded by some a priori value.

One-Class SVM, contd. The strategy is to map the data into kernel feature space (same as regular SVM), and then separate the data from the origin with maximum margin. The origin in feature space is the only original member of the “negative class.” Results in a modification to the Two-Class SVM Quadratic Programming problem: The classification function is the same as Two-Class SVM:

One-Class SVM - v Parameter
The modified QP introduces the v parameter. As v approaches 0, the upper boundary on the second inequality becomes very large and has decreasing impact on the expression. Leads to a hard margin problem because the penalty for errors becomes infinite. As v is increased, the mis- classification penalty is relaxed and errors are allowed. Notice the effect of v on outliers when the penalty of errors is low. Low v will “suck in” all the outliers, regardless of how far from the actual distribution they seem to be.

One-Class SVM Parameter Search
The cross-validation grid-search strategy will be applied for One-Class SVM parameter optimization: The v parameter searched for instead of the cost parameter C. The input SVM training data now only includes INV TRAIN data. The One-Class SVM model is evaluated against INV and OOV TRAIN data. This accuracy is recorded in order to evaluate the effect of v on overall error rate. Select the parameters that yield the highest accuracy from (3).

One-Class SVM – TIMIT “greasy”
Results in a Total Error Rate of 1.10%, as compared to the Two-Class SVM classifier that is able to achieve 0.88%

One-Class SVM Observations
The number of Support Vectors required for a competitive One-Class SVM model is much lower than the number required for Two-Class SVM (54 versus 19 for TIMIT “greasy”). The processing time to train the One-Class SVM model is much lower because only the INV data has to be considered in the Quadratic Programming optimization problem to determine the maximum margin classifier (2.330s versus 0.001s for the TIMIT “greasy”). The overall performance is generally lower for One-Class SVM models. Of course, the absence of negative information entails a price, and one should not expect as good results as when this information is available

Final SeqRec Experiment Configurations
The following techniques will be evaluated against the 25-word TIMIT test subset: Score 1 Classification (Code: “Score 1”) (Score 1 + Score 2) Classification With Two-Class SVM (Code: “CSVMND”) (Score 1 + Score 2 + Duration) Classification With Two-Class SVM (Code: “CSVM”) (Score 1 + Score 2 + Duration) Classification With One-Class SVM (Code: “OSVM”) All acoustic scores will be generated using 16-Mixture Monophone HMMs generated using SeqRec. The TIMIT test set is divided into two groups to increase graph readability.

Evaluation Metrics The Total Error Rate metric will be the primary criterion of performance for each method. The Relative Error Rate Reduction (RERR) and Error Rate Reduction (ERRR) will be calculated and used to compare performance between two methods: B – Baseline Total Error Rate N – New Total Error Rate

Manual Parameter Selections
Experimentation has revealed that the grid search method does not always yield the most appropriate parameters for Two-Class SVM. The following words were found to perform considerably better for the Two-Class SVM models when using the parameters listed in the right columns, as opposed to the parameters in the left columns that the grid search discovered. Using these determined values for the “problem” words in the TIMIT test data set demonstrates the actual capabilities of the SeqRec Classifier. Future work to optimize the grid search based upon these results. Word C - Grid γ - Grid C – Select γ- Select ask 2048 8 all 8192 2 32 water 0.5 year in that 512

SeqRec Results – TIMIT Test Set 1

Word RERR (%) CSVMND ERRR CSVMND CSVM ERRR CSVM OSVM ERRR OSVM Suit 391 4.91 723 8.23 90 1.90 greasy 146 2.46 172 2.72 117 2.17 year 23 1.23 126 2.26 8 1.08 rag 518 6.18 147 2.47 19 1.19 wash 1432 15.32 4303 44.03 2249 23.49 carry 407 5.07 1152 12.52 water 156 2.56 180 2.80 197 2.97 she 124 2.24 344 4.44 245 3.45 all 22 1.22 82 1.82 59 1.59 dark 590 6.90 737 8.37 465 5.65 had 205 3.05 253 3.53 575 6.75 ask 158 2.58 324 4.24 169 2.69 oily 89 1.89 228 3.28 111 2.11 me 175 2.75 368 4.68 211 3.11 like 389 4.89 538 6.38 522 6.22

Word RERR (%) CSVMND ERRR CSVMND CSVM ERRR CSVM OSVM ERRR OSVM your 251 3.51 360 4.60 137 2.37 that 106 2.06 133 2.33 10 1.10 an 64 1.64 301 4.01 79 1.79 in -15 0.85 116 2.16 53 1.53 of 65 1.65 198 2.98 55 1.55 to 261 3.61 725 8.25 278 3.78 and -36 0.64 88 1.88 -8 0.92 the 275 3.75 673 7.73 258 3.58 Average 245 3.45 532 6.32 4.00

Concluding Remarks The SeqRec system was able to successfully integrate off-the-shelf speech recognition and SVM frameworks to create a working single-word classification system that shows remarkable error rate improvements against the well-known TIMIT data set. Average RERR of 532% using Two-Class SVM Scoring with the Duration feature. This leads to a single word recognition system capable of performing with an overall average Total Error Rate of 5.4% as compared to the baseline of 20.6%. The highest gain wasTIMIT word “wash”: the baseline Total Error Rate was 2.51% and the Two-Class SVM with Duration Total Error Rate was 0.06% RERR of 4303% One-Class SVM is indeed a viable method for significantly reducing recognizer error, with an average RERR of 301% Outperforms Two-Class SVM without the Duration feature.

Acknowledgements A very special thank you to Dr. Kepuska for his dedication to the field of Speech Recognition and allowing me to participate in a very exciting part of it! Thanks to FIT’s ECE Department for the support provided to this field of study.

Wrap-up (Time Permitting)
Show Individual TIMIT Word Results in MATLAB. Future Work Topics. Questions from the audience.

Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska

Similar presentations

Presentation on theme: "Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska

Similar presentations

Presentation on theme: "Author: Arthur Gerald Kunkle Committee Chair: Dr. Veton Z. Këpuska"— Presentation transcript:

Similar presentations

About project

Feedback