Presentation is loading. Please wait.

Presentation is loading. Please wait.

FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.

Similar presentations


Presentation on theme: "FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic."— Presentation transcript:

1 FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic attribute for each frame of speech. Attributes are listed descending in the same order that they appear in Table 1. Discriminative Automatic Speech Recognition Using Conditional Random Fields Jeremy Morris – Department of Computer Science and Engineering, OSU Speech Attributes This work focuses on integrating together various speech attributes extracted from the speech waveform to perform recognition A speech feature is any feature extracted from the signal that could be useful for recognition This work concentrates on using two speech attributes: phonetic attributes and phone class Phonetic attributes are defined via linguistic properties per the International Phonetics Association (IPA) phonetic chart Consonants defined by their sonority, voicing, manner, and place of articulation Vowels defined by their sonority, voicing, height, frontness, roundness and tenseness Additional features for silence TABLE 1: PHONETIC ATTRIBUTES AttributePossible output values SONORITYvowel, obstruent, sonorant, syllabic, silence VOICEvoiced, unvoiced, n/a MANNERfricative,stop, closure flap, nasal, approximant, nasalflap, n/a PLACElabial, dental, alveolar, palatal, velar, glottal, lateral, rhotic, n/a HEIGHThigh, mid, low, lowhigh, midhigh, n/a FRONTfront, back, central, backfront, n/a ROUNDround, nonround, roundnonround, nonroundround, n/a TENSEtense, lax, n/a A discriminative model of a sequence that attempts to model the posterior probability of a label sequence given a set of observed data (Lafferty, et. al, 2001) A CRF can be described by the following equation: Where each s is a state feature function and each t is a transition feature function State feature functions associate observations in the data at a particular time segment with the label at that time segment Described as s(y, x, i), where y is the label, x is the observed data, and i is the time frame. Takes a non-zero value when the current label at frame i is the same as y and some observation in x holds for the frame I, otherwise the value is zero. Prior work using CRFs in speech recognition have used Gaussian attributes to build state feature function (Gunawardana et. al, 2005) Transition feature functions associate observations in the data at a particular time segment with the transition from the previous label into the current label Described as t(y,y’,x,i), where y is the label, y’ is the previous label, x is the observed data, and i is the time frame Takes a non-zero value when the current label at frame i is the same as y, the previous label is the same as y’, and some observation in x holds for the frame i Results Phone-recognition accuracies of the CRF compared to Tandem system (Hermansky et. al, 2000) A Tandem system uses the output of the neural networks as inputs to a Hidden Markov Model system Tandem system was trained with both triphone label contexts and monophone label contexts Triphone labels give a single left and right context phone to the label, allowing a finer level of context to be used when labels are assigned In other words, the context for the phone /ae/ in the string of phones /k ae t/ is different from that in the string /k ae p/ since the right context is different Monophone labels are a single phonetic label CRF system results are only for monophone labels Tandem and CRF systems examined using only phone class attributes, using only phonetic attributes, and using both phone class and phonetic attributes combined Discussion The CRF system trained only on monophones achieves accuracy results superior to a monophone HMM and comparable to a triphone HMM with far fewer parameters Tandem HMM systems also required a decorrelation step on the input features prior to use The CRF system is able to make better use of correlated features to improve results The Tandem system degrades when all features are combined together, while the CRF system improves its result substantially with a small increase in the number of overall parameters Using CRFs to compute frame-level local phone class posteriors for Tandem systems has shown promise (Fosler- Lussier & Morris, 2008) CRF system integrates speech attributes into a single set of phone class outputs Frame-level CRF outputs fed into a Tandem system as inputs -- a “Crandem” system Results show a significant improvement over either the Tandem system or the CRF system alone for the task of phone recognition Current directions include integrating CRF local posterior training into full word speech recognition, examining methods for using a CRF model itself for word recognition, and exploring useful ways of integrating speech attributes as transition features in the CRF For our model, a state feature function is a single output from our MLP speech attribute classifiers associated with a single label Example: s j (y,x,i) = MLP stop (x i )*δ(y i =/t/) The state feature function above has the value of the output of our MLP classifier for the STOP attribute if the label at time i is /t/. Otherwise, it takes the value of zero. In this work, transition feature functions do not use the output of the MLP neural networks The value of the function is 1 if the label pair matches the pair defined for the function, 0 if it does not. Each feature function has an associated weight value This weight value is high when a non-zero feature function value is strongly associated with a particular label – giving a high value to the computation of the probability for that label Weights are trained by maximizing the log likelihood of the training set with respect to the model The strength of the CRF model is in its ability to use arbitrary features as input In traditional HMMs, dependencies among features can lead to computationally difficult models – features are usually required to be independent or the parameter space must be large In a CRF, no independence assumption on the features is made. Features can have arbitrary dependencies. Conditional Random Fields References J. Lafferty, A. McCallum, and F. Pereira, “Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data”, in Proceedings of the 18 th International Conference on Machine Learning, 2001. H. Hermansky, D. Ellis, and S.Sharma, “Tandem connectionist feature stream extraction for conventional HMM systems”, in Proceedings of the IEEE Intl. Conf. on Acoustics, Speech, and Signal Processing, 2000. A. Gunawardana, M. Mahajan, A. Acero and J. Platt, “Hidden Conditional Random Fields for Phone Classification”, in Proceedings of Interspeech, 2005. J. Morris and E. Fosler-Lussier, “CRFs for Integrating Local Discriminative Classifiers”, IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 3, pp. 617-628, March 2008. E. Fosler-Lussier and J. Morris, “Crandem Systems: Conditional Random Field Acoustic Models for Hidden Markov Models”, in Proceedings of ICASSP, 2008 S. Sarawagi, “CRF package for Java”, http://crf.sourceforge.net http://crf.sourceforge.net D. Johnson et al. “ICSI QuickNet software”, http://www.icsi.berkely.edu/Speech/qn.html http://www.icsi.berkely.edu/Speech/qn.html S. Young et al. “HTK HMM software”, http://htk.eng.cam.ac.uk/ http://htk.eng.cam.ac.uk/ FIGURE 2: Graphical model for a CRF phone labeling of the word “that”. Vectors containing the neural net outputs of phonetic attribute posteriors for each time segment as described in Figure 1 are used as observations for the state feature functions to determine the identity of the phone in that timeslice. Arcs between the phone labels indicate transition feature functions determined by the CRF. /dh/ /ae/ /dx/ TABLE 2: Phone Recognition Comparisons ModelAttributesAccuracyParameters Tandem (mono) Phone Class67.88% ~500,000 Tandem (tri) Phone Class70.21% ~1.7 million CRF (mono) Phone Class70.40% 5280 Tandem (mono) Phonetic Attribute68.55% ~400,000 Tandem (tri) Phonetic Attribute69.27% ~1.3 million CRF (mono) Phonetic Attribute69.81% 4464 Tandem (tri) Both70.19% ~2.8 million CRF (mono) Both71.49% 7392 This work was supported by NSF ITR grant IIS-0427413 and by NSF CAREER II-0643901; and in part by a Student-Faculty fellowship from the Dayton Area Graduate Studies Institute/AFRL; the opinions and conclusions expressed in this work are those of the authors and not of any funding agency Phone classes are the phone label associated with the frame of speech Each frame can belong to one of 61 possible phone classes in this work Speech attributes extracted by multi-layer perceptron (MLP) discriminative classifiers Trained on 12 th order cepstral PLP and delta coefficients derived from the speech data Speech data is broken up into frames of 25ms, with overlapping frames every 10ms Input is a vector of PLP and delta coefficients for a nine frame window Each classifier outputs a series of posterior probabilities representing the probability of the attribute given the data Classifiers were trained using a phonetically transcribed corpus Phonetic attribute labels were derived by using the attributes provided by the IPA description of the transcribed phone (See Figure 1) All phones are assumed to have their canonical phonetic attribute values for training, and attribute boundaries occur at transcribed phone boundaries Phone class labels were taken directly from the transcripts


Download ppt "FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic."

Similar presentations


Ads by Google