Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.

Slides:

Advertisements

Similar presentations

Neural Networks and Kernel Methods

Advertisements

Machine Learning Approaches to the Analysis of Large Corpora : A Survey Xunlei Rose Hu and Eric Atwell University of Leeds.

Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Advances in WP2 Torino Meeting – 9-10 March

Designing Facial Animation For Speaking Persian Language Hadi Rahimzadeh June 2005.

AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.

Page 1 Hidden Markov Models for Automatic Speech Recognition Dr. Mike Johnson Marquette University, EECE Dept.

Speaker Adaptation for Vowel Classification

VARIABLE PRESELECTION LIST LENGTH ESTIMATION USING NEURAL NETWORKS IN A TELEPHONE SPEECH HYPOTHESIS-VERIFICATION SYSTEM J. Macías-Guarasa, J. Ferreiros,

Optimal Adaptation for Statistical Classifiers Xiao Li.

Pattern Recognition Applications Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall

Dan Simon Cleveland State University

Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.

Database Construction for Speech to Lip-readable Animation Conversion Gyorgy Takacs, Attila Tihanyi, Tamas Bardi, Gergo Feldhoffer, Balint Srancsik Peter.

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.

Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.

1 Conditional Random Fields for ASR Jeremy Morris 11/23/2009.

Statistical automatic identification of microchiroptera from echolocation calls Lessons learned from human automatic speech recognition Mark D. Skowronski.

Adaptation Techniques in Automatic Speech Recognition Tor André Myrvoll Telektronikk 99(2), Issue on Spoken Language Technology in Telecommunications,

OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.

Graphical models for part of speech tagging

General Tensor Discriminant Analysis and Gabor Features for Gait Recognition by D. Tao, X. Li, and J. Maybank, TPAMI 2007 Presented by Iulian Pruteanu.

Appendix B: An Example of Back-propagation algorithm

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

A Two-level Pose Estimation Framework Using Majority Voting of Gabor Wavelets and Bunch Graph Analysis J. Wu, J. M. Pedersen, D. Putthividhya, D. Norgaard,

Recognition of Speech Using Representation in High-Dimensional Spaces University of Washington, Seattle, WA AT&T Labs (Retd), Florham Park, NJ Bishnu Atal.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Singer similarity / identification Francois Thibault MUMT 614B McGill University.

Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

1/18 New Feature Presentation of Transition Probability Matrix for Image Tampering Detection Luyi Chen 1 Shilin Wang 2 Shenghong Li 1 Jianhua Li 1 1 Department.

Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.

Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.

1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.

NTNU Speech and Machine Intelligence Laboratory 1 Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models 2016/05/31.

1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION

Word2Vec CS246 Junghoo “John” Cho.

Neural Networks Advantages Criticism

Statistical Models for Automatic Speech Recognition

Neuro-Computing Lecture 4 Radial Basis Function Network

Jeremy Morris & Eric Fosler-Lussier 04/19/2007

Automatic Speech Recognition: Conditional Random Fields for ASR

Speaker Identification:

Learning Long-Term Temporal Features

Presentation transcript:

Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen A. Zahorian Department of Electrical and Computer Engineering, Binghamton University, Binghamton, NY 13902, USA Introduction  Accurate Automatic Speech Recognition (ASR)  Highly discriminative features »Incorporate nonlinear frequency scales and time dependency »Low dimensionality feature spaces  Efficient recognition models (HMMs & Neural Networks)  Neural Network Based Dimensionality Reduction  Neural Networks (NNs) used to represent complex data while preserving the variability and discriminability of the original data  Combine with a HMM recognizer to form a hybrid NN/HMM recognition model NLDA Reduction Overview  Nonlinear Discriminant Analysis (NLDA)  A multilayer neural network performs a nonlinear feature transformation of the input speech features  Phone models for transformed features using HMMs with each state modeled with a GMM (Gaussian Mixture Model)  PCA performs a Karhunen-Loeve (KL) transform for reducing the correlation of the network outputs NLDA1 Method  Dimensionality Reduced Features  Obtained at the output layer of the neural network  Feature dimensionality is further reduced by PCA  Node Nonlinearity (Activation Function)  In the feature transformation, a linear function for the output layer and a Sigmoid nonlinear function for the other layers  In the NLDA1 training, all layers are nonlinear Experimental Setup  TIMIT Database (‘SI’ and ‘SX” only)  48 phoneme set mapped down from 62 phoneme set  Training Data: 3696 sentences (460 speakers)  Testing Data: 1344 sentences (168 speakers)  DCTC/DCSC Features  A total of 78 features (13 DCTCs x 6 DCSCs) were computed  10 ms frames with 2 ms frame spacing and 8 ms block spacing with 1s block length Conclusions  Very high recognition accuracies were obtained using the outputs of network middle layer as in NLDA2  The NLDA methods are able to produce a low- dimensional effective representation of speech features  HMM  3-state Left-to-right Markov models with no skip  48 monophone HMMs were created using HTK (ver 3.4)  Language model: Phone bigram information of the training data  Neural Networks in NLDA  3 hidden-layers: nodes  Input layer: 78 nodes corresponding to the feature dimensionality  Output Layer: 48 nodes for the phoneme targets, or 144 nodes for the state level targets Training Neural Networks  Phone Level Targets  Each NN output correspond to a specific phone  Straightforward to implement, using phonetically labeled training database  But why should NN output be forced to the same value for the entire phone  State Level Targets  Each NN output correspond to a single state of a phone HMM  But how to determine state boundaries o Estimate using percentage of total length o Use initial training iteration, then Viterbi alignment Original Features PCA Network Outputs Dimensionality Reduced Features NLDA2 Method  Reduced Features  Use outputs of the network middle hidden layer  The reduced dimensionality is determined by the number of middle nodes, giving flexibility in reduced feature dimensionality  The linear PCA is used only for feature decorrelation  Nonlinearity  All nonlinear layers are used in both the feature transformation and network training PCA Dimensionality Reduced Outputs Dimensionality Reduced Features Experimental Results Control Experiment  Compare the original DCTC-DSCS with the PCA and LDA reduced features  Use various numbers of mixtures in HMMs  Accuracies using the original, PCA and LDA reduced features (20 & 36 dimensions)  The original 78-dimensional features yield the highest accuracy of 73.2% using 64-mix HMMs NLDA Experiment  Evaluate NLDA1 and NLDA2 with or without PCA  48-dimensional phoneme level targets used  The features reduced to 36 dimensions  Accuracies using the NLDA1 and NLDA2 reduced features, and the reduced features without the PCA processing  The middle layer outputs of a network results in more effective features in a reduced space  The accuracies improved about 2% with PCA State Level Target Exp 2  Use a fixed length ration and the Viterbi alignment for the state targets  State level targets with “Don’t cares” used  Targets obtained using a fixed length ratio (3 states: 1:4:1) and the Viterbi alignment  Network training: 4x10 7 weight updates  Accuracies; (R)” indicates a fixed length ratio and “(A)” the Viterbi forced alignment Literature Comparison  Recognition accuracy based on TIMIT FeatureRecognizerAcc. (%)Study MFCCHMM68.5Somervuo (2003) PLPMLP-GMM71.5Ketabdar et al. (2008) LPCHMM-MLP74.6Pinto et al. (2008) MFCCTandem NN78.5Schwarz et al. (2006) DCTC/DCSCHMM73.9Zahorian et al. (2009) DCTC/DCSCNN-HMM74.9This study State Level Target Exp 1  Compare the state level targets with and without “Don’t cares”  144-dimensional state level targets used  State boundaries obtained using the fixed state length method (3 states: 1:4:1)  Network training: 8x10 6 weight updates  Accuracies using the state level targets with and without “Don’t cares”  The state level targets with “Don’t cares” result in higher accuracies  The NLDA2 reduced features achieved a substantial improvement versus the original features Dimensionality Reduction Reduction & Decorrelation Dimensionality Reduction Decorrelation  For 3-state models, train using “Don’t Cares” o For 1st portion, target is “1” for state 1 and “Don’t Cares” for states 2 and 3 o For 2nd portion, target is “1” for state 2 and “Don’t cares” for states 1 and 3 o For 3rd portion, target is “1” for state 3 and “Don’t Cares” for states 1 and 2