The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute.

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Sphinx-3 to 3.2 Mosur Ravishankar School of Computer Science, CMU Nov 19, 1999.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Ch-9: Markov Models Prepared by Qaiser Abbas ( )
Hidden Markov Models Theory By Johan Walters (SR 2003)
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.
Acoustic / Lexical Model Derk Geene. Speech recognition  P(words|signal)= P(signal|words) P(words) / P(signal)  P(signal|words): Acoustic model  P(words):
Application of HMMs: Speech recognition “Noisy channel” model of speech.
Speech Recognition Training Continuous Density HMMs Lecture Based on:
PatReco: Hidden Markov Models Alexandros Potamianos Dept of ECE, Tech. Univ. of Crete Fall
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Models Pairwise Alignments. Hidden Markov Models Finite state automata with multiple states as a convenient description of complex dynamic.
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Training Tied-State Models
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
27 th, February 2004Presentation for the speech recognition system An overview of the SPHINX Speech Recognition System Jie Zhou, Zheng Gong Lingli Wang,
1 Hidden Markov Model Instructor : Saeed Shiry  CHAPTER 13 ETHEM ALPAYDIN © The MIT Press, 2004.
15-Jul-04 FSG Implementation in Sphinx2 FSG Implementation in Sphinx2 Mosur Ravishankar Jul 15, 2004.
Automatic Speech Recognition Introduction. The Human Dialogue System.
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Automatic Continuous Speech Recognition Database speech text Scoring.
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
RS, © 2004 Carnegie Mellon University Training HMMs with shared parameters Class 24, 18 apr 2012.
Isolated-Word Speech Recognition Using Hidden Markov Models
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
SPEECH RECOGNITION Presented to Dr. V. Kepuska Presented by Lisa & Za ECE 5526.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Training Tied-State Models Rita Singh and Bhiksha Raj.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang 2. “ Predicting Unseen Triphones with Senones”, IEEE Trans. on Speech & Audio Processing,
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
Hidden Markov Models Wassnaa AL-mawee Western Michigan University Department of Computer Science CS6800 Adv. Theory of Computation Prof. Elise De Doncker.
A NONPARAMETRIC BAYESIAN APPROACH FOR
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Statistical Models for Automatic Speech Recognition
5.0 Acoustic Modeling References: , 3.4.1, 4.5, 9.1~ 9.4 of Huang
Automatic Speech Recognition Introduction
Hidden Markov Models Part 2: Algorithms
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Statistical Models for Automatic Speech Recognition
Handwritten Characters Recognition Based on an HMM Model
LECTURE 15: REESTIMATION, EM AND MIXTURES
Introduction to Digital Speech Processing
Presentation transcript:

The Acoustic/Lexical model: Exploring the phonetic units; Triphones/Senones in action. Ofer M. Shir Speech Recognition Seminar, 15/10/2003 Leiden Institute of Advanced Computer Science

Theoretical Background – Unit Selection When selecting the basic unit of acoustic information, we want it to be accurate, trainable and generalizable. Words are good units for small-vocabulary SR – but not a good choice for large-vocabulary continuous SR: Each word is treated individually – no data sharing, which implies large amount of training data and storage. The recognition vocabulary may consist of words which have never been given in the training data. Expensive to model interword coarticulation effects.

Theoretical Background - Phonemes The alternative unit is a Phoneme. Phonemes are more trainable (there are only about 50 phonemes in English, for example) and generalizable (vocabulary independent). However, each word is not a sequence of independent phonemes! Our articulators move continuously from one position to another. The realization of a particular phoneme is affected by its phonetic neighbourhood, as well as by local stress effects etc. Different realizations of a phoneme are called allophones.

Theoretical Background - Triphones The Triphone model is a phonetic model which takes into consideration both the left and the right neighbouring phonemes. Triphones are an example of allophones. This model captures the most important coarticulatory effects, a fact which makes him a very powerful model. The cost – as context-dependent models generally increase the number of parameters, the trainability becomes much harder. Notice that in English there are more than 100,000 triphones ! Nevertheless, so far we have assumed that every triphone context is different. We are motivated to finds instances of similar contexts and merge them.

Theoretical Background - Senones Recall that each allophone model is an HMM, made of states, transitions and probability distributions; the bottom line is that some distributions can be tied. The basic idea is clustering, but rather than clustering the HMM models themselves – we shall cluster only the the HMM states. Each cluster will represent a set of similar Markov states, and is called a Senone. The senones provide not only an improved recognition accuracy, but also a pronunciation-optimization capability.

Theoretical Background – Senonic Trees Reminder: a decision tree is a binary tree which classifies target objects by asking Yes/No questions in a hierarchical manner. The senonic decision tree classifies Markov states of triphones, represented in the training data, by asking linguistic questions. => The leaves of the senonic trees are the possible senones.

Sphinx III, A Short Review – Front End Feature Extraction Cepstrum 12 elements Time-der Cepstrum Time-2-der Cepstrum 12 elements Power 3 elements Current frame 7 frame speech window Fetch phonetic data (Senones !) from these Gaussian Mixtures – using the well-trained machine. Feature vectors and their analysis are inputs into Gaussian Mixtures Fitting Process. Gaussian Mixtures 39 elements Mean, Variance, Determinant Senones Data (Scoring Table)

Sphinx III – the implementation Handling a single word; evaluating each HMM according to the input, using the Viterbi Search. Every senone gets a HMM: UW ONE TWO THREE T AHWN RTHIY 5-state HMM

The Viterbi Search - basics Instantaneous score: how well a given HMM state matches the feature vector. Path: A sequence of HMM states traversed during a given segment of feature vectors. Path-score: Product of instantaneous scores and state transition probabilities corresponding to a given path. The Viterbi search: An efficient lattice structure and algorithm for computing the best path score for a given segment of feature vectors.

time Initial state initialized with path-score = 1.0 The Viterbi Search - demo

The Viterbi Search (demo-contd.) time State with best path-score State with path-score < best State without a valid path-score P (t) j = max [P (t-1) a b (t)] iijj i Total path-score ending up at state j at time t State transition probability, i to j Score for state j, given the input at time t

The Viterbi Search (demo-contd.) time

Continuous Speech Recognition UW ONE TWO THREE T AHWN RTHIY Add transitions from word ends to beginnings, and run the Viterbi Search.

Cross-Word Triphone Modeling Sphinx III uses “triphone” or “phoneme-in-context” HMMs; Remember to inject left-context into entry state. AH ONE WN Context- dependent AH HMM Separate N HMM instances for each possible right context Inherited left context propagated along with path-scores, and dynamically modifies the state model.

Sphinx-III - Lexical Tree Structure Nodes shared if triphone Senone-Sequence-ID (SSID) identical: STARTS-T-AA-R-TD STARTINGS-T-AA-R-DX-IX-NG STARTEDS-T-AA-R-DX-IX-DD STARTUPS-T-AA-R-T-AX-PD START-UPS-T-AA-R-T-AX-PD STAA R RT TD DX IX NG DD AX PD start starting started startup start-up

Cross-Word Triphones (left context) Root nodes replicated for left context. Nodes are shared if SSIDs are identical. STAA R RT TD DX IX NG DD AX PD start starting started startup start-up left-contexts to rest of lextree S-models for different left contexts

Cross-Word Triphones (right context) Leaf node Triphones for all right contexts HMM states for triphones Picking states composite states; average of component states Composite SSID model

Sphinx III, the Acoustic Model – File List Summary mdef.c – definition of the basic phones and triphones HMMs, the mapping of each HMM state to a senone and to its transition matrix. dict.c – pronunciation dictionary structure. hmm.c – implementing HMM evaluation using Viterbi Search, which means fetching the best senone score. Note that the HMM data structures, defined at hmm.h, are hardwired to 2 possible HMM topologies – 3 / 5 state left-to-right HMMs. lextree.c – lexical tree search.

Presentation Resources: Spoken Language Processing: A Guide to Theory, Algorithm and System Development by Xuedong Huang, Alex Acero, Hsiao-Wuen Hon, Raj Reddy (Hardcover, 980 pages; Publisher: Prentice Hall PTR; ISBN: ; 1st edition, April 25, 2001). Chapters 9,13. Hwang, M., Huang, X., Alleva, F. : “Predicting Unseen Triphones with Senone”, Hwang et al : Shared Distribution Hidden Markov Models for Speech Recognition, Hwang et al : Subphonetic Modeling with Markov States – Senones, Sphinx-III documentation - a presentation made by Mosur Ravishankar; found in the /doc/ folder of the sphinx-III package. “Sphinx-III bible” - a presentation made by Edward Lin;

“I shall never believe that God plays dice with the world, but maybe machines should play dice with human capabilities…” John Doe