Introduction to Automatic Speech Recognition

Slides:

Advertisements

Similar presentations

Building an ASR using HTK CS4706

Advertisements

Speech Recognition with Hidden Markov Models Winter 2011

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.

Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections

Chapter 15 Probabilistic Reasoning over Time. Chapter 15, Sections 1-5 Outline Time and uncertainty Inference: ltering, prediction, smoothing Hidden Markov.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.

SPEECH RECOGNITION Kunal Shalia and Dima Smirnov.

Speech Translation on a PDA By: Santan Challa Instructor Dr. Christel Kemke.

The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.

Speech Recognition. What makes speech recognition hard?

On Recognizing Music Using HMM Following the path craved by Speech Recognition Pioneers.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

Dynamic Time Warping Applications and Derivation

Why is ASR Hard? Natural speech is continuous

A PRESENTATION BY SHAMALEE DESHPANDE

May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.

1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.

Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

Speech and Language Processing

7-Speech Recognition Speech Recognition Concepts

Automatic Speech Recognition (ASR): A Brief Overview.

A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.

By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.

Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.

Csc Lecture 7 Recognizing speech. Geoffrey Hinton.

Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,

IRCS/CCN Summer Workshop June 2003 Speech Recognition.

Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Conditional Random Fields for ASR Jeremy Morris July 25, 2006.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.

Probabilistic reasoning over time Ch. 15, 17. Probabilistic reasoning over time So far, we’ve mostly dealt with episodic environments –Exceptions: games.

Performance Comparison of Speaker and Emotion Recognition

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.

BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Artificial Intelligence for Speech Recognition

Automatic Speech Recognition Introduction

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

CRANDEM: Conditional Random Fields for ASR

Statistical Models for Automatic Speech Recognition

Speech Processing Speech Recognition

LECTURE 15: REESTIMATION, EM AND MIXTURES

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

The Application of Hidden Markov Models in Speech Recognition

Presentation transcript:

Introduction to Automatic Speech Recognition

Outline Define the problem What is speech? Feature Selection Models Early methods Modern statistical models Current State of ASR Future Work

The ASR Problem There is no single ASR problem The problem depends on many factors Microphone: Close-mic, throat-mic, microphone array, audio-visual Sources: band-limited, background noise, reverberation Speaker: speaker dependent, speaker independent Language: open/closed vocabulary, vocabulary size, read/spontaneous speech Output: Transcription, speaker id, keywords

Performance Evaluation Accuracy Percentage of tokens correctly recognized Error Rate Inverse of accuracy Token Type Phones Words* Sentences Semantics?

What is Speech? Analog signal produced by humans You can think about the speech signal being decomposed into the source and filter The source is the vocal folds in voiced speech The filter is the vocal tract and articulators

Speech Production

Speech Production

Speech Production

Speech Visualization

Speech Visualization

Speech Visualization

Feature Selection As in any data-driven task, the data must be represented in some format Cepstral features have been found to perform well They represent the frequency of the frequencies Mel-frequency cepstral coefficients (MFCC) are the most common variety

Where do we stand? Defined the multiple problems associated with ASR Described how speech is produced Illustrated how speech can be represented in an ASR system Now that we have the data, how do we recognize the speech?

Radio Rex First known attempt at speech recognition A toy from 1922 Worked by analyzing the signal strength at 500Hz

Actual speech recognition systems Originally thought to be a relatively simple task requiring a few years of concerted effort 1969, “Wither speech recognition” is published A DARPA project ran from 1971-1976 in response to the statements in the Pierce article We can examine a few general systems

Template-Based ASR Originally only worked for isolated words Performs best when training and testing conditions are best For each word we want to recognize, we store a template or example based on actual data Each test utterance is checked against the templates to find the best match Uses the Dynamic Time Warping (DTW) algorithm

Dynamic Time Warping Create a similarity matrix for the two utterances Use dynamic programming to find the lowest cost path

Hearsay-II One of the systems developed during the DARPA program A blackboard-based system utilizing symbolic problem solvers Each problem solver was called a knowledge group A complex scheduler was used to decide when each KG should be called

Hearsay-II

DARPA Results The Hearsay-II system performed much better than the two other similar competing systems However, only one system met the performance goals of the project The Harpy system was also a CMU built system In many ways it was a predecessor to the modern statistical systems

Modern Statistical ASR

Modern Statistical ASR

Acoustic Model For each frame of data, we need some way of describing the likelihood of it belonging to any of our classes Two methods are commonly used Multilayer perceptron (MLP) gives the likelihood of a class given the data Gaussian Mixture Model (GMM) gives the likelihood of the data given a class

Gaussian Distribution

Pronunciation Model While the pronunciation model can be very complex, it is typically just a dictionary The dictionary contains the valid pronunciations for each word Examples: Cat: k ae t Dog: d ao g Fox: f aa x s

Language Model Now we need some way of representing the likelihood of any given word sequence Many methods exist, but ngrams are the most common Ngrams models are trained by simply counting the occurrences of words in a training set

Ngrams A unigram is the probability of any word in isolation A bigram is the probability of a given word given the previous word Higher order ngrams continue in a similar fashion A backoff probability is used for any unseen data

How do we put it together? We now have models to represent the three parts of our equation We need a framework to join these models together The standard framework used is the Hidden Markov Model (HMM)‏

Markov Model A state model using the markov property The markov property states that the future depends only on the present state Models the likelihood of transitions between states in a model Given the model, we can determine the likelihood of any sequence of states

Hidden Markov Model Similar to a markov model except the states are hidden We now have observations tied to the individual states We no longer know the exact state sequence given the data Allows for the modeling of an underlying unobservable process

HMMs for ASR First we build an HMM for each phone Next we combine the phone models based on the pronunciation model to create word level models Finally, the word level models are combined based on the language model We now have a giant network with potentially thousands or even millions of states

Decoding Decoding happens in the same way as the previous example For each time frame we need to maintain two pieces of information The likelihood of being at any state The previous state for every state

State of the Art What works well What doesn't work Constrained vocabulary systems Systems adapted to a given speaker Systems in anechoic environments without background noise Systems expecting read speech What doesn't work Large unconstrained vocabulary Noisy environments Conversational speech

Future Work Better representations of audio based on humans Better representation of acoustic elements based on articulatory phonology Segmental models that do not rely on the simple frame-based approach

Resources Hidden Markov Model Toolkit (HTK)‏ http://htk.eng.cam.ac.uk/ CHIME ( a freely available dataset)‏ http://spandh.dcs.shef.ac.uk/projects/chime/PCC/datasets.html Machine Learning Lectures http://www.stanford.edu/class/cs229/ http://www.youtube.com/watch?v=UzxYlbK2c7E