LML Speech Recognition 2008 1 Speech Recognition Introduction I E.M. Bakker.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
BravoBrava Mississippi State University Spontaneous telephone speech is still a “grand challenge”. Telephone-quality speech is still central to the problem.
BravoBrava Mississippi State University Can Advances in Speech Recognition make Spoken Language as Convenient and as Accessible as Online Text? Joseph.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
ECE 8443 – Pattern Recognition Objectives: Course Introduction Typical Applications Resources: Syllabus Internet Books and Notes D.H.S: Chapter 1 Glossary.
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Natural Language Processing - Speech Processing -
Seminar Speech Recognition 2003 E.M. Bakker LIACS Media Lab
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
COMP 4060 Natural Language Processing Speech Processing.
Dynamic Time Warping Applications and Derivation
Why is ASR Hard? Natural speech is continuous
A PRESENTATION BY SHAMALEE DESHPANDE
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Natural Language Understanding
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Speech Recognition LIACS Media Lab Leiden University Seminar Speech Recognition Project Support E.M. Bakker LIACS Media Lab (LML) Leiden University.
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
Speech Signal Processing I By Edmilson Morais And Prof. Greg. Dogil Second Lecture Stuttgart, October 25, 2001.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Speech Signal Processing I
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
Speech Recognition LIACS Media Lab Leiden University Speech Recognition E.M. Bakker LIACS Media Lab Leiden University.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Seminar Speech Recognition a Short Overview E.M. Bakker
Performance Comparison of Speaker and Emotion Recognition
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Christoph Prinz / Automatic Speech Recognition Research Progress Hits the Road.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Automatic Speech Recognition
Automatic Speech Recognition Introduction
Statistical Models for Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Statistical Models for Automatic Speech Recognition
8-Speech Recognition Speech Recognition Concepts
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Speech Recognition: Acoustic Waves
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker

LML Speech Recognition Speech Recognition Some Applications Some Applications An Overview An Overview General Architecture General Architecture Speech Production Speech Production Speech Perception Speech Perception

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal How is SPEECH produced?  Characteristics of Acoustic Signal How is SPEECH produced?  Characteristics of Acoustic Signal

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal How is SPEECH perceived? => Important Features How is SPEECH perceived? => Important Features

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal What LANGUAGE is spoken? => Language Model What LANGUAGE is spoken? => Language Model

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal What is in the BOX? Input Speech Acoustic Front-end Acoustic Front-end Acoustic Models P(A/W) Acoustic Models P(A/W) Recognized Utterance Search Language Model P(W)

LML Speech Recognition Important Components of General SR Architecture Speech Signals Signal Processing Functions Parameterization Acoustic Modeling (Learning Phase) Language Modeling (Learning Phase) Search Algorithms and Data Structures Evaluation

LML Speech Recognition Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Speech Recognition Problem: P(W|A), where A is acoustic signal, W words spoken Recognition Architectures A Communication Theoretic Approach Objective: minimize the word error rate Approach: maximize P(W|A) during training Components: P(A|W) : acoustic model (hidden Markov models, mixtures) P(W) : language model (statistical, finite state networks, etc.) The language model typically predicts a small set of next words based on knowledge of a finite number of previous words (N-grams). Bayesian formulation for speech recognition: P(W|A) = P(A|W) P(W) / P(A), A is acoustic signal, W words spoken

LML Speech Recognition Input Speech Recognition Architectures Acoustic Front-end Acoustic Front-end The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Acoustic Models P(A|W) Acoustic Models P(A|W) Acoustic models represent sub-word units, such as phonemes, as a finite- state machine in which: states model spectral structure and transitions model temporal structure. Recognized Utterance Search Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence. The language model predicts the next set of words, and controls which models are hypothesized. Language Model P(W)

LML Speech Recognition ASR Architecture Common BaseClasses Configuration and Specification Speech Database, I/O Feature ExtractionRecognition: Searching Strategies Evaluators Language Models HMM Initialisation and Training

LML Speech Recognition Signal Processing Functionality Acoustic Transducers Sampling and Resampling Temporal Analysis Frequency Domain Analysis Ceps-tral Analysis Linear Prediction and LP-Based Representations Spectral Normalization

LML Speech Recognition Fourier Transform Fourier Transform Cepstral Analysis Cepstral Analysis Perceptual Weighting Perceptual Weighting Time Derivative Time Derivative Time Derivative Time Derivative Energy + Mel-Spaced Cepstrum Delta Energy + Delta Cepstrum Delta-Delta Energy + Delta-Delta Cepstrum Input Speech Incorporate knowledge of the nature of speech sounds in measurement of the features. Utilize rudimentary models of human perception. Acoustic Modeling: Feature Extraction Measure features 100 times per sec. Use a 25 msec window for frequency domain analysis. Include absolute energy and 12 spectral measurements. Time derivatives to model spectral change.

LML Speech Recognition Acoustic Modeling Dynamic Programming Markov Models Parameter Estimation HMM Training Continuous Mixtures Decision Trees Limitations and Practical Issues of HMM

LML Speech Recognition Acoustic models encode the temporal evolution of the features (spectrum). Gaussian mixture distributions are used to account for variations in speaker, accent, and pronunciation. Phonetic model topologies are simple left-to-right structures. Skip states (time-warping) and multiple paths (alternate pronunciations) are also common features of models. Sharing model parameters is a common strategy to reduce complexity. Acoustic Modeling Hidden Markov Models

LML Speech Recognition Closed-loop data-driven modeling supervised from a word-level transcription. The expectation/maximization (EM) algorithm is used to improve our parameter estimates. Computationally efficient training algorithms (Forward-Backward) have been crucial. Batch mode parameter updates are typically preferred. Decision trees are used to optimize parameter-sharing, system complexity, and the use of additional linguistic knowledge. Acoustic Modeling: Parameter Estimation Initialization Single Gaussian Estimation 2-Way Split Mixture Distribution Reestimation 4-Way Split Reestimation

LML Speech Recognition Language Modeling Formal Language Theory Context-Free Grammars N-Gram Models and Complexity Smoothing

LML Speech Recognition Language Modeling

LML Speech Recognition Language Modeling: N-Grams Bigrams (SWB): Most Common: “you know”, “yeah SENT!”, “!SENT um-hum”, “I think” Rank-100: “do it”, “that we”, “don’t think” Least Common:“raw fish”, “moisture content”, “Reagan Bush” Trigrams (SWB): Most Common: “!SENT um-hum SENT!”, “a lot of”, “I don’t know” Rank-100: “it was a”, “you know that” Least Common:“you have parents”, “you seen Brooklyn” Unigrams (SWB): Most Common: “I”, “and”, “the”, “you”, “a” Rank-100: “she”, “an”, “going” Least Common: “Abraham”, “Alastair”, “Acura”

LML Speech Recognition LM: Integration of Natural Language Natural language constraints can be easily incorporated. Lack of punctuation and search space size pose problems. Speech recognition typically produces a word-level time-aligned annotation. Time alignments for other levels of information also available.

LML Speech Recognition Search Algorithms and Data Structures Basic Search Algorithms Time Synchronous Search Stack Decoding Lexical Trees Efficient Trees

LML Speech Recognition Dynamic Programming-Based Search Search is time synchronous and left-to-right. Arbitrary amounts of silence must be permitted between each word. Words are hypothesized many times with different start/stop times, which significantly increases search complexity.

LML Speech Recognition Input Speech Recognition Architectures Acoustic Front-end Acoustic Front-end The signal is converted to a sequence of feature vectors based on spectral and temporal measurements. Acoustic Models P(A|W) Acoustic Models P(A|W) Acoustic models represent sub-word units, such as phonemes, as a finite- state machine in which: states model spectral structure and transitions model temporal structure. Recognized Utterance Search Search is crucial to the system, since many combinations of words must be investigated to find the most probable word sequence. The language model predicts the next set of words, and controls which models are hypothesized. Language Model P(W)

LML Speech Recognition Speech Recognition Goal: Automatically extract the string of words spoken from the speech signal Speech Recognition Words “How are you?” Speech Signal How is SPEECH produced?