Listen Attend and Spell – a brief introduction

Slides:



Advertisements
Similar presentations
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Advertisements

Speech Recognition. What makes speech recognition hard?
Text Independent Speaker Recognition with Added Noise Jason Cardillo & Raihan Ali Bashir April 11, 2005.
Introduction to Recurrent neural networks (RNN), Long short-term memory (LSTM) Wenjie Pei In this coffee talk, I would like to present you some basic.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Why is ASR Hard? Natural speech is continuous
Introduction to Automatic Speech Recognition
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
7-Speech Recognition Speech Recognition Concepts
Deep Learning Neural Network with Memory (1)
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
22CS 338: Graphical User Interfaces. Dario Salvucci, Drexel University. Lecture 10: Advanced Input.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Performance Comparison of Speaker and Emotion Recognition
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation EMNLP’14 paper by Kyunghyun Cho, et al.
Survey on state-of-the-art approaches: Neural Network Trends in Speech Recognition Survey on state-of-the-art approaches: Neural Network Trends in Speech.
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Course Outline (6 Weeks) for Professor K.H Wong
Olivier Siohan David Rybach
Convolutional Sequence to Sequence Learning
Applying Connectionist Temporal Classification Objective Function to Chinese Mandarin Speech Recognition Pengrui Wang, Jie Li, Bo Xu Interactive Digital.
Convolutional Neural Network
End-To-End Memory Networks
CS 388: Natural Language Processing: LSTM Recurrent Neural Networks
Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe
Recurrent Neural Networks for Natural Language Processing
Automatic Speech Recognition Introduction
Deep Learning: Model Summary
Intelligent Information System Lab
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Image Question Answering
Neural Language Model CS246 Junghoo “John” Cho.
Speech Processing Speech Recognition
RNN and LSTM Using MXNet Cyrus M Vahid, Principal Solutions Architect
Image Captions With Deep Learning Yulia Kogan & Ron Shiff
Final Presentation: Neural Network Doc Summarization
LECTURE 15: REESTIMATION, EM AND MIXTURES
Natural Language to SQL(nl2sql)
Attention.
Advances in Deep Audio and Audio-Visual Processing
Please enjoy.
Meta Learning (Part 2): Gradient Descent as LSTM
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Attention for translation
RNNs and Sequence to sequence models
STATE-OF-THE-ART SPEECH RECOGNITION WITH SEQUENCE-TO-SEQUENCE MODELS
Automatic Handwriting Generation
Question Answering System
Neural Machine Translation
Sequence-to-Sequence Models
Bidirectional LSTM-CRF Models for Sequence Tagging
Week 7 Presentation Ngoc Ta Aidean Sharghi
LHC beam mode classification
The Application of Hidden Markov Models in Speech Recognition
Neural Machine Translation by Jointly Learning to Align and Translate
Huawei CBG AI Challenges
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Listen Attend and Spell – a brief introduction Dr Ning Ma Speech and Hearing Group University of Sheffield

Classical speech recognition architecture there: /ðɛː/ is: /ɪz/ a: /ə/ cat: /kat/ W = “there is a cat" Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

The neural network revolution RNN-based Pronunciation models CNNs Auto-encoders DNN-HMMs LSTM-HMMs Neural language models Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X Y End-to-end: features

End-to-end speech recognition X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X features Y End-to-end: Two main approaches Connectionist Temporal Classification (CTC) Sequence-to-sequence models with attention (seq2seq)

Connectionist Temporal Classification (CTC) x1 x2 x3 x4 x5 x6 x7 x8 Softmax over vocabulary and extra blank token _ Bi-directional RNN produces log prob for different token classes at each time frame

Connectionist Temporal Classification (CTC) Allow only transition from a symbol to itself or to _ cc_aa_t_ maps to cat ccc__a_t_ maps to cat cccc_aaa_ttt_ maps to cat c c _ a a _ t _ Dynamic programming allows efficient calculation of log prob p(Y|X) and its gradient, which can be propagated for learning RNN parameters

Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring

Limitations of CTC CTC outputs often lack correct spelling and grammar A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring CTC makes label predictions for each frame just based on audio data: p(Y|X) Assumes label predictions are conditionally independent of each other

Sequence-to-sequence models (seq2seq) Decoder / Transducer y1…t yt+1 p(yt+1|y1…t, x) transcript f(X) x1 x2 x3 x4 x5 x6 x7 x8

Attention models

Attention example Prediction derived from “attending” to segment of input Attention vector – where the model thinks the relevant information is to be found

Attention example

Attention example

Attention example

Attention example

Listen Attend and Spell (LAS) Transcripts Decoder / Transducer y1…t yt+1 Decoder (RNN) named the speller transcript f(X) high-level features x1 x2 x3 x4 x5 x6 x7 x8 Encoder (RNN) named the listener Low-level signals

Listen Attend and Spell (LAS) s: state vector from the decoder softmax{ f([ht, s]) }  Attention vector h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

Listen Attend and Spell (LAS) s: state vector from the decoder h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

Limitations of LAS (seq2seq) Not an online model – all input must be received before producing transcripts Attention is a computational bottleneck Length of input has a large impact on accuracy