Listen Attend and Spell – a brief introduction

Listen Attend and Spell – a brief introduction
Dr Ning Ma Speech and Hearing Group University of Sheffield

Classical speech recognition architecture
there: /ðɛː/ is: /ɪz/ a: /ə/ cat: /kat/ W = “there is a cat" Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

The neural network revolution
RNN-based Pronunciation models CNNs Auto-encoders DNN-HMMs LSTM-HMMs Neural language models Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

End-to-end speech recognition
X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X Y End-to-end: features

End-to-end speech recognition
X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X features Y End-to-end: Two main approaches Connectionist Temporal Classification (CTC) Sequence-to-sequence models with attention (seq2seq)

Connectionist Temporal Classification (CTC)
x1 x2 x3 x4 x5 x6 x7 x8 Softmax over vocabulary and extra blank token _ Bi-directional RNN produces log prob for different token classes at each time frame

Connectionist Temporal Classification (CTC)
Allow only transition from a symbol to itself or to _ cc_aa_t_ maps to cat ccc__a_t_ maps to cat cccc_aaa_ttt_ maps to cat c c _ a a _ t _ Dynamic programming allows efficient calculation of log prob p(Y|X) and its gradient, which can be propagated for learning RNN parameters

Limitations of CTC CTC outputs often lack correct spelling and grammar
A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring

Limitations of CTC CTC outputs often lack correct spelling and grammar
A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring CTC makes label predictions for each frame just based on audio data: p(Y|X) Assumes label predictions are conditionally independent of each other

Sequence-to-sequence models (seq2seq)
Decoder / Transducer y1…t yt+1 p(yt+1|y1…t, x) transcript f(X) x1 x2 x3 x4 x5 x6 x7 x8

Attention models

Attention example Prediction derived from “attending” to segment of input Attention vector – where the model thinks the relevant information is to be found

Attention example

Listen Attend and Spell (LAS)
Transcripts Decoder / Transducer y1…t yt+1 Decoder (RNN) named the speller transcript f(X) high-level features x1 x2 x3 x4 x5 x6 x7 x8 Encoder (RNN) named the listener Low-level signals

s: state vector from the decoder softmax{ f([ht, s]) }  Attention vector h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

s: state vector from the decoder h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

Limitations of LAS (seq2seq)
Not an online model – all input must be received before producing transcripts Attention is a computational bottleneck Length of input has a large impact on accuracy

Listen Attend and Spell – a brief introduction

Similar presentations

Presentation on theme: "Listen Attend and Spell – a brief introduction"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Listen Attend and Spell – a brief introduction

Similar presentations

Presentation on theme: "Listen Attend and Spell – a brief introduction"— Presentation transcript:

Similar presentations

About project

Feedback