Presentation is loading. Please wait.

Presentation is loading. Please wait.

Listen Attend and Spell – a brief introduction

Similar presentations


Presentation on theme: "Listen Attend and Spell – a brief introduction"— Presentation transcript:

1 Listen Attend and Spell – a brief introduction
Dr Ning Ma Speech and Hearing Group University of Sheffield

2 Classical speech recognition architecture
there: /ðɛː/ is: /ɪz/ a: /ə/ cat: /kat/ W = “there is a cat" Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

3 The neural network revolution
RNN-based Pronunciation models CNNs Auto-encoders DNN-HMMs LSTM-HMMs Neural language models Classical signal processing Gaussian mixture models Pronunciation tables N-gram models features Speech Front-end Acoustic Models Pronunciation Models Language Models

4 End-to-end speech recognition
X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X Y End-to-end: features

5 End-to-end speech recognition
X is the audio (feature vectors), and Y is a text sequence (transcript) Perform speech recognition by learning a probabilistic model p(Y|X) Acoustic Models Pronunciation Models Language Models Classical: X features Y Probabilistic Models X features Y End-to-end: Two main approaches Connectionist Temporal Classification (CTC) Sequence-to-sequence models with attention (seq2seq)

6 Connectionist Temporal Classification (CTC)
x1 x2 x3 x4 x5 x6 x7 x8 Softmax over vocabulary and extra blank token _ Bi-directional RNN produces log prob for different token classes at each time frame

7 Connectionist Temporal Classification (CTC)
Allow only transition from a symbol to itself or to _ cc_aa_t_ maps to cat ccc__a_t_ maps to cat cccc_aaa_ttt_ maps to cat c c _ a a _ t _ Dynamic programming allows efficient calculation of log prob p(Y|X) and its gradient, which can be propagated for learning RNN parameters

8 Limitations of CTC CTC outputs often lack correct spelling and grammar
A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring

9 Limitations of CTC CTC outputs often lack correct spelling and grammar
A cat sat on the desk  A Kat sat on the desk Kat said hello  Cat said hello A language model is required for rescoring CTC makes label predictions for each frame just based on audio data: p(Y|X) Assumes label predictions are conditionally independent of each other

10 Sequence-to-sequence models (seq2seq)
Decoder / Transducer y1…t yt+1 p(yt+1|y1…t, x) transcript f(X) x1 x2 x3 x4 x5 x6 x7 x8

11 Attention models

12 Attention example Prediction derived from “attending” to segment of input Attention vector – where the model thinks the relevant information is to be found

13 Attention example

14 Attention example

15 Attention example

16 Attention example

17 Listen Attend and Spell (LAS)
Transcripts Decoder / Transducer y1…t yt+1 Decoder (RNN) named the speller transcript f(X) high-level features x1 x2 x3 x4 x5 x6 x7 x8 Encoder (RNN) named the listener Low-level signals

18 Listen Attend and Spell (LAS)
s: state vector from the decoder softmax{ f([ht, s]) }  Attention vector h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

19 Listen Attend and Spell (LAS)
s: state vector from the decoder h: hidden state sequence from the encoder Hierarchical encoder reduces time resolution

20 Limitations of LAS (seq2seq)
Not an online model – all input must be received before producing transcripts Attention is a computational bottleneck Length of input has a large impact on accuracy


Download ppt "Listen Attend and Spell – a brief introduction"

Similar presentations


Ads by Google