Presentation is loading. Please wait.

Presentation is loading. Please wait.

Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe

Similar presentations


Presentation on theme: "Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe"— Presentation transcript:

1 Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe
End-to-end speech recognition system using RNNs and the CTC loss function Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe

2 Presentation overview
Problem and history Model Implementation Results

3 Why Speech Recognition?
Corti’s goal Extract critical information about patient’s condition calling emergency number such as heart attacks. Many other applications: Home-automation, court reporting, live-translation, etc...

4 1950s and 1960s : The dream starts
Bell Laboratories designed in 1952 the "Audrey" system, which recognized digits spoken by a single voice. Ten years later, IBM demonstrated at the 1962 World's Fair its "Shoebox" machine, which could understand 16 words spoken in English.

5 1970s : Dream takes off First speech recognition company, Threshold Technology U.S. Department of Defense’ s DARPA Speech Understanding Research Program( ) 1970s Bell Laboratories' introduction of a system that could interpret multiple people's voices. Carnegie Mellon’s “Harpy” words.

6 Other new improvements
1980s and 1990s : Wilder dreams Statistic Method HMM A few hundred words Several thousand words Potential of an unlimited number of words. Other new improvements Highly natural concatenative speech synthesis systems Machine learning Mixed-initiative dialog systems

7 Improvements over years
Huang X, Baker J, Reddy R. A historical perspective of speech recognition[J]. Communications of the ACM, 2014, 57(1):

8 DEEP Learning The Future Trend?

9 Speech Recognition New Fashion
HMM & FNN & GMM Deep learning(LSTM) 2000 Steady incremental improvements Decreased word error rate by 30% Around 2007, CTC-trained LSTM started to outperform tradition

10 First Trend : Larger Vocabulary
1000+ 100~1000

11 Second Trend : From Isolated to Continuous
Isolated words 1960s Isolated Words; Connected Digits 1970s Connected Words 1980s Continuous Speech 1990s

12 ASR New Fashion (DNN) and the Importance of Data

13 Deep Speech 2 Model I/O Accuracy Measurement Model Structure CTC

14 Input and output End-to-end transcription
Source: Input and output End-to-end transcription INPUT X: raw audio (wav, 16Khz, 16bit), 1D vector OUTPUT Y: sequence of words

15 2. Accuracy Measurement Word Error Rate (WER) Align recognized/reference text (using dynamic string alignment) Transform recognized text → reference text S = number of substitutions D = number of deletions I = number of insertions N = number of words in reference Source:

16 Layers in the Deep Speech 2 model
3. Model Structure Layers in the Deep Speech 2 model Raw audio Spectrogram CNN RNN Fully connected CTC cost function Source:

17 Spectrogram (pre-processing)
3. Model Structure Spectrogram (pre-processing) Apply FFT in some time window (20 ms) Source:

18 Spectrogram (pre-processing)
3. Model Structure Spectrogram (pre-processing) Concatenate windows from adjacent frames to get spectrogram Source:

19 Speech engine Train from labeled pairs (x,y*) Intermediate output: c
3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source:

20 Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y)

21 Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y) Solution: align phonemes with audio manually?

22 Speech engine Main issue: segmentation: length(x) != length(y)
Phonemes are the perceptually distinct units of sound 3. Model Structure Speech engine Train from labeled pairs (x,y*) Intermediate output: c Extract transcription from c Source: Main issue: segmentation: length(x) != length(y) Solution: CTC

23 Connectionist Temporal Classification
4. Connectionist Temporal Classification (CTC) Connectionist Temporal Classification Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006) Term coined in article (Graves, 2006) → Source:

24 Connectionist Temporal Classification
4. Connectionist Temporal Classification (CTC) Connectionist Temporal Classification Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks (2006) Term coined in article (Graves, 2006) → Temporal Classification: Labelling unsegmented data sequences. Connectionist Temporal Classification: Use of RNNs for this purpose. Source:

25 4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Source:

26 4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Note: length(c) == length(x) Source:

27 4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Note: length(c) == length(x) Note: c is a (x by 28) matrix Source:

28 4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Output neurons define distribution over whole character sequences c, assuming independence: Source:

29 4. Connectionist Temporal Classification (CTC)
CTC step 1 of 3 1. RNN output neurons c, encoding distribution is over symbols: c ∈ {A, B, C, …, Z, blank, space} Output neurons define distribution over whole character sequences c, assuming independence: P of char sequence that is the same length as audio Source:

30 CTC step 2 of 3 2. Define a mapping β(c) → y
4. Connectionist Temporal Classification (CTC) CTC step 2 of 3 2. Define a mapping β(c) → y β(c): Delete duplicates + blanks y = β(c) = β(HHH_E__LL_LO___) = “HELLO” Source:

31 4. Connectionist Temporal Classification (CTC)
CTC step 2 of 3 2. Define a mapping β(c) → y Mapping implies a distribution over all possible transcriptions of y: Probability of specific transcription: Sum up / marginalize over all the different alignments. Source:

32 4. Connectionist Temporal Classification (CTC)
CTC step 2 of 3 2. Define a mapping β(c) → y Mapping implies a distribution over all possible transcriptions of y: Likelihood function: P(y|x,θ) = L(θ|x,y) Probability of specific transcription: Sum up / marginalize over all the different alignments. Source:

33 4. Connectionist Temporal Classification (CTC)
CTC step 3 of 3 3. Update network param. θ to maximize likelihood of correct label y*: Maximize probability of correct transcription, given audio. Source:

34 CTC training Audio spectrogram Neural network
4. Connectionist Temporal Classification (CTC) CTC training Audio spectrogram Neural network Output bank of softmax neurons Compute CTC(c,y*) + gradient Gradient descent Source:

35 CTC training Audio spectrogram Neural network
4. Connectionist Temporal Classification (CTC) CTC training Audio spectrogram Neural network Output bank of softmax neurons Compute CTC(c,y*) + gradient Gradient descent Source:

36 4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Source:

37 4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Simple (approximate) solution: Source:

38 4. Connectionist Temporal Classification (CTC)
Decoding Network outputs P(c|x). Most likely transcription from P(y|x)? Simple (approximate) solution: Optimal solution: find c to maximize Hard problem! Source:

39 DeepSpeech 2 Implementation
Tips and Tricks System Optimizations Training data used by Baidu Their results

40 BatchNorm Why? BatchNorm accelerates the training for DNN
Tips and Tricks BatchNorm Why? To efficiently scale the model as the training set is scaled Increasing the depth of the network leads to optimization issues BatchNorm accelerates the training for DNN Basic formulation :

41 Tips and Tricks BatchNorm The special case of bidirectional RNNs (eg. standard recurrent operation):

42 Tips and Tricks BatchNorm The special case of bidirectional RNNs (eg. standard recurrent operation):

43 Tips and Tricks BatchNorm Instead, batch normalization is applied only on the vertical connections (i.e. from one layer to another) and not on the horizontal connections (i.e. within the recurrent layer) +12% performance difference for the deepest network

44 Tips and Tricks SortaGrad Training on examples of varying length pose some algorithmic challenges Think of how a child learns Longer examples tend to be more challenging 1st epoch : iterate through the training set in increasing order of the length of the longest utterance in the minibatch Other epochs : the training reverts back to a random order over minibatches

45 Language Model 5-gram models Parameters tuned on a development set
Tips and Tricks Language Model 5-gram models Parameters tuned on a development set Additionally, they used beam search to find the optimal transcription

46 GPU implementations Memory allocation issue Leads to parallel SGD
2. System Optimizations GPU implementations Memory allocation issue Most of it is for activations through each layer for use by back propagation 70M parameters: “only” 280 MB of memory for the weights but for the activations with a batch of 64 and seven-second utterances it is 1.5 GB... Leads to parallel SGD Also allocate to CPU memory accessible by the GPU

47 Data English : 11,940 hours of labeled speech
3. Training data used by Baidu Data English : 11,940 hours of labeled speech Mandarin : 9,400 hours of labeled speech Dataset augmentation by adding noise Increases the effective size of the training data Improves the robustness of the model to noisy speech

48 Training 20 epochs 9-layer model (2 CNN, 7RNN) with 68M parameters
4. Their results Training 20 epochs 9-layer model (2 CNN, 7RNN) with 68M parameters maxNorm = 400 Learning rate = 10^(-4) and annealed by 1.2 after each epoch Beam size = 500

49 4. Their results Compared results Data Drives Accuracy

50 Initial research CTC function implementations: Tensorflow Theano
Warp-CTC in C++ by Baidu (outperforms the rest) Deep Speech implementations: Theano/Python (DS1) Torch/Lua (DS2)

51 Datasets AN4: personal information: names, addresses, telephone numbers, birthdates, etc 948 training utterances 130 test utterances ~50 minutes of recordings

52 Datasets LibriSpeech: Dev-clean 5.4h Dev-other 5.3h
Train-clean h Train-clean h Train-other h Total: h of training speech test-clean and test-other: 10.5h in total

53 Initial tests and setup
AWS: g2.x8 instance AN4 dataset, 9 hours of training Moved to p2 instance but already out of credits DTU HPC: Setting up Linux dependencies Sorting out CUDA errors, memory errors, etc… 2 x Nvidia Tesla K80 (48 GB of GPU memory in total)

54 Scaling up LibriSpeech - clean 100h
Big variance of WER depending on batch size batch size = 75 (max): WER = 58% batch size = 40 : WER = 52% batch size = 12 : WER = 42% Tradeoff: faster training vs lower WER Explanation: noisy gradients

55 Scaling up LibriSpeech - clean ~1000h (full training dataset)
Audio files: 60 GB - almost 300k utterances Input data to network: 212 GB Batch size 64 One epoch took ~8h Whole training over a week WER ~12% Baidu got 5.33%

56 Training on small amounts of data
Dev-clean + dev-other = ~11 hours of speech Deep learning does not work without big data. Batch size WER, % 10 90 6 87.5 4 83.1 2 81.2

57 flac I THINK HE WAS PERHAPS MORE APPRECIATIVE THAN I WAS OF THE DISCIPLINE OF THE EDISON CONSTRUCTION DEPARTMENT AND THOUGHT IT WOULD BE WELL FOR US TO WAIT UNTIL THE MORNING OF THE FOURTH BEFORE WE STARTED UP --- i think he was perhaps more appreciative that i was of the discipline of the edison construction department and thought it would be well for us to wait until the morning of the fourth before we started up

58 flac SHE WAS THE MOST AGREEABLE WOMAN IVE EVER KNOWN IN HER POSITION SHE WOULD HAVE BEEN WORTHY OF ANY WHATEVER --- she was the most agreeable woman i have ever known in her position she would have been worth of any whatever

59 flac STEPHANOS DEDALOS --- stephanos der loss

60 Thank you for your attention !

61 Development of the Technology
Filter-bank analysis; Time-normalization; Dynamic Programming; Pattern recognition; LPC analysis; Clustering algorithms; Level building; Hidden Markov models; Stochastic Language modeling; Finite-state machines; Statistical learning; Concatenative synthesis Machine learning Mixed-initiative dialog;


Download ppt "Yannis Flet-Berliac Tengyu Zhou Maciej Korzepa Gandalf Saxe"

Similar presentations


Ads by Google