Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speaker Classification through Deep Learning

Similar presentations


Presentation on theme: "Speaker Classification through Deep Learning"— Presentation transcript:

1 Speaker Classification through Deep Learning
Jacob Morris Alex Douglass Luke Woodbury 1

2 Overview Goals Potential Applications Learn more about deep learning!
Create a neural network that will classify voice recordings based on gender, age, natural language, etc. Potential Applications Research Security 2

3 Software Dependencies
Python 2.7 Keras 1.2.2 Theano Matplotlib 3

4 Hardware GeForce TitanX (Pascal) 12gb memory 4
4

5 Speech Accent Archive WAV files Categorizations
2300+ different speakers All recorded speaking same paragraph Categorizations Age Gender English Residence Natural Language Country Learning Style Etc. 5

6 The Essence of Deep Learning

7 Artificial Neural Networks (ANN)
7

8 Recurrent Networks Layer "remembers" data 8
8

9 LSTM Long Short Term Memory 9
9

10 Problem Type Sequence Classification Supervised Learning
Assign classification label(s) to input sequences Supervised Learning Each training sample includes the correct output for that sample 10

11 Variations of Model Topologies
Inputs Sequence of amplitudes Discrete Fourier transform of the segment Hidden Layers Variable Outputs Any subset of data categories 11

12 Training Challenges Process of Exploration Many parameters to tune
Results vague, must be interpreted Days required to train a new model 12

13 Terminology Sample Batch Epoch Base unit of training data
1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches

14 Terminology Sample Batch Epoch Base unit of training data
1/100 of a second of audio Batch Group of samples 4 seconds of consecutive samples Epoch Number of batches required to train on entire training data set In our case, 2310 batches

15 Loss Measure of how close an output signal is to its expected value
Categorical Cross Entropy Emphasizes correct answer

16 Learning Rate Determines how big of adjustments to make for given loss values

17 Accuracy Considered correct if the expected output neuron’s activation value is the greatest among all neurons for that category

18 Initial Attempts Features Short sample lengths WAV inputs only
Trained on training set of only 2 speakers 18

19 Results 19

20 False Hope Features Changes Short sample lengths
Trained on training set of only 2 speakers Changes Both input types 20

21 Results 21

22 Hope Features Changes Short sample lengths Both input types
Trained on training set of only 2 speakers Changes Train on single batch per speaker per pass through training set Reduced learning rate 22

23 Results 23

24 Confirmation Features Changes Short sample lengths Both input types
Train on single batch per speaker per pass through training set Changes Trained on full training set of speakers 24

25 Results 25

26 Refinement Features Changes Short sample lengths Both input types
Train on single batch per speaker per pass through training set Changes True Validation Decaying learning rate Epoch duration increased 26

27 Results 27

28 UI 28

29 Conclusion 29

30 Works Cited Weinberger, Steven. (2015). Speech Accent Archive. George Mason University. Retrieved from 30


Download ppt "Speaker Classification through Deep Learning"

Similar presentations


Ads by Google