1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010.

1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010

2 Signal Processing - Windowing ● Split raw PCM data into windows ● 25 ms long ● 10 ms shift ● Apply window function ● Hamming: w(n) = 0.54 – 0.46 * cos(2πn / N)

3 Signal Processing - MFCC ● Mel scale – non-linear scale of human perception of pitch ● X = DFT(x) ● Triangular filters( |X|^2 ) ● DCT ● Output is 13 element long feature vector

4 Signal Processing... PCM DataSequence of vectors

5 Hidden Markov Models 1 ● N - Number of states ● Π - Initial state distribution ● A - Transitions probabilities ● Ω - Termination probabilities ● B - Probability density functions 32 a 13 a 12 a 23 Ω3Ω3 π1π1

6 Hidden Markov Models Observation Probability Density Functions ● Multi-variate Gaussian mixtures ● C – mixture weights ● μ - mean vectors ● Σ - covariance matrices ● Gives probabilities for the feature vectors

7 Hidden Markov Models Modelling Phonemes 1 ● Phonemes are trained as left- right HMMs with 3 states ● Bees-clustering for initial approximation ● Baum-Welch re-estimation for refinement 32 a 13 a 12 a 23 Ω3Ω3 π1π1 ● 46 phonemes ● Potentially 46 3 = 97,000 triphones ● Only about 300 triphones actually used for a dictionary of ~100 words

8 Hidden Markov Models Building Word Graphs ● Triphone HMMs combined into a larger HMM, representing the words in a dictionary. ● Phoneme HMMs are dense - implemented using matrices ● Word graphs are sparse – implemented using graph, node and edge objects ● TRIE – Words share common prefixes

9 Word Recognition Viterbi Algorithm ● Finds the most probable path through an HMM for a sequence of observations. ● N T possible paths ● Dynamic Programming, optimal path must include optimal subpath ● For every node i at time t, discards every path leading up to it except the one with the highest probability ● O(TN 2 ) instead of O(N T )

10 Word Recognition Viterbi Beam Search ● For large HMMs (word graphs) the N 2 term in O(TN 2 ) may become too large. ● Exactly the same algorithm as Viterbi, except only the K best hypotheses are explored. ● Implemented using lists state objects instead of matrices.

11 Implementation ● ~22000 lines C++, 4500 lines Perl ● C++ mostly for computational heavy lifting ● Perl mostly for text manipulation ● Unix style: one program does one thing ● Common configuration file

12 Implementation Waxholm Corpus ● Corpus from the Waxholm Dialog project at KTH ● 3 hours 30 minutes spoken sentences ● Mostly Swedish, some sentences in English ● Sentence, Word and Phoneme level annotations ● Very complex and irregular annotation file format ● Common configuration file

13 Implementation Configuration File math { random_seed = 326 } efx { type = mfcc delta1 = (2,-2) delta2 = (1,-1) train_ratio = 0.7 window { function = cosine length_ms = 25 ; shift_ms = 10 min_filled = 0.8 } mfcc { num_coeff = 13 ; num_filters = 29 } } hmm { num_states = 3 num_mixtures = 2 use_covariance = false statemodel = bakis-x train_order = bees, baumwelch...

14 Implementation Overview Waxholm Corpus parse_cor pus.pl waxholm.txt waxholm.wav efx_gen efx_split ph_all.efx hmm_test hmm_train phonemes.hmm ph_train.efxph_test.efx Phoneme test results mk_cfmatrix Confusion Matrix mk_gx.pl word_graph.gx word_test Word test results file1.wav file2.wav file3.wav... dictionary.txt ph_list.txt mk_phlist.pl

15 Implementation Results Tests run with ● 25 ms window, 10 ms shift ● Cosine window function ● 13 MFCC coefficients, 29 filters ● Level-1 and Level-2 MFCC deltas ● Triphones modelled as left-right HMMs ● 2 Gaussian mixture components ● 3 Baum-Welch iterations 97 * 5 = 485 words recorded with 3 different speakers ● 51.8% correct matches ● 85% within top 10

16 Implementation Results – Varying Parameters Varying parameters from the template configuration ● Number of Gaussian mixture components ● Number of MFCC coefficients ● MFCC delta levels ● Window functions: Rectangular, Cosine, Hamming, Hann

17 Implementation Results – Varying Parameters

18 Implementation Results – Varying Parameters Varying parameters from the template configuration ● Best percentage achieved: 57.3% ● The more number of mixtures, MFCC coefficients and MFCC deltas used the better ● Hamming window slightly better than the other window functions ● When run with 4 mixtures, 20 MFCC coefficients, level-2 delta and Hamming window: 56.5% ● Most likely crossed threshold with too many parameters for too little training data.

19 Android Port ● Simple recognition algorithm ported to Android as a proof of concept ● ~1700 lines Java ● Minimal interaction with the Android environment ● Does not run in real time ● Only tested on emulator – no actual phone ● Signal processing currently biggest bottleneck – very slow DFT implemented – could be drastically improved

20 Android Port Screenshot

21 Thank you for you time!

1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010.

Similar presentations

Presentation on theme: "1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010.

Similar presentations

Presentation on theme: "1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010."— Presentation transcript:

Similar presentations

About project

Feedback