Download presentation

Presentation is loading. Please wait.

Published bySydney Curtis Modified about 1 year ago

1
BUT SWS Massive parallel approach Brno University of Technology Faculty of Information Technology Igor Szöke, Lukáš Burget, František Grézl, Lucas Ondel MediaEval SWS 2013 workshop, October , Barcelona

2
Outlines Systems overview & Underlying technologies AKWS DTW Calibration Fusion Results and discussion

3
What it is about? Search audio within audio – low or zero resourced 9 languages Czech (MV recording from broadcast tel. calls) Non-native English (SuperLectures.com data) Albanian - prompted Romanian - prompted Basque - prompted 4 SouthAfrica languages - prompted All together 20hours of dev data dev terms Eval data – utterances eval = dev, only 500 new eval terms Primary metric – TWV with satanic Beta = Extended run – more examples per query Registered 16, Submited 13

4
System overview Our internal task was: To reuse as many Atomic systems as we have and fuse them on the detection level. We end up with: 13 Atomic systems, 26 QbE sub-systems, 19 languages (16 unique). zero resourced system Ingredients Phoneme recognizer, Acoustic Keyword Spotting, DTW, Calibration, Fusion

5
System overview Igor’s Greeting

6
Subsystem Sentence mean normalization Neural network based features three state phone posteriors Query detector AKWS DTW systemPosteriors SpeechDat CZLCRC O 129 SpeechDat HULCRCO 177 SpeechDat RULCRCO 150 BABEL CA St. BN A (1045) 660 BABEL PA BABEL TA BABEL TU SWS lang.St. BN O 150 GlobalPhone CZSt. BN A 120 GlobalPhone ENSt. BN A 120 GlobalPhone GESt. BN A 126 GlobalPhone POSt. BN A 102 GlobalPhone RUSt. BN A 156 GlobalPhone SPSt. BN A 102 GlobalPhone TUSt. BN A 90 GlobalPhone VISt. BN A 102

7
Atomic system Adaptation on target data (GP and BABEL NNs) Original NN used for target data labeling (state level) Then, universal context, bottle-neck neural network base classifier trained. LCRC, SWS2012 without any adaptation.

8
AKWS QbE subsystem Query -> example-to-text using phoneme recognizer Omit initial and final silence Omit queries having less than 3 non-silence phonemes No LM constrains

9
DTW QbE subsystem Segmental DTW (query can start in any frame of utterance) Log dot product over phoneme state posteriors Path cost: 1, 1, 1 On-line normalizing of the path While filling a cell in a distant matrix, the value already considers the length of the previous path We add VAD as late submission -> really huge impact Initial and final silence frames were removed from examples

10
Calibration Really important! No-norm, z-norm, z-norm_sideinfo, m-norm (the best) Experiments with adding sideinfo [log(#term_occ), #phn, log(#nonsilence frames)] Linear model was trained (using logistic regresion) Good improvement M-norm – find the peak in histogram of term scores Calculate variance of data Apply variance norm on the whole data set Subtract the peak (shift the peak to 0) Event better than z-norm Sideinfo does not helped! (means m-norm is calibrated enough)

11
DTW AWKS Orig Z-norm M-norm

12
Calibration 1 AKWS subsystem MTWV (UBTWV) orig (0.1012) z-norm (0.1434) z-norm_side (0.1436) m-norm (0.1611)

13
Fusion Linear combination of subsystems (and one bias) Trained with respect to minimizing of cross entropy (binary logistic regression) Detections are clustered System not producing any score at given time get a default score

14
Fusion

15
Results MTWV(UBTWV) UBTWV – non-pooled TWV, ideal calibration, oracle calibration DTW is superior to AKWS… but the speed… Still having some gaps in calibration (the difference between DEV and EVAL TWV) NN unsupervised adaptation helped 1 AKWS subsystem: (0.1154) -> (0.1630) m-norm! Lot of directions for research

16
Results - tajne Tabulka vysledku ostatnich

17
Conclusion

Similar presentations

© 2016 SlidePlayer.com Inc.

All rights reserved.

Ads by Google