Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation.

Similar presentations


Presentation on theme: "Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation."— Presentation transcript:

1

2 Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation and Recognition Challenge

3 ASRU 2015 The 3 rd CHiME Speech Separation and Recognition Challenge Overview –Background – 1 st and 2 nd CHiME Challenges –The CHiME-3 scenario and task design –The baseline enhancement and ASR systems –The challenge results –Some findings and points for discussion 15 th Dec 20152

4 ASRU 2015 CHiME-3 Background 1 st CHiME challenge (2011) supported by EU PASCAL-network Computational Hearing in Multisource Environments Speech from the Grid corpus, i.e. simple command sentences Noise from binaural recordings of domestic environments Simulated speech and noise mixtures using impulse responses recorded 2 m from the microphones. 3

5 ASRU 2015 CHiME-3 Background 4 Top systems came very close to human performance. But the ASR task was too narrow and too artificial.

6 ASRU 2015 CHiME-3 Background 2 nd CHiME challenge held after ICASSP 2013 Tried to address biggest limitations of the 1 st challenge: –Artificial mixing but with time varying impulse responses to simulate small talker movements –Progressed from Grid corpus to the WSJ 5k task Step in the right direction but doubts remained over validity of using an artificially mixed test data. 5

7 ASRU 2015 CHiME-3 Objectives Feedback from 1 st and 2 nd CHiME challenges led to the following objectives for CHiME-3, –A commercially relevant scenario, e.g. move from binaural recording to a conventional mic array. –A larger variety of noise environments. –Increased realism of data, i.e. from artificial mixing to speech spoken and recorded live in noise. We also wanted to explicitly examine the role of simulated data (i.e. artificially mixed speech + noise), –Is simulated data useful for augmenting training data? –Can we trust evaluations that use simulated test data? –How can noisy mic array speech data best be simulated? 6

8 ASRU 2015 The CHiME-3 Scenario “ASR running on a mobile tablet device being used in noisy everyday setting, e.g. cafes, on the street etc.” 7

9 ASRU 2015 The CHiME-3 Hardware Android tablet with custom-built surround holding 6 microphones. 5 microphones facing forward and one facing backward. 8

10 ASRU 2015 The CHiME-3 Recording Set-up Portable battery-powered recording set-up. Records 6 tablet mics and a close-talking headset mic onto a pair of external digital recorders. 9

11 ASRU 2015 CHiME-3 Speech Data CHiME-3 based on WSJ 5k task. 12 native US speakers (6 male, 6 female) Speakers divided into 4 training, 4 dev and 4 test 4 recording environments (café, street, pedestrian, bus) Dev and test sets same as WSJ0, (i.e. 410 and 330 utterances), recorded in each of the 4 environments. Training data –1600 utterance subset of WSJ training data recorded in real environments –7138 simulated mixtures – WSJ + CHiME background noise 10

12 ASRU 2015 The CHiME-3 Noise Environments 11 Sitting in a cafeStanding at a street junction Travelling on a busIn a pedestrian area

13 ASRU 2015 CHiME-3 Baseline System Three components: –Baseline simulation (signals + MATLAB code) –Baseline enhancement (signals + MATLAB code) –Baseline ASR (Kaldi recipe) 12

14 ASRU 2015 CHiME-3: Baseline Systems Baseline simulation: Can simulated data be used to augment the limited amount of real training data? –Technique: 1.Estimate SNRs at each tablet mic using close talking mic. 2.Track speaker location using SRP-PHAT and calculate the time- varying delays to each tablet microphone. 3.WSJ utterances convolved with time-varying filters. 4.Apply filter to match microphone frequency response. 5.Mix with background noise collected in CHiME-3 environments. –Issues: no Lombard-like effects or reverberation; speaker tracking can be poor; true SNR hard to estimate. 13 Original clean Original clean Simulated noisy Simulated noisy

15 ASRU 2015 CHiME-3: Baseline Systems Baseline enhancement – MVDR beam-forming –multichannel covariance matrix of noise is estimated using up to 800 ms of context prior to utterance. Baseline ASR - two baseline Kaldi ASR systems –GMM system - triphone models, LDA, MLLT, fMLLR, SAT –DNN system - pre-training using RBMs, cross entropy training, sequence discriminative training. 14 simulated mixture simulated mixture enhanced real mixture enhanced

16 ASRU 2015 CHiME-3: Baseline Word Error Rates TrainingTestingSimulatedReal GMMClean dataNoisy data50.355.7 GMMNoisy data 18.7 GMMEnhanced 9.820.6 DNNNoisy data 14.316.1 DNNEnhanced 8.217.7 15 TrainingTestingSimulatedReal DNNNoisy data 21.533.4 DNNEnhanced 8.133.8 Development data Final test data

17 ASRU 2015 CHiME-3: Submissions Full LDC-licensed dataset distributed to 65 sites Received 26 official submissions Large teams - average of 5 authors per paper Involvement from 36 institutions Even split between US, Europe, Asia Mix of academic and industrial e.g Hitachi, NTT, MERL Bias towards signal processing researchers; failed to attract participation from many big speech groups. 16

18 ASRU 2015 17

19 ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 18

20 ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 19

21 ASRU 2015 CHiME-3: General conclusions Best WER 5.8% very close to noise-free speech performance. Solved problem? Performance on simulated data often poor predictor of performance on real data – highlights need for caution when considering challenges that use artificial mixing. Simulated training data valuable tool when real data is in short supply – but need care to avoid mismatch. In particular, the baseline simulated data responded differently to mic array processing leading to mismatched enhanced signals. Biggest gain w.r.t. the baseline from improved multichannel signal processing, feature normalization and language modelling. Important to have a strong and accessible baseline but difficult to prepare when data challenge timescales are short. Kaldi was invaluable. We’ve released a new Kaldi baseline with BeamformIt array processing, fMLLR DNN features, and 5-gram RNN LM rescoring - scores 12.8% (c.f. 33.4% for initially distributed baseline). What next? 20

22 ASRU 2015 21

23 ASRU 2015 CHiME-3: Future Directions Fewer microphones? Mismatched noise conditions. A more challenging, bigger scale task, –Large talker-microphone distances. –More complex speech. –Greater number of noise backgrounds and speakers. 22

24 ASRU 2015 Thank you for listening 23


Download ppt "Jon Barker, Ricard Marxer, University of Sheffield Emmanuel Vincent, Inria Shinji Watanabe, MERL ASRU 2015, Scottsdale The 3 rd CHIME Speech Separation."

Similar presentations


Ads by Google