1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Slides:

Advertisements

Similar presentations

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

Advertisements

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

An Alternative Approach of Finding Competing Hypotheses for Better Minimum Classification Error Training Mr. Yik-Cheung Tam Dr. Brian Mak.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.

Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.

Adaptive Playout Scheduling Using Time-scale Modification Yi Liang, Nikolaus Färber Bernd Girod, Balaji Prabhakar.

Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,

ASR Evaluation Julia Hirschberg CS Outline Intrinsic Methods –Transcription Accuracy Word Error Rate Automatic methods, toolkits Limitations –Concept.

Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

1 Quick Transcription of Fisher Data with WordWave Owen Kimball, Rukmini Iyer, Chia-lin Kao, Thomas Colthurst, John Makhoul.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

11 Update on Transcription of Fisher Phase II Data Owen Kimball, Chia-lin Kao, Tresi Arvizo, John Makhoul.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5, Martigny, Switzerland.

Topics Covered Phase 1: Preliminary investigation Phase 1: Preliminary investigation Phase 2: Feasibility Study Phase 2: Feasibility Study Phase 3: System.

Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Spontaneous speech recognition using a statistical model of VTR-dynamics Team members: L.Deng (co-tech.team leader), J.Ma, M.Schuster, J.Bridle (co-tech.team.

Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

Observing the Current System Benefits Can see how the system actually works in practice Can ask people to explain what they are doing – to gain a clear.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

L1Calo Databases ● Overview ● Trigger Configuration DB ● L1Calo OKS Database ● L1Calo COOL Database ● ACE Murrough Landon 16 June 2008.

A NONPARAMETRIC BAYESIAN APPROACH FOR

Olivier Siohan David Rybach

Progress Report of Sphinx in Summer 2004 (July 1st to Aug 31st )

Sphinx Recognizer Progress Q2 2004

Research on the Modeling of Chinese Continuous Speech Recognition

Speaker Identification:

Presenter : Jen-Wei Kuo

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul

2 Outline  Schedule update  Investigating WordWave + auto segmentation quality –Updated evaluation method –Separating effect of transcripts and segmentation –Improved segmentation algorithm  Plans  Update on using Fisher data in Training

3 Data Schedule  BBN has received 925 hours from WordWave (WWave)  Processed and released 478 hours via LDC –91 hrs on 8/1/03 –300 on 9/24/03 –87 on 10/21/03  WWave is currently running more slowly than planned –Reason: CTS transcription is hard! –They will complete 1600 hrs by the end of Jan 04, with remaining 200 hrs to follow as quickly as possible.

4 Segmentation Quality as of Sept 03  Auto segmentation goals: Given audio and transcript and no timing info, break into fairly short segments and align correct text to each segment  In September, we compared transcription and segmentation approaches on a 20 hour Swbd set: –LDC/MSU careful transcription and manual segmentation vs. –LDC fast transcription and manual segmentation vs. –WWave transcripts + BBN automatic segmentation.  Compared 2 different segmentation algorithms –Alg I: run recognizer and segment at “reliable” silences; decode using segmentation and reject based on sclite alignment errors –Alg II: use recognizer to get coarse initial segmentation; then forced alignment within coarse segs to find finer segs; final rejection pass as before.

5 Performance Comparison in Sept  Unadapted recognition; acoustic models trained with 20-hour Swbd1 set, LM trained on full Switchboard  ML, GI, VTL, HLDA-trained models Transcripts / Segmentation Training hours Eval01 WER Manual LDC+MSU CTRAN / Alg I Fast Manual LDC WWave / Alg I WWave / Alg II

6 Improving the Evaluation Method  There were a number of issues and shortcuts in the training and test, that clouded comparisons.  We therefore –Adopted improved training sequence, including new binaries –Reduced pruning errors in decode –Converted from fast approximate VTL length estimation to more careful approach –Adopted more stable VTL models  VTL models trained on 20 hours differed dramatically for small changes in segmentation –This is a bug in our VTL model estimation that we need to fix –For following experiments used stable VTL models from RT03 eval  Switched from our historic LDC+MSU baseline to all MSU for simplicity.

7 Comparison with Better Train and Test Transcripts/ Segmentation Training hoursEval01 WER LDC+MSU MSU Fast LDC Wwave/ Alg I Wwave/ Alg II

8 Separating Effect of Segmentation  Compare segmentations using identical (MSU) transcripts  Alg I WER same for WWave vs MSU transcripts  Segmentation may be biggest/only problem. Transcripts/ Segmentation Training hours Eval01 WER MSU / MSU MSU / Alg I

9 Segmentation Algorithm III  Algorithm II used forced alignment within coarse segments provided by initial pass of recognition, but examination revealed unrecoverable errors (words in wrong segment) from coarse initial seg.  Tried forced alignment of complete conversation sides  Overcame initial problems of failed alignments by –Pre-chopping out long silences, where our system tends to get confused Used auto-segmenter developed for RT03 CTS eval for this –Changing forced alignment program to do much less pruning at begin and end of conversation This accommodated things like beeps, line noise, and words cut off by recording start and stop  Forced alignment is followed by script that breaks segments at silences, then rejection pass

10 Algorithm III with MSU transcripts Transcripts/ Segmentation Training hours Eval01 WER MSU / MSU MSU / Alg I MSU / Alg III  Manually comparing MSU and Alg III showed that Alg III: –had more, shorter segments –had less silence padding around utterances –allowed utterances > 15 seconds when speaker did not pause  Modified Alg III to approximate MSU’s statistics

11 Improved Algorithm III Transcripts/ Segmentation Training hours Eval01 WER MSU / MSU MSU / Alg I MSU / Original Alg III MSU / Improved Alg III  Matching MSU’s utterance lengths and silence improves WER slightly  Alg III seems good enough, at least for this task

12 Results with WordWave Transcripts  WWave transcripts seem fine given improved seg Transcripts/ Segmentation Training hours Eval01 WER MSU / MSU Fast LDC WWave/ Alg I WWave/ Original Alg III

13 Plans  Confirm quality of WWave with Alg III seg –On Swbd 20 hour set, train MMI models to compare all-MSU vs. WWave/Alg III –On Swbd hour Fisher experiment, where we got gains using Alg I segmented data. Performance should not degrade  Improve speed of Alg III  Resegment and redistribute all data that has been released so far  Catch up with and continue segmenting latest WWave transcript deliveries.

14 Update on Adding Fisher Data  In Martigny, showed 1.4% gain for adding 150 hrs Fisher data (Alg I segmented) to RT03 training  Hoped to have results with 350 hours but we had bugs in our initial runs.  Did train MMI on RT03 (sw370) vs RT03+Fisher150  Results on 2 nd adaptation pass with POS LM rescoring  CAVEAT: non-rigorous comparison! Fisher150 system optimized (gains % gain); used diff phone set & faster training (degrades 0.2% in other comparisons). Training Eval03 WER RT03: SW Fisher