11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

Slides:



Advertisements
Similar presentations
PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Coarticulation Analysis of Dysarthric Speech Xiaochuan Niu, advised by Jan van Santen.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.
Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation Gina-Anne Levow University of Chicago October 14, 2005.
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
1 Quick Transcription of Fisher Data with WordWave Owen Kimball, Rukmini Iyer, Chia-lin Kao, Thomas Colthurst, John Makhoul.
Automatic Transcript Generation Helmer Strik A 2 RT Dept. of Language & Speech University of Nijmegen.
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.
Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
11 Update on Transcription of Fisher Phase II Data Owen Kimball, Chia-lin Kao, Tresi Arvizo, John Makhoul.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
1 Phoneme and Sub-phoneme T- Normalization for Text-Dependent Speaker Recognition Doroteo T. Toledano 1, Cristina Esteve-Elizalde 1, Joaquin Gonzalez-Rodriguez.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
Automatic Cue-Based Dialogue Act Tagging Discourse & Dialogue CMSC November 3, 2006.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Robust speaking rate estimation using broad phonetic class recognition Jiahong Yuan and Mark Liberman University of Pennsylvania Mar. 16, 2010.
Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
National Taiwan University, Taiwan
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
MultiModality Registration Using Hilbert-Schmidt Estimators By: Srinivas Peddi Computer Integrated Surgery II April 6 th, 2001.
Experiments in Adaptive Language Modeling Lidia Mangu & Geoffrey Zweig.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
2009 NIST Language Recognition Systems Yan SONG, Bing Xu, Qiang FU, Yanhua LONG, Wenhui LEI, Yin XU, Haibing ZHONG, Lirong DAI USTC-iFlytek Speech Group.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Mr. Darko Pekar, Speech Morphing Inc.
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Learning Long-Term Temporal Features
Social Practice of the language: Describe and share information
Network Training for Continuous Speech Recognition
Presentation transcript:

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas

22 Outline  Motivation  BBN’s standard training procedure without noise words  Effect of noise words in ML training  Effect of noise words in discriminative training  Conclusions

33 Motivation  BBN’s English CTS system does not train with noise words in transcripts  For RT04 non-English CTS systems, we found that using noise words helped –[LAUGH], [NOISE], [SIGH], etc., appear in transcripts used to train non-English ML models –Levantine system: 1.6% gain on unadapted LevAr.Dev04 test –Mandarin system: 1.0% gain on unadapted Man.CTS.Dev04 test  Do these results hold for English? for discriminative training? –Success would simplify the preparation of Fisher training transcripts: no need to change transcripts and re-segment

44 Noise Words in English Transcripts  MSU Jan 2000 Switchboard I transcripts include: [laughter], [noise], [vocalized-noise]  For RT02, BBN switched to CU-HTK training transcripts, in which explicit noise words were removed from MSU transcripts –Found no significant difference in performance compared with previous BBN transcripts –Assumed noise words were a no-op, but there were other differences and we did not test which ones helped or hurt  WordWave Fisher transcripts include [LAUGH], [NOISE], [MN], [COUGH], [LIPSMACK], [SIGH]  BBN RT04 CTS English system removes noise words from transcripts and relies on silence HMM to model them.

55 Training Procedure without Noise Words  Process training transcripts –Drop utterances containing only noise words –Map noise words to silence  Train initial ML models and generate word alignments  Remove long silences –Using alignment information, chop utterances containing silences longer than two seconds  Train final ML models using the processed transcripts and segmentation

66 Effect of Noise Words in ML Training  Comparison experiments, Fisher training –Train ML models using 330 hours of automatically segmented Fisher data with and without noise words in transcripts  Validation experiments, Switchboard training –Train ML models using 180 hours of Switchboard data with noise words: MSU's original transcripts without noise words: CU's processed transcripts  Test models on combined Eval03 and Dev04 test set

77 Fisher Training Experiments  Without noise words: train as described 2 slides back  With noise words: use four phonemes to model six noise words; transcripts and segmentation unaltered. Noise WordPhonetic spelling [COUGH]COF-COF [LAUGHTER]LAF-LAF [NOISE]AMN-AMN [MN]BRN-BRN [SIGH]BRN-BRN [LIPSMACK]BRN-BRN

88 Fisher Experiment results Noise words in transcripts Unadapted WER Eval03+Dev04 No26.2 Yes25.4  Noise words in acoustic modeling (AM) and language modeling (LM) transcripts give 0.8% WER gain

99 Diagnostic Experiments  Is the gain with noise words due to better acoustic or better language modeling?  Expt I: explicit noise words in transcripts but modeled as silences: spell all noise words using the silence phoneme  Expt II: test acoustic models from Expt I using LMs trained on transcripts without noise words Noise words in AM transcripts? Noise phones Noise words in LM transcripts? Unadapted WER Eval03+Dev04 No--No26.2 YesNoiseYes25.4 I YesSilenceYes25.5 II YesSilenceNo25.6

10 Diagnostic Experiments, cont’d  Including or excluding noise words from the LM training has no significant effect on performance  Noise words in transcripts improve performance whether they are trained as noise models or as silence ==> Acoustic model initialization improves when noise words are explicitly marked in the transcripts

11 ML Training on Switchboard corpus 1.Use 2385 Swbd I conversations (160 hours), processed and segmented by CU from Eval03 training set 2.Same 2385 conversations from original MSU Swbd I Jan 2000 release (180 hours) 3.Apply auto-segmentation process to the MSU version of conversations, produced 180 hours Noise words in training?Segmentation Unadapted WER Eval03+Dev04 NoCU28.4 YesMSU manual27.7 YesBBN auto27.6

12 Effects with Discriminative Training  Trained SI-MPE models using baseline 330-hour Fisher ML models as the seed models Noise words in transcripts Unadapted WER Eval03+Dev04 No23.6 Yes23.4  Noise words still yield better models, but the gain is just 0.2%

13 Conclusions  Including noise words in transcripts results in better model initialization in acoustic training  Discriminative training procedure overcomes most of the poor initial estimate when noise words are not explicitly marked in the transcripts  We can directly use Fisher transcripts that are output by the BBN / WordWave, i.e. no need to map noise words and resegment