Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Slides:

Advertisements

Similar presentations

Critical Reading Strategies: Overview of Research Process

Advertisements

Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.

Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Joint Sentiment/Topic Model for Sentiment Analysis Chenghua Lin & Yulan He CIKM09.

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines

Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Introduction to Machine Learning Approach Lecture 5.

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

The Project AH Computing. Functional Requirements  What the product must do!  Examples attractive welcome screen all options available as clickable.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Temporal Compression Of Speech: An Evaluation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 16, NO. 4, MAY 2008 Simon Tucker and Steve.

A Survey for Interspeech Xavier Anguera Information Retrieval-based Dynamic TimeWarping.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

COMPARISON OF A BIGRAM PLSA AND A NOVEL CONTEXT-BASED PLSA LANGUAGE MODEL FOR SPEECH RECOGNITION Md. Akmal Haidar and Douglas O’Shaughnessy INRS-EMT,

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

Major objective of this course is: Design and analysis of modern algorithms Different variants Accuracy Efficiency Comparing efficiencies Motivation thinking.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

Automatic Image Annotation by Using Concept-Sensitive Salient Objects for Image Content Representation Jianping Fan, Yuli Gao, Hangzai Luo, Guangyou Xu.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

AQUAINT Herbert Gish and Owen Kimball June 11, 2002 Answer Spotting.

Indirect Supervision Protocols for Learning in Natural Language Processing II. Learning by Inventing Binary Labels This work is supported by DARPA funding.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

1 Sentence Extraction-based Presentation Summarization Techniques and Evaluation Metrics Makoto Hirohata, Yousuke Shinnaka, Koji Iwano and Sadaoki Furui.

ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Basic structure of sphinx 4

1 Unsupervised Adaptation of a Stochastic Language Model Using a Japanese Raw Corpus Gakuto KURATA, Shinsuke MORI, Masafumi NISHIMURA IBM Research, Tokyo.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

S1S1 S2S2 S3S3 8 October 2002 DARTS ATraNoS Automatic Transcription and Normalisation of Speech Jacques Duchateau, Patrick Wambacq, Johan Depoortere,

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,

Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen

Presentation transcript:

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France CSL 2002 Reporter: Shih-Hung Liu 2007/03/05

2 Outline Abstract Introduction Lightly supervised acoustic model training System description Impact of the amount of acoustic training data Impact of the language model training material Unsupervised acoustic model training Conclusions

3 Abstract This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost The approach uses a speech recognizer to transcribe unannotated broadcast news The hypothesized transcription is optionally aligned with closed-captions to create labels for the training data

4 Introduction Despite the rapid progress made in LVCSR, there remain many outstanding challenges One of the main challenges is to reduce the development costs required to adapt a recognition system to a new task With today’s technology, the adaptation of a recognition system to a new task required large amounts of transcribed acoustic training data One of the most often cited costs in development is that of obtaining this necessary transcribed acoustic training data, which is expensive process in terms of both manpower and time

5 Introduction There are certain audio sources such as radio and television news broadcasts, that can provide an essentially unlimited supply of acoustic training data However, for the vast majority of audio data sources there are no corresponding accurate word transcriptions Some of these sources also broadcast manually derived closed-captions There may also exist other sources of information with different levels of completeness such as approximation transcriptions, summaries or keywords, which can be used to provide some supervision

6 Introduction The basic idea is to use a speech recognizer to automatically transcribe raw audio data, thus generating approximate transcriptions for the training data Training on all of the automatically annotated data is compared with using the closed-captions to filter the hypothesized transcriptions, thus removing words that are potentially incorrect and training only on the words which agree

7 Lightly supervised acoustic model training The following training procedure is used in this work which can be used with all of the different levels of supervision: 1. Normalize the available text materials (e.g. newspaper and newswire, commercially produced transcripts, closed-captions, detailed transcripts of acoustic training data) and train an n-gram language model 2. Partition each show into homogeneous segments, labelling the acoustic attributes (speaker, gender, bandwidth) 3. Train acoustic models on a small amount of manually annotated data (1 h or less) 4. Automatically transcribe a large amount of raw training data 5. Optionally align the closed-captions with the automatic transcriptions (using a dynamic programming algorithm) removing speech segments where the two transcripts disagree. 6. Run the standard acoustic model training procedure on the speech segments using the automatic transcripts 7. Reiterate from step 4.

8 Lightly supervised acoustic model training

9 System description The LIMSI broadcast news transcription system has two components –an audio partitioner to divide the continuous stream of acoustic data into homogeneous segments, associating appropriate labels with the segments –a word recognition initial hypothesis generation – used for MLLR word graph generation - trigram final hypothesis generation – fourgram

10 Impact of the amount of acoustic training data As expected, when more training data is used, the word error rate decreases

11 Impact of the language model training material

12 Impact of the language model training material LMa (baseline Hub4 LM): newspaper and newswire (News), commercially produced transcripts (Com) pre-dating June 1998, and acoustic transcripts News.Com.Cap : newspaper and newswire, commercially produced transcripts, and closed-captions (Cap) during May 1998 News.Com : newspaper and newswire, and commercially produced transcripts during May 1998 News.Cap : newspaper and newswire and closed-captions during May 1998 News : newspaper and newswire during May 1998 News.Com97 : newspaper and newswire during May 1998, commercially produced transcripts during December 1997 News.Com97.Cap : newspaper and newswire and closed-captions during May 1998, commercially produced transcripts during December 1997 News97 : newspaper and newswire during December 1997

13 Unsupervised acoustic model training

14 Unsupervised acoustic model training

15 Conclusions In this work, we have investigated the use of low cost data to train acoustic models for broadcast news transcriptions This method required substantial computation time, but little manual effort A question that remains unanswered is: –Can better performance be obtained using large amounts of automatically annotated data than with a large, but lesser amount of manually annotated data? and if so, how much data is needed?

Unsupervised Training of Acoustic Models for Large Vocabulary Continuous Speech Recognition Frank Wessel and Hermann Ney RWTH Aachen, Germany IEEE SAP January 2005 Reporter: Shih-Hung Liu 2007/03/05

17 Outline Abstract Introduction Description of the training procedure Bootstrapping with an optimized system Bootstrapping with a low-cost system Iterative application of the unsupervised training Unsupervised training of an across-word system Conclusions and outlook

18 Abstract For LVCSR systems, the amount of acoustic training data is of crucial importance Since untranscribed speech is available in various forms nowadays, the unsupervised training is studied in this paper A low-cost recognizer is used to recognize large amount of untranscribed acoustic data These transcriptions are then used in combination with a confidence measure which is used to detect possible recognition errors Finally, the unsupervised training is applied iteratively

19 Introduction The building of a recognizer for a new language, a new domain, or different acoustic conditions usually requires the recording and transcription of large amounts of speech data In contrast to the early days of speech recognition, large collections of speech data are available these days Unfortunately, most of the acoustic material comes without a detailed transcription and has to be transcribed manually One possible way to reduced manual effort is to use an already existing speech recognizer to transcribe new data automatically

20 Description of the training procedure

21 Description of the training procedure

22 Description of the training procedure

23 Appendix – confidence measure example

24 Bootstrapping with an optimized system

25 Bootstrapping with an optimized system These results can be attributed to two opposed effects: –If the recognizer used to transcribe the data is trained on large amounts of material as in the experiments above, most of the incorrectly recognized words in the transcription will be acoustically very similar to the words originally spoken The negative impact of these errors is thus only small since the acoustic models are defined on a phonetic level –Confidence measure cannot improve the performance since they do not only exclude words from the training which might be erroneous but since they also reduce the amount of training material for the acoustic models The trade-off between these two effects is an obvious explanation for the above results

26 Bootstrapping with an optimized system As the experiment clearly shows, the automatically transcribed training corpus can be used successfully to argument an already existing training corpus and to reduce the WERs on the testing corpus w1 sil w2 sil w2

27 Bootstrapping with a low-cost system The scenario for the following experiments is as follows: –It is assumed that 72h of the Broadcast New97 training corpus are not transcribed, but chopped into suitable audio segments –It is also assumed no initial acoustic models, no initial phonetic CART, and no initial LDA matrix are available In such a scenario, it appears to be straightforward to transcribed a small amount of the training corpus manually, to train a recognizer and to generate transcriptions of the rest of the training data

28 Bootstrapping with a low-cost system

29 Bootstrapping with a low-cost system

30 Iterative application of the unsupervised training

31 Iterative application of the unsupervised training

32 Iterative application of the unsupervised training

33 Unsupervised training of an across-word system

34 Conclusions and outlook The experiments show that confidence measures can be used successfully to restrict the unsupervised training to those portions of the transcriptions where the words are most probably correct With the unsupervised training procedure, the manual expenditure of transcribing speech data can be reduced drastically for new application scenario