Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit.

Slides:



Advertisements
Similar presentations
1 Speech Sounds Introduction to Linguistics for Computational Linguists.
Advertisements

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Tuning Jenny Burr August Discussion Topics What is tuning? What is the process of tuning?
Tone perception and production by Cantonese-speaking and English- speaking L2 learners of Mandarin Chinese Yen-Chen Hao Indiana University.
Research & Development ICASSP' Analysis of Model Adaptation on Non-Native Speech for Multiple Accent Speech Recognition D. Jouvet & K. Bartkova France.
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Automatic Prosodic Event Detection Using Acoustic, Lexical, and Syntactic Evidence Sankaranarayanan Ananthakrishnan, Shrikanth S. Narayanan IEEE 2007 Min-Hsuan.
Connecting Acoustics to Linguistics in Chinese Intonation Greg Kochanski (Oxford Phonetics) Chilin Shih (University of Illinois) Tan Lee (CUHK) with Hongyan.
Confidence Measures for Speech Recognition Reza Sadraei.
Recognition of Voice Onset Time for Use in Detecting Pronunciation Variation ● Project Description ● What is Voice Onset Time (VOT)? – Physical Realization.
1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.
CALL: Computer-Assisted Language Learning. 2/14 Computer-Assisted (Language) Learning “Little” programs Purpose-built learning programs (courseware) Using.
Text-To-Speech Synthesis An Overview. What is a TTS System  Goal A system that can read any text Automatic production of new sentences Not just audio.
Semantic and phonetic automatic reconstruction of medical dictations STEFAN PETRIK, CHRISTINA DREXEL, LEO FESSLER, JEREMY JANCSARY, ALEXANDRA KLEIN,GERNOT.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
® Automatic Scoring of Children's Read-Aloud Text Passages and Word Lists Klaus Zechner, John Sabatini and Lei Chen Educational Testing Service.
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
Background Infants and toddlers have detailed representations for their known vocabulary items Consonants (e.g., Swingley & Aslin, 2000; Fennel & Werker,
Whither Linguistic Interpretation of Acoustic Pronunciation Variation Annika Hämäläinen, Yan Han, Lou Boves & Louis ten Bosch.
Nasal endings of Taiwan Mandarin: Production, perception, and linguistic change Student : Shu-Ping Huang ID No. : NA3C0004 Professor : Dr. Chung Chienjer.
The Linguistics of Second Language Acquisition
Learning of Word Boundaries in Continuous Speech using Time Delay Neural Networks Colin Tan School of Computing, National University of Singapore.
1 BILC SEMINAR 2009 Speech Recognition: Is It for Real? Tony Mirabito Defense Language Institute English Language Center (DLIELC) DLIELC.
1 The Ferret Copy Detector Finding short passages of similar texts in large document collections Relevance to natural computing: System is based on processing.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Korea Maritime and Ocean University NLP Jung Tae LEE
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
Building a sentential model for automatic prosody evaluation Kyuchul Yoon School of English Language & Literature Yeungnam University Korea.
Incorporating Dynamic Time Warping (DTW) in the SeqRec.m File Presented by: Clay McCreary, MSEE.
Results Tone study: Accuracy and error rates (percentage lower than 10% is omitted) Consonant study: Accuracy and error rates 3aSCb5. The categorical nature.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
Introduction to Neural Networks and Example Applications in HCI Nick Gentile.
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Intelligent Database Systems Lab 國立雲林科技大學 National Yunlin University of Science and Technology Advisor : Dr. Hsu Presenter : Chien Shing Chen Author: Wei-Hao.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Phonemes and allophones
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
1 A Two-pass Framework of Mispronunciation Detection & Diagnosis for Computer-aided Pronunciation Training Xiaojun Qian, Member, IEEE, Helen Meng, Fellow,
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
Speaker Recognition UNIT -6. Introduction  Speaker recognition is the process of automatically recognizing who is speaking on the basis of information.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
Speech Recognition
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
EXPERIMENTS WITH UNIT SELECTION SPEECH DATABASES FOR INDIAN LANGUAGES
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Natural Language Processing (NLP) Systems Joseph E. Gonzalez
Network Training for Continuous Speech Recognition
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Presentation transcript:

Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee, Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng Represented by Chun-Yu Chen

Design and collection of the chinese learners of english corpus Speech data for deriving salient mispronunciations made by Chinese learners of English The corpus includes data from 100 Cantonese subjects and 111 Mandarin subjects

Sounds similar to the learner’s first language will be easy for the learner to acquire while different sounds will present difficulty We summarize these errors by deriving phonological rules for the observed errors Capturing language transfer effects through contrastive phonological analyses

To model such mispronunciations, we make use of context-sensitive phonological rules, of the format: Capturing language transfer effects through contrastive phonological analyses

Mispronunciation prediction with manually and automatically derived phonological rules Manually written phonological rules We first developed a list of 43 context-insensitive rules and generate hypothesized pronunciation variants that may appear in the learners’ speech The dictionary grows exponentially and many pronunciations generated are rare or implausible in the learner’s speech To reduce the number of implausible pronunciations, the context-sensitive rules was compiled

context-sensitive rules developed using the immediate neighboring segments and symbols for various linguistic classes ─ like consonants and vowels The extended pronunciation dictionary (EPD) with 51 context-sensitive rules were developed Mispronunciation prediction with manually and automatically derived phonological rules

Automatically derived phonological rules Manually authoring phonological rules requires expertise in both the mother language and also the L2 being learned ─the feasible language pairs will be limited Our approach is based on a few assumptions 1.differences in the phonetic transcriptions and the canonical pronunciations are due to negative language transfe 2.interferences such as misread prompts, unknown words, transcription errors Mispronunciation prediction with manually and automatically derived phonological rules

The automatic rule derivation 1.aligned the canonical pronunciations with the manual transcriptions 2.obtain a set of all phonetic substitutions, insertions, and deletions 3.perform the rule selection process by keeping the top-N rules in the basic rule set and evaluate the coverage of the top-N rules by computing the F1-score Mispronunciation prediction with manually and automatically derived phonological rules

Mispronunciation detection and diagnoses In this system, ASR is using to detect mis- pronunciations with the extended pronunciation dictionary and predicted mispronunciation for the given word The steps in this system: 1.The process is repeated for all rules to generate the extended pronunciation dictionary 2.The recognized phone sequences are then aligned with the canonical phone sequences. Phones that cannot be aligned properly can then be easily identified as deletions, insertions and substitutions 3.provide diagnostic feedback

Representation of Extended Pronunciations We devise the Extended Recognition Network (ERN) as a compact representation of the same information we use the finite state transducer as a vehicle to represent the rules Mispronunciation detection and diagnoses

Enhancing mispronunciation detection by fusion with pronunciation scoring detection of salient mispronunciations is refer to the linguistically-motivated approach Not all the possible mispronunciations are predicted by the approach the expansion rule may be absent due to pruning or lack of relevant language transfer knowledge the quality of the acoustic models is poor which hinders recognition accuracy the mispronunciations may be caused by factors other than language transfer

Conventional pronunciation scoring is based on the posterior probability of a speech unit being produced by the speaker To minimize the total detection error, we combine the two techique We first optimize individual thresholds for every English phone, and define a backoff list Enhancing mispronunciation detection by fusion with pronunciation scoring

The backoff list is a list of phones that is better handled by the pronunciation scoring approach Enhancing mispronunciation detection by fusion with pronunciation scoring

Utterance rejection for pre-filtering grossly erroneous like false starts, pressing the button too early,etc. should be appropriately handled by a pre-filtering mechanism we use the statistical phone duration model to pre-filter for intact utterances

If forced alignment produces phone durations that are overly long or short, as compared with their inherent values, it may suggest that the input utterance is not intact Utterance rejection for pre-filtering

In phone duration scoring, we incorporate an anti- model to increase the discriminative power of the phone duration model The “catch-all” anti-model 1.We first shuffle the utterances in the corpus such that the recordings will not be matching to the prompting texts 2.A forced-alignment is then performed using this intentionally shuffled prompts 3.A Gamma distribution is then trained using all aligned phone durations in the shuffled corpus Utterance rejection for pre-filtering

Ongoing work and future directions Our approach described above uses the ERN to provide explicitly modeled mispronunciations to capture the error we are exploring the use of discriminatively trained acoustic models, with reference to predicted mispronunciations