Download presentation
Presentation is loading. Please wait.
Published byWillis Warren Modified over 9 years ago
1
Development of Automatic Speech Recognition and Synthesis Technologies to Support Chinese Learners of English: The CUHK Experience Helen Meng, Wai-Kit Lo, Alissa M. Harrison, Pauline Lee, Ka-Ho Wong, Wai-Kim Leung and Fanbo Meng Represented by Chun-Yu Chen
2
Design and collection of the chinese learners of english corpus Speech data for deriving salient mispronunciations made by Chinese learners of English The corpus includes data from 100 Cantonese subjects and 111 Mandarin subjects
3
Sounds similar to the learner’s first language will be easy for the learner to acquire while different sounds will present difficulty We summarize these errors by deriving phonological rules for the observed errors Capturing language transfer effects through contrastive phonological analyses
4
To model such mispronunciations, we make use of context-sensitive phonological rules, of the format: Capturing language transfer effects through contrastive phonological analyses
5
Mispronunciation prediction with manually and automatically derived phonological rules Manually written phonological rules We first developed a list of 43 context-insensitive rules and generate hypothesized pronunciation variants that may appear in the learners’ speech The dictionary grows exponentially and many pronunciations generated are rare or implausible in the learner’s speech To reduce the number of implausible pronunciations, the context-sensitive rules was compiled
6
context-sensitive rules developed using the immediate neighboring segments and symbols for various linguistic classes ─ like consonants and vowels The extended pronunciation dictionary (EPD) with 51 context-sensitive rules were developed Mispronunciation prediction with manually and automatically derived phonological rules
7
Automatically derived phonological rules Manually authoring phonological rules requires expertise in both the mother language and also the L2 being learned ─the feasible language pairs will be limited Our approach is based on a few assumptions 1.differences in the phonetic transcriptions and the canonical pronunciations are due to negative language transfe 2.interferences such as misread prompts, unknown words, transcription errors Mispronunciation prediction with manually and automatically derived phonological rules
8
The automatic rule derivation 1.aligned the canonical pronunciations with the manual transcriptions 2.obtain a set of all phonetic substitutions, insertions, and deletions 3.perform the rule selection process by keeping the top-N rules in the basic rule set and evaluate the coverage of the top-N rules by computing the F1-score Mispronunciation prediction with manually and automatically derived phonological rules
9
Mispronunciation detection and diagnoses In this system, ASR is using to detect mis- pronunciations with the extended pronunciation dictionary and predicted mispronunciation for the given word The steps in this system: 1.The process is repeated for all rules to generate the extended pronunciation dictionary 2.The recognized phone sequences are then aligned with the canonical phone sequences. Phones that cannot be aligned properly can then be easily identified as deletions, insertions and substitutions 3.provide diagnostic feedback
10
Representation of Extended Pronunciations We devise the Extended Recognition Network (ERN) as a compact representation of the same information we use the finite state transducer as a vehicle to represent the rules Mispronunciation detection and diagnoses
11
Enhancing mispronunciation detection by fusion with pronunciation scoring detection of salient mispronunciations is refer to the linguistically-motivated approach Not all the possible mispronunciations are predicted by the approach the expansion rule may be absent due to pruning or lack of relevant language transfer knowledge the quality of the acoustic models is poor which hinders recognition accuracy the mispronunciations may be caused by factors other than language transfer
12
Conventional pronunciation scoring is based on the posterior probability of a speech unit being produced by the speaker To minimize the total detection error, we combine the two techique We first optimize individual thresholds for every English phone, and define a backoff list Enhancing mispronunciation detection by fusion with pronunciation scoring
13
The backoff list is a list of phones that is better handled by the pronunciation scoring approach Enhancing mispronunciation detection by fusion with pronunciation scoring
14
Utterance rejection for pre-filtering grossly erroneous like false starts, pressing the button too early,etc. should be appropriately handled by a pre-filtering mechanism we use the statistical phone duration model to pre-filter for intact utterances
15
If forced alignment produces phone durations that are overly long or short, as compared with their inherent values, it may suggest that the input utterance is not intact Utterance rejection for pre-filtering
16
In phone duration scoring, we incorporate an anti- model to increase the discriminative power of the phone duration model The “catch-all” anti-model 1.We first shuffle the utterances in the corpus such that the recordings will not be matching to the prompting texts 2.A forced-alignment is then performed using this intentionally shuffled prompts 3.A Gamma distribution is then trained using all aligned phone durations in the shuffled corpus Utterance rejection for pre-filtering
17
Ongoing work and future directions Our approach described above uses the ERN to provide explicitly modeled mispronunciations to capture the error we are exploring the use of discriminatively trained acoustic models, with reference to predicted mispronunciations
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.