Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter: Davidson Date: 2009/07/08, 2009/07/15

Contents  Introduction  Goodness of Pronunciation (GoP) algorithm Basic GoP algorithm Phone dependent thresholds Explicit error modeling  Collection of a non-native database  Performance measures  The labeling consistency of the human judges  Experimental results  Conclusions and future work

Introduction (1/3)  CAPT systems (Computer-Assisted Pronunciation Training)  Word and phrase level scoring ( ’ 93, ’ 94, ’ 97) Intonation, stress, and rhythm Requires several recordings of native utterances for each word Difficult to add new teaching material  Selected phonemic error teaching (1997) Uses duration information or models trained on non-native speech

Introduction (2/3)  HMM has been used to produce sentence-level scores (1990, 1996)  Eskenazi ’ s system (1996) produces phone-level scores but no attempt to relate this to human judgement  Author ’ s proposed system: Measures pronunciation quality for non- native speech at the phone level

Introduction (3/3)  Other issues GoP algorithms with refinements Performance measures for both GoP scores and scores by human judges Non-native database Experiments on these performance measures

Goodness of Pronunciation (GoP) algorithm: Basic GoP algorithm  A score for each phone  = likelihood of the acoustic segment corresponding to each phone  GoP = duration normalized log of the posterior probability for a phone given the corresponding acoustic segment

Basic GoP algorithm (2/5)  = the set of all phone models  = number of frames in  By assuming equal phone priors and approximating by its maximum:

Basic GoP algorithm (3/5)  Numerator term is computed using forced alignment with known transcription  Denominator term is determined using an unconstrained phone loop

Basic GoP algorithm (4/5)  If a mispronunciation has occurred, it is not reasonable to constrain the acoustic segment used to compute the maximum likelihood phone to be identical to the assumed phone  Hence, the denominator score is computed by summing the log likelihood per frame over the duration of  In practice, this will often mean that more than one phone in the unconstrained phone sequence has contributed to the computation of

Basic GoP algorithm (5/5)  Intuitive to use speech data from native speakers to train the acoustic models  However, non-native speech is characterized by different formant structures compared to those from a native speaker for the same phone  Adapt Gaussian means by MLLR  Use only one single global transform of the HMM Gaussian component mean to avoid adapting to specific phone error patterns

Phone dependent thresholds  The acoustic fit of phone-based HMMs differs from phone to phone E.g. fricatives tend to have lower log likelihood than vowels  2 ways to determine phone-specific thresholds By using mean and variance for phone By approximating human labeling behavior

Explicit error modeling (1/3)  2 types of pronunciation errors Individual mispronunciations Systematic mispronunciations  Consists of substitutions of native sounds for sounds of the target language, which do not exist in the native language  Knowledge of the learner ’ s native language is included in order to detect systematic mispronunciation

Explicit error modeling (2/3)  Solution: a recognition network incorporating both correct pronunciation and common pronunciation errors in the form of error sublattices for each phone.  E.g. “ but ”

Explicit error modeling (3/3)  Target phone posterior probability  Scores for systematic mispronunciations  GoP that includes additional penalty for systematic mispronunciation

Collection of a non-native database (1/2)  Based on the procedures used for the WSJCAM0 corpus  Texts are composed of a limited vocabulary of 1500 words  6 females and 4 males whose mother- tongues are Korean (3), Japanese (3), Latin-American Spanish (3), and Italian (1).  Each speaker reads 120 sentences 40 common set of phonetically-balanced sentences 80 sentences varied from session to session

Collection of a non-native database (2/2)  6 human judges who speaks native British English Each speaker was labeled by 1 judge  20 sentences from a female Spanish speakers are used as calibration sentences Annotated by all 6 judges  Transcriptions reflect the actual sound uttered by the speakers Including phonemes from other languages

Performance measures (1/3)  Compares 2 transcriptions of the same sentence Transcriptions are either transcribed by human judges or generated automatically  4 types of performance measures Strictness Agreement Cross-correlation Overall phone correlation

Performance measures (2/3)  Compared on a frame by frame basis  Each error is marked as 1 or 0 otherwise. Yields a vector of length with  Apply a Hamming window Transition between 0 and 1 is too abrupt where as in practice the boundary is often uncertain Forced alignment might be erroneous due to poor acoustic modeling of non-native speech Window length

Performance measures (3/3)

Strictness (S)  Measures how strict the judge was in marking pronunciation errors  Relative strictness

Overall Agreement (A)  Measures the agreement of all frames between 2 transcriptions  Defined in terms of cityblock distance between 2 transcription vectors

Cross-correlation (CC)  Measures the agreement between the error frames in either or both transcriptions is the Euclidean distance

Phoneme Correlation (PC)  Measures the overall agreement of overall rejection statistics for each phone between 2 judges/systems  PC is defined as is a vector of rejection count for each phone denotes the mean rejection counts

Labeling consistency of the human judges (1/4)

Labeling consistency of the human judges (2/4)  All results are within an acceptable range 0.85<A<0.95, mean = 0.91 0.40<CC<0.65, mean = 0.47 0.70<PC<0.85, mean = 0.78 0.03< <0.14, mean = 0.06  These mean values can be used as a benchmark values

Experimental results (1/7)  Multiple mixture monophone models  Corpus: WSJCAM0  Range of rejection threshold was restricted to lie within one standard deviation of the judges strictness where

Experimental results (2/7)

Experimental results (6/7)  Add error handling with Latin-American Spanish models to detect systematic mispronunciations

Experimental results (7/7)  Transcriptions comparison between human judges and the system with error network

Conclusions and future work  2 GoP scoring mechanism Basic GoP GoP with systematic mispronunciation penalty  Refinement methods MLLR adaptation Independent thresholds trained from human judgement Error network  Future work Information about the type of mistake

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Similar presentations

Presentation on theme: "Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

Similar presentations

Presentation on theme: "Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:"— Presentation transcript:

Similar presentations

About project

Feedback