Presentation is loading. Please wait.

Presentation is loading. Please wait.

AGA 4/28/2003 2003 NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.

Similar presentations


Presentation on theme: "AGA 4/28/2003 2003 NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky."— Presentation transcript:

1 AGA 4/28/2003 2003 NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky Anthropic Signal Processing Group http://www.asp.ogi.edu

2 2 AGA 4/28/2003 2003 NIST LID Evaluation OGI-4 – ASP System Goal –Convert the speech signal into a sequence of discrete sub-word units that can characterize the language Approach –Use temporal trajectories of speech parameters to obtain the sequence of units –Model the sequence of discrete sub-word units using a N-gram language model Sub-word units –TRAP-derived American English phonemes –Symbols derived from prosodic cues dynamics –Phonemes from OGI-LID Segmentation Target language model Background model + units Speech signal score (+) (-)

3 3 AGA 4/28/2003 2003 NIST LID Evaluation American English Phoneme Recognition Phoneme set –39 American English phonemes (CMU-like) Phoneme Recognizer –trained on NTIMIT –TRAP (Temporal Patterns) based –Speech segments for training obtained from energy-based speech/nonspeech segmentation Modeling –3-gram language model Frequency Time classifier Short-term analysis Temporal patterns paradigm phone

4 4 AGA 4/28/2003 2003 NIST LID Evaluation English Phoneme System Temporal trajectories –23 mel-scale frequency band –1 s segments of log energy trajectory Band classifiers –MLP (101x300x39) –Hidden unit nonlinearities: sigmoids –Output nonlinearities: softmax Band Classifier 1 Band Classifier 2 Band Classifier N Merger frequency time Viterbi search Merger –MLP (897x300x39) Viterbi search –Penalty factor tuning : deletions = insertions Training –NTIMIT

5 5 AGA 4/28/2003 2003 NIST LID Evaluation Prosodic Cues Dynamics Technique –Using prosodic cues (intensity and pitch trajectories) to derive the sub-word units Approach –Segment the speech signal at the inflection points of trajectories (zero-crossings of the derivative) and at the onsets and offsets of voicing –Label the segment by the direction of change of the parameter within the segment ClassTemporal Trajectory Description 1rising f0 and rising energy 2rising f0 and falling energy 3falling f0 and rising energy 4falling f0 and falling energy 5unvoiced segment

6 6 AGA 4/28/2003 2003 NIST LID Evaluation Prosodic Cues Dynamics Duration –The duration of the segment is characterized as “short” (less than 8 frames) or “long” 10 symbols Broad-phonetic-category (BFC) –Finer labeling achieved by estimating the broad-phonetic category (vowel+diphthong+glide, schwa, stop, fricative, flap, nasal, and silence) coinciding with each prosodic segment –BFC TRAPs trained on NTIMIT is used for deriving the broad phonetic categories 61 symbols 3-gram language model BFC TRAPS Setup –Input temporal vectors 15 bark-scale frequency band energy 1s segments of log energy trajectory Mean and variance normalized Dimension reduction:DCT –Band classifiers MLP (15x100x7) Hidden units: sigmoid Output units: softmax –Merger MLP (105x100x7)

7 7 AGA 4/28/2003 2003 NIST LID Evaluation OGI-4 – ASP System EER 30s =41.4% EER 30s =19.3%EER 30s =32.1% EER 30s =17.8%

8 8 AGA 4/28/2003 2003 NIST LID Evaluation OGI-4 – ASP System EER 30s =17.8%

9 9 AGA 4/28/2003 2003 NIST LID Evaluation Post-Evaluation – Phoneme System Speech-nonspeech segmentation using silence classes from TRAP-based classification TRAPs classifier –Temporal trajectory duration - 400ms –3 bands as the input trajectory for each band classifier to explore the correlation between adjacent bands The trajectories of 3 bands are projected into a DCT basis (20 coefficients) –Viterbi search tuned for language identification Training data –CallFriend training and development sets

10 10 AGA 4/28/2003 2003 NIST LID Evaluation Post-Evaluation – Phoneme System 34% relative improvement EER 30s =12.7%

11 11 AGA 4/28/2003 2003 NIST LID Evaluation Post-Evaluation – Prosodic Cues System No energy-based segmentation –Unvoiced segments longer than 2 seconds are considered non-speech No broad-phonetic category labeling applied –Rate of change plus the quantized duration (10 tokens) Training data –CallFriend training and development sets

12 12 AGA 4/28/2003 2003 NIST LID Evaluation Post-Evaluation – Prosodic Cues System 30% relative improvement EER 30s =22.2%

13 13 AGA 4/28/2003 2003 NIST LID Evaluation Fusion - 30 sec condition Fusing the scores from the prosodic cues system –with TRAP-derived phonemes: EER 30s = 10.5% (17% relative improvement) –with OGI-LID derived phonemes: EER 30s = 6.6% 14% relative improvement TRAP-derived phoneme system fused with OGI-LID: –EER 30s = 6.2% 19% relative improvement 26% relative improvement EER 30s =5.7%

14 14 AGA 4/28/2003 2003 NIST LID Evaluation Conclusions Sequences of discrete symbols derived from speech dynamics provide useful information for characterizing the language Two techniques for deriving the sequences of symbols investigated –segmentation and labeling based on prosodic cues –segmentation and labeling based on TRAP-derived phonetic labels The introduced techniques combine well with each other as well as with the more conventional language ID techniques


Download ppt "AGA 4/28/2003 2003 NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky."

Similar presentations


Ads by Google