Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee.

Slides:



Advertisements
Similar presentations
1 Speech Sounds Introduction to Linguistics for Computational Linguists.
Advertisements

Punctuation Generation Inspired Linguistic Features For Mandarin Prosodic Boundary Prediction CHEN-YU CHIANG, YIH-RU WANG AND SIN-HORNG CHEN 2012 ICASSP.
Building an ASR using HTK CS4706
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Dual-domain Hierarchical Classification of Phonetic Time Series Hossein Hamooni, Abdullah Mueen University of New Mexico Department of Computer Science.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
A Study on Detection Based Automatic Speech Recognition Author : Chengyuan Ma Yu Tsao Professor: 陳嘉平 Reporter : 許峰閤.
Phoneme Alignment. Slide 1 Phoneme Alignment based on Discriminative Learning Shai Shalev-Shwartz The Hebrew University, Jerusalem Joint work with Joseph.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
Automatic Speech Recognition
Deep Learning and its applications to Speech EE 225D - Audio Signal Processing in Humans and Machines Oriol Vinyals UC Berkeley.
Conditional Random Fields   A form of discriminative modelling   Has been used successfully in various domains such as part of speech tagging and other.
OSU ASAT Status Report Jeremy Morris Yu Wang Ilana Bromberg Eric Fosler-Lussier Keith Johnson 13 October 2006.
Modeling speech signals and recognizing a speaker.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
Hyperparameter Estimation for Speech Recognition Based on Variational Bayesian Approach Kei Hashimoto, Heiga Zen, Yoshihiko Nankaku, Akinobu Lee and Keiichi.
Improving Speech Modelling Viktoria Maier Supervised by Prof. Hynek Hermansky.
Automatic Identification and Classification of Words using Phonetic and Prosodic Features Vidya Mohan Center for Speech and Language Engineering The Johns.
1 Word Recognition with Conditional Random Fields Jeremy Morris 12/03/2009.
Algoritmi e Programmazione Avanzata
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ELIS-DSSP Sint-Pietersnieuwstraat 41 B-9000 Gent Recognition of foreign names spoken by native speakers Frederik Stouten & Jean-Pierre Martens Ghent University.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
國立交通大學 電信工程研究所 National Chiao Tung University Institute of Communication Engineering 1 Phone Boundary Detection using Sample-based Acoustic Parameters.
Automatic Speech Attribute Transcription (ASAT) Project Period: 10/01/04 – 9/30/08 The ASAT Team –Mark Clements –Sorin Dusan.
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
The HTK Book (for HTK Version 3.2.1) Young et al., 2002.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
Discriminative Phonetic Recognition with Conditional Random Fields Jeremy Morris & Eric Fosler-Lussier The Ohio State University Speech & Language Technologies.
1 Conditional Random Fields for Automatic Speech Recognition Jeremy Morris 06/03/2010.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Machine Learning: A Brief Introduction Fu Chang Institute of Information Science Academia Sinica ext. 1819
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
A Hybrid Model of HMM and RBFN Model of Speech Recognition 길이만, 김수연, 김성호, 원윤정, 윤아림 한국과학기술원 응용수학전공.
1 Experiments with Detector- based Conditional Random Fields in Phonetic Recogntion Jeremy Morris 06/01/2007.
SPEECH VARIATION AND THE USE OF DISTANCE METRICS ON THE ARTICULATORY FEATURE SPACE Louis ten Bosch.
FIGURE 1: Spectrogram of the phrase “that experience”, shown with phonetic labels and corresponding neural network posterior distributions over each phonetic.
Present by: Fang-Hui Chu Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition Fei Sha*, Lawrence K. Saul University of Pennsylvania.
語音訊號處理之初步實驗 NTU Speech Lab 指導教授: 李琳山 助教: 熊信寬
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
Combining Phonetic Attributes Using Conditional Random Fields Jeremy Morris and Eric Fosler-Lussier – Department of Computer Science and Engineering A.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Olivier Siohan David Rybach
Automatic Speech Recognition
Linguistic knowledge for Speech recognition
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
CRANDEM: Conditional Random Fields for ASR
Jeremy Morris & Eric Fosler-Lussier 04/19/2007
Automatic Speech Recognition: Conditional Random Fields for ASR
Ju Lin, Yanlu Xie, Yingming Gao, Jinsong Zhang
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
2017 APSIPA A Study on Landmark Detection Based on CTC and Its Application to Pronunciation Error Detection Chuanying Niu1, Jinsong Zhang1, Xuesong Yang2.
Da-Rong Liu, Kuan-Yu Chen, Hung-Yi Lee, Lin-shan Lee
Presentation transcript:

Institute of Information Science, Academia Sinica 12 July, IIS, Academia Sinica Automatic Detection-based Phone Recognition on TIMIT Hung-Shin Lee ( 李鴻欣 ) Based on Chen and Wang in ISCSLP’08 and Interspeech’09

Page-2 Detection-Based ASR Knowledge Detection Knowledge Detection Integration Knowledge (Higher Level) Knowledge (Higher Level) Phonological attr. Prosodic attr. Acoustic attr. … Human SR HMM CRF … HMM CRF … DB ASR Detectors Integrator Results Phone Syllable Word Sentence Semantic info … Phone Syllable Word Sentence Semantic info …

Page-3 Phonological Systems SPE (Sound Pattern of English) MV (Multi-valued Feature) GP (Government Phonology) Literatures (N. Chomsky & M. Halle, 1968) (S. King, 2000)?(J. Harris, 1994) Feature Types Production-based, Binary Production-based, 2-10 values Sound structure primes, Binary Feature Number Examples anterior, nasal, round centrality, front back, manner, phonation, place, roundness

Page-4 Phonological Feature Detection (1) MLP (Detectors) hidden layer posterior probability quantization SPE_ GP_ ii-4i+4 9 frames 13 MFCCs input layer recurrent time-delay

Page-5 Phonological Feature Detection (2) ii-4i+4 9 frames 13 MFCCs MLP (Centrality) MLP (Front-Back) MLP (Roundness) MV_29 time-delay 6 MV Features

Page-6 Conditional Random Field (CRF) Integrator General Chain CRF state feature functiontransition feature function λ j, μ k : feature function weight parameters X y i-1 Output (phone) Input (phonological features) yiyi x i-1 xixi x i+1 Y

Page-7 CRF Integrator – Training Issues Required Label for CRF Training –Phone: y –Phonological features: x Detectors MLP Detectors MLP Speech Detected-data trained CRF Phonological features (with errors) DT CRF DT CRF Phone labels Mapping phones → phonological features Mapping phones → phonological features Phone labels Oracle-data trained CRF Phonological features OT CRF OT CRF Training Data

Page-8 Experiments Corpus: TIMIT –No SA1, SA2 –Training set (3296 utts), Dev set (400 utts) –Test set (1344 utts) Phone set: TIMIT61 –Evaluation: CMU/MIT 39 Baseline –CI-HMM Toolkits –Nico Toolkit (for MLP), CRF++ (for CRF)

Page-9 Results (1) Phone Corr. %Phone Acc. % SPE GP MV Model:OT CRF Test:OD Features Phone Corr. %Phone Acc. % HMM-baseline OT CRF SPE GP MV DT CRF SPE GP MV Model:OT/DT CRF Test:DD Features

Page-10 Results (2) Methods# SystemPhone Corr. (%)Phone Acc. (%) HMM baseline OT: SPE+GP+MV DT: SPE+GP+MV OT+DT: SPE+GP+MV OT: SPE+GP+MV +HMM DT: SPE+GP+MV +HMM OT+DT: SPE+GP+MV +HMM System Fusion

Page-11 System Fusion with CRF X y i-1 Combined Results (Phone) Phone Sequence yiyi x i-1 xixi x i+1 Y SPE Sys. MV Sys. GP Sys. HMM Sys.

Page-12 Two Types of AFDT Imperfection h# n eh ow kcl k w eh ae eh s tcl t ix n Phone AF(A) AF(A’) AF asynchronyAFDT errors

Page-13 CRF Training (1) Phone y AFs x t Mapping Table Phone AFs Oracle Data Training Phone y AFs x t AFDT Detected Data Training Detected Errors

Page-14 CRF Training (2) Phone y AFs x t AFDT Aligned Data Training AF Sequence

Page-15 Results (3) SystemPhone Corr. (%)Phone Acc. (%) Upper Bound OT CRF AT CRF Real Case OT CRF DT CRF AT CRF % acc. drops on the introduction of AF asynchrony Detection Error causes further 7.99 % acc. drop

Page-16 AF Asynchrony Compensation AF asynchrony is caused by context variation We can reduce AF asynchrony by letting our systems learn context variation directly – Long-Term information Windows + DCTs MLP Windows + DCTs Right Context Left Context 23 dim Mel MLP 310ms 144Dim 72Dim

Page-17 Results (4) Test Data TypeSystemCorrAcc - CI-HMM CD-HMM Detected (real case) OT CRF (±3) Long Term AFDT + DT CRF (±3) Ideal (upper bound) Long Term AFDT + AT CRF MFCC AFDT + AT CRF (±3) Long Term AFDT + AT CRF (±3) Detected (real case) Long Term AFDT + AT CRF MFCC AFDT + AT CRF (±3) Long Term AFDT + AT CRF (±3)

Page-18 Conclusions A well-designed phonological feature system is important –AF asynchrony minimization training and AF-phone synchronization could also be investigated Oracle Trained CRF is able to retrieve more phonological information from speech –High phone correction rate (but sensitive to detection error) –Helpful for combination Detection-Based ASR is promising –A front-end detector is a major issue

Page-19 AF and Phone Alignment Using AFDT t t t t t phone sequence AF sequence