1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Slides:



Advertisements
Similar presentations
Current HOARSE related activities 6-7 Sept …include the following (+ more) Novel architectures 1.All-combinations HMM/ANN 2.Tandem HMM/ANN hybrid.
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Building an ASR using HTK CS4706
Speech Recognition with Hidden Markov Models Winter 2011
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Speaker Adaptation for Vowel Classification
Signal Modeling for Robust Speech Recognition With Frequency Warping and Convex Optimization Yoon Kim March 8, 2000.
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Representing Acoustic Information
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Kinect Player Gender Recognition from Speech Analysis
Age and Gender Classification using Modulation Cepstrum Jitendra Ajmera (presented by Christian Müller) Speaker Odyssey 2008.
1 CS 551/651: Structure of Spoken Language Lecture 8: Mathematical Descriptions of the Speech Signal John-Paul Hosom Fall 2008.
VBS Documentation and Implementation The full standard initiative is located at Quick description Standard manual.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Robust Speech Feature Decorrelated and Liftered Filter-Bank Energies (DLFBE) Proposed by K.K. Paliwal, in EuroSpeech 99.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
Basics of Neural Networks Neural Network Topologies.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Conditional Random Fields for ASR Jeremy Morris July 25, 2006.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Performance Comparison of Speaker and Emotion Recognition
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Conditional Random Fields for ASR
RECURRENT NEURAL NETWORKS FOR VOICE ACTIVITY DETECTION
Mel-spectrum to Mel-cepstrum Computation A Speech Recognition presentation October Ji Gu
Statistical Models for Automatic Speech Recognition
Neuro-Fuzzy and Soft Computing for Speaker Recognition (語者辨識)
8-Speech Recognition Speech Recognition Concepts
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
DCT-based Processing of Dynamic Features for Robust Speech Recognition Wen-Chi LIN, Hao-Teng FAN, Jeih-Weih HUNG Wen-Yi Chu Department of Computer Science.
John H.L. Hansen & Taufiq Al Babba Hasan
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speaker Identification:
Learning Long-Term Temporal Features
Presented by Chen-Wei Liu
Presenter: Shih-Hsiang(士翔)
Measuring the Similarity of Rhythmic Patterns
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International

2 Phonetically Motivated Features Problem: –Cepstral coefficients fail to capture many discriminative cues. –Front-end optimized for traditional Mel cepstral features. –Front-end parameters are a compromise solution for all phones.

3 Phonetically Motivated Features Proposal: –Enrich Mel cepstral feature representation with phonetically motivated features from independent front-ends. –Optimize each specific front-end to improve discrimination. –Robust broad class phonetic features provide “anchor points” in acoustic phonetic decoding. –General framework for multiple phonetic features. First approach: voicing features.

4 Voicing features algorithms: 1.Normalized peak autocorrelation (PA). For time frame X max computed in pitch region 80Hz to 450Hz 2.Entropy of high order cepstrum (EC) and linear spectra (ES). If And H is the entropy of Y, then Entropy computed in pitch region 80Hz to 450Hz Voicing Features

5 3.Correlation with template and DP alignment [Arcienega, ICSLP’02]. The Discrete Logarithm Fourier Transform for the frequency band for speech signal If IT is an impulse train, the template is and the signal DLFT the correlation for frame j with the template is the DP optimal correlation is max computed in pitch region 80Hz to 450Hz

6 Voicing Features Preliminary exploration of voicing features: - Best feature combination: Peak Autocorrelation + Entropy Cepstrum - Complementary behavior of autocorrelation and entropy features for high and low pitch. Low pitch: time periods are well separated therefore correlation is well defined. High pitch: harmonics are well separated and cepstrum is well defined.

7 Voicing Features Graph of voicing features: w er k ay n d ax f s: aw th ax v dh ey ax r

8 Voicing Features Integration of Voicing Features: 1 - Juxtaposing Voicing Features: Juxtapose two voicing features to traditional Mel cepstral feature vector (MFCC) plus delta and delta-delta features (MFCC+D+DD) Voicing feature front-end: use same MFCC frame rate and optimize temporal window duration.

9 Voicing Features Train small switchboard database (64 hours). Test on dev WER for both sexes. Features: MFCC+D+DD, 25.6 msec. frame every 10 msec. VTL and speaker mean and var. norm. Genone acoustic model. Non-X-word, MLE trained, Gender Dep. Bigram LM. Window Length Optimization WER Baseline 41.4% Baseline + 2 voicing (25.6 msec)41.2 % Baseline + 2 voicing (75 msec)40.7 % Baseline + 2 voicing (87.5 msec)40.5 % Baseline + 2 voicing (100 msec)40.4 % Baseline + 2 voicing (112.5 msec)41.2 %

10 Voicing Features 2 – Voiced/Unvoiced Posterior Features: Use a posterior voicing probability as feature. Computed from 2 state HMM. Juxtaposed feature dim is 40. Similar setup as before. Males only results. Soft V/UV transitions may be not captured because posterior feature behaves similar to binary feature. Recognition Systems WER Baseline 39.2 % Baseline + voicing posterior39.7 %

11 Voicing Features 3 – Window of Voicing Features + HLDA: Juxtapose MFCC features and window of voicing features around current frame. Apply dimensionality reduction with HLDA. Final feature had 39 dimensions. Same setup as before, MFCC+D+DD+3 rd diffs. Both sexes. Baseline 1.5% abs. better, Voicing improves 1% more. Recognition Systems WER % Baseline + HLDA 39.9 Baseline + 1 frame, 2 voicing + HLDA Baseline + 5 frames, 2 voicing + HLDA38.9 Baseline + 9 frames, 2 voicing + HLDA 39.5

12 Voicing Features 4 – Delta of Voicing Features + HLDA: Use delta and delta-delta features instead of window of voicing features. Apply HLDA to juxtaposed feature. Same setup as before, MFCC+D+DD+3 rd diffs. Males only. Reason may be variability in voicing features produce noisy deltas. HLDA weighting of “window of voicing features” is similar to average  The best overall configuration was MFCC+D+DD+3 rd diffs. and 10 voicing features + HLDA. Recognition Systems WER Baseline + HLDA 37.5 % Baseline + voicing + delta voicing + HLDA37.6 %

13 Voicing Features Voicing Features in SRI CTS Eval. Sept 03 System: Adaptation of MMIE cross-word models w/wo voicing features. Used best configuration of voicing features. Train on Full SWBD+CTRANS data. Test on EVAL’02. Feature: MFCC+D+DD+3 rd diffs.+HLDA Adaptation: 9 transforms full matrix MLLR. Adaptation hypothesis from: MLE non cross-word model, PLP front end with voicing features. Recognition Systems WER Baseline EVAL 25.6 % Baseline EVAL + voicing25.1 %

14 Voicing Features Hypothesis Examples: REF: OH REALLY WHAT WHAT KIND OF PAPER HYP BASELINE: OH REALLY WHICH WAS KIND OF PAPER HYP VOICING: OH REALLY WHAT WHAT KIND OF PAPER REF: YOU KNOW HE S JUST SO UNHAPPY HYP BASELINE: YOU KNOW YOU JUST I WANT HAPPY HYP VOICING: YOU KNOW HE S JUST SO I WANT HAPPY

15 Voicing Features Error analysis: –In one experiment: 54% of speakers got WER reduction (some up to 4% abs. reduction). Rest 46% small WER increase. –Still need a more detailed study of speaker dependent performance. Implementation: –Implemented a voicing feature engine in DECIPHER system. –Fast computation, using one FFT and two IFFTs per frame for both voicing features.

16 Voicing Features Conclusions: –Explored how to represent/integrate the voicing features for best performance. –Achieved 1% abs (~2 % rel) gain in first pass (using small training set), and >0.5 % abs (2 % rel) (using full training set) in higher rescoring passes of DECIPHER LVCSR system. Future work: –Still need to further explore feature combination/selection –Develop more reliable voicing features, features not always reflect actual voicing activity –Develop other phonetically derived features (vowels/consonants, occlusion, nasality, etc).

17