Statistical and Signal Processing Approaches for Voicing Detection

Slides:



Advertisements
Similar presentations
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Advertisements

Advanced Speech Enhancement in Noisy Environments
Multipitch Tracking for Noisy Speech
Pitch Perception.
Overview of Real-Time Pitch Tracking Approaches Music information retrieval seminar McGill University Francois Thibault.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
Speech Sound Production: Recognition Using Recurrent Neural Networks Abstract: In this paper I present a study of speech sound production and methods for.
Speech Group INRIA Lorraine
Xkl: A Tool For Speech Analysis Eric Truslow Adviser: Helen Hanson.
The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.
Speaker Adaptation for Vowel Classification
Feature vs. Model Based Vocal Tract Length Normalization for a Speech Recognition-based Interactive Toy Jacky CHAU Department of Computer Science and Engineering.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
A PRESENTATION BY SHAMALEE DESHPANDE
Representing Acoustic Information
Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.
Source/Filter Theory and Vowels February 4, 2010.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Resonance, Revisited March 4, 2013 Leading Off… Project report #3 is due! Course Project #4 guidelines to hand out. Today: Resonance Before we get into.
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
Csc Lecture 7 Recognizing speech. Geoffrey Hinton.
1 PATTERN COMPARISON TECHNIQUES Test Pattern:Reference Pattern:
Basics of Neural Networks Neural Network Topologies.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Speech Signal Representations I Seminar Speech Recognition 2002 F.R. Verhage.
Speaker Recognition by Habib ur Rehman Abdul Basit CENTER FOR ADVANCED STUDIES IN ENGINERING Digital Signal Processing ( Term Project )
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
Gammachirp Auditory Filter
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
ECE 5525 Osama Saraireh Fall 2005 Dr. Veton Kepuska
Singer similarity / identification Francois Thibault MUMT 614B McGill University.
Towards a Cohort-Selective Frequency- Compression Hearing Aid Marie Roch ¤, Richard R. Hurtig ¥, Jing Lui ¤, and Tong Huang ¤ ¥ ¤
Speaker Identification by Combining MFCC and Phase Information Longbiao Wang (Nagaoka University of Technologyh, Japan) Seiichi Nakagawa (Toyohashi University.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Singer Similarity Doug Van Nort MUMT 611. Goal Determine Singer / Vocalist based on extracted features of audio signal Classify audio files based on singer.
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
SPOKEN LANGUAGE SYSTEMS MIT Computer Science and Artificial Intelligence Laboratory Relating Reliability in Phonetic Feature Streams to Noise Robustness.
Performance Comparison of Speaker and Emotion Recognition
EEL 6586: AUTOMATIC SPEECH PROCESSING Speech Features Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida February 27,
Detection of Vowel Onset Point in Speech S.R. Mahadeva Prasanna & Jinu Mariam Zachariah Department of Computer Science & Engineering Indian Institute.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
By Sarita Jondhale 1 Signal preprocessor: “conditions” the speech signal s(n) to new form which is more suitable for the analysis Postprocessor: operate.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
UNIT-IV. Introduction Speech signal is generated from a system. Generation is via excitation of system. Speech travels through various media. Nature of.
Voice Activity Detection Based on Sequential Gaussian Mixture Model Zhan Shen, Jianguo Wei, Wenhuan Lu, Jianwu Dang Tianjin Key Laboratory of Cognitive.
Speech Enhancement Summer 2009
PATTERN COMPARISON TECHNIQUES
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
ARTIFICIAL NEURAL NETWORKS
Vocoders.
Presentation on Artificial Neural Network Based Pathological Voice Classification Using MFCC Features Presenter: Subash Chandra Pakhrin 072MSI616 MSC in.
Conditional Random Fields for ASR
Linear Prediction.
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
Pitch Estimation By Chih-Ti Shih 12/11/2006 Chih-Ti Shih.
Digital Systems: Hardware Organization and Design
AUDIO SURVEILLANCE SYSTEMS: SUSPICIOUS SOUND RECOGNITION
Speaker Identification:
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Statistical and Signal Processing Approaches for Voicing Detection Alex Park July 25th, 2003

Overview Motivation and background for voicing detection Overview of recent methods Signal Processing approaches Statistical approaches Performance comparison of voicing detection methods Detection error rates on small task Example outputs Conclusions and Future Work Introduction

Motivation Voicing is not necessary for speech understanding E.g. Whispered speech – excitation is provided by aspiration E.g. Sinewave speech – no periodic excitation, resonances produced directly What is the value of adding voicing to the speech signal? Separability? Pitch is useful for distinguishing between concurrent speakers and background Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands Robustness? Unvoiced speech has lower SNR than voiced speech Whispering is intended to prevent unwanted listeners from hearing Shouting/singing not possible without voicing Low frequencies less attenuated over distances Current speech recognition systems typically discard voicing information in the front end because Energy is environment dependent, pitch is speaker dependent Vocal tract configuration carries most phonetic information Introduction

Background Voicing produced by periodic vibrations of the vocal folds. In time, voiced speech consists of repeated segments. In frequency, spectrum has harmonic structure shaped by formant resonances Pitch estimation and voicing decision can be made In time, using repetition rate and similarity of pitch periods In frequency, using spacing and relative height of harmonic peaks Irregular pitch periods Time Domain In theory, problem seems easy In practice lots of issues crop up which make decision on real speech signals difficult. Irregular pitch periods so repetition is not exact (hampers temporal approaches) Missing fundamental and resonant shaping (hampers spectral approaches) Missing Fundamental Freq Domain Introduction

Signal Processing Approaches Signal processing approaches marked by lack of training phase Voicing detection typically paired with pitch extraction Well known approach: peak-picking (spectral or temporal) Usually followed by smoothing gross errors via Dynamic Programming Many proposed solutions: Spectral Cepstral Pitch tracking Harmonic Product Sum Logarithmic DFT pitch tracker (Wang) Temporal Autocorrelation Sinusoid matching (Saul) Synchrony (Seneff) Exotic methods Image based pitch tracking (Quatieri) Post processing of voicing decisions generally reduces false alarms Signal Processing

Peaks occur at multiples of fundamental period I. Autocorrelation Temporal domain approach, used in ESPS tool ‘get_f0’ Compute inner product of signal with shifted version of itself If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing

II. Band-limited Sinusoid Fitting (Saul 2002) Filter bandwidths allow at least one filter to resolve single harmonics Frames of filtered signals fit with sinusoid of frequency w* and error u* At each step, lowest u* gives voicing probability, w* gives pitch estimate Algorithm is fast and gives accurate pitch tracks Low Pass Filter max(x,0) Half-wave rectify Signal Preconditioning Octave Filterbank (8) 134-407 Hz : 25-75 Hz 264-800 Hz w1*, u1* w8*, u8* w4*, u4* Sliding temporal window Output min(ui*) F0 = 2p w4* p(V) = f(u4*) Signal Processing

Statistical Approaches Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches Possible classifiers suitable for voicing detection include GMM classifier (w/ MFCC features) Structured Bayesian Network (alternative features) Neural Network classifier Support Vector Machines Statistical

Transcribed speech (Training Data) I. GMM Classifier Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) 50 mixtures each, dimensions reduced to 50 via PCA Using Bayes’ rule, voicing score is given by likelihood ratio Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Transcribed speech (Training Data) Training V GMM UV GMM p(x|V) p(V) p(x|UV) p(UV) > < p(V|x) p(UV|x) Decision Rule L p(x|V) p(x|UV) “voiced” “unvoiced” p(x|UV) p(x|V) Unknown frame, x Testing Formulate it as a classic detection problem 14 cepstral coefficients from 7 frames, then reduced down to 50 dims Statistical

II. Bayesian Network (Saul/Rahim/Allen 1999) Feature vector constructed for frames of narrowband speech (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame Individual voicing decisions made on each channel Channel sigmoid decision weights (q’s) trained via EM algorithm Overall voicing decision triggered by positive example in individual channels Feature Extraction : s(u) & Channel Tests max(x,0) Half-wave rectify Signal Preconditioning Filter 2 : Filter 1 Filter 24 Auditory Filterbank (Gammatone) : AND Layer OR Layer OR AND Voicing Decision What is notable about this approach is Front end features Structure of network (avoids false positives via AND, false negatives via OR) Statistical

Comparison: Matched Conditions Trained on 410 TIMIT sentences from 40 speakers (126k frames) Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) Speech was resampled to 8kHz, phone labels used as voicing ref Also evaluated on Keele database (laryngograph reference) Bayes net performed too poorly. Didn’t include in subsequent evaluations for sake of processing time Note that published results are much better GMM classifier seems to do quite well. Considering the discriminitive nature of Some newer features incorporated into the bayesian network approach that I haven’t been able to put in yet Reported Operating Point (Saul) Results

Sample Outputs: Matched Conditions Some example voicing tracks output by individual methods GMM w/ MFCCs Bayesian Network Sinusoid Uncertainty Autocorrelation Voiced Unvoiced Voicing tracks output by the 3rd and 4th method appear to be more smooth and satisfactory. Results

Comparison: Mismatched Conditions Evaluated with different kinds of signal corruption Condition not known a priori => same threshold as before threshold can be adaptive to environment (same as modifying output prob.) Overall error rates are unsatisfactory GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Sinusoid Fit Autocorrelation Try to run on training data for statistical methods We can make the threshold adaptive, but this can be accomplished by making the output probability adaptive (is more desirable this way, too) Results

Sample Outputs: Mismatched Conditions Voicing tracks on NTIMIT utterance GMM w/ MFCCs Bayesian Network Sinusoid Uncertainty Autocorrelation Voiced Unvoiced TIMIT NTIMIT Most degradation is in the 3-4 kHz and below 500 Hz range. False alarms go up with autocorrelation method Results

Conclusions and Future Work Error rates are still high compared with literature Post processing to remove stray frames Problem with scoring procedure? Statistical framework with knowledge based features Weight contribution of multiple detectors using SNR-based variable Using same approach, apply to phonetic detectors for voiced speech Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy Rounding – Low F1, F2. Retroflex – Low F3, rising formants. Combine feature streams with SNR based weight as input to HMM Processing frames independently seems stupid Want to try detecting voicing onset, voicing offset. Classify voiced segments rather than frames. Allows combination of across channel measurements. Conclusions

References L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems 15. MIT Press: Cambridge, MA. L. K. Saul, M. G. Rahim, and J. B. Allen (2001). A statistical model for robust integration of narrowband cues in speech. Computer Speech and Language 15(2): 175-194. C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions