Statistical and Signal Processing Approaches for Voicing Detection

Statistical and Signal Processing Approaches for Voicing Detection
Alex Park July 25th, 2003

Overview Motivation and background for voicing detection
Overview of recent methods Signal Processing approaches Statistical approaches Performance comparison of voicing detection methods Detection error rates on small task Example outputs Conclusions and Future Work Introduction

Motivation Voicing is not necessary for speech understanding
E.g. Whispered speech – excitation is provided by aspiration E.g. Sinewave speech – no periodic excitation, resonances produced directly What is the value of adding voicing to the speech signal? Separability? Pitch is useful for distinguishing between concurrent speakers and background Redundancy? Harmonics provide regular structure from which we can detect speech in multiple bands Robustness? Unvoiced speech has lower SNR than voiced speech Whispering is intended to prevent unwanted listeners from hearing Shouting/singing not possible without voicing Low frequencies less attenuated over distances Current speech recognition systems typically discard voicing information in the front end because Energy is environment dependent, pitch is speaker dependent Vocal tract configuration carries most phonetic information Introduction

Background Voicing produced by periodic vibrations of the vocal folds.
In time, voiced speech consists of repeated segments. In frequency, spectrum has harmonic structure shaped by formant resonances Pitch estimation and voicing decision can be made In time, using repetition rate and similarity of pitch periods In frequency, using spacing and relative height of harmonic peaks Irregular pitch periods Time Domain In theory, problem seems easy In practice lots of issues crop up which make decision on real speech signals difficult. Irregular pitch periods so repetition is not exact (hampers temporal approaches) Missing fundamental and resonant shaping (hampers spectral approaches) Missing Fundamental Freq Domain Introduction

Signal Processing Approaches
Signal processing approaches marked by lack of training phase Voicing detection typically paired with pitch extraction Well known approach: peak-picking (spectral or temporal) Usually followed by smoothing gross errors via Dynamic Programming Many proposed solutions: Spectral Cepstral Pitch tracking Harmonic Product Sum Logarithmic DFT pitch tracker (Wang) Temporal Autocorrelation Sinusoid matching (Saul) Synchrony (Seneff) Exotic methods Image based pitch tracking (Quatieri) Post processing of voicing decisions generally reduces false alarms Signal Processing

Peaks occur at multiples of fundamental period
I. Autocorrelation Temporal domain approach, used in ESPS tool ‘get_f0’ Compute inner product of signal with shifted version of itself If is a speech frame, then autocorrelation is Speech Frame Peaks occur at multiples of fundamental period Short Time Autocorrelation Signal Processing

II. Band-limited Sinusoid Fitting (Saul 2002)
Filter bandwidths allow at least one filter to resolve single harmonics Frames of filtered signals fit with sinusoid of frequency w* and error u* At each step, lowest u* gives voicing probability, w* gives pitch estimate Algorithm is fast and gives accurate pitch tracks Low Pass Filter max(x,0) Half-wave rectify Signal Preconditioning Octave Filterbank (8) Hz : 25-75 Hz Hz w1*, u1* w8*, u8* w4*, u4* Sliding temporal window Output min(ui*) F0 = 2p w4* p(V) = f(u4*) Signal Processing

Statistical Approaches
Statistical voicing detectors are not strictly dependent on spectral features (but these are the features widely used) Training data useful for capturing acoustic cues of voicing not explicitly specified in signal processing approaches Possible classifiers suitable for voicing detection include GMM classifier (w/ MFCC features) Structured Bayesian Network (alternative features) Neural Network classifier Support Vector Machines Statistical

Transcribed speech (Training Data)
I. GMM Classifier Train two GMMs, p(x|V) and p(x|UV) using frame-level feature vectors (MFCCs + surrounding frames (for D’s and DD’s)) 50 mixtures each, dimensions reduced to 50 via PCA Using Bayes’ rule, voicing score is given by likelihood ratio Discriminative framework is useful because it uses knowledge of unvoiced speech characteristics in making decision Transcribed speech (Training Data) Training V GMM UV GMM p(x|V) p(V) p(x|UV) p(UV) > < p(V|x) p(UV|x) Decision Rule L p(x|V) p(x|UV) “voiced” “unvoiced” p(x|UV) p(x|V) Unknown frame, x Testing Formulate it as a classic detection problem 14 cepstral coefficients from 7 frames, then reduced down to 50 dims Statistical

II. Bayesian Network (Saul/Rahim/Allen 1999)
Feature vector constructed for frames of narrowband speech (Autocorrelation peaks and valleys) & (SNR Estimate) = 5 dims/band/frame Individual voicing decisions made on each channel Channel sigmoid decision weights (q’s) trained via EM algorithm Overall voicing decision triggered by positive example in individual channels Feature Extraction : s(u) & Channel Tests max(x,0) Half-wave rectify Signal Preconditioning Filter 2 : Filter 1 Filter 24 Auditory Filterbank (Gammatone) : AND Layer OR Layer OR AND Voicing Decision What is notable about this approach is Front end features Structure of network (avoids false positives via AND, false negatives via OR) Statistical

Comparison: Matched Conditions
Trained on 410 TIMIT sentences from 40 speakers (126k frames) Evaluated on 100 TIMIT sentences from 10 speakers (28k frames) Speech was resampled to 8kHz, phone labels used as voicing ref Also evaluated on Keele database (laryngograph reference) Bayes net performed too poorly. Didn’t include in subsequent evaluations for sake of processing time Note that published results are much better GMM classifier seems to do quite well. Considering the discriminitive nature of Some newer features incorporated into the bayesian network approach that I haven’t been able to put in yet Reported Operating Point (Saul) Results

Sample Outputs: Matched Conditions
Some example voicing tracks output by individual methods GMM w/ MFCCs Bayesian Network Sinusoid Uncertainty Autocorrelation Voiced Unvoiced Voicing tracks output by the 3rd and 4th method appear to be more smooth and satisfactory. Results

Comparison: Mismatched Conditions
Evaluated with different kinds of signal corruption Condition not known a priori => same threshold as before threshold can be adaptive to environment (same as modifying output prob.) Overall error rates are unsatisfactory GMM classifier has best performance on clean data, but unpredictable results in varied conditions GMM Sinusoid Fit Autocorrelation Try to run on training data for statistical methods We can make the threshold adaptive, but this can be accomplished by making the output probability adaptive (is more desirable this way, too) Results

Sample Outputs: Mismatched Conditions
Voicing tracks on NTIMIT utterance GMM w/ MFCCs Bayesian Network Sinusoid Uncertainty Autocorrelation Voiced Unvoiced TIMIT NTIMIT Most degradation is in the 3-4 kHz and below 500 Hz range. False alarms go up with autocorrelation method Results

Conclusions and Future Work
Error rates are still high compared with literature Post processing to remove stray frames Problem with scoring procedure? Statistical framework with knowledge based features Weight contribution of multiple detectors using SNR-based variable Using same approach, apply to phonetic detectors for voiced speech Nasality – broad F1 bandwidth, low spectral slope in F1:F2 region, stable low frequency energy Rounding – Low F1, F2. Retroflex – Low F3, rising formants. Combine feature streams with SNR based weight as input to HMM Processing frames independently seems stupid Want to try detecting voicing onset, voicing offset. Classify voiced segments rather than frames. Allows combination of across channel measurements. Conclusions

References L. K. Saul, D. D. Lee, C. L. Isbell, and Y. LeCun (2003). “Real time voice processing with audiovisual feedback: Toward autonomous agents with perfect pitch” in S. Becker, S. Thrun, and K. Obermayer (eds.), Advances in Neural Information Processing Systems MIT Press: Cambridge, MA. L. K. Saul, M. G. Rahim, and J. B. Allen (2001). A statistical model for robust integration of narrowband cues in speech. Computer Speech and Language 15(2): C. Wang, and S. Seneff (2000). "Robust Pitch Tracking for Prosodic Modeling in Telephone Speech," In Proc. ICASSP ‘00, Istanbul, Turkey. S. Seneff (1985). “Pitch and spectral analysis of speech based on an auditory synchrony model,” Ph.D Thesis, Dept. of Electrical Engineering, M.I.T., Cambridge, MA. T. F. Quatieri (2002). "2-D Processing of Speech with Application to Pitch Estimation," In Proc. ICLSP ’02, Denver, Colorado. Conclusions

Statistical and Signal Processing Approaches for Voicing Detection

Similar presentations

Presentation on theme: "Statistical and Signal Processing Approaches for Voicing Detection"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical and Signal Processing Approaches for Voicing Detection

Similar presentations

Presentation on theme: "Statistical and Signal Processing Approaches for Voicing Detection"— Presentation transcript:

Similar presentations

About project

Feedback