Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Road-Sign Detection and Recognition Based on Support Vector Machines Saturnino, Sergio et al. Yunjia Man ECG 782 Dr. Brendan.
Auditory scene analysis 2
1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.
Speech Enhancement through Noise Reduction By Yating & Kundan.
Multipitch Tracking for Noisy Speech
An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Hongliang Li, Senior Member, IEEE, Linfeng Xu, Member, IEEE, and Guanghui Liu Face Hallucination via Similarity Constraints.
Look Who’s Talking Now SEM Exchange, Fall 2008 October 9, Montgomery College Speaker Identification Using Pitch Engineering Expo Banquet /08/09.
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Multiple Human Objects Tracking in Crowded Scenes Yao-Te Tsai, Huang-Chia Shih, and Chung-Lin Huang Dept. of EE, NTHU International Conference on Pattern.
Communications & Multimedia Signal Processing Formant Tracking LP with Harmonic Plus Noise Model of Excitation for Speech Enhancement Qin Yan Communication.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Despeckle Filtering in Medical Ultrasound Imaging
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
INTRODUCTION  Sibilant speech is aperiodic.  the fricatives /s/, / ʃ /, /z/ and / Ʒ / and the affricatives /t ʃ / and /d Ʒ /  we present a sibilant.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Page 0 of 14 Dynamical Invariants of an Attractor and potential applications for speech data Saurabh Prasad Intelligent Electronic Systems Human and Systems.
Speech Perception 4/6/00 Acoustic-Perceptual Invariance in Speech Perceptual Constancy or Perceptual Invariance: –Perpetual constancy is necessary, however,
By Sarita Jondhale1 Pattern Comparison Techniques.
SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
Hierarchical Dirichlet Process (HDP) A Dirichlet process (DP) is a discrete distribution that is composed of a weighted sum of impulse functions. Weights.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
Basics of Neural Networks Neural Network Topologies.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Week 7 Lecture 1+2 Digital Communications System Architecture + Signals basics.
Gammachirp Auditory Filter
Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.
Hearing Research Center
New Acoustic-Phonetic Correlates Sorin Dusan and Larry Rabiner Center for Advanced Information Processing Rutgers University Piscataway,
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Speech Perception.
APPLICATION OF A WAVELET-BASED RECEIVER FOR THE COHERENT DETECTION OF FSK SIGNALS Dr. Robert Barsanti, Charles Lehman SSST March 2008, University of New.
IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
Audio Scene Analysis and Music Cognitive Elements of Music Listening Kevin D. Donohue Databeam Professor Electrical and Computer Engineering University.
Speech and Singing Voice Enhancement via DNN
Cognitive Processes PSY 334
Cocktail Party Problem as Binary Classification
Introduction Computer vision is the analysis of digital images
Two-Stage Mel-Warped Wiener Filter SNR-Dependent Waveform Processing
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Josh H. McDermott, Eero P. Simoncelli  Neuron 
EE513 Audio Signals and Systems
Attentional Modulations Related to Spatial Gating but Not to Allocation of Limited Resources in Primate V1  Yuzhi Chen, Eyal Seidemann  Neuron  Volume.
Introduction Computer vision is the analysis of digital images
Presentation transcript:

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University

2 Outline of presentation l Introduction l Auditory scene analysis l Unvoiced speech problem l Auditory segmentation based on event detection l Unvoiced speech segregation l Summary

3 Speech segregation In a natural environment, speech is usually corrupted by acoustic interference. Speech segregation is critical for many applications, such as automatic speech recognition and hearing prosthesis Most speech separation techniques, e.g. beamforming and blind source separation via independent analysis, require multiple sensors. However, such techniques have clear limits Suffer from configuration stationarity Can’t deal with single-microphone mixtures Most speech enhancement developed for monaural situation can deal with only stationary acoustic interference

4 Auditory scene analysis (ASA) The auditory system shows a remarkable capacity in monaural segregation of sound sources in the perceptual process of auditory scene analysis (ASA) ASA takes place in two conceptual stages (Bregman’90): Segmentation. Decompose the acoustic signal into ‘sensory elements’ (segments) Grouping. Combine segments into streams so that the segments of the same stream likely originate from the same source

5 Computational auditory scene analysis l Computational ASA (CASA) approaches sound separation based on ASA principles l CASA successes: Monaural segregation of voiced speech l A main challenge is segregation of unvoiced speech, which lacks the periodicity cue

6 Unvoiced speech l Speech sounds consist of vowels and consonants, the latter are further composed of voiced and unvoiced consonants l For English, the relative frequencies of different phoneme categories are (Dewey’23): l Vowels: 37.9% l Voiced consonants: 40.3% l Unvoiced consonants: 21.8% l In terms of time duration, unvoiced consonants account for about 1/5 in American English l Consonants are crucial for speech recognition

7 Ideal binary mask as CASA goal l Key idea is to retain parts of a target sound that are stronger than the acoustic background, or to mask interference by the target l Broadly consistent with auditory masking and speech intelligibility results l Within a local time-frequency (T-F) unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise l Local 0 SNR criterion for mask generation

8 Ideal binary masking illustration Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise with music (0 SNR)

9 Outline of presentation l Introduction l Auditory scene analysis l Unvoiced speech problem l Auditory segmentation based on event detection l Unvoiced speech segregation l Summary

10 Auditory segmentation Our approach to unvoiced speech segregation breaks the problem into two stages: segmentation and grouping This presentation is mainly about segmentation The task of segmentation is to decompose an auditory scene into contiguous T-F regions, each of which should contain signal from the same event It should work for both voiced and unvoiced sounds This is equivalent to identifying onsets and offsets of individual T-F regions, which generally correspond to sudden changes of acoustic energy Our segmentation strategy is based on onset and offset analysis of auditory events

11 What is an auditory event? To define an auditory event, two perceptual effects need to be considered: Audibility Auditory masking We define an auditory event as a collection of the audible T-F regions from the same sound source that are stronger than combined intrusions Hence the computational goal of segmentation is to produce segments, or contiguous T-F regions, of an auditory event For speech, a segment corresponds to a phone

12 Cochleogram as a peripheral representation We decompose an acoustic input using a gammatone filterbank 128 filters centered from 50 Hz to 8 kHz Filtering is performed in 20-ms time frames with 10-ms frame shift The intensity output forms what we call a cochleogram

13 Cochleogram and ideal segments

14 Scale-space analysis for auditory segmentation From a computational standpoint, auditory segmentation is similar to image segmentation Image segmentation: Finding bounding contours of visual objects Auditory segmentation: Finding onset and offset fronts of segments Our onset/offset analysis employs scale-space theory, which is a multiscale analysis commonly used in image segmentation Our proposed system performs the following computations: Smoothing Onset/offset detection and matching Multiscale integration

15 Smoothing For each filter channel, the intensity is smoothed over time to reduce the intensity fluctuation An event tends to have onset and offset synchrony in the frequency domain. Consequently the intensity is further smoothed over frequency to enhance common onsets and offsets in adjacent frequency channels Smoothing is done via dynamic diffusion

16 Smoothing via diffusion A one-dimensional diffusion of a quantity v across the spatial dimension x is governed by: D is a function controlling the diffusion process. As t increases, v gradually smoothes over x The diffusion time t is called the scale parameter and the smoothed v values at different times compose a scale space

17 Diffusion Let the input intensity be the initial value of v, and let v diffuse across time frames, m, and filter channels, c, as follows: I(c, m) is the logarithmic intensity in channel c at frame m

18 Diffusion, continued Two forms of D m (v) are employed in the time domain: D m (v) = 1, which reduces to Gaussian smoothing: Perona-Malik (’90) anisotropic diffusion: Compared with Gaussian smoothing, the Perona-Malik model may identify onset and offset positions bettter In the frequency domain, D c (v) = 1

19 Diffusion results Top: Initial intensity. Middle and bottom: Two scales for Gaussian smoothing (dash line) and anisotropic diffusion (solid line)

20 Onset/offset detection and matching At each scale, onset and offset candidates are detected by identifying peaks and valleys of the first-order time-derivative of v Detected candidates are combined into onset and offset fronts, which form vertical curves Individual onset and offset fronts are matched to yield segments

21 Multiscale integration The system integrates segments generated with different scales iteratively: First, it produces segments at a coarse scale (more smoothing) Then, at a finer scale, it locates more accurate onset and offset positions for these segments. In addition, new segments may be produced The advantage of multiscale integration is that it analyzes an auditory scene at different levels of detail so as to detect and localize auditory segments at appropriate scales

22 Segmentation at different scales Input: Mixture of speech and crowd noise with music Scales (t c, t m ) are: (a). (32, 200); (b). (18, 200); (c). (32, 100). (d). (18, 100)

23 Evaluation How to quantitatively evaluate segmentation results is a complex issue, since one has to consider various types of mismatch between a collection of ideal segments and that of computed segments Here we adapt a region-based definition by Hoover et al. (’96), originally proposed for evaluating image segmentation systems Based on the degree of overlapping (defined by threshold  ), we label a T-F region as belonging to one of the five classes Correct Under-segmented. Under-segmentation is not really an error because it produces larger segments – good for subsequent grouping Over-segmented Missing Mismatching

24 Illustration of different classes Ovals (Arabic numerals) indicate ideal segments and rectangles (Roman numerals) computed segments. Different colors indicate different classes

25 Quantitative measures Let E C, E U, E O, E M, and E I be the summated energy in all the regions labeled as correct, under-segmented, over- segmented, missing, and mismatching respectively. Let E GT be the total energy of all ideal segments and E S that of all estimated segments The percentage of correctness: P C = E C / E GT  100%. The percentage of under-segmentation: P U = E U / E GT  100%. The percentage of over-segmentation: P O = E O / E GT  100%. The percentage of mismatch, P I = E I / E S  100%. The percentage of missing, P M = (1  P C  P U  P O )  100%.

26 Evaluation corpus 20 utterances from the TIMIT database 10 types of intrusion: white noise, electrical fan, rooster crowing and clock alarm, traffic noise, crowd in playground, crowd with music, crowd clapping, bird chirping and waterflow, wind, and rain

27 Results on all phonemes Results are with respect to , with 0 dB mixtures and anisotropic diffusion

28 Results on stops, fricatives, and affricates

29 Results with different mixture SNRs P C and P U are combined here since P U is not really error

30 Comparisons Comparisons are made between anisotropic diffusion and Gaussian smoothing, as well as with the Wang-Brown model (1999), which deals with mainly with voiced segments using cross-channel correlation. Mixtures are at 0 dB SNR

31 Outline of presentation l Introduction l Auditory scene analysis l Unvoiced speech problem l Auditory segmentation based on event detection l Unvoiced speech segregation l Summary

32 Speech segregation The general strategy for speech segregation is to first segregate voiced speech using the pitch cue, and then deal with unvoiced speech Voiced speech segregation is performed using our recent model (Hu & Wang’04): The model generates segments for voiced speech using cross-channel correlation and temporal continuity It groups segments according to periodicity and amplitude modulation To segregate unvoiced speech, we perform auditory segmentation, and then group segments that correspond to unvoiced speech

33 Segment classification For nonspeech interference, grouping is in fact a classification task – to classify segments as either speech or non-speech The following features are used for classification: Spectral envelope Segment duration Segment intensity Training data Speech: Training part of the TIMIT database Interference: 90 natural intrusions including street noise, crowd noise, wind, etc. A Gaussian mixture model is trained for each phoneme, and for interference as well which provides the basis for a likelihood ratio test

34 Demo for fricatives and affricates Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise with music (IBM: Ideal binary mask)

35 Demo for stops Utterance: “A good morrow to you, my boy” Interference: Rain

36 Summary We have proposed a model for auditory segmentation, based on a multiscale analysis of onsets and offsets Our model segments both voiced and unvoiced speech sounds The general strategy for unvoiced (and voiced) speech segregation is to first perform segmentation and then group segments using various ASA cues Sequential organization of segments into streams is not addressed How well can people organize unvoiced speech?