An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.

Slides:

Advertisements

Similar presentations

Auditory scene analysis 2

Advertisements

1 A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old.

Multipitch Tracking for Noisy Speech

CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.

Pitch Perception.

Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.

Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.

A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.

Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.

Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.

Sound source segregation (determination)

Rob van der Willigen Auditory Perception.

Voice Transformations Challenges: Signal processing techniques have advanced faster than our understanding of the physics Examples: – Rate of articulation.

Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.

EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.

Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.

Source/Filter Theory and Vowels February 4, 2010.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Audio and Music Representations (Part 2) 1.

Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.

HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.

A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST

Abstract We report comparisons between a model incorporating a bank of dual-resonance nonlinear (DRNL) filters and one incorporating a bank of linear gammatone.

„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.

Audio Scene Analysis and Music Cognitive Elements of Music Listening

From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.

Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.

Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.

‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.

Gammachirp Auditory Filter

Overview ► Recall ► What are sound features? ► Feature detection and extraction ► Features in Sphinx III.

Hearing Research Center

Chapter 6. Effect of Noise on Analog Communication Systems

Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.

Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.

Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.

SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.

Loudness level (phon) An equal-loudness contour is a measure of sound pressure (dB SPL), over the frequency spectrum, for which a listener perceives a.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.

Autonomous Robots Vision © Manfred Huber 2014.

Performance Comparison of Speaker and Emotion Recognition

Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....

RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.

IIT Bombay 17 th National Conference on Communications, Jan. 2011, Bangalore, India Sp Pr. 1, P3 1/21 Detection of Burst Onset Landmarks in Speech.

A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.

Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.

Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.

Audio Scene Analysis and Music Cognitive Elements of Music Listening Kevin D. Donohue Databeam Professor Electrical and Computer Engineering University.

SOUND PRESSURE, POWER AND LOUDNESS

Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.

Speech Enhancement Algorithm for Digital Hearing Aids

Speech and Singing Voice Enhancement via DNN

Cocktail Party Problem as Binary Classification

Information-Theoretic Listening

Term Project Presentation By: Keerthi C Nagaraj Dated: 30th April 2003

Sound & Sound Waves.

DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark

Josh H. McDermott, Eero P. Simoncelli Neuron

EE513 Audio Signals and Systems

Speech Perception (acoustic cues)

Dealing with Acoustic Noise Part 1: Spectral Estimation

Perception & Neurodynamics Lab

Presenter: Shih-Hsiang(士翔)

Presentation transcript:

An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University

Outline of presentation l Introduction l Speech segregation problem l Auditory scene analysis (ASA) approach l Voiced speech segregation based on pitch tracking and amplitude modulation analysis l Ideal binary mask as CASA goal l Unvoiced speech segregation l Auditory segmentation l Neurobiological basis of ASA

Real-world audition What? Source type Speech message speaker age, gender, linguistic origin, mood, … Music Car passing by Where? Left, right, up, down How close? Channel characteristics Environment characteristics Room configuration Ambient noise

Humans versus machines Source: Lippmann (1997) Additionally: Car noise is not a very effective speech masker At 10 dB At 0 dB Human word error rate at 0 dB SNR is around 1% as opposed to 100% for unmodified recognisers (around 40% with noise adaptation)

Speech segregation problem In a natural environment, speech is usually corrupted by acoustic interference. Speech segregation is critical for many applications, such as automatic speech recognition and hearing prosthesis Most speech separation techniques, e.g. beamforming and blind source separation via independent analysis, require multiple sensors. However, such techniques have clear limits Suffer from configuration stationarity Can’t deal with single-microphone mixtures or situations where multiple sounds arrive from close directions Most speech enhancement developed for monaural situation can deal with only stationary acoustic interference

Auditory scene analysis (Bregman’90) l Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source l Ball-room problem, Helmholtz, 1863 (“complicated beyond conception”) l Cocktail-party problem, Cherry’53 l Two conceptual processes of auditory scene analysis (ASA): l Segmentation. Decompose the acoustic mixture into sensory elements (segments) l Grouping. Combine segments into groups, so that segments in the same group are likely to have originated from the same environmental source

Computational auditory scene analysis l Computational ASA (CASA) systems approach sound separation based on ASA principles l Weintraub’85, Cooke’93, Brown & Cooke’94, Ellis’96, Wang & Brown’99 l CASA progress: Monaural segregation with minimal assumptions l CASA challenges l Broadband high-frequency mixtures l Reliable pitch tracking of noisy speech l Unvoiced speech

Outline of presentation l Introduction l Speech segregation problem l Auditory scene analysis (ASA) approach l Voiced speech segregation based on pitch tracking and amplitude modulation analysis l Ideal binary mask as CASA goal l Unvoiced speech segregation l Auditory segmentation l Neurobiological basis of ASA

Resolved and unresolved harmonics l For voiced speech, lower harmonics are resolved while higher harmonics are not l For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech l Our model (Hu & Wang’04) applies different grouping mechanisms for low-frequency and high-frequency signals: l Low-frequency signals are grouped based on periodicity and temporal continuity l High-frequency signals are grouped based on amplitude modulation (AM) and temporal continuity

Diagram of the Hu-Wang model

Cochleogram: Auditory peripheral model Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleogram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent Previous work suggests better resilience to noise than spectrogram Spectrogram Cochleogram

l Mid-level representations form the basis for segment formation and subsequent grouping l Correlogram extracts periodicity and AM from simulated auditory nerve firing patterns l Summary correlogram is used to identify global pitch l Cross-channel correlation between adjacent correlogram channels identifies regions that are excited by the same harmonic or formant Mid-level auditory representations

Correlogram Short-term autocorrelation of the output of each frequency channel of the cochleogram Peaks in summary correlogram indicate pitch periods (F0) A standard model of pitch perception Correlogram & summary correlogram of a double vowel, showing F0s

Cross-channel correlation (a) Correlogram and cross-channel correlation of hair cell response to clean speech (b) Corresponding representations for response envelopes

Initial segregation l Segments are formed based on temporal continuity and cross-channel correlation l Segments generated in this stage tend to reflect resolved harmonics, but not unresolved ones l Initial grouping into a foreground (target) stream and a background stream according to global pitch using the oscillatory correlation model of Wang and Brown (1999)

Pitch tracking l Pitch periods of target speech are estimated from the segregated speech stream l Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints: l Target pitch should agree with the periodicity of the time-frequency units in the initial speech stream l Pitch periods change smoothly, thus allowing for verification and interpolation

Pitch tracking example (a) Global pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion (b) Estimated target pitch

T-F unit labeling l In the low-frequency range: l A time-frequency (T-F) unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch l In the high-frequency range: l Due to their wide bandwidths, high-frequency filters respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863) l A T-F unit in the high-frequency range is labeled by comparing its AM repetition rate with the estimated target pitch

AM example (a) The output of a gammatone filter (center frequency: 2.6 kHz) in response to clean speech (b) The corresponding autocorrelation function

AM repetition rates l To obtain AM repetition rates, a filter response is half- wave rectified and bandpass filtered l The resulting signal within a T-F unit is modeled by a single sinusoid using the gradient descent method. The frequency of the sinusoid indicates the AM repetition rate of the corresponding response

Final segregation l New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross- channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM repetition rates l Other units are grouped according to temporal and spectral continuity

Ideal binary mask for performance evaluation l Within a T-F unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise l Motivation: Auditory masking - stronger signal masks weaker one within a critical band l We have suggested to use ideal binary masks as ground truth for CASA performance evaluation l Consistent with recent speech intelligibility results (Roman et al.’03; Brungart et al.’05)

Ideal binary mask illustration

Voiced speech segregation example

Systematic SNR results l Evaluation on a corpus of 100 mixtures (Cooke, 1993): 10 voiced utterances x 10 noise intrusions (see next slide) l Average SNR gain: 12.3 dB; 5.2 dB better than the Wang-Brown model (1999), and 6.4 dB better than the spectral subtraction method Hu-Wang model SNR (in dB)

CASA progress on voiced speech segregation 100 mixture set used by Cooke (1993) 10 voiced utterances mixed with 10 noise intrusions (N0: tone, N1: white noise, N2: noise bursts, N3: ‘cocktail party’, N4: rock music, N5: siren, N6: telephone, N7: female utterance, N8: male utterance, N9: female utterance) Cooke (1993) Ellis (1996) Wang & Brown (1999) Hu & Wang (2004) + telephone + male + female Original mixture of voiced speech

Outline of presentation l Introduction l Speech segregation problem l Auditory scene analysis (ASA) approach l Voiced speech segregation based on pitch tracking and amplitude modulation analysis l Ideal binary mask as CASA goal l Unvoiced speech segregation l Auditory segmentation l Neurobiological basis of ASA

Segmentation and unvoiced speech segretation To deal with unvoiced speech segregation, we (Hu & Wang’04) proposed a model of auditory segmentation that applies to both voiced and unvoiced speech The task of segmentation is to decompose an auditory scene into contiguous T-F regions, each of which should contain signal from the same sound source The definition of segmentation does not distinguish between voiced and unvoiced sounds This is equivalent to identifying onsets and offsets of individual T-F regions, which generally correspond to sudden changes of acoustic energy The segmentation strategy is based on onset and offset analysis

Scale-space analysis for auditory segmentation From a computational standpoint, auditory segmentation is similar to image (visual) segmentation Visual segmentation: Finding bounding contours of visual objects Auditory segmentation: Finding onset and offset fronts of segments Onset/offset analysis employs scale-space theory, which is a multiscale analysis commonly used in image segmentation Smoothing Onset/offset detection and onset/offset front matching Multiscale integration

Example of auditory segmentation

Speech segregation The general strategy for speech segregation is to first segregate voiced speech using the pitch cue, and then deal with unvoiced speech To segregate unvoiced speech, we perform auditory segmentation, and then group segments that correspond to unvoiced speech

Segment classification For nonspeech interference, grouping is in fact a classification task – to classify segments as either speech or non-speech The following features are used for classification: Spectral envelope Segment duration Segment intensity Training data Speech: Training part of the TIMIT database Interference: 90 natural intrusions including street noise, crowd noise, wind, etc. A Gaussian mixture model is trained for each phoneme, and for interference as well which provides the basis for a likelihood ratio test

Example of segregating fricatives/affricates Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise with music (IBM: Ideal binary mask)

Example of segregating stops Utterance: “A good morrow to you, my boy” Interference: Rain

Outline of presentation l Introduction l Speech segregation problem l Auditory scene analysis (ASA) approach l Voiced speech segregation based on pitch tracking and amplitude modulation analysis l Ideal binary mask as CASA goal l Unvoiced speech segregation l Auditory segmentation l Neurobiological basis of ASA

How does the auditory system perform ASA? l Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system l Binding problem: How are these features combined to form a perceptual whole (stream)? l Hierarchies of feature-detecting cells exist, but do not seem to constitute a solution to the binding problem

Oscillatory correlation theory for ASA l Neural oscillators are used to represent auditory features l Oscillators representing features of the same source are synchronized, and are desynchronized from those representing different sources l Originally proposed by von der Malsburg & Schneider (1986), and further developed by Wang (1996) l Supported by growing experimental evidence

Oscillatory correlation representation FD: Feature Detector

Oscillatory correlation for ASA l LEGION dynamics (Terman & Wang’95) provides a computational foundation for the oscillatory correlation theory l The utility of oscillatory correlation has been demonstrated for speech segregation (Wang-Brown’99), modeling auditory attention (Wrigley-Brown’04), etc.

Summary l CASA approach to monaural speech segregation l Performs substantially better than previous CASA systems for voiced speech segregation l AM cue and target pitch tracking are important for performance improvement l Early steps for unvoiced speech segregation l Auditory segmentation based on onset/offset analysis l Segregation using speech classification l Oscillatory correlation theory for ASA

Acknowledgment l Joint work with Guoning Hu l Funded by AFOSR/AFRL and NSF