Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.

Slides:



Advertisements
Similar presentations
Hearing relative phases for two harmonic components D. Timothy Ives 1, H. Martin Reimann 2, Ralph van Dinther 1 and Roy D. Patterson 1 1. Introduction.
Advertisements

Auditory scene analysis 2
The case of the missing pitch templates: How harmonic templates emerge in the early auditory system Shihab Shamma and David Klein, 2000.
Multipitch Tracking for Noisy Speech
Timbre perception. Objective Timbre perception and the physical properties of the sound on which it depends Formal definition: ‘that attribute of auditory.
An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.
Periodicity and Pitch Importance of fine structure representation in hearing.
Hearing and Deafness 2. Ear as a frequency analyzer Chris Darwin.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Pitch Perception.
Overview of Real-Time Pitch Tracking Approaches Music information retrieval seminar McGill University Francois Thibault.
Page 0 of 34 MBE Vocoder. Page 1 of 34 Outline Introduction to vocoders MBE vocoder –MBE Parameters –Parameter estimation –Analysis and synthesis algorithm.
AUDIO COMPRESSION TOOLS & TECHNIQUES Gautam Bhattacharya.
Chapter 7 Principles of Analog Synthesis and Voltage Control Contents Understanding Musical Sound Electronic Sound Generation Voltage Control Fundamentals.
Auditory Scene Analysis (ASA). Auditory Demonstrations Albert S. Bregman / Pierre A. Ahad “Demonstration of Auditory Scene Analysis, The perceptual Organisation.
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
Feature Vector Selection and Use With Hidden Markov Models to Identify Frequency-Modulated Bioacoustic Signals Amidst Noise T. Scott Brandes IEEE Transactions.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Hearing & Deafness (4) Pitch Perception 1. Pitch of pure tones 2. Pitch of complex tones.
MPEG Audio Compression by V. Loumos. Introduction Motion Picture Experts Group (MPEG) International Standards Organization (ISO) First High Fidelity Audio.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
Sound source segregation (determination)
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Fundamentals of Perceptual Audio Encoding Craig Lewiston HST.723 Lab II 3/23/06.
1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
Digital Sound and Video Chapter 10, Exploring the Digital Domain.
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
HCSNet December 2005 Auditory Scene Analysis and Automatic Speech Recognition in Adverse Conditions Phil Green Speech and Hearing Research Group, Department.
A VOICE ACTIVITY DETECTOR USING THE CHI-SQUARE TEST
Abstract We report comparisons between a model incorporating a bank of dual-resonance nonlinear (DRNL) filters and one incorporating a bank of linear gammatone.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Audio Compression Usha Sree CMSC 691M 10/12/04. Motivation Efficient Storage Streaming Interactive Multimedia Applications.
1 ELEN 6820 Speech and Audio Processing Prof. D. Ellis Columbia University Midterm Presentation High Quality Music Metacompression Using Repeated- Segment.
Adaptive Design of Speech Sound Systems Randy Diehl In collaboration with Bjőrn Lindblom, Carl Creeger, Lori Holt, and Andrew Lotto.
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Gammachirp Auditory Filter
Hearing Research Center
Chapter 6. Effect of Noise on Analog Communication Systems
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
The Function of Synchrony Marieke Rohde Reading Group DyStURB (Dynamical Structures to Understand Real Brains)
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.
Autonomous Robots Vision © Manfred Huber 2014.
Performance Comparison of Speaker and Emotion Recognition
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
The Relation Between Speech Intelligibility and The Complex Modulation Spectrum Steven Greenberg International Computer Science Institute 1947 Center Street,
Automatic Transcription System of Kashino et al. MUMT 611 Doug Van Nort.
An Oscillatory Correlation Approach to Scene Segmentation DeLiang Wang The Ohio State University.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
Piano Music Transcription Wes “Crusher” Hatch MUMT-614 Thurs., Feb.13.
Audio Scene Analysis and Music Cognitive Elements of Music Listening Kevin D. Donohue Databeam Professor Electrical and Computer Engineering University.
Speech Enhancement Algorithm for Digital Hearing Aids
Speech and Singing Voice Enhancement via DNN
PSYCHOACOUSTICS A branch of psychophysics
Precedence-based speech segregation in a virtual auditory environment
Summary of “Efficient Deep Learning for Stereo Matching”
Cocktail Party Problem as Binary Classification
Musical Source Separation
EE513 Audio Signals and Systems
Speech Perception (acoustic cues)
Govt. Polytechnic Dhangar(Fatehabad)
Presenter: Shih-Hsiang(士翔)
Presentation transcript:

Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University

Outline of Presentation l Introduction l Speech segregation problem l Auditory scene analysis (ASA) approach l A multistage model for computational ASA l On amplitude modulation and pitch tracking l Oscillatory correlation theory for ASA

Speech Segregation Problem l In a natural environment, target speech is usually corrupted by acoustic interference. An effective system for speech segregation has many applications, such as automatic speech recognition, audio retrieval, and hearing aid design l Most speech separation techniques require multiple sensors l Speech enhancement developed for the monaural situation can deal with only specific acoustic interference

Auditory Scene Analysis (Bregman’90) l Listeners are able to parse the complex mixture of sounds arriving at the ears in order to retrieve a mental representation of each sound source l ASA would take place in two conceptual processes: l Segmentation. Decompose the acoustic mixture into sensory elements (segments) l Grouping. Combine segments into groups, so that segments in the same group are likely to have originated from the same environmental source

Auditory Scene Analysis - continued l The grouping process involves two aspects: l Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception (proximity, similarity, common fate, good continuation, etc.) l Schema-driven grouping. Application of learned knowledge about speech, music and other environmental sounds

Computational Auditory Scene Analysis l Computational ASA (CASA) systems approach sound separation based on ASA principles l Weintraub’85, Cooke’93, Brown & Cooke’94, Ellis’96, Wang’96 l Previous CASA work suggests that: l Representation of the auditory scene is a key issue l Temporal continuity is important (although it is ignored in most frame-based sound processing algorithms) l Fundamental frequency (F0) is a strong cue for grouping

A Multi-stage Model (Wang & Brown’99)

Auditory Periphery Model l A bank of fourth-order gammatone filters (Patterson et al.’88) l Meddis hair cell model converts gammatone output to neural firing

Auditory Periphery - Example l Hair cell response to utterance: “Why were you all weary?” mixed with phone ringing l 128 filter channels arranged in ERB

l Mid-level representations form the basis for segment formation and subsequent grouping l Correlogram extracts periodicity information from simulated auditory nerve firing patterns l Summary correlogram is used to identify F0 l Cross-correlation between adjacent correlogram channels identifies regions that are excited by the same frequency component or formant Mid-level Auditory Representations

Mid-level Representations - Example l Correlogram and cross-channel correlation for the speech/telephone mixture

Oscillator Network: Segmentation Layer l Horizontal weights are unity, reflecting temporal continuity, and vertical weights are unity if cross-channel correlation exceeds a threshold, otherwise 0 l A global inhibitor ensures that different segments have different phases l A segment thus formed corresponds to acoustic energy in a local time-frequency region that is treated as an atomic component of an auditory scene

Segmentation Layer - Example l Output of the segmentation layer in response to the speech/telephone mixture

Oscillator Network: Grouping Layer l At each time frame, an F0 estimate from the summary correlogram is used to classify channels into two categories; those that are consistent with the F0, and those that are not l Connections are formed between pairs of channels: mutual excitation if the channels belong to the same F0 category, otherwise mutual inhibition l Strong excitation within each segment l The second layer embodies the grouping stage of ASA

Grouping Layer - Example l Two streams emerge from the grouping layer at different times or with different phases l Left: Foreground (original mixture ) l Right: Background

l Previous systems, including the Wang-Brown model, have difficulty in l Dealing with broadband high-frequency mixtures l Performing reliable pitch tracking for noisy speech l Retaining high-frequency energy of the target speaker l Our next step considers perceptual resolvability of various harmonics Challenges Facing CASA

Resolved and Unresolved Harmonics l For voiced speech, lower harmonics are resolved while higher harmonics are not l For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech l Hence we apply different grouping mechanisms for low-frequency and high-frequency signals: l Low-frequency signals are grouped based on periodicity and temporal continuity l High-frequency signals are grouped based on amplitude modulation (AM) and temporal continuity

Proposed System (Hu & Wang'02)

Envelope Representations - Example (a) Correlogram and cross-channel correlation of hair cell response to clean speech (b) Corresponding representations for response envelopes

Initial Segregation l The Wang-Brown model is used in this stage to generate segments and select the target speech stream l Segments generated in this stage tend to reflect resolved harmonics, but not unresolved ones

Pitch Tracking l Pitch periods of target speech are estimated from the segregated speech stream l Estimated pitch periods are checked and re- estimated using two psychoacoustically motivated constraints: l Target pitch should agree with the periodicity of the time- frequency (T-F) units in the initial speech stream l Pitch periods change smoothly, thus allowing for verification and interpolation

Pitch Tracking - Example (a) Global pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion (b) Estimated target pitch

T-F Unit Labeling l In the low-frequency range: l A T-F unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch l In the high-frequency range: l Due to their wide bandwidths, high-frequency filters generally respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863) l A T-F unit in the high-frequency range is labeled by comparing its AM repetition rate with the estimated target pitch

AM - Example (a) The output of a gammatone filter (center frequency: 2.6 kHz) to clean speech (b) The corresponding autocorrelation function

AM Repetition Rates l To obtain AM repetition rates, a filter response is half-wave rectified and bandpass filtered l The resulting signal within a T-F unit is modeled by a single sinusoid using the gradient descent method. The frequency of the sinusoid indicates the AM repetition rate of the corresponding response

Final Segregation l New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross-channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM repetition rates l The foreground stream is adjusted to remove the segments that do not agree with the estimated target pitch l Other units are grouped according to temporal and spectral continuity

Ideal Binary Mask for Performance Evaluation l Within a T-F unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise l Motivation: Auditory masking - stronger signal masks weaker one within a critical band l Further motivation: Ideal binary masks give excellent listening experience and automatic speech recognition performance l Thus, we suggest to use ideal binary masks as ground truth for CASA performance evaluation

Monaural Speech Segregation Example Left: Segregated speech stream (original mixture: ) Right: Ideal binary mask

Systematic Evaluation l Evaluated on a corpus of 100 mixtures (Cooke’93): 10 voiced utterances x 10 noise intrusions l Noise intrusions have a large variety l Resynthesis stage allows estimation of target speech waveform l Evaluation is based on ideal binary masks

Signal-to-Noise Ratio (SNR) Results Average SNR gain: 12.1 dB; average improvement over Wang-Brown: 5 dB Major improvement occurs in target energy retention, particularly in the high-frequency range

Segregation Examples Mixture Ideal Binary Mask Wang-Brown New System

How Does Auditory System Perform ASA? l Information about acoustic features (pitch, spectral shape, interaural differences, AM, FM) is extracted in distributed areas of the auditory system l Binding problem: How are these features combined to form a perceptual whole (stream)? l Hierarchies of feature-detecting cells exist, but do not seem to constitute a solution to the binding problem

Oscillatory Correlation Theory (von der Malsburg & Schneider’86; Wang’96) l Neural oscillators are used to represent auditory features l Oscillators representing features of the same source are synchronized (phase-locked with zero phase lag), and are desynchronized from oscillators representing different sources l Supported by growing experimental evidence, e.g. oscillations in auditory cortex measured by EEG, MEG and local field potentials

Oscillatory Correlation Representation FD: Feature Detector

Oscillatory Correlation for ASA l LEGION dynamics (Terman & Wang’95) provides a computational foundation for the oscillatory correlation theory l The utility of oscillatory correlation has been demonstrated for speech separation (Wang- Brown’99), modeling auditory attention (Wrigley-Brown’01), etc.

Issues l Grouping is entirely pitch-based, hence limited to segregating voiced speech l How to group unvoiced speech? l Target pitch tracking in the presence of multiple voiced sources l Role of segmentation l We found increased robustness with segments as an intermediate representation between streams and T-F units

Summary l Multistage ASA approach to monaural speech segregation l Performs substantially better than previous CASA systems l Oscillatory correlation theory for ASA l Key issue is integration of various grouping cues

Collaborators l Recent work with Guoning Hu- Ohio State University l Earlier work with Guy Brown - University of Sheffield