Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University.

Slides:



Advertisements
Similar presentations
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Advertisements

Auditory scene analysis 2
Psychoacoustics Perception of Direction AUD202 Audio and Acoustics Theory.
Multipitch Tracking for Noisy Speech
Sound source segregation Development of the ability to separate concurrent sounds into auditory objects.
Timbre perception. Objective Timbre perception and the physical properties of the sound on which it depends Formal definition: ‘that attribute of auditory.
An Auditory Scene Analysis Approach to Speech Segregation DeLiang Wang Perception and Neurodynamics Lab The Ohio State University.
CS 551/651: Structure of Spoken Language Lecture 11: Overview of Sound Perception, Part II John-Paul Hosom Fall 2010.
Auditory Scene Analysis (ASA). Auditory Demonstrations Albert S. Bregman / Pierre A. Ahad “Demonstration of Auditory Scene Analysis, The perceptual Organisation.
Source Localization in Complex Listening Situations: Selection of Binaural Cues Based on Interaural Coherence Christof Faller Mobile Terminals Division,
Cocktail Party Processing DeLiang Wang (Jointly with Guoning Hu) Perception & Neurodynamics Lab Ohio State University.
A Hidden Markov Model Framework for Multi-target Tracking DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
A.Diederich– International University Bremen – Sensation and Perception – Fall Frequency Analysis in the Cochlea and Auditory Nerve cont'd The Perception.
Classification of Music According to Genres Using Neural Networks, Genetic Algorithms and Fuzzy Systems.
Interrupted speech perception Su-Hyun Jin, Ph.D. University of Texas & Peggy B. Nelson, Ph.D. University of Minnesota.
Hearing & Deafness (4) Pitch Perception 1. Pitch of pure tones 2. Pitch of complex tones.
Modeling of Mel Frequency Features for Non Stationary Noise I.AndrianakisP.R.White Signal Processing and Control Group Institute of Sound and Vibration.
Spectral centroid 6 harmonics: f0 = 100Hz E.g. 1: Amplitudes: 6; 5.75; 4; 3.2; 2; 1 [(100*6)+(200*5.75)+(300*4)+(400*3.2)+(500*2 )+(600*1)] / = 265.6Hz.
Sound source segregation (determination)
Rob van der Willigen Auditory Perception.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
1 Linking Computational Auditory Scene Analysis with ‘Missing Data’ Recognition of Speech Guy J. Brown Department of Computer Science, University of Sheffield.
Speech Segregation Based on Sound Localization DeLiang Wang & Nicoleta Roman The Ohio State University, U.S.A. Guy J. Brown University of Sheffield, U.K.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
LE 460 L Acoustics and Experimental Phonetics L-13
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Speech Perception in Noise and Ideal Time- Frequency Masking DeLiang Wang Oticon A/S, Denmark On leave from Ohio State University, USA.
„Bandwidth Extension of Speech Signals“ 2nd Workshop on Wideband Speech Quality in Terminals and Networks: Assessment and Prediction 22nd and 23rd June.
Audio Scene Analysis and Music Cognitive Elements of Music Listening
Sensitivity System sensitivity is defined as the available input signal level Si for a given (SNR)O Si is called the minimum detectable signal An expression.
By Sarita Jondhale1 Pattern Comparison Techniques.
1 Speech Perception 3/30/00. 2 Speech Perception How do we perceive speech? –Multifaceted process –Not fully understood –Models & theories attempt to.
From Auditory Masking to Supervised Separation: A Tale of Improving Intelligibility of Noisy Speech for Hearing- impaired Listeners DeLiang Wang Perception.
METHODOLOGY INTRODUCTION ACKNOWLEDGEMENTS LITERATURE Low frequency information via a hearing aid has been shown to increase speech intelligibility in noise.
Monaural Speech Segregation: Representation, Pitch, and Amplitude Modulation DeLiang Wang The Ohio State University.
From Auditory Masking to Binary Classification: Machine Learning for Speech Separation DeLiang Wang Perception & Neurodynamics Lab Ohio State University.
1 Audio Compression. 2 Digital Audio  Human auditory system is much more sensitive to quality degradation then is the human visual system  redundancy.
ICASSP Speech Discrimination Based on Multiscale Spectro–Temporal Modulations Nima Mesgarani, Shihab Shamma, University of Maryland Malcolm Slaney.
Sounds in a reverberant room can interfere with the direct sound source. The normal hearing (NH) auditory system has a mechanism by which the echoes, or.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
Dynamic Aspects of the Cocktail Party Listening Problem Douglas S. Brungart Air Force Research Laboratory.
‘Missing Data’ speech recognition in reverberant conditions using binaural interaction Sue Harding, Jon Barker and Guy J. Brown Speech and Hearing Research.
Hearing Research Center
Pitch perception in auditory scenes 2 Papers on pitch perception… of a single sound source of more than one sound source LOTS - too many? Almost none.
Auditory Segmentation and Unvoiced Speech Segregation DeLiang Wang & Guoning Hu Perception & Neurodynamics Lab The Ohio State University.
Temporal masking of spectrally reduced speech: psychoacoustical experiments and links with ASR Frédéric Berthommier and Angélique Grosgeorges ICP 46 av.
SOUND PRESSURE, POWER AND LOUDNESS MUSICAL ACOUSTICS Science of Sound Chapter 6.
Hearing: Physiology and Psychoacoustics 9. The Function of Hearing The basics Nature of sound Anatomy and physiology of the auditory system How we perceive.
The Time Dimension for Scene Analysis DeLiang Wang Perception & Neurodynamics Lab The Ohio State University, USA.
Performance Comparison of Speaker and Emotion Recognition
Introduction to psycho-acoustics: Some basic auditory attributes For audio demonstrations, click on any loudspeaker icons you see....
Pitch What is pitch? Pitch (as well as loudness) is a subjective characteristic of sound Some listeners even assign pitch differently depending upon whether.
A. R. Jayan, P. C. Pandey, EE Dept., IIT Bombay 1 Abstract Perception of speech under adverse listening conditions may be improved by processing it to.
Speech Segregation Based on Oscillatory Correlation DeLiang Wang The Ohio State University.
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Audio Scene Analysis and Music Cognitive Elements of Music Listening Kevin D. Donohue Databeam Professor Electrical and Computer Engineering University.
SOUND PRESSURE, POWER AND LOUDNESS
SPATIAL HEARING Ability to locate the direction of a sound. Ability to locate the direction of a sound. Localization: In free field Localization: In free.
Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.
Speech Enhancement Algorithm for Digital Hearing Aids
Speech and Singing Voice Enhancement via DNN
Supervised Speech Separation
PSYCHOACOUSTICS A branch of psychophysics
Precedence-based speech segregation in a virtual auditory environment
Cocktail Party Problem as Binary Classification
Musical Source Separation
Auditory scene analysis Day 15
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Josh H. McDermott, Eero P. Simoncelli  Neuron 
Perception & Neurodynamics Lab
Presentation transcript:

Computational Auditory Scene Analysis and Its Potential Application to Hearing Aids DeLiang Wang Perception & Neurodynamics Lab Ohio State University

Outline of presentation l Auditory scene analysis l Fundamentals of computational auditory scene analysis (CASA) l CASA for speech segregation l Subject tests l Assessment

Real-world audition What? Speech message speaker age, gender, linguistic origin, mood, … Music Car passing by Where? Left, right, up, down How close? Channel characteristics Environment characteristics Room reverberation Ambient noise

Sources of intrusion and distortion additive noise from other sound sources reverberation from surface reflections channel distortion

Cocktail party problem Term coined by Cherry “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957) “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal- hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992) l Ball-room problem by Helmholtz l “Complicated beyond conception” (Helmholtz, 1863)

Auditory scene analysis Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990) From acoustic events to perceptual streams Two conceptual processes of ASA: Segmentation. Decompose the acoustic mixture into sensory elements (segments) Grouping. Combine segments into streams, so that segments in the same stream originate from the same source

Simultaneous organization Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization: Proximity in frequency (spectral proximity) Common periodicity Harmonicity Fine temporal structure Common spatial location Common onset (and to a lesser degree, common offset) Common temporal modulation Amplitude modulation (AM) Frequency modulation (Demo: )

Sequential organization Sequential organization groups sound components across time. ASA cues for sequential organization: Proximity in time and frequency Temporal and spectral continuity Common spatial location; more generally, spatial continuity Smooth pitch contour Smooth format transition? Rhythmic structure

Organisation in speech: Spectrogram offset synchrony onset synchrony continuity “… pure pleasure … ” harmonicity

Outline of presentation l Auditory scene analysis l Fundamentals of computational auditory scene analysis (CASA) l CASA for speech segregation l Subject tests l Assessment

Cochleagram: Auditory spectrogram Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleagram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent A waveform signal can be constructed (inverted) from a cochleagram Spectrogram Cochleagram

Correlogram Short-term autocorrelation of the output of each frequency channel of the cochleagram Peaks in summary correlogram indicate pitch periods (F0) A standard model of pitch perception Correlogram & summary correlogram of a double vowel, showing F0s

Cross-correlogram Cross-correlogram (within one frame) in response to two speech sources presented at 0º and 20º. Skeleton cross-correlogram sharpens cross-correlogram, making peaks in the azimuth axis more pronounced

Ideal binary mask A main CASA goal is to retain parts of a target sound that are stronger than the acoustic background, or to mask interference by the target What a target is depends on intention, attention, etc. Within a local time-frequency (T-F) unit, the ideal binary mask is 1 if the SNR within the unit exceeds a local criterion (LC) or threshold, and 0 otherwise (Hu & Wang, 2001) l Consistent with the auditory masking phenomenon: A stronger signal masks a weaker one within a critical band l Optimality: Under certain conditions the ideal binary mask with 0 dB LC is the optimal binary mask for SNR gain l It doesn’t actually separate the mixture!

Ideal binary mask illustration

Outline of presentation l Auditory scene analysis l Fundamentals of computational auditory scene analysis (CASA) l CASA for speech segregation l Voiced speech segregation l Unvoiced speech segregation l Subject tests l Assessment

CASA systems for speech segregation l A substantial literature that can be broadly divided into monaural and binaural systems l Monaural CASA systems for speech segregation are based on harmonicity, onset/offset, AM/FM, and trained models (Weintraub, 1985; Brown & Cooke, 1994; Ellis, 1996; Hu & Wang, 2004) l Binaural CASA systems for speech segregation are based sound localization and location-based grouping (Lyon, 1983; Bodden, 1993; Liu et al., 2001; Roman et al., 2003)

CASA system architecture Typical architecture of CASA systems

Voiced speech segregation l For voiced speech, lower harmonics are resolved while higher harmonics are not l For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech l A voiced segregation model by Hu and Wang (2004) applies different grouping mechanisms for low-frequency and high-frequency signals: l Low-frequency signals are grouped based on periodicity and temporal continuity l High-frequency signals are grouped based on amplitude modulation and temporal continuity

Pitch tracking l Pitch periods of target speech are estimated from an initially segregated speech stream based on dominant pitch within each frame l Estimated pitch periods are checked and re-estimated using two psychoacoustically motivated constraints: l Target pitch should agree with the periodicity of the T-F units in the initial speech stream l Pitch periods change smoothly, thus allowing for verification and interpolation

Pitch tracking example (a) Dominant pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion (b) Estimated target pitch

T-F unit labeling & final segregation l In the low-frequency range: l A T-F unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch l In the high-frequency range: l Due to their wide bandwidths, high-frequency filters respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863) l A T-F unit in the high-frequency range is labeled by comparing its AM rate with the estimated target pitch l Finally, other units are grouped according to temporal and spectral continuity

Voiced speech segregation example

Unvoiced speech segregation Unvoiced speech constitutes about 20-25% of all speech sounds Unvoiced speech is more difficult to segregate than voiced speech Voiced speech is highly structured, whereas unvoiced speech lacks harmonicity and is often noise-like Unvoiced speech is usually much weaker than voiced speech and therefore more susceptible to interference A model by Hu and Wang (2008) performs unvoiced speech segregation using auditory segmentation and segment classification Segmentation is based on multiscale onset/offset analysis Classification of each segment is based on Bayesian classification of acoustic-phonetic features

Example of segregation Utterance: “That noise problem grows more annoying each day” Interference: Crowd noise in a playground (IBM: Ideal binary mask)

Outline of presentation l Auditory scene analysis l Fundamentals of computational auditory scene analysis (CASA) l CASA for speech segregation l Subject tests l Assessment

Subject tests of ideal binary masking Recent studies found large speech intelligibility improvements by applying ideal binary masking for normal-hearing (Brungart et al., 2006, Anzalone et al., 2006; Li & Loizou, 2008; Wang et al., 2008), and hearing-impaired (Anzalone et al., 2006; Wang et al., 2008) listeners Improvement for stationary noise is above 7 dB for NH listeners, and above 9 dB for HI listeners Improvement for modulated noise is significantly larger than for stationary noise See our poster today on tests with both NH and HI listeners

Speech perception of noise with binary gains Is there an optimal LC that is independent of input SNR? l Wang et al. (2008) found that, when LC is chosen to be the same as the input SNR, nearly perfect intelligibility is obtained when input SNR is -∞ dB (i.e. the mixture contains noise only with no target speech)

Wang et al.’08 results l Despite a great reduction of spectrotemporal information, a pattern of binary gains is apparently sufficient for human speech recognition l Our results extend the observation of intelligible vocoded noise in significant ways l Only binary gains (envelopes) l Masks are computed from local comparisons between target and interference, not target itself l Mean numbers for the 4 conditions: (97.1%, 92.9%, 54.3%, 7.6%)

Outline of presentation l Auditory scene analysis l Fundamentals of computational auditory scene analysis (CASA) l CASA for speech segregation l Subject tests l Assessment

Assessment of CASA for hearing prosthesis l Few CASA systems were developed for the hearing aid application l Hearing aid processing poses a number of constraints l Real-time processing with processing delays of just a few milliseconds l Amount of online training, if needed, has to be small l Limited number of frequency bands

Assessment of monaural CASA systems l Monaural algorithms involve complex operations for feature extraction, segmentation, grouping, or significant amounts of training l They are either too complex or too limited in performance to be directly applicable to hearing aid design l Certain aspects could be useful, e.g. environment classification and voice detection l In longer term, monaural CASA research is promising l It is based on principles of auditory perception l Not subject to fundamental limitations of spatial filtering (beamforming) l Configuration stationarity l Room reverberation

Assessment of binaural CASA systems l Many binaural (two-microphone) systems produce a T-F mask based on classification or clustering l Good performance after seconds of training data l Unfortunately, retraining is needed for a configuration change, limiting their prospect of applying to hearing aids l Room reverberation likely poses further difficulties for such algorithms l T-F masking algorithms based on beamforming hold promise for hearing aid design (e.g. Roman et al., 2006) l Both fixed and adaptive beamformers have been implemented in hearing aids l Beamforming in combination with T-F masking is likely effective for improving speech intelligibility

Conclusion l CASA approaches the problem of sound separation using perceptual principles, and represents a new paradigm for solving the cocktail party problem l Recent intelligibility tests show that ideal binary masking provides large benefits to both NH and HI listeners l Current CASA systems pay little attention to processing constraints of hearing aids, doubtful for direct application to hearing aid design l In longer term, CASA research (particularly monaural systems) promises to deliver intelligibility benefits

Further information on CASA 2006 CASA book edited by D.L. Wang & G.J. Brown and published by Wiley-IEEE Press l A 10-chapter book with a coherent and comprehensive treatment of CASA