Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.

Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University http://www.cse.ohio-state.edu/~dwang

ICASSP'04 Tutorial2 Outline of presentation I.Sound separation problem II.Human auditory scene analysis (ASA) III.Computational auditory scene analysis (CASA) 1.Fundamentals 2.Monaural segregation 3.Binaural segregation IV.Discussion and conclusion l Foci l Speech segregation l Recent advances

ICASSP'04 Tutorial3 I. Sound separation problem l Problem definition l Listener’s performance l Some applications of automatic sound separation l Current approaches to sound separation

ICASSP'04 Tutorial4 Real-world audition What? Source type Speech message speaker age, gender, linguistic origin, mood, … Music Car passing by Where? Left, right, up, down How close? Channel characteristics Environment characteristics Room configuration Ambient noise

ICASSP'04 Tutorial5 Sources of intrusion and distortion additive noise from other sound sources reverberation from surface reflections channel distortion

ICASSP'04 Tutorial6 Cocktail party problem Term coined by Cherry “One of our most important faculties is our ability to listen to, and follow, one speaker in the presence of others. This is such a common experience that we may take it for granted; we may call it ‘the cocktail party problem’…” (Cherry, 1957) “For ‘cocktail party’-like situations… when all voices are equally loud, speech remains intelligible for normal- hearing listeners even when there are as many as six interfering talkers” (Bronkhorst & Plomp, 1992) l Ball-room problem by Helmholtz l “Complicated beyond conception” (Helmholtz, 1863)

ICASSP'04 Tutorial7 A modern acoustic perspective Physical attributes of sound useful for segregation (Yost, 1997, p 331): Spectral separation Spectral profile Harmonicity Spatial separation Temporal separation Temporal onsets/offsets Temporal modulations

ICASSP'04 Tutorial8 Listener’s performance Speech Reception Threshold (SRT) The speech-to-noise ratio needed for 50% intelligibility Each 1 dB gain in SRT corresponds to 5-10% increase in intelligibility (Miller et al., 1951) dependent upon materials Source: Steeneken (1992)

ICASSP'04 Tutorial9 Competing speakers Source: Assmann & Summerfield (2004), redrawn from Miller (1947) SRT gain

ICASSP'04 Tutorial10 Location Source: Bronkhorst & Plomp (1992) SRT gain

ICASSP'04 Tutorial11 Binaural versus 3D presentation Source: Drullman & Bronkhorst (2000)

ICASSP'04 Tutorial12 Humans versus machines Source: Lippmann (1997) Additionally: Car noise is not a very effective speech masker At 10 dB At 0 dB Human word error rate at 0 dB SNR is around 1% as opposed to 100% for unmodified recognisers (around 40% with noise adaptation)

ICASSP'04 Tutorial13 Some applications of automatic sound separation Automatic speech and speaker recognition Processor for hearing impairment Music transcription Audio information retrieval Audio display for human-computer interaction

ICASSP'04 Tutorial14 Machine approaches to sound separation l Speech enhancement l Spatial filtering (beamforming) l Blind source separation via independent component analysis l Computational auditory scene analysis (CASA) l Focus of this tutorial

ICASSP'04 Tutorial15 Speech enhancement l Enhance SNR or speech quality by attenuating interference l Spectral subtraction is a standard enhancement technique l Advantage: Simple and easy to apply. It works for single- microphone recordings (monaural condition) l Limitation: Need prior knowledge of interference and generally assume stationarity of interference

ICASSP'04 Tutorial16 Spatial filtering l Spatial filtering (beamforming): Extract target sound from a specific spatial direction with a sensor array l Advantage: High fidelity with a large array of microphones, and robustness to reverberation because much reverberation energy comes from nontarget directions l Challenge: Configuration stationarity - What if the target sound switches between different sound sources, or the target changes its location and orientation?

ICASSP'04 Tutorial17 Blind source separation l Independent component analysis (ICA) is a popular approach to blind source separation l Assume statistical independence between sound sources l Formulate the separation problem as that of demixing the mixing matrix l Mathematically similar to adaptive beamforming l Apply machine learning techniques to estimate the demixing matrix l Advantage: High fidelity when assumptions are met l Limitation: Assumptions are difficult to satisfy. Chief among them is stationarity of the mixing matrix, similar to the configuration stationarity limitation of spatial filtering

ICASSP'04 Tutorial18 Interim summary Everyday audition has to contend with additive noise, reverberation and channel distortions Listeners use a variety of cues to solve the cocktail party problem Current computational approaches suffer from the nonstationarity of sound source or configuration

ICASSP'04 Tutorial19 Part II. Auditory scene analysis l A tour of the auditory periphery l Human auditory scene analysis (ASA)

ICASSP'04 Tutorial20 A whirlwind tour of the auditory periphery A complex mechanism for transducing pressure variations in the air to neural impulses in auditory nerve fibers

ICASSP'04 Tutorial21 Traveling wave Different frequencies of sound give rise to maximum vibrations at different places along the basilar membrane The frequency of vibration at a given place is equal to that of the nearest stimulus component (resonance) Hence, the cochlea performs a frequency analysis

ICASSP'04 Tutorial22 Cochlear filtering model The gammatone function approximates physiologically-recorded impulse responses n = filter order (typically 4) b = bandwidth f 0 = centre frequency  = phase

ICASSP'04 Tutorial23 Gammatone filterbank Each position on the basilar membrane is simulated by a single gammatone filter with appropriate centre frequency and bandwidth A small number of filters (e.g. 32) are generally sufficient to cover the range 50-8 kHz Note variation in bandwidth with frequency (unlike Fourier analysis)

ICASSP'04 Tutorial24 Response to a pure tone Many channels respond, but those closest to tone frequency respond most strongly (place coding) The interval between successive peaks also encodes the tone frequency (temporal coding) Note propagation delay along the membrane model

ICASSP'04 Tutorial25 Beyond the periphery The auditory system is complex with four relay stations between periphery and cortex rather than one in the visual system In comparison to the auditory periphery, central parts of the auditory system are less understood Number of neurons in the primary auditory cortex is comparable to that in the primary visual cortex despite the fact that the number of fibers in the auditory nerve is far fewer than that of the optic nerve (thousands vs. millions) The auditory system (Source: Arbib, 1989) The auditory nerve

ICASSP'04 Tutorial26 Critical bands Sound demo Beating and combinational tones Sound demo Separation result depends on sound types (overall SNR is 0) Noise-Noise: pink, white, pink+white Speech-Speech: Noise-Tone: Noise-Speech: Tone-Speech: Some psychoacoustic phenomena

ICASSP'04 Tutorial27 Auditory scene analysis Listeners are capable of parsing an acoustic scene (a sound mixture) to form a mental representation of each sound source – stream – in the perceptual process of auditory scene analysis (Bregman, 1990) From events to streams Two conceptual processes of ASA: Segmentation. Decompose the acoustic mixture into sensory elements (segments) Grouping. Combine segments into streams, so that segments in the same stream originate from the same source

ICASSP'04 Tutorial28 Simultaneous organization Simultaneous organization groups sound components that overlap in time. ASA cues for simultaneous organization Proximity in frequency (spectral proximity) Common periodicity Harmonicity Fine temporal structure Common spatial location Common onset (and to a lesser degree, common offset) Common temporal modulation Amplitude modulation (AM) Frequency modulation (Demo: )

ICASSP'04 Tutorial29 Sequential organization Sequential organization groups sound components across time. ASA cues for sequential organization Proximity in time and frequency Temporal and spectral continuity Streaming demo: Cycle of six tones Common spatial location; more generally, spatial continuity Smooth pitch contour Smooth format transition? Rhythmic structure Rhythmic attention theory (Large and Jones, 1999)

ICASSP'04 Tutorial30 Streaming in African xylophone music Notes chosen from pentatonic scale Source: Bregman & Ahad (1995)

ICASSP'04 Tutorial31 Primitive versus schema-based organization The grouping process involves two aspects: l Primitive grouping. Innate data-driven mechanisms, consistent with those described by Gestalt psychologists for visual perception (proximity, similarity, common fate, good continuation, etc.) l It is domain-general, and exploits intrinsic structure of environmental sound l Grouping cues described earlier are primitive in nature l Schema-driven grouping. Learned knowledge about speech, music and other environmental sounds – Model- based or top-down. l It is domain-specific, e.g. organization of speech sounds into syllables

ICASSP'04 Tutorial32 Organisation in speech: Broadband spectrogram offset synchrony onset synchrony common AM continuity “… pure pleasure … ” harmonicity

ICASSP'04 Tutorial33 Organisation in speech: Narrowband spectrogram offset synchrony onset synchrony continuity “… pure pleasure … ” harmonicity

ICASSP'04 Tutorial34 Interim summary Auditory peripheral processing amounts to a decomposition of the acoustic signal ASA cues essentially reflect structural coherence of a sound source A subset of cues believed to be strongly involved in ASA Simulteneous organization: Periodicity, temporal modulation, onset Sequential organization: Location, pitch contour and other source characteristics (e.g. vocal tract)

ICASSP'04 Tutorial35 Part III. Computational auditory scene analysis l Fundamentals l Monaural segregation l Binaural segregation

ICASSP'04 Tutorial36 III.1 Fundamentals of CASA l Cochleogram l Correlogram l Cross-correlogram l Continuity in time and frequency l Division of segmentation and grouping l Time-frequency masks l Resynthesis l Missing-data recognition

ICASSP'04 Tutorial37 Cochleogram: Auditory spectrogram Spectrogram Plot of log energy across time and frequency (linear frequency scale) Cochleogram Cochlear filtering by the gammatone filterbank (or other models of cochlear filtering), followed by a stage of nonlinear rectification; the latter corresponds to hair cell transduction by either a hair cell model or simple compression operations (log and cube root) Quasi-logarithmic frequency scale, and filter bandwidth is frequency-dependent Previous work suggests better resilience to noise than spectrogram Let’s call it ‘cochleogram’ Spectrogram Cochleogram

ICASSP'04 Tutorial38 Neural autocorrelation for pitch perception Licklider (1951)

ICASSP'04 Tutorial39 Correlogram Short-term autocorrelation of the output of each frequency channel of the cochleogram Peaks in summary correlogram indicate pitch periods (F0) A standard model of pitch perception Correlogram & summary correlogram of a double vowel, showing F0s

ICASSP'04 Tutorial40 Neural cross-correlation l Cross-correlogram: Cross-correlation (or coincidence) between the left ear signal and the right ear signal l Strong physiological evidence supporting this neural mechanism for sound localization (more specifically azimuth localization) Jeffress (1948)

ICASSP'04 Tutorial41 Azimuth localization example (Target: 0 , Noise: 20  ) Cross-correlogram within one frameSkeleton cross-correlogram sharpens cross-correlogram, making peaks in the azimuth axis more pronounced

ICASSP'04 Tutorial42 Dichotomy of segmentation and grouping Mirroring Bregman’s two-stage conceptual model, a CASA model generally consists of a segmentation stage and a subsequent grouping stage l Segmentation stage decomposes an acoustic scene into a collection of segments, each of which is a contiguous region in the cochleogram l Temporal continuity l Cross-channel correlation that encodes correlated responses (fine temporal structure) of adjacent filter channels Grouping aggregates segments into streams based on various ASA cues

ICASSP'04 Tutorial43 Ideal binary time-frequency mask A main CASA goal is to retain parts of a target sound that are stronger than the acoustic background, or to mask interference by the target What a target is depends on intention, attention, etc. Within a local time-frequency (T-F) unit, the ideal binary mask is 1 if target energy is stronger than interference energy, and 0 otherwise (Hu and Wang, 2001) l Local 0-dB SNR criterion for mask generation l Other local SNR criteria are possible l Consistent with the auditory masking phenomenon: A stronger signal masks a weaker one within a critical band

ICASSP'04 Tutorial44 Ideal binary mask illustration

ICASSP'04 Tutorial45 Properties of the ideal binary mask Flexibility: With the same mixture, the definition leads to different masks depending on what target is Well-definedness: The ideal mask is well-defined no matter how many intrusions are in the scene or how many targets need to be segregated The ideal binary mask is very effective for human speech intelligibility The local 0 dB SNR appears to be optimal for human listeners (Brungart et al., in preparation) The ideal binary mask provides an excellent front-end for robust automatic speech recognition (Cooke et al., 2001; Roman et al., 2003)

ICASSP'04 Tutorial46 Resynthesis from a binary T-F mask With a cochleogram, a waveform signal can be resynthesized from a binary T-F mask (Weintraub, 1985; Brown and Cooke, 1994) A binary T-F mask is used as a matrix of binary weights on the gammatone filterbank: The output of a gammatone filter at a particular time frame is time- reversed and passed through the filter again. Its response is time- reversed the second time. This is to compensate for across-channel phase shifts The output is either retained or removed according to the corresponding value of the mask A raised cosine window is used to window the output Sum the outputs from all filter channels at all time frames. The result is the resynthesized signal The sound demo of 2 slides earlier uses this resynthesis technique

ICASSP'04 Tutorial47 Missing data recognition The aim of ASR is to assign an acoustic vector X to a class C so that the posterior probability P(C|X) is maximized: P(C|X)  P(X|C) P(C) If components of X are unreliable or missing, one cannot compute P(X|C) as usual Missing data technique adapts a hidden Markov model (HMM) classifier to cope with missing or unreliable features (Cooke et al., 2001) Partition X into reliable parts X r and unreliable parts X u, and use marginal distribution P(X r |C) in recognition Require a T-F mask to indicate reliable regions, which can be supplied by a CASA system It provides a natural bridge between a binary T-F mask generated by CASA and recognition

ICASSP'04 Tutorial48 III.2 Monaural segregation l Primitive segregation: Segregation based on primitive ASA cues l Brown and Cooke model (1994) l Hu and Wang model (2004) l Model-based segregation: Segregation based on speech models l Barker, Cooke, and Ellis model (2004)

ICASSP'04 Tutorial49 A list of representative CASA models l Weintraub’s 1985 dissertation at Stanford l First systematic CASA model l ASA cues explored: pitch and onset (only pitch used later) l Uses an HMM model for source organization l Evaluated on speech recognition, but results are ambiguous l Cooke’s 1991 dissertation at Sheffield (published as a book in 1993) l Segments as synchrony strands l Two grouping stages: Stage 1 based on harmonicity and common AM and Stage 2 based on pitch contours l Systematic evaluation using 100 mixtures (10 voiced utterances mixed with 10 noise types) l Brown and Cooke’s 1994 Computer Speech & Language paper (detailed here) l Ellis’s 1996 dissertation at MIT l A prediction-driven model, where prediction encompasses from simple temporal continuity to complex inference based on remembered sound patterns l Organization is done on a blackboard architecture that maintains multiple hypotheses l Incomplete implementation l Wang and Brown’s 1999 IEEE Trans. on Neural Networks paper l An oscillatory correlation model with emphasis on plausible neural substrate l Clear separation of segmentation from grouping, where the former is based on cross-channel correlation and temporal continuity l Hu and Wang’s 2004 IEEE Trans. on Neural Networks paper (detailed here) l Barker, Cooke, and Ellis 2004 Speech Communication paper (detailed here)

ICASSP'04 Tutorial50 Brown and Cooke model l A primitive CASA model with emphasis on representations and physiological motivation l Computes a collection of auditory map representations l Compute segments from auditory maps l Group segments to streams by pitch and common onset and offset l Systematically evaluated using a normalized SNR (signal-to-noise ratio) metric

ICASSP'04 Tutorial51 Model diagram

ICASSP'04 Tutorial52 Rate map l A form of a Cochleogram l Use a gammatone filterbank and the Meddis hair cell model to form a neural firing rate representation l Rate map in response to the utterance: “Don’t ask me to carry an oily rag like that.” l Rate map in response to trill telephone

ICASSP'04 Tutorial53 Frequency transition map l The objective is to track spectral peaks across time l Frequency transition map in response to the utterance l In response to trill telephone

ICASSP'04 Tutorial54 Onset and offset maps The objective is to incorporate common onset and offset cues l Onset map in response to the utterance l In response to trill telephone

ICASSP'04 Tutorial55 Autocorrelation and cross-channel correlation Autocorrelation map is a correlogram Cross-channel correlation map is formed with respect to a similarity metric Later simplified by Wang and Brown (1999) to just cross-channel correlation of the normalized correlogram A time frame for the vowel /æ/

ICASSP'04 Tutorial56 Segment formation l This stage combines the information from different auditory maps l Segment formation is based on 3 rules that specify the birth and death of each segment l Key maps are the frequency transition map and the cross-channel correlation map

ICASSP'04 Tutorial57 Grouping as symbolic search l A local F0 contour is extracted from each segment l Grouping combines segments that overlap in time and have similar local F0 contours. Further adjustments are made using common onset and offset l Onset and offset cues did not contribute much to model performance

ICASSP'04 Tutorial58 Hu and Wang model l Also a primitive CASA model with the following features l Deals with resolved and unresolved harmonics differently l For unresolved harmonics, grouping is based on AM analysis l Computes a target pitch contour l Formulates the CASA problem as that of estimating the ideal binary mask l The model produces a substantial performance improvement over previous models

ICASSP'04 Tutorial59 Resolved and unresolved harmonics l For voiced sound, lower harmonics are resolved while higher harmonics are not l For unresolved harmonics, the envelopes of filter responses fluctuate at the fundamental frequency of speech l The model uses different grouping mechanisms for low- frequency and high-frequency components: l Low-frequency components are grouped based on periodicity and temporal continuity l High-frequency components are grouped based on amplitude modulation and temporal continuity

ICASSP'04 Tutorial60 Model diagram

ICASSP'04 Tutorial61 Initial segregation and target pitch tracking l Segments are formed based on temporal continuity and cross-correlation l Initial grouping into a foreground (target) stream and a background stream according to dominant pitch using the oscillatory correlation model of Wang and Brown (1999) l Target speech periods are estimated from the initial target stream subject to two psychoacoustically motivated constraints: l Target pitch should agree with the periodicity of the T-F units in the initial speech stream l Pitch periods change smoothly, thus allowing for verification and interpolation

ICASSP'04 Tutorial62 Pitch tracking example (a) Dominant pitch (Line: pitch track of clean speech) for a mixture of target speech and ‘cocktail-party’ intrusion (b) Estimated target pitch

ICASSP'04 Tutorial63 T-F unit labeling l In the low-frequency range: l A T-F unit is labeled by comparing the periodicity of its autocorrelation with the estimated target pitch l In the high-frequency range: l Due to their wide bandwidths, high-frequency filters respond to multiple harmonics. These responses are amplitude modulated due to beats and combinational tones (Helmholtz, 1863) l A T-F unit in the high-frequency range is labeled by comparing its AM rate with the estimated target pitch

ICASSP'04 Tutorial64 Amplitude modulation (a)The output of a gammatone filter (center frequency: 2.6 kHz) in response to clean speech; (b)(b) The corresponding autocorrelation function l To obtain an AM rate, a filter response is half-wave rectified and bandpass filtered. The resulting signal within a T-F unit is modeled by a single sinusoid using the gradient descent method. The frequency of the sinusoid indicates the AM rate

ICASSP'04 Tutorial65 Final segregation l New segments corresponding to unresolved harmonics are formed based on temporal continuity and cross- channel correlation of response envelopes (i.e. common AM). Then they are grouped into the foreground stream according to AM rates l The foreground stream is adjusted to remove the segments that do not agree with the estimated target pitch l Other units are grouped according to temporal and spectral continuity

ICASSP'04 Tutorial66 Segregation example

ICASSP'04 Tutorial67 Systematic SNR results l Evaluation on a corpus of 100 mixtures (Cooke, 1993): 10 voiced utterances x 10 noise intrusions (see next slide) l Average SNR gain: 12.3 dB; 5.2 dB better than the Wang-Brown model (1999), and 6.4 dB better than the spectral subtraction method Hu-Wang model SNR (in dB)

ICASSP'04 Tutorial68 Monaural CASA progress via sound demo 100 mixture set used by Cooke (1993) 10 voiced utterances mixed with 10 noise intrusions (N0: tone, N1: white noise, N2: noise bursts, N3: ‘cocktail party’, N4: rock music, N5: siren, N6: telephone, N7: female utterance, N8: male utterance, N9: female utterance) Cooke (1993) Ellis (1996) Wang & Brown (1999) Hu & Wang (2004) + telephone + male + female Original mixture of voiced speech

ICASSP'04 Tutorial69 Barker, Cooke, and Ellis model l Basic observations Pure primitive CASA is not capable enough; also listeners use top- down information l Pure schema-based CASA needs models for all sources and ignores organization in the input l Barker, Cooke and Ellis (2004) proposed a model-based approach to integrate primitive and schema-based processes in CASA l Key idea: Use recognition to group segments (fragments) generated in primitive organization

ICASSP'04 Tutorial70 CASA recognition By extending traditional speech recognition theory, the goal of CASA recognition is to find the word sequence and speech/background segregation which, jointly, have the maximum a posteriori probability given Y Y represents mixture acoustics and X is used to represent speech acoustics

ICASSP'04 Tutorial71 Reintroducing the speech acoustics Since W is independent of S and Y given X: Here, P(X) cannot be dropped from the integral

ICASSP'04 Tutorial72 For HMM recognition, a hidden state sequence is introduced. This leads to CASA recognition with HMM modeling Language model bigrams, dictionary Acoustic model schemas Segregation model primitive grouping Search algorithm Modified decoder Segregation weight Connection to observations

ICASSP'04 Tutorial73 Segregation model The objective of segregation is to produce a binary mask indicating which T-F units belong to each source For a 32-channel filterbank and frame rate of 100 frames/second, there are 2 3200 binary masks for a second of speech! The task of primitive organization is to reduce this figure to a manageable level With N segments, 2 N possible organizations – a large reduction Inter-segment grouping further helps

ICASSP'04 Tutorial74 Efficient search 1.Share computations up to the point when two alternative groupings differ 2.Split decoders to deal with the differing segment 3.Merge decoders when segment terminates, since their futures are identical

ICASSP'04 Tutorial75 Illustration Implemented by a token-passing Viterbi algorithm, modified to maintain multiple tokens in each state Not the same as search pruning - the same result will be returned as for an exhaustive search

ICASSP'04 Tutorial76 Acoustic model and segregation weighting Impractical to evaluate the acoustic model and the segregation weighting term over entire search space of X. Simplifying assumptions must be made Segregation weighting is determined by analyzing speech and background dominance

ICASSP'04 Tutorial77 Experimental evaluation No real CASA Simple noise estimation to split into foreground and background masks Divided into 4 subbands and contiguous regions in each band provide segments

ICASSP'04 Tutorial78 Results Factory noise: stationary background + hammer blows, machine noise, etc. MFCC + CMN is an effective approach for clean speech Missing data recognition fairs well in AURORA evaluation The model-based approach provides about 1.5 dB improvement over missing data recognition at low SNR levels No significant CASA processing is done

ICASSP'04 Tutorial79 III.2 Monaural segregation: Interim summary Main progress occurs in voiced speech segregation Accomplished with minimal assumptions Reliable pitch tracking is important for CASA Wu et al. (2003) recently published a continuous-HMM based model for multipitch tracking, and obtained very good results for noisy speech Segregation of unvoiced speech is a major challenge l Primitive and model-based CASA need to be truly integrated

ICASSP'04 Tutorial80 III.3 Binaural segregation l The Bodden model (1993) that estimates a time- varying Wiener filter l A cancellation-based model of Liu et al. (2001) l A classification-based model of Roman et al. (2003)

ICASSP'04 Tutorial81 Bodden model The binaural processing extends from the classical cross-correlation mechanism, and is applied independently in 24 critical bands The input is a binaural signal obtained by filtering the monaural signals with a set of HRTFs (head related transfer functions) The goal of the model is to estimate a Wiener filter

ICASSP'04 Tutorial82 Activity running down a delay-line in the cross- correlation processor inhibits the activity at the corresponding tap on the opposite line Amount of inhibition can be adjusted Inhibition may be instantaneous or persist for a while, based on a specified memory function. The latter may produce a precedence effect The precedence effect refers to the phenomenon that the first wavefront suppresses the later ones The mechanism produces a sharpening of the peaks and suppresses secondary peaks in the cross- correlation function The resulting pattern becomes sensitive to IID due to asymmetrical contralateral inhibition Contralateral inhibition mechanism

ICASSP'04 Tutorial83 Key idea: The binaural processor has to adapt to the individual interaural parameters of the HRTF set Additional weighting of the signals moving along the delay lines is thus introduced so that the contralateral inhibition becomes symmetrical Weighting coefficients are determined by supervised learning from natural combinations of ITD and IID Adaptation to HRTFs

ICASSP'04 Tutorial84 Output patterns of the binaural processor Output patterns in response to white noise presented at various azimuth angles in the horizontal plane. All 24 critical bands are shown

ICASSP'04 Tutorial85 To estimate the time-varying Wiener filter in the ith critical band: The running power of the mixture is estimated at each time frame by integrating the cross-correlation output on the azimuth axis. A desired window centered at the target location is used to estimate the power of target signal The estimated filter becomes a time-varying weighting factor g i (t) that weights the ith critical band to resynthesize the segregated signal: Segregation method

ICASSP'04 Tutorial86 Results and discussion Sound demo: Two speakers with 0 dB (-45°, +45°): ; (0°, 20°): Three speakers with 0 dB (-45°, 0°, 30°): The Bodden model has been evaluated on hearing impaired listeners. The model produces a 20% increase (in absolute term) of word intelligibility for a two-source condition and 40% for a three-source condition The intelligibility improvement for consonants is comparable to that for vowels The model, particularly the binaural processor, is very complex, involving a lot of parameter adjustment and curve fitting. As a result, it has proven to be difficult to re-implement

ICASSP'04 Tutorial87 Two-stage signal processing for noise removal: The equalization-cancellation (EC) theory of Durlach (1963) 1.Equalization that makes the noise components identical in two channels 2.Cancellation (subtraction) of the noise components in one channel from those in the other channel Most two-microphone noise cancellation techniques are variants of the EC theory A different viewpoint to sound localization than the cross-correlation mechanism of Jeffress Limitation: Satisfactory noise reduction only for two- source situations with one target and one intrusion Equalization-cancellation theory

ICASSP'04 Tutorial88 Liu et al. model A two-microphone system designed for speech segregation in the presence of multiple interfering sources A two-stage cancellation-based model: Sound localization using the ‘dual delay-line’ network proposed by the authors and the cross-correlation mechanism of Jeffress (1948) A broadband EC scheme that capitalizes on the knowledge of the spatial directions of the sound sources in order to cancel multiple noise sources

ICASSP'04 Tutorial89 Input is preprocessed using the short-time Fourier analysis to obtain the left/right spectral representations: X Ln (m) and X Rn (m), where m indicates frequency and n time frame System overview

ICASSP'04 Tutorial90 Dual delay-line structure

ICASSP'04 Tutorial91 Equalization At each frequency bin, the amplitude difference of the left signal and the right is equalized at a particular time lag, corresponding to a particular azimuth in the azimuth plane Subtraction The right equalized signal is then subtracted from the left one In the case of only one interference, the output at the time lag corresponding to the noise location in the dual delay-line equals the spectral value of the target Equalization-cancellation at each frequency

ICASSP'04 Tutorial92 Broadband localization Compute coincidence locations in the delay line at each frequency and time frame using the equalized spectral components: where i indicates time lag Integrate across time the instantaneous 2-D coincidence patterns Activity decay is introduced for earlier time frames For each potential location, integrate across frequency along the primary/secondary traces Peaks in the resulting pattern are the estimates for source locations

ICASSP'04 Tutorial93 Secondary trace Primary trace Coincidence patterns for a two-source configuration Theoretical primary (vertical) and secondary (curved) traces for two sources, one at -60° (solid), the other at +45° (dotted) Output of model 60 ms after onset, showing the effect of temporal integration

ICASSP'04 Tutorial94 (A)Direct method: Integration across only primary traces (actual: filled arrow; estimated: open arrow; phantom: double-open arrow) (B)Stencil method: Integration across both primary and secondary traces Localization results for four sources

ICASSP'04 Tutorial95 Noise cancellation The localization stage provides the positions in the dual delay-line for a number of noise sources. The output at each position contains the target component plus components of the other interfering sources The key idea is to cancel the strongest noise component at each frequency bin. Hence it is a subband cancellation approach

ICASSP'04 Tutorial96 Experimental results with four source locations in an anechoic chamber. Values in parentheses indicate target cancellation (0 dB ideally) and other values indicate each noise cancellation Average SNR gain is about 8 dB A sound demo with target at 0°, two jammers at -20° and 40°: Speech segregation results

ICASSP'04 Tutorial97 Discussion Results are not significantly different using either “direct” or “stencil” method for localization Focus on cancellation of the intrusion rather than the extraction of the target – An alternative approach to sound separation Target degradation is small Apply the EC theory to individual frequency bins It works well if there is no more than one dominant intrusion within individual frequency bins

ICASSP'04 Tutorial98 Roman, Wang, and Brown model Attend to one talker while filtering out acoustic interference Focus on target rather than interference Computational goal is to estimate the ideal binary mask Ideal mask estimation is based on classification

ICASSP'04 Tutorial99 Model architecture

ICASSP'04 Tutorial100 Azimuth localization Cross-correlogram for ITD detection Frequency-dependent nonlinear transformation from the time-delay axis to the azimuth axis Locations are identified as peaks in the skeleton cross- correlogram

ICASSP'04 Tutorial101 Binaural cue extraction Interaural time difference Cross-correlation mechanism To resolve the multiple-peak problem at high frequencies, ITD is estimated as the peak in the cross-correlation pattern within a period centering at ITD target Interaural intensity difference: Ratio of right-ear energy to left-ear energy

ICASSP'04 Tutorial102 Ideal binary mask estimation For narrowband stimuli, systematic changes of extracted ITD and IID values occur as the relative strength of the original signals changes. This interaction produces characteristic clustering in the joint ITD-IID space The core of the model lies in deriving the statistical relationship between the relative strength and the binaural cues

ICASSP'04 Tutorial103 3-Source Configuration Example - Data histograms for one channel (center frequency: 1.5 kHz) from speech sources with target at 0  and two intrusions at -30  and 30  (R: relative strength) - Clustering in the joint ITD-IID space

ICASSP'04 Tutorial104 Pattern classification Independent supervised learning for different spatial configurations and different frequency bands in the joint ITD-IID feature space Define: Decision rule (MAP): Nonparametric method for the estimation of probability densities : Kernel Density Estimation Utterances from the TIMIT corpus are used for training

ICASSP'04 Tutorial105 Example (Target: 0 o, Noise: 30 o ) Target Noise MixtureIdeal binary maskResult

ICASSP'04 Tutorial106 Sound demo Target NoiseMixtureSegregated target ‘Cocktail Party’ Siren Female Speech Noise1MixtureSegregated target ‘Cocktail Party’ Female Speech 3 sound sources (Target: 0 o, Noise1: -30 o, Noise2: 30 o ) Target Noise2 2 sound sources (Target: 0 o, Noise: 30 o )

ICASSP'04 Tutorial107 SNR results The model yields large SNR improvements l For 2-source configurations, average SNR gain (at the better ear) ranges from 13.7 dB to 5 dB depending on azimuth separation and deviation from median plane l For 3 sources, average SNR gain is 11.3 dB in good configurations l The model performance is 3.5 dB better than the Bodden model in the favorable conditions for the Bodden model. The improvement is significantly larger in other conditions

ICASSP'04 Tutorial108 ASR evaluation The missing-data technique for robust speech recognition is employed. The task domain is recognition of connected digits Target at 0  Intrusion (male speech) at 30  Target at 0  Two intrusions at 30  and -30 

ICASSP'04 Tutorial109 Speech intelligibility evaluation The Bamford-Kowal-Bench sentence database is used that contains short semantically predictable sentences as target MixtureSegregated Two-source (0 , 5  ) condition Interference: babble noise Three-source (0 , 30 , -30  ) condition Interference: male utterance & female utterance

ICASSP'04 Tutorial110 Discussion The estimation of the ideal binary mask is based on pattern classification in the joint ITD-IID feature space Training is configuration-specific and frequency-specific Estimated masks are very similar to ideal ones High-quality estimation of the ideal binary mask translates to high ASR and speech intelligibility scores

ICASSP'04 Tutorial111 III.3 Binaural segregation: Interim summary It pays to have an additional microphone Binaural segregation produces better results than monaural segregation It works equally well for voiced and unvoiced speech Binaural segregation employs spatial cues, whereas monaural segregation exploits intrinsic sound characteristics A main challenge in binaural segregation is room reverberation See for example, Roman and Wang (2004) Can one achieve robustness to reverberation without analyzing sound characteristics?

ICASSP'04 Tutorial112 IV. Overall discussion: CASA challenges Event tracking/prediction at multiple time scales Sequential organization Sound motion Rhythm Role of attention What to attend? How to attend, or mechanisms of auditory attention? Rapid adaptation to acoustic environment Echo and reverberation Analysis and classification of natural sound sources Non-speech, non-music environmental sounds Monaural segregation of unvoiced speech Integration Multicue integration Primitive and schema-based CASA

ICASSP'04 Tutorial113 Concluding remarks Auditory perception is robust to a variety of intrusions and distortions that characterize real-world environments The primary feature of CASA is that it is perceptually- motivated Emphasis on perceptual characteristics Emphasis on sound properties In other words, it is content-based analysis Despite a modest start and a period of slow progress, recent advances have made CASA a viable approach to real-world sound separation An emerging field of study with major importance and a host of research issues

ICASSP'04 Tutorial114 Bibliography Arbib (1989) The Metaphorical Brain 2. Wiley & Sons. Assmann & Summerfield (2004) in Speech Processing in the Auditory System (eds Greenberg, Ainsworth, Popper & Fay) Springer. Barker, Cooke, Ellis (2004) Speech Communication, in press. Bodden (1993). Acta Acustica 1: 43-55. Bregman (1990) Auditory Scene Analysis. MIT Press. Brungart et al. (in preparation). Bronkhorst & Plomp (1992) JASA 92:3132-3139. Brown & Cooke (1994) Computer Speech & Language 8: 297-336. Cherry (1957) On Human Communication. Wiley & Sons. Cooke (1993) Modelling Auditory Processing & Organisation. Cambridge University Press. Cooke et al (2001) Speech Communication 34:267-285. Drullman & Bronkhorst (2000) JASA 107:2224-2235. Durlach (1963) JASA 35: 1206-1218. Ellis (1996) PhD MIT. Helmholtz (1863). On the Sensation of Tone (A. J. Ellis, Trans.). Dover Publishers. Hu & Wang (2001) WASPPA. Hu & Wang (2004) IEEE Trans. Neural Networks, in press. Jeffress (1948) J Comp Physiol Psychol 61:468-486. Large and Jones (1999) Psych Rev 106:119-159. Licklider (1951) Experientia 7: 128-134. Lippmann (1997) Speech Communication 22: 1-15. Liu et al. (2001) JASA 110: 3218-3231. Miller (1947) Psych. Bulletin 44:105-129. Miller, Heise, Lichten (1951) JEP 41: 329-335. Roman & Wang (2004) ICASSP. Roman, Wang & Brown (2003) JASA 114: 2236-2252. Steeneken (1992) PhD Univ of Amsterdam. Wang & Brown (1999) IEEE Neural Networks 10:684- 697. Weintraub (1985) PhD Stanford. Wu, Wang, & Brown (2003) IEEE Trans. Speech & Audio Processing 11: 229-241. Yost (1997) In: Binaural and Spatial Hearing in Real and Virtual Environments (eds Gilkey & Anderson), LEA.

ICASSP'04 Tutorial115 Resources Source programs for several studies referred to in this tutorial are available from http://www.cse.ohio-state.edu/pnl Sheffield University has a webpage for the Matlab Auditory Demonstrations from http://www.dcs.shef.ac.uk/~martin/MAD/docs/mad.htm http://www.dcs.shef.ac.uk/~martin/MAD/docs/mad.htm The 100 mixture set collected by Cooke (1993) can be downloaded from http://www.dcs.shef.ac.uk/~martin/corpora/cookephd.tar.gz ShATR is a corpus for CASA research consisting of up to 5 simultaneous speakers, recorded by 8 microphones. It is fully labelled and downloadable from http://www.dcs.shef.ac.uk/research/groups/spandh/projects/ shatrweb Bregman & Ahad (1995) produced a CD that accompanies Bregman’s 1990 book, which can be ordered from MIT Press Houtsma, Rossing, & Wagenaars (1987) produced a CD that demonstrates basic auditory perception phenomena, which can be ordered from the Acoustical Society of America Acustica, vol. 82, 1996, has an accompanying CD that contains among others demonstrations of binaural processing Acknowledgements I thank the following: Martin Cooke for making publicly available his tutorial at NIPS 2002, which forms the basis for many slides used in this tutorial. Guoning Hu, Niki Roman, and Soundar Srinivasan for their assistance in the preparation. Research supported in part by AFOSR and NSF

Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.

Similar presentations

Presentation on theme: "Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University.

Similar presentations

Presentation on theme: "Computational Auditory Scene Analysis DeLiang Wang Perception & Neurodynamics Lab Department of Computer Science and Engineering The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback