Musical Source Separation

Musical Source Separation
Auditory Scene Analysis Approach Good afternoon. Today I’d like to talk about musical source separation, with an auditory scene analysis approach. What is musical source separation? Roughly speaking, is given a mixture of musical audio, say for example a clip of pop song, tell which part of the audio belongs to which acoustic source. And what about auditory scene analysis? I’ll explain it later.

Why doing this at all? Theoretically, a deep algorithmic understanding of how we demix music (cocktail party problem) Practically, to automate the music demixing process (cocktail party processor) The first thing to ask is why doing this at all. I can think of two reasons. Open problem, far from perfectly solved.

What to classify or what to separate?
Musical Timbre – What to classify or what to separate? The second thing to ask is what to separate. As mentioned at the beginning, musical source, which means different instruments. We are not interested in deciding chords. We are interested in partition every note of every chord into different groups. But wait, what is timbre first?

With respect to time Temporal Envelop (ASDR)?
Transient? (Prefix? Attack?) Because timbre is not something as clear as time or frequency, different people may have different view on timbre. What is timbre? Some prefer to look at timbre as a kind of feature with respect to time. (Bregman, 481)

With respect to frequency
Harmonics Series? (How many coincides? The behaviors of the series?) Spectral Envelop? Formant transpose? Others tend to treat it as a spectral characteristics. (Bregman, )

What is timbre? No one knows exactly. Is it a dimension? Is it a label in our brain? Is it a connection to vision? Is it a complicated dynamics of harmonic series?… What is timbre in computer’s view? Of course many people also want to group temporal and spectral features together or even assign a new dimension to timbre. But still, we cannot come up with an exact answer to what timbre is. Therefore we also cannot deal with timbre easily within computer.

Auditory scene analysis (ASA) – A Gestalt Approach to Auditory system
Scene Analysis – “The allocation of regions to objects” Then someone come up with this theory, called ASA. Rather than look at acoustic event as a series of segmented physical event, it takes the rules of human perception into account and essentially try to answer the problem about “what is timbre”.

Auditory scene analysis
Segmentation → Grouping Sound Units → Auditory Streaming Now we’ve come to the interesting part. (Basically, we can distinguish a source from a mixture of sources, because it makes sense, it is an audio stream.) Stream: cluster of related qualities. Same as the word “object” in visual description. “Sound” merely denotes the physical sense of acoustic event, but “stream” means a psychological sense. When listeners listen to acoustic input, they “parse” them into “stream”s. The sound is “Belonging to an object”. You can see the guys playing there, and when you close your eyes, you can also “see” the guys playing there. (Gestalt Grouping Explanation) “Sound” is something sensational, while “stream” is something perceptual. (difference between sensation and perception) Sensation turns to perception. (Bregman, Ch. 1, ; Wang, D.L., and G.J. Brown, 2)

You may not realize how elegant the work it is for us to differentiate one stream from a complicated mixture of streams. Please imagine this…

Segregation (Audio Stream Segregation)
Principle of Grouping Principle of Closure After we realize that, the first thing to ask is “Why can we do that?” Gestalt psychologists provide us with some principles to explain. These principles are about how we process information. Segregation ≠ sensation. Segregation is the way we recognize the acoustic input. It is actually the feature extraction and classification. Foreground and background, we can only focus ourselves on one thing. (Bregman, Ch. 2)

Principle of Figure-Ground

Principle of Figure-Ground Ground Figure

Principle of Figure-Ground You may argue that this is not kind of explanation, but rather a phenomena. Right! But only by looking at things at such different way can we come up with better solution than just analyze what timbre is. Because timbre is something meaningful to human being, therefore we should probably look at the problem in a way more humanoid.

Integration Sequential Integration (melody, rhythm)
Group via similarity w.r.t. spectrum and transient Spectral Integration “Old-plus-new heuristic” – the timbre is registered into our auditory system (we know it before!) Based on those principles. And this slide actually tell us some of the reason why we can differentiate those audio streams. Depends on how you listen to music. It is the way you segregate audio input. Different representations on the same physical input. Sequential Unit (The smallest unit that can be separated in a melody) To recognize chords is trivial. To partition the spectrum is non-trivial. How much time complexity does it take to partition the spectrum into different groups(i.e. group them into different physical objects, group the total harmonic series into different subsets that belong to different fundametals)? Pitch system -> Grouping system? We can distinguish this single signal confidently from a mixture of signal because we know it before. So that our attention can be directed into this particular group of spectrum. The mechanism depends on our prior knowledge and the continuity of the auditory stream. Masking effect. (Lower masks higher harmonics, higher intensity masks lower intensity) If the frequencies belong to the same source, they change in the same way. (common fate) (Bregman, Ch. 3)

Integration Principle of Common Fate
This slide provide us with a very important insight with “timbre” from the view of ASA(or gestalt psychology). Principle of common fate. The static picture itself will not give any cue for clustering, while the dynamic change will. (Time involves) Two kinds of common fate in acoustic events, namely, FM and AM. (FM, the whole harmonic series(partials) go up or down proportionally and simultaneously; AM, the amplitude of those spectral components go up or down proportionally and simultaneously) FM can be further divided into CD and CR (constant difference and constant ratio) and CD does not work well because it destroy the natural harmonic similarity. (Harmonic series not natural number any more). The micromodulation experiment substantiated the FM theory. AM: onset-offset synchronization, (retroactive effect in auditory system), amplitude pattern, phase pattern. (or you can say the common fate of amplitude and common fate of phase) You can understand this picture as several flocks of birds or a certain fighting scene in the movie.

Schema-based (controlled)
Primitive v.s. Schema-driven ≡ Bottom-up v.s. Top-down ≡ Unconsciousness v.s. Consciousness prior learning involved (Not prior knowledge of the input, but the prior knowledge the listener learned) Then more and more interesting things are following the discussion. Can prior learning affect our separation? Does the way we separate streams only involve innate abilities? Maybe No! And actually, No! Therefore, maybe we can use schema to accelerate the computer automated process. Actually this is something very important! Everything moment of perception could be a schema driven. The primitive approach is innate(natural), while the schema is acquisitus(learned). The audio input activates our schema(or pattern), and small lower level schema may form higher-level schema, and on and on. Or course, if no innates, how come there is schema? This schema approach is more close related to pattern recognition(but I doubt). We have apply primitive approach to acquire the schemas and then we use schemas to recognize the acoustic input! If there are something new, we again use primitive approach to analysis it, and then make it a schema again. (Bregman, Ch. 4)

And here I briefly explain what is schema-based integration.
Schema-driven or hypothesis-driven: “presumed to involve the activation of stored knowledge of familiar patterns or schemas in the acoustic environment and of a search for confirming stimulation in the auditory input.” Search for regular patterns to form stream.

“Timbre influences scene analysis, but scene analysis creates timbre.”
(Bregman, 488) To sum up with this part, I would like to quote a sentence from Bregman. I think this is something related to schema. Kind of dialectic but that’s it. Still not very clear, still an open problem. 5

Computational auditory scene analysis (CASA)
Of course, our goal is not to know what is ASA, but to use ASA in the automated process of musical source separation! Fortunately, there is a newly emerged field dealing with exactly the same problem. It is called CASA. Here is the last part I want to show you. “reverse engineering of human auditory scene analysis “ Should be using one or two mics rather than more. (monaural, binaural)

Mid-level Represent-ations Source/Background models
CASA typical system Acoustic Mixture Grouping Cues Peripheral analysis Feature Extraction Mid-level Represent-ations Scene Organization Here is the System architecture of a typical CASA system. It looks like a typical pattern recognition system except for special technique it used. There are… 1, Cochleagram. Transducer, preprocessing, gammatone filter bank, Meddis hair cell model, basilar membrane, Time-frequency representation (compared to spectrogram) Firing rate, auditory nerve fiber 2, Correlogram. 3, Cross-Correlogram. 4, Time-Frequency Masks. 5, Resynthesis. Source/Background models (Wang, D.L., and G.J. Brown, 14)

Cochleagram Gammatone filter (banks):
𝒈 𝒇 𝒄 𝒕 = 𝒕 𝑵−𝟏 𝐞𝐱𝐩 −𝟐𝝅𝒕𝒃 𝒇 𝒄 𝐜𝐨𝐬 𝟐𝝅 𝒇 𝒄 𝒕+𝝓 𝒖(𝒕) In the preprocessing stage, to mimic the cochlear of human auditory system, a gammatone filter bank is applied to the input and the magnitude variations of each band are further transformed into a “fire rate” representation to mimic the behavior of auditory nerve fibers. Filter banks, fire rate(fire of what? Of the auditory nerves fibers) Compared to spectrogram, time-frequency Low frequencies expanded First formant resolved in harmonics Onset emphasis (Wang, D.L., and G.J. Brown, 15-19)

Cochleagram (Wang, D.L., and G.J. Brown, 15-19)
The transformation result is like this, which is a periodic audio input fundamental at 100Hz. The interesting thing about this is that it contains both the time information and the spectral information within the same space, but it differs from spectrogram in that it is a “human sensed” version of spectrogram. Filter banks, fire rate(fire of what? Of the auditory nerves fibers) Compared to spectrogram, time-frequency Low frequencies expanded First formant resolved in harmonics Onset emphasis (Wang, D.L., and G.J. Brown, 15-19)

Cochleagram (Shao, Yang, et. al., 2010)
This is a real cochleagram, you can see that the onset is emphsized, and the fundamental frequency is resolved into harmonics Filter banks, fire rate(fire of what? Of the auditory nerves fibers) Compared to spectrogram, time-frequency Low frequencies expanded First formant resolved in harmonics Onset emphasis (Shao, Yang, et. al., 2010)

Correlogram 𝒂𝒄𝒇 𝒏,𝒄,𝝉 = 𝒌=𝟎 𝑲−𝟏 𝒂 𝒏−𝒌,𝒄 𝒂(𝒏−𝒌−𝝉 , c)h(k)
Correlogram is computed by autocorrelating the cochleagram, giving a very clear look at the fundamental frequencies. This process mimics our pitch perception. Autocorrelation-based Center on the fundamental(F0), look like a spine. Summary over all frequency(filter banks) (Wang, D.L., and G.J. Brown, 20-21)

Cross-Correlogram 𝒄𝒄𝒇 𝒏,𝒄,𝝉 = 𝒌=𝟎 𝑴−𝟏 𝒂 𝑳 𝒏−𝒌,𝒄 𝒂 𝑹 (𝒏−𝒌−𝝉 , c)h(k)
Related to Spatial location perception. Interaural time difference(ITD) This is interaural cross-correlation(L and R) Spine centered on ITD, peak at the ITD(cue to derive the object location) Summary over all frequency(filter banks) (Wang, D.L., and G.J. Brown, 21-22)

Time-frequency Masks Group according to principles of ASA (such as closure, common-fate, schema, whatever you like…) 𝒎 𝒕,𝒇 = 𝟏 𝒊𝒇 𝒔 𝒕,𝒇 −𝒏 𝒕,𝒇 >𝜽 𝟎 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆 This is the crucial part for source separation. Before we only talk about different representation of the same thing. Now we can apply the knowledge of auditory scene analysis to actually separate them out. Are there any common fate? Similarity of onsets? Any harmonic series patterns? Once we got a time-frequency representation of the original mixed signal, we can make our judgment by masking. Time-frequency weighting: Give a weight(binary or natural number) to emphasize the spectraltemporal region that is contributed by the objective source. 1, speaker and background 2, two speakers grouping (Wang, D.L., and G.J. Brown, 22-23)

Time-frequency Masks (Shao, Yang, et. al., 2010)
Time-frequency weighting: Give a weight(binary or natural number) to emphasize the spectraltemporal region that is contributed by the objective source. 1, speaker and background 2, two speakers grouping (Shao, Yang, et. al., 2010)

Resynthesis (Cosi, P., and E. Zovato, 1996)
Inverting time-frequency representation(specifically the T-F mask) (Cosi, P., and E. Zovato, 1996)

Resynthesis (Slaney, M., D. Naar, and R. E. Lyon, 1994)
Inverting time-frequency representation(specifically the T-F mask) (Slaney, M., D. Naar, and R. E. Lyon, 1994)

Evaluation Clean target signal Human listening
Of course we have to evaluate whether the system performance is good or not. (Wang, D.L., and G.J. Brown, 25-26)

Other approaches Beamforming – spatial filtering, boost the signal from one particular direction. Blind Source Separation. (Independent component analysis(ICA) 𝒙 𝒕 =𝑨𝒔(𝒕), take the inverse of 𝑨.) Nothing perfect yet! There are some other approaches for doing music source separation. But I should point out here none off the methods, including CASA, can claim that they have successfully solved the problem. In other words, the problem is still remain unsolved. 5 (Wang, D.L., and G.J. Brown, 28-29)

Citation Figure . "Auditory scene analysis." Retrieved Nov. 23rd, 2012, from (Nov. 23rd, 2012). "quintet." from "Cherrybam.Com."

Citation Bibliography
"Bregman, Albert S. Auditory Scene Analysis : The Perceptual Organization of Sound. MIT Press, Book. "Gestalt Psychology." Wang, D.L., and G.J. Brown. Computational Auditory Scene Analysis: Principles, Algorithms, and Applications. IEEE Press, 2006. "Computational Auditory Scene Analysis." Shao, Yang, Soundararajan Srinivasan, Zhaozhang Jin, and DeLiang Wang. "A Computational Auditory Scene Analysis System for Speech Segregation and Robust Speech Recognition." Computer Speech & Language 24, no. 1 (1// 2010): Slaney, M., D. Naar, and R. E. Lyon. "Auditory Model Inversion for Sound Separation." Paper presented at the Acoustics, Speech, and Signal Processing, ICASSP-94., 1994 IEEE International Conference on, Apr Cosi, P., and E. Zovato. "Lyon’s Auditory Model Inversion: A Tool for Sound Separation and Speech Enhancement." Paper presented at the Proc. of ESCA Workshop on ‘The Auditory Basis of Speech Perceprion, Keele University, Keele (UK), 1996.

Thank you! DENG JUNQI Dept. Electrical and Electronics Engineering,
The University of Hong Kong 11/16/2018

Musical Source Separation

Similar presentations

Presentation on theme: "Musical Source Separation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Musical Source Separation

Similar presentations

Presentation on theme: "Musical Source Separation"— Presentation transcript:

Similar presentations

About project

Feedback