Presentation on theme: "Machine Listening in Silicon Part of: Accelerated Perception & Machine Learning in Stochastic Silicon project."— Presentation transcript:
Machine Listening in Silicon Part of: Accelerated Perception & Machine Learning in Stochastic Silicon project
Who? UIUC: – Students: M. Kim, J. Choi, A. Guzman-Rivera, G. Ko, S. Tsai, E. Kim. – Faculty: Paris Smaragdis, Rob Rutenbar, Naresh Shanbhag Intel: – Jeff Parkhurst – Ryszard Dyrga, Tomasz Szmelczynski – Intel Technology Poland – Georg Stemmer – Intel, Germany – Dan Wartski, Ohad Falik – Intel Audio Voice and Speech (AVS), Israel
Motivating ideas: – Make machines that can perceive – Use stochastic hardware for stochastic software – Discover new modes of computation Machine Listening component: – Perceive == Listen – Escape local optimum of Gaussian/MSE/ 2 Project overview
Making systems that understand sound – Think computer vision, but for sound Broad range of fundamentals and applications – Machine learning, DSP, psychoacoustics, music, … – Speech, media analysis, surveying, monitoring, … Machine Listening? What can we gather from this?
Machine listening in the wild Highlight discovery In videos Incident discovery in streets Surveillance for emergencies Some of this work is already in place – Mostly projects on recognition and detection – More apps in medical, mechanical, geological, architectural, …
The CrowdMic project – PhotoSynth for audio, construct audio recordings from crowdsourced audio snippets Collaborative audio devices – Harnessing the power of untethered open mics – E.g. conf-call using all phones and laptops in room And theres more to come
Today is all about small form factors – We all carry a couple of mics in our pockets, but we dont carry the vector processors they need! Can we come up with new better systems? – Which run on more efficient hardware? – And perform just as well, or better? The Challenge
Sound has a pesky property, additivity – We almost always observe sound mixtures Models for sound analysis are monophonic – Designed for isolated, clean sounds – So we like to first extract and then process The Testbed: Sound Mixtures ++=
Theres no shortage of methods (they all suck by the way) – But these are computationally some of the most demanding algorithms in audio processing So we instead catered to a different approach that would be a good fit for hardware – i.e. Rob told me that he can do MRFs fast Focusing on a single sound
We like to visualize sounds as spectrograms – 2D representations of energy over time and frequency For multiple mics we observe level differences – These are known as ILDs (Interaural Level Differences) A bit of background
For each spectrogram pixel we take an ILD – And plot their histogram – Each sound/location will produce a mode Finding sources
Assign each pixel to a source et voila – But it looks a little ragged And we use these as labels
Thus a Markov Random Field Each pixel is a node that influences its neighbors – Incorporates ILDs and smoothness constraints – Makes my hardware friends happy
The whole pipeline LEFT timefreqtime freq RIGHT Spectrograms Binary, pairwise MRF Observe: ILDs Inference Binary Mask: Which freqs belong to which source at each time point? source0 source1 ~15dB SIR boost
Iteration Per pixel depth info Obj. Markov Random Field Nodes: Data cost Edges: Smoothness cost 3D depth map by MRF MAP inference Reusing the same core Oh, and we use this for stereo vision too
Our work outperforms up-to-date GPU implementations Performance Result: Single Frame Tsukuba (384x288,16) Real-time BP [Yang 2006] Tile-based BP [Liang 2011] Fast BP [Xiang 2012] Our work GPU NVIDIA GeForce 7900 GTX NVIDIA GeForce 8800 GTS NVIDIA GeForce GTX 260 N/A # Iteration (4 scales) = (5,5,10,2) (B, T I, T O ) = (12, 20, 5) (3 scales) = (9,6,2) T O = 5 Time (msec) Min. Energy N/A396,953N/A 393,434 Its also pretty fast
Error Resilient MRF Inference via ANT Algorithmic Noise Tolerance Power saving by ANT – Complexity overhead = 45% – Estim.: 42 % at V dd = 0.75V And we made it error resilient
ILDs suffer front-back confusion and require some distance between the microphones – So we also added Interaural Phase Differences (IPD) Back to source separation again
They work best when ILDs fail – E.g. when sensors are far apart Input ILD IPD Joint 30cm1cm15cm Why add IPDs?
Incorporated NMF-based denoisers – Systems that learn by example what to separate Adding one more element
Porting the whole system in hardware – We havent ported the front-end yet Evaluating the results with speech recognition Extending this model to multiple devices – As opposed to one device with multiple mics So whats next?
Kim, Smaragdis, Ko, Rutenbar. Stereophonic Spectrogram Segmentation Using Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2012 Kim & Smaragdis. Manifold Preserving Hierarchical Topic Models for Quantization and Approximation, in International Conference on Machine Learning, 2013 Kim & Smaragdis Single Channel Source Separation Using Smooth Nonnegative Matrix Factorization with Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2013 Kim & Smaragdis. Non-Negative Matrix Factorization for Irregularly-Spaced Transforms, in IEEE Workshop for Applications of Signal Processing in Audio and Acoustics, 2013 Traa & Smaragdis Blind Multi-Channel Source Separation by Circular-Linear Statistical Modeling of Phase Differences, in IEEE International Conference on Acoustics, Speech and Signal Processing, 2013 Choi, Kim, Rutenbar, Shanbhag. Error Resilient MRF Message Passing Hardware for Stereo Matching via Algorithmic Noise Tolerance, IEEE Workshop on Signal Processing Systems, 2013 Zhang, Ko, Choi, Tsai, Kim, Rivera, Rutenbar, Smaragdis, Park, Narayanan, Xin, Mutlu, Li, Zhao, Chen, Iyer. EMERALD: Characterization of Emerging Applications and Algorithms for Low-power Devices, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013 Relevant publications