Machine Listening in Silicon

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Mean-Field Theory and Its Applications In Computer Vision1 1.
Bayesian Belief Propagation
Multi-view Stereo via Volumetric Graph-cuts George Vogiatzis, Philip H. S. Torr Roberto Cipolla.
HOPS: Efficient Region Labeling using Higher Order Proxy Neighborhoods Albert Y. C. Chen 1, Jason J. Corso 1, and Le Wang 2 1 Dept. of Computer Science.
Efficient Sparse Shape Composition with its Applications in Biomedical Image Analysis: An Overview Shaoting Zhang, Yiqiang Zhan, Yan Zhou, and Dimitris.
Active Appearance Models
Linear Time Methods for Propagating Beliefs Min Convolution, Distance Transforms and Box Sums Daniel Huttenlocher Computer Science Department December,
Improved Census Transforms for Resource-Optimized Stereo Vision
Cornell Accelerating Belief Propagation in Hardware Skand Hurkat and José Martínez Computer Systems Laboratory Cornell University
Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,
Local Discriminative Distance Metrics and Their Real World Applications Local Discriminative Distance Metrics and Their Real World Applications Yang Mu,
Online PLCA for Real-Time Semi-supervised Source Separation Zhiyao Duan, Gautham J. Mysore, Paris Smaragdis 1. EECS Department, Northwestern University.
Advanced topics.
Introduction to Markov Random Fields and Graph Cuts Simon Prince
Use of Kalman filters in time and frequency analysis John Davis 1st May 2011.
Probabilistic Inference Lecture 1
Bayesian Robust Principal Component Analysis Presenter: Raghu Ranganathan ECE / CMR Tennessee Technological University January 21, 2011 Reading Group (Xinghao.
LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.
Top Level System Block Diagram BSS Block Diagram Abstract In today's expanding business environment, conference call technology has become an integral.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Gerhard Maierbacher Scalable Coding Solutions for Wireless Sensor Networks IT.
Today Introduction to MCMC Particle filters and MCMC
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Introduction to machine learning
Statistical Natural Language Processing. What is NLP?  Natural Language Processing (NLP), or Computational Linguistics, is concerned with theoretical.
Information Retrieval in Practice
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.
Chapter 2. Image Analysis. Image Analysis Domains Frequency Domain Spatial Domain.
SoundSense by Andrius Andrijauskas. Introduction  Today’s mobile phones come with various embedded sensors such as GPS, WiFi, compass, etc.  Arguably,
What we didn’t have time for CS664 Lecture 26 Thursday 12/02/04 Some slides c/o Dan Huttenlocher, Stefano Soatto, Sebastian Thrun.
Object Based Video Coding - A Multimedia Communication Perspective Muhammad Hassan Khan
Wavelets and Denoising Jun Ge and Gagan Mirchandani Electrical and Computer Engineering Department The University of Vermont October 10, 2003 Research.
MRFs and Segmentation with Graph Cuts Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 02/24/10.
City University of Hong Kong 18 th Intl. Conf. Pattern Recognition Self-Validated and Spatially Coherent Clustering with NS-MRF and Graph Cuts Wei Feng.
INDEPENDENT COMPONENT ANALYSIS OF TEXTURES based on the article R.Manduchi, J. Portilla, ICA of Textures, The Proc. of the 7 th IEEE Int. Conf. On Comp.
Heart Sound Background Noise Removal Haim Appleboim Biomedical Seminar February 2007.
Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.
REVISED CONTEXTUAL LRT FOR VOICE ACTIVITY DETECTION Javier Ram’ırez, Jos’e C. Segura and J.M. G’orriz Dept. of Signal Theory Networking and Communications.
2 2  Background  Vision in Human Brain  Efficient Coding Theory  Motivation  Natural Pictures  Methodology  Statistical Characteristics  Models.
Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.
MUMT611: Music Information Acquisition, Preservation, and Retrieval Presentation on Timbre Similarity Alexandre Savard March 2006.
A Sparse Non-Parametric Approach for Single Channel Separation of Known Sounds Paris Smaragdis, Madhusudana Shashanka, Bhiksha Raj NIPS 2009.
Introduction to Video Background Subtraction 1. Motivation In video action analysis, there are many popular applications like surveillance for security,
Interactive Learning of the Acoustic Properties of Objects by a Robot
Approximate Inference: Decomposition Methods with Applications to Computer Vision Kyomin Jung ( KAIST ) Joint work with Pushmeet Kohli (Microsoft Research)
Kylie Gorman WEEK 1-2 REVIEW. CONVERTING AN IMAGE FROM RGB TO HSV AND DISPLAY CHANNELS.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
A Study of Sparse Non-negative Matrix Factor 2-D Deconvolution Combined With Mask Application for Blind Source Separation of Frog Species 1 Reporter :
Predicting Voice Elicited Emotions
Journal of Visual Communication and Image Representation
Chapter 1. SIGNAL PROCESSING:  Signal processing is concerned with the efficient and accurate extraction of information in a signal process.  Signal.
APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,
Markov Random Fields in Vision
Motorola presents in collaboration with CNEL Introduction  Motivation: The limitation of traditional narrowband transmission channel  Advantage: Phone.
Jungwook Choi and Rob A. Rutenbar
- photometric aspects of image formation gray level images
Dynamo: A Runtime Codesign Environment
Detecting Semantic Concepts In Consumer Videos Using Audio Junwei Liang, Qin Jin, Xixi He, Gang Yang, Jieping Xu, Xirong Li Multimedia Computing Lab,
Paper – Stephen Se, David Lowe, Jim Little
SENSOR FUSION LAB RESEARCH ACTIVITIES PART II: SIGNAL/IMAGE PROCESSING AND NETWORKING Sensor Fusion Lab, Department of Electrical Engineering and.
Markov Random Fields with Efficient Approximations
SoC and FPGA Oriented High-quality Stereo Vision System
A Tutorial on Bayesian Speech Feature Enhancement
Introduction Computer vision is the analysis of digital images
Deep neural networks for spike sorting: exploring options
Practical Hidden Voice Attacks against Speech and Speaker Recognition Systems NDSS 2019 Hadi Abdullah, Washington Garcia, Christian Peeters, Patrick.
Presentation transcript:

Machine Listening in Silicon Part of: “Accelerated Perception & Machine Learning in Stochastic Silicon” project

Who? UIUC: Students: M. Kim, J. Choi, A. Guzman-Rivera, G. Ko, S. Tsai, E. Kim. Faculty: Paris Smaragdis, Rob Rutenbar, Naresh Shanbhag Intel: Jeff Parkhurst Ryszard Dyrga, Tomasz Szmelczynski – Intel Technology Poland Georg Stemmer – Intel, Germany Dan Wartski, Ohad Falik – Intel Audio Voice and Speech (AVS), Israel Students / Intel contacts

Project overview Motivating ideas: Machine Listening component: Make machines that can perceive Use stochastic hardware for stochastic software Discover new modes of computation Machine Listening component: Perceive == Listen Escape local optimum of Gaussian/MSE/ℓ2

Machine Listening? Making systems that understand sound Think computer vision, but for sound Broad range of fundamentals and applications Machine learning, DSP, psychoacoustics, music, … Speech, media analysis, surveying, monitoring, … What can we gather from this?

Machine listening in the wild Some of this work is already in place Mostly projects on recognition and detection More apps in medical, mechanical, geological, architectural, … Highlight discovery In videos Incident discovery in streets Surveillance for emergencies

And there’s more to come The CrowdMic project “PhotoSynth for audio”, construct audio recordings from crowdsourced audio snippets Collaborative audio devices Harnessing the power of untethered open mics E.g. conf-call using all phones and laptops in room

The Challenge Today is all about small form factors We all carry a couple of mics in our pockets, but we don’t carry the vector processors they need! Can we come up with new better systems? Which run on more efficient hardware? And perform just as well, or better?

The Testbed: Sound Mixtures Sound has a pesky property, additivity We almost always observe sound mixtures Models for sound analysis are “monophonic” Designed for isolated, clean sounds So we like to first extract and then process + + =

Focusing on a single sound There’s no shortage of methods (they all suck by the way) But these are computationally some of the most demanding algorithms in audio processing So we instead catered to a different approach that would be a good fit for hardware i.e. Rob told me that he can do MRFs fast

A bit of background We like to visualize sounds as spectrograms 2D representations of energy over time and frequency For multiple mics we observe level differences These are known as ILDs (Interaural Level Differences)

Finding sources For each spectrogram pixel we take an ILD And plot their histogram Each sound/location will produce a mode

And we use these as labels Assign each pixel to a source et voila But it looks a little ragged

Thus a Markov Random Field Each pixel is a node that influences its neighbors Incorporates ILDs and smoothness constraints Makes my hardware friends happy

Binary Mask: Which freq’s belong to which source at each time point? The whole pipeline Spectrograms Binary, pairwise MRF time freq Observe: ILDs RIGHT time freq source0 Inference LEFT Sequential tree-reweighted message passing, TRW-S ~15dB SIR boost Binary Mask: Which freq’s belong to which source at each time point? source1

Nodes: Data cost Edges: Smoothness cost Reusing the same core Oh, and we use this for stereo vision too d s ( x s ) x s x t Iteration Obj. Sequential tree-reweighted message passing 3D depth map by MRF MAP inference Markov Random Field Nodes: Data cost Edges: Smoothness cost Per pixel depth info

Performance Result: Single Frame It’s also pretty fast Our work outperforms up-to-date GPU implementations Tsukuba (384x288,16) Real-time BP [Yang 2006] Tile-based BP [Liang 2011] Fast BP [Xiang 2012] Our work GPU NVIDIA GeForce 7900 GTX NVIDIA GeForce 8800 GTS NVIDIA GeForce GTX 260 N/A # Iteration (4 scales) = (5,5,10,2) (B, TI, TO) = (12, 20, 5) (3 scales) = (9,6,2) TO = 5 Time (msec) 80.8 97.3 61.4 26.10 Min. Energy 396,953 393,434 Performance Result: Single Frame Sequential tree-reweighted message passing, TRW-S

Error Resilient MRF Inference via ANT And we made it error resilient Algorithmic Noise Tolerance Power saving by ANT Complexity overhead = 45% Estim.: 42 % at Vdd = 0.75V Error Resilient MRF Inference via ANT

Back to source separation again ILDs suffer front-back confusion and require some distance between the microphones So we also added Interaural Phase Differences (IPD)

Why add IPDs? They work best when ILDs fail E.g. when sensors are far apart 30cm 1cm 15cm Input ILD IPD Joint

Adding one more element Incorporated NMF-based denoisers Systems that learn by example what to separate

So what’s next? Porting the whole system in hardware We haven’t ported the front-end yet Evaluating the results with speech recognition Extending this model to multiple devices As opposed to one device with multiple mics

Relevant publications Kim, Smaragdis, Ko, Rutenbar. Stereophonic Spectrogram Segmentation Using Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2012 Kim & Smaragdis. Manifold Preserving Hierarchical Topic Models for Quantization and Approximation, in International Conference on Machine Learning, 2013 Kim & Smaragdis Single Channel Source Separation Using Smooth Nonnegative Matrix Factorization with Markov Random Fields, in IEEE Workshop for Machine Learning in Signal Processing, 2013 Kim & Smaragdis. Non-Negative Matrix Factorization for Irregularly-Spaced Transforms, in IEEE Workshop for Applications of Signal Processing in Audio and Acoustics, 2013 Traa & Smaragdis. 2013. Blind Multi-Channel Source Separation by Circular-Linear Statistical Modeling of Phase Differences, in IEEE International Conference on Acoustics, Speech and Signal Processing,  2013 Choi, Kim, Rutenbar, Shanbhag. Error Resilient MRF Message Passing Hardware for Stereo Matching via Algorithmic Noise Tolerance, IEEE Workshop on Signal Processing Systems, 2013 Zhang, Ko, Choi, Tsai, Kim, Rivera, Rutenbar, Smaragdis, Park, Narayanan, Xin, Mutlu , Li, Zhao, Chen, Iyer. EMERALD: Characterization of Emerging Applications and Algorithms for Low-power Devices, 2013 IEEE International Symposium on Performance Analysis of Systems and Software, 2013