Audio-Visual Graphical Models Matthew Beal Gatsby Unit University College London Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft.

Slides:

Advertisements

Similar presentations

Real-time on-line learning of transformed hidden Markov models Nemanja Petrovic, Nebojsa Jojic, Brendan Frey and Thomas Huang Microsoft, University of.

Advertisements

Part 2: Unsupervised Learning

Bayesian Belief Propagation

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Learning deformable models Yali Amit, University of Chicago Alain Trouvé, CMLA Cachan.

Change Detection C. Stauffer and W.E.L. Grimson, “Learning patterns of activity using real time tracking,” IEEE Trans. On PAMI, 22(8): , Aug 2000.

Computer vision: models, learning and inference Chapter 18 Models for style and identity.

Dimension reduction (1)

“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR

Foreground Modeling The Shape of Things that Came Nathan Jacobs Advisor: Robert Pless Computer Science Washington University in St. Louis.

Using Multi-Modality to Guide Visual Tracking Jaco Vermaak Cambridge University Engineering Department Patrick Pérez, Michel Gangnet, Andrew Blake Microsoft.

Rob Fergus Courant Institute of Mathematical Sciences New York University A Variational Approach to Blind Image Deconvolution.

Joint Estimation of Image Clusters and Image Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada Beckman Institute and ECE,

Formation et Analyse d’Images Session 8

Yanxin Shi 1, Fan Guo 1, Wei Wu 2, Eric P. Xing 1 GIMscan: A New Statistical Method for Analyzing Whole-Genome Array CGH Data RECOMB 2007 Presentation.

Transformed Component Analysis: Joint Estimation of Image Components and Transformations Brendan J. Frey Computer Science, University of Waterloo, Canada.

Variational Inference and Variational Message Passing

Hilbert Space Embeddings of Hidden Markov Models Le Song, Byron Boots, Sajid Siddiqi, Geoff Gordon and Alex Smola 1.

Tracking Objects with Dynamics Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 04/21/15 some slides from Amin Sadeghi, Lana Lazebnik,

1 Graphical Models in Data Assimilation Problems Alexander Ihler UC Irvine Collaborators: Sergey Kirshner Andrew Robertson Padhraic Smyth.

Model-Based Fusion of Bone and Air Sensors for Speech Enhancement and Robust Speech Recognition John Hershey, Trausti Kristjansson, Zhengyou Zhang, Alex.

Detecting Image Region Duplication Using SIFT Features March 16, ICASSP 2010 Dallas, TX Xunyu Pan and Siwei Lyu Computer Science Department University.

J. Daunizeau Wellcome Trust Centre for Neuroimaging, London, UK Institute of Empirical Research in Economics, Zurich, Switzerland Bayesian inference.

Group analyses of fMRI data Methods & models for fMRI data analysis 26 November 2008 Klaas Enno Stephan Laboratory for Social and Neural Systems Research.

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

Today Introduction to MCMC Particle filters and MCMC

Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.

A Unifying Review of Linear Gaussian Models

(1) A probability model respecting those covariance observations: Gaussian Maximum entropy probability distribution for a given covariance observation.

Cognitive Computer Vision Kingsley Sage and Hilary Buxton Prepared under ECVision Specific Action 8-3

Binaural Sonification of Disparity Maps Alfonso Alba, Carlos Zubieta, Edgar Arce Facultad de Ciencias Universidad Autónoma de San Luis Potosí.

Tracking Pedestrians Using Local Spatio- Temporal Motion Patterns in Extremely Crowded Scenes Louis Kratz and Ko Nishino IEEE TRANSACTIONS ON PATTERN ANALYSIS.

BraMBLe: The Bayesian Multiple-BLob Tracker By Michael Isard and John MacCormick Presented by Kristin Branson CSE 252C, Fall 2003.

SVCL Automatic detection of object based Region-of-Interest for image compression Sunhyoung Han.

Computer vision: models, learning and inference Chapter 19 Temporal models.

Loris Bazzani*, Marco Cristani*†, Vittorio Murino*† Speaker: Diego Tosato* *Computer Science Department, University of Verona, Italy †Istituto Italiano.

Understanding The Semantics of Media Chapter 8 Camilo A. Celis.

Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.

Bayesian Inference and Posterior Probability Maps Guillaume Flandin Wellcome Department of Imaging Neuroscience, University College London, UK SPM Course,

MURI: Integrated Fusion, Performance Prediction, and Sensor Management for Automatic Target Exploitation 1 Dynamic Sensor Resource Management for ATE MURI.

Learning the Appearance and Motion of People in Video Hedvig Sidenbladh, KTH Michael Black, Brown University.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

Generative Models for Image Understanding Nebojsa Jojic and Thomas Huang Beckman Institute and ECE Dept. University of Illinois.

Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,

Full-rank Gaussian modeling of convolutive audio mixtures applied to source separation Ngoc Q. K. Duong, Supervisor: R. Gribonval and E. Vincent METISS.

Computer Vision Lecture 6. Probabilistic Methods in Segmentation.

Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.

Looking at people and Image-based Localisation Roberto Cipolla Department of Engineering Research team

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Tracking with dynamics

Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen

Visual and auditory scene analysis using graphical models Nebojsa Jojic

Bayesian inference Lee Harrison York Neuroimaging Centre 23 / 10 / 2009.

LECTURE 11: Advanced Discriminant Analysis

Deep Predictive Model for Autonomous Driving

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology

Tracking Objects with Dynamics

LOCUS: Learning Object Classes with Unsupervised Segmentation

Dynamic Causal Modelling (DCM): Theory

Image Segmentation Techniques

Modelling data static data modelling.

Video Compass Jana Kosecka and Wei Zhang George Mason University

LECTURE 15: REESTIMATION, EM AND MIXTURES

Mixture Models with Adaptive Spatial Priors

NON-NEGATIVE COMPONENT PARTS OF SOUND FOR CLASSIFICATION Yong-Choon Cho, Seungjin Choi, Sung-Yang Bang Wen-Yi Chu Department of Computer Science &

Nome Sobrenome. Time time time time time time..

Presentation transcript:

Audio-Visual Graphical Models Matthew Beal Gatsby Unit University College London Nebojsa Jojic Microsoft Research Redmond, Washington Hagai Attias Microsoft Research Redmond, Washington

Beal, Jojic and Attias, ICASSP’02 Overview Some background to the problem A simple video model A simple audio model Combining these in a principled manner Results of tracking experiments Further work and thoughts.

Beal, Jojic and Attias, ICASSP’02 Motivation – applications Teleconferencing –We need speaker’s identity, position, and individual speech. –The case of multiple speakers. Denoising –Speech enhancement using video cues (at different scales). –Video enhancement using audio cues. Multimedia editing –Isolating/removing/adding objects, visually and aurally. Multimedia retrieval –Efficient multimedia searching.

Beal, Jojic and Attias, ICASSP’02 Motivation – current state of art Video models and Audio models –Abundance of work on object tracking, image stabilization… –Large amount in speech recognition, ICA (blind source separation), microphone array processing… Very little work on combining these –We desire a principled combination. –Robust learning of environments using multiple modalities. –Various past approaches: Information theory: Hershey & Movellan (NIPS 12) SVD-esque: (FaceSync) Slaney & Covell (NIPS 13) Subspace stats.: Fisher et al. (NIPS 13). Periodicity analysis: Ross Cutler Particle filters: Vermaak and Blake et al (ICASSP 2001). System engineering: Yong Rui (CVPR 2001). Our approach: Graphical Models, Bayes nets.

Beal, Jojic and Attias, ICASSP’02 Generative density modeling Probability models that –reflect desired structure –randomly generate plausible images and sounds, –represent the data by parameters ML estimation p(image|class) used for recognition, detection,... Examples: Mixture of Gaussians, PCA/FA/ICA, Kalman filter, HMM All parameters can be learned from data!

Beal, Jojic and Attias, ICASSP’02 Speaker detection & tracking problem  mic.1mic.2 source at l x camera lxlx lyly Video scenarioAudio scenario

Beal, Jojic and Attias, ICASSP’02 Bayes Nets for Multimedia Video models –Models such as Jojic & Frey (NIPS’99, CVPR’99’00’01). Audio models –Work of: Attias (Neural Comp’98); Attias, Platt, Deng & Acero (NIPS’00,EuroSpeech’01).

Beal, Jojic and Attias, ICASSP’02 A generative video model for scenes (see Frey&Jojic, CVPR’99, NIPS’01) Mean  s Class s Latent image z Transformed image z Generated/observed image y Shift (l x,l y )

Beal, Jojic and Attias, ICASSP’02Example Hand-held camera Moving subject Cluttered background DATA Mean One class summary Variance 5 classes

Beal, Jojic and Attias, ICASSP’02 A generative video model for scenes (see Frey&Jojic, CVPR’99, NIPS’01) Mean  s Class s Latent image z Transformed image z Generated/observed image y Shift (l x,l y )

Beal, Jojic and Attias, ICASSP’02 A failure mode of this model

Beal, Jojic and Attias, ICASSP’02 Modeling scenes - the audio part  mic.1mic.2 source at l x camera mic.1mic.2

Beal, Jojic and Attias, ICASSP’02 Unaided audio model audio waveform  video frames  Posterior probability over , the time delay. Periods of quiet cause uncertainty in  – (grey blurring). Occasionally reverberations / noise corrupt inference on  –and we become certain of a false time delay. time

Beal, Jojic and Attias, ICASSP’02 Limit of this simple audio model

Beal, Jojic and Attias, ICASSP’02 Multimodal localization Time delay  is approximately linear in horizontal position l x Define a stochastic mapping from spatial location to temporal shift:

Beal, Jojic and Attias, ICASSP’02 The combined model

Beal, Jojic and Attias, ICASSP’02 The combined model Two halves connected by  - l x link Maximize  n a log p(x t )+n v log p(y t )

Beal, Jojic and Attias, ICASSP’02 Learning using EM: E-Step Distribution Q over hidden variables is inferred given the current setting of all model parameters.

Beal, Jojic and Attias, ICASSP’02 Learning using EM: M-Step Audio: –Relative microphone attenuations 1, 2 and noise levels 1 2 AV Calibration between modalities – , ,  Video: –object templates  s and precisions  s –camera noise  Given the distribution over hidden variables, the parameters are set to maximize the data likelihood.

Beal, Jojic and Attias, ICASSP’02 Efficient inference and integration over all shifts (Frey and Jojic, NIPS’01) E Estimating posterior Q(l x,l y,  ) involves computing Mahalanobis distances for all possible shifts in the image M Estimating model parameters involves integrating over all possible shifts taking into account the probability map Q(l x,l y,  ) E reduces to correlation, M reduces to convolution Efficiently done using FFTs

Beal, Jojic and Attias, ICASSP’02 Demonstration of tracking A AV V n a /n v

Beal, Jojic and Attias, ICASSP’02 Learning using EM: M-Step Audio: –Relative microphone attenuations 1, 2 and noise levels 1 2 AV Calibration between modalities – , ,  Video: –object templates  s and precisions  s –camera noise  Given the distribution over hidden variables, the parameters are set to maximize the data likelihood.

Beal, Jojic and Attias, ICASSP’02 Inside EM iterations Q(  |x 1,x 2,y) Q(l x |x 1,x 2,y)

Beal, Jojic and Attias, ICASSP’02 TrackingStabilization TrackingStabilization

Beal, Jojic and Attias, ICASSP’02 Work in progress: models Incorporating a more sophisticated speech model –Layers of sound Reverberation filters –Extension to y-localization is trivial. –Temporal models of speech. Incorporating a more sophisticated video model –Layered templates (sprites) each with their own audio (circumvents dimensionality issues). –Fine-scale correlations between pixel intensities and speech. –Hierarchical models? (Factor Analyser trees). Tractability issues: –Variational approximations in both audio and video.

Beal, Jojic and Attias, ICASSP’02 Basic flexible layer model (CVPR’01)

Beal, Jojic and Attias, ICASSP’02 Future work: applications Multimedia editing –Removing/adding objects’ appearances and associated sounds. –With layers in both audio and video (cocktail party / danceclub). Video-assisted speech enhancement –Improved denoising with knowledge of source location. –Exploit fine-scale correlations of video with audio. (e.g. lips) Multimedia retrieval –Given a short clip as a query, search for similar matches in a database.

Beal, Jojic and Attias, ICASSP’02 Summary A generative model of audio-visual data All parameters learned from the data, including camera/microphones calibration in a few iterations of EM Extensions to multi-object models Real issue: the other curse of dimensionality

Beal, Jojic and Attias, ICASSP’02 Pixel-audio correlations analysis SVD. Factor Analysis (probabilistic PCA). Original video sequence Inferred activation of latent variables (factors, subspace vectors)