What’s Making That Sound ?

Slides:

Advertisements

Similar presentations

Image Registration  Mapping of Evolution. Registration Goals Assume the correspondences are known Find such f() and g() such that the images are best.

Advertisements

Kien A. Hua Division of Computer Science University of Central Florida.

INTERNATIONAL CONFERENCE ON TELECOMMUNICATIONS, ICT '09. TAREK OUNI WALID AYEDI MOHAMED ABID NATIONAL ENGINEERING SCHOOL OF SFAX New Low Complexity.

Human Action Recognition across Datasets by Foreground-weighted Histogram Decomposition Waqas Sultani, Imran Saleemi CVPR 2014.

Motivation Application driven -- VoD, Information on Demand (WWW), education, telemedicine, videoconference, videophone Storage capacity Large capacity.

Clustering & image segmentation Goal::Identify groups of pixels that go together Segmentation.

Patch to the Future: Unsupervised Visual Prediction

“ Pixels that Sound ” Find pixels that correspond (correlate !?) to sound Kidron, Schechner, Elad, CVPR

Discriminative Segment Annotation in Weakly Labeled Video Kevin Tang, Rahul Sukthankar Appeared in CVPR 2013 (Oral)

In ♫ ♫ otion Harmony Zohar Barzelay, Yoav Y. Schechner Dept. Elect. Eng. Technion – Israel Institute of Technology 1 Ack: Einav Namer, Yael Waissman, ISF.

Modeling Pixel Process with Scale Invariant Local Patterns for Background Subtraction in Complex Scenes (CVPR’10) Shengcai Liao, Guoying Zhao, Vili Kellokumpu,

Chapter 11 Beyond Bag of Words. Question Answering n Providing answers instead of ranked lists of documents n Older QA systems generated answers n Current.

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials 2.

LYU0103 Speech Recognition Techniques for Digital Video Library Supervisor : Prof Michael R. Lyu Students: Gao Zheng Hong Lei Mo.

Efficient Moving Object Segmentation Algorithm Using Background Registration Technique Shao-Yi Chien, Shyh-Yih Ma, and Liang-Gee Chen, Fellow, IEEE Hsin-Hua.

EE663 Image Processing Edge Detection 2 Dr. Samir H. Abdul-Jauwad Electrical Engineering Department King Fahd University of Petroleum & Minerals.

Supervised by Prof. LYU, Rung Tsong Michael Department of Computer Science & Engineering The Chinese University of Hong Kong Prepared by: Chan Pik Wah,

Segmentation Divide the image into segments. Each segment:

Detecting and Tracking Moving Objects for Video Surveillance Isaac Cohen and Gerard Medioni University of Southern California.

Losslessy Compression of Multimedia Data Hao Jiang Computer Science Department Sept. 25, 2007.

CSSE463: Image Recognition Day 30 Due Friday – Project plan Due Friday – Project plan Evidence that you’ve tried something and what specifically you hope.

1 Integration of Background Modeling and Object Tracking Yu-Ting Chen, Chu-Song Chen, Yi-Ping Hung IEEE ICME, 2006.

MULTIPLE MOVING OBJECTS TRACKING FOR VIDEO SURVEILLANCE SYSTEMS.

CS292 Computational Vision and Language Visual Features - Colour and Texture.

Multi-camera Video Surveillance: Detection, Occlusion Handling, Tracking and Event Recognition Oytun Akman.

1 Motion in 2D image sequences Definitely used in human vision Object detection and tracking Navigation and obstacle avoidance Analysis of actions or.

Instructor : Dr. K. R. Rao Presented by: Rajesh Radhakrishnan.

김덕주 (Duck Ju Kim). Problems What is the objective of content-based video analysis? Why supervised identification has limitation? Why should use integrated.

Content-Based Video Retrieval System Presented by: Edmund Liang CSE 8337: Information Retrieval.

GCT731 Fall 2014 Topics in Music Technology - Music Information Retrieval Overview of MIR Systems Audio and Music Representations (Part 1) 1.

CS 376b Introduction to Computer Vision 04 / 02 / 2008 Instructor: Michael Eckmann.

TEMPORAL VIDEO BOUNDARIES -PART ONE- SNUEE KIM KYUNGMIN.

CSSE463: Image Recognition Day 30 This week This week Today: motion vectors and tracking Today: motion vectors and tracking Friday: Project workday. First.

Preprocessing Ch2, v.5a1 Chapter 2 : Preprocessing of audio signals in time and frequency domain  Time framing  Frequency model  Fourier transform 

COLOR HISTOGRAM AND DISCRETE COSINE TRANSFORM FOR COLOR IMAGE RETRIEVAL Presented by 2006/8.

Dan Rosenbaum Nir Muchtar Yoav Yosipovich Faculty member : Prof. Daniel LehmannIndustry Representative : Music Genome.

TEMPLATE DESIGN © Zhiyao Duan 1,2, Lie Lu 1, and Changshui Zhang 2 1. Microsoft Research Asia (MSRA), Beijing, China.2.

Experimental Results ■ Observations:  Overall detection accuracy increases as the length of observation window increases.  An observation window of 100.

1 Multiple Classifier Based on Fuzzy C-Means for a Flower Image Retrieval Keita Fukuda, Tetsuya Takiguchi, Yasuo Ariki Graduate School of Engineering,

1 CS 430: Information Discovery Lecture 22 Non-Textual Materials: Informedia.

Motion Analysis using Optical flow CIS750 Presentation Student: Wan Wang Prof: Longin Jan Latecki Spring 2003 CIS Dept of Temple.

The Implementation of Markerless Image-based 3D Features Tracking System Lu Zhang Feb. 15, 2005.

Case Study 1 Semantic Analysis of Soccer Video Using Dynamic Bayesian Network C.-L Huang, et al. IEEE Transactions on Multimedia, vol. 8, no. 4, 2006 Fuzzy.

Implementation, Comparison and Literature Review of Spatio-temporal and Compressed domains Object detection. By Gokul Krishna Srinivasan Submitted to Dr.

Autonomous Robots Vision © Manfred Huber 2014.

Segmentation of Vehicles in Traffic Video Tun-Yu Chiang Wilson Lau.

Cross-Modal (Visual-Auditory) Denoising Dana Segev Yoav Y. Schechner Michael Elad Technion – Israel Institute of Technology 1.

Tracking Turbulent 3D Features Lu Zhang Nov. 10, 2005.

Hierarchical Segmentation: Finding Changes in a Text Signal Malcolm Slaney and Dulce Ponceleon IBM Almaden Research Center.

October 16, 2014Computer Vision Lecture 12: Image Segmentation II 1 Hough Transform The Hough transform is a very general technique for feature detection.

CS Spring 2010 CS 414 – Multimedia Systems Design Lecture 4 – Audio and Digital Image Representation Klara Nahrstedt Spring 2010.

Edge Segmentation in Computer Images CSE350/ Sep 03.

Similarity Measurement and Detection of Video Sequences Chu-Hong HOI Supervisor: Prof. Michael R. LYU Marker: Prof. Yiu Sang MOON 25 April, 2003 Dept.

Motion Segmentation at Any Speed Shrinivas J. Pundlik Department of Electrical and Computer Engineering, Clemson University, Clemson, SC.

Color Image Segmentation Mentor : Dr. Rajeev Srivastava Students: Achit Kumar Ojha Aseem Kumar Akshay Tyagi.

Trajectory-Based Ball Detection and Tracking with Aid of Homography in Broadcast Tennis Video Xinguo Yu, Nianjuan Jiang, Ee Luang Ang Present by komod.

Cross-modal Hashing Through Ranking Subspace Learning

Motion tracking TEAM D, Project 11: Laura Gui - Timisoara Calin Garboni - Timisoara Peter Horvath - Szeged Peter Kovacs - Debrecen.

Naifan Zhuang, Jun Ye, Kien A. Hua

A. M. R. R. Bandara & L. Ranathunga

- photometric aspects of image formation gray level images

Presenter: Ibrahim A. Zedan

Traffic Sign Recognition Using Discriminative Local Features Andrzej Ruta, Yongmin Li, Xiaohui Liu School of Information Systems, Computing and Mathematics.

A New Approach to Track Multiple Vehicles With the Combination of Robust Detection and Two Classifiers Weidong Min , Mengdan Fan, Xiaoguang Guo, and Qing.

Image Segmentation Techniques

Presented by: Yang Yu Spatiotemporal GMM for Background Subtraction with Superpixel Hierarchy Mingliang Chen, Xing Wei, Qingxiong.

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Counting in High-Density Crowd Videos

Presentation transcript:

What’s Making That Sound ? Kai Li Department of Electrical Engineering and Computer Science University of Central Florida

Audiovisual Correlation Problem Find the visual object whose motion generates the audio. Distracting Moving Object Video can be made using a single microphone Object can be musical instrument, speaker, etc. Assume a primary audio source dominates the audio signal. A special case of general cross-modality correspondence problem Video frame The audio source Audio (Guitar Music)

The Challenge Significantly different resolutions. Temporal resolution: audio @ kHz vs. videos @ 20-30 fps. Spatial resolution: video @ 1 million pixels per frame vs. audio with 1 numerical value per sample. Semantic gap between modalities. Audio and visual signals are captured using different sensors, their numerical values take essentially different semantic meanings. Prevalent noises and distractions. Both modality contain noises. Multiple distractions may exist in both modalities.

Existing Solutions Pixel-level correlation methods. Objective: Identify image pixels that are most correlated with audio signals. Methods: CCA and its variants, Mutual Information etc. Limitation: pixel-level localization is noisy and doesn’t carry too much high-level semantic meaning. Object-level correlation methods. Objective: Identify object (i.e. image structure) that are most correlated with audio signals Methods: correlation measures are first obtained at fine-level (e.g. pixels), then cluster pixels based on the fine-level correlation. Advantage: Correlation results are segmented visual objects which are more semantically meaningful.

Existing Approach Existing object-level solutions also have problems. Segmentation step is susceptible to the previous correlation analysis. Extracted object hardly observe true object due to the noise of fine-level correlations. How to address it ?

An Overview of Our Approach Video Input Audio Feature Computing Visual Feature Computing The general idea: first apply video segmentation, and analyze correlation afterwards Audio signal strength is correlated with the object’s motion intensity Find audio features that represent audio signal strength Find visual features to represent object’s motion intensity Audiovisual Correlation

𝑊 𝑡 = 1, 𝑖𝑓 𝑡 <ℎ/2 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Audio Representation Audio energy features The window function 𝑊 𝑡 = 1, 𝑖𝑓 𝑡 <ℎ/2 0, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑎 𝑡 = 0 ∞ 0 𝑇 𝑓 𝑡 ′ 𝑊( 𝑡 ′ −𝑡) 𝑒 −𝑖2𝜋𝑓 𝑡 ′ 𝑑 𝑡 ′ 𝑑𝑓 Short-term Fourier Transform (STFT) Audio signal is framed according to the video frame rate. Compute the audio energy for each audio frame using the above equation.

Distance Computation & Thresholding Region Similarity Computation Video Representation Block diagram of spatial-temporal video segmentation Intra-frame Processing Inter-frame Processing Motion Clustering Distance Computation & Thresholding New Regions Region Tracks Optical flow Color Segmentation Region Similarity Computation Image Relabeling New frame Region Tracks Update Video Frames

Video Representation Intra-frame processing (2-step segmentation) Step 1: Mean Shift color segmentation Step 2: Motion-based K-means Clustering Compute average optical flow image: 𝐅 𝑥, 𝑦, 𝑡 = 1 2 ( 𝐅 + 𝑥,𝑦,𝑡 − 𝐅 − 𝑥,𝑦,𝑡 ) Each region is represented as a 5-dimensional feature vector 𝑥, 𝑦, 𝑙, 𝑢, 𝑣 , where (𝑥, 𝑦) is spatial centroid of the image segment, and (𝑙, 𝑢, 𝑣) are the segment’s average LUV color values in the color-coded average optical flow image. Color Image Optical Flow (forward) Optical Flow (backward) Segmentation Input Output

𝑅𝑒𝑔𝑖𝑜𝑛: {𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛: (𝑥, 𝑦), 𝐶𝑜𝑙𝑜𝑟: 𝐡} Video Representation Inter-frame Processing: Region representation. A region (image segment) is represented by its location attribute (𝑥, 𝑦) and its color attribute. 𝑅𝑒𝑔𝑖𝑜𝑛: {𝐿𝑜𝑐𝑎𝑡𝑖𝑜𝑛: (𝑥, 𝑦), 𝐶𝑜𝑙𝑜𝑟: 𝐡} Location: the spatial centroid of the region Color histogram 𝐡 ∈ 𝒁 𝑵 : evenly quantizing the LUV color space into 𝑁 bins and counting the number of pixels falling into each bin.

Video Representation Inter-frame region tracking Input: A set of frames 𝐼 1 , …, 𝐼 𝑇 , the spatial distance threshold 𝐷 𝑡ℎ , and the color similarity threshold 𝐶 𝑡ℎ . Initialization: Initialize the region tracks 𝑅 𝑖 , 𝑖=1, …, 𝐾 with regions of the segmentation of frame 𝐼 1 Iteration: For 𝑡 = 2,…𝑇 Segment 𝐼 𝑡 into a number of regions 𝑟 𝑖, 𝑡 ,𝑖=1, …, 𝑛 𝑡 Set 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑇𝑟𝑎𝑐𝑘𝑠={} Foreach 𝑟 𝑖, 𝑡 Add all 𝑅 𝑗 for which 𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 𝑅 𝑗 , 𝑟 𝑖, 𝑡 < 𝐷 𝑡ℎ to 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑇𝑟𝑎𝑐𝑘𝑠. If 𝐶𝑎𝑛𝑑𝑖𝑑𝑎𝑡𝑒𝑇𝑟𝑎𝑐𝑘𝑠≠∅ Find 𝑘=𝑎𝑟𝑔⁡ max 𝑗 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦( 𝑅 𝑗 , 𝑟 𝑖, 𝑡 ) If 𝑠𝑖𝑚𝑖𝑙𝑎𝑟𝑖𝑡𝑦 𝑅 𝑘 , 𝑟 𝑖, 𝑡 > 𝐶 𝑡ℎ , add 𝑟 𝑖, 𝑡 to 𝑅 𝑘 Else, create a new region track and add 𝑟 𝑖, 𝑡 to it. Output: A number of region tracks where each region track is a temporal sequence of regions. The distance is computed as the Euclidean distance between current region’s spatial centroid, and that of the region track’s most recently added region The similarity is computed as the cosine angle between current region’s color histogram and the average color histogram of all regions in the region track

Video Representation Visual feature extraction Compute the acceleration of each pixel as 𝐌 𝑥, 𝑦, 𝑡 = 𝐅 + 𝑥,𝑦,𝑡 −(− 𝐅 − 𝑥,𝑦,𝑡 ) Compute the motion feature for a region 𝑟 𝑡 𝑘 as its average acceleration 𝑚 𝑡 𝑘 Represent a region track as a motion vector 𝑉 𝑘 = [ 𝑚 1 𝑘 , 𝑚 2 𝑘 ,⋯, 𝑚 𝑇 𝑘 ] 𝑇 ,𝑘=1,2,⋯, 𝐾

Audiovisual Correlation Some interesting observations Discrete Sound (i.e. with clear intervals of silence) We need a feature embedding technique to encode such similarity of multimodal features. Continuous Sound Video Audio visual features

Audiovisual Correlation Winner-Take-All Hash Nonlinear transformation. Two parameters: 𝑁: Number of random permutations 𝑆: Window size

Audiovisual Correlation How does WTA work ? X = [A, B, C] A<C<B B X’ = [A’, B’, C’] A’<C’<B’ B’ C’ A C A’ X = X’ in ordinal space; not the case in metric spaces with distances based on numerical values. We use the same WTA function to embed multimodal features into the same ordinal space. Similarity can be computed efficiently (e.g. Hamming distance).

Audiovisual Correlation Audiovisual correlations 𝑉 𝑘 = [ 𝑚 1 𝑘 , 𝑚 2 𝑘 ,⋯, 𝑚 𝑇 𝑘 ] 𝑇 𝐴= [ 𝑎 1 , 𝑎 2 , ⋯, 𝑎 𝑇 ] 𝑇 Winner-Take-All Hash 𝐻𝑎𝑠ℎ𝐹𝑢𝑛𝑐(∙) 𝐻𝑎𝑠ℎ𝐹𝑢𝑛𝑐(∙) 𝐻𝑎𝑚𝑚𝑖𝑛𝑔𝐷𝑖𝑠𝑡(∙,∙) The audio source object is identified by choosing maximum 𝜒 𝑘 𝐶𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝜒 𝑘

Experiments Dataset 5 challenging videos from Youtube and previous research Video Name Frame rate (fps) Resolution Audio Spl. Freq. (kHz) Source Basketball 29.97 540 x 360 44.1 Made Student News 640 x 360 Youtube Wooden Horse 24.87 480 x 384 [1][2] Guitar Street 25.00 Violin Yanni 320 x 240 [1] [1] Izadinia, H.; Saleemi, I.; Shah, M.,“Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects”, Multimedia, IEEE Transactions on , vol.15, no.2, pp.378,390, Feb. 2013 [2] Kidron, Einat, Yoav Y. Schechner, and Michael Elad.“Cross-modal localization via sparsity”, Signal Processing, IEEE Transactions on 55.4 (2007):1390-1404

Experiments Baseline Method [1] Spatial-temporal segmentation with K-means Video features: optical flows and their 1st order derivatives Audio features: MFCCs and their 1st order derivatives CCA is used to find the maximum projection base for video [1] Izadinia, H.; Saleemi, I.; Shah, M.,“Multimodal Analysis for Identification and Segmentation of Moving-Sounding Objects”, Multimedia, IEEE Transactions on , vol.15, no.2, pp.378,390, Feb. 2013

Qualitative Results Short demo on video clips. Ground Truth Baseline [1] Proposed Method

Quantitative Experiments Performance metrics Spatial localization 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛= |𝑃∩𝑇| |𝑃| ,𝑟𝑒𝑐𝑎𝑙𝑙= |𝑃∩𝑇| |𝑇| P: pixels detected by the algorithm. T: ground truth pixels. Temporal localization 𝐷𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛 𝑟𝑎𝑡𝑒= # 𝑜𝑓 𝑠𝑢𝑐𝑐𝑒𝑠𝑓𝑢𝑙𝑙 𝑑𝑒𝑡𝑒𝑐𝑡𝑖𝑜𝑛𝑠 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑎𝑚𝑒𝑠 𝐻𝑖𝑡 𝑟𝑎𝑡𝑖𝑜= # 𝑜𝑓 𝑎𝑐𝑐𝑢𝑟𝑎𝑡𝑒 𝑙𝑜𝑐𝑎𝑙𝑖𝑧𝑎𝑡𝑖𝑜𝑛 𝑇𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑓𝑟𝑎𝑚𝑒𝑠 Successful detection: 𝑟𝑒𝑐𝑎𝑙𝑙>0.5 Accurate detection: 𝑝𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛>0.5

Quantitative Results Precision & Recall Precision Recall

Quantitative Results Precision & Recall (another view)

Quantitative Results Hit ratio & Detection rate. Hit ratio

Thank You !