Patch to the Future: Unsupervised Visual Prediction

Slides:

Advertisements

Similar presentations

Yinyin Yuan and Chang-Tsun Li Computer Science Department

Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki

Ziming Zhang, Yucheng Zhao and Yiwen Wan.  Introduction&Motivation  Problem Statement  Paper Summeries  Discussion and Conclusions.

Foreground Focus: Finding Meaningful Features in Unlabeled Images Yong Jae Lee and Kristen Grauman University of Texas at Austin.

Limin Wang, Yu Qiao, and Xiaoou Tang

Carolina Galleguillos, Brian McFee, Serge Belongie, Gert Lanckriet Computer Science and Engineering Department Electrical and Computer Engineering Department.

CVPR2013 Poster Representing Videos using Mid-level Discriminative Patches.

Activity Recognition Aneeq Zia. Agenda What is activity recognition Typical methods used for action recognition “Evaluation of local spatio-temporal features.

Yuanlu Xu Human Re-identification: A Survey.

INTRODUCTION Heesoo Myeong, Ju Yong Chang, and Kyoung Mu Lee Department of EECS, ASRI, Seoul National University, Seoul, Korea Learning.

Robust Object Tracking via Sparsity-based Collaborative Model

Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)

Discriminative Segment Annotation in Weakly Labeled Video Kevin Tang, Rahul Sukthankar Appeared in CVPR 2013 (Oral)

Watching Unlabeled Video Helps Learn New Human Actions from Very Few Labeled Snapshots Chao-Yeh Chen and Kristen Grauman University of Texas at Austin.

Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1, Lehigh University.

Learning Convolutional Feature Hierarchies for Visual Recognition

Effective Image Database Search via Dimensionality Reduction Anders Bjorholm Dahl and Henrik Aanæs IEEE Computer Society Conference on Computer Vision.

Event prediction CS 590v. Applications Video search Surveillance – Detecting suspicious activities – Illegally parked cars – Abandoned bags Intelligent.

1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.

Learning from Observations Chapter 18 Section 1 – 4.

Student: Hsu-Yung Cheng Advisor: Jenq-Neng Hwang, Professor

Presented by Zeehasham Rasheed

Multiple Object Class Detection with a Generative Model K. Mikolajczyk, B. Leibe and B. Schiele Carolina Galleguillos.

Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques

DVMM Lab, Columbia UniversityVideo Event Recognition Video Event Recognition: Multilevel Pyramid Matching Dong Xu and Shih-Fu Chang Digital Video and Multimedia.

Transfer Learning From Multiple Source Domains via Consensus Regularization Ping Luo, Fuzhen Zhuang, Hui Xiong, Yuhong Xiong, Qing He.

Tracking Pedestrians Using Local Spatio- Temporal Motion Patterns in Extremely Crowded Scenes Louis Kratz and Ko Nishino IEEE TRANSACTIONS ON PATTERN ANALYSIS.

Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)

Visual Tracking with Online Multiple Instance Learning

Abstract Developing sign language applications for deaf people is extremely important, since it is difficult to communicate with people that are unfamiliar.

A General Framework for Tracking Multiple People from a Moving Camera

Marcin Marszałek, Ivan Laptev, Cordelia Schmid Computer Vision and Pattern Recognition, CVPR Actions in Context.

3D SLAM for Omni-directional Camera

Simultaneous Localization and Mapping Presented by Lihan He Apr. 21, 2006.

Dynamic 3D Scene Analysis from a Moving Vehicle Young Ki Baik (CV Lab.) (Wed)

Multimodal Information Analysis for Emotion Recognition

Using Inactivity to Detect Unusual behavior Presenter : Siang Wang Advisor : Dr. Yen - Ting Chen Date : Motion and video Computing, WMVC.

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Robust Object Tracking by Hierarchical Association of Detection Responses Present by fakewen.

Efficient Visual Object Tracking with Online Nearest Neighbor Classifier Many slides adapt from Steve Gu.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Probabilistic Latent Query Analysis for Combining Multiple Retrieval Sources Rong Yan Alexander G. Hauptmann School of Computer Science Carnegie Mellon.

First-Person Activity Recognition: What Are They Doing to Me? M. S. Ryoo and Larry Matthies Jet Propulsion Laboratory, California Institute of Technology,

Boosted Particle Filter: Multitarget Detection and Tracking Fayin Li.

Recognition Using Visual Phrases

 Present by 陳群元.  Introduction  Previous work  Predicting motion patterns  Spatio-temporal transition distribution  Discerning pedestrians  Experimental.

Active Frame Selection for Label Propagation in Videos Sudheendra Vijayanarasimhan and Kristen Grauman Department of Computer Science, University of Texas.

Unsupervised Salience Learning for Person Re-identification

Silhouette Segmentation in Multiple Views Wonwoo Lee, Woontack Woo, and Edmond Boyer PAMI, VOL. 33, NO. 7, JULY 2011 Donguk Seo

Learning video saliency from human gaze using candidate selection CVPR2013 Poster.

Multi-view Synchronization of Human Actions and Dynamic Scenes Emilie Dexter, Patrick Pérez, Ivan Laptev INRIA Rennes - Bretagne Atlantique

ICCV 2007 Optimization & Learning for Registration of Moving Dynamic Textures Junzhou Huang 1, Xiaolei Huang 2, Dimitris Metaxas 1 Rutgers University 1,

CSCI 631 – Foundations of Computer Vision March 15, 2016 Ashwini Imran Image Stitching.

Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.

Shape2Pose: Human Centric Shape Analysis CMPT888 Vladimir G. Kim Siddhartha Chaudhuri Leonidas Guibas Thomas Funkhouser Stanford University Princeton University.

1 Bilinear Classifiers for Visual Recognition Computational Vision Lab. University of California Irvine To be presented in NIPS 2009 Hamed Pirsiavash Deva.

City Forensics: Using Visual Elements to Predict Non-Visual City Attributes Sean M. Arietta, Alexei A. Efros, Ravi Ramamoorthi, Maneesh Agrawala Presented.

Automatic Video Shot Detection from MPEG Bit Stream

Deep Predictive Model for Autonomous Driving

Saliency-guided Video Classification via Adaptively weighted learning

Nonparametric Semantic Segmentation

estimated tracklet partition

Video Summarization via Determinantal Point Processes (DPP)

Context-Aware Modeling and Recognition of Activities in Video

PRAKASH CHOCKALINGAM, NALIN PRADEEP, AND STAN BIRCHFIELD

Video Compass Jana Kosecka and Wei Zhang George Mason University

EM Algorithm and its Applications

Jointly Generating Captions to Aid Visual Question Answering

Robust Feature Matching and Fast GMS Solution

Presentation transcript:

Patch to the Future: Unsupervised Visual Prediction CVPR 2014 Oral

Outline 1. Introduction 2. Methodology 3. Experimental Result 4. Conclusion and Future Work

1. Introduction ＊Motivation: we humans look at the same image, we can not only infer what is happening at that instant but also predict what can happen next.

1. Introduction ＊ Visual prediction is important for two main reasons (1)For intelligent agents and systems, prediction is vital for decision making (2)More importantly, prediction requires deep understanding of the visual world and complex interplay between different elements of the scene. ＊ Goal of generalized visual prediction What is active in the scene? What do we predict? Determining How the activity should unfold? What does the output space of visual prediction look like?

1. Introduction ＊ We humans can not only predict the motion but also how the appearances would change with that movement or transition Visual prediction should be richer and even include prediction of visual appearances. Having a richer output space requires richer representation and lots of data to learn the priors Building upon the recent success of mid-level elements [28]

1. Introduction ＊ Advantages No assumption about what can act as an agent and uses a data-driven approach to identify the possible agents and their activities Using a patch-based representation allows us to learn the models of visual prediction in a completely unsupervised manner.

2. Methodology ＊ Central idea These scenes are represented as a collection of mid-level elements (detected using a sliding window) where agents can either move in space or change visual appearances. Step1 : We model the distribution over the space of possible actions using a transition matrix which represents how mid-level elements can move and transition into one another and with what probability. Given the mid-level elements and their possible actions, we first determine which is the most likely agent and the most likely action given the scene. Step2: We need to model the interaction between the active element (agent) and its surrounding. Step3: Giving a goal & not giving a goal

2. Methodology Learning The Transitions ＊ We first apply the work of [28] to extract mid-level elements (visually meaningful) (Each element can act as an agent which can move) ＊ learn a temporal model over these elements and the temporal model is represented using a transition matrix (element i can move/ transition into another element) How do we learn these transitions? ＊ extract pairs of frames ＊ To learn the transition we need to obtain the correspondence between the detections in the two frames.(use KLT Tracker [22])

2. Methodology ＊ We interpret the mapping as either an appearance or spatial transition. ＊ In order to compensate for camera motion these movements are computed on a stitched panorama obtained via SIFT matching [21]. ＊ For each transition, we normalize for total number of observed patches as well. This gives us the probability of transition for each mid-level patch.

2. Methodology

2. Methodology Learning Contextual Information ＊ the actions of agents are not only dependent on likely transitions but also on the scene and the surroundings in which they appear ＊ We model these interactions using a reward function (how likely is it that an element of type i can move to location (x, y) in the image) ＊ we learn a separate reward function for the interaction of each element within the scene ＊ To obtain the training data for reward function of element type i, we detect the element in the training videos and observe which segments are likely to overlap with that element in time

2. Methodology ＊ We build a training set for every element in the dictionary. ＊ Each segment in the test image retrieves the nearest neighbor using Euclidean distance over image features ＊ We choose the top-N nearest neighbors to label high reward areas in the image

2. Methodology Inferring Active Entities , Planning Actions and Choosing Goals ＊ Now we can predict what is going to happen next ＊ The prediction inference requires estimating the elements in the scene that are likely to be active (Kitani et al. [18] choose the active agents manually) ＊ we propose an automatic approach to infer the likely active agent based on the learned transition matrix ＊ Our basic idea is to rank cluster types by their likelihood to be spatially active ＊ we first detect the instances of each element using sliding-window detection and then rank these instances based on contextual information.

2. Methodology The context-score for a patch i at location (x, y) is given by : direction of the movement : transition probability in direction d : reward for moving the patch from (x, y) to (x+dx; y +dy) We compute the likelihood of changing location based on the transition matrix

2. Methodology ＊ searching for optimal actions/transitions given a spatial goal in the scene re-parameterize the reward function State : s = (x, y, i) (patch i being at location (x, y)) Each decision a is quantified by the expected reward

2. Methodology from an initial state to all given goal states by converting rewards to costs. ＊ Our goal is to find the optimal set of actions/decisions ＊ Such that these actions maximize expected reward (minimize cost), and these actions reach the goal state g. : operator applies a set of actions to a state to estimate goal state ＊ We then use Dijsktra’s algorithm to plan a sequence of optimal decisions from an initial state to all given goal states by converting rewards to costs. We select the best path among different goals based on average expected reward

3. Experimental Evaluation ＊ Baselines : There are no algorithms for unsupervised visual prediction compare max-entropy based Inverse Optimal Control (IOC) based algorithm of Kitani et al. [18]. Nearest Neighbor followed by sift-flow warping [20, 33] ＊ Datasets: a Car Chase Dataset (collected from YouTube) and the VIRAT dataset [23]. ＊ Evaluation Metric: We use the modified Hausdorff distance (MHD) from [18] as a measure of the distance between two trajectories.

3. Experimental Evaluation

3. Experimental Evaluation

3. Experimental Evaluation

3. Experimental Evaluation VIRAT Dataset

4. Conclusion and Future Work ＊ we have presented a simple and effective framework for visual prediction on a static scene. ＊ This representation allows us to train our framework in a completely unsupervised manner from a large collection of videos. ＊ Possible future work includes modeling the simultaneous behavior of multiple elements