Semantic Structure from Motion Paper by: Sid Yingze Bao and Silvio Savarese Presentation by: Ian Lenz.

Slides:

Advertisements

Similar presentations

3D Model Matching with Viewpoint-Invariant Patches(VIP) Reporter ：鄒嘉恆 Date ： 10/06/2009.

Advertisements

An Adaptive Learning Method for Target Tracking across Multiple Cameras Kuan-Wen Chen, Chih-Chuan Lai, Yi-Ping Hung, Chu-Song Chen National Taiwan University.

Assignment 4 Cloning Yourself

For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.

MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.

Digital Camera and Computer Vision Laboratory Department of Computer Science and Information Engineering National Taiwan University, Taipei, Taiwan, R.O.C.

Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.

(Includes references to Brian Clipp

Formation et Analyse d’Images Session 8

Topological Mapping using Visual Landmarks ● The work is based on the "Team Localization: A Maximum Likelihood Approach" paper. ● To simplify the problem,

1 Structure from Motion with Unknown Correspondence (aka: How to combine EM, MCMC, Nonlinear LS!) Frank Dellaert, Steve Seitz, Chuck Thorpe, and Sebastian.

Segmentation and Tracking of Multiple Humans in Crowded Environments Tao Zhao, Ram Nevatia, Bo Wu IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE,

Uncalibrated Geometry & Stratification Sastry and Yang

Lecture 11: Structure from motion CS6670: Computer Vision Noah Snavely.

CSE 803 Fall 2008 Stockman1 Structure from Motion A moving camera/computer computes the 3D structure of the scene and its own motion.

COMP322/S2000/L23/L24/L251 Camera Calibration The most general case is that we have no knowledge of the camera parameters, i.e., its orientation, position,

CSCE 641 Computer Graphics: Image-based Modeling (Cont.) Jinxiang Chai.

Algorithm Evaluation and Error Analysis class 7 Multiple View Geometry Comp Marc Pollefeys.

CSCE 641 Computer Graphics: Image-based Modeling (Cont.) Jinxiang Chai.

776 Computer Vision Jan-Michael Frahm, Enrique Dunn Spring 2013.

Automatic Camera Calibration

CSE 185 Introduction to Computer Vision

1 Preview At least two views are required to access the depth of a scene point and in turn to reconstruct scene structure Multiple views can be obtained.

A General Framework for Tracking Multiple People from a Moving Camera

The Brightness Constraint

Course 12 Calibration. 1.Introduction In theoretic discussions, we have assumed: Camera is located at the origin of coordinate system of scene.

International Conference on Computer Vision and Graphics, ICCVG ‘2002 Algorithm for Fusion of 3D Scene by Subgraph Isomorphism with Procrustes Analysis.

May 9, 2005 Andrew C. Gallagher1 CRV2005 Using Vanishing Points to Correct Camera Rotation Andrew C. Gallagher Eastman Kodak Company

Learning Lateral Connections between Hidden Units Geoffrey Hinton University of Toronto in collaboration with Kejie Bao University of Toronto.

CSCE 643 Computer Vision: Structure from Motion

Forward-Scan Sonar Tomographic Reconstruction PHD Filter Multiple Target Tracking Bayesian Multiple Target Tracking in Forward Scan Sonar.

Computer Vision 776 Jan-Michael Frahm 12/05/2011 Many slides from Derek Hoiem, James Hays.

© 2005 Martin Bujňák, Martin Bujňák Supervisor : RNDr.

18 th August 2006 International Conference on Pattern Recognition 2006 Epipolar Geometry from Two Correspondences Michal Perďoch, Jiří Matas, Ondřej Chum.

CSE 185 Introduction to Computer Vision Stereo. Taken at the same time or sequential in time stereo vision structure from motion optical flow Multiple.

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

Paper Reading Dalong Du Nov.27, Papers Leon Gu and Takeo Kanade. A Generative Shape Regularization Model for Robust Face Alignment. ECCV08. Yan.

MCMC (Part II) By Marc Sobel. Monte Carlo Exploration  Suppose we want to optimize a complicated distribution f(*). We assume ‘f’ is known up to a multiplicative.

Computer Vision - Fitting and Alignment (Slides borrowed from various presentations)

COS429 Computer Vision =++ Assignment 4 Cloning Yourself.

Computer vision: models, learning and inference M Ahad Multiple Cameras

Using Adaptive Tracking To Classify And Monitor Activities In A Site W.E.L. Grimson, C. Stauffer, R. Romano, L. Lee.

Tracking with dynamics

Lecture 9 Feature Extraction and Motion Estimation Slides by: Michael Black Clark F. Olson Jean Ponce.

Presented by: Idan Aharoni

Inference of Non-Overlapping Camera Network Topology by Measuring Statistical Dependence Date ：

Visual Tracking by Cluster Analysis Arthur Pece Department of Computer Science University of Copenhagen

1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.

Introduction to Sampling Methods Qi Zhao Oct.27,2004.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

R. Kass/W03 P416 Lecture 5 l Suppose we are trying to measure the true value of some quantity (x T ). u We make repeated measurements of this quantity.

HW7: Evolutionarily conserved segments ENCODE region 009 (beta-globin locus) Multiple alignment of human, dog, and mouse 2 states: neutral (fast-evolving),

Bootstrapping James G. Anderson, Ph.D. Purdue University.

Ch 1. Introduction Pattern Recognition and Machine Learning, C. M. Bishop, Updated by J.-H. Eom (2 nd round revision) Summarized by K.-I.

Paper – Stephen Se, David Lowe, Jim Little

A Forest of Sensors: Using adaptive tracking to classify and monitor activities in a site Eric Grimson AI Lab, Massachusetts Institute of Technology

Approximate Models for Fast and Accurate Epipolar Geometry Estimation

Dynamical Statistical Shape Priors for Level Set Based Tracking

Latent Variables, Mixture Models and EM

A special case of calibration

Outline Image Segmentation by Data-Driven Markov Chain Monte Carlo

Structure from motion Input: Output: (Tomasi and Kanade)

Eric Grimson, Chris Stauffer,

Camera Calibration class 9

Course 6 Stereo.

Calibration and homographies

Structure from motion Input: Output: (Tomasi and Kanade)

Parameter estimation class 6

Nome Sobrenome. Time time time time time time..

Presentation transcript:

Semantic Structure from Motion Paper by: Sid Yingze Bao and Silvio Savarese Presentation by: Ian Lenz

Inferring 3D With special hardware: Range sensor Stereo camera Without special hardware: Local features/graphical models (Make3D, etc) Structure from motion

Structure from Motion Obtain 3D scene structure from multiple images from the same camera in different locations, poses Typically, camera location & pose treated as unknowns Track points across frames, infer camera pose & scene structure from correspondences

Intuition

Typical Approaches Fit model of 3D points + camera positions to 2D points Use point matches (e.g. SIFT, etc.) Use RANSAC or similar to fit models Often complicated pipeline – “Building Rome in a Day”

Semantic SfM

Use semantic object labels to inform SfM Use SfM to inform semantic object labels Hopefully, improve results by modeling both together

High-level Approach Maximum likelihood estimation Given: object detection probabilities at various poses, 2D point correspondences Model probability of observed images given inferred parameters Use Markov Chain Monte Carlo to maximize

Model Parameters C: camera parameters C k : parameters for camera k C k = {K k, R k, T k } K: internal camera parameters – known R: camera rotation – unknown T: camera translation - unknown

Model Parameters q: 2D points q k i : ith point in camera k q k = {x, y, a} k i x, y : point location a : visual descriptor (SIFT, etc.) Known

Model Parameters Q: 3D points Q s = (X s, Y s, Z s ) World frame coordinates Unknown u: Point correspondences u k i = s if q k i corresponds to Q s Known CkCk QsQs qkiqki u k i = s

Model Parameters o: camera-space obstacle detections o k j : jth obstacle detection in camera k o k j = {x, y, w, h, θ, φ, c} k j x, y: 2D location w, h: bounding box size θ, φ: 3D pose c: class (car, person, keyboard, etc.) Known

Model Parameters O: 3D objects O t = (X, Y, Z, Θ, Φ, c) t Similar to o except no bounding box, Z coord Unknown

Likelihood Function Assumption: Points independent from objects Why? Splits likelihood, makes inference easier Would require complicated model of object 3D appearance otherwise Camera parameters appear in both terms

Point Term Compute by measuring agreement between predicted, actual measurements Compute predictions by projecting 3D-> cam Assume predicted, actual locations vary by Gaussian noise

Point Term (Alternative) Take q k i and q l j as matching points from cameras C k and C l Determine epipolar line of q k i w/r/t C l sd Take as the distance from q l j to this line Consider appearance similarity:

Object Term Also uses agreement Projection more difficult Recall: 3D object parameterized by XYZ coords, orientation, class 2D also has bounding box params

Projecting 3D->2D object Location, pose easy using camera params For BB width, height: f k : camera focal length W, H: mapping from object bounding cube to bounding box “learned by using ground truth 3D object bounding cubes and corresponding observations using ML regressor”

Object Probability Scale proportional to bounding box size Highly quantized pose, scale Stack maps as tensor, index based on pose, scale Tensor denoted as Ξ (Chi) Tensor index denoted as

Object Term Probability of object observation proportional to the probability of not not seeing it in each image (yes a double negative) Why do it this way? Occlusion -> probability of not seeing = 1 Doesn’t affect likelihood term

Estimation Have a model, now how do we maximize it? Answer: Markov Chain Monte Carlo Estimate new params from current ones Accept depending on ratio of new/old prob Two questions remain: What are the initial parameters? How do we update?

Initialization Camera location/pose – two approaches: Point-based: Use five-points solver to compute camera parameters from five corresponding points Scale ambiguous, so randomly pick several Object-based: Form possible object correspondences between frames, initialize cameras using these

Initialization Object & point locations: Use estimated camera parameters (prev slide) Project points, objects from 2D->3D Merge objects which get mapped to similar locations Determine 2D-3D correspondences (u)

Update Order: C, O, Q (updated versions: C’, O’, Q’) Pick C’ with Gaussian probability around C Pick O’ to maximize Pr(o|O’,C’) (within local area of O) Pick Q’ to maximize Pr(q,u|Q’,C’) – Unless alternative term was used

Algorithm

Obtaining Results Intuition: MCMC visit probability proportional to probability function (what we’re trying to maximize) Cluster MCMC points using MeanShift Cluster with most corresponding samples wins Read out Q, O, C as average from cluster

Results sfm/index.html

Results vs. Bundler Cars Office

Object Detection Results

Runtime 20 minute runtime for 2 images Results not presented for more than 4 Bad scaling? Code released, but 0.1 alpha vers… Ran Bundler on 4 images, took < 3 minutes

Questions?