Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.

Slides:

Advertisements

Similar presentations

Bayesian Belief Propagation

Advertisements

Image Registration  Mapping of Evolution. Registration Goals Assume the correspondences are known Find such f() and g() such that the images are best.

Medical Image Registration Kumar Rajamani. Registration Spatial transform that maps points from one image to corresponding points in another image.

For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.

KinectFusion: Real-Time Dense Surface Mapping and Tracking

MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

Computer Vision Optical Flow

Adam Rachmielowski 615 Project: Real-time monocular vision-based SLAM.

Motion Analysis (contd.) Slides are from RPI Registration Class.

CSci 6971: Image Registration Lecture 4: First Examples January 23, 2004 Prof. Chuck Stewart, RPI Dr. Luis Ibanez, Kitware Prof. Chuck Stewart, RPI Dr.

Probabilistic video stabilization using Kalman filtering and mosaicking.

Direct Methods for Visual Scene Reconstruction Paper by Richard Szeliski & Sing Bing Kang Presented by Kristin Branson November 7, 2002.

Uncalibrated Geometry & Stratification Sastry and Yang

Announcements Project1 artifact reminder counts towards your grade Demos this Thursday, 12-2:30 sign up! Extra office hours this week David (T 12-1, W/F.

Computing motion between images

High-Quality Video View Interpolation

Lecture 11: Structure from motion CS6670: Computer Vision Noah Snavely.

Optical flow and Tracking CISC 649/849 Spring 2009 University of Delaware.

Optical Flow Estimation

Motion Estimation Today’s Readings Trucco & Verri, 8.3 – 8.4 (skip 8.3.3, read only top half of p. 199) Numerical Recipes (Newton-Raphson), 9.4 (first.

COMP 290 Computer Vision - Spring Motion II - Estimation of Motion field / 3-D construction from motion Yongjik Kim.

Matching Compare region of image to region of image. –We talked about this for stereo. –Important for motion. Epipolar constraint unknown. But motion small.

CSCE 641 Computer Graphics: Image-based Modeling (Cont.) Jinxiang Chai.

Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.

CSE 185 Introduction to Computer Vision

KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.

Xiaoguang Han Department of Computer Science Probation talk – D Human Reconstruction from Sparse Uncalibrated Views.

The Brightness Constraint

MESA LAB Multi-view image stitching Guimei Zhang MESA LAB MESA (Mechatronics, Embedded Systems and Automation) LAB School of Engineering, University of.

Mesh Deformation Based on Discrete Differential Geometry Reporter: Zhongping Ji

ALIGNMENT OF 3D ARTICULATE SHAPES. Articulated registration Input: Two or more 3d point clouds (possibly with connectivity information) of an articulated.

A Method for Registration of 3D Surfaces ICP Algorithm

CSCE 643 Computer Vision: Structure from Motion

Lec 22: Stereo CS4670 / 5670: Computer Vision Kavita Bala.

Temporally Coherent Completion of Dynamic Shapes AUTHORS:HAO LI,LINJIE LUO,DANIEL VLASIC PIETER PEERS,JOVAN POPOVIC,MARK PAULY,SZYMON RUSINKIEWICZ Presenter:Zoomin(Zhuming)

3d Pose Detection Used by Kinect

COS429 Computer Vision =++ Assignment 4 Cloning Yourself.

Distinctive Image Features from Scale-Invariant Keypoints

Lecture 9 Feature Extraction and Motion Estimation Slides by: Michael Black Clark F. Olson Jean Ponce.

RGB-D Images and Applications

776 Computer Vision Jan-Michael Frahm Spring 2012.

Motion Estimation Today’s Readings Trucco & Verri, 8.3 – 8.4 (skip 8.3.3, read only top half of p. 199) Newton's method Wikpedia page

776 Computer Vision Jan-Michael Frahm Spring 2012.

University of Ioannina

Paper – Stephen Se, David Lowe, Jim Little

CS4670 / 5670: Computer Vision Kavita Bala Lec 27: Stereo.

Motion and Optical Flow

Robust Visual Motion Analysis: Piecewise-Smooth Optical Flow

Computational Photography Derek Hoiem, University of Illinois

The Brightness Constraint

Dynamical Statistical Shape Priors for Level Set Based Tracking

Structure from motion Input: Output: (Tomasi and Kanade)

The Brightness Constraint

Lecture 10 Causal Estimation of 3D Structure and Motion

The Brightness Constraint

Motion Estimation Today’s Readings

Instance Based Learning

A Volumetric Method for Building Complex Models from Range Images

Announcements more panorama slots available now

Digital Image Processing

Announcements Questions on the project? New turn-in info online

Announcements more panorama slots available now

Computational Photography

Chapter 11: Stereopsis Stereopsis: Fusing the pictures taken by two cameras and exploiting the difference (or disparity) between them to obtain the depth.

Presented by Xu Miao April 20, 2005

Structure from motion Input: Output: (Tomasi and Kanade)

Image Registration  Mapping of Evolution

Stereo vision Many slides adapted from Steve Seitz.

Presentation transcript:

Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman

Previous Work Monocular SLAM Multiview stereo Not real-time Monocular SLAM RGB camera, real-time, sparse Pose estimation using iterative closest point (ICP) Active sensors using signed distance function (SDF) for dense reconstruction instead of meshes Feature based monocular SLAM system ORB-SLAM

Previous Work contd. Dense, real-time reconstruction using SDF fusion Gap to fill? Dense, real-time reconstruction using SDF fusion Global reconstruction instead of frame-to-frame Independent of lighting in the scene using structured light

KinectFusion Overview Real time dense reconstruction of static scenes Works on sensors that generate depth maps quickly (30Hz) Implicit surface construction Implicit surface: set of all points that satisfy some f(x,y,z) = 0. Simultaneous Localization and Mapping (SLAM)

Pipeline Image from Microsoft’s KinectFusion Page

Depth Map Conversion Process raw depth data so that camera’s rigid transform can be calculated Input: raw depth map from Kinect at some time k Output: vertex map and normal map in camera coordinate system

Depth Map Conversion contd. Details: Apply bilateral filter to smooth noise while preserving sharp edges Calculate vertex map and normal map using intrinsic camera parameters Make a 3-level pyramid representation of both maps

Volumetric Integration Fuse current raw depth map into global scene truncated signed distance function (TSDF) camera pose estimation Input: raw depth map, camera pose 𝑇 𝑔,𝑘 Output: global TSDF, rigid transform matrix from previous frame 𝑇 𝑔,𝑘−1

Volumetric Integration contd. Details: TSDF: discrete version of SDF. Further truncate by taking points within some ± μ of the surface in TSDF Parallelizable Nearest neighbor lookup instead of interpolation of depth values to avoid smearing

Volumetric Integration contd. Details: Converge towards SDF Fusion by taking weighted average of TSDFs for every depth map Use raw depth map, not bilaterally filtered map

Ray-casting Render a surface prediction made by the zero level set in the global TSDF using the previous camera pose estimation Input: global TSDF, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: predicted vertex and normal maps

Ray-casting contd. Details: Raycast TSDF (the current world volume) to render the 3D volume Use ray skipping to accelerate process Simplified version of cubic interpolation to predict vertex and normal maps

Camera Tracking Calculate world camera position using ICP. Input: Predicted vertex and normal maps, rigid transform of previous frame 𝑇 𝑔,𝑘−1 Output: Rigid transform matrix 𝑇 𝑔,𝑘 Details: Use all data for pose estimation Assume small motion between frames Parallelizable processing

Camera Tracking contd. Details: Align current surface measurement maps with predicted surface maps Do this by 3d iterative closest point (ICP) Assume closest points as initial correspondences, then iteratively improve results until convergence Outliers from ICP are tracked

Pipeline Recap Image from Microsoft’s KinectFusion Page

Results Tracking and mapping in constant time Qualitative results show superior to frame-to-frame matching No explicit expensive global optimization needed https://youtu.be/quGhaggn3cQ? t=1m30s

Demo! Can we break it? LET’S FIND OUT.

Assumptions and Limitations Scene is static (no sustained motion) Scene is fairly contained Initialization determines accuracy If current observed scene doesn’t constrain all DOF of the camera, solving for the camera pose can yield many arbitrary solutions Currently if tracking fails, user is asked to align the camera with the last known good-position

Previous Work Offline multi-camera reconstructions Real time parametric Real time reconstruction of non- static scenes with one depth camera Fusing new frames bit by bit to one world model (like KinectFusion)

Previous Work contd. Gap to fill? Templates make it hard to model features in image that may be drastically different (clothing, smaller person, wider person, etc.) Don’t assume small motion between frames Long runs of systems without templates accrue drift and smooth out high frequency information

Fusion4D Overview Builds off of KinectFusion (overlap in authors, too) Real-time dense reconstruction Temporal coherence Resistant to topology changes and large motion between frames No templates or assumptions about the scene a priori Multi-view setup with many RGB-D cameras

Pipeline

Input and Pre-processing RGB images provide texture for reconstruction RGB-D frames from IR cameras use PatchMatch Stereo for depth calculations Segmentation of depth maps for ROIs to keep foreground distinct throughout pipeline Use dense Conditional Random Field (machine learning) for neighborhood smoothing

Correspondence Dense correspondence field initializes non-rigid alignment using Global Patch Collider (decision tree based machine learning) Input: Two consecutive image frames Output: Energy term that encourages matched pixels and their corresponding key volume points to line up

Correspondence contd. Details: Decision tree like in HyperDepth, but normalize by depth term to give scale invariance Goal to find correspondences between pixel positions in two consecutive frames at patch level Take union of trees, then to minimize false positives, use voting (each tree votes for a match)

Non-rigid Alignment: Embedded Deformation Graph Key volume is a TSDF. How do we calculate a deformation field to warp this key volume to new raw depth maps? Input: Key volume Output: Functions to warp local areas around ED nodes to raw depth maps

Non-rigid Alignment: Embedded Deformation Graph contd. Details: Use Emedded Deformation (ED) model Uniformly sample k ED nodes in the key volume’s representative mesh Warp the key volume and normals off the mesh in each ED node region using affine transformation and translation and world rotation and translation

Non-rigid Alignment: Alignment Error Energy function will constrain allowed deformations of the key volume, and will best align model to raw data. Input: Warp function from key volume to raw depth data, raw depth data, correspondence energy term Output: Transformation from key volume to raw depth data that minimizes energy function

Non-rigid Alignment: Alignment Error contd. Terms of energy function:

Non-rigid Alignment: Alignment Error contd. Details: Energy function minimization a nonlinear least squares problem Fix ED node affine transformation and translation Use ICP to approximate global motion parameters rotation and translation See if E(X + h) < E(X), where X is all parameters and h is a small step size Use preconditioned conjugate gradient (PCG) to solve system of linear equations (energy function) iteratively

Volumetric Fusion and Blending Accumulated data improves TSDF model. Maintain both a key volume and data volume. Input: best transformation of key volume that minimizes energy function, data volume Output: reconstructed TSDF

Volumetric Fusion and Blending contd. Details: Data volume: volume at current frame made from fused depth maps. Key volume: an instance of the reference volume.

Volumetric Fusion and Blending contd. Details: Selective fusion: Collision Misalignment Fusion and blending steps: Fuse depth maps to make data volume Warp key volume into data frame Blend together data volume and warped key volume

Pipeline Recap

Results Not limited to human subjects because there are no prior assumptions Non-registration (energy function calculations) takes the most time in processing (64% at 20ms) Correspondence results are better than SIFT detector, FAST detector, etc. https://youtu.be/2dkcJ1YhYw4?t =4m11s

Assumptions and Limitations Non-rigid alignment errors Overly smooth model Segmentation errors Visual hull estimate incorrect, so noise is fused into model Reliance on temporal coherence Framerate can’t be low Motion between frames can’t be too big

Similarities Both real time dense reconstruction algorithms Both use volumetric fusion from Curless and Levoy (1996) Accumulated depth data is used to improve the current world model Both use TSDF to represent their world surfaces Both use ICP to estimate camera’s global rotation and translation parameters Both make it possible to use depth images to obtain a view of the world

Differences KinectFusion Fusion4D Rigid reconstruction Non-rigid reconstruction No Keyframes Key Volumes Static scenes Non-static scenes

Future Work Non-rigid matching algorithms designed specifically for topology changes 3D streaming of live concerts Reconstruction of larger, more complex scenes