Multi-View Stereo for Community Photo Collections

Slides:

Advertisements

Similar presentations

Wheres Waldo: Matching People in Images of Crowds Rahul GargDeva RamananSteven M. Seitz Noah Snavely Problem Definition University of Washington University.

Advertisements

1 Photometric Stereo Reconstruction Dr. Maria E. Angelopoulou.

The fundamental matrix F

CSE473/573 – Stereo and Multiple View Geometry

Presented by Xinyu Chang

For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.

Some problems... Lens distortion  Uncalibrated structure and motion recovery assumes pinhole cameras  Real cameras have real lenses  How can we.

TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.

Discrete-Continuous Optimization for Large-scale Structure from Motion David Crandall, Andrew Owens, Noah Snavely, Dan Huttenlocher Presented by: Rahul.

Multi-View Stereo for Community Photo Collections

Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.

Silhouettes in Multiview Stereo Ian Simon. Multiview Stereo Problem Input: – a collection of images of a rigid object (or scene) – camera parameters for.

Reconstructing Building Interiors from Images Yasutaka Furukawa Brian Curless Steven M. Seitz University of Washington, Seattle, USA Richard Szeliski Microsoft.

Reconstructing Building Interiors from Images Yasutaka Furukawa Brian Curless Steven M. Seitz University of Washington, Seattle, USA 2011/01/16 蔡禹婷.

Boundary matting for view synthesis Samuel W. Hasinoff Sing Bing Kang Richard Szeliski Computer Vision and Image Understanding 103 (2006) 22–32.

Constructing immersive virtual space for HAI with photos Shingo Mori Yoshimasa Ohmoto Toyoaki Nishida Graduate School of Informatics Kyoto University GrC2011.

A new approach for modeling and rendering existing architectural scenes from a sparse set of still photographs Combines both geometry-based and image.

Last Time Pinhole camera model, projection

A Study of Approaches for Object Recognition

Contents Description of the big picture Theoretical background on this work The Algorithm Examples.

Uncalibrated Geometry & Stratification Sastry and Yang

Multi-view stereo Many slides adapted from S. Seitz.

High-Quality Video View Interpolation

Distinctive Image Feature from Scale-Invariant KeyPoints

Lecture 11: Structure from motion CS6670: Computer Vision Noah Snavely.

The plan for today Camera matrix

Distinctive image features from scale-invariant keypoints. David G. Lowe, Int. Journal of Computer Vision, 60, 2 (2004), pp Presented by: Shalomi.

Scale Invariant Feature Transform (SIFT)

Lecture 25: Multi-view stereo, continued

Lecture 6: Feature matching and alignment CS4670: Computer Vision Noah Snavely.

Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.

Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.

Manhattan-world Stereo Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.

Computer vision: models, learning and inference

Computer vision.

Structure from images. Calibration Review: Pinhole Camera.

Final Exam Review CS485/685 Computer Vision Prof. Bebis.

Gwangju Institute of Science and Technology Intelligent Design and Graphics Laboratory Multi-scale tensor voting for feature extraction from unstructured.

Automatic Registration of Color Images to 3D Geometry Computer Graphics International 2009 Yunzhen Li and Kok-Lim Low School of Computing National University.

KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.

Xiaoguang Han Department of Computer Science Probation talk – D Human Reconstruction from Sparse Uncalibrated Views.

Local invariant features Cordelia Schmid INRIA, Grenoble.

Y. Moses 11 Combining Photometric and Geometric Constraints Yael Moses IDC, Herzliya Joint work with Ilan Shimshoni and Michael Lindenbaum, the Technion.

Recap from Monday Image Warping – Coordinate transforms – Linear transforms expressed in matrix form – Inverse transforms useful when synthesizing images.

Geometry 3: Stereo Reconstruction Introduction to Computer Vision Ronen Basri Weizmann Institute of Science.

Lec 22: Stereo CS4670 / 5670: Computer Vision Kavita Bala.

Lecture 7: Features Part 2 CS4670/5670: Computer Vision Noah Snavely.

Vehicle Segmentation and Tracking From a Low-Angle Off-Axis Camera Neeraj K. Kanhere Committee members Dr. Stanley Birchfield Dr. Robert Schalkoff Dr.

Computer Vision Lecture #10 Hossam Abdelmunim 1 & Aly A. Farag 2 1 Computer & Systems Engineering Department, Ain Shams University, Cairo, Egypt 2 Electerical.

Peter Henry1, Michael Krainin1, Evan Herbst1,

CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.

CSE 185 Introduction to Computer Vision Feature Matching.

Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.

Yizhou Yu Texture-Mapping Real Scenes from Photographs Yizhou Yu Computer Science Division University of California at Berkeley Yizhou Yu Computer Science.

Local features: detection and description

1Ellen L. Walker 3D Vision Why? The world is 3D Not all useful information is readily available in 2D Why so hard? “Inverse problem”: one image = many.

John Morris Stereo Vision (continued) Iolanthe returns to the Waitemata Harbour.

Image-Based Rendering Geometry and light interaction may be difficult and expensive to model –Think of how hard radiosity is –Imagine the complexity of.

CSE 185 Introduction to Computer Vision Stereo 2.

CS4670 / 5670: Computer Vision Kavita Bala Lec 27: Stereo.

TP12 - Local features: detection and description

Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.

Nonparametric Semantic Segmentation

Fast Preprocessing for Robust Face Sketch Synthesis

Modeling the world with photos

Structure from motion Input: Output: (Tomasi and Kanade)

CSE 185 Introduction to Computer Vision

Structure from motion Input: Output: (Tomasi and Kanade)

Lecture 15: Structure from motion

Presentation transcript:

Multi-View Stereo for Community Photo Collections Michael Goesele, Noah Snavely, Brian Curless, Hugues Hoppe, Steven M. Seitz

photos varies substantially in lighting, foreground clutter, scale due to various cameras, time, weather

Images of Notre Dame (a variation in sampling rate of more than 1,000)

Images taken in the wild—wide variety Lots of photographers Different cameras Sampling rates Occlusion Different time of day, weather Post processing

The problem statement Design an adaptive view selection process Given the massive number of images, find a compatible subset Multi View Stereo (MVS) Reconstruct robust & accurate depth maps from this subset

Previous work Global View Selection assume a relatively uniform viewpoint distribution and simply choose the k nearest images from each reference view Local View Selection use shiftable windows in time to adaptively choose frames to match

CPC non-uniformly distributed in 7D viewpoint (translation, rotation, focal length) space represents an extreme case of unorganized images sets Algorithm overview: - Calibrating Internet Photos - Global View Selection - Local View Selection - Multi-View Stereo Reconstruction

Calibrating Internet Photos PTLens extracts camera and lens information and corrects for radial distortion based on a database of camera and lens properties Discard images cannot be corrected Remaining images entered into a robust, metric structure-from-motion (SfM) system (uses SIFT feature detector) - generate a sparse scene reconstruction from the matched features - list of images where feature was detected Remove Radiometric Distortions - all input images into a linear radiometric space (sRGB color space)

Global View Selection For each reference view R, global view selection seeks a set N of neighboring views that are good candidates for stereo matching in terms of scene content, appearance, and scale SIFT selects features with similar appearance - Shared feature points: collocation problem - Scale invariance: stereo matching problem need a measurement to deal these two problems !

Global score gR for each view V within a candidate neighborhood N (which includes R) FV: set of feature points in View V FV ∩ FR: common feature points of View V and R wN(f): measure angular separation of two views, the larger, the more separated in angulation ws(f): measures similarity in scale of two views, the larger, the more similar in scale

Calculating wN(f) αmax set to 10 degrees α is the angle between the lines of sight from Vi and Vj to f αmax set to 10 degrees

Calculating ws(f) r = sR(f) / sV(f) sR(f): diameter of a sphere centered at f whose projected diameter in view V equals the pixel spacing in V - favors the case 1 ≤ r <2

Add scores of all feature points for all view V and select top N Rescaling views If scaleR(Vmin) is smaller than 0.6 (threshold), which means 5x5 R vs 3x3 V, need rescale Find lowest resolution view Vmin, resample R Resample view whose scaleR(V) > 1.2 to match the scale of R

Local View Selection Global view selection determines a set N of good matching candidates for a reference view R Select a smaller set A∈N (|A|=4) of active views for stereo matching at a particular location in the reference view

Stereo Matching Use nxn window centered on point in R Goal: To maximize photometric consistency of this patch to its projections into the neighboring views Scene Geometry Model Photometric Model

Scene Geometry Model Window centered at pixel (s, t) oR is the center of projection of view R rR(s,t) is the normalized ray direction through the pixel Reference view corresponds to a point xR(s,t) at a distance h(s,t) along the viewing ray rR(s,t)

Photometric Model Simple model for reflectance effects—a color scale factor ck for each patch projected into the k-th neighboring view Models Lambertian reflectance for constant illumination over planar surfaces Fails for shadow boundaries, caustics, specular highlights, bumpy surfaces

Results Several Internet CPCs gather from Flickr varying widely in terms of size, number of photographers and scale

Output

Video demo

Thank you!

Reconstructing Building Interiors from Images Yasutaka Furukawa Brian Curless Steven M. Seitz University of Washington, Seattle, USA Richard Szeliski Microsoft Research, Redmond, USA

Reconstruction & Visualization of Architectural Scenes Manual (semi-automatic) Google Earth & Virtual Earth Façade [Debevec et al., 1996] CityEngine [Müller et al., 2006, 2007] Automatic Ground-level images [Cornelis et al., 2008, Pollefeys et al., 2008] Aerial images [Zebedin et al., 2008] Google Earth Virtual Earth Müller et al. Zebedin et al.

Reconstruction & Visualization of Architectural Scenes Manual (semi-automatic) Google Earth & Virtual Earth Façade [Debevec et al., 1996] CityEngine [Müller et al., 2006, 2007] Automatic Ground-level images [Cornelis et al., 2008, Pollefeys et al., 2008] Aerial images [Zebedin et al., 2008] Google Earth Virtual Earth Müller et al. Zebedin et al.

Reconstruction & Visualization of Architectural Scenes Little attention paid to indoor scenes Google Earth Virtual Earth Müller et al. Zebedin et al.

Our Goal Fully automatic system for indoors/outdoors Reconstructs a simple 3D model from images Provides real-time interactive visualization

Challenges - Reconstruction Multi-view stereo (MVS) typically produces a dense model We want the model to be Simple for real-time interactive visualization of a large scene (e.g., a whole house) Accurate for high-quality image-based rendering

Challenges – Indoor Reconstruction Texture-poor surfaces Complicated visibility Texture-poor surfaces: hard for MVS; Complicated visibility: Blockage, depthmap Prevalence of thin structures (doors, walls, tables)

Outline System pipeline (system contribution) Algorithmic details (technical contribution) Experimental results Conclusion and future work

System pipeline Images Images

System pipeline Structure-from-Motion Images Bundler by Noah Snavely Structure from Motion for unordered image collections http://phototour.cs.washington.edu/bundler/ Images

System pipeline Multi-view Stereo Images SFM PMVS by Yasutaka Furukawa and Jean Ponce Patch-based Multi-View Stereo Software http://grail.cs.washington.edu/software/pmvs/ Images SFM

Manhattan-world Stereo System pipeline Manhattan-world Stereo [Furukawa et al., CVPR 2009] Images SFM MVS

Manhattan-world Stereo System pipeline Manhattan-world Stereo [Furukawa et al., CVPR 2009] Images SFM MVS

Manhattan-world Stereo System pipeline Manhattan-world Stereo [Furukawa et al., CVPR 2009] Images SFM MVS

Manhattan-world Stereo System pipeline Manhattan-world Stereo [Furukawa et al., CVPR 2009] Images SFM MVS

Manhattan-world Stereo System pipeline Manhattan-world Stereo [Furukawa et al., CVPR 2009] Images SFM MVS

Axis-aligned depth map merging System pipeline Axis-aligned depth map merging (our contribution) Images SFM MVS MWS

Rendering: simple view-dependent texture mapping System pipeline Rendering: simple view-dependent texture mapping Images SFM MVS MWS Merging

Outline System pipeline (system contribution) Algorithmic details (technical contribution) Experimental results Conclusion and future work

Axis-aligned Depth-map Merging Basic framework is similar to volumetric MRF [Vogiatzis 2005, Sinha 2007, Zach 2007, Hernández 2007] First explain the algorithm then differences from competing approaches.

Axis-aligned Depth-map Merging Basic framework is similar to volumetric MRF [Vogiatzis 2005, Sinha 2007, Zach 2007, Hernández 2007] First explain the algorithm then differences from competing approaches.

Axis-aligned Depth-map Merging Basic framework is similar to volumetric MRF [Vogiatzis 2005, Sinha 2007, Zach 2007, Hernández 2007] First explain the algorithm then differences from competing approaches.

Key Feature 1 - Penalty terms

Key Feature 1 - Penalty terms Binary penalty Binary encodes smoothness & data Unary is often constant (inflation)

Key Feature 1 - Penalty terms Binary penalty Binary encodes smoothness & data Unary is often constant (inflation)

Key Feature 1 - Penalty terms Binary penalty Binary encodes smoothness & data Unary is often constant (inflation)

Key Feature 1 - Penalty terms Binary is smoothness defined as neighboring voxels having the same label Binary penalty Binary encodes smoothness & data Unary is often constant (inflation) Binary is smoothness Unary encodes data

Axis-aligned Depth-map Merging Align voxel grid with the dominant axes Data term (unary) Put here how typical approaches do, and why they do not work Talk about 4d neighborhood, Put texts at the bottom of the figures

Axis-aligned Depth-map Merging Align voxel grid with the dominant axes Data term (unary) Smoothness (binary) Put here how typical approaches do, and why they do not work Talk about 4d neighborhood, Put texts at the bottom of the figures

Axis-aligned Depth-map Merging Align voxel grid with the dominant axes Data term (unary) Smoothness (binary) Put here how typical approaches do, and why they do not work Talk about 4d neighborhood, Put texts at the bottom of the figures

Axis-aligned Depth-map Merging Align voxel grid with the dominant axes Data term (unary) Smoothness (binary) Graph-cuts Put here how typical approaches do, and why they do not work Talk about 4d neighborhood, Put texts at the bottom of the figures

Outline System pipeline (system contribution) Algorithmic details (technical contribution) Experimental results Conclusion and future work

Kitchen - 22 images 1364 triangles hall - 97 images 3344 triangles house - 148 images 8196 triangles gallery - 492 images 8302 triangles

Demo

Conclusion & Future Work Fully automated 3D reconstruction/visualization system for architectural scenes Novel depth-map merging to produce piece-wise planar axis-aligned model with sub-voxel accuracy Future work Relax Manhattan-world assumption Larger scenes (e.g., a whole building)

Any Questions? Images SFM MVS MWS Merging

KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera

[0] KinectFusion: Real-time 3D Reconstruction and Interaction Using a Moving Depth Camera*, 1Microsoft Research

A) Depth Map Conversion Reduce noise and calibrate with the inferred camera intrinsic matrix to get the point cloud position in camera coordinate. [1] C.Tomasi, R. Manduchi, "Bilateral Filtering for gray and color images", Sixth International Conference on Computer Vision, pp 839-46, New Delhi, India, 1998.

B) Camera Tracking(ICP) [2] Zhang, Zhengyou (1994). "Iterative point matching for registration of free-form curves and surfaces". International Journal of Computer Vision (Springer)

C) Volumetric Integration Signed distance field: divided the world into voxels, each one saves the nearest distance to a surface. 2D example [3] B. Curless and M. Levoy. A volumetric method for building complex models from range images. ACM Trans. Graph., 1996.

3D example

D) Ray Casting

Demonstration

Building Rome in a Day Paper Summary

Outcome A system that can reconstruct 3D geometry from large, unorganized collections of photographs Uses new distributed computer vision algorithms for image matching and 3D reconstruction Algorithms designed to maximize parallelism at each state of the pipeline Algorithms designed to scale well with size of problem Algorithms designed to scale well with amount of available computation.

Challenges Images collected from photo sharing websites Images are unstructured Images taken in no specific order no control over distributions of camera viewpoints Images are uncalibrated Different photographers Different cameras Little knowledge of camera settings for each image Scale of project 2-3 orders of magnitude larger than used with prior methods Algorithms must be fast to complete reconstruction in one day

Applications Government sector uses city models Urban planning and visualization Academic disciplines use city models History Archeology Geography Consumer mapping technology Google Earth GPS navigation systems Online Map sites

Recover 3D Geometry (x, y, z) = (x/z, y/z) Given scene geometry and camera geometry, we can predict where the 2D projections of each point should be in each image. Compare these projections to the original measurements. Scene geometry represented as 3D points Camera geometry represented as 3D position and orientation for each camera Equations: (x, y, z) = (x/z, y/z)

Correspondence Problem Definition: Automatically estimate 2D correspondence between input images Detect most distinctive, repeatable features in each image Match features across image pairs by finding similar looking features using approximate nearest neighbors search For each pair of images, insert the features of one image into a k-d tree Use features from second image as queries. For each query, if the nearest neighbor is sufficiently far away from the next nearest neighbor, declare a match. Clean up matches Rigid scenes have strong geometric constrains on the locations of matching features 3x3 Fundamental Matrix, F, such that corresponding points xij, xik from images j and k satisfy:

City Scale Matching Goal: Find correspondence spanning entire collection Solve using graph estimation problem “Match Graph” Graph vertices = images Graph edge exists between two vertices iff they are looking at the same part of the scene and have a sufficient number of feature matches Multiround scheme In each round, propose a set of edges in the match graph Whole Image Similarity Query Expansion Verify each edge through feature matching

City Scale Matching: Whole Image Similarity Used for first round edge proposal Metric to compute overall similarity of two images Cluster features into visual words Visual words weighted using Term Frequency Inverse Document Frequency method Apply document retrieval algorithms to match data sets Each photo represented as sparse histogram of visual words Compare histograms by taking inner product For each image, determine k1 + k2 most similar images Verify top k1 images Result: sparsely connected match graph Goal: minimize connected components For each image, consider next k2 images and verify pairs which straddle different connected components

City Scale Matching: Query Expansion Result from first round: sparse match graph, insufficiently dense to produce good reconstruction Definition, Query Expansion: find all vertices within two steps of the query vertex If vertices i and k connected to j, propose i and k also connected Verify edge (i, k)

City Scale Matching: Implementation Pre-processing Verification Track Generation System runs on cluster of computers (“nodes”) “Master node” makes job scheduling decisions

Implementation: Pre-processing Images distributed to cluster nodes in chunks of fixed size Node down-samples images to fixed size Node extracts features

Implementation: Verification Use whole image similarity for first two rounds Use query expansion for remaining rounds Solve with greedy bin-packing algorithm Bin = set of jobs sent to a node Drawback: requires multiple sweeps over remaining image pairs Solution: consider only fixed sized subset of image pairs for scheduling

Implementation: Track Generation Definition: A group of features corresponding to a single 3D point Combine all pairwise matching information to generate consistent tracks across images Solved by finding connected components in a graph Vertex = features in images Edge = connect matching features

Recover camera poses Find and reconstruct skeletal set, minimal subset of photographs capturing essential geometry of a scene Add remaining images to the scene by estimating each camera’s pose with respect to known 3D points matched to the image

Multiview Stereo Estimate depths for every pixel in every image Merge resulting 3D points into a single model Scale exceeds MVS algorithms ability Group photos into clusters that each reconstruct part of the scene

Results