Multi-View Stereo for Community Photo Collections

Slides:



Advertisements
Similar presentations
Multi-View Stereo for Community Photo Collections
Advertisements

Wheres Waldo: Matching People in Images of Crowds Rahul GargDeva RamananSteven M. Seitz Noah Snavely Problem Definition University of Washington University.
The fundamental matrix F
Efficient High-Resolution Stereo Matching using Local Plane Sweeps Sudipta N. Sinha, Daniel Scharstein, Richard CVPR 2014 Yongho Shin.
For Internal Use Only. © CT T IN EM. All rights reserved. 3D Reconstruction Using Aerial Images A Dense Structure from Motion pipeline Ramakrishna Vedantam.
Building Rome in a Day Sameer Agarwal1 Noah Snavely2 Ian Simon1 Steven M. Seitz1 Richard Szeliski3 1University of Washington 2Cornell University 3Microsoft.
TP14 - Local features: detection and description Computer Vision, FCUP, 2014 Miguel Coimbra Slides by Prof. Kristen Grauman.
MASKS © 2004 Invitation to 3D vision Lecture 7 Step-by-Step Model Buidling.
Computer vision: models, learning and inference
Discrete-Continuous Optimization for Large-scale Structure from Motion David Crandall, Andrew Owens, Noah Snavely, Dan Huttenlocher Presented by: Rahul.
Lecture 8: Stereo.
Instructor: Mircea Nicolescu Lecture 13 CS 485 / 685 Computer Vision.
December 5, 2013Computer Vision Lecture 20: Hidden Markov Models/Depth 1 Stereo Vision Due to the limited resolution of images, increasing the baseline.
Shape from Contours and Multiple Stereo A Hierarchical, Mesh-Based Approach Hendrik Kück, Wolfgang Heidrich, Christian Vogelgsang.
Silhouettes in Multiview Stereo Ian Simon. Multiview Stereo Problem Input: – a collection of images of a rigid object (or scene) – camera parameters for.
Boundary matting for view synthesis Samuel W. Hasinoff Sing Bing Kang Richard Szeliski Computer Vision and Image Understanding 103 (2006) 22–32.
Last Time Pinhole camera model, projection
1 Learning to Detect Objects in Images via a Sparse, Part-Based Representation S. Agarwal, A. Awan and D. Roth IEEE Transactions on Pattern Analysis and.
Lecture 23: Structure from motion and multi-view stereo
Contents Description of the big picture Theoretical background on this work The Algorithm Examples.
Stereo & Iterative Graph-Cuts Alex Rav-Acha Vision Course Hebrew University.
Multi-view stereo Many slides adapted from S. Seitz.
Lecture 11: Structure from motion CS6670: Computer Vision Noah Snavely.
The plan for today Camera matrix
Fitting a Model to Data Reading: 15.1,
Object Recognition Using Geometric Hashing
Stereo Computation using Iterative Graph-Cuts
Scale-Invariant Feature Transform (SIFT) Jinxiang Chai.
Lecture 12: Structure from motion CS6670: Computer Vision Noah Snavely.
Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.
Manhattan-world Stereo Y. Furukawa, B. Curless, S. M. Seitz, and R. Szeliski 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
Computer vision: models, learning and inference
Computer Vision James Hays, Brown
Structure from images. Calibration Review: Pinhole Camera.
Final Exam Review CS485/685 Computer Vision Prof. Bebis.
1/20 Obtaining Shape from Scanning Electron Microscope Using Hopfield Neural Network Yuji Iwahori 1, Haruki Kawanaka 1, Shinji Fukui 2 and Kenji Funahashi.
Xiaoguang Han Department of Computer Science Probation talk – D Human Reconstruction from Sparse Uncalibrated Views.
Local invariant features Cordelia Schmid INRIA, Grenoble.
ALIGNMENT OF 3D ARTICULATE SHAPES. Articulated registration Input: Two or more 3d point clouds (possibly with connectivity information) of an articulated.
Y. Moses 11 Combining Photometric and Geometric Constraints Yael Moses IDC, Herzliya Joint work with Ilan Shimshoni and Michael Lindenbaum, the Technion.
Stereo Vision Reading: Chapter 11 Stereo matching computes depth from two or more images Subproblems: –Calibrating camera positions. –Finding all corresponding.
CSE 185 Introduction to Computer Vision Pattern Recognition 2.
CSCE 643 Computer Vision: Extractions of Image Features Jinxiang Chai.
Wenqi Zhu 3D Reconstruction From Multiple Views Based on Scale-Invariant Feature Transform.
Computer Vision Lecture #10 Hossam Abdelmunim 1 & Aly A. Farag 2 1 Computer & Systems Engineering Department, Ain Shams University, Cairo, Egypt 2 Electerical.
Bahadir K. Gunturk1 Phase Correlation Bahadir K. Gunturk2 Phase Correlation Take cross correlation Take inverse Fourier transform  Location of the impulse.
EFFICIENT VARIANTS OF THE ICP ALGORITHM
CSE 185 Introduction to Computer Vision Feature Matching.
A Tutorial on using SIFT Presented by Jimmy Huff (Slightly modified by Josiah Yoder for Winter )
Course14 Dynamic Vision. Biological vision can cope with changing world Moving and changing objects Change illumination Change View-point.
Lecture 9 Feature Extraction and Motion Estimation Slides by: Michael Black Clark F. Olson Jean Ponce.
1Ellen L. Walker 3D Vision Why? The world is 3D Not all useful information is readily available in 2D Why so hard? “Inverse problem”: one image = many.
776 Computer Vision Jan-Michael Frahm Spring 2012.
Image-Based Rendering Geometry and light interaction may be difficult and expensive to model –Think of how hard radiosity is –Imagine the complexity of.
11/25/03 3D Model Acquisition by Tracking 2D Wireframes Presenter: Jing Han Shiau M. Brown, T. Drummond and R. Cipolla Department of Engineering University.
776 Computer Vision Jan-Michael Frahm Spring 2012.
MAN-522 Computer Vision Spring
Computer vision: models, learning and inference
Paper – Stephen Se, David Lowe, Jim Little
CS4670 / 5670: Computer Vision Kavita Bala Lec 27: Stereo.
TP12 - Local features: detection and description
Nonparametric Semantic Segmentation
Feature description and matching
Modeling the world with photos
Noah Snavely.
Video Compass Jana Kosecka and Wei Zhang George Mason University
CSE 185 Introduction to Computer Vision
Feature descriptors and matching
Lecture 15: Structure from motion
Presentation transcript:

Multi-View Stereo for Community Photo Collections Michael Goesele, Noah Snavely (U Washington) Brian Curless (TU Darmstadt) Hugues Hoppe Steven M. Seitz (MSR) http://grail.cs.washington.edu/projects/mvscpc/

Community Photo Collection (CPC) Flickr, Google

But there are some complications… Images taken in the wild – wide variety Lots of photographers Different cameras Sampling rates Occlusion Different time of day, weather Post processing

Trevi Fountain Notre Dame

The problem statement Design and analysis of an adaptive view selection process given the massive number of images find a compatible subset of images Multi View Stereo (MVS) - Reconstruct robust and accurate depth maps from image set

Previous Work (related to MVS) Global View Selection - assume a relatively uniform viewpoint distribution and simply choose the k nearest images for each reference view. Local View Selection - use shiftable windows in time to adaptively choose frames to match

CPC datasets are more challenging Traditionally, multi-view stereo algorithms - far less appearance variation - somewhat regular distributions of viewpoints (e.g., photographs regularly spaced around an object, or video streams with spatiotemporal coherence) CPC non-uniformly distributed in a 7D viewpoint (translation, rotation, focal length) space - represents an extreme case of unorganized image sets

Algorithm Overview Calibrating Internet Photos View Selection - Global View Selection - Local View Selection - Multi-View Stereo Reconstruction

Calibrating Internet Photos Remove radial distortion from the images Eliminate image if not correctable Remaining images entered into a robust, metric structure-from-motion (SfM) system (uses SIFT feature detector) - generates a sparse scene reconstruction from the matched features - list of images where feature was detected

Remove Radiometric Distortions - all input images into a linear radiometric space (sRGB color space)

View Selection - Global View Selection For each reference view R, global view selection seeks a set N of neighboring views that are good candidates for stereo matching in terms of scene content, appearance, and scale SIFT selects features with similar appearance Shared Feature Points - problem of Collocation Scale Invariance - problem with stereo matching

Compute a global score gR for each view V within a candidate neighborhood N (which includes R) FX - set of feature points observed in view X wN(f) – measures angular separation ws(f) – measures similarity in scale

Calculating wN(f) where wα(f, Vi, Vj) = min((α/αmax)2, 1) α - angle between the lines of sight from Vi and Vj to f (αmax set to 10 degrees)

Calculating wS(f) sV (f) - diameter of a sphere centered at f in V Similarly sR(f) for R the ratio r = sR(f)/sV (f) - favors views with equal or higher resolution than the reference view

Add scores of all feature points for all views V and select top N ΣV ∈N gR(v) - greedy iterative approach Rescaling Views find the lowest resolution view Vmin, resample R resample images of higher resolution

 

Multi-View Stereo Reconstruction Two parts – Region-growing and Stereo Matching Region-growing a successfully matched depth sample provides a good initial estimate for depth, normal, and matching confidence for the neighboring pixel locations in R - Has a priority queue containing candidate points Initialized with SfM features in R along with features in V projected onto R Run stereo matching and update q for each point

Stereo Matching Scene Geometry Model Photometric Model Consider n*n window centered on point in R Goal – To maximize photometric consistency of this patch to its projections into the neighboring views Scene Geometry Model Photometric Model

Scene Geometry Model Window centered at pixel (s, t) oR is the center of projection of view R rR(s, t) is the normalized ray direction through the pixel - Reference view corresponds to a point xR(s, t) at a distance h(s, t) along the viewing ray rR(s, t)

now, look at the neighbors with i, j = −(n−1)/2 to (n−1)/2 - Determine the corresponding locations in a neighboring view k with sub-pixel accuracy using that view’s projection Pk(xR(s+i, t+j))

Photometric Model Simple model for reflectance effects—a color scale factor ck for each patch projected into the k-th neighboring view Models Lambertian reflectance for constant illumination over planar surfaces Fails for shadow boundaries, caustics, specular highlights, bumpy surfaces

MPGC Matching with Outlier Rejection Relate the pixel intensities within a patch in R to the intensities in the k-th neighboring view (using the previous models) i, j = −(n−1)/2 to (n−1)/2, k = 1 to m where m = |A| Omitting the pixel coordinates (s, t) and substituting *(MPGC – Multiphoto Geometrically Constrained Least Squares matching)

For a 3-channel color image this equation represent 3 equations, 1 for each channel Considering all pixels in the window and all neighboring views we have 3n2m equations to solve and 3+3m unknowns: h, hs, ht, and the per-view color scale ck Solve this over determined nonlinear system using the standard MPGC approach (linearize the equation)

Iteration generally converges fast but there might be problems – slow convergence, oscillation, convergence to wrong answer Allow 5 iterations for system to settle For subsequent iteration, compute NCC (Normalized Cross Relation) score between view pair patches Reject all views with NCC score below κ = 0.4 (typically) Maximum number of Iterations – 20 (otherwise fail) If iterations converges - compute average NCC score, C, for all n2 points in patch use this score to determine how to update the depth, normal, and confidence maps and Q

…continuing with Local View Selection We now have a possible candidate V that can be added to A Check for useful range of parallax - during Local View Selection we used α to avoid pictures with small triangulation angle - we now use a parameter γ (γmax = 10 degrees) to reduce weight of those images which are nearly co-planar (weighting system similar to previous case)

If a view has a sufficiently high NCC score and satisfies parallax constraints, add to A Repeat process, select from remaining non-rejected views, until either the set A reaches desired size or no non-rejected views remain.

Results MVS reconstructions for several Internet CPCs gathered from Flickr varying widely in terms of size, number of photographers, and scale

Output

Reconstruction for Pisa Duomo

How well does it perform? Compare model with the ground truth (laser scanned model) 90% reconstructed samples are within 0.128m of the laser scanned model Of this 51m high building

Conclusion Multi-view stereo algorithm capable of computing high quality reconstructions of a wide range of scenes from large, shared, multi-user photo collections available on the Internet With the explosion of imagery available online, this capability opens up the exciting possibility of computing accurate geometric models of the world’s sites, cities, and landscapes

Reconstructing Building Interiors from Images Yasutaka Furukawa, Brian Curless, Steven M. Seitz, Richard Szeliski http://grail.cs.washington.edu/projects/interior/

Motivation (or why another MVS algorithm for interiors?) Texture Poor Surfaces Visibility reasoning Scalability Challenge 1 3 2

Incomplete 3D Reconstruction of the kitchen data set

3D Reconstruction Pipeline 3D Location Surface Normal Set of Visible images Vi ={I1, I2, …..}

Yasutaka Furukawa, Brian Curless, Steven M. Seitz Manhattan-World Stereo Yasutaka Furukawa, Brian Curless, Steven M. Seitz http://grail.cs.washington.edu/projects/manhattan/ Output : Depth map for each individual image Step 1: (Oriented Points –Input) 3D Location Surface Normal Set of Visible images Vi ={I1, I2, …..}

Only approximately orthogonal Step 2: (Dominant Axes Extraction) Only approximately orthogonal Histogram of the surface normals from STEP 1 Appropriately declare the largest bins as dominant axes. So now we have , and

Step 3: (Generating Hypothesis planes) Calculate offsets For each k=1,2,3 and all Pi from STEP 1 on Mean Shift Clustering: to identify peak density areas Ouput : Set of Hypothesis planes

3. Repeat steps 1 to 3 until x converges to m(x) Mean shift clustering Data Set: X={x1, x2, …….. xn} For each x in X 1. Calculate where Is a Gaussian kernel 2. Shift x to m(x). 3. Repeat steps 1 to 3 until x converges to m(x) D. Comaniciu and P. Meer. Mean shift: A robust approach toward feature space analysis. IEEE:PAMI, 24(5):603–619, May 2002.

Step 4(final): Labeling each pixel with a plane The labeling problem: Label each pixel with a hypothesis plane Data term Smoothness term

Data term Conflicts:

Smoothness term We make use of priors (evidence) here : edges

MRF models and solving them using graph cuts We are here with a depth map for each input image MRF models and solving them using graph cuts 1. Boykov, Y., Kolmogorov, V., An Experimental Comparison of Min-Cut/Max-Flow Algorithms for Energy Minimization in Vision, In IEEE Transactions on PAMI, Vol. 26, No. 9, pp. 11241137, 2004 2.Graph cut matching in Computer Vision, Toby Colbins

Volumetric Reconstruction A labeling problem again… We label each voxel as either ‘interior’ or ‘exterior’ Data term Smoothness term Data term We make each depth map vote for a voxel

Notations: Pv V If v is behind pv, then it is labeled interior. If v is in the front of pv, it is exterior. We ask each depth map for a given voxel. The depth map may vote a voxel as interior, exterior or empty(no visibility). Majority wins. Corner case: What if a depth map votes the voxel as empty? We take it as a vote for exterior.

Smoothness term Why, no priors? Because, there is nothing left to encode. The energy minimum is not unique

Now we have labeled each voxel as interior or exterior. Delaunay triangulation Delaunay triangulations maximize the minimum angle of all the angles of the triangles in the triangulation; they tend to avoid skinny triangles.

Identify boundaries between exterior and interior voxel regions. Hence identify connected components in the voxel grid. 3. For each connected component in a vertical slice, perform a Delaunay triangulation We get a mesh model out of this.

A list of optimizations before we get the 3D model 1. Grid Pruning: Remove a couple of grid slices. We want grid slices to pass through “grid pixels”. 2. Ground Plane determination: The system has very little idea of the ground. The bottom most horizontal slice with a lot of “grid pixels” Or the bottom most slice with a count greater than the average. 3. Boundary filling

Building Rome in a Day Sameer Agarwal, Noah Snavely, Ian Simon, Steven M. Seitz, Richard Szeliski http://grail.cs.washington.edu/rome/

City-scale 3D reconstruction Search term “Rome” on flickr.com returns more than two million photographs Construct a 3D Rome from unstructured photo collections

The Cluster Images are available on a central store Distributed to the cluster nodes on demand in chunks of fixed size

Overall System Design

Preprocessing and feature extraction Down sample images Convert to grayscale Extract features using SIFT

Image Matching Pairwise matching of images extremely expensive (Rome dataset – 100,000 images – 11.5 days) Use a multi-stage matching scheme. Each stage consists of a proposal and a verification step

Vocabulary Tree Proposals Represent an image as a bag of words obtain term frequency (TF) vector for image, and document frequency (DF) for image corpus Per node TFIDF matrix – broadcast to cluster Calculate TFIDF product (from own and cluster) Top scoring k1+k2 images identified

Verification and detailed matching Images distributed over cluster – problem Experimented with 3 approaches (master node has all images to be verified) optimize network transfers before any verification overpartition image set into small set and send to node on demand greedy bin packing Verify image at node

Match Graph A graph on the set of images with edges connecting two images if matching features were found between them We want the fewest number of connected components in this graph Use the vocabulary tree for this

Track Generation Combine all the pairwise matching information to generate consistent tracks across images. Gather them at the master node and broadcast Simultaneously, store the 2D coordinates and pixel color of The feature points for using it in 3D rendering. Refine the tracks

Geometric Estimation Internet photo sets are redundant. Incremental construction will be very slow. Form a minimal skeletal set of photos such that connectivity is preserved.

Distributed Computing engine Choices of file systems. Memory /speed tradeoffs Whether multi platform (Unix /Windows) Caching Scheduling (map reduce)