High-Quality Video View Interpolation

Slides:



Advertisements
Similar presentations
Object Space EWA Surface Splatting: A Hardware Accelerated Approach to High Quality Point Rendering Liu Ren Hanspeter Pfister Matthias Zwicker CMU.
Advertisements

Analysis of Contour Motions Ce Liu William T. Freeman Edward H. Adelson Computer Science and Artificial Intelligence Laboratory Massachusetts Institute.
www-video.eecs.berkeley.edu/research
MPEG-4 Objective Standardize algorithms for audiovisual coding in multimedia applications allowing for Interactivity High compression Scalability of audio.
Stereo.
Shape from Contours and Multiple Stereo A Hierarchical, Mesh-Based Approach Hendrik Kück, Wolfgang Heidrich, Christian Vogelgsang.
Boundary matting for view synthesis Samuel W. Hasinoff Sing Bing Kang Richard Szeliski Computer Vision and Image Understanding 103 (2006) 22–32.
A new approach for modeling and rendering existing architectural scenes from a sparse set of still photographs Combines both geometry-based and image.
Copyright  Philipp Slusallek Cs fall IBR: Model-based Methods Philipp Slusallek.
Stanford CS223B Computer Vision, Winter 2005 Lecture 6: Stereo 2 Sebastian Thrun, Stanford Rick Szeliski, Microsoft Hendrik Dahlkamp and Dan Morris, Stanford.
SIGGRAPH Course 30: Performance-Driven Facial Animation Section: Markerless Face Capture and Automatic Model Construction Part 2: Li Zhang, Columbia University.
Stereo Matching Vision for Graphics CSE 590SS, Winter 2001 Richard Szeliski.
Multi-view stereo Many slides adapted from S. Seitz.
Stanford CS223B Computer Vision, Winter 2006 Lecture 6 Stereo II Professor Sebastian Thrun CAs: Dan Maynes-Aminzade, Mitul Saha, Greg Corrado Stereo.
CSCE 641 Computer Graphics: Image-based Modeling Jinxiang Chai.
Copyright  Philipp Slusallek IBR: View Interpolation Philipp Slusallek.
CSCE 641 Computer Graphics: Image-based Rendering (cont.) Jinxiang Chai.
An Iterative Optimization Approach for Unified Image Segmentation and Matting Hello everyone, my name is Jue Wang, I’m glad to be here to present our paper.
CSCE 641: Computer Graphics Image-based Rendering Jinxiang Chai.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 20, NO. 11, NOVEMBER 2011 Qian Zhang, King Ngi Ngan Department of Electronic Engineering, the Chinese university.
Convergence of vision and graphics Jitendra Malik University of California at Berkeley Jitendra Malik University of California at Berkeley.
 Marc Levoy IBM / IBR “The study of image-based modeling and rendering is the study of sampled representations of geometry.”
Visual Screen: Transforming an Ordinary Screen into a Touch Screen Zhengyou Zhang & Ying Shan Vision Technology Group Microsoft Research
Accurate, Dense and Robust Multi-View Stereopsis Yasutaka Furukawa and Jean Ponce Presented by Rahul Garg and Ryan Kaminsky.
Stereo matching Class 10 Read Chapter 7 Tsukuba dataset.
David Luebke Modeling and Rendering Architecture from Photographs A hybrid geometry- and image-based approach Debevec, Taylor, and Malik SIGGRAPH.
Johannes Kopf Billy Chen Richard Szeliski Michael Cohen Microsoft Research Microsoft Microsoft Research Microsoft Research.
The Story So Far The algorithms presented so far exploit: –Sparse sets of images (some data may not be available) –User help with correspondences (time.
Review: Binocular stereo If necessary, rectify the two stereo images to transform epipolar lines into scanlines For each pixel x in the first image Find.
Computer Vision Spring ,-685 Instructor: S. Narasimhan WH 5409 T-R 10:30am – 11:50am Lecture #15.
Prakash Chockalingam Clemson University Non-Rigid Multi-Modal Object Tracking Using Gaussian Mixture Models Committee Members Dr Stan Birchfield (chair)
What we didn’t have time for CS664 Lecture 26 Thursday 12/02/04 Some slides c/o Dan Huttenlocher, Stefano Soatto, Sebastian Thrun.
Image Based Rendering(IBR) Jiao-ying Shi State Key laboratory of Computer Aided Design and Graphics Zhejiang University, Hangzhou, China
City University of Hong Kong 18 th Intl. Conf. Pattern Recognition Self-Validated and Spatially Coherent Clustering with NS-MRF and Graph Cuts Wei Feng.
Object Stereo- Joint Stereo Matching and Object Segmentation Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on Michael Bleyer Vienna.
Image-based rendering Michael F. Cohen Microsoft Research.
Image-based Rendering. © 2002 James K. Hahn2 Image-based Rendering Usually based on 2-D imagesUsually based on 2-D images Pre-calculationPre-calculation.
CS 4487/6587 Algorithms for Image Analysis
Computer Vision, Robert Pless
1 Real-Time Stereo-Matching for Micro Air Vehicles Pascal Dufour Master Thesis Presentation.
12/7/10 Looking Back, Moving Forward Computational Photography Derek Hoiem, University of Illinois Photo Credit Lee Cullivan.
Computer Vision Lecture #10 Hossam Abdelmunim 1 & Aly A. Farag 2 1 Computer & Systems Engineering Department, Ain Shams University, Cairo, Egypt 2 Electerical.
Epitomic Location Recognition A generative approach for location recognition K. Ni, A. Kannan, A. Criminisi and J. Winn In proc. CVPR Anchorage,
Lecture 19: Solving the Correspondence Problem with Graph Cuts CAP 5415 Fall 2006.
112/5/ :54 Graphics II Image Based Rendering Session 11.
Lecture 6 Rasterisation, Antialiasing, Texture Mapping,
Extracting Depth and Matte using a Color-Filtered Aperture Yosuke Bando TOSHIBA + The University of Tokyo Bing-Yu Chen National Taiwan University Tomoyuki.
Yizhou Yu Texture-Mapping Real Scenes from Photographs Yizhou Yu Computer Science Division University of California at Berkeley Yizhou Yu Computer Science.
CS559: Computer Graphics Final Review Li Zhang Spring 2010.
Journal of Visual Communication and Image Representation
CSCE 641 Computer Graphics: Image-based Rendering (cont.) Jinxiang Chai.
Paper presentation topics 2. More on feature detection and descriptors 3. Shape and Matching 4. Indexing and Retrieval 5. More on 3D reconstruction 1.
Project 2 due today Project 3 out today Announcements TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAA.
John Morris Stereo Vision (continued) Iolanthe returns to the Waitemata Harbour.
Speaker Min-Koo Kang March 26, 2013 Depth Enhancement Technique by Sensor Fusion: MRF-based approach.
Photoconsistency constraint C2 q C1 p l = 2 l = 3 Depth labels If this 3D point is visible in both cameras, pixels p and q should have similar intensities.
Matte-Based Restoration of Vintage Video 指導老師 : 張元翔 主講人員 : 鄭功運.
Che-An Wu Background substitution. Background Substitution AlphaMa p Trimap Depth Map Extract the foreground object and put into another background Objective.
Energy minimization Another global approach to improve quality of correspondences Assumption: disparities vary (mostly) smoothly Minimize energy function:
Presented by 翁丞世  View Interpolation  Layered Depth Images  Light Fields and Lumigraphs  Environment Mattes  Video-Based.
1 Real-Time High-Quality View-dependent Texture Mapping using Per-Pixel Visibility Damien Porquet Jean-Michel Dischler Djamchid Ghazanfarpour MSI Laboratory,
A Plane-Based Approach to Mondrian Stereo Matching
Summary of “Efficient Deep Learning for Stereo Matching”
Steps Towards the Convergence of Graphics, Vision, and Video
Motion and Optical Flow
Image-Based Rendering
Fast Preprocessing for Robust Face Sketch Synthesis
What have we learned so far?
Analysis of Contour Motions
Occlusion and smoothness probabilities in 3D cluttered scenes
Presentation transcript:

High-Quality Video View Interpolation Larry Zitnick Interactive Visual Media Group Microsoft Research

3D video Image centric Geometry centric Figure from Kang/Szeliski/Anandan ICIP paper. We will be developing an imaging model that captures this spectrum and permits easy use of all these techniques. The important thing to understand is that finding a common platform to accommodate this entire spectrum gains us the flexibility to make use of each technique represented in the spectrum and the efficiency to mix the representations without a performance penalty. Fixed geometry View-dependent texture View-dependent geometry Sprites with depth Layered depth Image Lumigraph Light field Polygon rendering + texture mapping Warping Interpolation

Current practice free viewpoint video Many cameras vs. Motion Jitter

Current practice free viewpoint video Many cameras vs. Motion Jitter

Video view interpolation Fewer cameras and Smooth Motion Automatic Real-time rendering

System overview Video Capture Video Capture Stereo Representation OFFLINE Stereo Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render

cameras cameras hard disks controlling laptop concentrators Our video capture system consists of 8 cameras, each with a resolution of 1024 by 768 capturing at 15 frames per second. Each group of 4 cameras is synchronized using a device called a concentrator, which pipes all the uncompressed video data to a bank of hard disks via a fiber optic cable. The two concentrators are themselves synchronized, and are controlled by a single laptop.

Calibration Zhengyou Zhang, 2000

Input videos

System overview Video Capture Video Capture Stereo Stereo OFFLINE Stereo Stereo Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render

Key to view interpolation: Geometry Stereo Geometry Image 1 Image 2 Camera 1 Camera 2 Virtual Camera

Image correspondence Image 1 Image 2 Leg Correct Match Score Incorrect Good Bad Incorrect Match Score Match Score Match Score Wall

Why segments? Better delineation of boundaries.

Why segments? Larger support for matching. Handle gain and offset differences without global model (Kim, Kolmogorov and Zabih, 2003.)

Why segments? More efficient. 786,432 pixels vs. 1000 segments Compute disparities per segment rather than per pixel.

Segmentation Many methods will work: Graph-based (Felzenszwalb and Huttenlocher, 2004) Mean Shift (Comaniciu, et al. 2001) Min-cut (Boykov et al. 2001) Others…

Segmentation: Important properties Not too large, not too small… As large as possible while not spanning multiple objects.

Segmentation: Important properties Stable Regions

Segmentation: Our Approach First average… …then segment. Anisotropic smoothing

Segmentation: Result Close-up

Matching segments Many measures will work: SSD Normalized correlation Mutual information Depends on color balancing and image quality.

Matching segments: Important properties Never remove correct matches. Remove as many false matches as possible Use global methods to remove remaining false positives.

Matching segments: Our approach Create gain histogram 0.8 1.25 Good match 0.8 1.25 Bad match

Local matching Image 1 Image 2 Low texture

Global regularization Create MRF (Markov Random Field): Image 1 Image 2 A F E D C B R P Q S T U Number of states = number of depth levels Each segment is a node

Global regularization Likelihood (data term) Prior (regularization term) Disparity Images

Global regularization Image 1 Image 2 A F E D C B R P Q S T U colorA ≈ colorB → zA ≈ zB

Global regularization Variance – % of border and similarity of color Normal distribution A F E D C B Disparity

Multiple disparity maps Compute a disparity map for each image. We want the disparity maps to be consistent across images…

Consistent disparities Image 1 Image 2 A F E D C B R P Q A S T U zA ≈ zP, zQ, zS

Consistent disparities Disparities dependent on neighboring disparities. Likelihood includes neighboring disparities.

Consistent disparities Use original data term if not occluded. Bias disparities to lie behind known surfaces when occluded. if not occluded if occluded

Is the segment occluded? Ii Occluded Not occluded

If occluded… Disparity Ii Occluded

Iteratively solve MRF

Depth through time

Matting Interpolated view without matting Bayesian Matting Background Surface Interpolated view without matting Foreground Surface Background Background Alpha Strip Width Foreground Foreground Bayesian Matting Chuang et al. 2001 Camera

Rendering with matting No Matting Matting

System overview Video Capture Stereo Stereo Representation OFFLINE Stereo Stereo Representation Representation Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render

Representation Main Boundary Strip Width Boundary Layer: Main Layer: Background Boundary Strip Width Foreground Boundary Layer: Color Depth Alpha Main Layer: Color Depth

System overview Video Capture Stereo Representation Representation OFFLINE Stereo Representation Representation Compression Compression File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Render

Compression Camera 1 Camera 2 Camera 3 Camera 4 Time = 0 Time = 1 Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1

Compression Camera 1 Temporal Prediction Camera 2 Camera 3 Camera 4 Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1

Compression Camera 1 Camera 2 Camera 3 Spatial Prediction Camera 4 Compression is used to reduce the large data-set to a manageable size and to allow fast playback from disk. We developed our own codec to make use of both temporal and between-camera redundancy. Temporal prediction is used to compress the reference camera’s data in terms of previously decoded results from an earlier frame time. Spatial prediction makes use of the reference camera’s disparity map to transform its texture and disparity data into the viewpoint of a spatially adjacent camera. The differences between predicted and actual images are coded using a novel transform-based compression scheme which can simultaneously handle texture, disparity and alpha-map data. To obtain real-time interactivity, the overall decoding scheme is highly optimized for speed and makes use of the GPU where possible. Camera 4 Time = 0 Time = 1

Spatial prediction Depth and Texture Reference Camera Predicted Camera

Spatial prediction Depth and Texture Warped Reference Camera Predicted

Spatial prediction Warped Depth and Texture Error Signal _ + Reference Camera Error Signal _ Predicted Camera +

Spatial prediction Warped Depth and Texture Reference Camera Reconstructed (after error signal is added) Predicted Camera

Boundary layer coding Depth Color Texture Alpha Matte Use our own shape coding method similar to MPEG-4

System overview Video Capture Stereo Representation Compression OFFLINE Stereo Representation Compression Compression File File Our rendering system consists of offline and online components. The video capture and processing are done offline. Video processing consists of stereo computation and data compression. The dynamic scene can then be interactively viewed by selectively decompressing the data file and rendering it. ONLINE Selective Decompression Selective Decompression Render Render

Rendering Source Cameras Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.

Rendering Composite Project Boundary Layer Project Boundary Layer Main Layer Project Main Layer Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.

Rendering the main layer (Step 1) Depth Color Video of background depth Video of background color Projected Color Buffer At every frame time the background depth map is used to create a dense mesh which is texture mapped with the background color map. The mesh is projected into the virtual cameras view. A pixel shader program rejects any pixels that are at large depth gradients. Vertex Shader Pixel Shader Position, Texture Coord GPU Z-Buffer

Rendering the main layer (Step 2) Depth Projected Color Buffer Locate Depth Discontinuities Need to remove main layer triangles connecting background to foreground (1 pixel wide). Avoid modifying the mesh on a frame by frame basis. Set Z-buffer to far away and colors to transparent for boundary. Generate Erase Mesh Pixel Shader CPU GPU Z-Buffer

Rendering boundary layer Depth Boundary RGBA Projected Main Layer Projected Color Buffer Vertex Colors At every frame time the background depth map is used to create a dense mesh which is texture mapped with the background color map. The mesh is projected into the virtual cameras view. A pixel shader program rejects any pixels that are at large depth gradients. Generate Boundary Mesh Compositing CPU GPU Z-Buffer

Graphics for Vision Use the GPU for vision. Real-time stereo – (Yang and Pollefeys, CVPR 03)

Rendering Composite Project Boundary Layer Project Boundary Layer Main Layer Project Main Layer Next we describe the process of using the GPU to render a novel viewpoint from the compressed data. Given a novel viewpoint <highlight virtual camera> the rendering program determines the two nearest cameras <highlight two nearest>. The data from these two cameras is blended to create the new viewpoint. A block diagram of the rendering process is shown here <pause>.

Compositing views Pixel Shader Camera 1 Camera 2 Final composite Weights based on proximity to virtual viewpoint Final composite Final Result Normalization Pixel Shader GPU

DEMO Running in real time on a xxx machine. Pause, interpolate, with without playback. Decompressed and rendered in real time. 640x480 x N frames = 300MB.

“Massive Arabesque” videoclip

Future work Mesh simplification More complicated scenes Temporal interpolation (use optical flow) Wider range of virtual motion 2D grid of cameras

Summary Sparse camera configuration High-quality depth recovery Automatic matting New two-layer representation Inter-camera compression Real-time rendering