RGB-D Images and Applications

Slides:



Advertisements
Similar presentations
Applications of one-class classification
Advertisements

Pose Estimation and Segmentation of People in 3D Movies Karteek Alahari, Guillaume Seguin, Josef Sivic, Ivan Laptev Inria, Ecole Normale Superieure ICCV.
Víctor Ponce Miguel Reyes Xavier Baró Mario Gorga Sergio Escalera Two-level GMM Clustering of Human Poses for Automatic Human Behavior Analysis Departament.
KinectFusion: Real-Time Dense Surface Mapping and Tracking
Real-time, low-resource corridor reconstruction using a single consumer grade RGB camera is a powerful tool for allowing a fast, inexpensive solution to.
KINECT Vinayak Thapliyal and Noah Balsmeyer 1. Overview  What is the Kinect?  Why was it made?  How does it work?  How does it compare to other sensors?
11/06/14 How the Kinect Works Computational Photography Derek Hoiem, University of Illinois Photo frame-grabbed from:
INTERACTING WITH SIMULATION ENVIRONMENTS THROUGH THE KINECT Fayez Alazmi Supervisor: Dr. Brett Wilkinson Flinders University Image 1Image 2Image 3 Source.
Real-Time Human Pose Recognition in Parts from Single Depth Images
Wrist Recognition and the Center of the Palm Estimation Based on Depth Camera Zhengwei Yao ; Zhigeng Pan ; Shuchang Xu Virtual Reality and Visualization.
Semantic Texton Forests for Image Categorization and Segmentation We would like to thank Amnon Drory for this deck הבהרה : החומר המחייב הוא החומר הנלמד.
Real-Time Hand Gesture Recognition with Kinect for Playing Racing Video Games 2014 International Joint Conference on Neural Networks (IJCNN) July 6-11,
Po-Hsiang Chen Advisor: Sheng-Jyh Wang 2/13/2012.
Face Alignment with Part-Based Modeling
CAP4730: Computational Structures in Computer Graphics Visible Surface Determination.
A Modified EM Algorithm for Hand Gesture Segmentation in RGB-D Data 2014 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE) July 6-11, 2014, Beijing,
Low Complexity Keypoint Recognition and Pose Estimation Vincent Lepetit.
Object Recognition & Model Based Tracking © Danica Kragic Tracking system.
Parallel Tracking and Mapping for Small AR Workspaces Vision Seminar
Vision Based Control Motion Matt Baker Kevin VanDyke.
Jonathan Tompson, Murphy Stein, Yann LeCun, Ken Perlin
1.Introduction 2.Article [1] Real Time Motion Capture Using a Single TOF Camera (2010) 3.Article [2] Real Time Human Pose Recognition In Parts Using a.
Su-A Kim 3 rd June 2014 Danhang Tang, Tsz-Ho Yu, Tae-kyun Kim Imperial College London, UK Real-time Articulated Hand Pose Estimation using Semi-supervised.
Real-Time Human Pose Recognition in Parts from Single Depth Images Presented by: Mohammad A. Gowayyed.
Proximity Computations between Noisy Point Clouds using Robust Classification 1 Jia Pan, 2 Sachin Chitta, 1 Dinesh Manocha 1 UNC Chapel Hill 2 Willow Garage.
Kinect Case Study CSE P 576 Larry Zitnick
Multi-view stereo Many slides adapted from S. Seitz.
Jacinto C. Nascimento, Member, IEEE, and Jorge S. Marques
Segmentation and tracking of the upper body model from range data with applications in hand gesture recognition Navin Goel Intel Corporation Department.
Face Recognition Using Neural Networks Presented By: Hadis Mohseni Leila Taghavi Atefeh Mirsafian.
12/01/11 How the Kinect Works Computational Photography Derek Hoiem, University of Illinois Photo frame-grabbed from:
Computational Photography lecture 19 – How the Kinect 1 works? CS 590 Spring 2014 Prof. Alex Berg (Credits to many other folks on individual slides)
PG 2011 Pacific Graphics 2011 The 19th Pacific Conference on Computer Graphics and Applications (Pacific Graphics 2011) will be held on September 21 to.
CSE 185 Introduction to Computer Vision Pattern Recognition.
Real-time Dense Visual Odometry for Quadrocopters Christian Kerl
Introduction Kinect for Xbox 360, referred to as Kinect, is developed by Microsoft, used in Xbox 360 video game console and Windows PCs peripheral equipment.
Mining Discriminative Components With Low-Rank and Sparsity Constraints for Face Recognition Qiang Zhang, Baoxin Li Computer Science and Engineering Arizona.
Zhengyou Zhang Microsoft Research Digital Object Identifier: /MMUL Publication Year: 2012, Page(s): Professor: Yih-Ran Sheu Student.
KinectFusion : Real-Time Dense Surface Mapping and Tracking IEEE International Symposium on Mixed and Augmented Reality 2011 Science and Technology Proceedings.
Human Gesture Recognition Using Kinect Camera Presented by Carolina Vettorazzo and Diego Santo Orasa Patsadu, Chakarida Nukoolkit and Bunthit Watanapa.
Miguel Reyes 1,2, Gabriel Dominguez 2, Sergio Escalera 1,2 Computer Vision Center (CVC) 1, University of Barcelona (UB) 2
A General Framework for Tracking Multiple People from a Moving Camera
A Method for Hand Gesture Recognition Jaya Shukla Department of Computer Science Shiv Nadar University Gautam Budh Nagar, India Ashutosh Dwivedi.
Multi-Output Learning for Camera Relocalization Abner Guzmán-Rivera UIUC Pushmeet Kohli Ben Glocker Jamie Shotton Toby Sharp Andrew Fitzgibbon Shahram.
Realtime 3D model construction with Microsoft Kinect and an NVIDIA Kepler laptop GPU Paul Caheny MSc in HPC 2011/2012 Project Preparation Presentation.
BAGGING ALGORITHM, ONLINE BOOSTING AND VISION Se – Hoon Park.
Tracking People by Learning Their Appearance Deva Ramanan David A. Forsuth Andrew Zisserman.
Vision-based human motion analysis: An overview Computer Vision and Image Understanding(2007)
Lec 22: Stereo CS4670 / 5670: Computer Vision Kavita Bala.
Human pose recognition from depth image MS Research Cambridge.
The 18th Meeting on Image Recognition and Understanding 2015/7/29 Depth Image Enhancement Using Local Tangent Plane Approximations Kiyoshi MatsuoYoshimitsu.
The Implementation of Markerless Image-based 3D Features Tracking System Lu Zhang Feb. 15, 2005.
CVPR2013 Poster Detecting and Naming Actors in Movies using Generative Appearance Models.
Discussion of Pictorial Structures Pedro Felzenszwalb Daniel Huttenlocher Sicily Workshop September, 2006.
Spatiotemporal Saliency Map of a Video Sequence in FPGA hardware David Boland Acknowledgements: Professor Peter Cheung Mr Yang Liu.
Tracking Turbulent 3D Features Lu Zhang Nov. 10, 2005.
11/05/15 How the Kinect Works Computational Photography Derek Hoiem, University of Illinois Photo frame-grabbed from:
DISCRIMINATIVELY TRAINED DENSE SURFACE NORMAL ESTIMATION ANDREW SHARP.
Presenter: Jae Sung Park
Microsoft Kinect Jason Wong Pierce Nichols Rick Berggreen Tri Le.
Microsoft Kinect How does a machine infer body position?
CS4670 / 5670: Computer Vision Kavita Bala Lec 27: Stereo.
Tracking Objects with Dynamics
Real Time Dense 3D Reconstructions: KinectFusion (2011) and Fusion4D (2016) Eleanor Tursman.
How the Kinect Works Computational Photography
Real-Time Human Pose Recognition in Parts from Single Depth Image
Developing systems with advanced perception, cognition, and interaction capabilities for learning a robotic assembly in one day Dr. Dimitrios Tzovaras.
Oral presentation for ACM International Conference on Multimedia, 2014
Week 3 Volodymyr Bobyr.
Presentation transcript:

RGB-D Images and Applications Yao Lu

Outline Overview of RGB-D images and sensors Recognition: human pose, hand gesture Reconstruction: Kinect fusion

Outline Overview of RGB-D images and sensors Recognition: human pose, hand gesture Reconstruction: Kinect fusion

How does Kinect work? Kinect has 3 components :- color camera ( takes RGB values) IR camera ( takes depth data ) Microphone array ( for speech recognition )

Depth Image

How Does the Kinect Compare? Distance Sensing Alternatives Cheaper than Kinect ~$2 Single-Point Close-Range Proximity Sensor Motion Sensing and 3D Mapping High Performing Devices with Higher Cost Good Performance for Distance and Motion Sensing Provides a bridge between low cost and high performance sensors

Depth Sensor IR projector emits predefined Dotted Pattern Lateral shift between projector and sensor Shift in pattern dots Shift in dots determines Depth of Region

Kinect Accuracy OpenKinect SDK 11 Bit Accuracy Measured Depth 211 = 2048 possible values Measured Depth Calculated 11 bit value 2047 = maximum distance Approx. 16.5 ft. 0 = minimum distance Approx. 1.65 ft. Reasonable Range 4 – 10 feet Provides Moderate Slope Values from: http://mathnathan.com/2011/02/depthvsdistance/

Kinect Accuracy OpenKinect SDK 11 Bit Accuracy Measured Depth 211 = 2048 possible values Measured Depth Calculated 11 bit value 2047 = maximum distance Approx. 16.5 ft. 0 = minimum distance Approx. 1.65 ft. Reasonable Range 4 – 10 feet Provides Moderate Slope Values from: http://mathnathan.com/2011/02/depthvsdistance/

Other RGB-D sensors Intel RealSense Series Asus Xtion Pro Microsoft Kinect V2 Structure Sensor

Outline Overview of RGB-D images and sensors Recognition: human pose, hand gesture Reconstruction: Kinect fusion

Recognition: Human Pose recognition Research in pose recognition has been on going for 20+ years. Many assumptions: multiple cameras, manual initialization, controlled/simple backgrounds

Model-Based Estimation of 3D Human Motion, Ioannis Kakadiaris and Dimitris Metaxas, PAMI 2000

Tracking People by Learning Their Appearance, Deva Ramanan, David A Tracking People by Learning Their Appearance, Deva Ramanan, David A. Forsyth, and Andrew Zisserman, PAMI 2007

Kinect Why does depth help?

Algorithm design Shotton et al. proposed two main steps: 1. Find body parts 2. Compute joint positions. Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchio Richard Moore Alex Kipman Andrew Blake, CVPR 2011

Finding body parts What should we use for a feature? What should we use for a classifier?

Finding body parts What should we use for a feature? Difference in depth What should we use for a classifier? Random Decision Forests A set of decision trees

Features 𝑑 𝐼 𝑥 : depth at pixel x in image I 𝑢, 𝑣: parameters describing offsets

Classification Learning: Randomly choose a set of thresholds and features for splits. Pick the threshold and feature that provide the largest information gain. Recurse until a certain accuracy is reached or depth is obtained.

Implementation details 3 trees (depth 20) 300k unique training images per tree. 2000 candidate features, and 50 thresholds One day on 1000 core cluster.

Synthetic data

Synthetic training/testing

Real test

Results

Estimating joints Apply mean-shift clustering to the labeled pixels. “Push back” each mode to lie at the center of the part.

Results

Outline Overview of RGB-D images and sensors Recognition: human pose, hand gesture Reconstruction: Kinect fusion

Hand gesture recognition

Hand Pose Inference Target: low-cost markerless mocap Challenges Full articulated pose with high DoF Real-time with low latency Challenges Many DoF contribute to model deformation Constrained unknown parameter space Self-similar parts Self occlusion Device noise

OFFLINE DATABASE CREATION Pipeline Overview Tompson et al. Real-time continuous pose recovery of human hands using convolutional networks. ACM SIGGRAPH 2014. Supervised learning based approach Needs labeled dataset + machine learning Existing datasets had limited pose information for hands Architecture RDF HAND DETECT CONVNET JOINT DETECT INVERSE KINEMETICS POSE OFFLINE DATABASE CREATION

OFFLINE DATABASE CREATION Pipeline Overview Supervised learning based approach Needs labeled dataset + machine learning Existing datasets had limited pose information for hands Architecture RDF HAND DETECT CONVNET JOINT DETECT INVERSE KINEMETICS POSE OFFLINE DATABASE CREATION

OFFLINE DATABASE CREATION Pipeline Overview Supervised learning based approach Needs labeled dataset + machine learning Existing datasets had limited pose information for hands Architecture RDF HAND DETECT CONVNET JOINT DETECT INVERSE KINEMETICS POSE OFFLINE DATABASE CREATION

OFFLINE DATABASE CREATION Pipeline Overview Supervised learning based approach Needs labeled dataset + machine learning Existing datasets had limited pose information for hands Architecture RDF HAND DETECT CONVNET JOINT DETECT INVERSE KINEMETICS POSE OFFLINE DATABASE CREATION

RDF Hand Detection + Target Inferred RDT1 RDT2 Per-pixel binary classification  Hand centroid location Randomized decision forest (RDF) Shotton et al.[1] Fast (parallel) Generalize Target Inferred RDT1 RDT2 + P(L | D) Labels [1] J. Shotten et al., Real-time human pose recognition in parts from single depth images, CVPR 11

Inferring Joint Positions ConvNet Detector 1 ConvNet Detector 2 ConvNet Detector 3 Image Preprocessing 2 stage Neural Network HeatMap 96x96 48x48 24x24 PrimeSense Depth ConvNet Depth

Hand Pose Inference Results

Outline Overview of RGB-D images and sensors Recognition: human pose, hand gesture Reconstruction: Kinect fusion

Reconstruction: Kinect Fusion Newcombe et al. KinectFusion: Real-time dense surface mapping and tracking. 2011 IEEE International Symposium on Mixed and Augmented Reality. https://www.youtube.com/watch?v=quGhaggn3cQ

Motivation Augmented Reality 3d model scanning Robot Navigation Etc..

Challenges Tracking Camera Precisely Fusing and De-noising Measurements Avoiding Drift Real-Time Low-Cost Hardware

Proposed Solution Fast Optimization for Tracking, Due to High Frame Rate. Global Framework for fusing data Interleaving Tracking & Mapping Using Kinect to get Depth data (low cost) Using GPGPU to get Real-Time Performance (low cost)

Method

Tracking Finding Camera position is the same as fitting frame’s Depth Map onto Model Tracking Mapping

Tracking – ICP algorithm icp = iterative closest point Goal: fit two 3d point sets Problem: What are the correspondences? Kinect fusion chosen solution: Start with Project model onto camera Correspondences are points with same coordinates Find new T with Least - Squares Apply T, and repeat 2-5 until convergence Tracking Mapping

Tracking – ICP algorithm icp = iterative closest point Goal: fit two 3d point sets Problem: What are the correspondences? Kinect fusion chosen solution: Start with Project model onto camera Correspondences are points with same coordinates Find new T with Least - Squares Apply T, and repeat 2-5 until convergence Tracking Mapping

Tracking – ICP algorithm Assumption: frame and model are roughly aligned. True because of high frame rate Tracking Mapping

Mapping Mapping is Fusing depth maps when camera poses are known Model from existing frames New frame Problems: measurements are noisy Depth maps have holes in them Solution: using implicit surface representation Fusing = estimating from all frames relevant Tracking Mapping

Mapping – surface representation Surface is represented implicitly - using Truncated Signed Distance Function (TSDF) Voxel grid Tracking Mapping Numbers in cells measure voxel distance to surface – D

Mapping Tracking Mapping

Mapping d= [pixel depth] – [distance from sensor to voxel] Tracking

Mapping Tracking Mapping

Mapping Tracking Mapping

Mapping Tracking Mapping

Method

Pros & Cons Pros: Cons : Really nice results! Elegant solution Real time performance (30 HZ) Dense model No drift with local optimization Robust to scene changes Elegant solution Cons : 3d grid can’t be trivially up-scaled

Limitations doesn’t work for large areas (Voxel-Grid) Doesn’t work far away from objects (active ranging) Doesn’t work out-doors (IR) Requires powerful Graphics card Uses lots of battery (active ranging) Only one sensor at a time