Real-Time Human Pose Recognition in Parts from Single Depth Images

Name: Real-Time Human Pose Recognition in Parts from Single Depth Images
Uploaded: 2017-12-13T06:19:49+00:00
Duration: PTM15S32
Channel: Johnathon Pool
Description: Real-Time Human Pose Recognition in Parts from Single Depth Images

Real-Time Human Pose Recognition in Parts from Single Depth Images
Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper

OUTLINE Introduction Data Body Part Inference and Joint Proposals
Experiments Discussion

Introduction Robust interactive human body tracking
gaming, human-computer interaction, security, telepresence, health-care Real time depth cameras tracking from frame to frame but struggle to re-initialize quickly and so are not robust Our focus on per-frame initialization + tracking algorithm focus on pose recognition in parts 3D position candidates for each skeletal joint

Introduction appropriate tracking algorithm
Tracking people with twists and exponential maps (CVPR 1998) Tracking loose limbed people (CVPR 2004) Nonlinear body pose estimation from depth images (DAGM 2005) Real-time hand-tracking with a color glove (ACM 2009) Real time motion capture using a single time-of-flight camera (CVPR 2010)

Introduction inspired by recent object recognition work that divides objects into parts Object class recognition by unsupervised scale-invariant learning [CVPR 2003] The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006] Two key design goals Computational efficiency robustness

Introduction dense probabilistic body part labeling +
spatially localized near skeletal joints Depth Image 3D proposal segment generate

Introduction Training data
We treat the segmentation into body parts as a per-pixel classification task Evaluating each pixel separately Training data generate realistic synthetic depth images train a deep randomized decision forest classifier avoid overfitting

Introduction Overfitting
Simple, discriminative depth comparison image features maintaining high computational efficiency

Introduction For further speed, the classifier can be run in parallel on each pixel on a GPU mean shift resulting in the 3D joint proposals

Density GRADIENT Estimation
What is Mean Shift ? A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in RN PDF in feature space Color space Scale space Actually any feature space you can conceive … Non-parametric Density Estimation Data Discrete PDF Representation Non-parametric Density GRADIENT Estimation (Mean Shift) PDF Analysis

Intuitive Description
Region of interest Center of mass Mean Shift vector Objective : Find the densest region Distribution of identical billiard balls

Intuitive Description
Region of interest Center of mass Objective : Find the densest region Distribution of identical billiard balls

Main contribution Treat pose estimation as object recognition
using a novel intermediate body parts representation spatially localize joints low computational cost and high accuracy

Experiments (i) synthetic depth training data is an excellent proxy for real data (ii) scaling up the learning problem with varied synthetic data is important for high accuracy (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor

Data Depth imaging and Motion capture data Pose estimation research
often focused on techniques lack of training data Two problems on depth image color pose

Depth image Use real mocap data
Retargetted to a variety of base character models to synthesize a large, varied dataset 640x480 image at 30 frames per second Depth cameras > Traditional intensity sensors working in low light levels giving a calibrated scale estimate resolving silhouette ambiguities in pose

Motion capture data capture a large database of motion capture (mocap) of human actions approximately 500k frames (driving, dancing, kicking, running, navigating menus) Need not record mocap with variation in rotation vertical axis, mirroring left-right, scene position body shape and size, camera pose all of which can be addedin (semi-)automatically

Motion capture data The classifier uses no temporal information
static poses not motion frame to the next are so small as to be insignificant using ‘furthest neighbor’ clustering algorithm where the distance between poses j mean body joints , Pi mean i pose Define distance more than 5 cm

Motion capture data necessary to iterate the process of motion capture
sampling from our model training the classifier testing joint prediction accuracy CMU mocap database

Generating synthetic data
build a randomized rendering pipeline sample fully labeled training images Goals realism and variety

First : randomly samples a set of parameters Then uses standard computer graphics techniques render depth and body part images from texture mapped 3D meshes Use autodesk motionbulider slight random variation in height and weight give extra coverage of body shapes Others parameters

Body Part Inference and Joint Proposals
Body part labeling Depth image features Randomized decision forests Joint position proposals

Body part labeling intermediate body part representation
as color-coded Some directly localize particular skeletal joints others fill the gaps transforms the problem into one that can readily be solved by efficient classification algorithms

Body part labeling The parts are specified in a texture map

Body part labeling 31 body parts: LU/RU/LW/RW head, neck,
L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, L/R ankle, L/R foot (Left, Right, Upper, loWer)

Depth image features di (x) is the depth at pixel x in image I
Ө= (u, v) describe offsets u and v 1/di (x) ensures the features are depth invariant

Depth image features combination in a decision forest
Individually these features provide only a weak signal combination in a decision forest sufficient to accurately disambiguate all trained parts

Depth image features The design of these features was strongly motivated by their computational efficiency no preprocessing is needed read at most 3 image pixels at most 5 arithmetic operations straightforwardly implemented on the GPU

Randomized decision forests
fast and effective multi-class classifiers Implemented efficiently on the GPU 1

Randomized decision forests

Joint position proposals
generate reliable proposals for the positions of 3D skeletal joints the final output of our algorithm used by a tracking algorithm to self initialize and recover from failure

A local mode-finding approach based on mean shift with a weighted Gaussian kernel ^xi is the reprojection of image pixel xi bc is a learned per-part bandwidth world space given depth dI (xi)

Non-Parametric Density Estimation
Assumption : The data points are sampled from an underlying PDF Data point density implies PDF value ! Assumed Underlying PDF Real Data Samples

Assumed Underlying PDF Real Data Samples

? Assumed Underlying PDF Real Data Samples

Parametric Density Estimation
Assumption : The data points are sampled from an underlying PDF Estimate Assumed Underlying PDF Real Data Samples

Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel

The detected modes lie on the surface of the body pushed back into the scene by a learned z offset produce a final joint position proposal Bandwidth Bc = 0.065m Threshold λc = 0.14 Z offset = m Set = 5000 images by grid search

Experiments provide further results in the supplementary material
3 trees, 20 deep, 300k training images per tree 2000 training example pixels per image 2000 candidate features Ө 50 candidate thresholds ζ per feature

Experiments Test data Real test set
challenging synthetic and real depth images to evaluate our approach synthesize 5000 depth images Real test set 8808 frames of real depth images 15 different subjects 7 upper body joint positions

Experiments Error metric: quantify both classification
average of the diagonal of the confusion matrix between the ground truth part label and the most likely inferred part label Joint prediction accuracy generate recall-precision curvesas a function of confidence threshold quantify accuracy as average precision per joint

Experiments Error metric: This penalizes multiple spurious detections
Near the correct position which might slow a downstream tracking algorithm D = 0.1 m below closed real test data

Experiments

Experiments Real time motion capture using a single time-of-flight camera. [CVPR 2010]

Discussion accurate proposals body part recognition
for the 3D locations of body joints super real-time from single depth images body part recognition as an intermediate representation a highly varied synthetic training set train very deep decision forests Depth invariant features without overfitting

Future work study of the variability in the source mocap data
Generative model underlying the synthesis pipeline a similarly efficient approach directly regress joint positions remove ambiguities in local pose

Thank you

Real-Time Human Pose Recognition in Parts from Single Depth Images

Similar presentations

Presentation on theme: "Real-Time Human Pose Recognition in Parts from Single Depth Images"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-Time Human Pose Recognition in Parts from Single Depth Images

Similar presentations

Presentation on theme: "Real-Time Human Pose Recognition in Parts from Single Depth Images"— Presentation transcript:

Similar presentations

About project

Feedback