Presentation is loading. Please wait.

Presentation is loading. Please wait.

Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex.

Similar presentations


Presentation on theme: "Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex."— Presentation transcript:

1 Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex Kipman Andrew Blake Microsoft Research Cambridge & Xbox Incubation CVPR 2011 Best Paper

2

3 OUTLINE Introduction Data Body Part Inference and Joint Proposals Experiments Discussion

4 Introduction Robust interactive human body tracking – gaming, human-computer interaction, security, – telepresence, health-care Real time depth cameras – tracking from frame to frame but struggle to re-initialize quickly and so are not robust – Our focus on per-frame initialization + tracking algorithm focus on pose recognition in parts – 3D position candidates for each skeletal joint

5 Introduction appropriate tracking algorithm – Tracking people with twists and exponential maps (CVPR 1998) – Tracking loose limbed people (CVPR 2004) – Nonlinear body pose estimation from depth images (DAGM 2005) – Real-time hand-tracking with a color glove (ACM 2009) – Real time motion capture using a single time-of-flight camera (CVPR 2010)

6 Introduction inspired by recent object recognition work that divides objects into parts – Object class recognition by unsupervised scale-invariant learning [CVPR 2003] – The layout consistent random field for recognizing and segmenting partially occluded objects [CVPR 2006] Two key design goals – Computational efficiency – robustness

7 Introduction Depth Image dense probabilistic body part labeling + spatially localized near skeletal joints 3D proposal segmentgenerate

8 Introduction We treat the segmentation into body parts as a per-pixel classification task – Evaluating each pixel separately Training data – generate realistic synthetic depth images – train a deep randomized decision forest classifier avoid overfitting

9 Introduction Overfitting Simple, discriminative depth comparison image features maintaining high computational efficiency

10 Introduction For further speed, the classifier can be run in parallel on each pixel on a GPU mean shift resulting in the 3D joint proposals

11 What is Mean Shift ? Non-parametric Density Estimation Non-parametric Density GRADIENT Estimation (Mean Shift) Data Discrete PDF Representation PDF Analysis PDF in feature space Color space Scale space Actually any feature space you can conceive … A tool for: Finding modes in a set of data samples, manifesting an underlying probability density function (PDF) in R N

12 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

13 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

14 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

15 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

16 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

17 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Mean Shift vector Objective : Find the densest region

18 Intuitive Description Distribution of identical billiard balls Region of interest Center of mass Objective : Find the densest region

19 Treat pose estimation as object recognition – using a novel intermediate body parts representation – spatially localize joints – low computational cost and high accuracy Main contribution

20 (i) synthetic depth training data is an excellent proxy for real data (ii) scaling up the learning problem with varied synthetic data is important for high accuracy (iii) our parts-based approach generalizes better than even an oracular exact nearest neighbor Experiments

21 Data Depth imaging and Motion capture data Pose estimation research – often focused on techniques – lack of training data Two problems on depth image – color – pose

22 Use real mocap data – Retargetted to a variety of base character models – to synthesize a large, varied dataset – 640x480 image at 30 frames per second Depth cameras > Traditional intensity sensors – working in low light levels – giving a calibrated scale estimate – resolving silhouette ambiguities in pose Depth image

23 capture a large database of motion capture (mocap) of human actions – approximately 500k frames – (driving, dancing, kicking, running, navigating menus) Need not record mocap with variation in rotation – vertical axis, mirroring left-right, scene position body shape and size, camera pose – all of which can be addedin (semi-)automatically Motion capture data

24 The classifier uses no temporal information – static poses – not motion frame to the next are so small as to be insignificant – using ‘furthest neighbor’ clustering algorithm – where the distance between poses – j mean body joints, Pi mean i pose – Define distance more than 5 cm Motion capture data

25 necessary to iterate the process of motion capture – sampling from our model – training the classifier – testing joint prediction accuracy CMU mocap database Motion capture data

26 build a randomized rendering pipeline – sample fully labeled training images Goals – realism and variety Generating synthetic data

27 First : randomly samples a set of parameters Then uses standard computer graphics techniques – render depth and body part images – from texture mapped 3D meshes Use autodesk motionbulider – slight random variation in height – and weight give extra coverage of body shapes – Others parameters

28 Generating synthetic data

29 Body Part Inference and Joint Proposals Body part labeling Depth image features Randomized decision forests Joint position proposals

30 Body part labeling intermediate body part representation – as color-coded – Some directly localize particular skeletal joints – others fill the gaps transforms the problem into one that can readily be solved by efficient classification algorithms

31 Body part labeling The parts are specified in a texture map

32 Body part labeling 31 body parts: – LU/RU/LW/RW head, neck, – L/R shoulder, LU/RU/LW/RW arm, L/R elbow, L/R wrist, L/R – hand, LU/RU/LW/RW torso, LU/RU/LW/RW leg, L/R knee, – L/R ankle, L/R foot (Left, Right, Upper, loWer)

33 Depth image features di (x) is the depth at pixel x in image I Ө= (u, v) describe offsets u and v 1/di (x) ensures the features are depth invariant

34 Depth image features Individually these features provide only a weak signal combination in a decision forest – sufficient to accurately – disambiguate all trained parts

35 Depth image features The design of these features was strongly motivated by their computational efficiency – no preprocessing is needed – read at most 3 image pixels – at most 5 arithmetic operations – straightforwardly implemented on the GPU

36 Randomized decision forests – fast and effective multi-class classifiers – Implemented efficiently on the GPU – 1

37 Randomized decision forests

38

39 Joint position proposals generate reliable proposals for the positions of 3D skeletal joints – the final output of our algorithm – used by a tracking algorithm to self initialize – and recover from failure

40 Joint position proposals A local mode-finding approach based on mean shift with a weighted Gaussian kernel – ^x i is the reprojection of image pixel xi – bc is a learned per-part bandwidth – world space given depth dI (xi)

41 Non-Parametric Density Estimation Assumption : The data points are sampled from an underlying PDF Assumed Underlying PDFReal Data Samples Data point density implies PDF value !

42 Assumed Underlying PDFReal Data Samples Non-Parametric Density Estimation

43 Assumed Underlying PDFReal Data Samples ? Non-Parametric Density Estimation

44 Parametric Density Estimation Assumption : The data points are sampled from an underlying PDF Assumed Underlying PDF Estimate Real Data Samples

45 Joint position proposals Wic considers both the inferred body part probability at the pixel and the world surface area of the pixel

46 Joint position proposals The detected modes – lie on the surface of the body – pushed back into the scene by a learned z offset produce a final joint position proposal Bandwidth Bc = 0.065m Threshold λc = 0.14 Z offset = 0.039m Set = 5000 images by grid search

47 Joint position proposals

48 Experiments provide further results in the supplementary material – 3 trees, 20 deep, 300k training images per tree – 2000 training example pixels per image – 2000 candidate features Ө – 50 candidate thresholds ζ per feature

49 Experiments Test data – challenging synthetic and real depth images to evaluate our approach – synthesize 5000 depth images Real test set – 8808 frames of real depth images – 15 different subjects – 7 upper body joint positions

50 Experiments Error metric: – quantify both classification average of the diagonal of the confusion matrix between the ground truth part label and the most likely inferred part label – Joint prediction accuracy generate recall-precision curvesas a function of confidence threshold quantify accuracy as average precision per joint

51 Experiments Error metric: – This penalizes multiple spurious detections – Near the correct position which might slow a downstream tracking algorithm D = 0.1 m below closed real test data

52 Experiments

53

54

55

56

57 Real time motion capture using a single time-of-flight camera. [CVPR 2010]

58 Discussion accurate proposals – for the 3D locations of body joints – super real-time from single depth images body part recognition – as an intermediate representation a highly varied synthetic training set – train very deep decision forests – Depth invariant features without overfitting

59 Future work study of the variability in the source mocap data Generative model underlying the synthesis pipeline a similarly efficient approach – directly regress joint positions – remove ambiguities in local pose

60 Thank you


Download ppt "Real-Time Human Pose Recognition in Parts from Single Depth Images Jamie Shotton Andrew Fitzgibbon Mat Cook Toby Sharp Mark Finocchi Richard Moore Alex."

Similar presentations


Ads by Google