The Kinect body tracking pipeline Oliver Williams, Mihai Budiu Microsoft Research, Silicon Valley With slides contributed by Johnny Lee, Jamie Shotton NASA Ames, February 14, 2011
Outline Hardware overview The body tracking pipeline Learning a classifier from large data Conclusions 2
What is Kinect? 3
~2000 people 4 Caveat: we only have knowledge about a small part of this process.
Input device 5
The Innards Source: iFixit 6
The vision system Source: iFixit 7 IR laser projector IR camera RGB camera
RGB Camera Used for face recognition Face recognition requires training Needs good illumination 8
The audio sensors 9 4 channel multi-array microphone Time-locked with console to remove game audio
Prime Sense Chip 10 Xbox Hardware Engineering dramatically improved upon Prime Sense reference design performance Micron scale tolerances on large components Manufacturing process to yield ~1 device / 1.5 seconds
Projected IR pattern Source: 11
Depth computation Source: 12
Depth map Source: 13
Kinect video output 30 HZ frame rate 57deg field-of-view 8-bit VGA RGB 640 x bit monochrome 320 x
XBox 360 Hardware Source: 15 Triple Core PowerPC 970, 3.2GHz Hyperthreaded, 2 threads/core 500 MHz ATI graphics card DirectX MB RAM 2005 performance envelope Must handle real-time vision AND a modern game
THE BODY TRACKING PIPELINE 16
Generic Extensible Architecture 17 Expert 1 Expert 2 Expert 3 Arbiter Stateless Raw data Sensor Skeleton estimates Final estimate probabilistic fuses the hypotheses Statefull
Background segmentation Player separation Body Part Classifier One Expert: Pipeline Stages 18 Depth mapSensor Body Part Identification Skeleton
Sample test frames 19
Constraints No calibration -no start/recovery pose -no background calibration -no body calibration Minimal CPU usage Illumination-independent 20
body size hair body type clothes furniture pets FOV angle The test matrix 21
Preprocessing 22 Identify ground plane Separate background (couch) Identify players via clustering
Two trackers Hands + head trackingBody tracking 23 not exposed through SDK
The body tracking problem 24 Input Depth map Output Body parts Classifier Runs on 320x240
Training the classifier 25 Start from ground-truth data – depth paired with body parts Train classifier to work across – pose – scene position – Height, body shape
Getting the Ground Truth (1) 26 Use synthetic data (3D avatar model) Inject noise
Motion Capture: -Unrealistic environments -Unrealistic clothing -Low throughput Getting the Ground Truth (2) 27
Getting the Ground Truth (3) 28 Manual Tagging: -Requires training many people -Potentially expensive -Tagging tool influences biases in data. -Quality control is an issue contractors ~= 20 years
Getting the Ground Truth (4) 29 Amazon Mechanical Turk: -Build web based tool -Tagging tool is 2D only -Quality control can be done with redundant HITS $0.04/HIT -> 6 $80/hr
Classifying pixels Compute P(c i |w i ) – pixels i = (x, y) – body part c i – image window w i Learn classifier P(c i |w i ) from training data – randomized decision forests example image windows window moves with classifier 30
Features depth of pixel x in image I -- parameter describing offets u and v
From body parts to joint positions Compute 3D centroids for all parts Generates (position, confidence)/part Multiple proposals for each body part Done on GPU 32
From joints positions to skeleton Tree model of skeleton topology Has cost terms for: – Distances between connected parts (relative to “body size”) – Bone proximity to body parts – Motion terms for smoothness 33
Where is the skeleton? 34
LEARNING THE BODY PARTS CLASSIFIER FROM A MOUNTAIN OF DATA 35
Learn from Data 36 Classifier Training examples Machine learning
Cluster-based training 37 Classifier Training examples Dryad DryadLINQ Machine learning > Millions of input frames > objects manipulated Sparse, multi-dimensional data Complex datatypes (images, video, matrices, etc.)
Execution Application Data-Parallel Computation 38 Storage Language Parallel Databases Map- Reduce GFS BigTable Cosmos Azure SQL Server Dryad DryadLINQ Scope Sawzall,FlumeJava Hadoop HDFS S3 Pig, Hive SQL≈SQLLINQ, SQLSawzall, Java
Dryad = 2-D Piping Unix Pipes: 1-D grep | sed | sort | awk | perl Dryad: 2-D grep 1000 | sed 500 | sort 1000 | awk 500 | perl 50 39
Virtualized 2-D Pipelines 40
Virtualized 2-D Pipelines 41
Virtualized 2-D Pipelines 42
Virtualized 2-D Pipelines 43
Virtualized 2-D Pipelines 44 2D DAG multi-machine virtualized
Fault Tolerance
LINQ 46 Dryad => DryadLINQ
47 LINQ =.Net+ Queries Collection collection; bool IsLegal(Key); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};
DryadLINQ Data Model 48 Partition Collection.Net objects
Collection collection; bool IsLegal(Key k); string Hash(Key); var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value}; 49 DryadLINQ = LINQ + Dryad C# collection results C# Vertex code Query plan (Dryad job) Data
Language Summary 50 Where Select GroupBy OrderBy Aggregate Join
Highly efficient parallellization 51 time machine
CONCLUSIONS 52
Huge Commercial Success 53
Tremendous Interest from Developers 54
Consumer Technologies Push The Envelope 55 Price: 6000$ Price: 150$
Unique Opportunity for Technology Transfer 56
I can finally explain to my son what I do for a living… 57