Temporal Order-Preserving Dynamic Quantization for Human Action Recognition from Multimodal Sensor Streams Jun Ye Kai Li Guo-Jun Qi Kien A. Hua University of Central Florida
Outline Background Problem, existing methods, challenges Our algorithm Dynamic Temporal Quantization Multimodal Feature Fusion Performance study MSR-Action3D UTKinect-Action MSR-ActionPairs Conclusions
Background Depth sensors becomes affordable and popular New human-computer interaction Gesture recognition Speech recognition Application domain Video games, education, business, healthcare
Problem and Challenges Key problem: modeling the temporal dynamics of 3D human action/gestures Existing methods Histogram-based methods do not preserve order (bag-of-3d-words [5, 21], HOJ3D [16], HON4D [9] ) Temporal modeling suffer from video misalignment (motion template [7, 20], temporal pyramid [9, 14]) Challenge: temporal misalignment due to Temporal translation Execution rate variation
Dynamic Temporal Quantization Algorithm Objective Modeling the temporal patterns of 3D actions according to the transition of sub-actions satisfying Frames with similar postures are clustered together (sub-action constraint) Temporal order of the sequence must be preserved (order-preserving) Dynamic Temporal Quantization Algorithm
Dynamic Temporal Quantization Quantization: videos X1,X2,… Xn of varied length n quantized vector V1,V2,…Vm of fixed length m. Optimal frame assignment a Objective function: Optimal quantization can be obtained by jointly optimizing a and V
Dynamic Temporal Quantization (cont’d) Nontrivial to jointly solve the frame assignment a Initialization: uniform partition Aggregation step: given fixed assignment a, vj is computed by the aggregation Assignment step: fixed the quantized vector V, update the assignment a by DTW Iterate until convergence.
Hierarchical representation Multilayers of the Dynamic Quantization Top layers: global temporal patterns Bottom layers: local temporal patterns Concatenate all layers
Multimodel Feature Fusion Multimodal features: joint coordinate pairwise angle joint offset [21] histogram of velocity components (HVC) Supervised learning for all quantized vectors Multiclass SVM Fusion by regression (softmax)
Experiments Experiments on three public 3D human action datasets MSR-Action3D UTKinect-Action MSR-ActionPairs
Experiment: dynamic quantization VS deterministic quantization outperforms deterministic quantization. MSR-Action3D dataset Feature Accuracy Dynamic quantization Deterministic quantization position 81.61% 76.24% angle 73.95% 71.65% offset 68.20% velocity 80.84% 72.80% fused 90.42% 83.15% Similar performances can be observed in the other two datasets.
Experiment: hierarchical representation MSR-Action3D dataset with the joint coordinate feature Layers 1 2 3 4 5 Accuracy 66.28% 67.82% 71.26% 81.61% 77.39% More layers generally produce higher accuracy though need to take care of the overfitting.
Experiment: Comparison with state-of-the-art results Method Accuracy Actionlet Ensemble [14] HON4D [9] DCSF [15] Lie Group [13] Super Normal Vector [18] Proposed method 88.2% 88.89% 89.3% 89.48% 93.09% 90.42% Method Accuracy Actionlet Ensemble [14] HON4D [9] HON4D + Ddisc [9] Super Normal Vector [18] Proposed method 82.22% 93.33% 96.67% 98.89% 93.71% MSR-Action3D dataset MSR-ActionPairs dataset Method Accuracy Histogram of 3D joints [17] Combined features with random forest [21] Lie Group [13] Proposed method 90.92% 91.9% 97.08% 100% UTKinect-Action dataset (100% accuracy)
Conclusions A novel algorithm for 3D human action sequence recognition from the perspective of dynamic temporal quantization. Extensive experiments on three public datasets demonstrate the effectiveness of the proposed technique for temporal modeling.
Thank you. Questions?