SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger.

SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks
Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger Dyson Robotics Lab, Imperial College London Presentation by Chris Conte

Hey robot, go fetch me a Twix from the snack bar
In order to understand most commands, computers need to know two aspects of their surrounding: Where things are What things are Motivation for divide: Practical way to break up task Humans do this!

Where am I? Where is everything?
Problem often referred to as Simultaneous Localization and Mapping Various algorithms to convert 2d video into a dense map of all of the surfaces and their positions within a scene, including the position of the camera Challenges include loop closure, non- static objects, and distance (perspective is hardly changed with objects far away) Specifically, ElasticFusion is used.

ElasticFusion in Action
Creates a set of what are called Surface elements, which are characterized by a position (in 3D), normal, color, weight, radius, initialization timestamp t0 and last updated timestamp t. Note: these are not complete objects. They are meant to represent local surface area around a point, while minimizing visual holes.

What? A semantic map is one based on concepts: the scene contains a chair and a desk Previous work has been concerned with identifying specific objects from within the surface elements. This paper uses a Convolutional Neural Network

Why Use a CNN? Very effective because of high degree of generalizability. Instead of identifying a make/model of chair, it can be trained to identify chairs, of all kinds. It learns it’s own features weights, saving time and ideally, finding the optimal configuration Requires a lot of computing power, however. Requires a lot of labelled training data, one thing that this paper did not have in abundance (<1000 images)

Paper investigates two CNNs
First is from Eigen et al. Pre trained segmentation, augmented for depth detection on NYU dataset (complete) Includes three separate scales in which to detect objects. Includes an (unused) depth approximation.

Second is from Noh et. al. The group also trained a CNN that took in a depth value provided by the localization and mapping approach for each individual pixel alongside an R, G and B value. The RGB weights were initialized from the original papers’ weights. Depth was initialized to the average of those weights, and then trained on the NYUv2 dataset 10k iterations on 2 days. RGB-D segmentation alone showed an average class accuracy of 43.6%

Merge the what and the where
We can combine these two processes in a way that results in a semantic understanding of objects in a 3D space, allowing us to solve complex problems involving motion and identification. We also expect to see some improvements for segmentation: 3D spacial elements can more easily be identified than a 2D curve in single merge How to do this?

Recursive Naive Bayes The probability of element X belonging to an object A in frames q through k is the product across all frames i of the probability of the shape X being an object A in frame i Yields a probability distribution P(Oi = oi) for all class labels, oi in O, the set of all classes. Naive in that it assumes independence between adjacent object elements.

Advantage to recognize dependence
If we see grass below a patch of white, it is more likely a sheep or a cow. How?

Conditional Random Fields
Probabilistic approach to addressing dependence between connected elements. Create a graph, each node being a surface element, and Gaussian edge potentials. For each regularization, we calculate the energy of assigning each class to each surface element.

The pairwise smoothness term
Energy Function The pairwise smoothness term unary data term

Improvement was seen both with State of The Art CNNs and pre-trained CNNs.
Eigen improved class average accuracy from 59.9% to 63.2% (+3.3%) RGBD-CNN is (+2.3%) This confirms initial hypothesis that contour information, can increase accuracy of segmentation. Success!

Conditional Random Fields regularization results were inconclusive
Added a massive amount of computational complexity. Only contributed to a marginal gain of 0.5% Partially explained by the lack of parameter optimization.

Semantic Segmentation does not need to be updated every frame
In practice, the CNN does not need to run very often to see improvement over the current single-frame literature. (As little as once every 5 seconds)

Experimental Conclusions
Semantic segmentation only needs to happen at a rate of around 1Hz Approach surpassed any semantic segmentation system published before. Some videos showed clearer improvements from the combination than others Run time requires some compromises

Videos with close objects and variation in perspective resulted in the best classification
The eventual classifications were dependent on the camera being able to understand 3D space first. We can see that some classifications were harmed compared to their 2D counterparts. Possible extension: use CNN predictions to improve ElasticFusion? Input Ground Truth Single Frame SemanticFusion

Computational power is a limiting factor
SLAM system requires 29.3ms, or 34Hz CRF 0.05Hz or 20 seconds per frame on a CPU CNN and Bayesian updates are at 10Hz on an optimized GPU New work in compressed CNNs could bring the time required down, however we are far from real time execution on a computer less than $1000

Sources gitHub: T. Whelan, S. Leutenegger, R. F. Salas-Moreno, B. Glocker, and A. J. Davison, “ElasticFusion: Dense SLAM without a pose graph,” in Proceedings of Robotics: Science and Systems (RSS), 2015. R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. J. Kelly, and A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at the Level of Objects,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), [Online]. Available: D. Eigen and R. Fergus, “Predicting Depth, Surface Normals and Semantic Labels with a Common Multi-Scale Convolutional Archi- tecture,” in Proceedings of the International Conference on Computer Vision (ICCV), 2015. John McCormac, Ankur Handa, Andrew J. Davison and Stefan Leutenegger, “SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks” in Robotics and Automation (ICRA), 2017 IEEE International Conference on, 2017.

Video of the system in action!

SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger.

Similar presentations

Presentation on theme: "SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger.

Similar presentations

Presentation on theme: "SemanticFusion: Dense 3D Semantic Mapping with Convolutional Neural Networks Paper by John McCormac, Ankur Handa, Andrew Davison, and Stefan Leutenegger."— Presentation transcript:

Similar presentations

About project

Feedback