FaceNet A Unified Embedding for Face Recognition and Clustering

FaceNet A Unified Embedding for Face Recognition and Clustering
Coral Sharoni Tal Sheffer

Overview Face Recognition Related work Face Net Datasets
Applications Network architecture Triplets loss Mini batch Datasets Experiments & Results Conclusion

Face Recognition Face Recognition – A technology capable of identifying or verifying a person from a digital image or a video frame . Why? Face ID (Apple) - a biometric authentication Automatic tags Security …

Chinese man caught by facial recognition at pop concert
Chinese police have used facial recognition technology to locate and arrest a man who was among a crowd of 60,000 concert goers. Police said the wanted for "economic crimes", was "shocked" when he was caught. And it is not the first time.. Police in China arrested 25 suspects using a facial recognition system that was set up at the International Beer Festival.

Related Work FaceNet is based on two different deep network architecture: Architecture based on the Zeiler&Fergus model: Consists of multiple interleaved layers of convolutions, non-linear activations, local response normalizations, and max pooling layers architecture is based on the Inception model of Szegedy et al: Use mixed layers that run several different convolutional and pooling layers in parallel and concatenate their responses. Was recently used as the winning approach for ImageNet Both architecture have been used to great success in the computer vision community.

FaceNet A unified system for:
Face verification - is this the same person ? Face recognition - who is this person ? Face clustering - find common people among these faces ?

Facial recognition technology reunites lost man with his family
A mentally ill Chinese man who had been missing for over a year was reunited with his family after being identified by China’s vast facial recognition surveillance network. Hospital officials were unable to identify the man before the assistance of the facial recognition firm

classify every pair correctly.
FaceNet FaceNet method is based on learning a Euclidean embedding per image using a deep convolutional network. The network is trained such that the squared L2 distances in the embedding space ,directly correspond to face similarity: Faces of the same person - have small distances. Faces of distinct people - have large distances. Threshold of 1.1 would classify every pair correctly.

FaceNet – face clustering, recognition and verification
Once the embedded space has been produced, the aforementioned tasks become trivial: Face verification - thresholding the distance between the two embeddings. Face recognition - becomes a k-NN classiﬁcation problem. Face clustering - can be achieved using simple techniques such as k-means.

FaceNet – Network Architecture
The network consists of a batch input layer and a deep CNN followed by L2 normalization, which results in the face embedding. This is followed by the triplet loss during training. EMBEDDING

Triplet loss The Triplet Loss -
Minimizes the distance between an anchor and a positive (both of which have the same identity). Maximizes the distance between the anchor and a negative (of a different identity).

Triplet loss 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝛼< 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 For all possible triplets in training set. Assuming that we have N triplets sets. Than, the loss function to minimize become: 𝑖 𝑁 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 − 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝛼

Triplet set Generating all possible triplets would result in many triplets that are easily satisﬁed. These triplets would not contribute to the training and result in slower convergence . In order to ensure fast convergence it is crucial to select triplets that violate the triplet constraint . This means that – given 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 the optimal selection is : An ‘hard positive’ 𝑥 𝑖 𝑝 such that 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑝 is maximal An ‘hard negative’ 𝑥 𝑖 𝑛 such that 𝑓 𝑥 𝑖 𝑎𝑛𝑐ℎ𝑜𝑟 −𝑓 𝑥 𝑖 𝑛 is minimal

Triplet set It is inefficient, and sometimes infeasible to compute the minimum and maximum across the whole training set. The proposed solution - generate triplets online. selecting the hard positive/negative exemplars from within a mini-batch.

MINI BATCH The mini-batches are in the order of a few thousand exemplars. For meaningful results we need to ensure that a minimal number of exemplars of any one identity is present in each mini-batch. Around 40 faces are selected per identity per mini-batch in the experiment.

Deep Convolutional Networks
The CNN trained using Stochastic Gradient Descent (SGD) with standard backprop . Learning rate – mostly start with 0.05, and descends to finalize the model. The models are trained on a CPU cluster for hours. The margin 𝛼 is set to 0.2 Two types of architecture are used, which practically differs with number of parameters and FLOPS.

Datasets and Evaluation
The experiments datasets: Labeled Faces in the Wild (LFW) - The academic test set for face veriﬁcation. YouTube Faces - new dataset with highly popularity in the face recognition community. Hold-out Test Set - one million images, that has the same distribution as training set. Personal photos - collections with a total of around 12k images, manually veriﬁed to have very clean labels.

Datasets and Evaluation
Given a pair of two face images: True accepts - correctly classiﬁed as same at threshold d False accepts - incorrectly classiﬁed as same at threshold d Validation Rate: , False Accept Rate: 𝑉𝐴𝐿 𝑑 = 𝑇𝐴 𝑑 𝑃 𝑠𝑎𝑚𝑒 ,𝐹𝐴𝑅 𝑑 = 𝐹𝐴 𝑑 𝑃 𝑑𝑖𝑓𝑓

Experiments 100M – 200M training face about 8M different identities.
Input sizes – range from 96 X 96 to 224 X 224 pixels. A face detector is run on each image. The faces are resized to the input size.

VAL computed on Hold-out Test Set.
The Main models Model Name Architecture Input Size Parameters FLOPS VAL ±(𝟏.𝟔 𝒕𝒐 𝟐.𝟗) NN1 Zeiler&Fergus 220 X 220 140M 1.6B 87.9% NN2 Inception 224 X 224 7.5M 89.4% NN3 160 X 160 88.3% NN4 96 X 96 285M 82.0% NNS1 mini Inception 165 X 165 26M 220M 82.4% NNS2 tiny Inception 140 X 116 4.3M 20M 51.9% VAL computed on Hold-out Test Set.

ROC graph for personal photos
Experiments & Results ROC graph for personal photos

Experiments & Results Training Data size against VAL
Embedding Dimensionality

Computation Accuracy Trade-off
Experiments & Results Computation Accuracy Trade-off

Academic data set performance
LFW: Achieved record breaking classiﬁcation accuracy of % ± (standard error of the mean) using the NN1 model. Youtube Faces DB: Achieved classiﬁcation accuracy of % ± 0.39 תמונה זו מאת מחבר לא ידוע ניתן ברשיון במסגרת CC BY-SA תמונה זו מאת מחבר לא ידוע ניתן ברשיון במסגרת CC BY-SA

LWF DB errors

conclusion FaceNet provide a method to directly learn an embedding into an Euclidean space for face veriﬁcation. The method uses a deep convolutional network trained to directly optimize the embedding itself. The system achieves a great success and a new record accuracy.

So, if you don’t want to be arrested in a middle of a concert… Thank you!
Tal & Coral

FaceNet A Unified Embedding for Face Recognition and Clustering

Similar presentations

Presentation on theme: "FaceNet A Unified Embedding for Face Recognition and Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

FaceNet A Unified Embedding for Face Recognition and Clustering

Similar presentations

Presentation on theme: "FaceNet A Unified Embedding for Face Recognition and Clustering"— Presentation transcript:

Similar presentations

About project

Feedback