Siamese Neural Networks

Siamese Neural Networks
Based on slides created by Sanjeev Rao Gary’s Unbelievable Research Unit (GURU)

Just so that nobody calls me out for plagiarism
My own wisdom on the topic of today is very limited. So I have borrowed from wiser men and women. I am grateful to all of them. EMPATH: A Neural Network that Categorizes Facial Expressions Learning a similarity metric discriminatively with Application to Face Verification – Sumith Chopra, Raia Hadsell, Yann Le Cunn Dimensionality Reduction By Learning an Invariant Mapping - Raia Hadsell, Sumith Chopra, Yann Le Cunn A similarity-based neural network for facial expression analysis - Kenji Suzuki , Hiroshi Yamada , Shuji Hashimoto And most importantly our GURU (Professor Cottrell) and fellow GURONS ( this is how members of GURU are referred to). I have learnt a lot from all of them.

What mysteries await you?
Siamese Neural Networks (SNN) What are they ? The only thing I have heard about is Siamese twins. Are they somehow related ? Why would I use a Siamese neural network? Training Couple of seminal SNN papers Learning a similarity metric discriminatively with Application to Face Verification – Sumith Chopra, Raia Hadsell, Yann Le Cunn Dimensionality Reduction By Learning an Invariant Mapping - Raia Hadsell, Sumith Chopra, Yann Le Cunn The Contrastive Loss Function Sample Implementations and Visualizations

SNN - What are they ? “Siamese twins” is a dated term for Conjoined Twins - twins that are physically joined at birth, sometimes sharing organs (source : google) SNNs are the Neural Network analog for Conjoined Twins SNNs have two identical neural networks (like twins that are physically joined at birth) SNNs share their weights, both networks have the same weights (like twins sharing organs) They take two inputs and share a symmetric loss function ( more about the loss function later )

SNN - What are they ? Loss Function Shared Weights

SNN – Why take all this trouble?
Traditional approaches to classification using discriminative methods, such as neural networks or support vector machines, generally require that all the categories be known in advance They also require that training examples be available for all the categories. Intrinsically limited to a fairly small number of categories (on the order of 1000) Unsuitable for applications where the number of categories is very large, where the number of samples per category is small, and where only a subset of the categories is known at the time of training.

SNN – Why take all this trouble?
Such applications include face recognition and face verification: the number of categories can be in the hundreds or thousands, with only a few examples per category A common approach to this kind of problem is distance-based methods, which consist in computing a similarity metric between the pattern to be classified or verified and a library of stored prototypes. SNNs are used in such distance based learning methods. SNNs are used to devise a method that can extract information about the problem from the available data, without requiring specific information about the categories. SNNs are also used for feature learning and dimensionality reduction.

SNN – Training Create a dataset that contains pairs of images that are labeled with similarity information We don’t care about how many categories exist in our data, we are only interested in similarity information. Similarity information can be whether they are similar or not ( 1 or 0 ) Or a measure of how similar they are in the input space ( some distance metric in the input space) Choose an appropriate loss function and neural network architecture Forward pass computes the loss, backward pass updates the weights This part of training is similar to training any other neural network ( appropriate batch size, momentum, weight decay, adaptive learning rate etc..)

SNN – Training The loss function is designed with a specific goal in mind The neural network architecture is designed to learn the features of our input data in the best way possible Do a forward pass to compute the loss, backprop to update the weights Let the magic happen !!

Learning a Similarity Metric Discriminatively With Application to Face Verification (LSMDAFV) - Chopra, Hadsell, LeCun Main Goals A method to learn a similarity metric from data. The method can be used for recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. The authors develop their SNN for application to face images. This similarity metric can later be used to compare or match new samples from previously-unseen categories (e.g. faces from people not seen during training)

LSMDAFV – Neural Network Architecture
Use architectures designed to extract representations that are robust to geometric distortions of the input - convolutional networks Three convolution layers Two sub sampling layers One fully connected output C1 -> S2 -> C3 -> S4 -> C5 -> F6

LSMDAFV – The contrastive loss function
Where Q is the maximum possible value of EW .Y is 1 for a impostor pair and 0 for genuine pair

The Loss function So, this picks out LG (Loss for “Genuine” pairs) when the two are from the same category, and LI (Loss for “Imposter pairs”)when they are different category. Also, remember that EW is the distance between the two internal representations, A.K.A. the “Energy” SO, we want high energy for imposter pairs, and low energy for same pairs.

The Loss function Also, remember that EW is the distance between the two internal representations, A.K.A. the “Energy” SO, we want high energy for imposter pairs, and low energy for same pairs. This means we want to make imposters have representations that are far apart, and matches to be close.

The Loss function Note that if we try to minimize these, when it is a true pair, then the distance between their representations gets smaller. When it is a non-matching pair, making EW bigger makes the loss smaller.

LSMDAFV – The contrastive loss function
Why use two different functions , one for similar pairs and one for dissimilar pairs ? Why not just use the first term for all pairs ? The energy (EW) can be made zero by simply making G a constant function. Therefore our loss function needs a contrastive term to ensure not only that the energy for a pair of inputs from the same category is low, but also that the energy for a pair from different categories is large. Why do we not have this problem with normalized probabilistic models? This problem does not occur with properly normalized probabilistic models because making the probability of a particular pair high automatically makes the probability of other pairs low.

Dimensionality Reduction by Learning an Invariant Mapping (DrLIM) - Hadsell, Chopra, LeCun
Main Goals Develop a method for learning a non-linear function that maps the data evenly to the output manifold. It only needs neighborhood relationships between training samples. These relationships could come from prior knowledge, or manual labeling, and be independent of any distance metric. The learned function can be used to map new samples not seen during training, with no prior knowledge. The authors test their model on a couple of datasets including the MNIST dataset.

DrLIM– Neural Network Architecture
Two convolution layers One sub sampling layer One fully connected output C1 -> S2 -> C3 -> F4

DrLIM – The contrastive loss function
Where m is the margin (constant). Y is 1 for a impostor pair and 0 for genuine pair

DrLIM – Contrastive loss function – The Spring Analogy
Where m is the margin (constant). Y is 1 for a impostor pair and 0 for genuine pair

DrLIM - Results

MNIST 100 iterations

MNIST iterations

Empath - Emotions When emotions are trained on “THE MODEL” described in the empath paper, we obtain a circumplex that explains how we perceive emotions and neighborhood relationships among the six basic emotions.

Empath - Emotions Problem Statement
Can we get the circumplex without explicitly training on the categories, i.e. which image corresponds to which emotion ? Can we get the circumplex from just similar/dissimlar training? If yes, it suggests that there is some inherent information in emotions and facial expressions that results in the circumplex.

Empath - Emotions Emotions Model
At a high level, it is a Siamese neural network with 3 pre-trained convolution layers from Alexnet ( winner ILSVRC 2012 ) and 4 fully connected layers . This model reduces the dimensions of each input image ( facial expression ) to two dimension. We compare how this two dimensional embedding compares with the circumplex in the EMPATH paper The model was trained on the nimstim data set. It turns out that we obtain a similar two dimensional embedding.

Nimstim Angry and Happy

Angry and Happy After Several Iterations

NimStim emotions

NimStim 2000 iterations

SNN- Summary SNNs can extract information from the available data, without requiring specific information about the categories. SNNs can be used for dimensionality reduction, recognition or verification applications where the number of categories is very large and not known during training, and where the number of training samples for a single category is very small. SNNs can be used for several other cool applications.

Siamese Neural Networks

Similar presentations

Presentation on theme: "Siamese Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Siamese Neural Networks

Similar presentations

Presentation on theme: "Siamese Neural Networks"— Presentation transcript:

Similar presentations

About project

Feedback