Presented by: Anurag Paul

Slides:



Advertisements
Similar presentations
Deep Learning and Neural Nets Spring 2015
Advertisements

Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Spatial Pyramid Pooling in Deep Convolutional
Longbiao Kang, Baotian Hu, Xiangping Wu, Qingcai Chen, and Yan He Intelligent Computing Research Center, School of Computer Science and Technology, Harbin.
Richard Socher Cliff Chiung-Yu Lin Andrew Y. Ng Christopher D. Manning
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Describing Images using Inferred Visual Dependency Representations Authors : Desmond Elliot & Arjen P. de Vries Presentation of Paper by : Jantre Sanket.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Image Captioning Approaches
NOTE: To change the image on this slide, select the picture and delete it. Then click the Pictures icon in the placeholder to insert your own image. SHOW.
Deep Learning Overview Sources: workshop-tutorial-final.pdf
Efficient Estimation of Word Representations in Vector Space By Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean. Google Inc., Mountain View, CA. Published.
Learning to Answer Questions from Image Using Convolutional Neural Network Lin Ma, Zhengdong Lu, and Hang Li Huawei Noah’s Ark Lab, Hong Kong
Fill-in-The-Blank Using Sum Product Network
S.Bengio, O.Vinyals, N.Jaitly, N.Shazeer
Attention Model in NLP Jichuan ZENG.
R-NET: Machine Reading Comprehension With Self-Matching Networks
Fabien Cromieres Chenhui Chu Toshiaki Nakazawa Sadao Kurohashi
Unsupervised Learning of Video Representations using LSTMs
CS 4501: Introduction to Computer Vision Object Localization, Detection, Semantic Segmentation Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy.
CNN-RNN: A Unified Framework for Multi-label Image Classification
What Convnets Make for Image Captioning?
Hierarchical Question-Image Co-Attention for Visual Question Answering
End-To-End Memory Networks
CS 4501: Introduction to Computer Vision Computer Vision + Natural Language Connelly Barnes Some slides from Fei-Fei Li / Andrej Karpathy / Justin Johnson.
The Relationship between Deep Learning and Brain Function
Deep Learning Amin Sobhani.
Adversarial Learning for Neural Dialogue Generation
Neural Machine Translation by Jointly Learning to Align and Translate
Show and Tell: A Neural Image Caption Generator (CVPR 2015)
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
Intro to NLP and Deep Learning
Neural networks (3) Regularization Autoencoder
Vector-Space (Distributional) Lexical Semantics
mengye ren, ryan kiros, richard s. zemel
Learning to Rank Shubhra kanti karmaker (Santu)
Convolutional Neural Networks for sentence classification
Attention-based Caption Description Mun Jonghwan.
Variational Knowledge Graph Reasoning
Paraphrase Generation Using Deep Learning
Vessel Extraction in X-Ray Angiograms Using Deep Learning
Introduction to Text Generation
Word Embedding Word2Vec.
MEgo2Vec: Embedding Matched Ego Networks for User Alignment Across Social Networks Jing Zhang+, Bo Chen+, Xianming Wang+, Fengmei Jin+, Hong Chen+, Cuiping.
The Big Health Data–Intelligent Machine Paradox
Memory-augmented Chinese-Uyghur Neural Machine Translation
Attention.
Machine learning overview
Neural networks (3) Regularization Autoencoder
Ali Hakimi Parizi, Paul Cook
Word embeddings Text processing with current NNs requires encoding into vectors. One-hot encoding: N words encoded by length N vectors. A word gets a.
Word embeddings (continued)
Deep Learning Authors: Yann LeCun, Yoshua Bengio, Geoffrey Hinton
Neural Modular Networks
Attention for translation
Learn to Comment Mentor: Mahdi M. Kalayeh
Jointly Generating Captions to Aid Visual Question Answering
Sequence to Sequence Video to Text
Automatic Handwriting Generation
Building Dynamic Knowledge Graphs From Text Using Machine Reading Comprehension Rajarshi Das, Tsendsuren Munkhdalai, Xingdi Yuan, Adam Trischler, Andrew.
VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION
Neural Machine Translation using CNN
Presented By: Harshul Gupta
Week 3 Presentation Ngoc Ta Aidean Sharghi.
Week 3 Volodymyr Bobyr.
Neural Machine Translation by Jointly Learning to Align and Translate
What and How Well You Performed
Iterative Projection and Matching: Finding Structure-preserving Representatives and Its Application to Computer Vision.
Do Better ImageNet Models Transfer Better?
Presentation transcript:

Presented by: Anurag Paul Show and Tell Presented by: Anurag Paul

Show and Tell: A Neural Image Caption Generator Authors: All of Google Oriol Vinyals, vinyals@google.com Alexander Toshev, toshev@google.com Samy Bengio, bengio@google.com Dumitru Erhan, dumitru@google.com

Agenda Related Work Architecture Metrics Datasets Analysis

How can we do Image Captioning? If the dataset size is small (~1000 images) If there is only one class of data and need to capture fine-grained information (all images are of let’s say tennis) If there is a large amount of data (~ 1M) but it is noisy i.e. not labelled professionally How can we do Image Captioning?

Solutions in Related Work Detecting Scene Elements and converting to sentence using templates they are heavily hand designed and rigid when it comes to text generation. Baby Talk

Solutions in Related Work Ranking descriptions for a given image Such approaches are based on the idea of co-embedding of images and text in the same vector space cannot describe previously unseen compositions of objects avoid addressing the problem of evaluating how good a generated description is DeFrag

Solutions in Related Work m-RNN: Mao et al. uses a recurrent NN for the same prediction task. NIC uses a more powerful RNN model, and provide the visual input to the RNN model directly

Solutions in Related Work MNLM: Kiros et al. propose to construct a joint multimodal embedding space by using a powerful computer vision model and an LSTM that encodes text. use two separate pathways (one for images, one for text) to define a joint Embedding, approach is highly tuned for ranking.

Model Architecture

GoogleNet 2014 ILSVRC Winner

Inference - Sentence generation How to generate new sentence given an image (For testing, not training)? Sampling - sample the first word according to p1, then provide the corresponding embedding as input and sample p2, continuing like this until we sample the special end-of-sentence token or some maximum length BeamSearch: iteratively consider the set of the k best sentences up to time t as candidates to generate sentences of size t + 1, and keep only the resulting best k of them. This better approximates S = arg maxS0 p(S0jI). Authors used the Beam Search approach in the paper with a beam of size 20. Using a beam size of 1 (i.e., greedy search) did degrade their results by 2 BLEU points on average.

Training Details Most of the datasets are quite small compared to the ones we have for Image Classification Challenge: Overfitting Solutions: Pretrained CNN model Dropout Ensembling All weights randomly initialised except for the CNN

Training Details

Results

Evaluation Metrics BLEU Score

Evaluation Metrics CIDEr A measure of consensus would encode how often n-grams in the candidate sentence are present in the reference sentences. n-grams not present in the reference sentences should not be in the candidate sentence. n-grams that commonly occur across all images in the dataset should be given lower weight TF-IDF TF places higher weight on n-grams that frequently occur in the reference sentence describing an image, while IDF reduces the weight of ngrams that commonly occur across all images in the dataset

Evaluation Metrics CIDEr CIDErn score for n-grams of length n is computed using the average cosine similarity between the candidate sentence and the reference sentences

M = Fmean (1 - p) Evaluation Metrics METEOR m is the number of unigrams in the candidate translation that are also found in the reference translation, wT is the number of unigrams in the candidate translation wR is the number of unigrams in the reference translation M = Fmean (1 - p) c is the number of chunks, and um is the number of unigrams that have been mapped

Comparisons

Comparisons

High Diversity: Novel Sentences

Analysis of Embeddings

Conclusion NIC, an end-to-end neural network system that can automatically view an image and generate a reasonable description in plain English NIC is based on a convolution neural network that encodes an image into a compact representation, followed by a recurrent neural network that generates a corresponding sentence The model is trained to maximize the likelihood of the sentence given the image Authors believe that as the size of the available datasets for image description increases, so will the performance of approaches like NIC