Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Presenters: Tianlu Wang, Yin Zhang October 5th

Neural Image Caption (NIC)
Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words

Inspiration from Machine Translation task
The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure Encoder (RNN): transform the source language into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ”ABCD” to those in target language “XYZQ”

NIC Model Architecture
Follow the Encoder - Decoder structure Encoder (deep CNN): transform the image into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence

NIC Model Architecture
Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.

RNN(Recurrent Neural Network)
Why? Sequential task: speech, text and video… E.g. translate a word based on the previous one Advantage: Pass information from one step to next, information persistence How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more?

RNN & LSTM Why it’s better? Long term dependency problem:
translation of the last word depends on the information of the first word… when gap between relevant information grows, RNN fails Long Short Term Memory Networks remembers information for long periods of time.

LSTM(Long Short Term Memory)
Cell state: information flows along it! Gate: optionally let information through

LSTM Cont.(forget gate)
input x previous output h f (vector, every element is 0 or 1) decide what information to throw away from the cell state

LSTM Cont. decide what values will be updated
input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state

LSTM Cont.(output gate)
decide what parts of cell state we’ll output output the parts we decided to

Result BLEU:

Reference: Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan generator/61592/ Understanding LSTM Networks, colah’s blog

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Similar presentations

Presentation on theme: "Show and Tell: A Neural Image Caption Generator (CVPR 2015)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Show and Tell: A Neural Image Caption Generator (CVPR 2015)

Similar presentations

Presentation on theme: "Show and Tell: A Neural Image Caption Generator (CVPR 2015)"— Presentation transcript:

Similar presentations

About project

Feedback