Image Captioning Tackgeun You.

Image Captioning Tackgeun You

Image Captioning Algorithms
Retrieval-based Captioning Template-based Captioning Machine Translation-based Captioning How to generate Language?

Image Classification Problem setting Naïve testing phase
In the label space 𝑉, Find appropriate labels for given image 𝐼 Naïve testing phase Estimate scores of all possible labels Then, threshold to get result labels

Image Captioning Problem setting Naïve testing phase
In the sentence space S= 𝑉 𝑁 Find appropriate sentences for given image 𝐼 Naïve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences

Image Captioning (1) Problem setting Naïve testing phase
In the sentence space S= 𝑉 𝑁 Find appropriate sentences for given image 𝐼 Naïve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences

Image Captioning (2) Problem setting Naïve testing phase
In the sentence space 𝑉 𝑁 Find appropriate sentences for given image 𝐼 Naïve testing phase Retrieve relevant sentences Estimate scores of relevant sentences Then, threshold to get result sentences

Image Captioning (3) Ranking Sentences Generating Sentences Language
Model Ranking Sentences feature Sentences Result Captions Input Image

Statistical Language Model
(Statistical) Language Model is a probability distribution over sequences of words. It assigns a probability 𝑝 𝑤 𝑁 , 𝑤 𝑁−1 , 𝑤 𝑁−2 ,⋯, 𝑤 1 to all sentences. 𝑝 𝑤 𝑁 , 𝑤 𝑁−1 , ⋯, 𝑤 1 = 𝑖=1 𝑁 𝑝 𝑤 𝑖 𝑤 𝑖−1 ,⋯, 𝑤 1 =𝑝 𝑤 1 𝑝 𝑤 2 𝑤 1 𝑝 𝑤 3 𝑤 2 , 𝑤 1 ⋯𝑝( 𝑤 𝑁 | 𝑤 𝑁−1 ,⋯ 𝑤 1 ) ex. A woman holding a camera in a crowd. 𝑝 𝑎 𝑝 𝑤𝑜𝑚𝑎𝑛 𝑎 𝑝 ℎ𝑜𝑙𝑑𝑖𝑛𝑔 𝑤𝑜𝑚𝑎𝑛,𝑎 ⋯𝑝(𝑐𝑟𝑜𝑤𝑑|𝑎,⋯𝑎)

N-gram Model 𝑁-gram model (= 𝑁−1 order Markov assumption)
𝑝 𝑤 𝑁 , 𝑤 𝑁−1 , ⋯, 𝑤 1 = 𝑖=1 𝑁 𝑝( 𝑤 𝑖 | 𝑤 𝑖−1 ,⋯, 𝑤 1 ) ≈ 𝑖=1 𝑁 𝑝 𝑤 𝑖 𝑤 𝑖−1 ,⋯, 𝑤 𝑖−𝑘 ex. A woman holding a camera in a crowd. 𝑝(𝑐𝑟𝑜𝑤𝑑|𝑎,𝑖𝑛,𝑐𝑎𝑚𝑒𝑟𝑎,𝑎,ℎ𝑜𝑙𝑑𝑖𝑛𝑔,𝑤𝑜𝑚𝑎𝑛,𝑎)≈𝑝(𝑐𝑟𝑜𝑤𝑑|𝑎,𝑖𝑛, 𝑐𝑎𝑚𝑒𝑟𝑎)

Sentence Generation by LM (1)
In Language Model, Relevant sentence = high-probability sentence! Retrieve High-probability sentence  Search sentence with High-probability! Search scheme Exhaustive search Greedy search Beam search

Sentence Generation by LM (2)
Greedy search Pick the word with highest probability Beam search Greedy search while retaining 𝐾-best maximum paths 𝑝 𝑤 𝑁 , 𝑤 𝑁−1 , ⋯, 𝑤 1 = 𝑖=1 𝑁 𝑝 𝑤 𝑖 𝑤 𝑖−1 ,⋯, 𝑤 1 =𝑝 𝑤 1 𝑝 𝑤 2 𝑤 1 𝑝 𝑤 3 𝑤 2 , 𝑤 1 ⋯𝑝( 𝑤 𝑁 | 𝑤 𝑁−1 ,⋯ 𝑤 1 )

Image Captioning Pipeline
Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image

From Captions to Visual Concepts and Back (CVPR 2015)
Language Model with Detected Words(Labels) Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Detected Words

1. Word Detection Learning Weakly-supervised Detector
1000 frequent words in training set cover over 92% of the word occurrence FCN + Noisy-OR version of MIL Take Input sets of “positive” and “negative” bags of bounding boxes for each word The probability of bag 𝑏 𝑖 containing word 𝑤 𝑝 𝑏 𝑖 𝑤 =1− 𝑗∈ 𝑏 𝑖 (1− 𝑝 𝑖𝑗 𝑤 )

Fully Convolutional Network with MIL
output vector (1×1×1000) query image (224x224) fully connected output map (12×12×1000) convolution query image (565x565) output vector (1×1×1000)

2. Sentence Generation Beam search with blackboard
𝑝 𝑤 𝜈 0 𝑝 𝑤 𝑤 1 , 𝜈 1 ⋯𝑝( 𝑤 𝑙 | 𝑤 𝑙−1 ,⋯, 𝑤 1 , 𝜈 𝑙−1 ) 1. A ____ 2. A woman ____ 3. A woman holding ____ woman, crowd, cat, camera, holding, purple crowd, cat, camera, holding, purple crowd, cat, camera, purple N. A woman holding a camera in a crowd cat, purple

3. Re-Ranking Sentences Off-the-shelf Algorithm to Rank
MERT (Minimum Error Rate Training) Optimized for BLEU on validation set Similarity measure DMSM (Deep Multimodal Similarity Model) Image model : fine-tuned VGG16 Text model : Semantic vector

Result in Microsoft COCO
Table C5 COCO Challenge BLEU-1 BLEU-2 BLEU-3 BLEU-4 M1 M2 Human 0.663 0.469 0.321 0.217 0.638 0.675 MSR 0.695 0.526 0.391 0.291 0.268 0.322 Google 0.713 0.542 0.407 0.309 0.273 0.317 c5 – five reference captions for every train/val/test images M1 - Percentage of captions that are evaluated as better or equal to human caption. M2 - Percentage of captions that pass the Turing Test.

#2. Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Neural Language Model with CNN feature No need to re-rank! Generating Sentences Language Model Result Captions Input Image feature

Novelty in Captions (validation c4)
BLUE4 Unique Captions (%) Seen in Training (%) Human - 99.4 4.8 MSR 25.7 47.0 30.0 Google 27.2 ~80 Overfitting

Discussion Generating sentences with (sequential) Language model
How to modify LM to avoid overfitting?

Discussion Image Captioning

#3. Phrase-based Image Captioning (ICML 2015)
Phrase-based Language Model Generating Sentences Sentence Model Ranking Sentences feature Sentences Result Captions Input Image Phrase Model

잉여 페이지 Image Captioning
Generating Sentences Input Image LSTM-LM feature CNN Result Captions 잉여 페이지 Image Captioning Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Generating Sentences Input Image words ME-LM Ranking Sentences feature sentences MIL CNN Result Captions

1. Word Detection A person with helmet is riding a bicycle.
Words = a/person/with/helmet/is/riding/bicycle

Image Captioning Tackgeun You.

Similar presentations

Presentation on theme: "Image Captioning Tackgeun You."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Image Captioning Tackgeun You.

Similar presentations

Presentation on theme: "Image Captioning Tackgeun You."— Presentation transcript:

Similar presentations

About project

Feedback