Presentation is loading. Please wait.

Presentation is loading. Please wait.

Image Captioning Tackgeun You.

Similar presentations


Presentation on theme: "Image Captioning Tackgeun You."β€” Presentation transcript:

1 Image Captioning Tackgeun You

2 Image Captioning Algorithms
Retrieval-based Captioning Template-based Captioning Machine Translation-based Captioning How to generate Language?

3 Image Classification Problem setting NaΓ―ve testing phase
In the label space 𝑉, Find appropriate labels for given image 𝐼 NaΓ―ve testing phase Estimate scores of all possible labels Then, threshold to get result labels

4 Image Captioning Problem setting NaΓ―ve testing phase
In the sentence space S= 𝑉 𝑁 Find appropriate sentences for given image 𝐼 NaΓ―ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences

5 Image Captioning (1) Problem setting NaΓ―ve testing phase
In the sentence space S= 𝑉 𝑁 Find appropriate sentences for given image 𝐼 NaΓ―ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences

6 Image Captioning (2) Problem setting NaΓ―ve testing phase
In the sentence space 𝑉 𝑁 Find appropriate sentences for given image 𝐼 NaΓ―ve testing phase Retrieve relevant sentences Estimate scores of relevant sentences Then, threshold to get result sentences

7 Image Captioning (3) Ranking Sentences Generating Sentences Language
Model Ranking Sentences feature Sentences Result Captions Input Image

8 Statistical Language Model
(Statistical) Language Model is a probability distribution over sequences of words. It assigns a probability 𝑝 𝑀 𝑁 , 𝑀 π‘βˆ’1 , 𝑀 π‘βˆ’2 ,β‹―, 𝑀 1 to all sentences. 𝑝 𝑀 𝑁 , 𝑀 π‘βˆ’1 , β‹―, 𝑀 1 = 𝑖=1 𝑁 𝑝 𝑀 𝑖 𝑀 π‘–βˆ’1 ,β‹―, 𝑀 1 =𝑝 𝑀 1 𝑝 𝑀 2 𝑀 1 𝑝 𝑀 3 𝑀 2 , 𝑀 1 ⋯𝑝( 𝑀 𝑁 | 𝑀 π‘βˆ’1 ,β‹― 𝑀 1 ) ex. A woman holding a camera in a crowd. 𝑝 π‘Ž 𝑝 π‘€π‘œπ‘šπ‘Žπ‘› π‘Ž 𝑝 β„Žπ‘œπ‘™π‘‘π‘–π‘›π‘” π‘€π‘œπ‘šπ‘Žπ‘›,π‘Ž ⋯𝑝(π‘π‘Ÿπ‘œπ‘€π‘‘|π‘Ž,β‹―π‘Ž)

9 N-gram Model 𝑁-gram model (= π‘βˆ’1 order Markov assumption)
𝑝 𝑀 𝑁 , 𝑀 π‘βˆ’1 , β‹―, 𝑀 1 = 𝑖=1 𝑁 𝑝( 𝑀 𝑖 | 𝑀 π‘–βˆ’1 ,β‹―, 𝑀 1 ) β‰ˆ 𝑖=1 𝑁 𝑝 𝑀 𝑖 𝑀 π‘–βˆ’1 ,β‹―, 𝑀 π‘–βˆ’π‘˜ ex. A woman holding a camera in a crowd. 𝑝(π‘π‘Ÿπ‘œπ‘€π‘‘|π‘Ž,𝑖𝑛,π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Ž,π‘Ž,β„Žπ‘œπ‘™π‘‘π‘–π‘›π‘”,π‘€π‘œπ‘šπ‘Žπ‘›,π‘Ž)β‰ˆπ‘(π‘π‘Ÿπ‘œπ‘€π‘‘|π‘Ž,𝑖𝑛, π‘π‘Žπ‘šπ‘’π‘Ÿπ‘Ž)

10 Sentence Generation by LM (1)
In Language Model, Relevant sentence = high-probability sentence! Retrieve High-probability sentence οƒ  Search sentence with High-probability! Search scheme Exhaustive search Greedy search Beam search

11 Sentence Generation by LM (2)
Greedy search Pick the word with highest probability Beam search Greedy search while retaining 𝐾-best maximum paths 𝑝 𝑀 𝑁 , 𝑀 π‘βˆ’1 , β‹―, 𝑀 1 = 𝑖=1 𝑁 𝑝 𝑀 𝑖 𝑀 π‘–βˆ’1 ,β‹―, 𝑀 1 =𝑝 𝑀 1 𝑝 𝑀 2 𝑀 1 𝑝 𝑀 3 𝑀 2 , 𝑀 1 ⋯𝑝( 𝑀 𝑁 | 𝑀 π‘βˆ’1 ,β‹― 𝑀 1 )

12 Image Captioning Pipeline
Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image

13 From Captions to Visual Concepts and Back (CVPR 2015)
Language Model with Detected Words(Labels) Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Detected Words

14

15 1. Word Detection Learning Weakly-supervised Detector
1000 frequent words in training set cover over 92% of the word occurrence FCN + Noisy-OR version of MIL Take Input sets of β€œpositive” and β€œnegative” bags of bounding boxes for each word The probability of bag 𝑏 𝑖 containing word 𝑀 𝑝 𝑏 𝑖 𝑀 =1βˆ’ π‘—βˆˆ 𝑏 𝑖 (1βˆ’ 𝑝 𝑖𝑗 𝑀 )

16 Fully Convolutional Network with MIL
output vector (1Γ—1Γ—1000) query image (224x224) fully connected output map (12Γ—12Γ—1000) convolution query image (565x565) output vector (1Γ—1Γ—1000)

17 2. Sentence Generation Beam search with blackboard
𝑝 𝑀 𝜈 0 𝑝 𝑀 𝑀 1 , 𝜈 1 ⋯𝑝( 𝑀 𝑙 | 𝑀 π‘™βˆ’1 ,β‹―, 𝑀 1 , 𝜈 π‘™βˆ’1 ) 1. A ____ 2. A woman ____ 3. A woman holding ____ woman, crowd, cat, camera, holding, purple crowd, cat, camera, holding, purple crowd, cat, camera, purple N. A woman holding a camera in a crowd cat, purple

18 3. Re-Ranking Sentences Off-the-shelf Algorithm to Rank
MERT (Minimum Error Rate Training) Optimized for BLEU on validation set Similarity measure DMSM (Deep Multimodal Similarity Model) Image model : fine-tuned VGG16 Text model : Semantic vector

19 Result in Microsoft COCO
Table C5 COCO Challenge BLEU-1 BLEU-2 BLEU-3 BLEU-4 M1 M2 Human 0.663 0.469 0.321 0.217 0.638 0.675 MSR 0.695 0.526 0.391 0.291 0.268 0.322 Google 0.713 0.542 0.407 0.309 0.273 0.317 c5 – five reference captions for every train/val/test images M1 - Percentage of captions that are evaluated as better or equal to human caption. M2 - Percentage of captions that pass the Turing Test.

20 #2. Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Neural Language Model with CNN feature No need to re-rank! Generating Sentences Language Model Result Captions Input Image feature

21 Novelty in Captions (validation c4)
BLUE4 Unique Captions (%) Seen in Training (%) Human - 99.4 4.8 MSR 25.7 47.0 30.0 Google 27.2 ~80 Overfitting

22 Discussion Generating sentences with (sequential) Language model
How to modify LM to avoid overfitting?

23 Discussion Image Captioning

24 End.

25 #3. Phrase-based Image Captioning (ICML 2015)
Phrase-based Language Model Generating Sentences Sentence Model Ranking Sentences feature Sentences Result Captions Input Image Phrase Model

26 μž‰μ—¬ νŽ˜μ΄μ§€ Image Captioning
Generating Sentences Input Image LSTM-LM feature CNN Result Captions μž‰μ—¬ νŽ˜μ΄μ§€ Image Captioning Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Generating Sentences Input Image words ME-LM Ranking Sentences feature sentences MIL CNN Result Captions

27 1. Word Detection A person with helmet is riding a bicycle.
Words = a/person/with/helmet/is/riding/bicycle

28


Download ppt "Image Captioning Tackgeun You."

Similar presentations


Ads by Google