Download presentation
Presentation is loading. Please wait.
1
Image Captioning Tackgeun You
2
Image Captioning Algorithms
Retrieval-based Captioning Template-based Captioning Machine Translation-based Captioning How to generate Language?
3
Image Classification Problem setting NaΓ―ve testing phase
In the label space π, Find appropriate labels for given image πΌ NaΓ―ve testing phase Estimate scores of all possible labels Then, threshold to get result labels
4
Image Captioning Problem setting NaΓ―ve testing phase
In the sentence space S= π π Find appropriate sentences for given image πΌ NaΓ―ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences
5
Image Captioning (1) Problem setting NaΓ―ve testing phase
In the sentence space S= π π Find appropriate sentences for given image πΌ NaΓ―ve testing phase Estimate scores of all possible sentences Then, threshold to get result sentences
6
Image Captioning (2) Problem setting NaΓ―ve testing phase
In the sentence space π π Find appropriate sentences for given image πΌ NaΓ―ve testing phase Retrieve relevant sentences Estimate scores of relevant sentences Then, threshold to get result sentences
7
Image Captioning (3) Ranking Sentences Generating Sentences Language
Model Ranking Sentences feature Sentences Result Captions Input Image
8
Statistical Language Model
(Statistical) Language Model is a probability distribution over sequences of words. It assigns a probability π π€ π , π€ πβ1 , π€ πβ2 ,β―, π€ 1 to all sentences. π π€ π , π€ πβ1 , β―, π€ 1 = π=1 π π π€ π π€ πβ1 ,β―, π€ 1 =π π€ 1 π π€ 2 π€ 1 π π€ 3 π€ 2 , π€ 1 β―π( π€ π | π€ πβ1 ,β― π€ 1 ) ex. A woman holding a camera in a crowd. π π π π€ππππ π π βππππππ π€ππππ,π β―π(ππππ€π|π,β―π)
9
N-gram Model π-gram model (= πβ1 order Markov assumption)
π π€ π , π€ πβ1 , β―, π€ 1 = π=1 π π( π€ π | π€ πβ1 ,β―, π€ 1 ) β π=1 π π π€ π π€ πβ1 ,β―, π€ πβπ ex. A woman holding a camera in a crowd. π(ππππ€π|π,ππ,ππππππ,π,βππππππ,π€ππππ,π)βπ(ππππ€π|π,ππ, ππππππ)
10
Sentence Generation by LM (1)
In Language Model, Relevant sentence = high-probability sentence! Retrieve High-probability sentence ο Search sentence with High-probability! Search scheme Exhaustive search Greedy search Beam search
11
Sentence Generation by LM (2)
Greedy search Pick the word with highest probability Beam search Greedy search while retaining πΎ-best maximum paths π π€ π , π€ πβ1 , β―, π€ 1 = π=1 π π π€ π π€ πβ1 ,β―, π€ 1 =π π€ 1 π π€ 2 π€ 1 π π€ 3 π€ 2 , π€ 1 β―π( π€ π | π€ πβ1 ,β― π€ 1 )
12
Image Captioning Pipeline
Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image
13
From Captions to Visual Concepts and Back (CVPR 2015)
Language Model with Detected Words(Labels) Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Detected Words
15
1. Word Detection Learning Weakly-supervised Detector
1000 frequent words in training set cover over 92% of the word occurrence FCN + Noisy-OR version of MIL Take Input sets of βpositiveβ and βnegativeβ bags of bounding boxes for each word The probability of bag π π containing word π€ π π π π€ =1β πβ π π (1β π ππ π€ )
16
Fully Convolutional Network with MIL
output vector (1Γ1Γ1000) query image (224x224) fully connected output map (12Γ12Γ1000) convolution query image (565x565) output vector (1Γ1Γ1000)
17
2. Sentence Generation Beam search with blackboard
π π€ π 0 π π€ π€ 1 , π 1 β―π( π€ π | π€ πβ1 ,β―, π€ 1 , π πβ1 ) 1. A ____ 2. A woman ____ 3. A woman holding ____ woman, crowd, cat, camera, holding, purple crowd, cat, camera, holding, purple crowd, cat, camera, purple N. A woman holding a camera in a crowd cat, purple
18
3. Re-Ranking Sentences Off-the-shelf Algorithm to Rank
MERT (Minimum Error Rate Training) Optimized for BLEU on validation set Similarity measure DMSM (Deep Multimodal Similarity Model) Image model : fine-tuned VGG16 Text model : Semantic vector
19
Result in Microsoft COCO
Table C5 COCO Challenge BLEU-1 BLEU-2 BLEU-3 BLEU-4 M1 M2 Human 0.663 0.469 0.321 0.217 0.638 0.675 MSR 0.695 0.526 0.391 0.291 0.268 0.322 Google 0.713 0.542 0.407 0.309 0.273 0.317 c5 β five reference captions for every train/val/test images M1 - Percentage of captions that are evaluated as better or equal to human caption. M2 - Percentage of captions that pass the Turing Test.
20
#2. Show and Tell: A Neural Image Caption Generator (CVPR 2015)
Neural Language Model with CNN feature No need to re-rank! Generating Sentences Language Model Result Captions Input Image feature
21
Novelty in Captions (validation c4)
BLUE4 Unique Captions (%) Seen in Training (%) Human - 99.4 4.8 MSR 25.7 47.0 30.0 Google 27.2 ~80 Overfitting
22
Discussion Generating sentences with (sequential) Language model
How to modify LM to avoid overfitting?
23
Discussion Image Captioning
24
End.
25
#3. Phrase-based Image Captioning (ICML 2015)
Phrase-based Language Model Generating Sentences Sentence Model Ranking Sentences feature Sentences Result Captions Input Image Phrase Model
26
μμ¬ νμ΄μ§ Image Captioning
Generating Sentences Input Image LSTM-LM feature CNN Result Captions μμ¬ νμ΄μ§ Image Captioning Generating Sentences Language Model Ranking Sentences feature Sentences Result Captions Input Image Generating Sentences Input Image words ME-LM Ranking Sentences feature sentences MIL CNN Result Captions
27
1. Word Detection A person with helmet is riding a bicycle.
Words = a/person/with/helmet/is/riding/bicycle
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.