Beyond Attributes -> Describing Images

Beyond Attributes -> Describing Images
Tamara L. Berg UNC Chapel Hill

Berg, Attributes Tutorial CVPR13
Descriptive Text “It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” Scarlett O’Hara described in Gone with the Wind. People sometimes produce very vivid and richly informative descriptions about the visual world. For example, here the writer says “it was an arresting face, pointed of chin, …” Berg, Attributes Tutorial CVPR13

More Nuance than Traditional Recognition…
person car You’ll notice that this human output of recognition is quite different from traditional computer vision recognition outputs which might recognize this picture as a person, this one as a shoe or this one as a car. shoe Berg, Attributes Tutorial CVPR13

Toward Complex Structured Outputs
car A lot of research in visual recognition has focused on producing categorical labels for items Berg, Attributes Tutorial CVPR13

pink car Today we’ve been talking about attributes which is a first step toward producing more complex structured recognition outputs Attributes of objects Berg, Attributes Tutorial CVPR13

car on road We can also think about recognizing the context of where objects are located with respect to the overall scene or relative to other objects – maybe recognizing that this is a car on a road Relationships between objects Berg, Attributes Tutorial CVPR13

Little pink smart car parked on the side of a road in a London shopping district. … Complex structured recognition outputs Ultimately we might like our recognition systems to produce more complete predictions about the objects, their appearance, their relationships, actions, and context. Perhaps even going so far as to produce a short description of the image that tells the “story behind the image.” For this image we might like to say something like “little pink smart car…” Telling the “story of an image” Berg, Attributes Tutorial CVPR13

Learning from Descriptive Text
“It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns” Scarlett O’Hara described in Gone with the Wind. How does the world work? I think looking at human descriptions can help inform computer vision in terms of what we should recognize and also what kinds of structured outputs we might produce about image content. In general, I believe we can learn from studying visually descriptive language in a number of ways: descriptive language can provide information about how the world works, especially the visual world. for example looking at lots of descriptive like this could tell us what the possible modifiers of eyebrows are – they can be black, slanted, bushy, thin etc. descriptive text can also provide data for understanding how people describe the world, and how they construct natural language related to images, so what kinds of things do they mention or what do they perceive as important 3) Lastly, I think collections of descriptive text might guide visual recognition research of the future Visually descriptive language provides: Information about the world, especially the visual world. information about how people construct natural language for imagery. guidance for visual recognition. What should we recognize? How do people describe the world? Berg, Attributes Tutorial CVPR13

Methodology A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge. Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture. Pink Car Sign Door Motorcycle Tree Brick building Dirty Road Sidewalk London Shopping district Natural language description I’ll talk about two types of methods being used for generating image descriptions The first is to recognize some bits of image content in the query image and then directly construct a description based on those predictions. The second is to find relevant text that we might re-use to produce a description for the query image. Generation Methods: Compose descriptions directly from recognized content Retrieve relevant existing text given recognized content Berg, Attributes Tutorial CVPR13

Related Work Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011) Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012) Generation using pre-associated relevant text  Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a) Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012), Das et al (2013) There has been a fair amount of related work in this area recently ranging from the generation methodologies I just described to methods that take advantage of pre-associated relevant descriptive text to methods that work on related problems like image annotation or video description. Due to time restrictions I will just talk about a few methods related to the first two kinds of image description composition – Composing descriptions directly from recognized content and description generation as retrieval Berg, Attributes Tutorial CVPR13

Method 1: Recognize & Generate
So method 1 is recognize some image content and generate natural language descriptions based on those recognition predictions Berg, Attributes Tutorial CVPR13

Baby Talk: Understanding and Generating Simple Image Descriptions
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg CVPR 2011 We called this babytalk and maybe you’ll understand why in a moment

For an image like this one we generate a description that reads
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

“This picture shows one person…”

One grass “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

One chair “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

And one potted plant “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

The person is near the green grass

And in the chair (well not quite, but perhaps a reasonable interpretation given the image and what we know about people and chairs) “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

The green grass is by the chair

And near the potted plant.

So this generates a pretty simplistic, but perhaps somewhat relevant description.
How did we do it? “This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” Kulkarni et al, CVPR11

Methodology Vision -- detection and classification
Text inputs - statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language Being computer vision people the first thing we thought to try is to generate descriptions using state of the art visual recognition algorithms to determine the content of the image. Kulkarni et al, CVPR11

Vision is hard! Green sheep
So we can detect objects in the image. Here we correctly detect a sheep, but as we all know vision isn’t perfect so we might predict this as the bounding box for our sheep, and if we’re not careful we might predict this as a green sheep. Clearly that’s not right! So our idea here was that if we can extract some simple world knowledge from descriptive text then we could use that to smooth noisy vision predictions like this. World knowledge (from descriptive text) can be used to smooth noisy vision predictions! Kulkarni et al, CVPR11

Text -- statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language So in addition to using computer vision predictions it turns out to be quite useful to incorporate information gleaned from parsing lots and lots of descriptive text. Kulkarni et al, CVPR11

Learning from Descriptive Text
Attributes a very shiny car in the car museum in my hometown of upstate NY. green green grass by the lake You can think of this as a sort of simple world knowledge learned from textual descriptions, specifically from text associated with Flickr images. In this paper we collect information from descriptive text to learn about attributes of objects such as that grass is green or cars can be shiny. We also use descriptive text to learn about relationships between pairs of objects. Such as the fact that people are often in chairs or cats can be on sofas. Relationships very little person in a big rocking chair Our cat Tusik sleeping on the sofa near a hot radiator. Kulkarni et al, CVPR11

Text -- statistics from parsing lots of descriptive text Model (CRF) to predict best image labeling given vision and text based potentials Generation algorithms to compose natural language So in the end our method combines vision predictions for an image smoothed with text. We put that into a crf to predict image content given vision and text based potentials, and then finally use some simple generation algorithms to compose natural language from the crf prediction. Kulkarni et al, CVPR11

System Flow Extract Objects/stuff Predict attributes
a) dog b) person c) sofa brown 0.32 striped 0.09 furry .04 wooden .2 Feathered .04 ... brown 0.94 striped 0.10 furry .06 wooden .8 Feathered .08 brown 0.01 striped 0.16 furry .26 feathered .06 a) dog b) person c) sofa Predict attributes Predict prepositions a) dog b) person c) sofa near(a,b) 1 near(b,a) 1 against(a,b) .11 against(b,a) .04 beside(a,b) .24 beside(b,a) .17 ... near(a,c) 1 near(c,a) 1 against(a,c) .3 against(c,a) .05 beside(a,c) .5 beside(c,a) .45 near(b,c) 1 near(c,b) 1 against(b,c) .67 against(c,b) .33 beside(b,c) .0 beside(c,b) .19 Predict labeling – vision potentials smoothed with text potentials <<null,person_b>,against,<brown,sofa_c>> <<null,dog_a>,near,<null,person_b>> <<null,dog_a>,beside,<brown,sofa_c>> Generate natural language description This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa. Input Image So here’s the overall flow We start off with an input image. We detect objects and stuff present in that image – here a dog, a person, and a sofa. For each object we can predict possible attributes -- such as this dog may be furry, the sofa is probably brown. between pairs of objects we can predict preposition relationships such as the dog is near the person, or the person is against the sofa. These vision based potentials are combined with text potentials in a CRF framework to produce a final labeling for the image. In this case we predict “null – or no attribute -- person against brown sofa, dog near person, dog beside brown sofa.” Finally we utilize simple generation methods to compose a resulting English description “this is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa” (sequential tree re-weighted message passing) (set parameters with cross validation on a training set – this was sort of a prototype. I think there is probably room for improvement in the model and learning. For example, we only use a single parameter for weighting for example the detection scores, but we know that these models are not necessarily calibrated well wrt each other.) (TRW - These algorithms are inspired by the problem of maximizing a lower bound on the energy) Kulkarni et al, CVPR11

Some good results This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. This is a picture of two dogs. The first dog is near the second furry dog. This is a picture of one sky, one road… Kulkarni et al, CVPR11

Some bad results Missed detections: False detections:
Here we see one potted plant. Missed detections: This is a picture of one dog. False detections: There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. Incorrect attributes: This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass. Of course it doesn’t always work! Some common mistakes are: missing detections, false detections, Incorrectly predicted attributes. Kulkarni et al, CVPR11

Algorithm vs Humans Sounds unnatural!
“This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant.” H1: A Lemonaide stand is manned by a blonde child with a cookie. H2: A small child at a lemonade and cookie stand on a city corner. H3: Young child behind lemonade stand eating a cookie. So where are we in image description generation? Well, something you may have noticed, is that our descriptions don’t really sound very human, in fact they sound kind of robotic as compared to the descriptions provided by people for an image. This motivates a need for another type of approach to description generatiion Sounds unnatural! Kulkarni et al, CVPR11

Method 2: Retrieval based generation
These are retrieval based algorithms that make use of existing text for description generation Berg, Attributes Tutorial CVPR13

Every picture tells a story, describing images with meaningful sentences
Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth ECCV 2010 Our co-organizer ali farhadi did some of the first work in this area Slides provided by Ali Farhadi

A Simplified Problem Represent image/text content as subject-verb-scene triple Good triples: (ship, sail, sea) (boat, sail, river) (ship, float, water) Bad triples: (boat, smiling, sea) – bad relations (train, moving, rail) – bad words (dog, speaking, office) - both Here the idea was to simplify the generation problem a bit by defining a new representation where you represent images or text descriptions as a subject-verb-scene triple – called the meaning space. Some good triples for this image might be (ship, sail, sea) or (boat, sail, river) while some bad triples might be (boat, smiling, sea) or (train, moving, rail) Farhadi et al, ECCV10

The Expanded Model Map from Image Space to Meaning Space
The idea of their approach was to map from the image space to the meaning space, and from the sentence space to the meaning space. Then for a new query image you could caption it via this meaning space. Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

Retrieval through meaning space
So the first thing you have to do is define these mappings. Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

Image Space  Meaning Space
On the image side mapping was performed by predicting image content using trained classifiers for words. For example here you would hopefully predict (horse, ride, field) Predict Image Content using trained classifiers Farhadi et al, ECCV10

On the text side you want to do the same thing, map from the sentence space to the same meaning space Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

Sentence Space  Meaning Space
Extract subject, verb and scene from sentences in the training data black cat over pink chair A black color cat sitting on chair in a room. cat sitting on a chair looking in a mirror. Subject: Cat Verb: Sitting Scene: room Use taxonomy trees Vehicle Car Train Bike Human Animal Cat Horse Dog Object So they used some text processing to map from sentences to subject, verb, and scene, incorporating taxonomy trees make this mapping more robust to differences between words used in sentences to the words in the meaning space Farhadi et al, ECCV10

Given a new image you can map it to the meaning space and then retrieve sentences with the same or similar mappings Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

This often produces reasonable results – e
This often produces reasonable results – e..g beautiful scenery surrounds a fluffy sheep Farhadi et al, ECCV10

Or eatables in the refrigerator
You can see that these descriptions sound a lot more human than the ones I showed earlier – because they were actually written by people Farhadi et al, ECCV10

But of course you can have mistakes like “the two girls read to drive big bullet”.
Farhadi et al, ECCV10

Data 1,000 images 20,000 images More data needed? Rashtchian et al 2010, Farhadi et al 2010 5 descriptions per image 20 object categories Image-Clef challenge 2 descriptions per image Select image categories The dataset used in this approach was derived from pascal images with relatively limited image content and contained only 1000 images. Perhaps not enough for general retrieval based generation methods. There is also a nice dataset from the imageclef folks containing 20k images with associated descriptions. But is this enough for data driven generation? Perhaps more data is needed. Large amounts of paired data can help us study the image-language relationship Berg, Attributes Tutorial CVPR13

Through the smoke Duna Portrait #5 So do images with nice descriptive captions exist? Well it turns out that they do, but they are somewhat obscured by lots and lots of images with text that does not directly describe the image content. Mirror and gold the cat lounging in the sink Data exists, but buried in junk! Berg, Attributes Tutorial CVPR13

SBU Captioned Photo Dataset http://tamaraberg.com/sbucaptions
1 million captioned photos! 1 million captioned photos! The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Interior design of modern white and brown living room furniture against white wall with a lamp hanging. So my student vicente put in a lot of effort to sift through all of these images and find those that have descriptive image captions. He went to flickr and performed a ton of search queries to find images. He downloaded about 500 million images using 12 parallel download streams over a period of about 2 months. Then, we automatically filtered the set using various heuristics to get captions that are often visually descriptive of the associated image. In the end we got a collection of over 1 million captioned photos. These captions tend to be visually relevant to their associated image. For example, “man sits in a rusted car buried in the sand on waitarere beach” or “little girl and her dog in northern thailand. They both seemed interested in what we were doing” (by requiring image captions to be satisfactory length and contain at least 2 object, attribute, scene, or action words and a prepositional word. ) Man sits in a rusted car buried in the sand on Waitarere beach Our dog Zoe in her bed Emma in her hat looking super cute Berg, Attributes Tutorial CVPR13

“Im2Text: Describing Images Using 1 Million Captioned Photographs”
Vicente Ordonez, Girish Kulkarni, Tamara L. Berg NIPS 2011 This was the result

Big Data Driven Generation
An old bridge over dirty green water. One of the many stone bridges in town that carry the gravel carriage roads. A stone bridge over a peaceful river. With that much data we can now generate natural sounding descriptions for a wide variety of image contents. For example for this query image, we might produce a caption like “an old gray bridge over dirty green water. Or “a stone bridge over a peaceful river” Generate natural sounding descriptions using existing captions Ordonez et al, NIPS11

… Harness the Web! Global Matching (GIST + Color) Transfer Caption(s)
SBU Captioned Photo Dataset 1 million captioned images! The water is clear enough to see fish swimming around in it. Bridge to temple in Hoan Kiem lake. A walk around the lake near our house with Abby. We tried a number of methods from really simple tricks like gist and color based global matching for retrieval. here we might caption the query with “the water is clear enough to see fish swimming around in it” Hangzhou bridge in West lake. The daintree river by boat. Transfer Caption(s) e.g. “The water is clear enough to see fish swimming around in it.” Smallest house in paris between red (on right) and beige (on left). … Ordonez et al, NIPS11

Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions)
The bridge over the lake on Suzhou Street. Iron bridge over the Duck river. Once we have a small set of globally matched images, we also try to re-rank this set using text based measures from the matched set, as well as more computationally intensive visual recognition predictions like detection and visual similarity of objects, stuff, scene, or activities. For this query we now generate for example “the bridge over the lake on Suzhhou Street” Transfer Caption(s) e.g. “The bridge over the lake on Suzhou Street.” The Daintree river by boat. Bridge over Cacapon river. . . . Ordonez et al, NIPS11

Results Good Bad A female Mallard duck in the lake at Luukki Espoo.
Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. Fresh fruit and vegetables at the market in Port Louis Mauritius. A female Mallard duck in the lake at Luukki Espoo. Cat in sink. Good The cat in the window. The boat ended up a kilometre from the water in the middle of the airstrip. Bad These simple methods sometimes work surprisingly well – producing natural and relevant captions like: “cat in sink” or “fresh fruit and vegetables at the market in Port Louis Mauritius” Other times things can go wrong, for example we predict “the cat in the window” for this owl picture. Strange or unusual depictions sometimes can’t be captioned well because it’s hard to find relevant text. Ordonez et al, NIPS11

Next…. Composing novel captions from pieces of existing ones Some of you may have guessed where we went next --- there may not always be a whole caption in your database that describes the specific content of a query image. But there might be little parts of existing captions that are appropriate. So next we looked at how to compose novel captions from pieces of existing ones. Berg, Attributes Tutorial CVPR13

Composing captions guessing game
a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree But before I tell you what we did let’s play a game. Here is an image with 4 automatically generated captions, and one real caption generated by the photographer. Can you guess which is the real one? ;) Is it a) monkey playing in the tree canopy, monte verde in the rain forest, b) capuchin monkey in front of my window, c) monkey spotted in apenheul netherlands under the tree… How many of you think it was a? b? c? d? e? d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture Berg, Attributes Tutorial CVPR13

Composing captions guessing game
a) monkey playing in the tree canopy, Monte Verde in the rain forest b) capuchin monkey in front of my window c) monkey spotted in Apenheul Netherlands under the tree Not bad right? d) a white-faced or capuchin in the tree in the garden e) the monkey sitting in a tree, posing for his picture Berg, Attributes Tutorial CVPR13

“Collective Generation of Natural Image Descriptions”
Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,Tamara L. Berg and Yejin Choi ACL 2012 This work appeared at ACL last summer

Composing Descriptions
Object appearance NP: the dirty sheep Object pose VP: meandered along a desolate road Scene appearance PP: in the highlands of Scotland Region appearance & relationship PP: through frozen grass Here we compose new image descriptions from phrases retreived from other captions. So for this image we could use measures of object appearance to retrieve noun phrases from captions with similar looking objects, verb phrases related to object pose, prepositional phrases related to the general scene characteristics, or to specific relationships between objects and their background. We can then form new captions from retrieved phrases, for example to construct a description like “the dirty sheep…” Example Composed Description: the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass Kuznetsova et al, ACL12

SBU Captioned Photo Dataset http://tamaraberg.com/sbucaptions
1 million captioned photos! 1 million captioned photos! The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Little girl and her dog in northern Thailand. They both seemed interested in what we were doing Interior design of modern white and brown living room furniture against white wall with a lamp hanging. Again we make use of our captioned photo dataset Man sits in a rusted car buried in the sand on Waitarere beach Our dog Zoe in her bed Emma in her hat looking super cute Ordonez et al, NIPS11

Data Processing 1,000,000 images: Run object detectors
Run region based stuff detectors (grass, sky, etc.) Run global scene classifiers Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene). But first we do some extensive pre-processing on the database to make it more useful for doing things like retrieving similar detected objects. This processing includes running about 100 object detectors, regional stuff detectors, and global scene classifiers. We also parse the captions using the berkeley parser to get phrases Kuznetsova et al, ACL12

Image Description Generation
Computer Vision Objects, Actions, Stuff, Scenes Phrase Retrieval The general flow of our system is given a query image we run a bunch of detectors and classifiers to extract estimates of the image content like objects, background, and general scene category. Then we can retrieve phrases referring to the extracted image content, and generate captions from the set of retrieved phrases to produce a description for the query image. The new work here is really in two of these modules, the phrase retrieval step and the generation step. Generation Description Kuznetsova et al, ACL12

Computer Vision Objects, Actions, Stuff, Scenes Phrase Retrieval So how does phrase retrieval work? Generation Description Kuznetsova et al, ACL12

Retrieving VPs Find matching detections by pose similarity Detect: dog
Contented dog just laying on the edge of the road in front of a house.. Peruvian dog sleeping on city street in the city of Cusco, (Peru) Detect: dog Well, if we detect a dog in our query image, we can look for detections from the database showing dogs in visually similar poses. Then we can transfer the verb phrases referring to that object to the query. For example giving us – laying on the edge of the road in front of a house . Find matching detections by pose similarity this dog was laying in the middle of the road on a back street in jaco Closeup of my dog sleeping under my desk. Kuznetsova et al, ACL12

Retrieving NPs Find matching detections by appearance similarity
Tray of glace fruit in the market at Nice, France Fresh fruit in the market Detect: fruit The street market in Santanyi, Mallorca is a must for the oranges and local crafts. A box of oranges was just catching the sun, bringing out detail in the skin. We can use a similar approach to retrieve noun phrases relating to a detected object. Find matching detections by appearance similarity An orange tree in the backyard of the house. Kuznetsova et al, ACL12 mandarin oranges in glass bowl

Retrieving PPstuff Find matching regions by appearance + arrangement similarity Cordoba - lonely elephant under an orange tree... I positioned the chairs around the lemon tree -- it's like a shrine Or prepositional phrases referring to relationships between objects and their direct surroundings. Mini Nike soccer ball all alone in the grass Detect: stuff Comfy chair under a tree. Kuznetsova et al, ACL12

Retrieving PPscene Find matching images by global scene similarity
I'm about to blow the building across the street over with my massive lung power. Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere Extract scene descriptor Or to retrieve prepositional phrases referring to the general scene context. Find matching images by global scene similarity Only in Paris will you find a bottle of wine on a table outside a bookstore View from our B&B in this photo Kuznetsova et al, ACL12

Computer Vision Objects, Actions, Stuff, Scenes Phrase Retrieval Once we have retrieved a bunch of phrases we’d like to form them into a coherent description Generation Description Kuznetsova et al, ACL12

Position 1 Position 2 Position 3 Position 4 Object NPs Actions VPs
birds the bird Object NPs are standing looking for food Actions VPs in water over water Stuff PPs in the ocean near Salt Pond Scene PPs Consider this example image. For a detected object we can retrieve relevant phrases. Then we want to select and combine these phrases into a plausible sentence about the object. Position 1 Position 2 Position 3 Position 4 birds over water are standing in the ocean Kuznetsova et al, ACL12

Possible Assignments … … … … Position1 Position2 Position3 Position4
birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing … but this gives us a large number of potential assignments. … … … … in the ocean in the ocean in the ocean in the ocean Kuznetsova et al, ACL12

birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing For instance, … (** read **) … … … … in the ocean in the ocean in the ocean in the ocean Kuznetsova et al, ACL12

birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing Or …. (** read **) … … … … in the ocean in the ocean in the ocean in the ocean Kuznetsova et al, ACL12

Phrases of the Same Type
Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing But there are a few limitations. For instance we might construct a sentence with two noun phrases describing the same object – the bird, birds So we’d like to add a constraint that we shouldn’t include multiple phases of the same type … … … … in the ocean in the ocean in the ocean in the ocean Kuznetsova et al, ACL12

Singular/Plural Relationships
Position1 Position2 Position3 Position4 birds birds birds birds the bird the bird the bird the bird are standing are standing are standing are standing We’d also like to include other constraints such as enforcing singular/plural relationships (to avoid situations like “the bird are standing”). and so on… … … … … in the ocean in the ocean in the ocean in the ocean Kuznetsova et al, ACL12

ILP Optimization Vision scores Phrase cohesion Linguistic constraints
Visual detection/classification scores Phrase cohesion n-gram statistics between phrases Co-occurrence statistics between phrase head words Linguistic constraints Allow at most one phrase of each type Enforce plural/singular agreement between NP and VP Discourse constraints Prevent inclusion of repeated phrasing Optimize for: So we formulate our problem as an ILP optimization where we’re trying to to select and combine phrases into sentences such that our description has good vision scores and good cohesion between phrases subject to some linguistic and discourse constraints. (note: commercial software to optimize – cplex) Subject to: Kuznetsova et al, ACL12

Good Examples This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings. The clock made in Korea. This is a brass viking boat moored on beach in Tobago by the ocean. this allows us to generate captions like “this is a sporty little…” Kuznetsova et al, ACL12

Visual Turing Test Us vs Original Human Written Caption
In some cases (16%), ILP generated captions were preferred over human written ones! sometimes our captions are even preferred over the original human written ones! Kuznetsova et al, ACL12

Bad Results Not Relevant Grammatically Incorrect Cognitive Absurdity
Computer Vision Error Of course it also makes mistakes. Since we are composing new captions from phrases grammatical problems can be introduced. The content can be misidentified. Or you can just produce very odd descriptions like “here you can see a cross by the frog in the sky” This is a shoulder bag with a blended rainbow effect. One of the most shirt in the wall of the house. Here you can see a cross by the frog in the sky. Kuznetsova et al, ACL12

Questions?

Beyond Attributes -> Describing Images

Similar presentations

Presentation on theme: "Beyond Attributes -> Describing Images"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Beyond Attributes -> Describing Images

Similar presentations

Presentation on theme: "Beyond Attributes -> Describing Images"— Presentation transcript:

Similar presentations

About project

Feedback