Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill.

Similar presentations


Presentation on theme: "Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill."— Presentation transcript:

1 Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill

2 Descriptive Text It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns Scarlett OHara described in Gone with the Wind. Berg, Attributes Tutorial CVPR13

3 More Nuance than Traditional Recognition… car shoe person Berg, Attributes Tutorial CVPR13

4 Toward Complex Structured Outputs car Berg, Attributes Tutorial CVPR13

5 Toward Complex Structured Outputs pink car Attributes of objects Berg, Attributes Tutorial CVPR13

6 Toward Complex Structured Outputs car on road Relationships between objects Berg, Attributes Tutorial CVPR13

7 Toward Complex Structured Outputs Telling the story of an image Little pink smart car parked on the side of a road in a London shopping district. … Complex structured recognition outputs Berg, Attributes Tutorial CVPR13

8 Learning from Descriptive Text Visually descriptive language provides: Information about the world, especially the visual world. information about how people construct natural language for imagery. guidance for visual recognition. How do people describe the world? It was an arresting face, pointed of chin, square of jaw. Her eyes were pale green without a touch of hazel, starred with bristly black lashes and slightly tilted at the ends. Above them, her thick black brows slanted upward, cutting a startling oblique line in her magnolia-white skin–that skin so prized by Southern women and so carefully guarded with bonnets, veils and mittens against hot Georgia suns Scarlett OHara described in Gone with the Wind. How does the world work? What should we recognize? Berg, Attributes Tutorial CVPR13

9 Methodology Generation Methods: 1)Compose descriptions directly from recognized content 2)Retrieve relevant existing text given recognized content Natural language description A random Pink Smart Car seen driving around Lambeth Roundabout and onto Lambeth Bridge. Smart Car. It was so adorable and cute in the parking lot of the post office, I had to stop and take a picture. Pink Car Sign Door Motorcycle Tree Brick building Dirty Road Sidewalk London Shopping district Berg, Attributes Tutorial CVPR13

10 Related Work Compose descriptions given recognized content Yao et al. (2010), Yang et al. (2011), Li et al. ( 2011), Kulkarni et al. (2011) Generation as retrieval Farhadi et al. (2010), Ordonez et al (2011), Gupta et al (2012), Kuznetsova et al (2012) Generation using pre-associated relevant text Leong et al (2010), Aker and Gaizauskas (2010), Feng and Lapata (2010a) Other (image annotation, video description, etc) Barnard et al (2003), Pastra et al (2003), Gupta et al (2008), Gupta et al (2009), Feng and Lapata (2010b), del Pero et al (2011), Krishnamoorthy et al (2012), Barbu et al (2012), Das et al (2013) Berg, Attributes Tutorial CVPR13

11 Method 1: Recognize & Generate Berg, Attributes Tutorial CVPR13

12 Baby Talk: Understanding and Generating Simple Image Descriptions Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, Tamara L Berg CVPR 2011

13 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

14 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

15 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

16 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

17 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

18 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

19 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

20 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

21 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

22 This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. Kulkarni et al, CVPR11

23 Methodology Vision -- detection and classification Text inputs - statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language Kulkarni et al, CVPR11

24 Vision is hard! World knowledge (from descriptive text) can be used to smooth noisy vision predictions! Green sheep Kulkarni et al, CVPR11

25 Methodology Vision -- detection and classification Text -- statistics from parsing lots of descriptive text Graphical model (CRF) to predict best image labeling given vision and text inputs Generation algorithms to generate natural language Kulkarni et al, CVPR11

26 Learning from Descriptive Text Attributes Relationships green green grass by the lake a very shiny car in the car museum in my hometown of upstate NY. Our cat Tusik sleeping on the sofa near a hot radiator. very little person in a big rocking chair Kulkarni et al, CVPR11

27 Methodology Vision -- detection and classification Text -- statistics from parsing lots of descriptive text Model (CRF) to predict best image labeling given vision and text based potentials Generation algorithms to compose natural language Kulkarni et al, CVPR11

28 System Flow Input Image Extract Objects/stuff a) dog b) person c) sofa brown 0.32 striped 0.09 furry.04 wooden.2 Feathered brown 0.94 striped 0.10 furry.06 wooden.8 Feathered brown 0.01 striped 0.16 furry.26 wooden.2 feathered a) dog b) person c) sofa Predict attributes Predict prepositions a) dog b) person c) sofa near(a,b) 1 near(b,a) 1 against(a,b).11 against(b,a).04 beside(a,b).24 beside(b,a) near(a,c) 1 near(c,a) 1 against(a,c).3 against(c,a).05 beside(a,c).5 beside(c,a) near(b,c) 1 near(c,b) 1 against(b,c).67 against(c,b).33 beside(b,c).0 beside(c,b) Predict labeling – vision potentials smoothed with text potentials,against, >,near, >,beside, > Generate natural language description This is a photograph of one person and one brown sofa and one dog. The person is against the brown sofa. And the dog is near the person, and beside the brown sofa. Kulkarni et al, CVPR11

29 This is a picture of one sky, one road and one sheep. The gray sky is over the gray road. The gray sheep is by the gray road. Here we see one road, one sky and one bicycle. The road is near the blue sky, and near the colorful bicycle. The colorful bicycle is within the blue sky. Some good results This is a picture of two dogs. The first dog is near the second furry dog. Kulkarni et al, CVPR11

30 Some bad results Here we see one potted plant. Missed detections: This is a picture of one dog. False detections: There are one road and one cat. The furry road is in the furry cat. This is a picture of one tree, one road and one person. The rusty tree is under the red road. The colorful person is near the rusty tree, and under the red road. This is a photograph of two sheeps and one grass. The first black sheep is by the green grass, and by the second black sheep. The second black sheep is by the green grass. Incorrect attributes: This is a photograph of two horses and one grass. The first feathered horse is within the green grass, and by the second feathered horse. The second feathered horse is within the green grass. Kulkarni et al, CVPR11

31 Algorithm vs Humans This picture shows one person, one grass, one chair, and one potted plant. The person is near the green grass, and in the chair. The green grass is by the chair, and near the potted plant. H1: A Lemonaide stand is manned by a blonde child with a cookie. H2: A small child at a lemonade and cookie stand on a city corner. H3: Young child behind lemonade stand eating a cookie. Sounds unnatural! Kulkarni et al, CVPR11

32 Method 2: Retrieval based generation Berg, Attributes Tutorial CVPR13

33 Every picture tells a story, describing images with meaningful sentences Ali Farhadi, Mohsen Hejrati, Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, David Forsyth ECCV 2010 Slides provided by Ali Farhadi

34 A Simplified Problem Represent image/text content as subject-verb-scene triple Good triples: (ship, sail, sea) (boat, sail, river) (ship, float, water) Bad triples: (boat, smiling, sea) – bad relations (train, moving, rail) – bad words (dog, speaking, office) - both Farhadi et al, ECCV10

35 The Expanded Model Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

36 Retrieval through meaning space Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

37 Image Space Meaning Space Predict Image Content using trained classifiers Farhadi et al, ECCV10

38 Retrieval through meaning space Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

39 Sentence Space Meaning Space Extract subject, verb and scene from sentences in the training data Subject: Cat Verb: Sitting Scene: room black cat over pink chair A black color cat sitting on chair in a room. cat sitting on a chair looking in a mirror. Vehicle CarTrainBike HumanAnimal CatHorseDog Object Use taxonomy trees Farhadi et al, ECCV10

40 Retrieval through meaning space Map from Image Space to Meaning Space Map from Sentence Space to Meaning Space Retrieve Sentences for Images via Meaning Space Farhadi et al, ECCV10

41

42

43

44 Data Rashtchian et al 2010, Farhadi et al descriptions per image 20 object categories Image-Clef challenge 2 descriptions per image Select image categories Large amounts of paired data can help us study the image-language relationship 1,000 images20,000 images More data needed? Berg, Attributes Tutorial CVPR13

45 Data exists, but buried in junk! Through the smokeDuna Portrait #5 Mirror and goldthe cat lounging in the sink Berg, Attributes Tutorial CVPR13

46 SBU Captioned Photo Dataset Our dog Zoe in her bed Interior design of modern white and brown living room furniture against white wall with a lamp hanging. The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Emma in her hat looking super cute Little girl and her dog in northern Thailand. They both seemed interested in what we were doing 1 million captioned photos! Berg, Attributes Tutorial CVPR13

47 Im2Text: Describing Images Using 1 Million Captioned Photographs Vicente Ordonez, Girish Kulkarni, Tamara L. Berg NIPS 2011

48 Big Data Driven Generation One of the many stone bridges in town that carry the gravel carriage roads. An old bridge over dirty green water. A stone bridge over a peaceful river. Generate natural sounding descriptions using existing captions Ordonez et al, NIPS11

49 Harness the Web! Smallest house in paris between red (on right) and beige (on left). Bridge to temple in Hoan Kiem lake. The water is clear enough to see fish swimming around in it. A walk around the lake near our house with Abby. Hangzhou bridge in West lake. The daintree river by boat. … SBU Captioned Photo Dataset Transfer Caption(s) Global Matching (GIST + Color) e.g. The water is clear enough to see fish swimming around in it. 1 million captioned images! Ordonez et al, NIPS11

50 Use High Level Content to Rerank (Objects, Stuff, People, Scenes, Captions) The bridge over the lake on Suzhou Street. The Daintree river by boat. Bridge over Cacapon river. Iron bridge over the Duck river.... Transfer Caption(s) e.g. The bridge over the lake on Suzhou Street. Ordonez et al, NIPS11

51 Results Amazing colours in the sky at sunset with the orange of the cloud and the blue of the sky behind. Fresh fruit and vegetables at the market in Port Louis Mauritius. A female Mallard duck in the lake at Luukki Espoo. Cat in sink. Good The cat in the window. The boat ended up a kilometre from the water in the middle of the airstrip. Bad Ordonez et al, NIPS11

52 Next…. Composing novel captions from pieces of existing ones Berg, Attributes Tutorial CVPR13

53 Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest e) the monkey sitting in a tree, posing for his picture c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden b) capuchin monkey in front of my window Berg, Attributes Tutorial CVPR13

54 Composing captions guessing game a) monkey playing in the tree canopy, Monte Verde in the rain forest e) the monkey sitting in a tree, posing for his picture c) monkey spotted in Apenheul Netherlands under the tree d) a white-faced or capuchin in the tree in the garden b) capuchin monkey in front of my window Berg, Attributes Tutorial CVPR13

55 Collective Generation of Natural Image Descriptions Polina Kuznetsova, Vicente Ordonez, Alexander C. Berg,Tamara L. Berg and Yejin Choi ACL 2012

56 Composing Descriptions the dirty sheep meandered along a desolate road in the highlands of Scotland through frozen grass NP: the dirty sheep VP: meandered along a desolate road PP: in the highlands of Scotland PP: through frozen grass Object appearance Object pose Scene appearance Region appearance & relationship Example Composed Description: Kuznetsova et al, ACL12

57 SBU Captioned Photo Dataset Our dog Zoe in her bed Interior design of modern white and brown living room furniture against white wall with a lamp hanging. The Egyptian cat statue by the floor clock and perpetual motion machine in the pantheon Man sits in a rusted car buried in the sand on Waitarere beach Emma in her hat looking super cute Little girl and her dog in northern Thailand. They both seemed interested in what we were doing 1 million captioned photos! Ordonez et al, NIPS11

58 Data Processing 1,000,000 images: o Run object detectors o Run region based stuff detectors (grass, sky, etc.) o Run global scene classifiers o Parse captions associated with images and retrieve phrases referring to objects (NPs, VPs), region relationships (PPstuff), and general scene context (PPscene). Kuznetsova et al, ACL12

59 Image Description Generation Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Description Computer Vision Kuznetsova et al, ACL12

60 Image Description Generation Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Description Computer Vision Kuznetsova et al, ACL12

61 this dog was laying in the middle of the road on a back street in jaco Closeup of my dog sleeping under my desk. Detect: dog Find matching detections by pose similarity Peruvian dog sleeping on city street in the city of Cusco, (Peru) Contented dog just laying on the edge of the road in front of a house.. Retrieving VPs Kuznetsova et al, ACL12

62 Retrieving NPs Detect: fruit Find matching detections by appearance similarity Tray of glace fruit in the market at Nice, France Fresh fruit in the market A box of oranges was just catching the sun, bringing out detail in the skin. The street market in Santanyi, Mallorca is a must for the oranges and local crafts. An orange tree in the backyard of the house. mandarin oranges in glass bowl Kuznetsova et al, ACL12

63 Find matching regions by appearance + arrangement similarity Mini Nike soccer ball all alone in the grass Comfy chair under a tree. I positioned the chairs around the lemon tree -- it's like a shrine Cordoba - lonely elephant under an orange tree... Retrieving PPstuff Detect: stuff Kuznetsova et al, ACL12

64 Retrieving PPscene View from our B&B in this photo Extract scene descriptor Find matching images by global scene similarity Pedestrian street in the Old Lyon with stairs to climb up the hill of fourviere I'm about to blow the building across the street over with my massive lung power. Only in Paris will you find a bottle of wine on a table outside a bookstore Kuznetsova et al, ACL12

65 Image Description Generation Objects, Actions, Stuff, Scenes Phrase Retrieval Computer Vision Generation Description Kuznetsova et al, ACL12

66 Object NPs Actions VPs Scene PPs Stuff PPs birds the bird birds over water are standingin the ocean Position 1Position 2Position 3Position 4 are standing looking for food in water over water in the ocean near Salt Pond Kuznetsova et al, ACL12

67 Possible Assignments birds Position1Position2Position3Position4 the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … Kuznetsova et al, ACL12

68 Possible Assignments Position1Position2Position3Position4 birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … Kuznetsova et al, ACL12

69 Possible Assignments Position1Position2Position3Position4 birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … Kuznetsova et al, ACL12

70 Position1Position2Position3Position4 birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … Phrases of the Same Type Kuznetsova et al, ACL12

71 Position1Position2Position3Position4 birds the bird are standing in the ocean … are standing the bird birds in the ocean … birds the bird are standing in the ocean … birds the bird are standing in the ocean … Singular/Plural Relationships Kuznetsova et al, ACL12

72 ILP Optimization Vision scores o Visual detection/classification scores Phrase cohesion o n-gram statistics between phrases o Co-occurrence statistics between phrase head words Linguistic constraints o Allow at most one phrase of each type o Enforce plural/singular agreement between NP and VP Discourse constraints o Prevent inclusion of repeated phrasing Optimize for: Subject to: Kuznetsova et al, ACL12

73 This is a sporty little red convertible made for a great day in Key West FL. This car was in the 4th parade of the apartment buildings. Good Examples This is a brass viking boat moored on beach in Tobago by the ocean. The clock made in Korea. Kuznetsova et al, ACL12

74 Visual Turing Test In some cases (16%), ILP generated captions were preferred over human written ones! Us vs Original Human Written Caption Kuznetsova et al, ACL12

75 Grammatically Incorrect Cognitive Absurdity This is a shoulder bag with a blended rainbow effect. Not Relevant Here you can see a cross by the frog in the sky. One of the most shirt in the wall of the house. Computer Vision Error Bad Results Kuznetsova et al, ACL12

76 Questions?


Download ppt "Beyond Attributes -> Describing Images Tamara L. Berg UNC Chapel Hill."

Similar presentations


Ads by Google