6AttributesDescribing objects by their attributes, A Farhadi, I Endres, D Hoiem, D Forsyth Computer Vision and Pattern Recognition, CVPR 2009Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, C. Lampert, H. Nickisch, S. Harmeling, CVPR 2009Many others
9What feature representation should we use? Computer VisionimageWhat feature representation should we use?featuresclassification
10Computer Vision image features attributes classification [ .1 -.9 .1 .231-.1]featuresNow we can talk…Has hair, has skin, has ear, has eye, has armsattributesclassification
11Attributes Properties shared by many objects Explicit semantics Facilitate human-CPU communicationMaterials (glass, fur, wood, etc.)Parts (has wheel, has tail, etc.)Shape (boxy, cylindrical, etc.)Based on a slide by David Forsyth11
12Example Attributes Face Tracer Image Search “Smiling Asian Men With Glasses”Kumar et al., 2008
14Slide credit: Devi Parikh Example AttributesLampert et al. 2009Slide credit: Devi Parikh
15Slide credit: Devi Parikh Example AttributesWelinder et al. 2010Slide credit: Devi Parikh
16Slide credit: Devi Parikh Attribute ModelsClassifiers for binary attributesKumar et al. 2010Slide credit: Devi Parikh
17Why attributes? How humans naturally describe visual concepts Image searchI want elegant silver sandals with high heelsSlide credit: Devi Parikh
18Example AttributesVerificationclassifierSAMEKumar et al., 2010
19Why attributes?An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
20Why attributes?An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
21Why attributes?An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
22Zero-shot Learning Aye-ayes Are nocturnal Live in trees Have large eyesHave long middle fingersWhich one of these is an aye-aye?Humans can learn from descriptions (zero examples).Slide adapted from Christoph Lampert by Devi Parikh
23Slide credit: Devi Parikh Is this a giraffe?No.Is this a giraffe?Yes.Is this a giraffe?No.In the traditional active learning setting, after looking at a few examples, the learner identifies an image that is confusing and asks the teacher for a label. “Is this a giraffe?”. In this case, the teacher says no. The learner updates her model, and identifies another confusing image and asks, “Is this a giraffe?”. The teacher says yes. Is this a giraffe? No.Slide credit: Devi Parikh
24Learner learns better from its mistakes Parkash and Parikh, 2012Focused feedbackKnowledge of the worldCurrent beliefI think this is a giraffe. What do you think?No, its neck is too short for it to be a giraffe.Learner learns better from its mistakesAccelerated discriminative learning with few examples[Animals with even shorter necks]……The learner, picks a confusing image. Then instead of just demanding a label for the image, the learner gives the example some thought and determines its belief about the example, and communicates it to the teacher. If it thinks this image is a giraffe, it says “I think this is a giraffe. What do you think?” The teacher says “No this is not a giraffe, because its neck is too short for it to be a giraffe.”. With this, the learner realizes that if this animal’s neck is too short for it to be a giraffe, than all animals with even shorter necks than the query image must not be giraffes either. Hence resulting in a much better understanding of giraffes.At a high-level, In our proposed active learning paradigm, the learner conveys his/her current belief about an actively chosen query. If wrong, the supervisor provides focused feedback that conveys the teacher’s knowledge about the world. The learner takes the feedback provided on one image and transfers it to many previously unlabeled images. This results in (1) the classifier learning better from its mistakes and (2) accelerated learning with few labeled examples.Ah! These must not be giraffes either then.Feedback on one, transferred to manySlide credit: Devi Parikh
25Which Attributes to Describe? (f)Please choose a person to the left of the person who is frowning25Sadovnik et al. 2013
30Related Work Describing objects by attributes Learn semantic attributes for object classification [Farhadi et. al., 2009]Clothing recognition with collar, sleeve length, placket, etc. [Zhang et. al. 2008]Gender recognition: use adaboost and random forest with HOG feature to classify male/female
31Related Work Person identification with clothing Bounding box under face [Anguelov, 2007]Clothing segmentation [Gallagher, 2008]Gender recognition: use adaboost and random forest with HOG feature to classify male/female
32Dataset Preparation 1856 people from the web. Images are unconstrained.
33Dataset Preparation$400 spent for collecting 283,107 labels on Amazon Mechanical Turk (AMT).
36Pose Estimation [Eichner et. al., 2010] Perform upper body detection, by using complementary results from face detector and deformable part models.Foreground highlighting within the enlarged upper body bounding box.Parse the upper body into head, torso, upper and lower parts of the left and right arms.The upper body detector is based on the successful part-based object detection framework  and contains a model to detect near-frontal upper-bodies Object Detection with Discriminatively Trained Part Based Models, PAMI 2009
37Feature Extraction SIFT descriptor extracted over the sampling grid. Similar procedure for the arm regions.
38Feature Extraction Maximum Response Filters [Varma 2005] LAB color Skin probabilityRGB imageSkin probabilityThe RFS filter bank consists of 2 anisotropic filters (an edge and a bar filter, at 6 orientations and 3 scales), and 2 rotationally symmetric ones (a Gaussian and a Laplacian of Gaussian).MRF bankDetecting Text in Natural Images
39Feature ExtractionRaw features are quantized using soft K-means (K=5 in our implementation).Quantized features are aggregated over various body regions, by max or average pooling.For learning color attributes, the feature is LAB color aggregated from non-skin regions.Feature typeRegionPooling methodSIFTTorsoAverageTextureLeft upper armMaxColorRight upper armSkin probabilityLeft lower armRight lower arm
40Feature Fusion SVM is a kernel-based classification technique. Feature fusion solution: combined SVM is trained using weighted sum of the kernels.Combining features consistently outperforms the single best feature.SVM 1SVM 2SVM NK1K2KNPredict accuracy 2SVM CombinedPredict accuracy 1…Predict accuracy NAttribute prediction
43Attribute Inference with CRF Each attribute is a node. All nodes are pair-wise connected.The edge connecting 2 nodes corresponds to the joint probability of these 2 attributes.A6F6A2A1A3A5A4F1F2F3F4F5Ai: Attribute iFi: Features for Ai
44CRF for Attribute Learning [Following CRF model]A1AMF1FMA2F2…For a fully connected CRF, we maximize:The CRF potential is maximized using standard belief propagation technique [Tappen et. al. 2003] .Node potentialEdge potential
45No necktie (Wear necktie) Has collar Men’s Has placket Low exposure No scarfSolid patternBlackShort sleeve (Long sleeve)V-shape necklineDress (Suit)Wear necktieHigh exposure (Low exposure)Gray & blackLong sleeveSuitNo necktieWear scarfBrown & blackNo sleeve (long sleeve)Tank top (outerwear)1. Man in dress; 2. Suit but high skin exposure; 3. No sleeve but wearing scarfDetecting Text in Natural Images
46Experimental Results Questions that we are interested in: Does combining features improve performance?Does the pose model help?Does the CRF work?
47Pose Vs No Pose - Experiment Setup Positive and negative examples are balanced.SVM classificationChi-squared kernelLeave-1-out cross validationComparison with attribute learning without pose model.Features are extracted within a scaled clothing mask under the face.Evaluation performed under the same experiment settings.The clothing mask [Gallagher 2008]
52Steve Jobs: “solid pattern, men’s clothing, black color, long sleeves, round neckline, outerwear, wearing scarf”
53The predicted dressing style of weddings: Male: “solid pattern, suit, long-sleeves, V-shape neckline, wearing necktie, wearing scarf, has collar, has placket”Female: “high skin exposure, no sleeves, dress, other neckline shapes, white, >2 colors, floral pattern”
54Gender RecognitionFace-based: Project faces in the Fisher space. Clothing-based: The gender output of our system. Better gender recognition is achieved by combining face and clothing.
55ConclusionsClothing attributes can be better learned with a human pose model.CRF offers improved performance by exploring attribute relations.Proposed novel applications that exploit the predicted attributes.
67Future WorkExpect even better performance by using the (almost) ground truth pose estimated by Kinect sensors [Shotton et. al., Best Paper CVPR 2011].Incorporate clothing information in person identification.
68What we know about people The LoopWhat we know about peopleWhat do we mean by “context”?We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers.Images andComputer Vision68
69The Loop: This talkExamples of how social data has helped understand images of peopleSome things I’ve learned about people from computer visionWhat do we mean by “context”?We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers.69
70What Is Computer Vision? Vision is the process of discovering from images what is present in the world, and where it is.-- David Marr, Vision (1982)Humans can perceive and interpret images very fast and accurately.70
71What Is Computer Vision? Vision deals with:Uncertainty and Probability (What is present)Geometry (Where it is)Humans are really good at this!Humans can perceive and interpret images very fast and accurately.71
72Measurement vs. Perception Visual system tries to undo the measured brightness into the reflectance and illumination and estimate the reflectance that is inherent to the object.72
74Measurement vs. Perception Müller-Lyer IllusionOur perception of geometric properties is affected by our interpretation.74
75What is context? 75 What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers.75
76ContextWe ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human.76
77Which monster is larger? Shepard RN (1990) Mind Sights: Original Visual Illusions, Ambiguities, and other Anomalies, New York: WH Freeman and CompanyWe can’t help but to integrate perspective cues into our interpretation of the image.77
79Find The Face In the beans: We ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human.79
80Understanding images of people We use many different clues to discover identity and infer about people.What cues do we use to understand this image?How do we know this is a family?What we call “intuition” is often data that exists in the public domain.This thesis is to describe the progress we’ve made towards the objective of providing the computer with the same information that we have when understanding images.The goal of this thesis is to provide computers with that same intuition that computers have.80