Using Attributes to Describe What People Wear Andy Gallagher October 14, 2013 with Huizhong Chen and Bernd Girod
Objective Attribute learning List of attributes Men’s Black color Sweater Long sleeve Solid pattern Low skin exposure … Attribute learning
3
Outline Attributes Describing Clothing with Attributes ! Miscellaneous Topics !
Attributes
Attributes Describing objects by their attributes, A Farhadi, I Endres, D Hoiem, D Forsyth Computer Vision and Pattern Recognition, 2009. CVPR 2009 Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, C. Lampert, H. Nickisch, S. Harmeling, CVPR 2009 Many others
Computer Vision image features classification
Computer Vision image ? [ .1 -.9 .1 .231 -.1] features classification
What feature representation should we use? Computer Vision image What feature representation should we use? features classification
Computer Vision image features attributes classification [ .1 -.9 .1 .231 -.1] features Now we can talk… Has hair, has skin, has ear, has eye, has arms attributes classification
Attributes Properties shared by many objects Explicit semantics Facilitate human-CPU communication Materials (glass, fur, wood, etc.) Parts (has wheel, has tail, etc.) Shape (boxy, cylindrical, etc.) Based on a slide by David Forsyth 11
Example Attributes Face Tracer Image Search “Smiling Asian Men With Glasses” Kumar et al., 2008
Example Attributes Farhadi et al. 2009
Slide credit: Devi Parikh Example Attributes Lampert et al. 2009 Slide credit: Devi Parikh
Slide credit: Devi Parikh Example Attributes Welinder et al. 2010 Slide credit: Devi Parikh
Slide credit: Devi Parikh Attribute Models Classifiers for binary attributes Kumar et al. 2010 Slide credit: Devi Parikh
Why attributes? How humans naturally describe visual concepts Image search I want elegant silver sandals with high heels Slide credit: Devi Parikh
Example Attributes Verification classifier SAME Kumar et al., 2010
Why attributes? An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
Why attributes? An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
Why attributes? An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)
Zero-shot Learning Aye-ayes Are nocturnal Live in trees Have large eyes Have long middle fingers Which one of these is an aye-aye? Humans can learn from descriptions (zero examples). Slide adapted from Christoph Lampert by Devi Parikh
Slide credit: Devi Parikh Is this a giraffe? No. Is this a giraffe? Yes. Is this a giraffe? No. In the traditional active learning setting, after looking at a few examples, the learner identifies an image that is confusing and asks the teacher for a label. “Is this a giraffe?”. In this case, the teacher says no. The learner updates her model, and identifies another confusing image and asks, “Is this a giraffe?”. The teacher says yes. Is this a giraffe? No. Slide credit: Devi Parikh
Learner learns better from its mistakes Parkash and Parikh, 2012 Focused feedback Knowledge of the world Current belief I think this is a giraffe. What do you think? No, its neck is too short for it to be a giraffe. Learner learns better from its mistakes Accelerated discriminative learning with few examples [Animals with even shorter necks] …… The learner, picks a confusing image. Then instead of just demanding a label for the image, the learner gives the example some thought and determines its belief about the example, and communicates it to the teacher. If it thinks this image is a giraffe, it says “I think this is a giraffe. What do you think?” The teacher says “No this is not a giraffe, because its neck is too short for it to be a giraffe.”. With this, the learner realizes that if this animal’s neck is too short for it to be a giraffe, than all animals with even shorter necks than the query image must not be giraffes either. Hence resulting in a much better understanding of giraffes. At a high-level, In our proposed active learning paradigm, the learner conveys his/her current belief about an actively chosen query. If wrong, the supervisor provides focused feedback that conveys the teacher’s knowledge about the world. The learner takes the feedback provided on one image and transfers it to many previously unlabeled images. This results in (1) the classifier learning better from its mistakes and (2) accelerated learning with few labeled examples. Ah! These must not be giraffes either then. Feedback on one, transferred to many Slide credit: Devi Parikh
Which Attributes to Describe? (f) Please choose a person to the left of the person who is frowning 25 Sadovnik et al. 2013
Describing Clothing with Attributes
Objective Attribute learning List of attributes Men’s Black color Sweater Long sleeve Solid pattern Low skin exposure … Attribute learning
Recommend and Analyze Recommendations Formal Sport
Person Identification
Related Work Describing objects by attributes Learn semantic attributes for object classification [Farhadi et. al., 2009] Clothing recognition with collar, sleeve length, placket, etc. [Zhang et. al. 2008] Gender recognition: use adaboost and random forest with HOG feature to classify male/female
Related Work Person identification with clothing Bounding box under face [Anguelov, 2007] Clothing segmentation [Gallagher, 2008] Gender recognition: use adaboost and random forest with HOG feature to classify male/female
Dataset Preparation 1856 people from the web. Images are unconstrained.
Dataset Preparation $400 spent for collecting 283,107 labels on Amazon Mechanical Turk (AMT).
Dataset Statistics 23 Binary 3 Multiclass
The System … … A: attribute F: feature Multi-attribute CRF inference Pose estimation Feature extraction & quantization Attribute classifier 1 Attribute classifier 2 Attribute classifier M … Multi-attribute CRF inference Feature 1 Feature N SVM1 SVMN Combine features SVM Predictions Blue Solid pattern Outerwear Wear scarf Long sleeve A: attribute F: feature A2 A1 A3 F1 F2 F3 F4 A4 …
Pose Estimation [Eichner et. al., 2010] Perform upper body detection, by using complementary results from face detector and deformable part models. Foreground highlighting within the enlarged upper body bounding box. Parse the upper body into head, torso, upper and lower parts of the left and right arms. The upper body detector is based on the successful part-based object detection framework [1] and contains a model to detect near-frontal upper-bodies [1] Object Detection with Discriminatively Trained Part Based Models, PAMI 2009
Feature Extraction SIFT descriptor extracted over the sampling grid. Similar procedure for the arm regions.
Feature Extraction Maximum Response Filters [Varma 2005] LAB color Skin probability RGB image Skin probability The RFS filter bank consists of 2 anisotropic filters (an edge and a bar filter, at 6 orientations and 3 scales), and 2 rotationally symmetric ones (a Gaussian and a Laplacian of Gaussian). MRF bank Detecting Text in Natural Images
Feature Extraction Raw features are quantized using soft K-means (K=5 in our implementation). Quantized features are aggregated over various body regions, by max or average pooling. For learning color attributes, the feature is LAB color aggregated from non-skin regions. Feature type Region Pooling method SIFT Torso Average Texture Left upper arm Max Color Right upper arm Skin probability Left lower arm Right lower arm
Feature Fusion SVM is a kernel-based classification technique. Feature fusion solution: combined SVM is trained using weighted sum of the kernels. Combining features consistently outperforms the single best feature. SVM 1 SVM 2 SVM N K1 K2 KN Predict accuracy 2 SVM Combined Predict accuracy 1 … Predict accuracy N Attribute prediction
Recap … … A: attribute F: feature Multi-attribute CRF inference Pose estimation Feature extraction & quantization Attribute classifier 1 Attribute classifier 2 Attribute classifier M … Multi-attribute CRF inference Feature 1 Feature N SVM1 SVMN Combine features SVM Predictions Blue Solid pattern Outerwear Wear scarf Long sleeve A: attribute F: feature A2 A1 A3 F1 F2 F3 F4 A4 …
Attribute Dependencies Necktie and T-Shirt?
Attribute Inference with CRF Each attribute is a node. All nodes are pair-wise connected. The edge connecting 2 nodes corresponds to the joint probability of these 2 attributes. A6 F6 A2 A1 A3 A5 A4 F1 F2 F3 F4 F5 Ai: Attribute i Fi: Features for Ai
CRF for Attribute Learning [Following CRF model] A1 AM F1 FM A2 F2 … For a fully connected CRF, we maximize: The CRF potential is maximized using standard belief propagation technique [Tappen et. al. 2003] . Node potential Edge potential
No necktie (Wear necktie) Has collar Men’s Has placket Low exposure No scarf Solid pattern Black Short sleeve (Long sleeve) V-shape neckline Dress (Suit) Wear necktie High exposure (Low exposure) Gray & black Long sleeve Suit No necktie Wear scarf Brown & black No sleeve (long sleeve) Tank top (outerwear) 1. Man in dress; 2. Suit but high skin exposure; 3. No sleeve but wearing scarf Detecting Text in Natural Images
Experimental Results Questions that we are interested in: Does combining features improve performance? Does the pose model help? Does the CRF work?
Pose Vs No Pose - Experiment Setup Positive and negative examples are balanced. SVM classification Chi-squared kernel Leave-1-out cross validation Comparison with attribute learning without pose model. Features are extracted within a scaled clothing mask under the face. Evaluation performed under the same experiment settings. The clothing mask [Gallagher 2008]
Multiclass Confusion Matrix
Unbalanced data classification: G-mean Recap: our CRF model uses the priors of the attributes. Evaluate CRF performance on the full dataset requires unbalanced data classification.
Steve Jobs: “solid pattern, men’s clothing, black color, long sleeves, round neckline, outerwear, wearing scarf”
The predicted dressing style of weddings: Male: “solid pattern, suit, long-sleeves, V-shape neckline, wearing necktie, wearing scarf, has collar, has placket” Female: “high skin exposure, no sleeves, dress, other neckline shapes, white, >2 colors, floral pattern”
Gender Recognition Face-based: Project faces in the Fisher space. Clothing-based: The gender output of our system. Better gender recognition is achieved by combining face and clothing.
Conclusions Clothing attributes can be better learned with a human pose model. CRF offers improved performance by exploring attribute relations. Proposed novel applications that exploit the predicted attributes.
Miscellaneous 56
What do you have? 57
58
59
AutoCropping 60
AutoCropping Auction Probability: 97% 61
AutoCropping Eigenvector Quantized Eigenvector 62
63
How do photos affect value? Angled, high contrast: ~$115 64
How do photos affect value? Frontal, Flash reflection ~$88 65
Thank You! 66
Future Work Expect even better performance by using the (almost) ground truth pose estimated by Kinect sensors [Shotton et. al., Best Paper CVPR 2011]. Incorporate clothing information in person identification.
What we know about people The Loop What we know about people What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. Images and Computer Vision 68
The Loop: This talk Examples of how social data has helped understand images of people Some things I’ve learned about people from computer vision What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. 69
What Is Computer Vision? Vision is the process of discovering from images what is present in the world, and where it is. -- David Marr, Vision (1982) Humans can perceive and interpret images very fast and accurately. 70
What Is Computer Vision? Vision deals with: Uncertainty and Probability (What is present) Geometry (Where it is) Humans are really good at this! Humans can perceive and interpret images very fast and accurately. 71
Measurement vs. Perception Visual system tries to undo the measured brightness into the reflectance and illumination and estimate the reflectance that is inherent to the object. 72
Measurement vs. Perception 73
Measurement vs. Perception Müller-Lyer Illusion Our perception of geometric properties is affected by our interpretation. 74 http://www.michaelbach.de/ot/sze_muelue/index.html
What is context? 75 What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. 75
Context We ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human. 76
Which monster is larger? Shepard RN (1990) Mind Sights: Original Visual Illusions, Ambiguities, and other Anomalies, New York: WH Freeman and Company We can’t help but to integrate perspective cues into our interpretation of the image. 77
Your brain specializes in faces 78
Find The Face In the beans: We ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human. 79 http://www.michaelbach.de/ot/sze_muelue/index.html
Understanding images of people We use many different clues to discover identity and infer about people. What cues do we use to understand this image? How do we know this is a family? What we call “intuition” is often data that exists in the public domain. This thesis is to describe the progress we’ve made towards the objective of providing the computer with the same information that we have when understanding images. The goal of this thesis is to provide computers with that same intuition that computers have. 80