Using Attributes to Describe What People Wear

Using Attributes to Describe What People Wear
Andy Gallagher October 14, 2013 with Huizhong Chen and Bernd Girod

Objective Attribute learning List of attributes Men’s Black color
Sweater Long sleeve Solid pattern Low skin exposure … Attribute learning

Outline Attributes Describing Clothing with Attributes
! Miscellaneous Topics !

Attributes

Attributes Describing objects by their attributes, A Farhadi, I Endres, D Hoiem, D Forsyth Computer Vision and Pattern Recognition, CVPR 2009 Learning To Detect Unseen Object Classes by Between-Class Attribute Transfer, C. Lampert, H. Nickisch, S. Harmeling, CVPR 2009 Many others

Computer Vision image features classification

Computer Vision image ? [ .1 -.9 .1 .231 -.1] features classification

What feature representation should we use?
Computer Vision image What feature representation should we use? features classification

Computer Vision image features attributes classification [ .1 -.9 .1
.231 -.1] features Now we can talk… Has hair, has skin, has ear, has eye, has arms attributes classification

Attributes Properties shared by many objects Explicit semantics
Facilitate human-CPU communication Materials (glass, fur, wood, etc.) Parts (has wheel, has tail, etc.) Shape (boxy, cylindrical, etc.) Based on a slide by David Forsyth 11

Example Attributes Face Tracer Image Search
“Smiling Asian Men With Glasses” Kumar et al., 2008

Example Attributes Farhadi et al. 2009

Slide credit: Devi Parikh
Example Attributes Lampert et al. 2009 Slide credit: Devi Parikh

Example Attributes Welinder et al. 2010 Slide credit: Devi Parikh

Attribute Models Classifiers for binary attributes Kumar et al. 2010 Slide credit: Devi Parikh

Why attributes? How humans naturally describe visual concepts
Image search I want elegant silver sandals with high heels Slide credit: Devi Parikh

Example Attributes Verification classifier SAME Kumar et al., 2010

Why attributes? An okapi is a mammal with a reddish dark back, with striking horizontal white stripes on the front and back legs. (Wikipedia)

Zero-shot Learning Aye-ayes Are nocturnal Live in trees
Have large eyes Have long middle fingers Which one of these is an aye-aye? Humans can learn from descriptions (zero examples). Slide adapted from Christoph Lampert by Devi Parikh

Is this a giraffe? No. Is this a giraffe? Yes. Is this a giraffe? No. In the traditional active learning setting, after looking at a few examples, the learner identifies an image that is confusing and asks the teacher for a label. “Is this a giraffe?”. In this case, the teacher says no. The learner updates her model, and identifies another confusing image and asks, “Is this a giraffe?”. The teacher says yes. Is this a giraffe? No. Slide credit: Devi Parikh

Learner learns better from its mistakes
Parkash and Parikh, 2012 Focused feedback Knowledge of the world Current belief I think this is a giraffe. What do you think? No, its neck is too short for it to be a giraffe. Learner learns better from its mistakes Accelerated discriminative learning with few examples [Animals with even shorter necks] …… The learner, picks a confusing image. Then instead of just demanding a label for the image, the learner gives the example some thought and determines its belief about the example, and communicates it to the teacher. If it thinks this image is a giraffe, it says “I think this is a giraffe. What do you think?” The teacher says “No this is not a giraffe, because its neck is too short for it to be a giraffe.”. With this, the learner realizes that if this animal’s neck is too short for it to be a giraffe, than all animals with even shorter necks than the query image must not be giraffes either. Hence resulting in a much better understanding of giraffes. At a high-level, In our proposed active learning paradigm, the learner conveys his/her current belief about an actively chosen query. If wrong, the supervisor provides focused feedback that conveys the teacher’s knowledge about the world. The learner takes the feedback provided on one image and transfers it to many previously unlabeled images. This results in (1) the classifier learning better from its mistakes and (2) accelerated learning with few labeled examples. Ah! These must not be giraffes either then. Feedback on one, transferred to many Slide credit: Devi Parikh

Which Attributes to Describe?
(f) Please choose a person to the left of the person who is frowning 25 Sadovnik et al. 2013

Describing Clothing with Attributes

Objective Attribute learning List of attributes Men’s Black color
Sweater Long sleeve Solid pattern Low skin exposure … Attribute learning

Recommend and Analyze Recommendations Formal Sport

Person Identification

Related Work Describing objects by attributes
Learn semantic attributes for object classification [Farhadi et. al., 2009] Clothing recognition with collar, sleeve length, placket, etc. [Zhang et. al. 2008] Gender recognition: use adaboost and random forest with HOG feature to classify male/female

Related Work Person identification with clothing
Bounding box under face [Anguelov, 2007] Clothing segmentation [Gallagher, 2008] Gender recognition: use adaboost and random forest with HOG feature to classify male/female

Dataset Preparation 1856 people from the web.
Images are unconstrained.

Dataset Preparation $400 spent for collecting 283,107 labels on Amazon Mechanical Turk (AMT).

Dataset Statistics 23 Binary 3 Multiclass

The System … … A: attribute F: feature Multi-attribute CRF inference
Pose estimation Feature extraction & quantization Attribute classifier 1 Attribute classifier 2 Attribute classifier M … Multi-attribute CRF inference Feature 1 Feature N SVM1 SVMN Combine features SVM Predictions Blue Solid pattern Outerwear Wear scarf Long sleeve A: attribute F: feature A2 A1 A3 F1 F2 F3 F4 A4 …

Pose Estimation [Eichner et. al., 2010]
Perform upper body detection, by using complementary results from face detector and deformable part models. Foreground highlighting within the enlarged upper body bounding box. Parse the upper body into head, torso, upper and lower parts of the left and right arms. The upper body detector is based on the successful part-based object detection framework [1] and contains a model to detect near-frontal upper-bodies [1] Object Detection with Discriminatively Trained Part Based Models, PAMI 2009

Feature Extraction SIFT descriptor extracted over the sampling grid.
Similar procedure for the arm regions.

Feature Extraction Maximum Response Filters [Varma 2005] LAB color
Skin probability RGB image Skin probability The RFS filter bank consists of 2 anisotropic filters (an edge and a bar filter, at 6 orientations and 3 scales), and 2 rotationally symmetric ones (a Gaussian and a Laplacian of Gaussian). MRF bank Detecting Text in Natural Images

Feature Extraction Raw features are quantized using soft K-means (K=5 in our implementation). Quantized features are aggregated over various body regions, by max or average pooling. For learning color attributes, the feature is LAB color aggregated from non-skin regions. Feature type Region Pooling method SIFT Torso Average Texture Left upper arm Max Color Right upper arm Skin probability Left lower arm Right lower arm

Feature Fusion SVM is a kernel-based classification technique.
Feature fusion solution: combined SVM is trained using weighted sum of the kernels. Combining features consistently outperforms the single best feature. SVM 1 SVM 2 SVM N K1 K2 KN Predict accuracy 2 SVM Combined Predict accuracy 1 … Predict accuracy N Attribute prediction

Recap … … A: attribute F: feature Multi-attribute CRF inference
Pose estimation Feature extraction & quantization Attribute classifier 1 Attribute classifier 2 Attribute classifier M … Multi-attribute CRF inference Feature 1 Feature N SVM1 SVMN Combine features SVM Predictions Blue Solid pattern Outerwear Wear scarf Long sleeve A: attribute F: feature A2 A1 A3 F1 F2 F3 F4 A4 …

Attribute Dependencies
Necktie and T-Shirt?

Attribute Inference with CRF
Each attribute is a node. All nodes are pair-wise connected. The edge connecting 2 nodes corresponds to the joint probability of these 2 attributes. A6 F6 A2 A1 A3 A5 A4 F1 F2 F3 F4 F5 Ai: Attribute i Fi: Features for Ai

CRF for Attribute Learning
[Following CRF model] A1 AM F1 FM A2 F2 … For a fully connected CRF, we maximize: The CRF potential is maximized using standard belief propagation technique [Tappen et. al. 2003] . Node potential Edge potential

No necktie (Wear necktie) Has collar Men’s Has placket Low exposure
No scarf Solid pattern Black Short sleeve (Long sleeve) V-shape neckline Dress (Suit) Wear necktie High exposure (Low exposure) Gray & black Long sleeve Suit No necktie Wear scarf Brown & black No sleeve (long sleeve) Tank top (outerwear) 1. Man in dress; 2. Suit but high skin exposure; 3. No sleeve but wearing scarf Detecting Text in Natural Images

Experimental Results Questions that we are interested in:
Does combining features improve performance? Does the pose model help? Does the CRF work?

Pose Vs No Pose - Experiment Setup
Positive and negative examples are balanced. SVM classification Chi-squared kernel Leave-1-out cross validation Comparison with attribute learning without pose model. Features are extracted within a scaled clothing mask under the face. Evaluation performed under the same experiment settings. The clothing mask [Gallagher 2008]

Multiclass Confusion Matrix

Unbalanced data classification: G-mean
Recap: our CRF model uses the priors of the attributes. Evaluate CRF performance on the full dataset requires unbalanced data classification.

Steve Jobs: “solid pattern, men’s clothing, black color, long sleeves, round neckline, outerwear, wearing scarf”

The predicted dressing style of weddings:
Male: “solid pattern, suit, long-sleeves, V-shape neckline, wearing necktie, wearing scarf, has collar, has placket” Female: “high skin exposure, no sleeves, dress, other neckline shapes, white, >2 colors, floral pattern”

Gender Recognition Face-based: Project faces in the Fisher space. Clothing-based: The gender output of our system. Better gender recognition is achieved by combining face and clothing.

Conclusions Clothing attributes can be better learned with a human pose model. CRF offers improved performance by exploring attribute relations. Proposed novel applications that exploit the predicted attributes.

Miscellaneous 56

What do you have? 57

AutoCropping 60

AutoCropping Auction Probability: 97% 61

AutoCropping Eigenvector Quantized Eigenvector 62

How do photos affect value?
Angled, high contrast: ~$115 64

How do photos affect value?
Frontal, Flash reflection ~$88 65

Thank You! 66

Future Work Expect even better performance by using the (almost) ground truth pose estimated by Kinect sensors [Shotton et. al., Best Paper CVPR 2011]. Incorporate clothing information in person identification.

What we know about people
The Loop What we know about people What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. Images and Computer Vision 68

The Loop: This talk Examples of how social data has helped understand images of people Some things I’ve learned about people from computer vision What do we mean by “context”? We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. 69

What Is Computer Vision?
Vision is the process of discovering from images what is present in the world, and where it is. -- David Marr, Vision (1982) Humans can perceive and interpret images very fast and accurately. 70

What Is Computer Vision?
Vision deals with: Uncertainty and Probability (What is present) Geometry (Where it is) Humans are really good at this! Humans can perceive and interpret images very fast and accurately. 71

Measurement vs. Perception
Visual system tries to undo the measured brightness into the reflectance and illumination and estimate the reflectance that is inherent to the object. 72

73

Müller-Lyer Illusion Our perception of geometric properties is affected by our interpretation. 74

What is context? 75 What do we mean by “context”?
We can interpret the H or A based on context. Example from “Cognition in Action” Smyth Collins Morris Levy, 1994, LEA Publishers. 75

Context We ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human. 76

Which monster is larger?
Shepard RN (1990) Mind Sights: Original Visual Illusions, Ambiguities, and other Anomalies, New York: WH Freeman and Company We can’t help but to integrate perspective cues into our interpretation of the image. 77

Your brain specializes in faces
78

Find The Face In the beans:
We ourselves are susceptible to clutter as well. This is a problem where computer might do faster than human. 79

Understanding images of people
We use many different clues to discover identity and infer about people. What cues do we use to understand this image? How do we know this is a family? What we call “intuition” is often data that exists in the public domain. This thesis is to describe the progress we’ve made towards the objective of providing the computer with the same information that we have when understanding images. The goal of this thesis is to provide computers with that same intuition that computers have. 80

Using Attributes to Describe What People Wear

Similar presentations

Presentation on theme: "Using Attributes to Describe What People Wear"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Using Attributes to Describe What People Wear

Similar presentations

Presentation on theme: "Using Attributes to Describe What People Wear"— Presentation transcript:

Similar presentations

About project

Feedback