Words and Pictures Rahul Raguram. Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos.

Slides:



Advertisements
Similar presentations
LEARNING SEMANTICS OF WORDS AND PICTURES TEJASWI DEVARAPALLI.
Advertisements

Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Text mining Gergely Kótyuk Laboratory of Cryptography and System Security (CrySyS) Budapest University of Technology and Economics
Principal Component Analysis Based on L1-Norm Maximization Nojun Kwak IEEE Transactions on Pattern Analysis and Machine Intelligence, 2008.
Computer vision: models, learning and inference Chapter 13 Image preprocessing and feature extraction.
Segmentation and Fitting Using Probabilistic Methods
Face Recognition and Biometric Systems
Computer Vision Spring ,-685 Instructor: S. Narasimhan Wean 5403 T-R 3:00pm – 4:20pm Lecture #20.
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
Pattern Recognition and Machine Learning
Principal Component Analysis
CS 790Q Biometrics Face Recognition Using Dimensionality Reduction PCA and LDA M. Turk, A. Pentland, "Eigenfaces for Recognition", Journal of Cognitive.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
Face Recognition using PCA (Eigenfaces) and LDA (Fisherfaces)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Face Recognition Based on 3D Shape Estimation
WORD-PREDICTION AS A TOOL TO EVALUATE LOW-LEVEL VISION PROCESSES Prasad Gabbur, Kobus Barnard University of Arizona.
Lecture 4 Unsupervised Learning Clustering & Dimensionality Reduction
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Unsupervised Learning
Computer Vision I Instructor: Prof. Ko Nishino. Today How do we recognize objects in images?
Laurent Itti: CS599 – Computational Architectures in Biological Vision, USC Lecture 7: Coding and Representation 1 Computational Architectures in.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
Computer Vision - A Modern Approach Set: Segmentation Slides by D.A. Forsyth Segmentation and Grouping Motivation: not information is evidence Obtain a.
Summarized by Soo-Jin Kim
Dimensionality Reduction: Principal Components Analysis Optional Reading: Smith, A Tutorial on Principal Components Analysis (linked to class webpage)
Recognition Part II Ali Farhadi CSE 455.
Presented By Wanchen Lu 2/25/2013
Face Recognition and Feature Subspaces
Step 3: Classification Learn a decision rule (classifier) assigning bag-of-features representations of images to different classes Decision boundary Zebra.
Face Recognition and Feature Subspaces
Feature extraction 1.Introduction 2.T-test 3.Signal Noise Ratio (SNR) 4.Linear Correlation Coefficient (LCC) 5.Principle component analysis (PCA) 6.Linear.
Classifying Images with Visual/Textual Cues By Steven Kappes and Yan Cao.
Terrorists Team members: Ágnes Bartha György Kovács Imre Hajagos Wojciech Zyla.
Classification Course web page: vision.cis.udel.edu/~cv May 12, 2003  Lecture 33.
Computer Vision Lab. SNU Young Ki Baik Nonlinear Dimensionality Reduction Approach (ISOMAP, LLE)
Face Recognition: An Introduction
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
CSE 185 Introduction to Computer Vision Face Recognition.
Visual Categorization With Bags of Keypoints Original Authors: G. Csurka, C.R. Dance, L. Fan, J. Willamowski, C. Bray ECCV Workshop on Statistical Learning.
Decision Trees Binary output – easily extendible to multiple output classes. Takes a set of attributes for a given situation or object and outputs a yes/no.
EE4-62 MLCV Lecture Face Recognition – Subspace/Manifold Learning Tae-Kyun Kim 1 EE4-62 MLCV.
Object Recognition Part 2 Authors: Kobus Barnard, Pinar Duygulu, Nado de Freitas, and David Forsyth Slides by Rong Zhang CSE 595 – Words and Pictures Presentation.
Image Classification for Automatic Annotation
PCA vs ICA vs LDA. How to represent images? Why representation methods are needed?? –Curse of dimensionality – width x height x channels –Noise reduction.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
CS 2750: Machine Learning Dimensionality Reduction Prof. Adriana Kovashka University of Pittsburgh January 27, 2016.
Principal Component Analysis and Linear Discriminant Analysis for Feature Reduction Jieping Ye Department of Computer Science and Engineering Arizona State.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Advanced Artificial Intelligence Lecture 8: Advance machine learning.
Face detection and recognition Many slides adapted from K. Grauman and D. Lowe.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
1 Dongheng Sun 04/26/2011 Learning with Matrix Factorizations By Nathan Srebro.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Principal Component Analysis (PCA)
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
University of Ioannina
CS 2750: Machine Learning Dimensionality Reduction
Face Recognition and Feature Subspaces
Recognition: Face Recognition
PCA vs ICA vs LDA.
Matching Words with Pictures
Presented by: Chang Jia As for: Pattern Recognition
CS4670: Intro to Computer Vision
Announcements Project 2 artifacts Project 3 due Thursday night
Presented by Wanxue Dong
EM Algorithm and its Applications
Presentation transcript:

Words and Pictures Rahul Raguram

Motivation  Huge datasets where text and images co-occur ~ 3.6 billion photos

Motivation  Huge datasets where text and images co-occur

Motivation  Huge datasets where text and images co-occur Photos in the news

Motivation  Huge datasets where text and images co-occur Subtitles

Motivation  Interacting with large image datasets  Image content ‘Blobworld’ [Carson et al., 99]

Motivation  Interacting with large photo collections  Image content ‘Blobworld’ [Carson et al., 99]

Motivation  Interacting with large photo collections  Image content ‘Blobworld’ [Carson et al., 99]

Motivation  Interacting with large photo collections  Image content Query by sketch [Jacobs et al., 95]

Motivation  Interacting with large photo collections  Image content Query by sketch [Jacobs et al., 95]

Motivation  Interacting with large photo collections  Large disparity between user needs and what technology provides (Armitage and Enser 1997, Enser 1993, Enser 1995, Markulla and Sormunen 2000)  Queries based on image histograms, texture, overall appearance, etc. are vanishingly small

Motivation  Interacting with large photo collections  Text queries

Motivation  Text and images may be separately ambiguous; jointly they tend not to be  Image descriptions often leave out what is visually obvious (eg: the colour of a flower)  …but often include properties that are difficult to infer using vision (eg: the species of the flower)

Linking words and pictures: Applications  Automated image annotation  Auto illustration  Browsing support tiger cat mouth teeth “statue of liberty”

Learning the Semantics of Words and Pictures Barnard and Forsyth, ICCV 2001

Key idea  Model the joint distribution of words and image features Joint probability model for text and image features Random bits Impossible Keywords: apple tree Unlikely Keywords: sky water sun Reasonable Slide credit: David Forsyth

Input Representation  Extract keywords  Segment the image into a set of ‘blobs’

EM revisited: Image segmentation Examples from:

EM revisited: Image segmentation Image Segment 1 Segment 2. Segment k Generative model Problem: You don’t know the parameters, the mixing weights, or the segmentation

EM revisited: Image segmentation Image  If you knew the segmentation, then you could find the parameters easily Compute maximum likelihood estimates for Fraction of the image in the segment gives the mixing weight

EM revisited: Image segmentation Image  If you knew the segmentation, then you could find the parameters easily  If you knew the parameters, you could easily determine the segmentation  Solution: iterate Calculate the posteriors

EM revisited: Image segmentation Image from:

Input Representation  Segment the image into a set of ‘blobs’  Each region/blob represented by a vector of 40 features (size, position, colour, texture, shape)

Modeling image dataset statistics  Generative, hierarchical model  Extension of Hofmann’s model for text (1998) Each node emits blobs and words Higher nodes emit more general words and blobs sky Middle nodes emit moderately general words and blobs sun Lower nodes emit more specific words and blobs waves

Modeling image dataset statistics  Generative, hierarchical model  Extension of Hofmann’s model for text (1998) Following a path from root to leaf generates image and associated text sky sun waves sun sky waves

Modeling image dataset statistics  Generative, hierarchical model  Extension of Hofmann’s model for text (1998) Each cluster is associated with a path from the root to a leaf Cluster of images

Modeling image dataset statistics  Generative, hierarchical model  Extension of Hofmann’s model for text (1998) Each cluster is associated with a path from the root to a leaf sky sun, sea wavesrocks sun sea sky waves sun sea sky rocks Adjacent clusters

Modeling image dataset statistics D = blobs words Each cluster is associated with a path from a leaf to the root Conditional independence of the items Nodes along the path from leaf to root

Modeling image dataset statistics  For blobs  For words  Tabulate word frequencies

Modeling image dataset statistics  Model fitting: EM  Missing data is path, nodes that generated each data element  Two hidden variables:  If path, node were known for each data element, easy to get maximum likelihood estimate of parameters  Given parameter estimate, path, node easy to figure out document d is in cluster c item i of document d was generated at level l

Results  Clustering  Does text+image clustering have an advantage? Only text

Results  Clustering  Does text+image clustering have an advantage? Only blob features

Results  Clustering  Does text+image clustering have an advantage? Both text and image segments

Results  Clustering  Does text+image clustering have an advantage?  User study:  Generate 64 clusters for 3000 images  Generate 64 random clusters from the same images  Present random cluster to user, ask to rate coherence (yes/no)  94% accuracy

Results  Image search  Supply a combination of text + image features  Approach: compute for each candidate image, the probability of emitting the query items Q – set of query items d – candidate document

Results  Image search Image credit: David Forsyth

Results  Image search Image credit: David Forsyth

Results  Image search Image credit: David Forsyth

Results  Auto-annotation  Compute:

Results  Auto-annotation  Quantitative performance:  Use 160 Corel CDs, each with 100 images (grouped by theme)  Select 80 of the CDs, split into training (75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set  Model scoring: n – number of words for the image r – number of words predicted correctly w – number of words predicted incorrectly N – vocabulary size All words that exceed a threshold are predicted

Results  Auto-annotation  Quantitative performance:  Use 160 Corel CDs, each with 100 images (grouped by theme)  Select 80 of the CDs, split into training (75%) and test (25%). Remaining 80 CDs are a ‘harder’ test set  Model scoring: n – number of words for the image r – number of words predicted correctly Model predicts n words Can do surprisingly well just by using the empirical word frequency!

Results  Auto-annotation  Quantitative performance: Score of 0.1 indicates roughly 1 out of every 3 words is correctly predicted (vs. 1 out of 6 for the empirical model)

Names and Faces in the News Berg et al., CVPR 2004

Motivation President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Motivation President George W. Bush makes a statement in the Rose Garden while Secretary of Defense Donald Rumsfeld looks on, July 23, Rumsfeld said the United States would release graphic photographs of the dead sons of Saddam Hussein to prove they were killed by American troops. Photo by Larry Downing/Reuters

Motivation  Organize news photographs for browsing and retrieval  Build a large ‘real-world’ face dataset  Datasets captured in lab conditions do not truly reflect the complexity of the problem

Motivation  Organize news photographs for browsing and retrieval  Build a large ‘real-world’ face dataset  Datasets captured in lab conditions do not truly reflect the complexity of the problem  In many traditional face datasets, it’s possible to get excellent performance by using no facial features at all (Shamir, 2008)

Motivation Top left 100×100 pixels of the first 10 individuals in the color FERET dataset. The IDs of the subjects are listed right to the images

Dataset  Download news photos and captions  ~500,000 images from Yahoo News, over a period of two years  Run a face detector  44,773 faces  Resized to 86x86 pixels  Extract names from the captions  Identify two or more capitalized words followed by a present tense verb  Associate every face in the image with every detected name  Goal is to label each face detector output with the correct name

Dataset Properties  Diverse  Large variation in lighting and pose  Broad range of expressions

Dataset Properties  Diverse  Large variation in lighting and pose  Broad range of expressions  Name frequencies follow a long tailed distribution Doctor Nikola shows a fork that was removed from an Israeli woman who swallowed it while trying to catch a bug that flew in to her mouth, in Poriah Hospital northern Israel July 10, Doctors performed emergency surgery and removed the fork. (Reuters) President George W. Bush waves as he leaves the White House for a day trip to North Carolina, July 25, A White House spokesman said that Bush would be compelled to veto Senate legislation creating a new department of homeland security unless changes are made. (Kevin Lamarque/Reuters)

Preprocessing  Rectify faces to canonical position  Train 5 SVMs as feature detectors  Corners of left and right eyes, tip of the nose, corners of the mouth  Use 150 hand-clicked faces to train the SVMs  For a test image, run the SVMs over the entire image  Produces 5 feature maps  Detect maximal outputs in the 5 maps, and estimate the affine transformation to the canonical pose Image credit: Y. J. Lee

Preprocessing  Rectify faces to canonical position  Train 5 SVMs as feature detectors  Corners of left and right eyes, tip of the nose, corners of the mouth  Use 150 hand-clicked faces to train the SVMs  For a test image, run the SVMs over the entire image  Produces 5 feature maps  Detect maximal outputs in the 5 maps, and estimate the affine transformation to the canonical pose  Reject images with poor rectification scores

Preprocessing  Rectify faces to canonical position  Train 5 SVMs as feature detectors  Corners of left and right eyes, tip of the nose, corners of the mouth  Use 150 hand-clicked faces to train the SVMs  For a test image, run the SVMs over the entire image  Produces 5 feature maps  Detect maximal outputs in the 5 maps, and estimate the affine transformation to the canonical pose  Reject images with poor rectification scores  This leaves 34,623 images  Throw out images with more than 4 names  27,742 faces

Face representation  86x86 images – 7396 dimensional vectors  However, relatively few 7396 dimensional vectors actually correspond to valid face images  We want to effectively model the subspace of valid face images Slide credit: S. Lazebnik

Face representation  We want to construct a low-dimensional linear subspace that best explains the variation in the set of face images Slide credit: S. Lazebnik

Principal Component Analysis (PCA)  Given N data points x 1, …,x N in R d  Consider the projection onto a 1 dimensional subspace  Denoted by d-dimensional unit vector u 1  Projection of each data point: u 1 T x n  Mean of the projected data where  Variance of the projected data Define covariance matrix Formulation: C. Bishop

Principal Component Analysis (PCA)  Want to maximize the projected variance  Alternate formulation: minimize sum- of-square errors  Maximize subject to  Use Lagrange multipliers u 1 must be an eigenvector of S Choose maximum eigenvalue to maximize variance Image, formulation: C. Bishop

Principal Component Analysis (PCA)  The direction that captures the maximum covariance of the data is the eigenvector corresponding to the largest eigenvalue of the data covariance matrix  Furthermore, the top k orthogonal directions that capture the most variance of the data are the k eigenvectors corresponding to the k largest eigenvalues Slide credit: S. Lazebnik

Limitations of PCA  PCA assumes that the data has a Gaussian distribution (mean µ, covariance matrix Σ) Slide credit: S. Lazebnik

Limitations of PCA  The direction of maximum variance is not always good for classification Image credit: C. Bishop

Limitation #1  Shape of the data not modeled well by the linear principal components

The return of the kernel trick  Basic idea: express conventional PCA in terms of dot products  From before: For convenience, assume that you’ve subtracted off the mean from each vector  Consider a nonlinear function Φ(x) mapping into M-dimensions (M>D)  Assume Covariance matrix Formulation: C. Bishop

The return of the kernel trick  Covariance matrix in feature space Now MxM Substituting for C Scalar values The eigenvectors v i can be written as a linear combination of the Φ(x n ) Formulation: C. Bishop

 Key step: express this in terms of the kernel function  Multiply both sides by Φ T (x l )  Projection of a point onto eigenvector i The return of the kernel trick Formulation: C. Bishop

Kernel PCA Image credit: C. Bishop

Limitation #2  The direction of maximum variance is not always good for classification Image credit: C. Bishop

Linear Discriminant Analysis (LDA)  Goal: Perform dimensionality reduction while preserving as much of the class discriminatory information as possible  Try to find directions along which the classes are best separated  Capable of distinguishing image variation due to identity from variation due to other sources such as illumination and expression

Linear Discriminant Analysis (LDA)  Define inter- and intra-class scatter matrices  LDA computes a projection that maximizes the ratio by solving the generalized eigenvalue problem W – intra-class B – inter-class

Class labels for LDA  For the unsupervised names and faces dataset, you don’t have true labels  Use proxy for labeled training data  Images from the dataset with only one detected face and one detected name  Observation: Using LDA on top of the space found by kernel PCA improves performance significantly

Clustering faces  Now that we have a representation for faces, the goal is to ‘clean up’ this dataset  Modified k-means clustering Obama Bush Clinton Saddam

Clustering faces  Now that we have a representation for faces, the goal is to ‘clean up’ this dataset  Modified k-means clustering Obama Bush Clinton Saddam

Clustering faces  Now that we have a representation for faces, the goal is to ‘clean up’ this dataset  Modified k-means clustering Obama Bush Clinton Saddam x x x x

Clustering faces  Now that we have a representation for faces, the goal is to ‘clean up’ this dataset  Modified k-means clustering x x x x Bush Saddam

Pruning clusters  Remove clusters with < 3 faces  This leaves 19,355 images  For every data point, compute a likelihood score  Remove points with low likelihood k – number of nearest neighbours being considered k i – number of n.n. that are in cluster i n – total number of points in the dataset n i – total number of points in cluster i

Pruning clusters  For various thresholds:

Merging clusters  Merge clusters with different names that correspond to a single person  Defense Donald Rumsfeld and Donald Rumsfeld  Or Colin Powell and Secretary of State  Look at distance between the means in discriminant space  If below a threshold, merge

Merging clusters Image credit: David Forsyth

Results