A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department, University of Massachusetts Amherst Neural Information Processing Systems Conference poster

Abstract Automatically annotate an image with keywords and to retrieve images based on text queries. We assume that every image is divided into regions, each described by a continuous-valued feature vector. Given a training set of images with annotations, we compute a joint probabilistic model of image features and words Experiments show that our model significantly outperforms the best of the previously reported results on the tasks of automatic image annotation and retrieval.

Introduction We propose a model which looks at the probability of associating words with image regions. The surrounding context often simplifies the interpretation of regions as a specific objects. –the association of a region with the word tiger is increased by the fact that there is a grass region and a water region in the same image and should be decreased if instead there is a region corresponding to the interior of an aircraft.

Introduction Thus the association of different regions provides context while the association of words with image regions provides meaning Continuous-space Relevance Model (CRM) –a statistical generative model related to relevance models in information retrieval –directly associates continuous features with words and does not require an intermediate clustering stage.

Related Work Mori et al. (1999) –Co-occurrence Model –regular grid Duygulu et al. (2002) –Translation Model –Segmentation, blobs Jeon et al. (2003) –Cross-media relevance model (CMRM) –analogous to the cross-lingual retrieval problem

Related Work Blei and Jordan (2003) –Correlation Latent Dirichlet Allocation (LDA) Model –This model assumes that a Dirichlet distribution can be used to generate a mixture of latent factors. This mixture of latent factors is then used to generate words and regions.

Related Work Differences between CRM and CMRM –CMRM is a discrete model and cannot take advantage of continuous features –CMRM relies on clustering of the feature vectors into blobs. Annotation quality of the CMRM is very sensitive to clustering errors, and depends on being able to a-priori select the right cluster granularity

A Model of Annotated Images Learning a joint probability distribution P(w,r) over the regions r of some image and the words w in its annotation. Image Annotation –compute a conditional likelihood P(w|r) which can then be used to guess the most likely annotation w for the image Image Retrieval –compute the query likelihood P(w qry |r J ) for every image J in the dataset. We can then rank images in the collection according to their likelihood of having the query as annotation

Representation of Images and Annotations Let C denote the finite set of all possible pixel colors –c 0 : transparent color, which will be handy when we have to layer image regions. Assume that all images are of a fixed size W*H Represent any image as an element of a finite set R=C^W*H –each image contains several distinct regions {r 1 …r n } –Each region is itself an element of R and contains the pixels of some prominent object in the image –all pixels around the object are set to be transparent

Representation of Images and Annotations

function g which maps image regions r  R to real-valued vectors The value g(r) represents a set of features, or characteristics of an image region. an annotation for a given image is a set of words {w 1 …w m } drawn from some finite vocabulary V modeling a joint probability for observing a set of image regions {r 1 …r n } together with the set of annotation words {w 1 …w m }

A Model for Generating Annotated Images Generated image J is based on three distinct probability distributions –Words in w J are an i.i.d. random sample from some underlying multinomial distribution Pv( ‧ |J) –regions r J are produced from a corresponding set of generator vectors g 1 …g n according to a process P R (r i |g i ) which is independent of J –The generator vectors g 1 …g n are themselves an i.i.d. random sample from some underlying multi-variate density function P g ( ‧ |J)

A Model for Generating Annotated Images Let r A ={r 1 …r nA } denote the regions of some image A, which is not in the training set T Let w B ={w 1 …w nB } be some arbitrary sequence of words P(r A,w A ), the joint probability of observing an image denoted by r A together with annotation words w B

A Model for Generating Annotated Images The overall process for jointly generating w B and r A is as follows: –Pick a training image J  T with some probability P T (J) –For b=1…n B Pick the annotation word w b from the multinomisl distribution P V (|J) –For a=1…n A Sample a generator vector g a from the probability density P g ( |J) Pick the image region r a according to the probability P R (r a |g a )

A Model for Generating Annotated Images The probability of a joint observation {r A,w B } is given by:

Estimating parameters on the model P T (J)=1/N T, where N T is the size of training set –where N g is the number of all regions r’ in R such that g(r’)= g

Estimating parameters on the model –r J ={r 1 …r n } be the set of regions of image J –placing a Gaussian kernel over the feature vector g(r i ) of every region of image J –Each kernel is parameterized by the feature covariance matrix Σ. we assumed Σ=βI, where I is the identity matrix. β plays the role of kernel bandwidth: it determines the smoothness of Pg around the support point g(r i ).

Estimating parameters on the model Let be the simplex of all multinomial distributions over V Assume a Dirichlet prior over that has parameters {μp v : v  V} μ is a constant, p v is the relative frequency of observing the word v in the training set Introducing the observation w J results in a Dirichlet posterior over with parameters {μp v +N v,J : v  V} N v,J is the number of times v occurs in the observation w J

Experimental Results Dataset: provided by Duygulu et al. –http://www.cs.arizona.edu/people/kobus/ research/data/eccv 2002 –5,000 images from 50 Corel Stock Photo cds –Each image contains an annotation of 1-5 keywords. Overall there are 371 words. –every image in the dataset is pre-segmented into regions using general-purpose algorithms, such as normalized cuts pre-computed feature vector for every segmented region feature set consists of 36 features: 18 color features, 12 texture features and 6 shape features.

Experimental Results –divided the dataset into 3 parts - with 4,000 training set images, 500 evaluation set images and 500 images in the test set. The evaluation set is used to find system parameters. After fixing the parameters, we merged the 4,000 training set and 500 evaluation set images to make a new training set. This corresponds to the training set of 4500 images and the test set of 500 images used by Duygulu et al.

Results: Automatic Image Annotation Given a set of image regions r J we use equation (1) to arrive at the conditional distribution P(w| r J ) Take the top 5 words from that distribution and call them the automatic annotation of the image in question compute annotation recall and precision for every word in the testing set

Results: Automatic Image Annotation –Recall is the number of images correctly annotated with a given word, divided by the number of images that have that word in the human annotation. –Precision is the number of correctly annotated images divided by the total number of images annotated with that particular word –Recall and precision values are averaged over the set of testing words.

Results: Automatic Image Annotation

Results: Ranked Retrieval of Images Given a text query w qry and a testing collection of un-annotated images. For each testing image we use equation (1) to get the conditional probability P(w qry |r J ) use four sets of queries, constructed from all 1-, 2-, 3- and 4-word combinations of words that occur at least twice in the testing set. An image is considered relevant to a given query if its manual annotation contains all of the query words.

Results: Ranked Retrieval of Images evaluation metrics –precision at 5 retrieved images –non-interpolated average precision

Results: Ranked Retrieval of Images

Conclusions and Future Work Proposed a new statistical generative model for learning the semantics of images. –this model works significantly better than a number of other models for image annotation and retrieval. –Our model works directly on the continuous features. Future work will include the extension of this work to larger datasets

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Similar presentations

Presentation on theme: "A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,

Similar presentations

Presentation on theme: "A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,"— Presentation transcript:

Similar presentations

About project

Feedback