Presentation on theme: "Image Retrieval Basics Uichin Lee KAIST KSE Slides based on “Relevance Models for Automatic Image and Video Annotation & Retrieval” by R. Manmatha (UMASS)"— Presentation transcript:
Image Retrieval Basics Uichin Lee KAIST KSE Slides based on “Relevance Models for Automatic Image and Video Annotation & Retrieval” by R. Manmatha (UMASS)
How do we retrieve images? Using Content-based Image Retrieval (CBIR) systems –Hard to represent information needs using abstract image features; color percentages, color layout and textures.
How do we retrieve images? IBM QBIC system example using color
How do we retrieve images? Use Google image search ! –Google uses filenames, surrounding text and ignores contents of the images.
How do we retrieve images ? Using manual annotations –Libraries, Museums –Manual annotation is expensive. Picture from Library of Congress American Memory Collections CREATED/PUBLISHED:1940 August NOTES : Store or cafe with soft drink signs: Coca-Cola, Orange-Crush, Royal Crown, Double Cola and Dr. Pepper. SUBJECTS: Carbonated beverages Advertisements Restaurants United States-- Mississippi—Natchez Slides--Color CALL NUMBER: LC-USF35-115
How to retrieve images/videos? Retrieval based on similarity search of visual features (similar to traditional IR w/ visterms) –Doesn’t support textural queries –Doesn’t capture “semantics” Automatically annotate images then retrieve based on the textual annotations Example Annotations: Tiger, grass.
Content based image retrieval Image Database Extracted Features Compute Similarity Query Extracted Features Rank Images
Visterms: image vocabulary Can we represent all the images with a finite set of symbols? –Text documents consist of words –Images consist of visterms V123 V89 V988 V4552 V12336 V2 V765 V9887
Construction of visterms 1.Segment images into visual segments (e.g., Blobworld, Normalized-cuts algorithm.) 2.Extract features from segments 3.Cluster similar segments (k-means) 4.Each cluster is a visterm Visterms (=blob-tokens) … Images Segments V1 V2 V3 V4 V1 V5 V6
Segmentation Segment images into parts (tile or regions) Tiling Regioning Break Image down into visually coherent areas Break image down into simple geometric shapes
Image features Information about color or texture or shape which are extracted from an (part of) image are known as image features Features –Color (e.g., Red), Texture (e.g., Sandy), Shape –SIFT (Scale-invariant feature transform)* –… Color histogramTexture David G. Lowe “Distinctive image features from scale-invariant keypoints” (IJCV 2004)
Discrete visterms Segmentation vs. rectangular partition –Tiling vs. regioning Results - rectangular partition performs better than segmentation! –Model learned over many images. –Vs. segmentation over one image.
Automatic annotation & retrieval Automatically annotate unseen images –A training set of annotated images Do not know which word corresponds to which part of image. –Compute visterms (based on image features) –Learn a model and annotate a set of test –Learn all annotations at the same time Retrieval based on the annotation output –Use query likelihood language model –Rank test images according to the likelihoods
Correspondence (matching) Now we want to find relationship between visterms and words. –P( Tiger | V1 ), P( V1 | Tiger ), P( Maui | V3,V4 ) Maui People Dance See Sand See_Lion Tiger grass V2 V4 V6 V5 V12 V321 V1 V3 Maui People Dance See Sand See_Lion Tiger grass
Correspondence models Co-occurrence model Translation model Normalized & regularized model Cross media relevance model Continuous relevance model Multiple Bernoulli model…
Co-occurrence models Mori et al. 1999 Create the co- occurrence table using a training set of annotated images Tend to annotate with high frequency words Context is ignored –Needs joint probability models w1w2w3w4 V112201 V232401332 V3131200 V46543120 P( w1 | v1 ) = 12/(12+2+0+1)=0.8 P( v3 | w2 ) = 12/(2+40+12+43)=0.12
Cross media relevance models Estimating Relevance Model – the joint distribution of words and visterms Training: –Joint distribution computed as an expectation over the training set J –P(w, b1, b2,.., bm) = ∑P(J)P(w,b1,..,bm|J) Annotation: –Compute P(w | I) for different w. –Annotate the image with every possible w in the vocabulary with associated probabilities (or pick top k words) Retrieval: –Given a query Q, find the prob of drawing Q from image I: P(Q | I) –Rank images according to this prob. J. Jeon, V. Lavrenko and R. Manmatha, Automatic Image Annotation and Relevance Using Cross-Media Relevance Models, In Proc. SIGIR’03.