Inference Network Approach to Image Retrieval Don Metzler R. Manmatha Center for Intelligent Information Retrieval University of Massachusetts, Amherst
Motivation Most image retrieval systems assume: Implicit “AND” between query terms Equal weight to all query terms Query made up of single representation (keywords or image) “tiger grass” => “find images of tigers AND grass where each is equally important” How can we search with queries made up of both keywords and images? How do we perform the following queries? “swimmers OR jets” “tiger AND grass, with more emphasis on tigers than grass” “find me images of birds that are similar to this image”
Related Work Inference networks Semantic image retrieval Kernel methods
Inference Networks Inference Network Framework [Turtle and Croft ‘89] Formal information retrieval framework INQUERY search engine Allows structured queries phrases, term weighting, synonyms, etc… #wsum( 2.0 #phrase ( image retrieval ) 1.0 model ) Handles multiple document representations (full text, abstracts, etc…) MIRROR [deVries ‘98] General multimedia retrieval framework based on inference network framework Probabilities based on clustering of metadata + feature vectors
Image Retrieval / Annotation Co-occurrence model [Mori, et al] Translation model [Duygulu, et al] Correspondence LDA [Blei and Jordan] Relevance model-based approaches Cross-Media Relevance Models (CMRM) [Jeon, et al] Continuous Relevance Models (CRM) [Lavrenko, et al]
Goals Input Set of annotated training images User’s information need Terms Images “Soft” Boolean operators (AND, OR, NOT) Weights Set of test images with no annotations Output Ranked list of test images relevant to user’s information need
Data Corel data set † 4500 training images (annotated) 500 test images 374 word vocabulary Each image automatically segmented using normalized cuts Each image represented as set of representation vectors 36 geometric, color, and texture features Same features used in similar past work † Available at:
Features Geometric (6) area position (2) boundary/area convexity moment of inertia Color (18) avg. RGB x 2 (6) std. dev. of RGB (3) avg. L*a*b x 2 (6) std. dev. of L*a*b (3) Texture (12) mean oriented energy, 30 deg. increments (12)
Image representation cat, grass, tiger, water annotation vector (binary, same for each segment) representation vector (real, 1 per image segment)
Image Inference Network J – representation vectors for image, (continuous, observed) q w – word w appears in annotation, (binary, hidden) q r – representation vector r describes image, (binary, hidden) q op – query operator satisfied (binary, hidden) I – user’s information need is satisfied, (binary, hidden) I J q r1 q rk … q op1 q op2 q w1 q wk … “Image Network” “Query Network” fixed (based on image) dynamic (based on query)
Example Instantiation #or #and tigergrass
What needs to be estimated? P(q w | J) P(q r | J) P(q op | J) P(I | J) I J q r1 q rk … q op1 q op2 q w1 q wk …
P(q w | J) [ P( tiger | ) ] Probability term w appears in annotation given image J Apply Bayes’ Rule and use non-parametric density estimation Assumes representation vectors are conditionally independent given term w annotates the image ???
How can we compute P(r i | q w )? training set representation vectors representation vectors associated with image annotated by w area of high likelihood area of low likelihood
P(q w | J) [final form] Σ assumed to be diagonal, estimated from training data
Regularized estimates… P(q w | J) are good, but not comparable across images termP(q w | J) cat0.45 grass0.35 tiger0.15 water0.05 termP(q w | J) cat0.90 grass0.05 tiger0.01 water0.03 Is the 2 nd image really 2x more “cat-like”? Probabilities are relative per image
Regularized estimates… Impact Transformations Used in information retrieval “Rank is more important than value” [Anh and Moffat] Idea: rank each term according to P(q w | J) give higher probabilities to higher ranked terms P(q w | J) ≈ 1/rank qw Zipfian assumption on relevant words a few words are very relevant a medium number of words are somewhat relevant many words are not relevant
Regularized estimates… termP(q w | J)1/rank cat grass tiger water termP(q w | J)1/rank cat grass tiger water
What needs to be estimated? P(q w | J) P(q r | J) P(q op | J) P(I | J) I J q r1 q rk … q op1 q op2 q w1 q wk …
P(q r | J) [ P( | ) ] Probability representation vector observed given J Use non-parametric density estimation again Impose density over J’s representation vectors just as we did in the previous case Estimates may be poor Based on small sample (~ 10 representation vectors) Naïve and simple, yet somewhat effective
Model Comparison Relevance modeling-based CMRM, CRM General form: Fully non-parametric Model used here General form:
What needs to be estimated? P(q w | J) P(q r | J) P(q op | J) P(I | J) I J q r1 q rk … q op1 q op2 q w1 q wk …
Query Operators “Soft” Boolean operators #and / #wand (weighted and) #or #not One node added to query network for each operator present in query Many others possible #max, #sum, #wsum #syn, #odn, #uwn, #phrase, etc…
#or( #and ( tiger grass ) ) #or #and tigergrass
Operator Nodes Combine probabilities from term and image nodes Closed forms derived from corresponding link matrices Allows efficient inference within network Par(q) = Set of q’s parent nodes
… but where do they come from? AB Q P(Q=true|a,b)AB 0false 0 true 0 false 1true
Results - Annotation ModelTranslationCMRMCRMInfNet # words with recall >= Results on full vocabulary Mean per-word recall Mean per-word precision F-measure
foals (0.46) mare (0.33) horses (0.20) field (1.9E-5) grass (4.9E-6) railroad (0.67) train (0.27) smoke (0.04) locomotive (0.01) ruins (1.7E-5) sphinx (0.99) polar (5.0E-3) stone (1.0E-3) bear (9.7E-4) sculpture (6.0E-4)
Results - Retrieval 5 retrieved images 1 word2 word3 word CMRM CRM InfNet InfNet-reg Mean Average Precision 1 word2 word3 word CMRM CRM InfNet InfNet-reg
Future Work Use rectangular segmentation and improved features Different probability estimates Better methods for estimating P(q r | J) Use CRM to estimate P(q w | J) Apply to documents with both text and images Develop a method/testbed for evaluating for more “interesting” queries
Conclusions General, robust model based on inference network framework Departure from implied “AND” between query terms Unique non-parametric method for estimating network probabilities Pros Retrieval (inference) is fast Makes no assumptions about distribution of data Cons Estimation of term probabilities is slow Requires sufficient data to get a good estimate