Presentation on theme: "Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos."— Presentation transcript:
Using Large-Scale Web Data to Facilitate Textual Query Based Retrieval of Consumer Photos
Motivation Digital cameras and mobile phone cameras popularize rapidly: –More and more personal photos; –Retrieving images from enormous collections of personal photos becomes a more and more important topic. ? How to retrieve?
Prior Work: CBIR Content-Based Image Retrieval (CBIR) –Users provide images as queries to retrieve personal photos. The paramount challenge -- semantic gap: –The gap between the low-level visual features and the high-level semantic concepts. … Low-level Feature vector Image with high- level concept queryresult … … Feature vectors in DB compare Semantic Gap
Prior Work: Image Annotation It is more convenient for the user to retrieve the desirable personal photos using textual queries. Image annotation is used to classify images w.r.t. high-level semantic concepts. –Semantic concepts are analogous to the textual terms describing document contents. An intermediate stage for textual query based image retrieval. query Sunset Annotation Result: high-level concepts Annotation Result: high-level conceptsannotate compare …database …result Retrieve
Idea Web images are accompanied by tags, categories and titles. –Google and Flickr exploit them to index web images. … building people, family people, wedding sunset … WebImagesContextualInformation But raw consumer photos from digital cameras do not contain such semantic textual descriptions Web Images Consumer Photos Leverage information from web images to retrieve consumer photos in personal photo collection. information No intermediate image annotation process.
When user provides a textual query, Textual Query Classifier Automatic Web Image Retrieval Automatic Web Image Retrieval Large Collection of Web images (with descriptive words) Relevant/ Irrelevant Images WordNet Relevance Feedback Relevance Feedback Refined Top-Ranked Consumer Photos Consumer Photo Retrieval Consumer Photo Retrieval Raw Consumer Photos Top-Ranked Consumer Photos It would be used to find relevant/irrelevant images in web image collections. Then, a classifier is trained based on these web images. And then consumer photos can be ranked based on the classifiers decision value. The user can also use relevance feedback to refine the retrieval results.
boat Inverted File Inverted File Relevant Web Images Irrelevant Web Images boat ark barge dredgerhouseboat … Semantic Word Trees Based on WordNet For users textual query, first search it in the semantic word trees. The web images containing the query word are considered as relevant web images. The web images which do not containing the query word and its two-level descendants are considered as irrelevant web images.
Relevant Web Images Irrelevant Web Images … sample1 sample2sample3sample4 ds … Construct 100 smaller training sets: –Negative Samples: Randomly sample a fixed number of irrelevant web images for 100 times; –Positive Samples: The relevant web images. Based on each training set, train decision stumps on each dimension. Classifier f s (x) Finally, linearly combine all decision stumps based on their training errors.
Relevance Feedback via Cross-Domain Regularized Regression
Other images f T (x) should be close to f s (x) Design a target linear classifier f T (x) = w T x. User-labeled images x 1,…,x l f T (x) should be close to +1 (labeled as positive) 1 (labeled as negative) A regularizer to control the complexity of the target classifier f T (x) This problem can be solved with least square solver.
Source Classifiers Decision Stump Ensemble: –Trained on each dimension for each bag; –Decision values are fused after a sigmoid mapping: f d (x) = i γ id h(s id (x d -θ id )); –Pros: Non-linear; Easy to be parallelized; –Cons: Testing is time-consuming;
Accelerating Source Classifiers One possible solution: –Remove sigmoid mapping: f d (x) = i γ id s id (x id -θ id ) = ( i γ id s id )x i -( i γ id s id θ id ); Assume there are N bags, D dims: –Testing Complexity: O(ND) --> O(D) –Cons: Become linear; –Too weak.
Accelerating Source Classifiers Another possible solution: –Use linear svm instead of decision stump ensemble. Train 1 linear svm classifier for each bag; Fuse the decision values with a sigmoid mapping; –Pros: It is hopeful to use less bags to achieve a satisfying retrieval precision; Although testing complexity is still O(ND), there are much less ``exp'' function calls (ND --> N); Individual classifiers are computed with just a vector dot product, which can be efficiently computed with SIMD instructions.
Error Rate Refinement during RF Assume that there are M training data, in which E instances are incorrectly classified. –err_rate = E / M; For f s (x), when user labels one instance x as y \in (-1, 1): –If f s (x) = y, then err_rate = E / (M + α) –If f s (x) = -y, then err_rate = (E + α) / (M + α)