Download presentation
Presentation is loading. Please wait.
1
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman 23-1-2007
2
2 Introduction Content-based Image Retrieval (CBIR) –The process of searching for digital images in large databases based on image contents –It consists of four modules in general: data acquisition and processing feature representation data indexing query and feedback processing
3
3 Introduction Content-based Image Retrieval (CBIR) –Extensively studied in both academia and industry –Yet, traditional data indexing techniques are not scalable to high dimensional data –Image contents in high dimensional feature space –Thus, traditional CBIR systems are not efficient
4
4 Contribution Propose a scalable CBIR scheme by applying locality- sensitive hashing (LSH) –LSH is good at indexing high dimensional data Conduct a comprehensive empirical evaluation of CBIR over a half million images –There is very limited empirical study on that large CBIR systems Address some challenges for building scalable CBIR systems on large-scale data –some innovative ideas are suggested for tackling these issues
5
5 Hashing Algorithm Hashing algorithm is commonly employed for indexing large-scale database –Due to its fast database lookup capability –O(1) on average However, it is never used for similarity indexing –It builds index for searching identical copy of the query key, but not for searching near neighbors
6
6 Locality-Sensitive Hashing An emerging new indexing algorithm –Proposed to solve high-dimensional near neighbor searching problem in Euclidean space l 2 –It can answer queries in sublinear time –Each near neighbor being reported with a fixed probability Principles –Two close points most likely share the same hash value –By looking into the hash bucket of the query point, we obtain many near neigbhors of the query points –Large fraction of data points are not processed
7
7 Locality-Sensitive Hashing we employ the E 2 LSH (Exact Euclidean LSH) package The probability of finding near neighbors can be controlled by two parameters L and k –L: number of hash functions g(v) = (h 1 (v), … h k (v)) L: larger L increases the probability of finding all R-near neighbors k: larger k reduces the chance of hitting data points that are not R-near neighbors h(v) controlled by two parameters a and b –a: a d dimensional vector with entries chosen independently from a Gaussian distribution –b: a real number chosen uniformly from the range [0, w] Same h if this is within [ (n-1)w, (n)w )
8
8 Main problem of E 2 LSH One main problem is that E 2 LSH is a memory based implementation –All the data points and the data structures are stored in the main memory –Maximum database size limited by the amount of free main memory available
9
9 Our Scalable Implementation We propose a multi-partition indexing approach –We divide the whole database into multiple partitions –Each of them is small enough to be loaded into the main memory –Then we can process queries on each partition respectively
10
10 The Detailed Procedure 1.Divide the database into n partitions, where the number of partitions n = ceiling of (database size / max partition’s size). 2.Run E 2 LSH indexing on each of the partitions to create the hash tables; 3.Given a query q, load the pre-built hash tables T of one partition into the main memory; 4.Run E 2 LSH query on T to retrieve top k ranking images with respect to q; 5.Repeat (3) and (4) steps until all partitions are processed. 6.Collect the results from all partitions and return top k ranking results with respect to the query q.
11
11 Some Critical Issues Disk-access overhead for loading the hash tables into the main memory We can consider some parallel solutions to overcome this overhead issue and speedup the overall solution in our future work
12
12 Feature Representation We represent an image with three types of visual feature: color, shape, and texture For color, we use Grid Color Moment For shape, we employ an edge direction histogram For texture, we adopt Gabor feature In total, a 238-dimensional feature vector is employed to represent each image in our image database
13
13 Empirical Evaluation Experimental Testbed –Testbed containing 500,000 images crawled from Web –5,000 images from COREL image data set are engaged as the query set Contains 50 categories Each category consists of exactly 100 images that are randomly selected Every category represents a different semantic topic, such as butterfly, car, cat, dog, etc
14
14 Performance Metrics Two standard performance metrics –Precision and recall –Relevant image if it is one of the 100 COREL images in the database which share the same category with the query images –Evaluate efficiency by average CPU time elapsed on a given query
15
15 Experimental Setup Form a query set of 1000 image examples by randomly select 20 images from each of the 50 COREL image categories Prepare 10 image databases of different sizes ranging from 50,000 images to 500,000 images A database of size N contains the followings: –5,000 COREL images –N-5000 other images selected from our testbed Extract the image features using the techniques discussed
16
16 Experimental Setup Perform a series of experiments using our CBIR system with LSH indexing on all the 10 databases –LSH’s parameters: L = 550 and k = 34 –Retrieve top 20 and top 50 ranking images for each query image –Calculate recall and precision Simulate two rounds of relevance feedback –re-querying the database with relevant examples with respect to the given query Environments –3.2GHz Intel Pentium 4 PC with 2GB memory running Linux kernel 2.6 –All implementations are programmed by C++ language The same experiments are repeated with Exhaustive Searching indexing
17
17 Average Precision of TOP20 The results of LSH is very close to the ES results Their maximal difference is no more than 5% at any database size Average Precision of TOP20
18
18 Average Recall of TOP50 Average recall and average precision of our CBIR system decrease with the database size, yet the decreasing rate Diminishes when the database size increases Average Recall of TOP50
19
19 Average Query Time the query time for ES is linear to the database size, while the one for LSH is sublinear
20
20 Time Performance of LSH over ES on different databases LSH approach is much faster than the ES solution with an average speedup greater than 4 times The gap of time performance between them grows even faster when the database size increases
21
21 Conclusion and Future Work Proposed a scalable CBIR scheme based on a fast high dimensional indexing technique, LSH Conducted extensive empirical evaluations on a large testbed of a half million images –Our scalable CBIR system is more efficient than traditional exhaustive linear searching methods –Our system is scalable to large-scale CBIR applications Addressed some limitations and challenges in our current solution Consider parallel solutions in the coming future
22
22 Q & A
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.