1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine) Supported by NSF CAREER No. IIS-0238586 EDBT 2004

2 NN (nearest-neighbor) search KNN: find the k nearest neighbors of an object. q NN-join: for each object in the 1 st dataset, find the k nearest neighbors in the 2 nd dataset D1D2

3 Example: image search Images represented as features (color histogram, texture moments, etc.) Similarity search using these features “ Find 10 most similar images for the query image ” Other applications: Web-page search: “ Find 100 most similar pages for a given page GIS: “ find 5 closest cities of Irvine ” Data cleaning Query image

4 NN Algorithms Distance measurement: For objects are points, distance well defined Usually Euclidean Other distances possible For arbitrary-shaped objects, assume we have a distance function between them Most algorithms assume a high-dimensional tree structure for the datasets (e.g., R-tree).

5 Search process (1-NN for example) Most algorithms traverse the structure (e.g., R-tree) top down, and follow a branch-and-bound approach Keep a priority queue of nodes (“MBR”) to be visited Sorted based on the “minimum distance” between q and each node Improvement: Use MINDIST and MINMAXDIST Reduce the queue size Avoid unnecessary disk IO’s to access MBR’s Priority queue

6 Problem Queue size may be large: 60,000 objects, 32d (image) vectors, 50 NNs Max queue size: 15K entries Avg queue size: half (7.5K entries) If queue can’t fit in memory, more disk IOs! Problem worse for k-NN joins E.g., 1500 x 1500 join: Max queue size: 1.7M entries: >= 1GB memory! 750 seconds to run Couldn’t scale up to 2000 objects! Disk thrashing

7 Our Solution: Nearest-Neighbor Histogram (NNH) Main idea Utilizing NNH in a search (KNN, join) Construction and incremental maintenance Experiments Related work

8 p1p1 p2p2 pmpm Distances of its nearest neighbors: r1, r2, …, NNH: Nearest-Neighbor Histograms m : # of pivots They are not part of the database

9 Structure Nearest Neighbor Vectors: Nearest Neighbor Histogram Collection of m pivots with their NN vectors each r i is the distance of p ’ s i-th NN T : length of each vector

10 Outline Main idea Utilizing NNH in a search (KNN, join) Construction and incremental maintenance Experiments Related work

11 Estimate NN distance for query object NNH does not give exact NN information for an object But we can estimate an upper bound for the k-NN distance  q est of q Triangle inequality

12 Estimate NN for query object(con’t) Apply the triangle inequality to all pivots Upper bound estimate of NN distance of q Complexity: O(m)

13 Utilizing estimates in NN search More pruning: prune an mbr if: mbr MINDIST q

14 Utilizing estimates in NN join K-NN join: for each object o 1 in D 1, find its k- nearest neighbors in D 2. Traverse two trees top down; keep a queue of pairs

15 Utilizing estimates in NN join (cont’t) Construct NNH for D 2. For each object o 1 in D 1, keep its estimated NN radius  o1 est using NNH of D 2. Similar to k-NN query, ignore mbr for o 1 if: mbr MINDIST o1o1

16 More powerful: prune MBR pairs

17 Prune MBR pairs (cont) mbr1 mbr2 MINDIST Prune this MBR pair if:

19 NNH Construction If we have selected the m pivots: Just run KNN queries for them to construct NNH Time is O(m) Offline Important: selecting pivots Size-Constraint Construction Error-Constraint Construction (see paper)

20 # of pivots “ m ” determines Storage size Initial construction cost Incremental-maintenance cost Choose m “best” pivots Size-constraint NNH construction

21 Size-constraint NNH construction Given m (# of pivots), assume: query objects are from the database D H(pi,k) doesn’t vary too much Goal: Find pivots p 1, p 2, …, p m to minimize object distances to the pivots: Clustering problem: Many algorithms available Use K-means for its simplicity and efficiency

22 Incremental Maintenance How to update the NNH when inserting or deleting objects? Need to “shift” each vector: Associate a valid length E i to each NN vector.

24 Experiments Datasets: Corel image database Contains 60,000 images Each image represented by a 32-dimensional float vector Time-series data from AT&T Similar trends. Report results for Corel data set Test bed: PC: 1.5G Athlon, 512MB Mem, 80G HD, Windows 2000. GNU C++ in CYGWIN

25 Goal Is the pruning using NNH estimates powerful? KNN queries NN-join queries Is it “cheap” to have such a structure? Storage Initial construction Incremental maintenance

26 Improvement in k-NN search Ran k-means algorithm to generate 400 pivots for 60K objects, and constructed NNH Performed K-NN queries on 100 randomly selected query objects. Queue size to measure memory usage. Max queue size Average queue size

27 Reduced Memory Requirement

28 Reduced running time

29 Effects of different # of pivots

30 Join: Reduced Memory Requirement

31 Join: Reduced running time

32 Join: Running time for different data sizes

33 Cost/Benefit of NNH Pivot # (m)1050100150200250300350400 Construction time (sec) 0.73.596.69.411.513.715.717.820.4 Storage space (kB) 21020304050607080 Incr mantnce. time (ms) ~0 Improved q- size(kNN)(%) 40302824 232018 Improved q- size(join)(%) 45342826 2524 22 “ ~0 ” means almost zero. For 60,000 float vectors ( 32-d).

34 Conclusion NNH: efficient, effective approach to improving NN-search performance. Can be easily embedded into current implementation of NN algorithms. Can be efficiently constructed and maintained. Offers substantial performance advantages.

35 Related work Summary histograms E.g., [Jagadish et al VLDB98], [Mattias et al VLDB00] Objective: approximate frequency values NN Search algorithms Many algorithms developed Many of them can benefit from NNH Algorithms based on “pivots/foci/anchors” E.g., Omni [Filho et al, ICDE01], Vantage objects [Vleugels et al VIIS99], M-trees [Ciaccia et al VLDB97] Choose pivots far from each other (to represent the “intrinsic dimensionality”) NNH: pivots depend on how clustered the objects are Experiments show the differences

36 Work conducted in the Flamingo Project on Data Cleansing at UC Irvine

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Similar presentations

Presentation on theme: "1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)

Similar presentations

Presentation on theme: "1 NNH: Improving Performance of Nearest- Neighbor Searches Using Histograms Liang Jin (UC Irvine) Nick Koudas (AT&T Labs Research) Chen Li (UC Irvine)"— Presentation transcript:

Similar presentations

About project

Feedback