Presentation is loading. Please wait.

Presentation is loading. Please wait.

Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Similar presentations


Presentation on theme: "Nearest Neighbor Queries Sung-hsun Su April 12, 2001"— Presentation transcript:

1 Nearest Neighbor Queries Sung-hsun Su April 12, 2001
[1] Nick Roussopoulos, Stephen Kelley, Frederic Vincent: Nearest Neighbor Queries. SIGMOD Conference 1995: [2] G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999),

2 Outline Introduction to Nearest Neighbor Query
Spatial data structure – R-Tree K-NN Algorithm in [1] Incremental NN Algorithm in [2]

3 The Need of NN Query Used when data have spatial property
Example: Geographical Info System, Astronomical Data Spatial predicate Find the k nearest stars from the Earth Find the k nearest stars which is at least 10 LY away Find the nearest gas station in the east Find the furthest TCAT bus stop

4 Difficulties in NN Query
Need to scan the whole table if unordered Spatial data structure: 1D – Simply use a B+ tree or other sorted data structure 2D or higher dimensional? - A sorted structure for all queries? No. Earth as center

5 Data structure – First Trial
Need complex data structure First trial – Fixed grids: Partition the space evenly into rectangles, cubes, … - Search the neighboring grids first - Distance to objects in a grid is bounded Disadvantage?

6 Disadvantages of Fixed Grids
May still access many additional objects Skewed data distribution Grid size too large: inefficient search Grid size too small: waste of storage Need some hierarchical and scalable data structure A Globular Cluster can contain up to a million of stars in a space

7 Spatial Tree Structures
Make it possible to resolve cluster problem Some Trees provide balanced structure Insert/split dynamically Good construction of trees will provide efficient search Spatial Trees: K-D Tree, R-Tree, LSD-Tree, Quad-Tree … etc Search algorithm will be described later

8 A Glance of Algorithms [1]: K-NN Query [2]: Incremental NN Query
Apply a modified DFS on R-Tree [2]: Incremental NN Query A Priority First Search on different kinds of spatial tree structure Incremental Distance browsing They applied it to R Tree as an example

9 R-Tree Introduction Balanced structure, like B+Tree
Each node is an MBR (Minimal Bounding Rectangle) A node minimally bounds all descendants Non-leaf: (RECT, pointer to a child node) Leaf: (RECT, pointer to an object) Branching factor is chosen to fit a block or page

10 Minimal Bounding Rectangle
Cuboid, cube

11 R-Tree Example Root Root B G C I J H A B C F K A D D E F G H I J K E
Objects

12 Good and Bad R-Trees Bad R-Tree: Contains much dead space
Good R-Tree: Minimize overlapped area MBR estimates its objects better Construction affects the quality of R-Tree. We won’t go into details.

13 Algorithms in [1] Finding K Nearest Neighbors Two metrics introduced:
MINDIST (optimistic) MINMAXDIST (pessimistic) Pruning DFS Search

14 Space and Rectangle Euclidean Space with n dimension: E(n)
A Rectangle is defined by R=(S,T), S, T are two points on a diagonal (r1, r2..rn), (t1, t2..tn) that: For all k=1 to n, tk>rk Just simplifies computation

15 MINDIST(Optimistic) MINDIST(RECT,q): the shortest distance from RECT to query point q For all descendant (nodes/objects) in RECT, their distance to q is greater or equal than MINDIST(RECT,q) This provides a lower bound for distance from q to objects in RECT Use square of the distance as the metric

16 Calculation of MINDIST
MINDIST(P,R) = if otherwise (between si and ti) T(t1, t2) (p1,p2) (r1,r2)=(t1,p2) y x S(s1, s2)

17 MINMAXDIST(Pessimistic)
MBR property: Every face (edge in 2D, rectangle in 3D, hyper-face in high D) of any MBR contains at least one point of some spatial object in the DB. MINMAXDIST: Calculate the maximum dist to each face, and choose the minimal. Upper bound of minimal distance At least 1 object with distance less or equal to MINMAXDIST in the MBR

18 Illustration of MINMAXDIST
(t1,t2) MINDIST (t1,p2) (p1,p2) MINMAXDIST y x (s1,s2) (t1,s2)

19 Calculation of MINMAXDIST
Can be done in O(n)

20 Pruning MINDIST(M) > MINMAXDIST(M’) :
M can be pruned Distance(O) > MINMAXDIST(M’) : O can be discarded MINDIST(M) > Distance(O)

21 DFS Search on R-Tree Traversal: DFS
Expanding Non-leaf: Order its children by the metrics (MINDIST or MINMAXDIST). Prune before/after visiting each child. Expanding Leaf: Compare objects to the nearest neighbor found so far. Replace it if the new object is closer. Not a straight-forward approach - make only local decision May visit non-optimal objects before the NN is found. Best first search: simple, and never visit non-optimal nodes.

22 Extending to K-NN Maintain k nearest neighbors found so far.
Use the k-th furthest MBR/objects for pruning Blocking algorithm. No pipelining. Nodes are never in the queue. They are either unexplored, kept as k-NN, or discarded.

23 Experimental Results Real world data: TIGER, Satellite data
Synthetic data R-Tree Construction: (branching factor=50) Presorting data with Hilbert Number Apply a packing technique Branching factor is 50 Performance measure: # of pages accessed Hilbert ordering

24 Experimental Results (Cont’d)
Linear with k (number of neighbors to find), but slowly. Grow linear with height of the tree  Log(size of data set) MINDIST outperforms MINMAXDIST 20% faster in general, 30% in dense data set Reason: R-Tree is packed very well. MINDIST approaches actual minimal distance. If there’s no or less dead space, MINDIST approaches the actual minimal distance. I

25 Problems with this algorithm
Nodes/objects are not visited by order of distance.  Blocking May access non-optimal objects, and discard/prune them.  Not incremental Need to know k in advance, no distance browsing, difficult to combine with other predicates.

26 Distance Browsing To browse object in distance order
Example: Find the k nearest star with distance > 10LY How to apply algorithm[1] to this query? Select stars with distance >10LY first Materialize the first result And then build another R-Tree What if selectivity is very high?

27 Solution to Distance Browsing
Very low selectivity (nearest city with 2M+ population) Perform selection first, build an R-Tree, perform k-NN Otherwise Need incremental k-NN, pipeline the result to selection operator Can stop at any time

28 Overview of algorithm in [2]
A generic algorithm for different spatial data structure and different distance definition. Use Priority Queue to perform best first search using minimal distance(optimistic). Ensure that no object/node is visited before another closer object/node.

29 Search Algorithm Always expand the nearest node or object in the priority queue. Treat objects special cases of nodes. While expanding a node, calculate each children’s distances from query point, and add them into priority queue. While expanding an object, just report it and then continue.

30 Requirement for Tree/Distance
Tree/Distance must conform the following rules: Allow a node/object to have more than one parents There may be duplicate of object pointer in the tree. The region covered by a node must be completely contained within union of it parents’ region. Consistence distance: For all query point q and node/object n, at least one of its parents, n’ has distance d(q,n’) <= d(q,n). (To ensure expanding nodes in order)

31 Remarks to Tree/Distance
Applicable tree: Quad-tree, R-Tree, R+-Tree, LSD-Tree, K-D-B Tree…etc Applicable distance measure: Euclidean, Manhattan, Chessboard…etc Almost of spatial trees don’t have duplicate nodes. A node is fully contained in its parent. Some trees allow duplicate objects. We have to detect and remove duplicates. R-Tree doesn’t have duplicates.

32 Example Root F B D E A C Node/Obj Distance Root A 1 B 7 C 10 D E 8 F
A 1 B 7 C 10 D E 8 F 12 Triangle 13 Circle Rectangle Moon 14 Root B F A D E C

33 Order of expansion R=0: Expand Root, { A[1],B[7] }
R=1: Expand A, { D[1],B[7],C[10] } R=1: Expand D, { Circle[1],B[7],C[10] } R=1: Report Circle, { B[7], C[10] } R=7: Expand B, { E[8], C[10], F[12] } R=8: Expand E, { Rectangle[8], C[10], F[12] } R=8: Report Rectangle, { C[10], F[12] } R=10: Expand C, { F[12], Triangle[13] } R=12: Expand F, { Triangle[13], Moon[14] } R=13: Report Triangle, { Moon[14] } R=14: Report Moon, { }

34 Observation All nodes/objects intersecting the search region(circle) are expanded, and their children are put in the queue. All nodes/objects completely inside the search circle are already taken off the queue. All nodes/objects completely outside the search circle are not examined. It minimizes the number of objects to visit.

35 PseudoCode Queue=NewPriorityQueue(); EnQueue(Queue, Root, 0);
While (NotEmpty(Queue)) { Element=Dequeue(Queue) If IsObject(Element) { /*Remove duplicate*/; Report(Element) } If IsLeaf(Element) { For each child object o, if Dist(o,Q)>=Dist(Element,Q) EnQueue(Queue,o,Dist(o,Q)); //Don’t need the comparison for R-Tree } If IsNonLeaf(Element) { For each child object o Enqueue(Queue,o,Dist(o,Q));} }

36 Variants K Furthest: Use MaxDist Replace <= by >= Distance selection: Select all stars between 15 LY and 20 LY. Prune unqualified nodes Pseudo code for search algorithm combining these 2 extension: Figure 5

37 Implementation of Priority Queue
Enough memory: Heap (minheap/maxheap) Not enough: Use B+ Tree (sorted  keep nodes with smaller distance in the memory) Hybrid Scheme: Divide into 3 tiers. Tier 1 uses in-memory heap. Tier 2 is divided into several sections. Nodes in each sections are unordered bucket in memory, and the first bucket is moved to Tier 1 when Tier 1 is empty. Tier 3 is stored on disk, and moved to memory when tier 1 and 2 is empty. Small heap, faster access

38 Theoretical Analysis Assumption: Uniform distribution, 2D
Use the circular search region for analysis K  the area of search region Number of leaf nodes in the priority queue  circumference of search region = Number of leaf nodes accessed = Number of nodes accessed = For non-uniform 2D case: very close to the result

39 Experimental Result TIGER/Line file (17421~ segments) Synthetic data (infinite random segments) Construction: R* Tree Distance Browsing: Inc-NN much faster than k-NN, the ratio increases at Exact k-NN query: Inc-NN is 10~20% faster Scalability: close to theoretical result Very large k: k-NN can’t hold all k neighbors in memory

40 Conclusion Inc NN outperforms other k-NN algorithms.
Inc NN enables distance browsing. Number of node accesses (2D) is Future work: Compare this algorithm on different spatial structure Investigate the behavior on very large data set where the PQ can’t fit into memory.


Download ppt "Nearest Neighbor Queries Sung-hsun Su April 12, 2001"

Similar presentations


Ads by Google