Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Name: Nearest Neighbor Queries Sung-hsun Su April 12, 2001
Uploaded: 2017-07-17T15:37:33+00:00
Duration: PTM17S7
Description: Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Nearest Neighbor Queries Sung-hsun Su April 12, 2001
[1] Nick Roussopoulos, Stephen Kelley, Frederic Vincent: Nearest Neighbor Queries. SIGMOD Conference 1995: [2] G. R. Hjaltason and H. Samet, Distance browsing in spatial databases, ACM Transactions on Database Systems 24, 2 (June 1999),

Outline Introduction to Nearest Neighbor Query
Spatial data structure – R-Tree K-NN Algorithm in [1] Incremental NN Algorithm in [2]

The Need of NN Query Used when data have spatial property
Example: Geographical Info System, Astronomical Data Spatial predicate Find the k nearest stars from the Earth Find the k nearest stars which is at least 10 LY away Find the nearest gas station in the east Find the furthest TCAT bus stop

Difficulties in NN Query
Need to scan the whole table if unordered Spatial data structure: 1D – Simply use a B+ tree or other sorted data structure 2D or higher dimensional? - A sorted structure for all queries? No. Earth as center

Data structure – First Trial
Need complex data structure First trial – Fixed grids: Partition the space evenly into rectangles, cubes, … - Search the neighboring grids first - Distance to objects in a grid is bounded Disadvantage?

Disadvantages of Fixed Grids
May still access many additional objects Skewed data distribution Grid size too large: inefficient search Grid size too small: waste of storage Need some hierarchical and scalable data structure A Globular Cluster can contain up to a million of stars in a space

Spatial Tree Structures
Make it possible to resolve cluster problem Some Trees provide balanced structure Insert/split dynamically Good construction of trees will provide efficient search Spatial Trees: K-D Tree, R-Tree, LSD-Tree, Quad-Tree … etc Search algorithm will be described later

A Glance of Algorithms [1]: K-NN Query [2]: Incremental NN Query
Apply a modified DFS on R-Tree [2]: Incremental NN Query A Priority First Search on different kinds of spatial tree structure Incremental Distance browsing They applied it to R Tree as an example

R-Tree Introduction Balanced structure, like B+Tree
Each node is an MBR (Minimal Bounding Rectangle) A node minimally bounds all descendants Non-leaf: (RECT, pointer to a child node) Leaf: (RECT, pointer to an object) Branching factor is chosen to fit a block or page

Minimal Bounding Rectangle
Cuboid, cube

R-Tree Example Root Root B G C I J H A B C F K A D D E F G H I J K E
Objects

Good and Bad R-Trees Bad R-Tree: Contains much dead space
Good R-Tree: Minimize overlapped area MBR estimates its objects better Construction affects the quality of R-Tree. We won’t go into details.

Algorithms in [1] Finding K Nearest Neighbors Two metrics introduced:
MINDIST (optimistic) MINMAXDIST (pessimistic) Pruning DFS Search

Space and Rectangle Euclidean Space with n dimension: E(n)
A Rectangle is defined by R=(S,T), S, T are two points on a diagonal (r1, r2..rn), (t1, t2..tn) that: For all k=1 to n, tk>rk Just simplifies computation

MINDIST(Optimistic) MINDIST(RECT,q): the shortest distance from RECT to query point q For all descendant (nodes/objects) in RECT, their distance to q is greater or equal than MINDIST(RECT,q) This provides a lower bound for distance from q to objects in RECT Use square of the distance as the metric

Calculation of MINDIST
MINDIST(P,R) = if otherwise (between si and ti) T(t1, t2) (p1,p2) (r1,r2)=(t1,p2) y x S(s1, s2)

MINMAXDIST(Pessimistic)
MBR property: Every face (edge in 2D, rectangle in 3D, hyper-face in high D) of any MBR contains at least one point of some spatial object in the DB. MINMAXDIST: Calculate the maximum dist to each face, and choose the minimal. Upper bound of minimal distance At least 1 object with distance less or equal to MINMAXDIST in the MBR

Illustration of MINMAXDIST
(t1,t2) MINDIST (t1,p2) (p1,p2) MINMAXDIST y x (s1,s2) (t1,s2)

Calculation of MINMAXDIST
Can be done in O(n)

Pruning MINDIST(M) > MINMAXDIST(M’) :
M can be pruned Distance(O) > MINMAXDIST(M’) : O can be discarded MINDIST(M) > Distance(O)

DFS Search on R-Tree Traversal: DFS
Expanding Non-leaf: Order its children by the metrics (MINDIST or MINMAXDIST). Prune before/after visiting each child. Expanding Leaf: Compare objects to the nearest neighbor found so far. Replace it if the new object is closer. Not a straight-forward approach - make only local decision May visit non-optimal objects before the NN is found. Best first search: simple, and never visit non-optimal nodes.

Extending to K-NN Maintain k nearest neighbors found so far.
Use the k-th furthest MBR/objects for pruning Blocking algorithm. No pipelining. Nodes are never in the queue. They are either unexplored, kept as k-NN, or discarded.

Experimental Results Real world data: TIGER, Satellite data
Synthetic data R-Tree Construction: (branching factor=50) Presorting data with Hilbert Number Apply a packing technique Branching factor is 50 Performance measure: # of pages accessed Hilbert ordering

Experimental Results (Cont’d)
Linear with k (number of neighbors to find), but slowly. Grow linear with height of the tree  Log(size of data set) MINDIST outperforms MINMAXDIST 20% faster in general, 30% in dense data set Reason: R-Tree is packed very well. MINDIST approaches actual minimal distance. If there’s no or less dead space, MINDIST approaches the actual minimal distance. I

Problems with this algorithm
Nodes/objects are not visited by order of distance.  Blocking May access non-optimal objects, and discard/prune them.  Not incremental Need to know k in advance, no distance browsing, difficult to combine with other predicates.

Distance Browsing To browse object in distance order
Example: Find the k nearest star with distance > 10LY How to apply algorithm[1] to this query? Select stars with distance >10LY first Materialize the first result And then build another R-Tree What if selectivity is very high?

Solution to Distance Browsing
Very low selectivity (nearest city with 2M+ population) Perform selection first, build an R-Tree, perform k-NN Otherwise Need incremental k-NN, pipeline the result to selection operator Can stop at any time

Overview of algorithm in [2]
A generic algorithm for different spatial data structure and different distance definition. Use Priority Queue to perform best first search using minimal distance(optimistic). Ensure that no object/node is visited before another closer object/node.

Search Algorithm Always expand the nearest node or object in the priority queue. Treat objects special cases of nodes. While expanding a node, calculate each children’s distances from query point, and add them into priority queue. While expanding an object, just report it and then continue.

Requirement for Tree/Distance
Tree/Distance must conform the following rules: Allow a node/object to have more than one parents There may be duplicate of object pointer in the tree. The region covered by a node must be completely contained within union of it parents’ region. Consistence distance: For all query point q and node/object n, at least one of its parents, n’ has distance d(q,n’) <= d(q,n). (To ensure expanding nodes in order)

Remarks to Tree/Distance
Applicable tree: Quad-tree, R-Tree, R+-Tree, LSD-Tree, K-D-B Tree…etc Applicable distance measure: Euclidean, Manhattan, Chessboard…etc Almost of spatial trees don’t have duplicate nodes. A node is fully contained in its parent. Some trees allow duplicate objects. We have to detect and remove duplicates. R-Tree doesn’t have duplicates.

Example Root F B D E A C Node/Obj Distance Root A 1 B 7 C 10 D E 8 F
A 1 B 7 C 10 D E 8 F 12 Triangle 13 Circle Rectangle Moon 14 Root B F A D E C

Order of expansion R=0: Expand Root, { A[1],B[7] }
R=1: Expand A, { D[1],B[7],C[10] } R=1: Expand D, { Circle[1],B[7],C[10] } R=1: Report Circle, { B[7], C[10] } R=7: Expand B, { E[8], C[10], F[12] } R=8: Expand E, { Rectangle[8], C[10], F[12] } R=8: Report Rectangle, { C[10], F[12] } R=10: Expand C, { F[12], Triangle[13] } R=12: Expand F, { Triangle[13], Moon[14] } R=13: Report Triangle, { Moon[14] } R=14: Report Moon, { }

Observation All nodes/objects intersecting the search region(circle) are expanded, and their children are put in the queue. All nodes/objects completely inside the search circle are already taken off the queue. All nodes/objects completely outside the search circle are not examined. It minimizes the number of objects to visit.

PseudoCode Queue=NewPriorityQueue(); EnQueue(Queue, Root, 0);
While (NotEmpty(Queue)) { Element=Dequeue(Queue) If IsObject(Element) { /*Remove duplicate*/; Report(Element) } If IsLeaf(Element) { For each child object o, if Dist(o,Q)>=Dist(Element,Q) EnQueue(Queue,o,Dist(o,Q)); //Don’t need the comparison for R-Tree } If IsNonLeaf(Element) { For each child object o Enqueue(Queue,o,Dist(o,Q));} }

Variants K Furthest: Use MaxDist Replace <= by >= Distance selection: Select all stars between 15 LY and 20 LY. Prune unqualified nodes Pseudo code for search algorithm combining these 2 extension: Figure 5

Implementation of Priority Queue
Enough memory: Heap (minheap/maxheap) Not enough: Use B+ Tree (sorted  keep nodes with smaller distance in the memory) Hybrid Scheme: Divide into 3 tiers. Tier 1 uses in-memory heap. Tier 2 is divided into several sections. Nodes in each sections are unordered bucket in memory, and the first bucket is moved to Tier 1 when Tier 1 is empty. Tier 3 is stored on disk, and moved to memory when tier 1 and 2 is empty. Small heap, faster access

Theoretical Analysis Assumption: Uniform distribution, 2D
Use the circular search region for analysis K  the area of search region Number of leaf nodes in the priority queue  circumference of search region = Number of leaf nodes accessed = Number of nodes accessed = For non-uniform 2D case: very close to the result

Experimental Result TIGER/Line file (17421~ segments) Synthetic data (infinite random segments) Construction: R* Tree Distance Browsing: Inc-NN much faster than k-NN, the ratio increases at Exact k-NN query: Inc-NN is 10~20% faster Scalability: close to theoretical result Very large k: k-NN can’t hold all k neighbors in memory

Conclusion Inc NN outperforms other k-NN algorithms.
Inc NN enables distance browsing. Number of node accesses (2D) is Future work: Compare this algorithm on different spatial structure Investigate the behavior on very large data set where the PQ can’t fit into memory.

Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Similar presentations

Presentation on theme: "Nearest Neighbor Queries Sung-hsun Su April 12, 2001"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Nearest Neighbor Queries Sung-hsun Su April 12, 2001

Similar presentations

Presentation on theme: "Nearest Neighbor Queries Sung-hsun Su April 12, 2001"— Presentation transcript:

Similar presentations

About project

Feedback