SASH Spatial Approximation Sample Hierarchy

Slides:



Advertisements
Similar presentations
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Advertisements

Nearest Neighbor Search
Efficient access to TIN Regular square grid TIN Efficient access to TIN Let q := (x, y) be a point. We want to estimate an elevation at a point q: 1. should.
Fast Algorithms For Hierarchical Range Histogram Constructions
Chapter 4: Trees Part II - AVL Tree
Multidimensional Indexing
Searching on Multi-Dimensional Data
Addressing Diverse User Preferences in SQL-Query-Result Navigation SIGMOD ‘07 Zhiyuan Chen Tao Li University of Maryland, Baltimore County Florida International.
Computing a tree Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas.
Data Structures Data Structures Topic #13. Today’s Agenda Sorting Algorithms: Recursive –mergesort –quicksort As we learn about each sorting algorithm,
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
COMP 451/651 Indexes Chapter 1.
Nearest Neighbor. Predicting Bankruptcy Nearest Neighbor Remember all your data When someone asks a question –Find the nearest old data point –Return.
Multiple-key indexes Index on one attribute provides pointer to an index on the other. If V is a value of the first attribute, then the index we reach.
Tirgul 10 Rehearsal about Universal Hashing Solving two problems from theoretical exercises: –T2 q. 1 –T3 q. 2.
Introduction General Data Structures - Arrays, Linked Lists - Stacks & Queues - Hash Tables & Binary Search Trees - Graphs Spatial Data Structures -Why.
SST:an algorithm for finding near- exact sequence matches in time proportional to the logarithm of the database size Eldar Giladi Eldar Giladi Michael.
CS 206 Introduction to Computer Science II 12 / 03 / 2008 Instructor: Michael Eckmann.
Introduction General Data Structures - Arrays, Linked Lists - Stacks & Queues - Hash Tables & Binary Search Trees - Graphs Spatial Data Structures -Why.
©Silberschatz, Korth and Sudarshan12.1Database System Concepts Chapter 12: Part B Part A:  Index Definition in SQL  Ordered Indices  Index Sequential.
KNN, LVQ, SOM. Instance Based Learning K-Nearest Neighbor Algorithm (LVQ) Learning Vector Quantization (SOM) Self Organizing Maps.
1 Indexing Structures for Files. 2 Basic Concepts  Indexing mechanisms used to speed up access to desired data without having to scan entire.
Primary Indexes Dense Indexes
3/10/03Tucker, Sec.3-51 Tucker, Applied Combinatorics, Sec. 3.5, Jo E-M “Big O” Notation We say that a function is if for some constant c, when n is large.
Homework #3 Due Thursday, April 17 Problems: –Chapter 11: 11.6, –Chapter 12: 12.1, 12.2, 12.3, 12.4, 12.5, 12.7.
1 CS 728 Advanced Database Systems Chapter 17 Database File Indexing Techniques, B- Trees, and B + -Trees.
Tree-Structured Indexes. Range Searches ``Find all students with gpa > 3.0’’ –If data is in sorted file, do binary search to find first such student,
CS 61B Data Structures and Programming Methodology Aug 11, 2008 David Sun.
Spatial Data Structures Jason Goffeney, 4/26/2006 from Real Time Rendering.
ICS 220 – Data Structures and Algorithms Week 7 Dr. Ken Cosh.
Chapter Tow Search Trees BY HUSSEIN SALIM QASIM WESAM HRBI FADHEEL CS 6310 ADVANCE DATA STRUCTURE AND ALGORITHM DR. ELISE DE DONCKER 1.
Database Management 8. course. Query types Equality query – Each field has to be equal to a constant Range query – Not all the fields have to be equal.
UNC Chapel Hill M. C. Lin Point Location Reading: Chapter 6 of the Textbook Driving Applications –Knowing Where You Are in GIS Related Applications –Triangulation.
Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,
VLDB '2006 Haibo Hu (Hong Kong Baptist University, Hong Kong) Dik Lun Lee (Hong Kong University of Science and Technology, Hong Kong) Victor.
1 Trees Tree nomenclature Implementation strategies Traversals –Depth-first –Breadth-first Implementing binary search trees.
CSCE350 Algorithms and Data Structure Lecture 17 Jianjun Hu Department of Computer Science and Engineering University of South Carolina
Chapter 11 Indexing & Hashing. 2 n Sophisticated database access methods n Basic concerns: access/insertion/deletion time, space overhead n Indexing 
Nearest Neighbor (NN) Rule & k-Nearest Neighbor (k-NN) Rule Non-parametric : Can be used with arbitrary distributions, No need to assume that the form.
Heapsort. Heapsort is a comparison-based sorting algorithm, and is part of the selection sort family. Although somewhat slower in practice on most machines.
12.1 Chapter 12: Indexing and Hashing Spring 2009 Sections , , Problems , 12.7, 12.8, 12.13, 12.15,
P p Chapter 10 has several programming projects, including a project that uses heaps. p p This presentation shows you what a heap is, and demonstrates.
CSC 211 Data Structures Lecture 13
Nearest Neighbor Queries Chris Buzzerd, Dave Boerner, and Kevin Stewart.
WebQuery: Searching and Visualizing the Web through Connectivity Rick Kazman Software Engineering Institute Pittsburgh, PA
CS 206 Introduction to Computer Science II 04 / 22 / 2009 Instructor: Michael Eckmann.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
Indexes. Primary Indexes Dense Indexes Pointer to every record of a sequential file, (ordered by search key). Can make sense because records may be much.
Marwan Al-Namari Hassan Al-Mathami. Indexing What is Indexing? Indexing is a mechanisms. Why we need to use Indexing? We used indexing to speed up access.
Data Structures and Algorithms in Parallel Computing Lecture 2.
Presented by Ho Wai Shing
Spring 2003 ECE569 Lecture 05.1 ECE 569 Database System Engineering Spring 2003 Yanyong Zhang
Algorithms 2005 Ramesh Hariharan. Divide and Conquer+Recursion Compact and Precise Algorithm Description.
1 C++ Classes and Data Structures Jeffrey S. Childs Chapter 15 Other Data Structures Jeffrey S. Childs Clarion University of PA © 2008, Prentice Hall.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 B+-Tree Index Chapter 10 Modified by Donghui Zhang Nov 9, 2005.
Advanced Data Structure By Kayman 21 Jan Outline Review of some data structures Array Linked List Sorted Array New stuff 3 of the most important.
Bushy Binary Search Tree from Ordered List. Behavior of the Algorithm Binary Search Tree Recall that tree_search is based closely on binary search. If.
Sorting Algorithms Written by J.J. Shepherd. Sorting Review For each one of these sorting problems we are assuming ascending order so smallest to largest.
Data Structures and Algorithms Instructor: Tesfaye Guta [M.Sc.] Haramaya University.
DS.H.1 Hashing Chapter 5 Overview The General Idea Hash Functions Separate Chaining Open Addressing Rehashing Extendible Hashing Application Example: Geometric.
Spatial Data Management
School of Computing Clemson University Fall, 2012
CS 728 Advanced Database Systems Chapter 18
Distance Computation “Efficient Distance Computation Between Non-Convex Objects” Sean Quinlan Stanford, 1994 Presentation by Julie Letchner.
K Nearest Neighbor Classification
B+ Trees What are B+ Trees used for What is a B Tree What is a B+ Tree
Continuous Density Queries for Moving Objects
CSE572: Data Mining by H. Liu
CO 303 Algorithm Analysis and Design
Presentation transcript:

SASH Spatial Approximation Sample Hierarchy Authors: Michael E. Houle, Jun Sakuma

SASH features Index data in high-dimensional space Fast construction of the index N log N Fast lookups of k approximate nearest neighbors k log N

Drawbacks of other methods Slow construction Require a k-NN index to construct a k-NN index Slow lookups Reduce to grid searches or sequential search But they may allow for true nearest neighbor queries

SASH construction Two-phase process Phase 1: divide the set into a hierarchy of subsets Phase 2: link elements of the hierarchy together

SASH construction: phase 1 Start with a set of points in a metric space Divide the set in half randomly Repeatedly divide the “second half” of the set until there is one element remaining This hierarchy of sets reminds me of a skip list

SASH subsets Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements)

SASH appearance A SASH is hierarchy of sets of size 2k, 0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size. A SASH usually has many more edges.

SASH construction: phase 2 The SASH is constructed inductively by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si.

SASH construction: phase 2 Let SASH0 be the root, S0 For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

SASH parameters: P and C In practice, the P is a small, and the C is at least twice P (Their experiments use C=4P) It is likely that objects will have at least one parent that links to them, and if C > 2P, all orphans can eventually find parents Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good results, even though the relation isn’t really symmetric.

A Completed SASH

Example on the real line with P=2 and C=4

Randomly divide the set in half until reaching one point

Randomly divide the set in half until reaching one point

Randomly divide the set in half until reaching one point

Randomly divide the set in half until reaching one point

The sets Si

SASH Construction Example Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed. Links from children to parents are green, and links from parents to children are red.

SASH0:Construction P=2, C=4

SASH0:Complete

SASH1:Construction P=2, C=4

SASH1:Link children to parents

SASH1:Link parents to children

SASH1:Complete

SASH2:Construction

SASH2:Link children to parents

SASH2:Link parents to children

SASH2:Complete

SASH3:Construction

SASH3:Link children to parents

SASH3:Link parents to children

Some of the green arrows were not reversed

Because parents only link to their C=4 closest children

The green arrows are not parts of the completed SASH

SASH3:Complete

SASH4:Construction P=2, C=4

SASH4:Link children to parents

SASH4:Link parents to children

The green links were not returned to the children

The three purple nodes are orphans

Link them by doubling P as needed.

Orphans link to P=4 parents

Parents link to up to C=4 children

Two orphans were linked, and one remains

Two orphans were linked, and one remains

Link the final orphan to P=8 parents

Link parents to the orphan

The final green arrows are removed

SASH4:Complete

What am I hiding from you about this algorithm? For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

This part can be expensive For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

Cost of this operation For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4 points in Sh-1, for N2/8 checks Or we could build an index, like a quadtree and do a k-NN search directly This is expensive, and is the catch-22 of most k-NN algorithms SASH uses an N log N method

Avoiding k-NN search in SASH construction Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial SASH, let the current set equal the P children of the current set that are closest to the new point

Approximate parent search without a k-NN graph

Start at the root

Search children

Keep the 2 children closest to the query point

Search children

Keep the 2 children closest to the query point

Search children

Keep the 2 closest children to the query point

These are the approximate parents of the query point

Important points: No k-NN index needed Log N search time for each element Up to P objects retained at each level, and each of those has up to C children Only those PC children are searched at each level to find the P closest objects to send down to the next level.

SASH Issues When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away A SASH is mostly static Some new nodes can be added, but clusters need to be filtered up through the hierarchy during the construction process

Queries with a completed SASH Similar to the process described above to get approximate parents Two types of searches described Uniform: Keep the same number of children at each level Geometric: Start the search with a small number of nodes kept at each level, then increase it

Queries with a completed SASH The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used. In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search

Geometric search Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough

Search process Let ki be the number of elements we will keep at level i of the SASH Let U0=S0, the root For 1 ≤ i ≤ h Find all children of elements in Ui-1 Let Ui be the ki children of Ui-1 that are closest to the query point

Search process After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh Then the final result is the k closest points in U to the query point

Search complexity Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time Once U has been determined, we perform a true k-NN search on a set of size k log N

Use of transitivity when searching We follow links from parents to children under the assumption that children are close to parents We keep only the objects closest to the query at each level This gives good results in practice, but may fail in pathological cases

Pathological example of failure of transitivity Pathological case on the real line Assume the rest of the SASH is to the left or the right of the chains shown (following the dotted arrows) The query will return two of the nodes visited at the top, even though there are points closer to the query, Q

Pathological example of failure of transitivity when k=2 A B Q

A search for Q first finds S and T A B Q

T’s children are closer to Q than those of S A B Q

The search continues below T A B Q

The search continues below T A B Q

The search continues below T A B Q

The search continues below T A B Q

R and S are returned as the k=2 nearest neighbors of Q A B Q

However, A and B are the true k=2 nearest neighbors of Q A B Q

SASH Comparison to MTree MTree (Ciaccia, Patella, Zezula) – Deals with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with 1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object SSeq = sequential search on a randomly selected subset of the data

Complexity Comparison

Speed vs. accuracy

Internal SASH Comparisons BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero

SASH P=3,4,5,8,16; C=4P

Boosted SASH

Different dataset sizes

Conclusion SASH indexes high-dimensional spaces Efficient construction and query times Uses approximate similarity, and a generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results Large body of work in fuzzy logic on transitivity and approximate similarity