Presentation is loading. Please wait.

Presentation is loading. Please wait.

SASH Spatial Approximation Sample Hierarchy

Similar presentations


Presentation on theme: "SASH Spatial Approximation Sample Hierarchy"— Presentation transcript:

1 SASH Spatial Approximation Sample Hierarchy
Authors: Michael E. Houle, Jun Sakuma

2 SASH features Index data in high-dimensional space
Fast construction of the index N log N Fast lookups of k approximate nearest neighbors k log N

3 Drawbacks of other methods
Slow construction Require a k-NN index to construct a k-NN index Slow lookups Reduce to grid searches or sequential search But they may allow for true nearest neighbor queries

4 SASH construction Two-phase process
Phase 1: divide the set into a hierarchy of subsets Phase 2: link elements of the hierarchy together

5 SASH construction: phase 1
Start with a set of points in a metric space Divide the set in half randomly Repeatedly divide the “second half” of the set until there is one element remaining This hierarchy of sets reminds me of a skip list

6 SASH subsets Partitioning process roughly yields log N sets of size 2k, 0 ≤ k ≤ log N Label the sets S0 (for the set containing one element, namely the root) through Sh (for the largest set containing approximately N/2 elements)

7 SASH appearance A SASH is hierarchy of sets of size 2k, 0 ≤ k ≤ h, with directed edges from the set of size 2k-1 to the set of size 2k A SASH is generally not a tree, but it has some of the flavor of a binary tree with edges from sets of a certain size to sets that are double that size. A SASH usually has many more edges.

8 SASH construction: phase 2
The SASH is constructed inductively by first setting SASH0 = S0. For 1 ≤ i-1 ≤ h, SASHi-1 is a partial SASH on the set S0 U S1 U … U Si-1 SASHi is constructed by starting with SASHi-1 and producing new directed edges from elements in Si-1 to elements in Si.

9 SASH construction: phase 2
Let SASH0 be the root, S0 For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

10 SASH parameters: P and C
In practice, the P is a small, and the C is at least twice P (Their experiments use C=4P) It is likely that objects will have at least one parent that links to them, and if C > 2P, all orphans can eventually find parents Children link to “nearby” parents, and parents then link to “nearby” children The symmetric use of “nearby” gives good results, even though the relation isn’t really symmetric.

11 A Completed SASH

12

13 Example on the real line with P=2 and C=4

14 Randomly divide the set in half until reaching one point

15 Randomly divide the set in half until reaching one point

16 Randomly divide the set in half until reaching one point

17 Randomly divide the set in half until reaching one point

18 The sets Si

19 SASH Construction Example
Red nodes are in a completed SASH. Light blue nodes are in the process of being added to a SASH. Black nodes have not been processed. Links from children to parents are green, and links from parents to children are red.

20 SASH0:Construction P=2, C=4

21 SASH0:Complete

22 SASH1:Construction P=2, C=4

23 SASH1:Link children to parents

24 SASH1:Link parents to children

25 SASH1:Complete

26 SASH2:Construction

27 SASH2:Link children to parents

28 SASH2:Link parents to children

29 SASH2:Complete

30 SASH3:Construction

31 SASH3:Link children to parents

32 SASH3:Link parents to children

33 Some of the green arrows were not reversed

34 Because parents only link to their C=4 closest children

35 The green arrows are not parts of the completed SASH

36 SASH3:Complete

37 SASH4:Construction P=2, C=4

38 SASH4:Link children to parents

39 SASH4:Link parents to children

40 The green links were not returned to the children

41 The three purple nodes are orphans

42 Link them by doubling P as needed.

43 Orphans link to P=4 parents

44 Parents link to up to C=4 children

45 Two orphans were linked, and one remains

46 Two orphans were linked, and one remains

47 Link the final orphan to P=8 parents

48 Link parents to the orphan

49 The final green arrows are removed

50 SASH4:Complete

51 What am I hiding from you about this algorithm?
For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

52 This part can be expensive
For 1 ≤ i ≤ h, assume SASHi-1 exists, then For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 Once all c in Si link to possible parents, each p in Si-1 links to the C closest children that chose it as a possible parent If some orphan objects in Si have no parents linking to them, repeat the above, allowing them to try link to more parents.

53 Cost of this operation For each c in Si, use SASHi-1 to find P possible parents of c in Si-1 There are N/2 points in Sh, and N/4 points in Sh-1, for N2/8 checks Or we could build an index, like a quadtree and do a k-NN search directly This is expensive, and is the catch-22 of most k-NN algorithms SASH uses an N log N method

54 Avoiding k-NN search in SASH construction
Instead, perform a partial search query on the new point using the partially constructed SASH Start with the root as the current set While not at the bottom of the partial SASH, let the current set equal the P children of the current set that are closest to the new point

55 Approximate parent search without a k-NN graph

56 Start at the root

57 Search children

58 Keep the 2 children closest to the query point

59 Search children

60 Keep the 2 children closest to the query point

61 Search children

62 Keep the 2 closest children to the query point

63 These are the approximate parents of the query point

64 Important points: No k-NN index needed
Log N search time for each element Up to P objects retained at each level, and each of those has up to C children Only those PC children are searched at each level to find the P closest objects to send down to the next level.

65 SASH Issues When a large number of children are clustered near a few parents, some children will be orphaned and have parents that are farther away A SASH is mostly static Some new nodes can be added, but clusters need to be filtered up through the hierarchy during the construction process

66 Queries with a completed SASH
Similar to the process described above to get approximate parents Two types of searches described Uniform: Keep the same number of children at each level Geometric: Start the search with a small number of nodes kept at each level, then increase it

67 Queries with a completed SASH
The big difference between constructing the SASH and using it for queries is that in the construction process, only the nodes in the final partial SASH are used. In a query on a completed SASH, all of the intermediate points visited can be used in the final k-ANN search

68 Geometric search Keeping too few points near the root may lead to bad results, so instead of starting near 1, the authors found that 0.5*PC (4 in the case of P=2, C=4) nodes at smaller levels sufficed to keep the search broad enough

69 Search process Let ki be the number of elements we will keep at level i of the SASH Let U0=S0, the root For 1 ≤ i ≤ h Find all children of elements in Ui-1 Let Ui be the ki children of Ui-1 that are closest to the query point

70 Search process After the sets U0, …, Uh have been determined, let U = U0 U U1 U … U Uh Then the final result is the k closest points in U to the query point

71 Search complexity Each Ui has at most k elements, and each of those has at most C children, so we perform at most Ck distance calculations for log N levels, in k log N time Once U has been determined, we perform a true k-NN search on a set of size k log N

72 Use of transitivity when searching
We follow links from parents to children under the assumption that children are close to parents We keep only the objects closest to the query at each level This gives good results in practice, but may fail in pathological cases

73 Pathological example of failure of transitivity
Pathological case on the real line Assume the rest of the SASH is to the left or the right of the chains shown (following the dotted arrows) The query will return two of the nodes visited at the top, even though there are points closer to the query, Q

74 Pathological example of failure of transitivity when k=2
A B Q

75 A search for Q first finds S and T
A B Q

76 T’s children are closer to Q than those of S
A B Q

77 The search continues below T
A B Q

78 The search continues below T
A B Q

79 The search continues below T
A B Q

80 The search continues below T
A B Q

81 R and S are returned as the k=2 nearest neighbors of Q
A B Q

82 However, A and B are the true k=2 nearest neighbors of Q
A B Q

83 SASH Comparison to MTree
MTree (Ciaccia, Patella, Zezula) – Deals with overlapping objects, uses a balanced hierarchy with buckets and spheres as regions SASH-4: P=4, C=4P MEDLINE – 1,055,073 objects with 1,101,003 attributes. Represents keywords found in medical abstracts. Average 75 nonzero attributes per object SSeq = sequential search on a randomly selected subset of the data

84 Complexity Comparison

85 Speed vs. accuracy

86 Internal SASH Comparisons
BactORF – Bacterial protein sequences; 385,039 objects with 40,000 attributes – Sparse: 125 nonzero attributes per object VidFrame – Video -- 9,000,000 objects with 32 attributes densely nonzero

87 SASH P=3,4,5,8,16; C=4P

88 Boosted SASH

89 Different dataset sizes

90 Conclusion SASH indexes high-dimensional spaces
Efficient construction and query times Uses approximate similarity, and a generalization of equivalence relations (symmetry and a weak form of transitivity) to get good results Large body of work in fuzzy logic on transitivity and approximate similarity


Download ppt "SASH Spatial Approximation Sample Hierarchy"

Similar presentations


Ads by Google