Presentation is loading. Please wait.

Presentation is loading. Please wait.

Principles and Applications For Supporting Similarity Queries in Non-ordered Discrete and Continuous Data Spaces Gang Qian Advisor: Dr. Sakti Pramanik.

Similar presentations

Presentation on theme: "Principles and Applications For Supporting Similarity Queries in Non-ordered Discrete and Continuous Data Spaces Gang Qian Advisor: Dr. Sakti Pramanik."— Presentation transcript:

1 Principles and Applications For Supporting Similarity Queries in Non-ordered Discrete and Continuous Data Spaces Gang Qian Advisor: Dr. Sakti Pramanik Department of Computer Science and Engineering Michigan State University

2 Outline 1.Introduction –Similarity queries and applications –Research problems –Overview of the dissertation (contributions) 2.Indexing NDDSs using the ND-tree 3.The NSP-tree: an SP Approach 4.Extending NDDSs into HDSs 5.Choosing A Distance Measure 6.Conclusion

3 Introduction Similarity Queries –What: Return similar objects to a query object Different from traditional database queries E.g. find all similar genome sequences in the DB to the query sequence –Application: Many new application areas Genome Sequence Databases, Data Mining, Time Series Databases, Artificial Intelligent, Content Based Image Retrieval (CBIR), Audio Retrieval, etc. –A measure of similarity needs to be defined

4 Similarity Queries (contd) –Two query types K nearest neighbor (k-NN) query Range query Models for Similarity Queries –Vector model: Most popular and widely used Believed to be better than other models [Baeza 97] –Other models: The Boolean model, the probabilistic model, etc. Our focus is on the vector model

5 The Vector Model –Represent/approximate each database object and query object as a vector Could be non-trivial –Similarity between objects can be calculated A vector is a point in a multidimensional data space The closer the two points, the more similar are their representing objects –Similarity query becomes: Searching a DB of vectors by calculating distance values between the query vector and each vector in the DB –The focus of this dissertation is on supporting similarity queries using the vector model

6 Major Research Issues –Efficiency: Why: –DB are usually very large. –Linear search is not efficient Solution: –Indexing techniques are needed Our main focus in this dissertation –Effectiveness: Why: –A number of different distance measures are available. E.g., Euclidean distance, Manhattan distance, etc. Open problem: how to choose a suitable distance measure We have made contributions for understanding the relationship among distance measures for similarity queries

7 Overview of the Dissertation –Indexing Non-ordered Discrete Data Spaces (NDDS) The ND-tree and the NSP-tree are proposed –The ND-tree is the first index structure of its kind –A theoretical performance estimation model for the ND-tree is developed –The NSP-tree is particularly efficient for skewed datasets –Indexing Hybrid Data Spaces (HDS) The ND h -tree is proposed –Efficiently support similarity queries in HDSs –Choosing a distance measure A theoretical model is developed –Compare the behavior of the Euclidean distance and the cosine angle distance measures for NN queries on random data Experimentally compared EUD and CAD for real, clustered and normalized data

8 Outline 1.Introduction 2.Indexing NDDSs using the ND-tree –Motivations for NDDSs –The problem of current multidimensional index structures –Existing techniques to search non-ordered discrete data –Challenges –The ND-tree in detail 3.The NSP-tree: an SP Approach 4.Extending NDDSs into HDSs 5.Choosing A Distance Measure 6.Conclusion

9 Non-ordered Discrete Data Spaces (NDDS) –Domains that contain non-ordered discrete values are prevalent, e.g., sex, profession, etc. –There are many new and emerging applications that use vectors with non-ordered values e.g. genomic sequences that are broken into fixed length substrings (vectors) with the domain: {a, g, t, c}: aggcggtgatctgggccaatactga is a substring obtained from a genome sequence. It is also a vector, e.g., the value of the 3 rd dimension of the vector is g –NDDS: a d-dimensional data space that is the Cartesian product of d non-ordered discrete domains

10 NDDS (contd) –Databases based on an NDDS is often quite large E.g., Genbank is 24GB and growing –Multidimensional indexing methods are needed

11 Existing Multidimensional Index Structures –Typical index structure: The R-tree Widely used. The basis for many later methods: –R*-tree, SS-tree, SR-tree, X-tree, etc. –Group clusters of vectors/points into boxes, called Minimum Bounding Rectangles (MBRs) –MBRs are further grouped recursively into larger MBRs –Nested MBRs are organized as a balanced tree structure –Disk-based: Each tree node resides in one disk page/block –Dynamic construction algorithms Similar to those of the B-tree Heuristics are different from those of the B-tree Details in R-tree [Guttman 84]

12 Existing Multidimensional Index Structures (contd) R10R11 R12 R1 R2 R5 R3 R7 R9 R8 R6 R4 R10 R11 R12 R1 R2 R3R4 R5 R6R7 R8 R9 Leaf nodes containing points (Vectors)

13 Existing Multidimensional Indexing Methods (contd) –Must work in Continuous Data Spaces (CDS) Vectors are grouped using some geometrical shapes –Inapplicable for indexing an NDDS Problems for Other Indexing Methods –String indexing methods ( Tries, Prefix B-tree, etc.) For prefix and substring search, not for similarity search Only deal with a single domain (alphabet) –Metrics trees ( GNAT, M-tree, etc.) Organizing data only by their relative distances Too general, not optimized for the NDDS Most are static

14 Existing Search Techniques for Non-ordered Discrete Data –Bitmap index –Genome sequence search Online searching algorithms – linear scan Indexing: Hashing or inverted file – exact match –The signature tree (SG-tree) Similarity search on sets Indexing bitmaps

15 Challenges to Index NDDS –No ordering of values on each dimension –Non-applicability of continuous distance measures –High probability of duplicate values –Limited choice of splitting points for overflow node The ND-tree is designed to properly address the above challenges –Establish discrete geometrical concepts –Hamming distance is used. –Multiple heuristics are developed to break ties –Effective algorithms are developed to generate candidate partitions for overlap nodes

16 Discrete Geometrical Concepts for NDDSs –A d-dimensional NDDS d : The Cartesian product of d alphabets (domains): d = A 1 A 2... A d. A i (1 i d): an alphabet consisting of a finite number of non-ordered letters (values). –Discrete rectangle: R = S 1 S 2... S d S i A i (1 i d) is called the i-th component set of R –Edge length on ith-dim: length(R, i) = |S i | –area, overlap of discrete rectangles, …

17 ND-Tree Structure –Similar to that of the R-tree –M and m: Max. and min. number of entries in a node –Leaf node entry: (object pointer, vector) –Non-leaf node entry: (child pointer, DMBR) Discrete minimum bounding rectangle (DMBR): Recursively defined

18 Building the ND-tree –Keep the data well-organized in the tree (less overlap) –Insertion algorithm 1)Choose a leaf for the new vector 2)Overflow ? Split the node –Algorithm ChooseLeaf Go top-down to a leaf node Heuristics are used (least overlap inc., area inc., etc.) –Splitting an overflow node Divide the M+1 entries into two disjoint sets (partition) Algorithm SplitNode: 1)Find a set of candidate partitions 2)Choose the best partition 3)Split based on the best partition

19 Building the ND-tree (contd) –ChoosePartitionSet Exhaustive method is infeasible Need to decide a smaller candidate partition set –potentially less overlap. –Permutation approach (for smaller alphabet) Generate a sorted entry list for each dimension and each permutation of the alphabet by a bucket ordering technique Generate partitions from the sorted entry list Much less candidate partitions generated Proposition: can find an overlap-free partition, if exists

20 Building the ND-tree (contd) –Merge-and-sort approach (for larger alphabet) Generate one sorted entry list for each dimension by a merge-and-sort technique, then generate partitions from the list Even less candidates are considered Merge-and-sort technique: –Merge entries into an auxiliary tree, sort entries using the aux. Tree –3 data fields for each node T: i.T.sets: The set of component sets represented by the subtree ii.T.freq: Total number of entries that are corresponding to one of the component sets in T.sets iii.T.letters: The set of letters that appear in any component set in T.sets Can also find an overlap-free partition, if exists –Choose the best partition Choose the best partition from the candidate set A set of heuristics are used –H1: Minimize overlap of the DMBRs of the two new nodes –H2: Favor splits on longer edge of the DMBR of the overflow node –…… Similarity Query Algorithm

21 Example of the auxiliary tree: –A = {a, b, c, d, e, f}, M = 10, m = 3; Right now: D = 5 –The 5 th component set of the DMBRs of the 11 entries in the overflow node: {c}, {ade}, {b}, {ae}, {f}, {e}, {cf}, {de}, {e}, {cf}, {a}

22 Comparison with The Linear Scan (Genomic Data d=25)

23 NDDS with Different Alphabet Sizes –Naive approach No change to current algorithms Advantage: simplicity Disadvantage: unfair comparison among dimensions –Normalization approach The edge length of a discrete rectangle is normalized –norm_length(R, i) = length(R, i) / |A i | = |S i | / |A i | Other concepts, e.g. area, are normalized based on the normalized edge length The construction algorithms use normalized geometrical measures for their heuristics –The normalization approach is usually much better than the naive approach Even better when the difference among dims is large

24 Performance Estimation Model of the ND-tree –Motivation Analyze the performance of the ND-tree for very large databases with a large range of input parameters –Inputs Dimensions, alphabet size, database size, node size, Hamming distance –Output Estimated disk IOs for the given Hamming distance –Assumptions Vectors are uniformly distributed No correlation among dimensions –Main idea Estimate the area of DMBRs on each level of the ND-tree The area of a DMBR gives the probability that the corresponding node will be accessed

25 Model of the ND-tree (contd), where

26 Model of the ND-tree (contd) –Evaluation

27 Outline 1.Introduction 2.Indexing NDDSs using the ND-tree 3.The NSP-tree: an SP Approach –Motivations for an SP approach –Challenges –The NSP-tree –Experimental results 4.Extending NDDSs into HDSs 5.Choosing A Distance Measure 6.Conclusion

28 Motivations for A Space-Partitioning Approach –Overlap among bounding regions is a known problem in index structures for CDS [Berchtold et al. 96] –Overlap in NDDSs also causes performance degradation [Qian et. al. 03] –Although overlap reducing heuristics are applied, the ND-tree may have overlap as a DP approach When the database is very skewed, overlap in the ND-tree may cause noticeable performance degradation –An SP approach can guarantee overlap-free

29 Background –Data partitioning (R-tree variants) Group vectors based on data distribution – the bounding regions of the groups may overlap Guarantee a low bound on disk utilization –Space partitioning (KD-tree variants) Partition the data space into subspaces. Vectors are grouped based on the subspace they belong to Guarantee no overlap among subspaces –Pros and cons of SP method Advantage: fan-out is large – only split info is stored Disadvantage: subspaces contains large dead spaces –Use additional MBRs may reduce the fan-out –CDS solution: grid-based approximation of MBR is used as additional pruning tools

30 Challenges for an SP approach in NDDSs and the Solution of the NSP-tree –NDDS cannot be split based on a single split point No ordering Solution: Enumerate the arrangement of each letter for a split –Difficult to determine an arrangement for absent letters Randomly decide a side may not be good Solution: Only partition the current data space –Current data space: the Cartesian product of the existing letters on each dimension –Let insertion algorithms handle new letters

31 Challenges and Solutions of the NSP-tree (contd) –Balance the fan-out and the use of DMBRs The use of DMBRs reduces the fan-out and vice versa Grid-based solution for CDSs is inapplicable for NDDSs Different approaches are tested –Several nodes share one DMBR or one node have multiple DMBRs –It is found empirically that two DMBRs per node usually leads to best results Solution: Two DMBRs per node are used for the NSP-tree –Need to enhance the space utilization SP approaches cannot guarantee a low bound on space utilization Solution: Heuristics to balance number of entries in each tree node are extensively applied in the NSP-tree

32 Challenges and Solutions of the NSP-tree (contd) |A| = 10, d = 40, key# = 100,000, r q =3

33 The NSP-tree Structure –Leaf nodes contain vectors indexed –Each non-leaf node has an Split History Tree (SHT) and two additional DMBRs for each child –SHT: An auxiliary unbalanced binary tree Each SHT node records info of one space split that occurred in the node

34 Construction Algorithms of The NSP-tree –ChooseLeaf: From root to leaf, choose the child represents the subspace to which the new vector belongs. If no child found, choose the child with least entries –Make the tree more balanced –Split a node in the NSP-tree For each dim, sort vectors based on the histogram of the alphabet –More frequent letters are put at either end of the queue –May yield more balanced splits: e.g vs Heuristics, such as largest stretch and balanced split, are applied to choose a best split

35 Construction Algorithms (contd) –Adjust the DMBRs Issues arises as two DMBRs per node are used –Randomly pick two DMBRs may not be the best choice The purpose of maintaining two DMBRs for a node is different from node splitting –Want two DMBRs with a combined area as small as possible, but can be overlapped The quadratic algorithm of the R-tree could be adapted –Quite expensive A linear algorithm is developed for the NSP-tree –Much faster than the quadratic –The resulting query performance is comparable to the quadratic approach and much better than using one DMBR per node

36 Comparison with the ND-tree d = 40, |A| = 4, zipf2 and zipf3, respectively

37 Outline 1.Introduction 2.Indexing NDDSs using the ND-tree 3.The NSP-tree: an SP Approach 4.Extending NDDSs into HDSs –HDS concepts –The ND h -tree –Experimental results 5.Choosing A Distance Measure 6.Conclusion

38 Motivations –Data with values of different properties are very common A record in a relational table often consists of both continuous and non-ordered discrete data –Applications that conduct similarity queries on hybrid data are also very common E.g. check known attack patterns in network intrusion detection –How to efficiently conduct similarity queries on hybrid data is an open research area

39 HDS Concepts –A Hybrid Data Space (HDS) is Defined as the Cartesian product of both continuous and non- ordered discrete domains Continuous dimensions are assumed to be normalized to [0, 1] –A hybrid rectangle R is defined as the Cartesian product of sets and ranges: S i can be either a set or a range depending on the dimension it belongs to Sets are for non-ordered discrete dimensions, while ranges are for continuous dimensions –A hybrid vector can be deemed as a special case

40 HDS Concepts (contd) –The edge length of R: –Distance measure for HDSs No well-known distance measure Extended Hamming distance (EHD): –Area, overlap, HMBR, …

41 The ND h -tree –Support similarity queries in HDSs –The tree structure and construction algorithms are similar to those of the ND-tree Hybrid concepts such as HMBRs are used Heuristics are based on Hybrid concepts The algorithms are capable of handling continuous dimensions E.g. To generate candidate partitions for an overflow node, the split algorithm of the ND h -tree scans through all dimensions of an HDS. For NDs, either permutation or merge-and-sort approach is used. For CDs, the entries are sorted based on both low and high bounds of their range

42 Comparison with the ND-tree and R*-tree

43 Outline 1.Introduction 2.Indexing NDDSs using the ND-tree 3.The NSP-tree: an SP Approach 4.Extending NDDSs into HDSs 5.Choosing A Distance Measure –Motivation and related work –Our approach –Results –Feature combination as an application 6.Conclusion

44 Motivations –A distance measure is an integral part of the vector model –There are a number of distance measures available (e.g. Euclidean distance, Manhattan distance, …) Different distance measure yields different similarity query results –How to choose an appropriate distance measure is an open research issue

45 Related Work –Performance comparison [Hampapur et al. 01] Based on recall and precision Used in image and video retrieval –Complexity comparison [Hafner et al. 95] Consider computational overhead Prefer simplified distance measures –Noise-distribution-based [Sebe et al. 00] Choose distance measure based on the noise distribution in the data set

46 Our Approach –Establish a theoretical model to analyze the behavior of two widely-used distance measures for NN queries Euclidean distance (EUD) and cosine angle distance (CAD) This model can be extended to analyze other distance measures –Experimentally analyze EUD and CAD for real, normalized and clustered data

47 The Theoretical Model –Basic idea: find the expected rank of the first nearest neighbor of EUD (NN e ) by using CAD Similar if NN e is ranked high by CAD too –Assume a unit hyper-cube data space and uniform distribution Hyper-cone of NN e

48 Theoretical and experimental results –results based on the model DB = random data points –Our empirical results show that the NN query results by EUD and CAD are also quite similar for real, clustered and normalized data in high-dimensional data spaces

49 Discussion –Observation: As dimension gets even higher, the EUD and CAD get less similar eventually –Explanation: Two factors: dimension and hyper-angle of the hyper-cone of NN e

50 Discussion (contd) –Explanation (contd) As dimension gets higher, the hyper-angle of the hyper-cone of the NN e keeps increasing –Within a certain range of high dimensions, it is reasonable to claim that the NN query results of EUD and CAD are similar for random data

51 Conclusion To support similarity queries in NDDSs, the ND-tree and the NSP-tree are proposed. –Very efficient for similarity queries in NDDSs compared to other techniques. Their scalability is also very good. –The ND-tree is the first index structure of its kind. –A performance estimation model is developed for ND-tree. –The NSP-tree is an SP-approach, which is developed to further explore the problem of overlap in NDDSs. It is shown to be particularly efficient for skewed datasets. The ND h -tree is proposed to support similarity query in HDSs. It is shown to be very efficient compared to existing methods A theoretical model is proposed to analyze the behavior of distance measures for similarity queries. Non-trivial relationship between the EUD and CAD is revealed using the model

52 Future Work Support more query types using the ND-tree and the ND h -tree –Nearest neighbor queries –Queries that specify ranges on each attribute Study other distance measures for similarity queries in HDSs

53 Thank you!

Download ppt "Principles and Applications For Supporting Similarity Queries in Non-ordered Discrete and Continuous Data Spaces Gang Qian Advisor: Dr. Sakti Pramanik."

Similar presentations

Ads by Google