1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20.

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20

2 Outline  Introduction  Motivation  Measurement  Algorithms  Experiments  Conclusion

3 Introduction  Memory-Based Reasoning –Case-Based Reasoning –Instance-Based Learning  Given a training dataset and a new object, predict the class (target value) of the new object.  Focus on table data

4 Introduction  K Nearest Neighbor Search –Compute similarity between the new object and each object in the training dataset. –Linear time to the size of the dataset  Similarity: Euclidean distance  Multi-dimension Index –Spatial data structure, such as R-tree –Numeric data

5 Introduction  Indexing on Categorical Data? –Linear order of the categories –Existing correct ordering? –Best ordering?  Store the mapped data on a multi- dimensional data structure as filtering mechanism

6 Measurement for Ordering Ordering Problem Given an undirected weighted complete graph, a simple path is an ordering of the vertices. The edges are the distances between pairs of vertices. The ordering problem is to find a path, called ordering path, of maximal value according to a certain scoring function.

7 Measurement for Ordering  Relationship Scoring Reasonable Ordering Score  In an ordering path, 3-tuple is reasonable if and only if dist(v i-1, v i+1 ) ≧ dist(v i-1, v i ) and dist(v i-1, v i+1 ) ≧ dist(v i, v i+1 ).

8 Measurement for Mapping  Pairwise Difference Scoring –Normalized distance matrix –Mapping values of categories –Dist m (v i, v j ) = |mapping(v i ) - mapping(v j )|

9 Algorithms  Prim-like Ordering  Kruskal-like Ordering  Divisive Ordering  GA Approach Ordering –A vertex is a category –A graph represent a distance matrix

10 Prim-like Ordering Algorithm  Prim’s Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –Add the edge to the tree; Add w to S –Repeat until all vertices are in S

11 Prim-like Ordering Algorithm  Prim-like Ordering –Choose a least edge (u, v) –Add the edge to the ordering path; S = {u, v} –Choose a least edge connecting a vertex in S and a vertex, w, not in S –If the edge creates a circle on the path, discard the edge, and choose again –Else, add the edge to the ordering path; Add w to S –Repeat until all vertices are in S

12 Kruskal-like Ordering Algorithm  Kruskal Minimum Spanning Tree –Initially, choose a least edge (u, v) –Add the edge to the tree; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree –Add the edge to the tree; Add the two vertiecs to S –Repeat until all vertices are in S

13 Kruskal-like Ordering Algorithm  Kruskal-like Ordering –Initially, choose a least edge (u, v) and add it to the ordering path; S = {u, v} –Choose a least edge as long as the edge does not create a circle in the tree, and degree of either vertex on the path is <= 2 –Add the edge to the ordering path; Add the two vertices to S –Repeat until all vertices are in S  Heap array can be used to speed up choosing least edge

14 Divisive Ordering Algorithm  Idea: –Pick a central vertex, and split the rest vertices –Building a binary tree: vertices are the leaves  Central Vertex:

15 Divisive Ordering Algorithm  A R is closer to P than A L is.  B L is closer to P than B R is. P A ALAL ARAR B BLBL BRBR

16 Clustering  Splitting a Set of Vertices into Two Groups –Each group has at least one vertex –Close (similar) vertices in same group Distant vertices in different groups  Clustering Algorithms –Two clusters

17 Clustering  Clustering –Grouping a set of objects into classes of similar objects  Agglomerative Hierarchical Clustering Algorithm –Singleton clusters –Merge similar clusters

18 Clustering  Clustering Algorithm: Cluster Similarity –Single link dist(Ci, Cj) = min(dist(p, q)), p in Ci, q in Cj –Complete link dist(Ci, Cj) = max(dist(p, q)), p in Ci, q in Cj –Average link -- adopted in our study dist(Ci, Cj) = avg(dist(p, q)), p in Ci, q in Cj –others

19 Clustering  Clustering Implementation Issues –Which pair of clusters to be merged: Keep cluster-to-cluster similarity for each pair –Recursively partition sets of vertices while building the binary tree: Non-recursive version with a stack

20 GA Ordering Algorithm  Genetic Algorithm for Optimal Problems  Chromosome: solution  Population: pool of solutions  Genetic Operations –Crossover –Mutation

21 GA Ordering Algorithm  Encoding a Solution –Binary string –Ordered list of categories – in our ordering problem  Fitness Function –Reasonable ordering score  Selecting Chromosomes for crossover –High fitness value => high probability

22 GA Ordering Algorithm  Crossover –Single point –Multiple points –Mask  Crossover AB | CDE and BD | AEC  Results in ABAEC and BDCDE => Illegal

23 GA Ordering Algorithm  Repair Illegal Chromosome ABAEC –AB*EC => fill D in * position  Repair Illegal Chromosome ABABC –AB**C –D and E are missing –Which one is closest to B, fill it in first * position

24 Mapping Function  Ordering Path  Mapping(v i ) =

25 Experiments  Synthetic Data (width/length = 5)

27 Experiments  Synthetic Data: Reasonable Ordering Score for Divisive Algorithm –width/length = 5 => 0.82 –width/length = 10 => 0.9 –No Ordering => 1/3  Divisive algorithm is better than Prim-like algorithm when number of categories > 100

30 Experiments  Divisive Ordering is best among the three ordering algorithms  For divisive ordering algorithm on > 100 categories, RMSE scores are around 0.07 when width/length = 10, and are around 0.05 when width/length = 10.  Prim-like ordering algorithm: 0.12 and 0.1, respectively.

31 Experiments  “Census-Income” dataset from the University of California, Irvine (UCI) KDD Archive  33 nominal attributes, 7 continuous attributes  Sample 5000 records for training dataset.  Sample 2000 records for approximate KNN search experiment.

32 Experiments  Distance Matrix: distance between two categories  V. Ganti, J. Gehrke, and R. Ramakrishnan, “CACTUS-Clustering Categorical Data Using Summaries,” ACM KDD, 1999  D = {d 1, d 2, …, d n } of n tuples.  D is subset of D 1 * D 2 * … * D k, where D i is a categorical domain, for 1 ≦ i ≦ k.  di =.

33 Experiments

34 Experiments  Approximate KNN – nominal attributes

37 Experiments  Approximate KNN – all attributes

40 Conclusion  Developed Ordering Algorithms –Prim-like –Krusal-like –Divisive –GA-based  Devised Measurement –Reasonable ordering score –Root mean squared error

41 Conclusion  What next? –New categories, new mapping function –New index structure? –Training mapping function for a given ordering path.

42 Thank you.

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20.

Similar presentations

Presentation on theme: "1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20.

Similar presentations

Presentation on theme: "1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系 郭煌政 2004/10/20."— Presentation transcript:

Similar presentations

About project

Feedback

1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20.

Presentation on theme: "1 Converting Categories to Numbers for Approximate Nearest Neighbor Search 嘉義大學資工系郭煌政 2004/10/20."— Presentation transcript: