Presentation is loading. Please wait.

Presentation is loading. Please wait.

Taufik Abidin and William Perrizo

Similar presentations


Presentation on theme: "Taufik Abidin and William Perrizo"— Presentation transcript:

1 An Alternative Arrangement of Symmetric Datasets for Vertical Clustering Algorithms
Taufik Abidin and William Perrizo Computer Science North Dakota State University

2 Outline Symmetric datasets and its application
Problems in the n x n symmetric dataset Proposed solution Performance evaluation Summary 2

3 Symmetric Dataset Symmetric dataset is an n x n dataset that when transposed, is the same (undirected unipartite graph). The dataset may record pre-computed information, such as pair-wise similarity of genes or pair-wise Euclidian distance of objects Examples of algorithms that use symmetric datasets: clustering Microarray Data Based on Density Shared NN and Vertical Density-based Clustering Algorithms 3

4 Symmetric Dataset Example of a symmetric dataset:
the pre-computed pair-wise similarity of genes:  is the Pearson’s correlation coefficient of expression signals 4

5 Problem and Alternative
When n is large, arranging symmetric datasets into n x n is impractical. because, then there are n2 tuples. Alternative Solution: Organize the symmetric datasets into n’ x m, n’ >> m, instead of n x n In other words, the cardinality is extended, the dimension is narrowed, but the number of elements remains the same, 2n, as in n x n. 5

6 Alternative ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) etc O d ,
2 1 M ( ) n-1 O d , 2 1 M ( ) n-1 O d , 1 2 M ( ) n-1 O d , 1 2 M ( ) n-1 O d , 2 1 M ( ) n-1 O d , 2 1 M ( ) O d , (n'-n)/n 2 1 3 n-1 ( ) O d , n'-n 2 1 3 n-1 ( ) ê ë é L d O , O ú û ù 2 n-1 ( ) L d O , O 2 1 n-1 ( ) L d O , O 2 2 n-1 L M etc ( ) L d O , O 2 n-1 n-1 Algorithm: Determine n’ and m Input: mo Output: n’ and m let mu = md = mo while((n2 mod mu)!=0||(n2 mod md)!=0) mu++ md-- endwhile if(mu - mo)<(mo – md) m = mu else m = md endif n’ = n2 / m .

7 Alternative ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) (
é d O , O d O , O d O , O L d O , O ù ê 2 2 1 2 2 2 n-1 ( ) ( ) ( ) ( ) ú d O , O d O , O d O , O L d O , O ê 2 1 2 1 1 2 1 2 2 1 n-1 ú ê ( ) ( ) ( ) ( ) d O , O d O , O d O , O L d O , O ú ê 2 2 2 2 1 2 2 2 2 2 n-1 ú ê M M M L M ú ê ( ) ( ) ( ) ( ) d O , O d O , O d O , O d O , O ú ë L û 2 n-1 2 n-1 1 2 n-1 2 2 n-1 n-1 Algorithm: Determine n’ and m Input: mo Output: n’ and m let mu = md = mo while((n2 mod mu)!=0||(n2 mod md)!=0) mu++ md-- endwhile if(mu - mo)<(mo – md) m = mu else m = md endif n’ = n2 / m

8 8

9 Performance Evaluation
Two datasets: 4,000 and 8,000 objects It takes 1.71 and 6.91 minutes to compute pair-wise Euclidian distance of 4,000 and 8,000 objects respectively Dataset # Vertical Bit Vectors 8K x 8000 82,143 4K x 4000 41,730 16K x 4000 41,311 8K x 2000 20,958 32K x 2000 20,777 16K x 1000 10,497 64K x 1000 10,452 32K x 500 5,250 128K x 500 5,251 64K x 250 2,626 9

10 Performance Evaluation
10

11 Performance Evaluation
11

12 Performance Evaluation
Definitions: Neighbors: Density: * d2(Oi,Oj) 12

13 Performance Evaluation
Execution 4Kx4000 8Kx2000 16Kx1000 32Kx500 64Kx250 Density Cluster 1 8.04 0.21 6.41 0.27 6.96 0.42 10.03 0.74 18.02 1.41 2 8.08 6.37 6.98 10.04 18.00 1.40 3 7.99 6.39 10.05 0.73 18.03 1.39 4 7.98 6.36 0.28 6.99 0.41 17.99 5 8.00 6.42 18.01 Average 8.02 Total Time 8.23 6.66 7.40 10.78 19.41 13

14 Performance Evaluation
Execution 8Kx8000 16Kx4000 32Kx2000 64Kx1000 128Kx500 Density Cluster 1 33.60 7.03 22.00 8.10 23.46 12.71 37.58 22.99 71.27 43.97 2 34.36 7.12 21.22 8.02 23.51 12.74 37.60 43.96 3 33.15 6.92 21.14 8.04 37.59 22.98 71.23 43.93 4 33.28 6.93 8.06 23.48 37.53 71.20 5 21.68 23.49 12.73 37.55 22.97 71.25 43.92 Average 33.51 7.01 21.45 12.72 37.57 71.24 43.95 Total Time 40.51 29.51 36.20 60.55 115.19 14

15 Summary We have presented an alternative arrangement of symmetric dataset (unipartite undirected graphs) for vertical clustering algorithms and analyzed its performance. Narrowing the dimension and extending the cardinality of symmetric datasets is useful. Our study shows that a dimension in the range of 1,000 to 2,000 is a good option for datasets containing 4,000 and 8,000 objects. 15


Download ppt "Taufik Abidin and William Perrizo"

Similar presentations


Ads by Google