Presentation is loading. Please wait.

Presentation is loading. Please wait.

Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University.

Similar presentations


Presentation on theme: "Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University."— Presentation transcript:

1 Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University. George Hamer George Hamer, Dept. of CS South Dakota State University. William Perrizo, Dept. CS North Dakota State University.

2 Outline Introduction Background Approach Results Conclusions

3 Introduction Clustering: Automated identification of groups of objects based on similarity. Two Major Problems in Clustering Scalability. Need for Input parameters. We propose: Vertical Set Squared Distance based Clustering.

4 Background Clustering algorithms work well on small datasets. Current approaches for Large data sets Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters Above techniques may lead to sub optimal solutions.

5 Background (Cont.) Current Clustering algorithms require Input Parameters Eg. DENCLU : grid cell size ? DBSCAN : neighborhood radius? And minimum in core ? K-Means / K-Medoid : K ? Results are sensitive to the input parameters

6 Background (Cont.) Some approaches for Parameter-less Clustering algorithms OPTICS: Computes an augmented cluster ordering Costs O (n log n) G-Means: Use Gaussian properties in the data. If they exist ! ACE : Maps the search space to a weighted grid Depends on a heuristic search

7 Our Approach Scalability addressed Partition based aproach Vertical data structure (P-tree) Efficent Computation of Set Squared Distance (VSSD) for entire data set (influence function) Need for Parameter K addressed Observing the difference in the influence for each data point.

8 Influence Functions Influence: Describes the impact of a data point within its neighborhood Eg. Of Influence functions

9 P-tree Vertical Data Structure Predicate-trees (P-trees) Lossless, Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, SVM, etc. A basic P-tree represents One attribute bit reorganized into a tree structure By recursively sub-dividing, while recording the predicate truth value regarding purity for each division. Each level of the tree contains truth-bits that represent pure sub-trees. Construction is continued recursively down each tree path until a pure sub-division is reached.

10 6. 1 st half of 1 st of 2 nd is  1 0 0 1 1 4. 1 st half of 2 nd half not  0 0 2. 1 st half is not pure1  0 0 0 1. Whole file is not pure1  0 Horizontal structures (records) Scanned vertically P 11 P 12 P 13 P 21 P 22 P 23 P 31 P 32 P 33 P 41 P 42 P 43 0 0 0 0 1 10 0 1 0 0 1 0 0 0 0 0 0 1 01 10 0 1 0 0 1 0 0 0 0 1 0 01 0 1 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 10 01 5. 2 nd half of 2 nd half is  1 0 0 1 R 11 0 1 0 1 horizontally process these basic Ptrees using one multi-operand logical AND. Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 R( A 1 A 2 A 3 A 4 ) A file, R(A 1..A n ), contains horizontal structures (a set of horizontal records) processed vertically (vertical scans) 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P 11 : 3. 2 nd half is not pure1  0 0 7. 2 nd half of 1 st of 2 nd not  0 0 0 1 10 0 1 0 1 1 1 1 1 0 0 0 1 0 1 1 1 1 1 1 1 0 0 0 0 0 1 0 1 1 0 1 0 1 0 0 1 0 1 0 1 1 1 1 0 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 0 1 0 0 0 1 1 0 1 1 1 1 0 0 0 0 0 1 1 0 0 R 11 R 12 R 13 R 21 R 22 R 23 R 31 R 32 R 33 R 41 R 42 R 43 R[A 1 ] R[A 2 ] R[A 3 ] R[A 4 ] 010 111 110 001 011 111 110 000 010 110 101 001 010 111 101 111 101 010 001 100 010 010 001 101 111 000 001 100 Eg, to count, 111 000 001 100 s, use “pure111000001100”: 0 2 3 -level P 11 ^P 12 ^P 13 ^P’ 21 ^P’ 22 ^P’ 23 ^P’ 31 ^P’ 32 ^P 33 ^P 41 ^P’ 42 ^P’ 43 = 0 0 2 2 -level =2 01 2 1 -level But it is pure (pure0) so this branch ends

11 Efficeint Computation of Influence Use Vertical Set Squared Distance to Compute the Parabolic Influence. ),(Xa f Dist VertSetSqr where T 1 = T2T2 T3T3 Pi,j : P-tree for jth bit of ith attribute. rc(P) : root count of a P-tree (number of truth bits). PX: P-tree (mask) for the subset X. Above, Count operations are independent of ‘a’ Can pre-compute and reuse. For different PX counts can be precomputed fast with the use of P-trees.

12 Algorithm 1.Compute SetSqd dist for all points 2.Sort SetSqd Dist 3.Find the difference between i,i+1 (gap) 4.Compute mean(gap) stdev(gap) 5.Identify gap > mean + 3 * stdev (large Gaps) 6.Break into clusters using large gaps as partition boundaries. 7.Compute Set Sqd Dist for all tuples(a), against each Cluster (C i ). 8.Re-assign clusters based on min [SetSqdDist(a, Ci )] 9.Iterate until max iteration OR No change in Cluster sizes OR Oscillation 12 11 13 12 13 6 4 3 4 3 4 4 6 11 12 13 1 0 2 5 1 0 1 0 > mean+3*StdDev 3 4 4 6 11 12 13

13 Cluster separation with VSSD eg. Gap > µ+3σ

14 Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Be some clusterOriginal cluster F=1 for perfect clustering Where:

15 Results :Cluster Quality Synthetic Data (exec. until F-measure =1) DataSet VSSD Req. DB Scans 226 KMeans Given k Req. DB Scans 8814

16 Results :Cluster Quality Iris Data 3 known clusters KDD-99 Network Intrusion Data VSSDK-Means -K =3K=4K=5 Iterations 5163824 F-Measure 0.840.800.740.69 With 6 Clust.4 Clust.2 Clu. VDK- meansVDK meansVDKm K= -567-345-2 Iterations 71012 916121636 F-Measure 0.81 0.80 0.790.90

17 Results :Scalability RSI data 6x8 bit attributes per datapoint

18 Conclusions Ordering based on Set Squared Distance may be used to part ion a data set into natural clusters (Finding K) Set Squared Distance can be efficiently computed using P-trees. Vertical Set Squared Distance Clustering is a Scalable Technique that can produce High Quality Clusters without the need for user parameters.

19


Download ppt "Vertical Set Square Distance Based Clustering without Prior Knowledge of K Amal Perera,Taufik Abidin, Masum Serazi, Dept. of CS, North Dakota State University."

Similar presentations


Ads by Google