Vertical K Median Clustering

Vertical K Median Clustering
Amal Perera, William Perrizo {amal.perera, Dept. of CS, North Dakota State University. CATA 2006 – Seattle Washington 12/4/2018

Outline Introduction Background Our Approach Results Conclusions 12/4/2018 Vertical K Median Clustering

Introduction Clustering: Automated identification of groups of objects based on similarity. Application areas include: Datamining, Search engine indexing, Pattern recognition, Image processing, Trend analysis and many other areas Clustering Algorithms: Partition, Hierarchical, Density, Grid Based Major Problem: Scalability with respect to data set size We propose: A Partition Based Vertical K Median Clustering 12/4/2018 Vertical K Median Clustering

Background Many clustering algorithms work well on small datasets. Current approaches for Large data sets include: Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters with prior knowledge Above techniques may lead to sub optimal solutions. 12/4/2018 Vertical K Median Clustering

Background Partition Clustering (k): n objects in the original data set Broken into k partitions (iteratively, each time resulting in an improved k-clustering), to achieve a certain optimality criterion Computational Steps: Find a representative for each cluster component assign others to be in cluster of best representative Calculate error (repeat if error is too high) 12/4/2018 Vertical K Median Clustering

Our Approach Scalability is addressed it is a partition based approach it uses a vertical data structure (P-tree) the computation is efficient: selects the partition representative using a simple directed search across bit slices rather than down rows, assigns membership using bit slices with geometric reasoning computes error using position based manipulation of bit slices Solution quality is improved or maintained while increasing speed and scalability. Uses a median rather than mean 12/4/2018 Vertical K Median Clustering

P-tree* Vertical Data Structure
Predicate-trees (P-trees) Lossless , Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, etc. A basic P-tree represents one attribute bit slice, reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each subdivision. Each level of the tree contains truth-bits that represent pure sub-trees Construction is continued recursively down each tree path until a pure sub-division is reached. * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. 12/4/2018 Vertical K Median Clustering

But it is pure (pure0) so this branch ends
A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Horizontal structures (records) Scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, s, use “pure ”: level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level

Centroid: Partition Representative
Mean Vs Median Median is usually thought to be a better Estimator ! handles outliers much better, for example Finding the Median Vector: NP-hard problem Existing median (medoid) based solutions PAM Exhaustive search CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. 12/4/2018 Vertical K Median Clustering

Vector of Medians Vector of medians (Hayford,1902): All the median values from each individual dimension Vector of Medians mean N and 3N w.r.t. computational cost With a traditional Horizontal approach Mean N scans Median 3N scans (requires a partial sort) 12/4/2018 Vertical K Median Clustering

1 Pi,2 1 Pi,1 1 Pi,0 [1] Median with Ptrees Starting from the Most Significant Bit, Repeatedly AND appropriate Bit Slice until the Least Significant Bit is reached while building the Median pattern. Median Pattern Distribution of values _,_,_ 1,_,_ 0,_,_ 4 5 000 001 010 011 100 101 110 111 0,1,_ 0,0,_ 2 3 1 rc =4 Pi,2 1 P'i,2 1 Pi,1 1 Pi,0 0,1,1 0,1,0 2 OneCnt ZeroCnt Corresp bit of Median 4 5 < 0=hi bit 6 3 > 1 010 4 5 < Scalability? e.g., if the cardinality= 232=4,294,967,296 Rather than scan 4 billion records, we AND log2=32 P-trees.

[2] Bulk Membership Assignment (not 1-by-1)
Find perpendicular Bi-Sector boundaries from centroids (vectors of attribute medians which are easily computed as in previous slide) Assign membership to all the points within these boundary Assignment is done using AND & OR of respective Bit Slices without a scan. Data points in the red rectangular boxes can be assigned to the respective cluster in bulk. It can be proved that any point within will be closer to the respective cluster center compared to other cluster centers. Initial iterations has less bulk assignments compared to later iterations when the algorithm is edging towards actual cluster centers. d2 d2 12/4/2018 Vertical K Median Clustering

Reverse HOBBit membership
Assume red points (7) are not assigned C1,C2 C3 Starting from the higher order bit zoom into each HOBBit rectangle where centroids exist and assign all the points to the centroids in the HBOBit Rectangles. Stop before the total assigned points is smaller than available points. May lead to multiple assignments. Motivation is efficiency over accuracy. 0111 0101 Pat C1 C2 C3 T 0011 0101 1000 0111 - 7 21 5 2 12 1 3X 0011 Previous step may leave data points without cluster membership. We can either do a scan for those remaining points or use this quick fuzzy assignment. The best selection is chosen before the Hype rectangular boxes get too small leaving a lot of data points out. REMEMBER this is not a perfect membership assignment. T=Total points assigned with current approach. Compared to the actual number of points to be assigned, if T is too small that means we have gone too far. If it is too large we have a lot of multiple assignments. Initial observations show that success depends on the dataset. Most of the experimental results were carried out using the reverse HOBBIT approach with a scan for the last iteration. This approach is faster than a scan but we loose on accuracy. Would suggest the use of this approach when scanning for the remaining data points is not possible due to scalability issues (i.e. very large datasets) 0101 1000 Best 0011 d2 d2

[3] Efficient Error computation
Error = Sum Squared Distance from Centroid (a) to Points in Cluster (x) Where: Pi,j : P-tree for the jth bit of ith attribute COUNT(P) : count of the number of truth bits. PX : Ptree (mask) for the cluster subset X 12/4/2018 Vertical K Median Clustering

Algorithm Input: DataSet, K, Threshold Output: K Clusters Initialize K clusters for DataSet Repeat Assign membership using Hyper Rec. Pruning Assign membership for points outside the boundary with Reverse HOBBIT OR a DB scan Find Error = Sum of Sq.Dist(SetCi , Centroidi) for all i Find new centroid= Vector of Median Until (Threshold < QualityGain | Max Iteration < Iteration ) Quality gain is the difference in the Error computation (rate of improvement) 12/4/2018 Vertical K Median Clustering

F=1 for perfect clustering
Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Where: F=1 for perfect clustering F = F measure and used for quality in all the experimental results. Be some cluster Original cluster 12/4/2018 Vertical K Median Clustering

Results: Iterations Synthetic Data (exec. until F-measure =1) Iteration count for approach / dataset DataSet VKClust 6 4 KMeans PAM >300 3 separate 2 dimensional synthetically generated datasets (200 data points) to test the clustering capabilities. Pictures show the spatial location of the data. Numbers are iteration counts. We see a clear advantage w.r.t. the number of required iterations in the use of the vertical median approach to arrive at the same solution compared to PAM. 12/4/2018 Vertical K Median Clustering

Results: Quality and Iterations
IRIS data for 3 classes F Iterations VKClust 0.85 5 KMeans 0.80 8 PAM 0.86 >300 Low iterations mean quick convergence. High F measure indicates better cluster quality. We can expect the best possible clustering from PAM compared to the other 2 medoid approaches for a given dataset. So we can argue that we are comparing our results with the best available partition algorithm w.r.t. quality. In this slide we can clearly see that we can expect comparable (quality) results from the our K-median approach at a very low number of iterations. NOTE: Quality(PAM) > Quality(CLARANS) > Quality (CLARA) 12/4/2018 Vertical K Median Clustering

This slide is shown as an example to see the possibility of bulk assignment with Perpendicular Bi-Sector boundaries. It shows the clusters and bisector boundaries for one attribute. It also shows the final cluster assignments by our approach. Setosa is clearly identified without any false +ves. But in the case of the other two clusters the boundary cases have a few false +ves giving us F measure which is not 1.0 . 12/4/2018 Vertical K Median Clustering

Results : Quality UCI network data for 2,4,6 classes DataSet 2-Class 4-Class 6-Class F VKClust 0.91 0.86 0.77 KMeans 0.75 0.72 PAM 0.85 A Network intrusion data set. The objective is to be able to identify the type of network intrusion by looking at properties of a TCP/IP dump. Our objective is to show the feasibility of the clustering algorithm on real life data sets. This slide shows only the quality. Though PAM shows better results it is about 60 times more expensive than a K-Means or our K-Median approach. 12/4/2018 Vertical K Median Clustering

Results: Unit Performance time in Sec.
(1M RSI data for k=4) on P4 2.4 GHz & 4GB Unit Horiz Vert Median 1.17 std::nth_element()+ 0.35 Root Mean Sqrd Error 0.63* 0.44 Find Membership 2.48 0.19 Total (each Iter.) 3.55* 0.98 Shows a break down of the computational cost involved in the algorithm proposed in this paper. It also shows a time comparison for a horizontal approach and the suggested vertical approach. We can clearly see that the vertical approach is about 4 times faster than the horizontal approach. An this speed gain is achieved in the computation of the median and the membership assignment. NOTE: 0.63 secs for error computation in the horizontal approach is not added to the total to be fair to the horizontal approach, because the computation can be done while scanning through the dataset for membership. * Root MeanSqrd Error calculation overlaps with Find Membership + Best C++ implementation from std template algorithm library 12/4/2018 Vertical K Median Clustering

Results: Scalability Shows an RSI data set. Shows a clear advantage in the use of the vertical method over a horizontal method for the 2 key steps in a typical K-median algorithm. 12/4/2018 Vertical K Median Clustering

Conclusions Vertical bit slice based computation of median is computationally less expensive than the best horizontal approach. Hyper rectangular quarries can be used to make bulk cluster membership assignments. Position based manipulation and accumulation of vertical bit slices can be used to compute the squared error for the entire cluster without having to scan the DB for individual data points. Completely-Vertical K Median Clustering is a Scalable Technique that can produce High Quality Clusters at a lower cost. 12/4/2018 Vertical K Median Clustering

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback