Vertical K Median Clustering

Vertical K Median Clustering
Amal Perera, William Perrizo {amal.perera, Dept. of CS, North Dakota State University. CATA 2006 – Seattle Washington 2/24/2019

Outline Introduction Background Our Approach Results Conclusions 2/24/2019 Vertical K Median Clustering

Introduction Clustering: Automated identification of groups of objects based on similarity. Application areas include: Datamining, Search engine indexing, Pattern recognition, Image processing, Trend analysis and many other areas Clustering Algorithms: Partition, Hierarchical, Density, Grid Based Major Problem: Scalability with respect to data set size We propose: A Partition Based Vertical K Median Clustering 2/24/2019 Vertical K Median Clustering

Background Many clustering algorithms work well on small datasets. Current approaches for Large data sets include: Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters with prior knowledge Above techniques may lead to sub optimal solutions. 2/24/2019 Vertical K Median Clustering

Background Partition Clustering (k): n objects in the original data set Broken into k partitions (iteratively, each time resulting in an improved k-clustering), to achieve a certain optimality criterion Computational Steps: Find a representative for each cluster component assign others to be in cluster of best representative Calculate error (repeat if error is too high) 2/24/2019 Vertical K Median Clustering

Our Approach Scalability is addressed it is a partition based approach it uses a vertical data structure (P-tree) the computation is efficient: selects the partition representative using a simple directed search on bit slices, assigns membership using bit slices with geometric reasoning computes error using position based manipulation of bit slices Solution quality is improved or maintained Uses a median rather than mean 2/24/2019 Vertical K Median Clustering

P-tree* Vertical Data Structure
Predicate-trees (P-trees) Lossless , Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, etc. A basic P-tree represents one attribute bit slice, reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each subdivision. Each level of the tree contains truth-bits that represent pure sub-trees Construction is continued recursively down each tree path until a pure sub-division is reached. * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. 2/24/2019 Vertical K Median Clustering

But it is pure (pure0) so this branch ends
A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Horizontal structures (records) Scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, s, use “pure ”: level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level

Centroid: Partition Representative
Mean Vs Median Median is usually thought to be a better Estimator ! handles outliers much better, for example Finding the Median Vector: NP-hard problem Existing median (medoid) based solutions PAM Exhaustive search CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. 2/24/2019 Vertical K Median Clustering

Vector of Medians Vector of medians (Hayford,1902): All the median values from each individual dimension Vector of Medians mean N and 3N w.r.t. computational cost With a traditional Horizontal approach Mean N scans Median 3N scans (requires a partial sort) 2/24/2019 Vertical K Median Clustering

[1] Median with Ptrees Median Pattern Starting from the Most Significant Bit, Repeatedly AND appropriate Bit Slice & look at the ‘1’ bit count until you reach the Least Significant Bit while building the Median pattern. Distribution of values _,_,_ 1,_,_ 0,_,_ 4 5 000 001 010 011 100 101 110 111 0,1,_ 0,0,_ 2 3 0,1,1 0,1,0 2 OneCnt ZeroCnt Corresp bit of Median 4 5 < 6 3 > 1 010 4 5 < 2/24/2019 Vertical K Median Clustering

[2] Bulk Membership Assignment
Find the perpendicular Bi-Sector boundary from centroid Assign membership to all the points within the boundary Assignment is done using AND & OR of respective Bit Slices without the use of a DB scan Data points in the red rectangular boxes can be assigned to the respective cluster in bulk. It can be proved that any point within will be closer to the respective cluster center compared to other cluster centers. Initial iterations has less bulk assignments compared to later iterations when the algorithm is edging towards actual cluster centers. d2 d2 2/24/2019 Vertical K Median Clustering

Reverse HOBBIT membership
Assume red points (7) are not assigned C1,C2 C3 Starting from the higher order bit zoom into each HOBBIT hyper rectangle where centroids exist and assign all the points to the centroids in the HobHypRec. Stop before the total assigned points is smaller than available points. May lead to multiple assignments. Motivation is efficiency over accuracy. 0111 0101 Pat C1 C2 C3 T 0011 0101 1000 0111 - 7 21 5 2 12 1 3X 0011 Previous step may leave data points without cluster membership. We can either do a scan for those remaining points or use this quick fuzzy assignment. The best selection is chosen before the Hyperectangular boxes get too small leaving a lot of data points out. REMEMBER this is not a perfect membership assignment. 0101 1000 Best 0011 d2 d2 2/24/2019 Vertical K Median Clustering

[3] Efficient Error computation
Error = Sum Squared Distance from Centroid (a) to Points in Cluster (x) Where: Pi,j : Ptree for the jth bit of ith attribute COUNT(P) : count of the number of truth bits. PX: Ptree (mask) for the cluster subset X 2/24/2019 Vertical K Median Clustering

Algorithm Input: DataSet, K, Threshold Output: K Clusters Initialize K clusters for DataSet Repeat Assign membership using Hyper Rec. Pruning Assign membership for points outside the boundary with Reverse HOBBIT OR a DB scan Find Error = Sum of Sq.Dist(SetCi , Centroidi) for all i Find new centroid= Vector of Median Until (Threshold < QualityGain | Max Iteration < Iteration ) Quality gain is the difference in the Error computation (rate of improvement) 2/24/2019 Vertical K Median Clustering

F=1 for perfect clustering
Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Where: F=1 for perfect clustering Be some cluster Original cluster 2/24/2019 Vertical K Median Clustering

Results: Iterations Synthetic Data (exec. until F-measure =1) DataSet VKClust 6 4 KMeans PAM >300 2/24/2019 Vertical K Median Clustering

Results: Quality and Iterations
IRIS data for 3 classes F Iterations VKClust 0.85 5 KMeans 0.80 8 PAM 0.86 >300 NOTE: Quality(PAM) > Quality(CLARANS) > Quality (CLARA) 2/24/2019 Vertical K Median Clustering

This slide is shown to as an example to see the possibility of bulk assignment with Perpendicular Bi-Sector boundaries. 2/24/2019 Vertical K Median Clustering

Results : Quality UCI network data for 2,4,6 classes DataSet 2-Class 4-Class 6-Class F VKClust 0.91 0.86 0.77 KMeans 0.75 0.72 PAM 0.85 2/24/2019 Vertical K Median Clustering

Results: Unit Performance time in Sec.
(1M RSI data for k=4) on P4 2.4 GHz & 4GB Unit Horiz Vert Median 1.17 std::nth_element()+ 0.35 Root Mean Sqrd Error 0.63* 0.44 Find Membership 2.48 0.19 Total (each Iter.) 3.55* 0.98 * Root MeanSqrd Error calculation overlaps with Find Membership + Best C++ implementation from std template algorithm library 2/24/2019 Vertical K Median Clustering

Results: Scalability 2/24/2019 Vertical K Median Clustering

Conclusions Vertical bit slice based computation of median is computationally less expensive than the best horizontal approach. Hyper rectangular quarries can be used to make bulk cluster membership assignments. Position based manipulation and accumulation of vertical bit slices can be used to compute the squared error for the entire cluster without having to scan the DB for individual data points. Completely-Vertical K Median Clustering is a Scalable Technique that can produce High Quality Clusters at a lower cost. 2/24/2019 Vertical K Median Clustering

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback