Vertical K Median Clustering

Vertical K Median Clustering
Amal Perera, William Perrizo {amal.perera, Dept. of CS, North Dakota State University. CATA 2006 – Seattle Washington 11/27/2018

Outline Introduction Background Our Approach Results Conclusions 11/27/2018 Vertical K Median Clustering

Introduction Clustering: Automated identification of groups of objects based on similarity. Application areas include: Datamining, Search engine indexing, Pattern recognition, Image processing, Trend analysis and many other areas Clustering Algorithms: Partition, Hierarchical, Density, Grid Based Major Problem: Scalability with respect to data set size We propose: A Partition Based Vertical K Median Clustering 11/27/2018 Vertical K Median Clustering

Background Many clustering algorithms work well on small datasets. Current approaches for Large data sets include: Sampling eg. CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. Preserve summary statistics eg. BIRCH : tree structure that records the sufficient statistics for data set. Requirement for Input Parameters with prior knowledge Above techniques may lead to sub optimal solutions. 11/27/2018 Vertical K Median Clustering

Background Partition Clustering (k): n objects in the original data set Broken into k partitions (iteratively, each time resulting in an improved k-clustering), to achieve a certain optimality criterion Computational Steps: Find a representative for each cluster component assign others to be in cluster of best representative Calculate error (repeat if error is too high) 11/27/2018 Vertical K Median Clustering

Our Approach Scalability is addressed it is a partition based approach it uses a vertical data structure (P-tree) the computation is efficient: selects the partition representative using a simple directed search across bit slices rather than down rows, assigns membership using bit slices with geometric reasoning computes error using position based manipulation of bit slices Solution quality is improved or maintained while increasing speed and scalability. Uses a median rather than mean 11/27/2018 Vertical K Median Clustering

P-tree* Vertical Data Structure
Predicate-trees (P-trees) Lossless , Compressed, Data-mining-ready Successfully used in KNN, ARM, Bayesian Classification, etc. A basic P-tree represents one attribute bit slice, reorganized into a tree structure by recursively sub-dividing, while recording the predicate truth value regarding purity for each subdivision. Each level of the tree contains truth-bits that represent pure sub-trees Construction is continued recursively down each tree path until a pure sub-division is reached. * Predicate Tree (Ptree) technology is patented by North Dakota State University (William Perrizo, primary inventor of record); patent number 6,941,303 issued September 6, 2005. 11/27/2018 Vertical K Median Clustering

But it is pure (pure0) so this branch ends
A file, R(A1..An), contains horizontal structures (horizontal records) Ptrees: vertically partition; then compress each vertical bit slice into a basic Ptree; horizontally process these basic Ptrees using one multi-operand logical AND. processed vertically (vertical scans) R( A1 A2 A3 A4) R[A1] R[A2] R[A3] R[A4] Horizontal structures (records) Scanned vertically R11 1 R11 R12 R13 R21 R22 R23 R31 R32 R33 R41 R42 R43 1-Dimensional Ptrees are built by recording the truth of the predicate “pure 1” recursively on halves, until there is purity, P11: 1. Whole file is not pure1 0 2. 1st half is not pure1  0 3. 2nd half is not pure1  0 0 0 P11 P12 P13 P21 P22 P23 P31 P32 P33 P41 P42 P43 0 0 0 1 10 1 0 01 1 0 0 1 5. 2nd half of 2nd half is  1 0 0 0 1 6. 1st half of 1st of 2nd is  1 0 0 0 1 1 7. 2nd half of 1st of 2nd not 0 0 0 0 1 10 4. 1st half of 2nd half not  0 0 0 But it is pure (pure0) so this branch ends Eg, to count, s, use “pure ”: level P11^P12^P13^P’21^P’22^P’23^P’31^P’32^P33^P41^P’42^P’43 = level =2 level

Centroid: Partition Representative
Mean Vs Median Median is usually thought to be a better Estimator ! handles outliers much better, for example Finding the Median Vector: NP-hard problem Existing median (medoid) based solutions PAM Exhaustive search CLARA : choosing a representative sample CLARANS : Selecting a randomized sample for each iteration. 11/27/2018 Vertical K Median Clustering

Vector of Medians Vector of medians (Hayford,1902): All the median values from each individual dimension Vector of Medians mean N and 3N w.r.t. computational cost With a traditional Horizontal approach Mean N scans Median 3N scans (requires a partial sort) 11/27/2018 Vertical K Median Clustering

1 Pi,2 1 Pi,1 1 Pi,0 [1] Median with Ptrees Starting from the Most Significant Bit, Repeatedly AND appropriate Bit Slice until the Least Significant Bit is reached while building the Median pattern. Median Pattern Distribution of values _,_,_ 1,_,_ 0,_,_ 4 5 000 001 010 011 100 101 110 111 0,1,_ 0,0,_ 2 3 1 rc =4 Pi,2 1 P'i,2 1 Pi,1 1 Pi,0 0,1,1 0,1,0 2 OneCnt ZeroCnt Corresp bit of Median 4 5 < 0=hi bit 6 3 > 1 010 4 5 < Scalability? e.g., if the cardinality= 232=4,294,967,296 Rather than scan 4 billion records, we AND log2=32 P-trees.

[2] Bulk Membership Assignment (not 1-by-1)
Find perpendicular Bi-Sector boundaries from centroids (vectors of attribute medians which are easily computed as in previous slide) Assign membership to all the points within these boundary Assignment is done using AND & OR of respective Bit Slices without a scan. Data points in the red rectangular boxes can be assigned to the respective cluster in bulk. It can be proved that any point within will be closer to the respective cluster center compared to other cluster centers. Initial iterations has less bulk assignments compared to later iterations when the algorithm is edging towards actual cluster centers. d2 d2 11/27/2018 Vertical K Median Clustering

within what's left, hi-order green bit
left, 2nd hi-red bit Hi order red bit The (7) red points are not assigned C2 C1 C3 C1,C2 C3 C1,C2,C3 0111 C3 0101 C1 Starting from the higher order bit zoom in based on High Order Bisecting Bit (HOBBit) where a single centroid exists and assign points based on that. 0011 C2 Previous step may leave data points without cluster membership. We can either do a scan for those remaining points or use this quick fuzzy assignment. The best selection is chosen before the Hyperectangular boxes get too small leaving a lot of data points out. REMEMBER this is not a perfect membership assignment. 0011 0101 1000 HOBBit membership: an alternative assignment method that is very fast, but may sacrifice some accuracy (not used on performance study).

[3] Efficient Error computation
Error = Sum Squared Distance from Centroid (a) to Points in Cluster (x) Computed using P-tree ANDing Where: Pi,j : P-tree for the jth bit of ith attribute COUNT(P) : count of the number of 1-bits. PX : Ptree (mask) for the cluster subset X 11/27/2018 Vertical K Median Clustering

F=1 for perfect clustering
Experimental Results Objective : Quality and Scalability Datasets Synthetic data - Quality Iris Plant Data - Quality KDD-99 Network Intrusion Data - Quality Remotely Sensed Image Data - Scalability Quality Measured with Where: F=1 for perfect clustering Be some cluster Original cluster 11/27/2018 Vertical K Median Clustering

Results (Number of Iterations required)
Three 2-D Synthetic Data with 200 data pts (executed until F-measure =1) DataSet VKClust 6 4 KMeans PAM >300 11/27/2018 Vertical K Median Clustering

Results: Quality and # of Iterations
IRIS data for 3 classes F Iterations VKClust 0.85 5 KMeans 0.80 8 PAM 0.86 >300 NOTE: Quality(PAM)  Quality(CLARANS)  Quality (CLARA). That's why we only compare to PAM. 11/27/2018 Vertical K Median Clustering

200 400 600 50 100 150 250 300 350 Iris Data with Perpendicular Bi-sector Boundaries setosa versicolor virginica Sepal Length false Petal Length This slide is shown to as an example to see the possibility of bulk assignment with Perpendicular Bi-Sector boundaries. Sepal Width 11/27/2018 Vertical K Median Clustering

Results : Quality UCI network data for 2,4,6 classes DataSet 2-Class 4-Class 6-Class F VKClust 0.91 0.86 0.77 KMeans 0.75 0.72 PAM 0.85 11/27/2018 Vertical K Median Clustering

Results: Unit Performance time in Sec.
(1M RSI data for k=4) on P4 2.4 GHz & 4GB Unit Horiz Vert Median 1.17 std::nth_element()+ 0.35 Root Mean Sqrd Error 0.63* 0.44 Find Membership 2.48 0.19 Total (each Iter.) 3.55* 0.98 * Root MeanSqrd Error calculation overlaps with Find Membership + Best C++ implementation from std template algorithm library 11/27/2018 Vertical K Median Clustering

Results: Scalability 11/27/2018 Vertical K Median Clustering

Conclusions Vertical bit slice based computation of median is computationally less expensive than the best horizontal approach. Hyper rectangular quarries can be used to make bulk cluster membership assignments. Position based manipulation and accumulation of vertical bit slices can be used to compute the squared error for the entire cluster without having to scan the DB for individual data points. Completely-Vertical K Median Clustering is a Scalable Technique that can produce High Quality Clusters at a lower cost. 11/27/2018 Vertical K Median Clustering

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Vertical K Median Clustering

Similar presentations

Presentation on theme: "Vertical K Median Clustering"— Presentation transcript:

Similar presentations

About project

Feedback