Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha.

Alok Choudharychoudhar@ece.nwu.edu 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha Nagesh (Bell Labs) and Sanjay Goil (Sun)

Alok Choudharychoudhar@ece.nwu.edu 2Northwestern University Outline Overview of PARSIMONY MAFIA (Subspace clustering) pMAFIA (Parallel Subspace Clustering) Performance Results PARSIMONY –multidimensional data analysis –parallel classification Summary

Alok Choudharychoudhar@ece.nwu.edu 3Northwestern University Overview of Knowledge Discovery Process

Alok Choudharychoudhar@ece.nwu.edu 4Northwestern University PARSIMONY Overview

Alok Choudharychoudhar@ece.nwu.edu 5Northwestern University MAFIA: Subspace Clustering for High- Dimensional Data Sets Clustering and subspace clustering Base MAFIA Algorithm pMAFIA (parallelization) Performance Results

Alok Choudharychoudhar@ece.nwu.edu 6Northwestern University Clustering Discovery of interesting patterns in large multidimensional data sets. –What is the average credit for a particular income group (Financial Services) –Areas with maximum collect calls (Telecommunications) –Categorize stocks based on their movement (Investment Banking) –Target Mailing (Marketing) –Analysis of satellite data, detection of clusters in geographic information systems, categorize web documents, etc.

Alok Choudharychoudhar@ece.nwu.edu 7Northwestern University Clustering Multi-dimensional Dimensional Data Sets Determine range of attributes in each dimension of the cluster(s)

Alok Choudharychoudhar@ece.nwu.edu 8Northwestern University Issues to be addressed Basic algorithm computational optimizations Scalability with database size (out of core data sets) Scalability with the dimensionality of data Efficient Parallelization Recognition of arbitrary shaped clusters

Alok Choudharychoudhar@ece.nwu.edu 9Northwestern University Related Work Partition based Clustering: –User specified k representative points taken as cluster centers and points assigned to cluster centers k-means, k-mediods, CLARANS (VLDB 94), BIRCH (SIGMOD 96),.. Consider clustering partitioning of points Hierarchical Clustering: –Each point is a cluster. Merge similar points together gradually. CURE (SIGMOD 98) (use sampling) Categorical Clustering: –Clustering of categorical data e.g automobile sales data: color, year, model, price, etc –Best suited for non-numerical data CACTUS (KDD 99), STIRR (VLDB 98)

Alok Choudharychoudhar@ece.nwu.edu 10Northwestern University Related Work Density and Grid Based Clustering : –Clusters are high density regions than its surroundings WaveCluster (VLDB 98), DBSCAN, CLIQUE (SIGMOD 98) –Number of subspaces is exponential in the data dimensionality –Multidimensional space divided into grids. The histogram in each hyper-rectangle is found. Grid regions with a significant histogram value are cluster regions. –Post-processing done to grow the connected cluster regions. –Fine grid size results in explosion in the number hyper-rectangles, coarser grids fail to detect clusters. –Correct Grid Size is very critical !

Alok Choudharychoudhar@ece.nwu.edu 11Northwestern University Subspace Clustering CLIQUE (SIGMOD 98)- User specified grid size and threshold for each dimension –Finer grids : enormous computation and coarser grids : loss of quality Noise is another consideration in Finer grids –A bottom-up algorithm by combining dense regions in different subspaces. –A hyper-rectangle in a multidimensional space is dense if it contains more points than a user specified threshold percentage of the total number of points. PROCLUS (SIGMOD 99) - Modification of k-means algorithm. –User input of number of clusters and average cluster dimensionality : unrealistic for real-world data sets –Uses cluster centers and points near to it to compute statistics. These determine the relevant cluster dimensions of the clusters ! ENCLUS (KDD 99) - Identifies the dimensions of a cluster followed by the application of any clustering algorithm. –Entropy Based Clustering : Uses entropy as a measure of correlation between dimensions…requires entropy thresholds to be set

Alok Choudharychoudhar@ece.nwu.edu 12Northwestern University Subspace Clustering Observation: “If a collection of points S is a cluster in a k- dimensional space, then S is also a part of a cluster in any (k-1) dimensional projection of the space” Algorithm : Growing clusters…candidate dense units in any k dimensions are obtained by merging dense units in (k-1) dimensions which share any (k-2) dimensions. –Ex: ( {a1,b7,d9}, {b7,c8,d9} ) --> {a1,b7,c8,d9} –Candidate dense units are populated by a pass on the data set and the dense units are found out in each dimension. –Dense units found are combined to form candidate dense units. –Algorithm terminates when no more candidate dense units found.

Alok Choudharychoudhar@ece.nwu.edu 13Northwestern University Adaptive Grids (reducing computation in practice) Automatic Grid fitting based on data distribution – MAFIA : Merging of Adaptive Finite Intervals ! Optimal Bins in each dimension leads to very few units in the grid (candidate dense units) – (a) : CLIQUE – (b) : MAFIA

Alok Choudharychoudhar@ece.nwu.edu 14Northwestern University Base MAFIA Algorithm Divide each dimension into very fine regions. Compute histogram in these regions along every dimension. Set the value of a sliding window to the maximum histogram value in the window. Adjacent units which have nearly same histogram values are merged together to form larger bins. Threshold of each bin formed is computed automatically. A bin having a histogram value much greater (by a factor 2~3) than that of equi-distribution of data is DENSE.

Alok Choudharychoudhar@ece.nwu.edu 15Northwestern University

Alok Choudharychoudhar@ece.nwu.edu 16Northwestern University MAFIA Algorithm (merging dense units) Algorithm : Candidate dense units in any k dimensions are obtained by merging dense units in (k-1) dimensions which share any (k-2) dimensions. Ex: ( {a1,c7,b8}, {c7,b8,d9} ) --> {a1,c7,b8,d9}

Alok Choudharychoudhar@ece.nwu.edu 17Northwestern University CLIQUE : CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share first (k-2) dimensions. MAFIA: CDUs in any dimension k is formed by combining dense units of dimension (k-1) which share any (k-2) dimensions. Data Set CLIQUE MAFIA Fixed Grids Adaptive Grids Fixed Gridsfirst (k-2) algo any (k-2) algo Huge CDU Set Non Cluster Dims reported Huge Search Space much reduced CDU Set Correct Cluster Dims reported Reduced Search Space Dimensions aware of data distribution

Alok Choudharychoudhar@ece.nwu.edu 18Northwestern University Parallel MAFIA pMAFIA: Parallel Subspace Clustering –Scalable in data size and number of dimensions Grid and Density based clustering algorithm –Parallelization provides speedup for the subspace clustering algorithm.

Alok Choudharychoudhar@ece.nwu.edu 19Northwestern University pMAFIA Algorithm : Each processor reads part of the data in a Data Parallel fashion and constructs histogram in every dimension. // Data read in chunks (out of core) of data Reduce communication to obtain global histogram. All processors build Adaptive grids using the histogram. // Each bin formed is a Candidate Dense Unit. Current Dimension k = 1 while (no more dense units found) if( k > 1) Build Candidate Dense units(); Populate the candidate dense units in a data parallel fashion and in chunks (out-of-core) of B records. Reduce communication to obtain global CDU population. Identify the dense units (); Build dense unit data structures(); // for the next higher dimension

Alok Choudharychoudhar@ece.nwu.edu 20Northwestern University Build-Candidate-Dense-Units Current Dimension(k) = 3.

Alok Choudharychoudhar@ece.nwu.edu 21Northwestern University Build Candidate Dense Units CLIQUE : CDUs in any dimension k is formed by combining dense units of dimension (k-1) such that they share first (k-2) dimensions. pMAFIA: CDUs in any dimension k is formed by combining dense units of dimension (k-1) such that they share any (k-2) dimensions. For data dimension (d) = 10, current dimension (k) = 5, CLIQUE does not explore 93.3% of possible combinations. In general, This problem more so in data sets with clusters having a very high subspace coverage Current Dimension(k) = 3.

Alok Choudharychoudhar@ece.nwu.edu 22Northwestern University Build Candidate Dense Units Each dense unit is compared with every other dense unit to form CDUs, resulting in an O(Ndu 2 ) algorithm.(Ndu-number of dense units) For large values of Ndu CDUs built in parallel. Processors 0,..,(p-1) work in parallel on parts of total Ndu dense units. Processor k compares dense units between N i and N i+1 with all the other dense units; for optimal task partitioning we have Identical CDUs generated during the process need to be discarded. Each generated CDU compared with every other CDU to identify similar ones resulting in O(Ncdu 2 ) algorithm.

Alok Choudharychoudhar@ece.nwu.edu 23Northwestern University Task Parallelism Identify Dense Units –CDUs generated are populated in a data parallel fashion. –If the histogram count of a CDU is greater than the threshold of all the bins which form the CDU in their respective dimensions, the CDU is a Dense Unit. –If Ncdu is large, each processor processes Ncdu/p candidate dense units Build Dense Unit Data Structures –If Ndu, number of Dense units, is large dense unit data structures are constructed in parallel. Each dense unit is completely represented by a set of dimensions and the corresponding bin indices in those dimensions.

Alok Choudharychoudhar@ece.nwu.edu 24Northwestern University Parallelization (Building CDUs) Total work done by Pi is shown

Alok Choudharychoudhar@ece.nwu.edu 25Northwestern University pMAFIA : ANALYSIS Data parallelism in populating the CDUs in every dimension : effective for massive data sets. Gains of task parallelism realized when data contains large number of dense units (clusters). –A ‘k’ dimensional dense unit allocated just 2k bytes of memory, k bytes for dimensions and k for bin indices. –Data structures in form of linear arrays of bytes: communicate very small message buffers, space optimization. Although bottom-up algorithm is exponential in the data dimension, for low subspace coverage, with use of Adaptive grids and parallel formulation very promising results. –k dimension of the highest dimension dense unit, we explore all possible subspaces of these k dimensions ==> O(c k ) –For k passes over the data set O( (N/pB) * k*Tio), N-total number of records, p-processors, B- records per chunk, Tio-I/O access time for a block of B records. –Communication overhead results in O(Tcomm * S * p * k), Tcomm - constant for communication, S- size of message exchanged,. O( c k + (N/pB) * k*Ti + Tcomm * S * p * k )

Alok Choudharychoudhar@ece.nwu.edu 26Northwestern University pMAFIA Outline Clustering and subspace clustering Base MAFIA Algorithm pMAFIA (parallelization) Performance Results –Adaptivity performance –Scalability with data set size. –Scalability with data dimensionality. –Scalability with cluster dimensionality. –Some Real world data sets.

Alok Choudharychoudhar@ece.nwu.edu 27Northwestern University Quality of Results (a) CLIQUE : Loss of quality: reports ‘pqrs’ as the cluster ! Requires a complicated post processing step. Bin selection and threshold fixing is a non trivial problem. Cannot validate results. (b) MAFIA : Almost exact cluster boundaries recognized – No post processing step required.

Alok Choudharychoudhar@ece.nwu.edu 28Northwestern University CLIQUE-MAFIA 400,000 records in 10 dimensions; 2 clusters in 2 4d subspaces.

Alok Choudharychoudhar@ece.nwu.edu 29Northwestern University Advantage of Adaptive Grids 300,000 records in a 15 Dimension space with 1 cluster of 5 dimensions. –A speedup of 80 obtained over CLIQUE. –CLIQUE failed to produce results with our modified CDU generation algorithm even in 2 hours on 16 processors. –This relatively small data set mined in 32 seconds on 1 processor

Alok Choudharychoudhar@ece.nwu.edu 30Northwestern University Scalability with Data Set Size 20 Dimension data with 5 clusters in 5 different subspaces data sets :up to 11.8 million records Clusters detected in just about 3 minutes on 16 processors ! Almost Linear with the increase in the data set size (because most time in scanning)

Alok Choudharychoudhar@ece.nwu.edu 31Northwestern University Parallelization (on IBM SP2) 30 Dimension data with 8.3 million records, 5 clusters each in a 6 dimension subspace. Near linear speedups Negligible Communication overheads (<1%)

Alok Choudharychoudhar@ece.nwu.edu 32Northwestern University Data Dimensionality 250,000 records, 3 clusters in different 5 dimensional subspaces. Near linear behavior with data dimensionality : Algorithm depends on the maximum number of dimensions in a clusters and not on the data dimensionality.

Alok Choudharychoudhar@ece.nwu.edu 33Northwestern University Cluster Dimensionality 50 dimension data with 1 cluster, 650,000 records; cluster dimension from 3 to 10. Behavior in line with the order of the algorithm : increases with subspace coverage of the cluster.

Alok Choudharychoudhar@ece.nwu.edu 34Northwestern University Scalability on Movie Data 72,916 users rated 1628 movies in 18 months: 2.8 Million ratings 4D data: {user-id, movie-id, weight, score} Discovered seven interesting 2d clusters !

Alok Choudharychoudhar@ece.nwu.edu 35Northwestern University Other Data Sets One day Ahead Prediction of DAX (German Stock Exchange) –DAX prediction data set was based on a 12 input time series like stock indices, bond indices, gold prices, etc –22 dimensions with 2757 records: Major gains from task parallelism –Mined clusters in 8.16 seconds on 8 processors. –Unique clusters discovered in 3,4,5 and 6 dimensional subspaces. Ionosphere data: (UCI repository) –Radar data collected in Goose Bay, Labrador 34 dimension data, 351 records. Discovered unique clusters only in 3 and 4 dimensional sub spaces.

Alok Choudharychoudhar@ece.nwu.edu 36Northwestern University Performance Results Implementation on the distributed memory machine IBM SP2 and on a network of workstations. Massive data sets with more than 10 Million records in very high dimension spaces (>30). An order of two magnitude improvement over existing techniques. Parallelization adds additional scalability to the algorithm –Near linear speedups achieved. –Negligible Communication Overheads. Performance results on both synthetic and real data sets.

Alok Choudharychoudhar@ece.nwu.edu 37Northwestern University Summary and Conclusions for pMAFIA MAFIA : unsupervised subspace clustering algorithm Introduced the adaptive grids formulation First parallel subspace clustering algorithm for massive data sets –Incorporates both task and data parallelism.. pMAFIA : Scalable and Parallel Implementation in data size, no of dimensions

Alok Choudharychoudhar@ece.nwu.edu 38Northwestern University PARSIMONY Overview

Alok Choudharychoudhar@ece.nwu.edu 39Northwestern University Sparse Data Structures and Representations

Alok Choudharychoudhar@ece.nwu.edu 41Northwestern University Using OLAP Framework for

Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha.

Similar presentations

Presentation on theme: "Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha.

Similar presentations

Presentation on theme: "Alok 1Northwestern University PARSYMONY: Scalable Parallel Data Mining Alok N. Choudhary Northwestern University (ACK: Harsha."— Presentation transcript:

Similar presentations

About project

Feedback