Distributed Data Mining

Distributed Data Mining
ACAI’05/SEKT’05 ADVANCED COURSE ON KNOWLEDGE DISCOVERY Distributed Data Mining Dr. Giuseppe Di Fatta University of Konstanz (Germany) and ICAR-CNR, Palermo (Italy) 5 July, 2005

Tutorial Outline Part 1: Overview of High-Performance Computing
Technology trends Parallel and Distributed Computing architectures Programming paradigms Part 2: Distributed Data Mining Classification Clustering Association Rules Graph Mining Conclusions

Technology trends Moore’s law Processing Memory Communication Supercomputers

Units of HPC Processing Memory 1 Mflop/s 1 Megaflop/s 106 Flop/sec
1 Gflop/s 1 Gigaflop/s Flop/sec 1 Tflop/s 1 Teraflop/s Flop/sec 1 Pflop/s 1 Petaflop/s Flop/sec Memory 1 MB 1 Megabyte Bytes 1 GB 1 Gigabyte Bytes 1 TB 1 Terabyte Bytes 1 PB 1 Petabyte Bytes

How far did we go?

1 Tflop - 1 TB sequential machine
Technology Limits r 1 Tflop - 1 TB sequential machine r = 0.3 mm 1 TB Consider the 1 Tflop sequential machine data must travel some distance, r, to get from memory to CPU to get 1 data element per cycle, this means 1012 times per second at the speed of light, c = 3x108 m/s so r < c/1012 = 0.3 mm Now put 1 TB of storage in a 0.3 mm2 area each word occupies about 3 Angstroms2, the size of a small atom

Moore’s Law (1965) Gordon Moore (co-founder of Intel)
“ The complexity for minimum component costs has increased at a rate of roughly a factor of two per year. Certainly over the short term this rate can be expected to continue, if not to increase. Over the longer term, the rate of increase is a bit more uncertain, although there is no reason to believe it will not remain nearly constant for at least 10 years. That means by 1975, the number of components per integrated circuit for minimum cost will be 65,000. “

Moore’s Law (1975) In 1975, Moore refined his law:
circuit complexity doubles every 18 months. So far it holds for CPUs and DRAMs! Extrapolation for computing power at a given cost and semiconductor revenues.

Technology Trend

Technology Trend Processors issue instructions roughly every nanosecond. DRAM can be accessed roughly every 100 nanoseconds. DRAM cannot keep processors busy! And the gap is growing: processors getting faster by 60% per year. DRAM getting faster by 7% per year.

Memory Hierarchy Most programs have a high degree of locality in their accesses spatial locality: accessing things nearby previous accesses temporal locality: reusing an item that was previously accessed Memory hierarchy tries to exploit locality.

Memory Latency Hiding memory latency:
temporal and spatial locality (caching) multithreading prefetching

Communication Topology Latency Bandwidth
The manner in which the nodes are connected. Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. Instead, processors are arranged in some variation of a bus, grid, torus, or hypercube. Latency How long does it take to start sending a "message"? Measured in microseconds. (Also in processors: how long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) Bandwidth What data rate can be sustained once the message is started? Measured in Mbytes/sec.

Networking Trend System interconnection network:
bus, crossbar, array, mesh, tree static, dynamic LAN/WAN

LAN/WAN 1st network connection in 1969: 50 Kpbs
at about 10:30 PM on October 29'th, 1969, the first ARPANET connection was established between UCLA and SRI over a 50 kbps line provided by the AT&T telephone company. “At the UCLA end, they typed in the 'l' and asked SRI if they received it; 'got the l' came the voice reply. UCLA typed in the 'o', asked if they got it, and received 'got the o'. UCLA then typed in the 'g' and the darned system CRASHED! Quite a beginning. On the second attempt, it worked fine!” (Leonard Kleinrock) 10Base5 Ethernet in 1976 by Bob Metcalfe and David Boggs end of ‘90s: 100 Mbps (fast Ethernet) and 1 Gbps Bandwidth is not all the story! Do not forget to consider delay and latency.

Delay in packet-switched networks
(1) nodal processing: check bit errors determine output link (2) queuing time waiting at output link for transmission depends on congestion level of router (3) Transmission delay: R=link bandwidth (bps) L=packet length (bits) time to send bits into link = L/R (4) Propagation delay: d = length of physical link s = propagation speed in medium (~2x108 m/sec) propagation delay = d/s Source Destination (4) propagation (3) transmission (1) nodal processing (2) queuing … Note: s and R are very different quantities!

How long does it take to start sending a "message"?
Latency How long does it take to start sending a "message"? Latency may be critical for parallel computing. Some LAN technologies provide high BW and low latency for €. Scalable Coherent Interface (SCI) is a ANSI/IEEE standard Ethernet 1976, 1990s Mbps 120 μs 1.5-5 K€ Infiniband 2001 850 MBps 7 μs Myrinet 1994 230 MBps 15 K€ SCI 1992 320 MBps 1-2 μs QsNet 1990, 2003 900 MBps 3 μs

HPC Trend ~ 20 years ago: Mflop/s
1x106 Floating Point Ops/sec - Scalar based ~ 10 years ago: Gflop/s 1x109 Floating Point Ops/sec) Vector & Shared memory computing, bandwidth aware block partitioned, latency tolerant ~ Today: Tflop/s 1x1012 Floating Point Ops/sec Highly parallel, distributed processing, message passing, network based data decomposition, communication/computation ~ 5 years away: Pflop/s 1x1015 Floating Point Ops/sec Many more levels MH, combination/grids&HPC More adaptive, LT and BW aware, fault tolerant, extended precision, attention to SMP nodes

TOP500 SuperComputers

IBM BlueGene/L

Technology trends Parallel and Distributed Computing architectures Programming paradigms

Parallel and Distributed Systems

Different Architectures
Parallel computing single systems with many processors working on same problem Distributed computing many systems loosely coupled by a scheduler to work on related problems Grid Computing (MetaComputing) many systems tightly coupled by software, perhaps geographically distributed, to work together on single problems or on related problems Massively Parallel Processors (MPPs) continue to account of more than half of all installed high-performance computers worldwide (Top500 list). Microprocessor based supercomputers have brought a major change in accessibility and affordability. Nowadays, cluster systems are the most growing part.

Classification: Control Model
Flynn’s Classical Taxonomy (1966) Flynn's taxonomy distinguishes multi-processor computer architectures according to how they can be classified along the two independent dimensions of Instruction and Data. Each of these dimensions can have only one of two possible states: Single or Multiple. S I S D Single Instruction, Single Data S I M D Single Instruction, Multiple Data M I S D Multiple Instruction, Single Data M I M D Multiple Instruction, Multiple Data

SISD Von Neumann Machine Single Instruction, Single Data
A serial (non-parallel) computer Single instruction: only one instruction stream is being acted on by the CPU during any one clock cycle Single data: only one data stream is being used as input during any one clock cycle Deterministic execution This is the oldest and until recently, the most prevalent form of computer Examples: most PCs, single CPU workstations and mainframes

SIMD Single Instruction, Multiple Data
Single instruction: all processing units execute the same instruction at any given clock cycle. Multiple data: each processing unit can operate on a different data element. This type of machine typically has an instruction dispatcher, a very high-bandwidth internal network, and a very large array of very small-capacity instruction units. Best suited for specialized problems characterized by a high degree of regularity, such as image processing. Synchronous (lockstep) and deterministic execution Two varieties: Processor Arrays and Vector Pipelines Examples: Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2 Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi S820

MISD Multiple Instruction, Single Data
Few actual examples of this class of parallel computer have ever existed. Some conceivable examples might be: multiple frequency filters operating on a single signal stream multiple cryptography algorithms attempting to crack a single coded message.

MIMD Multiple Instruction, Multiple Data
Currently, the most common type of parallel computer Multiple Instruction: every processor may be executing a different instruction stream. Multiple Data: every processor may be working with a different data stream. Execution can be synchronous or asynchronous, deterministic or non- deterministic. Examples: most current supercomputers, networked parallel computer "grids" and multi-processor SMP computers - including some types of PCs.

Classification: Communication Model
Shared vs. Distributed Memory systems

Shared Memory: UMA vs. NUMA

Distributed Memory: MPPs vs. Clusters
Processors-memory nodes are connected by some type of interconnect network Massively Parallel Processor (MPP): tightly integrated, single system image. Cluster: individual computers connected by SW Interconnect Network P P P P P P M M M M M M

Distributed Shared-Memory
Virtual shared memory (shared address space) on hardware level on software level Global address space spanning all of the memory in the system. E.g., HPF, TreadMarks, sw for NoW (JavaParty, Manta, Jackal)

Parallel vs. Distributed Computing
Parallel computing usually considers dedicated homogeneous HPC systems to solve parallel problems. Distributed computing extends the parallel approach to heterogeneous general-purpose systems. Both look at the parallel formulation of a problem. But usually reliability, security, heterogeneity are not considered in parallel computing. But they are considered in Grid computing. “A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable.” (Leslie Lamport)

Parallel and Distributed Computing
Parallel computing: Shared-Memory SIMD Distributed-Memory SIMD Shared-Memory MIMD Distributed-Memory MIMD Behind DM-MIMD: Distributed computing and Clusters Behind parallel and distributed computing: Metacomputing SCALABILITY

Technology trends Parallel and Distributed Computing architectures Programming paradigms Programming models Problem decomposition Parallel programming issues

Programming Paradigms
Parallel Programming Models Control how is parallelism created what orderings exist between operations how do different threads of control synchronize Naming what data is private vs. shared how logically shared data is accessed or communicated Set of operations what are the basic operations what operations are considered to be atomic Cost how do we account for the cost of each of the above

Model 1: Shared Address Space
Program consists of a collection of threads of control, Each with a set of private variables e.g., local variables on the stack Collectively with a set of shared variables e.g., static variables, shared common blocks, global heap Threads communicate implicitly by writing and reading shared variables Threads coordinate explicitly by synchronization operations on shared variables writing and reading flags locks, semaphores Like concurrent programming in uniprocessor

Model 2: Message Passing
Program consists of a collection of named processes thread of control plus local address space local variables, static variables, common blocks, heap Processes communicate by explicit data transfers matching pair of send & receive by source and dest. proc. Coordination is implicit in every communication event Logically shared data is partitioned over local processes Like distributed programming Program with standard libraries: MPI, PVM aka shared nothing architecture, or a multicomputer

Model 3: Data Parallel Single sequential thread of control consisting of parallel operations Parallel operations applied to all (or defined subset) of a data structure Communication is implicit in parallel operators and “shifted” data structures Elegant and easy to understand Not all problems fit this model Vector computing

SIMD Machine An SIMD (Single Instruction Multiple Data) machine
A large number of small processors A single “control processor” issues each instruction each processor executes the same instruction some processors may be turned off on any instruction interconnect Machines not popular (CM2), but programming model is implemented by mapping n-fold parallelism to p processors mostly done in the compilers (HPF = High Performance Fortran) control processor

Model 4: Hybrid Shared memory machines (SMPs) are the fastest commodity machine. Why not build a larger machine by connecting many of them with a network? CLUMP = Cluster of SMPs Shared memory within one SMP, message passing outside Clusters, ASCI Red (Intel), ... Programming model? Treat machine as “flat”, always use message passing, even within SMP (simple, but ignore important part of memory hierarchy) Expose two layers: shared memory (OpenMP) and message passing (MPI) higher performance, but ugly to program.

Hybrid Systems

Model 5: BSP Bulk Synchronous Processing (BSP) (L. Valiant, 1990)
Used within the message passing or shared memory models as a programming convention Phases separated by global barriers Compute phases: all operate on local data (in distributed memory) or read access to global data (in shared memory) Communication phases: all participate in rearrangement or reduction of global data Generally all doing the “same thing” in a phase all do f, but may all do different things within f Simplicity of data parallelism without restrictions BSP superstep

Problem Decomposition
Domain decomposition  data parallel Functional decomposition  task parallel

Parallel Programming directives-based data-parallel language
Such as High Performance Fortran (HPF) or OpenMP Serial code is made parallel by adding directives (which appear as comments in the serial code) that tell the compiler how to distribute data and work across the processors. The details of how data distribution, computation, and communications are to be done are left to the compiler. Usually implemented on shared-memory architectures. Message Passing (e.g. MPI, PVM) very flexible approach based on explicit message passing via library calls from standard programming languages It is left up to the programmer to explicitly divide data and work across the processors as well as manage the communications among them. Multi-threading in distributed environments Parallelism is transparent to the programmer Shared-memory or distributed shared-memory systems

Parallel Programming Issues
The main goal of a parallel program is to get better performance over the serial version. Performance evaluation Important issues to take into account: Load balancing Minimizing communication Overlapping communication and computation

Speedup Serial fraction: Parallel fraction: Speedup:
fs fp P1 P2 P3 P4 Tp Ts Serial fraction: Parallel fraction: Speedup: p linear sublinear Speedup superlinear Superlinear speedup is, in general, impossible; but it may arise in two cases: memory hierarchy phenomena search algorithms

Maximum Speedup Amdahl’s Law states that potential program speedup is defined by the fraction of code (fp) which can be parallelized.

fs and fp may not be static
Maximum Speedup p fp=0.99 fp=0.90 fp=0.50 Speedup There are limits to the scalability of parallelism. For example, at fp = 0.50, 0.90 and 0.99, 50%, 90% and 99% of the code is parallelizable. However, certain problems demonstrate increased performance by increasing the problem size. Problems which increase the percentage of parallel time with their size are more "scalable" than problems with a fixed percentage of parallel time. fs and fp may not be static

Efficiency Given the parallel cost: Efficiency E:
In general, the total overhead To is an increasing function of p, at least linearly when fs > 0: communication, extra computation, idle periods due to sequential components, idle periods due to load imbalance.

Cost-optimality of Parallel Systems
A parallel system is composed by a parallel algorithm and a parallel computational platform. A parallel system is cost-optimal if the cost of solving a problem has the same asymptotic growth (in θ terms, as a function of the input size W) as the fastest known sequential algorithm. As a consequence,

Isoefficiency Isoefficiency:
For a given problem size, when we increase the number of PEs, the speedup and the efficiency decrease. How much do we need to increase the problem size to keep the efficiency constant? Isoefficiency is a metric for scalability And, in general, as the problem size increases the efficiency increases, while keeping the number of processors constant Isoefficiency: In scalable parallel system, when increasing the number of PEs, the efficiency can be kept constant by increasing the problem size. Of course, for different problems, the rate at which W must be increased may vary. This rate determines the degree of scalability of the system. W p

Sources of Parallel Overhead
Total parallel overhead: INTERPROCESSOR COMMUNICATION: If each PE spends Tcomm time for communications, then the overhead will increase by p*Tcomm. LOAD IMBALANCE: if exists, some PE will be idle while others are busy. Idle time of any PE contributes to the Overhead Time. Load imbalance always occurs if there is a strictly sequential component of the algorithm. Load imbalance often occurs at the end of the process run for asynchronous termination (e.g. in coarse-grain parallelism). EXTRA COMPUTATION: Parallel version of the fastest sequential algorithm may not be straightforward. Additional computation may be needed in the parallel algorithm. This contributes to the Overhead Time. To = pTp – Ts

Load Balancing Load balancing is the task of equally dividing the work among the available processes. A range of load balancing problems is determined by Task costs Task dependencies Locality needs Spectrum of solutions from static to dynamic A closely related problem is scheduling, which determines the order in which tasks run.

Different Load Balancing Problems
Load balancing problems differ in: Tasks costs Do all tasks have equal costs? If not, when are the costs known? Before starting, when task created, or only when task ends Task dependencies Can all tasks be run in any order (including parallel)? If not, when are the dependencies known? Locality Is it important for some tasks to be scheduled on the same processor (or nearby) to reduce communication cost? When is the information about communication between tasks known?

Task cost

Task Dependency Cholesky decomposition A=UTU, where U=Upper triangular matrix LU=Lower triangular, Upper triangular matrix LU decomposition: A=LU (e.g. data/control dependencies at end/beginning of task executions)

Task Locality (e.g. data/control dependencies during task executions)
Partial differential Equation (e.g. data/control dependencies during task executions)

Spectrum of Solutions Static scheduling. All information is available to scheduling algorithm, which runs before any real computation starts. (offline algorithms) Semi-static scheduling. Information may be known at program startup, or the beginning of each timestep, or at other well-defined points. Offline algorithms may be used even though the problem is dynamic. Dynamic scheduling. Information is not known until mid-execution. (online algorithms)

LB Approaches Static load balancing Semi-static load balancing
Self-scheduling (manager-workers) Distributed task queues Diffusion-based load balancing DAG scheduling (graph partitioning is NP-complete) Mixed Parallelism

Distributed and Dynamic LB
Dynamic load balancing algorithms, aka work stealing/donating Basic idea, when applied to search trees: Each processor performs search on disjoint part of tree When finished, get work from a processor that is still busy Requires asynchronous communication busy Finished available work idle Do fixed amount of work Select a processor and request work No work found Service pending messages Service pending messages Got work

Selecting a Donor Basic distributed algorithms:
Asynchronous Round Robin (ARR) Each processor k, keeps a variable target_k When a processor runs out of work, request from target_k Set target_k = (target_k +1) % procs Nearest Neighbor (NN) Round robin over neighbors Takes topology into account (as diffusive techniques) Load balancing somewhat slower than randomized Global Round Robin (GRR) Processor 0 keeps a single variable target When a processor needs work, get target, a request from target P0 increments (mod procs) with each access to target Random polling/stealing When a processor needs work, select a random processor and request work from it

Tutorial Outline Part 2: Distributed Data Mining Classification
Clustering Association Rules Graph Mining

Knowledge Discovery in Databases
Knowledge Discovery in Databases (KDD) is a non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. Data Mining Clean, Collect, Summarize Data Preparation Training Data Data Warehouse Model, Patterns Verification & Evaluation Operational Databases

Origins of Data Mining KDD draws ideas from machine learning/AI, pattern recognition, statistics, database systems, and data visualization. Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data. Traditional techniques may be unsuitable enormity of data high dimensionality of data heterogeneous, distributed nature of data

Speeding up Data Mining
Data oriented approach Discretization Feature selection Feature construction (PCA) Sampling Methods oriented approach Efficient and scalable algorithms

Speeding up Data Mining
Methods oriented approach (contd.) Distributed and parallel data-mining Task or control parallelism Data parallelism Hybrid parallelism Distributed-data mining Voting Meta-learning, etc.

What is Classification?
Classification is the process of assigning new objects to predefined categories or classes Given a set of labeled records Build a model (e.g. a decision tree) Predict labels for future unlabeled records

Classification learning
Supervised learning (labels are known) Example described in terms of attributes Categorical (unordered symbolic values) Numeric (integers, reals) Class (output/predicted attribute): categorical for classification numeric for regression

Classification learning
Training set: set of examples, where each example is a feature vector (i.e., a set of <attribute,value> pairs) with its associated class. The model is built on this set. Test set: a set of examples disjoint from the training set, used for testing the accuracy of a model.

Classification: Example
categorical categorical continuous class Test Set Learn Classifier Model Training Set

Classification Models
Some models are better than others Accuracy Understandability Models range from easy to understand to incomprehensible Decision trees Rule induction Regression models Genetic Algorithms Bayesian Networks Neural networks Easier Harder

Decision Trees Decision tree models are better suited for data mining:
Inexpensive to construct Easy to Interpret Easy to integrate with database systems Comparable or better accuracy in many applications

Decision Trees: Example
categorical categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES The splitting attribute at a node is determined based on the Gini index.

From Tree to Rules 1) Refund = Yes  NO 2) Refund = No and
MarSt in {Single, Divorced} and TaxInc < 80K  NO 3) Refund = No and and TaxInc >= 80K  YES 4) Refund = No and MarSt in {Married}  NO Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K >= 80K NO YES

Decision Trees: Sequential Algorithms
Many algorithms: Hunt’s algorithm (one of the earliest) CART ID3, C4.5 SLIQ, SPRINT General structure: Tree induction Tree pruning

Classification algorithm
Build tree Start with data at root node Select an attribute and formulate a logical test on attribute Branch on each outcome of the test, and move subset of examples satisfying that outcome to corresponding child node Recurse on each child node Repeat until leaves are “pure”, i.e., have example from a single class, or “nearly pure”, i.e., majority of examples are from the same class Prune tree Remove subtrees that do not improve classification accuracy Avoid over-fitting, i.e., training set specific artifacts

Build tree Evaluate split-points for all attributes
Select the “best” point and the “winning” attribute Split the data into two Breadth/depth-first construction CRITICAL STEPS: Formulation of good split tests Selection measure for attributes

How to capture good splits?
Occam’s razor: Prefer the simplest hypothesis that fits the data Minimum message/description length dataset D hypotheses H1, H2, …, Hx describing D MML(Hi) = Mlength(Hi)+Mlength(D|Hi) pick Hk with minimum MML Mlength given by Gini index, Gain, etc.

Tree pruning using MDL Data encoding: sum classification errors
Model encoding: Encode the tree structure Encode the split points Pruning: choose smallest length option Convert to leaf Prune left or right child Do nothing

Hunt’s Method Attributes: Refund (Yes, No), Marital Status (Single, Married, Divorced), Taxable Income Class: Cheat, Don’t Cheat Refund Don’t Cheat Yes No Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Refund Don’t Cheat Yes No Marital Status Single, Divorced Married Taxable Income < 80K >= 80K Don’t Cheat

What’s really happening?
Marital Status Cheat Don’t Cheat Married Income < 80K

Finding good split points
Use Gini index for partition purity where p(i) = frequency of class i in the node. If S is pure, Gini(S) = 1-1 = 0 Find split-point with minimum Gini Only need class distributions

Finding good splits points
Marital Status Marital Status Cheat Don’t Cheat Income Income Gini(split) = 0.34 Gini(split) = 0.31

Categorical Attributes: Computing Gini Index
For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions Multi-way split Two-way split (find best partition of values)

Decision Trees: Parallel Algorithms
Approaches for Categorical Attributes: Synchronous Tree Construction (data parallel) no data movement required high communication cost as tree becomes bushy Partitioned Tree Construction (task parallel) processors work independently once partitioned completely load imbalance and high cost of data movement Hybrid Algorithm combines good features of two approaches adapts dynamically according to the size and shape of trees

Synchronous Tree Construction
Partitioning of data only global reduction per node is required large number of classification tree nodes gives high communication cost m categorical attributes n records Good Bad 35 50 20 5 family sport

Partitioned Tree Construction
Partitioning of classification tree nodes natural concurrency load imbalance as the amount of work associated with each node varies child nodes use the same data as used by parent node loss of locality high data movement cost 7,000 records 10,000 training records 3,000 records 2,000 5,000 1,000

Synchronous Tree Construction
Partition Data Across Processors No data movement is required Load imbalance can be eliminated by breadth-first expansion High communication cost becomes too high in lower parts of the tree

Partitioned Tree Construction
Partition Data and Nodes Highly concurrent High communication cost due to excessive data movements Load imbalance

Hybrid Parallel Formulation
switch

Load Balancing

Switch Criterion Switch to Partitioned Tree Construction when
Splitting criterion ensures Parallel Formulations of Decision-Tree Classification Algorithms. A. Srivastava, E. Han, V. Kumar, and V. Singh, Data Mining and Knowledge Discovery: An International Journal, vol. 3, no. 3, pp , September 1999.

Speedup Comparison 0.8 million examples 1.6 million examples linear
hybrid hybrid partitioned partitioned synchronous synchronous 0.8 million examples 1.6 million examples

Speedup of the Hybrid Algorithm with Different Size Data Sets

Scaleup of the Hybrid Algorithm

Summary of Algorithms for Categorical Attributes
Synchronous Tree Construction Approach no data movement required high communication cost as tree becomes bushy Partitioned Tree Construction Approach processors work independently once partitioned completely load imbalance and high cost of data movement Hybrid Algorithm combines good features of two approaches adapts dynamically according to the size and shape of trees

Clustering: Definition
Given a set of data points, each having a set of attributes, and a similarity measure among them, find clusters such that Data points in one cluster are more similar to one another Data points in separate clusters are less similar to one another

Clustering Given N k-dimensional feature vectors, find a “meaningful” partition of the N examples into c subsets or groups. Discover the “labels” automatically c may be given, or “discovered” Much more difficult than classification, since in the latter the groups are given, and we seek a compact description.

Clustering Illustration
k=3  Euclidean Distance Based Clustering in 3-D space Intracluster distances are minimized Intercluster distances are maximized

Clustering Have to define some notion of “similarity” between examples
Similarity Measures: Euclidean Distance if attributes are continuous. Other Problem-specific Measures. Goal: maximize intra-cluster similarity and minimize inter-cluster similarity Feature vector be All numeric (well defined distances) All categorical or mixed (harder to define similarity; geometric notions don’t work)

Clustering schemes Distance-based Partition-based Numeric Categorical
Euclidean distance (root of sum of squared differences along each dimension) Angle between two vectors Categorical Number of common features (categorical) Partition-based Enumerate partitions and score each

Clustering schemes Model-based
Estimate a density (e.g., a mixture of gaussians) Go bump-hunting Compute P(Feature Vector i | Cluster j) Finds overlapping clusters too Example: bayesian clustering

Before clustering Normalization: Given three attributes
A in micro-seconds B in milli-seconds C in seconds Can’t treat differences as the same in all dimensions or attributes Need to scale or normalize for comparison Can assign weight for more importance

The k-means algorithm Specify ‘k’, the number of clusters
Guess ‘k’ seed cluster centers 1) Look at each example and assign it to the center that is closest 2) Recalculate the center Iterate on steps 1 and 2 till centers converge or for a fixed number of times

K-means algorithm Initial seeds

K-means algorithm New centers

K-means algorithm Final centers

Operations in k-means Main Operation: Calculate distance to all k means or centroids Other operations: Find the closest centroid for each point Calculate mean squared error (MSE) for all points Recalculate centroids

Parallel k-means Divide N points among P processors
Replicate the k centroids Each processor computes distance of each local point to the centroids Assign points to closest centroid and compute local MSE Perform reduction for global centroids and global MSE value

Serial and Parallel k-means
Group communication

Serial k-means Complexity

Parallel k-means Complexity
Where depends on the physical communication topology, e.g. in a hypercube

Speedup and Scaleup Condition for linear speedup
Condition for linear scaleup (w.r.t. n)

Clustering Association Rules Frequent Itemset Mining Graph Mining

ARM: Definition Given a set of records each of which contain some number of items from a given collection; Produce dependency rules which will predict occurrence of an item based on occurrences of other items.

ARM Definition Given a set of items/attributes, and a set of objects containing a subset of the items Find rules: if I1 then I2 (sup, conf) I1, I2 are sets of items I1, I2 have sufficient support: P(I1+I2) Rule has sufficient confidence: P(I2|I1)

Association Mining User specifies “interestingness”
Minimum support (minsup) Minimum confidence (minconf) Find all frequent itemsets (> minsup) Exponential Search Space Computation and I/O Intensive Generate strong rules (> minconf) Relatively cheap

Association Rule Discovery: Support and Confidence
Example: Association Rule: Support: Confidence:

Handling Exponential Complexity
Given n transactions and m different items: number of possible association rules: computation complexity: Systematic search for all patterns, based on support constraint: If {A,B} has support at least a, then both A and B have support at least a. If either A or B has support less than a, then {A,B} has support less than a. Use patterns of k-1 items to find patterns of k items.

Apriori Principle Collect single item counts. Find large items.
Find candidate pairs, count them => large pairs of items. Find candidate triplets, count them => large triplets of items, and so on... Guiding Principle: every subset of a frequent itemset has to be frequent. Used for pruning many candidates.

Illustrating Apriori Principle
Items (1-itemsets) Pairs (2-itemsets) Minimum Support = 3 Triplets (3-itemsets) If every subset is considered, 6C1 + 6C2 + 6C3 = 41 With support-based pruning, = 14

Counting Candidates Frequent Itemsets are found by counting candidates. Simple way: Search for each candidate in each transaction. Expensive!!! Transactions Candidates M N

Association Rule Discovery: Hash tree for fast access.
Candidate Hash Tree Hash Function 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 1,4,7 3,6,9 2,5,8

Association Rule Discovery: Subset Operation
transaction 1,4,7 2,5,8 3,6,9 Hash Function 1 + 3 5 6 2 + 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8 5 6 3 +

Association Rule Discovery: Subset Operation (contd.)
transaction 1,4,7 2,5,8 3,6,9 Hash Function 1 + 3 5 6 2 + 3 5 6 1 2 + 5 6 3 + 5 6 1 3 + 2 3 4 6 1 5 + 5 6 7 1 4 5 1 3 6 3 4 5 3 5 6 3 5 7 3 6 7 3 6 8 6 8 9 1 2 4 1 2 5 1 5 9 4 5 7 4 5 8

Parallel Formulation of Association Rules
Large-scale problems have: Huge Transaction Datasets (10s of TB) Large Number of Candidates. Parallel Approaches: Partition the Transaction Database, or Partition the Candidates, or Both

Parallel Association Rules: Count Distribution (CD)
Each Processor has complete candidate hash tree. Each Processor updates its hash tree with local data. Each Processor participates in global reduction to get global counts of candidates in the hash tree. Multiple database scans per iteration are required if hash tree too big for memory.

Global Reduction of Counts
CD: Illustration P0 P1 P2 {5,8} 2 {3,4} {2,3} {1,3} {1,2} 5 3 7 {5,8} 7 {3,4} {2,3} {1,3} {1,2} 3 1 9 {5,8} {3,4} {2,3} {1,3} {1,2} 2 8 6 N/p N/p N/p Global Reduction of Counts

Parallel Association Rules: Data Distribution (DD)
Candidate set is partitioned among the processors. Once local data has been partitioned, it is broadcast to all other processors. High Communication Cost due to data movement. Redundant work due to multiple traversals of the hash trees.

All-to-All Broadcast of Candidates
DD: Illustration P0 P1 P2 Data Broadcast N/p Remote Data N/p Remote Data N/p Remote Data Count Count Count {1,2} 9 {2,3} 12 {5,8} 17 {1,3} 10 {3,4} 10 All-to-All Broadcast of Candidates

Parallel Association Rules: Intelligent Data Distribution (IDD)
Data Distribution using point-to-point communication. Intelligent partitioning of candidate sets. Partitioning based on the first item of candidates. Bitmap to keep track of local candidate items. Pruning at the root of candidate hash tree using the bitmap. Suitable for single data source such as database server. With smaller candidate set, load balancing is difficult.

IDD: Illustration P0 P1 P2 Data Shift N/p N/p Remote Data N/p Remote Data Remote Data bitmask 1 2,3 5 Count Count Count {1,2} 9 {2,3} 12 {5,8} 17 {1,3} 10 {3,4} 10 All-to-All Broadcast of Candidates

Filtering Transactions in IDD
transaction bitmask 1,3,5 Skipped! 3 5 6 2 + 1 + 5 6 3 + 1 5 9 1 4 5 1 3 6 3 4 5 3 6 7 3 6 8 3 5 6 3 5 7 6 8 9 2 3 4 5 6 7 1 2 4 4 5 7 1 2 5 4 5 8

Parallel Association Rules: Hybrid Distribution (HD)
Candidate set is partitioned into G groups to just fit in main memory Ensures Good load balance with smaller candidate set. Logical processor mesh G x P/G is formed. Perform IDD along the column processors Data movement among processors is minimized. Perform CD along the row processors Smaller number of processors in global reduction operation.

HD: Illustration IDD along Columns All-to-All Broadcast of Candidates P/G Processors per Group N/(P/G) CD along Rows C0 C1 C2 N/P N/P N/P G Groups of Processors N/P N/P N/P N/P N/P N/P

Parallel Association Rules: Comments
HD has shown the same linear speedup and sizeup behavior as that of CD. HD exploits total aggregate main memory, while CD does not. IDD has much better scaleup behavior than DD.

Clustering Association Rules Graph Mining Frequent Subgraph Mining

Graph Mining  Market Basket Analysis
Association Rule Mining (ARM)  find frequent itemset Search space: Unstructured data: only item type is important item set I, |I|=n  power set P(I), |P(I)| = 2n pruning technique to make the search feasible Subset test: for each user transaction t and each candidate frequent-itemset s we need a subset test in order to compute the support (frequency).  Molecular Compound Analysis Frequent Subgraph Mining (FSM)  find frequent subgraph Bigger search space: Structured data: atom types are not sufficient; atoms have bonds with other atoms. Subgraph isomorphism test: for each graph and each candidate frequent-subgraph we need a subgraph isomorphism test. N.B.: for general graphs the subgraph isomorphism test is NP-complete

Molecular Fragment Lattice
{} C O S N C-C S-O C-S S-N S=N C-S-O N-S-O C-S-N C-S=N C-C-S C-S-O | N C-C-S-N (minSupp=50%)

Mining Molecular Fragments
Frequent Molecular Fragments Frequent Subgraph Mining (FSM) Discriminative Molecular Fragments Molecular compounds are classified into: active compounds  focus subset inactive compounds  complement subset Problem definition: find all discriminative molecular fragments, which are frequent in the set of the active compounds and not frequent among the inactive compounds, i.e. contrast substructures. User parameters: minSupp (minimum support in the focus dataset) maxSupp (maximum support in the complement dataset) C F

Molecular Fragment Search Tree
A search tree node represents a molecular fragment. Successor nodes are generated by extending the fragment of one bond and, eventually, one atom. maxSupp C minSupp F discriminative fragments

Need for scalability in terms of
Large-Scale Issue Need for scalability in terms of the number of molecules larger main and secondary memory to store molecules fragments with longer list of instances in molecules (embeddings) the size of molecules larger memory to store larger molecules fragments with longer list of longer embeddings more fragments (bigger search space) the minimum support with lower support threshold the mining algorithm produces more embeddings for each fragment more fragments and longer search tree branches (bigger search space)

High-Performance Distributed Approach
Sequential algorithms cannot handle large-scale problems and small values of the user parameters for better quality results. Distributed Mining of Molecular Fragments search space partitioning data partitioning Search Space Partitioning Distributed implementation of backtracking external representation (enhanced SMILES) DB selection and projection Tree-based reduction Dynamic load balancing for irregular problems donor selection and work splitting mechanism Peer-to-Peer computing

Search Space Partitioning

Search Space Partitioning
4th kind of search-tree pruning: “Distributed Computing pruning” prune a search node, generate a new job and assign it to an idle processor asynchronous communication and low overhead backtracking is particularly suitable to parallel processing because a subtree rooted at any node can be searched independently

Tree-based Reduction 3D hypercube (p=8) … Star reduction
(master-slave) O(p) Tree reduction O(log(p))

Job Assignment and Parallel Overheads
external representation (enhanced SMILES) split embed stack latency donor delay parallel computing overhead = communication + excess computation + idling periods idle receiver 1st job assignment job assignment termination detection Overlapped with useful computation DB selection & projection DLB for irregular problems

Parallel Execution Analysis
time T0 worker execution setup idle1 jobs idle3 idle2 setup: configuration message, DB loading idle1: wait first job assignment, due to initial sequential part idle2: processor starvation idle3: idle period due to load imbalance jobs: job processing time (including computational overhead) single job execution data prep mining data: data preprocessing prep: prepare root search node (embed the core fragment) mining: data mining processing (useful work)

Load Balancing The search space is not know a priori and is very irregular Dynamic load balancing receiver-initiated approach donor selection work splitting mechanism The DLB determines the overall performance and efficiency.

Highly Irregular Problem
Search tree node visit-time (subtree visit) Search tree node expand-time (node extension) Power-law distribution

Work Splitting A search tree node n can be donated only if:
1) stackSize() >= minStackSize, 2) support(n) >= (1 + α) * minSupp 3) lxa(n) <= β * atomCount(n)

Dynamic Load Balancing
Receiver-initiated approaches Random Polling (RP) Scheduler-based (MS) excellent scalability Quasi-Random Polling (QRP) optimal solution Quasi-Random Polling (QRP) policy Global list of potential donors (sorted w. r. t. the running time) server collects job statistics receiver periodically gets updated donor-list P2P Computing framework receiver selects a random donor according to a probability distribution decreasing with the donor rank in the list high probability to choose long running jobs

Issues Penalty for global synchronization “Adaptive” application
Asynchronous communication Highly irregular problem Difficulty in predicting work loads Heavy work loads may delay message processing Large-scale multi-domain heterogeneous computing environment Network latency and delay tolerant

Technology trends Parallel and Distributed Computing architectures Programming paradigms Part 2: Distributed Data Mining Classification Clustering Association Rules Graph Mining Conclusions

Large-scale Parallel KDD Systems
Data Terabyte-sized datasets Centralized or distributed datasets Incremental changes (refine knowledge as data changes) Heterogeneous data sources

Software Pre-processing, mining, post-processing Interactive (anytime mining) Modular (rapid development) Web services Workflow management tool integration Fault and latency tolerant Highly scalable

Computing Infrastructure Clusters already widespread Multi-domain heterogeneous Data and computational Grids Dynamic resource aggregation (P2P) Self-managing

Research Directions Fast algorithms: different mining tasks
Classification, clustering, associations, etc. Incorporating concept hierarchies Parallelism and scalability Millions of records Thousands of attributes/dimensions Single pass algorithms Sampling Parallel I/O and file systems

Research Directions (contd.)
Parallel Ensemble Learning parallel execution of different data mining algorithms and techniques that can be integrated to obtain a better model. Not just high performance but also high accuracy

Tight database integration Push common primitives inside DBMS Use multiple tables Use efficient indexing techniques Caching strategies for sequence of data mining operations Data mining query language and parallel query optimization

Understandability: too many patterns Incorporate background knowledge Integrate constraints Meta-level mining Visualization, exploration Usability: build a complete system Pre-processing, mining, post-processing, persistent management of mined results

Conclusions Data mining is a rapidly growing field
Fueled by enormous data collection rates, and need for intelligent analysis for business and scientific gains. Large and high-dimensional data requires new analysis techniques and algorithms. High Performance Distributed Computing is becoming an essential component in data mining and data exploration. Many research and commercial opportunities.

Resources Workshops Books
IEEE IPDPS Workshop on Parallel and Distributed Data Mining HiPC Special Session on Large-Scale Data Mining ACM SIGKDD Workshop on Distributed Data Mining IEEE IPDPS Workshop on High Performance Data Mining ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems ACM SIGKDD Workshop on Distributed Data Mining, IEEE IPPS Workshop on High Performance Data Mining LifeDDM, Distributed Data Mining in Life Science Books A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic Pub., Boston, MA, 1998. M. J. Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey, Volume 1759, Springer-Verlag, 2000. H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAI Press, Summer 2000.

References Journal Special Issues Survey Articles
P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining and Knowledge Discovery: An International Journal, Vol. 1, No. 4, December 1997. Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining and Knowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999. V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journal of Parallel and Distributed Computing, Vol. 61, No. 3, March 2001. M. J. Zaki and Y. Pan. Special Issue on Parallel and Distributed Data Mining, Distributed and Parallel Databases: An International Journal, forthcoming, 2001. P. Srimani, D. Talia, Parallel Data Intensive Algorithms and Applications, Parallel Computing, forthcoming, 2001. Survey Articles F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(2): , 1999. A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3): , 1999. M. J. Zaki. Parallel and distributed association mining: A survey. In IEEE Concurrency special issue on Parallel Data Mining, 7(4):14-25, Oct-Dec 1999. D. Skillicorn. Strategies for parallel data mining. IEEE Concurrency, 7(4):26--35, Oct-Dec 1999. M. V. Joshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations. In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Springer-Verlag 2000. M. J. Zaki. Parallel and distributed data mining: An introduction. In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Springer-Verlag 2000.

References: Classification
J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H. W. To, and Y. Dan. Large scale data mining: Challenges and responses. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997. S. Goil and A. Choudhary. Efficient parallel classification using dimensional aggregates. In Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor: Searching the nuggets in parallel. In Fayyad et al.(eds.), Advances in KDD, AAAI Press, 1996. M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classification algorithm for mining large datasets. In Intl. Parallel Processing Symposium, 1998. R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. Suttner, editors, Parallel Processing for Artificial Intelligence 3. Elsevier-Science, 1997. S. Lavington, N. Dewhurst, E. Wilkins, and A. Freitas. Interfacing knowledge discovery algorithms to large databases management systems. Information and Software Technology, 41: , 1999. F. Provost and J. Aronis. Scaling up inductive learning with massive parallelism. Machine Learning, 23(1), April 1996. F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(2): , 1999. John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In Proc. of the 22nd Int'l Conference on Very Large Databases, Bombay, India, September 1996. M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application to classification trees. In 13th International Parallel Processing Symposium, April 1999. A. Srivastava, E-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3): , 1999. M. J. Zaki, C.-T. Ho, and R. Agrawal.Parallel classification for data mining on shared-memory multiprocessors. In 15th IEEE Intl. Conf. on Data Engineering, March 1999.

References: Clustering
K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High Performance Data Mining, March 1998. I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. L. Iyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of Operations Research Vol. 90, pp 65-86, 1999. E. Johnson and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000. D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int'l Conf. Pattern Recognition, August 1996. X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11: , 1989. C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21: , 1995. S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distributed Systems, 2(2): , 1991. F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal of Parallel and Distributed Computing, 8: , 1990. G. Rudolph. Parallel clustering on a unidirectional ring. In R. Grebe et al., editor, Transputer Applications and Systems'93: Volume 1, pages IOS Press, Amsterdam, 1993. H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large data sets. Technical Report , Center for Parallel and Distributed Computing, Northwestern University, June 1999. X. Xu, J. Jager and H.-P. Kriegel. A fast parallel clustering algorithm for large spatial databases. Data Mining and Knowledge Discovery: An International Journal. 3(3): , 1999. D. Foti, D. Lipari, C. Pizzuti, D. Talia, Scalable Parallel Clustering for Data Mining on Multicomputers, Proc. of the 3rd Int. Workshop on High Performance Data Mining HPDM00-IPDPS, LNCS, Springer-Verlag, pp , Cancun, Mexico, May 2000.

References: Association Rules
R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages AAAI Press, Menlo Park, CA, 1996. R. Agrawal and J. Shafer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg., 8(6): , December 1996. D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distributed algorithm for mining association rules. In 4th Intl. Conf. Parallel and Distributed Info. Systems, December 1996. D. Cheung, V. Ng, A. Fu, and Y. Fu. Efficient mining of association rules in distributed databases. IEEE Transactions on Knowledge and Data Engg., 8(6): , December 1996. D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-memory multi-processors. In 10th ACM Symp. Parallel Algorithms and Architectures, June 1998. D. Cheung and Y. Xiao. Effect of data distribution in parallel mining of associations. Data Mining and Knowledge Discovery: An International Journal. 3(3): , 1999. E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD Conf. Management of Data, May 1997. M. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M. Zaki and C.-T. Ho (eds), Large-Scale Parallel Data Mining, LNAI State-of-the-Art Survey, Volume 1759, Springer-Verlag, 2000. S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000.

References: Associations (contd.)
A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical Report CS-TR-3515, University of Maryland, College Park, August 1995. J. S. Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf. Information and Knowledge Management, November 1995. T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl. Conf. Parallel and Distributed Info. Systems, December 1996. T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules with classification hierarchy. In ACM SIGMOD International Conference on Management of Data, May 1998. M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining on heterogeneous PC cluster systems. In 25th Int'l Conf. on Very Large Data Bases, September 1999. M. J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, October-December 1999. M. J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-memory multi-processors. In Supercomputing'96, November 1996. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. New algorithms for fast discovery of association rules. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997. M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithms for fast discovery of association rules. Data Mining and Knowledge Discovery: An International Journal, 1(4): , December 1997. M. J. Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), June, 1997.

References: Subgraph Mining
G. Di Fatta and M. R. Berthold. Distributed Mining of Molecular Fragments. IEEE DM-Grid Workshop of the Int. Conf. on Data Mining (ICDM 2004, Brighton, UK), November 1-4, 2004. M. Desphande and M. Kuramochi and G. Karypis. Automated Approaches for Classifying Structures. Proc. of Workshop on Data Mining in Bioinformatics (BioKDD), 2002, pp X. Yan and J. Han. gSpan: Graph-Based Substructure Pattern Mining. Proceedings of the IEEE International Conference on Data Mining ICDM, Maebashi City, Japan, 2002. S. Kramer and L. de Raedt and C. Helma. Molecular Feature Mining in HIV Data. Proc. of 7th Int. Conf. on Knowledge Discovery and Data Mining, (KDD-2001, San Francisco, CA, 2001, pp Takashi Washio and Hiroshi Motoda. State of the art of graph-based data mining. ACM SIGKDD Explorations Newsletter, July 2003 Vol.5, pp Chao Wang and Srinivasan Parthasarathy. Parallel Algorithms for Mining Frequent Structural Motifs in Scientific Data. ICS’04, June 26–July 1, 2004, Saint-Malo, France.

Distributed Data Mining

Similar presentations

Presentation on theme: "Distributed Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Distributed Data Mining

Similar presentations

Presentation on theme: "Distributed Data Mining"— Presentation transcript:

Similar presentations

About project

Feedback