Scalable Data Mining: Algorithms, System Support, and Applications

Scalable Data Mining: Algorithms, System Support, and Applications
2018/11/16 Scalable Data Mining: Algorithms, System Support, and Applications Ruoming Jin The Ohio State University Hello, everyone, my name is Ruoming Jin. Today, I will present the paper, “Communication and Memory Efficient Parallel Decision Tree Construction”. 2018/11/16

A World Immersed in Data!
Business data Wal-Mart (20M transaction per day) AT&T (300M calls per day) Satellite and sensor data NASA, EOS project: 50 GB per hour Biology data GenBank (>30 Billion base pairs, >30 Million sequences, 2003) Medical informatics Virtual placenta (Cancer Genetics, OSU) 3-5 GB per microscopic slide 2018/11/16

Scalable Data Mining Challenges
Parallel data mining SMP clusters Large shared memory machines Processing large amount of data Many of them are out-of-core Need for more scalable algorithms New data mining tasks Mining structured and semi-structured data 2018/11/16

Roadmap System Support Algorithms for Out-of-core Datasets
FREERIDE (Framework for Rapid Implementation of Data Mining Engines) Algorithms for Out-of-core Datasets SPIES (Statistical Pruning of Intervals for Enhanced Scalability) for Decision Tree Construction Discovering Frequent Topological Patterns in Graph Datasets Protein Structure Analysis 2018/11/16

Scalable Data Mining Implementation
2018/11/16 Creating scalable implementations of data algorithms can be very time-consuming and painful Disk-resident datasets Shared memory parallelization Distributed memory parallelization Debugging with MPI and/or Pthread code 2018/11/16

FREERIDE Overview Framework for Rapid Implementation of Data mining Engines Target distributed memory parallelism, shared memory parallelism, and combination Ability to process large and disk-resident datasets Demonstrated for a variety of standard mining algorithms 2018/11/16

Key Observation from Mining Algorithms (SDM’01)
Popular algorithms have a common canonical loop Can be used as the basis for supporting a common middleware While( ) { forall( data instances d) { I = process(d) R(I) = R(I) op d } ……. 2018/11/16

Issues / Challenges What are the efficient shared-memory parallelization techniques? How to support distributed memory parallelism? How to combine both techniques? How to provide a simple interface to parallelize mining algorithms? 2018/11/16

Generalized Reduction
Reduction Indexing: I = process(d) d d d R(I) Reduction Objects: R Reduction Operation: R(I) = R(I) op d Reduction Operation satisfies the commutative and associative properties. 2018/11/16

Challenges in Shared-Memory Parallelization
Statically partitioning the reduction object to avoid race conditions is generally impossible. Runtime preprocessing or scheduling also cannot be applied. Cannot determine what you need to update w/o processing the element. Significant memory overheads result from replicating large reduction objects. Locking and synchronization costs can be significant because of the fine-grained updates to the reduction object. 2018/11/16

Memory Layout for Various Locking Schemes (SDM’02)
Full Locking Fixed Locking Optimized Full Locking Cache-Sensitive Locking Lock Reduction Element 2018/11/16

Trade-offs between Techniques
Memory requirements: high memory requirements can cause memory thrashing Contention: if the number of reduction elements is small, contention for locks can be a significant factor Coherence cache misses and false sharing: more likely with a small number of reduction elements 2018/11/16

Performance Modeling (SIGMETRICS’02)
Target three shared memory parallelization techniques Full replication, optimized full locking, and cache-sensitive locking A detailed analytic model Cache misses TLB misses Memory contention Waiting for locks Experimental evaluation The difference between predicted and measured performance is within 20% in almost all cases 2018/11/16

Apriori Association Mining
500MB dataset, N2000,L20, 4 threads, fr (full replication), ofl (optimized full locking), csl (cache senstive locking) The additional memory cost of parallelization techniques can affect the performance dramatically!. 2018/11/16

K-means Shared Memory Parallelization
Optimized full locking and cache sensitive locking are very efficient! 2018/11/16

Apriori on Cluster of SMPs
Linear speedup for distributed memory parallelization and almost linear speedup for shared memory parallelization (up to 3 threads) 2018/11/16

Applying FREERIDE to Processing Digital Microscope Images
Investigated by Kishore Rao, Prof. Machiraju and Researchers in Bio-Medical Informatics Focusing on parallelizing of segmentation for large microscope images Applications Virtual Placenta Neublastoma 2018/11/16

Applying FREERIDE to Scientific Data Mining
Investigated by Leo Glimcher, Xuan Zhang, et al. (IPDPS 2004, IPDPS 2005) Focusing on feature extraction, tracking, and mining approach developed by Machiraju et al. Applications Vortex Detection Defect Detecion 2018/11/16

FREERIDE Summary Demonstrated a common framework for parallelization of a wide range of mining algos Association mining – apriori and fp-tree Clustering – k-means and EM Decision tree construction Nearest neighbor search Applications in analyzing large Bio-medical images and scientific data mining 2018/11/16

Mining Out-of-Core Datasets
The need to efficiently process disk-resident datasets In many cases, the huge amount of data can not fit into the main memory The processor-memory performance gap, and consequently, the processor-secondary memory performance gap become larger and larger! Moore’s law (50% per year) Latency gap (disks, 5 ms /DRAM 50ns > 100) The problem Most Mining Algorithms are I/O (data) intensive Many Mining Algorithms, such as Decision Tree Construction and K-means clustering, have to rewrite or scan the dataset many times Some remedies Approximate Mining Algorithms Working on Samples How can we develop efficient out-of-core mining algorithms without losing accuracy? 2018/11/16

Processor/Disk Race How can we let Jesse Owens do more running to reduce turtle’s running distance? 2018/11/16

Sampling Based Approach
Use samples to get approximate results or information Scan the complete dataset, collect the necessary information based on the approximate results in order to derive the accurate final results If the estimation from the sample is not within a certain range, a re-scan is needed. 2018/11/16

Decision Tree Construction
Employed Root Yes No Class=Not Default Node Balance Recursively construct the decision tree Best test condition for a node Categorical attributes, best subset test Numerical attributes, best split point Partition the dataset <50K >=50K Class=Yes Default Age Leaf <45 >=45 Efficient processing of numerical attributes is the key to scaling decision tree construction! Class=Not Default Class=Yes Default 2018/11/16

Finding the Best Split Point for Numerical Attributes
The data comes from a IBM Quest synthetic dataset for function 0 Best Split Point In-core algorithms, such as C4.5, will just online sort the numerical attributes! 2018/11/16

Handling of Numerical Attributes for Disk-Resident Datasets
2018/11/16 Sorting the disk-resident records is way too expensive! SLIQ (Mehta et al), SPRINT (Shafer et al) Pre-sort and use attribute-list Re-write the dataset – Expensive! RainForest (Gehrke et al) Materialize class histogram (No sorting) shows good performance if the class-histogram can be held in the main memory! Decision tree construction for large datasets has been widely studied. SLIQ and SPRINT are the two well-known approaches, but both of them require presorting and vertical partitioning of datasets. RainForest provides alternative approach to scale decision tree construction without these two requirements. It constructs decision tree in the top-down fashion. From the root, it computes sufficient statistics for the splitting node, and chooses splitting attribute and predicate based on the sufficient statistics. Clearly, if the main memory could hold the sufficient statistics for all of the nodes in one level of decision tree, We could split them by a single pass. However, the size of the sufficient statistics in a single level could not always be hold in main memory. In this case, we either have to read the dataset several times to construct one level of the tree, Or we partition the dataset such as every small partition will be building a sub-tree. Then we process every partition independently. The first variant algorithm is RF-read, the second one is RF-write. Sometimes, we could combine them together as RF-hybrid which is partitioning the dataset only when it is necessary. We can see that RF-read bares a lot of similarity with the requirement of FREERIDE approach. But the difficulty is that the large size of sufficient statistics results. So one nature problem to be asked is that if it it possible to reduce the sufficient statistics. In the following, we provide a solution to this problem, and a new algorithm based on it. 2018/11/16

Scaling Decision Tree Construction
The huge memory cost of the class histograms for numerical attributes Millions of distinct points (ZIP code, IP address, …) Class histogram for a single level of nodes might not fit in the main memory To construct a single level of nodes, the dataset needs to be scanned several times! The large communication volume results in a very low speedups Can we do a better job? 2018/11/16

Our approach – SPIES (SDM’03)
2018/11/16 Statistical Pruning of Intervals for Enhanced Scalability Reduce the size of the class histogram by partial materialization Sampling based approach Divide the range of numerical attributes into intervals Use samples to estimate class histogram for intervals Prune the intervals that are unlikely to have the best split point Scan the complete dataset and materialize the class histogram for points in the unpruned intervals An additional pass might be necessary if false pruning happens 2018/11/16

The Intuition 2018/11/16 The number of intervals will be much smaller than the number of distinct points For one attribute, only one interval can contain the best split point, and the large number of intervals that actually do not contain the best point points can be pruned by using samples Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. The additional computation from samples and interval processing can be offset by avoiding re-writing and reducing the number of passes over the dataset! 2018/11/16

Sampling Step Maximal gain from interval boundaries
Upper bound of gains for intervals 2018/11/16

Completion Step Best Split Point 2018/11/16

The Technical Challenges
How can it work? Memory reduction by pruning as many intervals as possible Avoid pruning the interval that can have the best split point (false pruning) key problems What is the good estimation based on sampling? How to derive the sample size? 2018/11/16

Estimation based on Samples
The difference can be bounded by statistical rules, such as Hoeffding Inequality. Interestingly, by utilizing delta method, the gain function in any fixed point can be approximated as Normal distribution. Comparing the efficiency of different estimation methods is explored in our KDD’03 paper. 2018/11/16

Sample size Hoeffding bound
The probability of false pruning an interval is bounded by δ, such that Pr( Δi < ε ) < δ, where Bonferroni’s Inequality Pr(∪(Δi < ε )) ≤∑(Pr(Δi < ε)) < δ 2018/11/16

SPIES algorithm SPIES can be efficiently parallelized!
2018/11/16 Sampling step Estimate class histograms for intervals from samples Compute the estimate intermediate best gain and upper bound of intervals Apply Hoeffding bound to perform interval pruning Completion step Materialize class histogram for unpruned intervals Compute the final best gain Verification An additional pass might be needed if false pruning happens and it will be executed together with next completion step Now we look at how sampling is used to replace the first pass of the algorithm and some responding modification of the algorithm. So, If you are interested in technical detail of how to use Hoeffding bound, please look at our paper or Domingo’s paper on VFDT. SPIES always finds the best split point by just partially materializing class histogram with practically one pass of dataset for each level of the decision tree SPIES can be efficiently parallelized! 2018/11/16

Memory Requirement 2018/11/16 800MB dataset 0, From 100 to 1000, several memory we have 95% memory reduction . 800MB dataset with number of intervals 100, 500,1000, 5000, 20000, and RF-Read 2018/11/16

Parallel Performance 2018/11/16 8 nodes, around 4. 2.25, 1.88 and 1.75 Better sequential performance, Almost linear, about 8. Distributed Memory Speedup of RF-read (without intervals), 800 MB datasets SPIES with 1000 intervals 2018/11/16

Applications of this Approach
Efficient Decision Tree Construction over Streaming Data (KDD’03) Fast and Exact K-means (FEKM, ICDM’04) Distributed, Fast, and Exact K-means (KAIS journal) 2018/11/16

Mining Graph Datasets Graph is powerful representation
Chemical Compounds Protein 3D Structures Social Networks Communication Networks XML, Web,… Mining graph datasets A collection of graphs Finding the frequently occurring sub-structures 2018/11/16

Why Mine Graph Datasets?
Providing key insights into the graph datasets What do these frequently occurring substructures suggest? Fundamental tools to perform other data mining applications Classification Clustering Association Rules Comparative Mining Change Detection 2018/11/16

Existing Research and Limitation
Finding Frequent Subgraphs from Graph Datasets Different types of subgraphs Connected, Induced Many Efficient Algorithms AGM, FSG, gSpan, FFSM, Gaston, SPIN,… Discovering frequent basic components Frequent fragments, local patterns Frequent Large-scale Structures Non-local (global) protein structures Social/Communications Direct connection is not the focus 2018/11/16

Example Protein Structures
1ALI Both share a -helices triangle Similar functionalities, belong to the class of zinc finger proteins 2018/11/16

Discovering Frequent Topological Structures (submitted to KDD’05)
2018/11/16

Membrane Proteins Structure Analysis
10 binding sites for 6 membrane proteins (1KB1,1KQF,1M3x, 1OKC, 1V54,1OGV) Vertices: amino acids, Edges: distances within a small range ( Å) 2018/11/16

Other Research Frequent pattern mining Parallel data cube construction
Streaming data Mining multiple datasets Parallel data cube construction Estimation problems in databases Approximate OLAP query processing XML selectivity estimation 2018/11/16

Future Research Topological structure for protein analysis
Classification Protein structure alignment Finding frequent 3-D geometric structures Comparative mining Comparative genomic Mining multiple datasets Mining in constraint conditions Grids Sensor networks Streaming data 2018/11/16

Conclusions Scalable data mining
challenging and fundamentally important requires efforts from both system support and algorithm design A knowledge discovery and data mining management system (KDDMS) Long-term goal for data mining Interactive data mining Scalable data mining techniques 2018/11/16

Questions ??? 2018/11/16

Computational Challenges in Bio-Medical Informatics
Recent technological advances have let to an explosion of data size and content Virtual placenta (Cancer Genetics, OSU) Anatomy of mouse placenta Gene alternation construct a 3-D model of mouse placenta from microscopic slides 2018/11/16

Size of Datasets = 15K * 15K * 3 RGB * 800  1 TB
Virtual Placenta Size of Datasets = 15K * 15K * 3 RGB * 800  1 TB 2018/11/16

Shared Memory Parallelization Techniques
Full Replication: create a copy of the reduction object for each thread Full Locking: associate a lock with each element Optimized Full Locking: put the element and corresponding lock on the same cache block Fixed Locking: use a fixed number of locks Cache Sensitive Locking: one lock for all elements in a cache block 2018/11/16

Apriori on Cluster of SMPs
Linear speedup for distributed memory parallelization and almost linear speedup for shared memory parallelization (up to 3 threads) 2018/11/16

Verification Gain of Best Split Point False Pruning
An additional pass might be required if false pruning happens 2018/11/16

Least Upper Bound of Gain for an Interval
[ 50 ,54 ] [ 50 ,54 ] Possible Best Configuration-1 Possible Best Configuration-2 2018/11/16

Scalable Data Mining: Algorithms, System Support, and Applications

Similar presentations

Presentation on theme: "Scalable Data Mining: Algorithms, System Support, and Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable Data Mining: Algorithms, System Support, and Applications

Similar presentations

Presentation on theme: "Scalable Data Mining: Algorithms, System Support, and Applications"— Presentation transcript:

Similar presentations

About project

Feedback