Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
1 Multi-way Algorithm for Cube Computation CPS Notes 8.
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
IPDPS, Supporting Fault Tolerance in a Data-Intensive Computing Middleware Tekin Bicer, Wei Jiang and Gagan Agrawal Department of Computer Science.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.
Exploiting Domain-Specific High-level Runtime Support for Parallel Code Generation Xiaogang Li Ruoming Jin Gagan Agrawal Department of Computer and Information.
Lecture 3 – Parallel Performance Theory - 1 Parallel Performance Theory - 1 Parallel Computing CIS 410/510 Department of Computer and Information Science.
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.
Ex-MATE: Data-Intensive Computing with Large Reduction Objects and Its Application to Graph Mining Wei Jiang and Gagan Agrawal.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Ohio State University Department of Computer Science and Engineering 1 Supporting SQL-3 Aggregations on Grid-based Data Repositories Li Weng, Gagan Agrawal,
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Efficient Methods for Data Cube Computation and Data Generalization
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.
A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.
Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.
FREERIDE: System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Ge Yang Gagan Agrawal Department of Computer and Information.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Compiler Supported High-level Abstractions for Sparse Disk-resident Datasets Renato Ferreira Gagan Agrawal Joel Saltz Ohio State University.
Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.
Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
System Support for High Performance Data Mining Ruoming Jin Leo Glimcher Xuan Zhang Gagan Agrawal Department of Computer and Information Sciences Ohio.
High-level Interfaces for Scalable Data Mining Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University.
Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,
Packet Size optimization for Supporting Coarse-Grained Pipelined Parallelism Wei Du Gagan Agrawal Ohio State University.
AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.
Research Overview Gagan Agrawal Associate Professor.
Performance Evaluation of Parallel Algorithms on a Computational Grid Environment Simona Blandino 1, Salvatore Cavalieri 2 1 Consorzio COMETA, 2 Faculty.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
System Support for High Performance Scientific Data Mining Gagan Agrawal Ruoming Jin Raghu Machiraju S. Parthasarathy Department of Computer and Information.
Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.
Dense-Region Based Compact Data Cube
A Dynamic Scheduling Framework for Emerging Heterogeneous Systems
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
Parallel Programming By J. H. Wang May 2, 2017.
Year 2 Updates.
Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz
Department of Computer Science University of California, Santa Barbara
Communication and Memory Efficient Parallel Decision Tree Construction
Data-Intensive Computing: From Clouds to GPU Clusters
Gary M. Zoppetti Gagan Agrawal
Fast and Exact K-Means Clustering
Compiler Supported Coarse-Grained Pipelined Parallelism: Why and How
Parallel Programming in C with MPI and OpenMP
Department of Computer Science University of California, Santa Barbara
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
FREERIDE: A Framework for Rapid Implementation of Datamining Engines
LCPC02 Wei Du Renato Ferreira Gagan Agrawal
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal Department of Computer and Information Sciences Ohio State University

Motivation  A lot of effort into developing cluster computing tools targetting scientific applications  There is an emerging class of commercial applications that are well suited for cluster environments  OnLine Analytical Processing (OLAP)  Data Mining  Can we successfully use cluster tools developed for scientific applications on commercial applications ?

Overview Focus on:  Data cube construction, which is an OLAP problem  Both compute and data intensive  Frequently used in data warehouses  Use of Active Data Repository (ADR) developed for scientific data intensive applications  Questions:  Are new algorithms / variations to existing algorithms required ?  Implementation experience ?  Performance ?

Outline  Data cube construction  Problem definition  Challenges  Active Data Repository (ADR)  Scalable data cube construction algorithms targetting ADR  Implementation Experience  Performance Evaluation  Summary

Data Cube Construction Context: Data Warehouses  Frequently store (possibly sparse) multidimensional datasets  Example: Sale information for a chain of stores: time, item, and location can be the three dimensions  Frequently asked queries: aggregate along one or more dimensions Data Cube Construction:  Perform all aggregations in advance to facilitate rapid response to all queries  For the original n dimension array construct: n C m arrays of m dimensions, 0 =< m =< n

Data Cube Construction Example:  Consider original 3 dimensional array ABC  Data cube comprises of  3 two-dimensional arrays AB, BC, AC  3 one-dimensional arrays A, B, and C  A scalar value all  Some observations:  Large input size: data warehouses can have a lot of data  Total amount of output could be quite large  A lot of computation is involved

Lattice for Data Cube Construction Options for computing different output arrays can be represented by a lattice If A is the shortest dimension and C is the largest, the arrows represent the minimal spanning tree of the lattice AB is considered the smallest parent of A and B

Active Data Repository  Developed at University of Maryland (Chang, Kurc, Sussman, Saltz)  Targetted scientific data intensive applications  Execution model:  Divide output dataset(s) into tiles, allocate one tile at a time  Fetch input dataset one chunk at a time to compute the tile  Decide on a plan or schedule for fetching chunks that contribute to a tile  Operations involved in computing an output element must be associative and commutative

Goals In Algorithm Design  Must use smallest parents / minimal spanning tree  Maximal cache and memory reuse: perform all computations associated with an input chunk before it is discarded from memory  Minimize interprocessor communication volume  Minimize the amount of memory that needs to be allocated across the tiles  Fit into ADR’s computation model

Approach  Currently consider data cube construction starting from three dimensional array only  Partition and tile along a single dimension only  If the size along the dimensions A, B, and C are |A|, |B| and |C|, assume that |A| <= |B| <= |C| (No loss of generality)

Partitioning and Tiling  Always partition along the dimension C  Minimizes communication volume  If |A| <= |B| <= |C|, |A||B| <= |A||C| <= |B||C|  Let the size of the dimension C on each processor be |C’|  Three separate cases for tiling  Case I: |A| <= |B| <= |C’|  Case II: |A| <= |C’| <= |B|  Case III: |C’| <= |A| <= |B|  Focus on first and second cases, third is almost identical to the second case

First Case Tile along the dimension C on each processor Hold AB in memory through the processing of all tiles AC and BC are allocated separately for each tile

Algorithm for Case I Allocate AB Foreach tile: Allocate AC and BC Foreach input chunk to be read Update AB, AC, and BC Compute C from AC Write-back AC, BC, and C If last tile Perform global reduction to obtain AB If (proc_id == 0) Compute A and B from AB Compute all from A

Properties of the Algorithm  All arrays are computed from their smallest parents  Maximal cache and memory reuse  Minimal interprocessor communication volume among all single dimensional partitions  Portion of output arrays that need to be kept in the main memory for the entire computation is minimal of all single dimensional tiling possibilities

Second Case  Tile along the dimension B  Hold AC in main memory for the entire computation

Algorithm for Case II Allocate AC and A Foreach tile: Allocate AB and AC Foreach input chunk to be read Update AB, AC, and BC Perform global reduction to obtain final AB If (proc_id == 0) Compute B from AB Update A using AB Write-back AB, BC, and B If (last tile) Finish AC Compute C from AC If (proc_id == 0) Finish A Compute all from A

Implementation Experience Using ADR  Had to supply  Local reduction function - processing for each chunk  Global reduction function - after local reduction on each tile  A Finalize function – after processing all tiles  A specification of tiling desired  ADR’s runtime support offered  Fetching of input chunk corresponding to each tile  Scheduling asynchronous operations  Details of interprocessor communication

Experimental Evaluation  Goals:  Speedups on sparse and dense datasets  Scaling of performance with respect to dataset sizes  Scaling of performance with respect to number of tiles  Evaluating the impact of sparsity  Experimental Platform:  MHz Ultra-II processors  1 GB of main memory on each  Myrinet for interconnection

Scaling Input Datasets - Dense Arrays Almost linear speedups upto 8 processors Performance per element increases linearly with increase in dataset size

Scaling Dataset Sizes: Sparse Dataset 25% Sparsity level Slightly lower speedups than dense datasets: higher comm. to comp. ratio Execution time stays Proportional to the amt. Of Computation

Increasing Number of Tiles 2 nodes Fixed amount of Computation per tile Execution time stays proportional to the amount of computation

Impact of Sparsity Same number of non-zero elements in each dataset Good speedups in all cases Some reduction in sequential performance as sparsity increases: Particularly for 1% case

Summary  Consider data cube construction on clusters  Used a runtime system developed for scientific data intensive applications  New algorithms to combine tiling and interprocessor communication  Observations:  Code writing simplified because of the use of runtime system  High speedups  Performance scales well as dataset sizes are increased