HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan.

Slides:

Advertisements

Similar presentations

Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.

Advertisements

LIBRA: Lightweight Data Skew Mitigation in MapReduce

Esma Yildirim Department of Computer Engineering Fatih University Istanbul, Turkey DATACLOUD 2013.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

ASCR Scientific Data Management Analysis & Visualization PI Meeting Exploration of Exascale In Situ Visualization and Analysis Approaches LANL: James Ahrens,

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Background: MapReduce and FREERIDE Co-clustering on FREERIDE Experimental.

July, 2001 High-dimensional indexing techniques Kesheng John Wu Ekow Otoo Arie Shoshani.

Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.

1 SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Gangyi Zhu, Yi Wang, Gagan Agrawal The Ohio State University.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Cluster-based SNP Calling on Large Scale Genome Sequencing Data Mucahid KutluGagan Agrawal Department of Computer Science and Engineering The Ohio State.

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

Ohio State University Department of Computer Science and Engineering 1 Cyberinfrastructure for Coastal Forecasting and Change Analysis Gagan Agrawal Hakan.

© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

A Survey of Distributed Task Schedulers Kei Takahashi (M1)

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Science Problem: Cognitive capacity (human/scientist understanding), storage and I/O have not kept up with our capacity to generate massive amounts physics-based.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Frontiers in Massive Data Analysis Chapter 3.  Difficult to include data from multiple sources  Each organization develops a unique way of representing.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

Indexing HDFS Data in PDW: Splitting the data from the index VLDB2014 WSIC、Microsoft Calvin

Using Bitmap Index to Speed up Analyses of High-Energy Physics Data John Wu, Arie Shoshani, Alex Sim, Junmin Gu, Art Poskanzer Lawrence Berkeley National.

Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.

HPDC 2013 Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices Yu Su*, Gagan Agrawal*, Jonathan Woodring # Kary Myers #, Joanne Wendelberger.

The Replica Location Service The Globus Project™ And The DataGrid Project Copyright (c) 2002 University of Chicago and The University of Southern California.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

SC 2013 SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol Yu Su*, Yi Wang*, Gagan Agrawal*, Rajkumar Kettimuthu.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Efficient Local Statistical Analysis via Integral Histograms with Discrete Wavelet Transform Teng-Yok Lee & Han-Wei Shen IEEE SciVis ’13Uncertainty & Multivariate.

SUPPORTING SQL QUERIES FOR SUBSETTING LARGE- SCALE DATASETS IN PARAVIEW SC’11 UltraVis Workshop, November 13, 2011 Yu Su*, Gagan Agrawal*, Jon Woodring†

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Mining Document Collections to Facilitate Accurate Approximate Entity Matching Presented By Harshda Vabale.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

Ohio State University Department of Computer Science and Engineering An Approach for Automatic Data Virtualization Li Weng, Gagan Agrawal et al.

GEM: A Framework for Developing Shared- Memory Parallel GEnomic Applications on Memory Constrained Architectures Mucahid Kutlu Gagan Agrawal Department.

Massive Semantic Web data compression with MapReduce Jacopo Urbani, Jason Maassen, Henri Bal Vrije Universiteit, Amsterdam HPDC ( High Performance Distributed.

ApproxHadoop Bringing Approximations to MapReduce Frameworks

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Research in In-Situ Data Analytics Gagan Agrawal The Ohio State University (Joint work with Yi Wang, Yu Su, and others)

March, 2002 Efficient Bitmap Indexing Techniques for Very Large Datasets Kesheng John Wu Ekow Otoo Arie Shoshani.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

Thomas Heinis* Eleni Tzirita Zacharatou ‡ Farhan Tauheed § Anastasia Ailamaki ‡ RUBIK: Efficient Threshold Queries on Massive Time Series § Oracle Labs,

Handling Data Skew in Parallel Joins in Shared-Nothing Systems Yu Xu, Pekka Kostamaa, XinZhou (Teradata) Liang Chen (University of California) SIGMOD’08.

LIOProf: Exposing Lustre File System Behavior for I/O Middleware

Accelerating K-Means Clustering with Parallel Implementations and GPU Computing Janki Bhimani Miriam Leeser Ningfang Mi

Servicing Seismic and Oil Reservoir Simulation Data through Grid Data Services Sivaramakrishnan Narayanan, Tahsin Kurc, Umit Catalyurek and Joel Saltz.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Fast Data Analysis with Integrated Statistical Metadata in Scientific Datasets By Yong Chen (with Jialin Liu) Data-Intensive Scalable Computing Laboratory.

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

Sameh Shohdy, Yu Su, and Gagan Agrawal

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

CS110: Discussion about Spark

Efficient Distribution-based Feature Search in Multi-field Datasets Ohio State University (Shen) Problem: How to efficiently search for distribution-based.

Gagan Agrawal The Ohio State University

1/15/2019 Big Data Management Framework based on Virtualization and Bitmap Data Summarization Yu Su Department of Computer Science and Engineering The.

Automatic and Efficient Data Virtualization System on Scientific Datasets Li Weng.

FREERIDE: A Framework for Rapid Implementation of Datamining Engines

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan Biswas*, Han-Wei Shen* *The Ohio State University # Los Alamos National Laboratory

HPDC 2014 Motivation: Big Data Gaps between data generation and storage 2 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

HPDC 2014 Big Data (Volume/Velocity) Challenge Data Movement is the Bottleneck –Memory to CPU –Disk to Memory –Wide Area Memory availability is another challenge Can we work with a summary of data? –Compression approaches already shown applicable

HPDC 2014 Context: Correlation Data Analysis Scientific Analysis Type: –Individual Variable Analysis Data Subsetting, Aggregation, Mining, Visualization –Correlation Analysis Study relationship among multiple variables Make interesting scientific discoveries “Big Data” problem becomes more severe: –Huge data loading cost (multiple variables) –Additional filtering cost for subset-based correlation analysis –Huge correlation calculation cost Correlation analysis is useful but extremely time consuming and resource costly

HPDC 2014 Our Solution and Contributions(1) Identify bitvectors as a summary structure –Space efficient –Data movement efficient –Assume constructed offline Correlation computation using bitmaps –Better efficiency –Smaller memory cost –Parallelization –Across data stored in distributed repositories

HPDC 2014 Our Solution and Contributions (2) An interactive framework to support both individual and correlation analysis based on bitmaps –Correlations and other operations using high-level operators –Individual Analysis: flexible data subsetting –Correlation Analysis: interactive correlation queries among multi-variables –Correlation over flexible data subsets –Combine with index-based sampling

HPDC 2014 Background: Bitmap Indexing Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression(WAH, BBC) Bitmap Indices can be treated as a small profile of the data

HPDC 2014 Bitmaps and Summarization Preserves spatial Distribution of data Accurate within the limits of binning Storage requirement within 15-25% after compression Entropy-preserving sampling (HPDC 13) May already be built to support query processing How do we calculate correlation metric? –Accurately and Efficiently

HPDC 2014 Metrics of Correlation Analysis 2-D Histogram: –Indicate value distribution relationship –Value distribution of one variable regarding to change of another Shannon’s Entropy: –A metric to show the variability of the dataset –Low entropy => more constant, predictable data –High entropy => more random distributed data Mutual Information: –A metric for computing the dependence between two variables –Low M => two variables are relatively independent –High M => one variable provides information about another

HPDC 2014 Bitmap-based Correlations No Indexing Support: –Load all data of variable A and B –Filter A and B and generate subset (for value-based subsetting) –Generate joint bins: divide A and B into bins, generate (A 1, B 1 )- >count 11, … (A m, B m )->count mm by scanning each data element –Calculate correlation metrics based on joint bins Dynamic Indexing (build Index for each variable): –Query bitvectors for variable A and B (much smaller index loading cost, very small filtering cost) –Generate joint bins: generate (A 1, B 1 )->count 11, … (A m, B m )- >count mm based on fast bitwise operations between A and B (bitvectors# are much smaller than elements#) –Calculate correlation metrics based on joint bins

HPDC 2014 Calculation Steps Memory

HPDC 2014 Static Indexing Dynamic Indexing: –build one index for each variable –Still need to perform bitwise operations to generate joint bins Static Indexing: –build one index over multi-variables –Only need to perform bitvectors loading and calculation

HPDC 2014 Parallel Indexing: Dim-based Partitioning Pros: efficiency parallel index generation Cons: slave node cannot directly calculate the results. Big reduction overhead

HPDC 2014 Parallel Indexing: Value-based Partitioning Pros: partition for parallel index generation is more time-consuming Cons: slave node can directly calculate partial results. Very small reduction overhead

HPDC 2014 Correlation Analysis in Distributed Environment Without Indexing Support Read Data Subset Using Bitmap Indexing Read IndexSubset Computing Node

HPDC 2014 Correlation Analysis over Samples Select bitvectors of variable A Select bitvectors of variable B Perform Index-based sampling on Variable A Logic operations between sample of A and bitvectors of B

HPDC 2014 System Architecture Parse the SQL expression Parse the metadata file Generate query request Decide query types Perform index- based data query and samling Read bitvectors and generate joint bins Read Joint Bitvectors Calculate Correlation Metrics based on joint bitvectors Give up current corrlation result or not? Continue Iteractive Query Read the data value after finding satisfying result

HPDC 2014 User Interface Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMP  SALT: 0.15, TEMP->UVEL: 0.036; Please enter your query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.28, SALT: 2.53, UVEL: 3.06; Mutual Information: TEMP  UVEL 0.039, SALT->UVEL->0.33; Please enter your query: UNDO Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMP  SALT: 0.15, TEMP->UVEL: 0.036; Please enter your query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.22, SALT: 1.58, UVEL: 2.64 ； Mutual Information: TEMP  UVEL 0.31, SALT->UVEL->0.21; ……

HPDC 2014 User Case Results Histogram of SALT based on TEMP Cold Water(TEMP<5): High SALT Hot Water(TEMP>=15): High SALT Entropy TEMP: similar entropy SALT: Diversity of SALT becomes bigger as TEMP increases Mutual Information Correlation between TEMP and SALT is high when TEMP is cold or hot

HPDC 2014 Experiment Results Goals: –Speedup of correlation analysis using bitmap indexing –Scalability of parallel correlation analysis –Efficiency improvement in distributed environment –Efficiency and accuracy comparison with sampling Datasets: –Parallel Ocean Program – Multi-dimensional Arrays –26 Variables: TEMP (depth, lat, lon), SALT, UVEL …… Environment: –OSC Glenn Cluster: each node has 8 cores, 2.6 GHz AMD Opteron, 64 GB memory, 1.9 TB disk

HPDC 2014 Correlation Efficiency Comparison based on Different Subsets No Indexing (original): Data Loading + Filtering Joint Bins Generation (scan each data element) Correlation Calculation Dynamic Indexing: Index Subset Loading Joint Bins Generation (bitwise operations) Correlation Calculation 1.78x to 3.61x speedup Speedup becomes bigger as data subset size decreases Static Indexing: Joint Index Subset Loading Correlation Calculation 11.4x to 15.35x speedup Variables: TEMP SALT, 5.6 GB each Metrics: Entropy, Histogram, Mutual Info Input: 1000 queries divided into 5 categories based on subsetting percentage

HPDC 2014 Parallel Correlation Analysis based on Different Nodes# Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 1 – 32, one core per node Calculate correlations based on entire data Speedup as more number of nodes used Dim-based Partition: The speedup is limited 1.73x to 5.96x speedup Every node can only generate joint bins Joint bins from different nodes need to be transferred for a global reduction (big cost) More nodes used means bigger network transfer and calculation cost Value-based Partition: Much better speedup 1.87x to 11.79x speedup Every node can directly calculate partial correlation metrics Very small reduction cost

HPDC 2014 Efficiency Improvement in Distributed Environment Data Size: 7Gb – 28 GB Indexing Method: Smaller data transfer time (index size is only 12.1% to 26.8% of the dataset) Faster correlation analysis time (smaller data loading, faster joint bin calculation) Speedup of using local data server (1 Gb/s): 1.87x – 1.91x Speedup of using remote data server (200 Mb/s): 2.78x – 2.96x Local Data Server (1Gb/s)Remote Data Server (200Mb/s)

HPDC 2014 Efficiency and Accuracy Comparison with Sampling Select 10 Variables (1.4 GB each) and calculate mutual information between each pair (45 pairs) Calculate correlation based on samples: Joint bins generation time is great reduced Extra cost: sampling time Speedup: 1.34x – 6.84x Use CFP to present relative mutual information differences (45 pairs) More accuracy lost as smaller sample used, average accuracy lost : 50% %, 25% % 10% %, 5% % 1% %

HPDC 2014 Conclusion ‘Big Data’ issue brings challenges for scientific data management Correlation analysis is useful but time-consuming Improve the efficiency of correlation analysis using bitmap indexing Develop a tool to support interactive correlation analysis over flexible subsets of the data Support correlation analysis in parallel and distributed environments Combine data sampling with correlation analysis

HPDC 2014 Thanks 26

HPDC 2014 Backup Slides 27

HPDC 2014 Correlation Efficiency Comparison based on Different Data Sizes Variables: TEMP SALT Metrics: Entropy, Histogram, Mutual Info Input: Data with different sizes No Indexing (original): Data Loading Joint Bins Generation Correlation Calculation Dynamic Indexing: Index Loading Joint Bins Generation Correlation Calculation Still achieve a good speedup because of the faster data loading speed and joint bins calculation speed Static Indexing: Joint Index Loading Correlation Calculation

HPDC 2014 Parallel Correlation Analysis based on Different Subsets Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 16 Input: 1000 queries divided into 5 categories based on subset sizes Dim-based Partition: The speedup is limited Bigger subsets will generate bigger number of joint bins More data transfer and reduction cost as subset percentage increases Value-based Partition: Much better scalability 1.17x to 1.58x speedup compared to dim-based partition The speedup is not affected by data subset percentage