HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan Biswas*, Han-Wei Shen* *The Ohio State University # Los Alamos National Laboratory

HPDC 2014 Motivation: Big Data Gaps between data generation and storage 2 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

HPDC 2014 Big Data (Volume/Velocity) Challenge Data Movement is the Bottleneck –Memory to CPU –Disk to Memory –Wide Area Memory availability is another challenge Can we work with a summary of data? –Compression approaches already shown applicable

HPDC 2014 Context: Correlation Data Analysis Scientific Analysis Type: –Individual Variable Analysis Data Subsetting, Aggregation, Mining, Visualization –Correlation Analysis Study relationship among multiple variables Make interesting scientific discoveries “Big Data” problem becomes more severe: –Huge data loading cost (multiple variables) –Additional filtering cost for subset-based correlation analysis –Huge correlation calculation cost Correlation analysis is useful but extremely time consuming and resource costly

HPDC 2014 Our Solution and Contributions(1) Identify bitvectors as a summary structure –Space efficient –Data movement efficient –Assume constructed offline Correlation computation using bitmaps –Better efficiency –Smaller memory cost –Parallelization –Across data stored in distributed repositories

HPDC 2014 Our Solution and Contributions (2) An interactive framework to support both individual and correlation analysis based on bitmaps –Correlations and other operations using high-level operators –Individual Analysis: flexible data subsetting –Correlation Analysis: interactive correlation queries among multi-variables –Correlation over flexible data subsets –Combine with index-based sampling

HPDC 2014 Background: Bitmap Indexing Widely used in scientific data management Suitable for floating value by binning small ranges Run Length Compression(WAH, BBC) Bitmap Indices can be treated as a small profile of the data

HPDC 2014 Bitmaps and Summarization Preserves spatial Distribution of data Accurate within the limits of binning Storage requirement within 15-25% after compression Entropy-preserving sampling (HPDC 13) May already be built to support query processing How do we calculate correlation metric? –Accurately and Efficiently

HPDC 2014 Metrics of Correlation Analysis 2-D Histogram: –Indicate value distribution relationship –Value distribution of one variable regarding to change of another Shannon’s Entropy: –A metric to show the variability of the dataset –Low entropy => more constant, predictable data –High entropy => more random distributed data Mutual Information: –A metric for computing the dependence between two variables –Low M => two variables are relatively independent –High M => one variable provides information about another

HPDC 2014 Bitmap-based Correlations No Indexing Support: –Load all data of variable A and B –Filter A and B and generate subset (for value-based subsetting) –Generate joint bins: divide A and B into bins, generate (A 1, B 1 )- >count 11, … (A m, B m )->count mm by scanning each data element –Calculate correlation metrics based on joint bins Dynamic Indexing (build Index for each variable): –Query bitvectors for variable A and B (much smaller index loading cost, very small filtering cost) –Generate joint bins: generate (A 1, B 1 )->count 11, … (A m, B m )- >count mm based on fast bitwise operations between A and B (bitvectors# are much smaller than elements#) –Calculate correlation metrics based on joint bins

HPDC 2014 Calculation Steps Memory

HPDC 2014 Static Indexing Dynamic Indexing: –build one index for each variable –Still need to perform bitwise operations to generate joint bins Static Indexing: –build one index over multi-variables –Only need to perform bitvectors loading and calculation

HPDC 2014 Parallel Indexing: Dim-based Partitioning Pros: efficiency parallel index generation Cons: slave node cannot directly calculate the results. Big reduction overhead

HPDC 2014 Parallel Indexing: Value-based Partitioning Pros: partition for parallel index generation is more time-consuming Cons: slave node can directly calculate partial results. Very small reduction overhead

HPDC 2014 Correlation Analysis in Distributed Environment Without Indexing Support Read Data Subset Using Bitmap Indexing Read IndexSubset Computing Node

HPDC 2014 Correlation Analysis over Samples Select bitvectors of variable A Select bitvectors of variable B Perform Index-based sampling on Variable A Logic operations between sample of A and bitvectors of B

HPDC 2014 System Architecture Parse the SQL expression Parse the metadata file Generate query request Decide query types Perform index- based data query and samling Read bitvectors and generate joint bins Read Joint Bitvectors Calculate Correlation Metrics based on joint bitvectors Give up current corrlation result or not? Continue Iteractive Query Read the data value after finding satisfying result

HPDC 2014 User Interface Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMP  SALT: 0.15, TEMP->UVEL: 0.036; Please enter your query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.28, SALT: 2.53, UVEL: 3.06; Mutual Information: TEMP  UVEL 0.039, SALT->UVEL->0.33; Please enter your query: UNDO Entropy: TEMP: 2.29, SALT: 2.66, UVEL: 3.05; Mutual Information: TEMP  SALT: 0.15, TEMP->UVEL: 0.036; Please enter your query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP: 2.22, SALT: 1.58, UVEL: 2.64 ； Mutual Information: TEMP  UVEL 0.31, SALT->UVEL->0.21; ……

HPDC 2014 User Case Results Histogram of SALT based on TEMP Cold Water(TEMP<5): High SALT Hot Water(TEMP>=15): High SALT Entropy TEMP: similar entropy SALT: Diversity of SALT becomes bigger as TEMP increases Mutual Information Correlation between TEMP and SALT is high when TEMP is cold or hot

HPDC 2014 Experiment Results Goals: –Speedup of correlation analysis using bitmap indexing –Scalability of parallel correlation analysis –Efficiency improvement in distributed environment –Efficiency and accuracy comparison with sampling Datasets: –Parallel Ocean Program – Multi-dimensional Arrays –26 Variables: TEMP (depth, lat, lon), SALT, UVEL …… Environment: –OSC Glenn Cluster: each node has 8 cores, 2.6 GHz AMD Opteron, 64 GB memory, 1.9 TB disk

HPDC 2014 Correlation Efficiency Comparison based on Different Subsets No Indexing (original): Data Loading + Filtering Joint Bins Generation (scan each data element) Correlation Calculation Dynamic Indexing: Index Subset Loading Joint Bins Generation (bitwise operations) Correlation Calculation 1.78x to 3.61x speedup Speedup becomes bigger as data subset size decreases Static Indexing: Joint Index Subset Loading Correlation Calculation 11.4x to 15.35x speedup Variables: TEMP SALT, 5.6 GB each Metrics: Entropy, Histogram, Mutual Info Input: 1000 queries divided into 5 categories based on subsetting percentage

HPDC 2014 Parallel Correlation Analysis based on Different Nodes# Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 1 – 32, one core per node Calculate correlations based on entire data Speedup as more number of nodes used Dim-based Partition: The speedup is limited 1.73x to 5.96x speedup Every node can only generate joint bins Joint bins from different nodes need to be transferred for a global reduction (big cost) More nodes used means bigger network transfer and calculation cost Value-based Partition: Much better speedup 1.87x to 11.79x speedup Every node can directly calculate partial correlation metrics Very small reduction cost

HPDC 2014 Efficiency Improvement in Distributed Environment Data Size: 7Gb – 28 GB Indexing Method: Smaller data transfer time (index size is only 12.1% to 26.8% of the dataset) Faster correlation analysis time (smaller data loading, faster joint bin calculation) Speedup of using local data server (1 Gb/s): 1.87x – 1.91x Speedup of using remote data server (200 Mb/s): 2.78x – 2.96x Local Data Server (1Gb/s)Remote Data Server (200Mb/s)

HPDC 2014 Efficiency and Accuracy Comparison with Sampling Select 10 Variables (1.4 GB each) and calculate mutual information between each pair (45 pairs) Calculate correlation based on samples: Joint bins generation time is great reduced Extra cost: sampling time Speedup: 1.34x – 6.84x Use CFP to present relative mutual information differences (45 pairs) More accuracy lost as smaller sample used, average accuracy lost : 50% - 1.53%, 25% - 3.42% 10% - 7.91%, 5% - 12.57% 1% - 18.32%

HPDC 2014 Conclusion ‘Big Data’ issue brings challenges for scientific data management Correlation analysis is useful but time-consuming Improve the efficiency of correlation analysis using bitmap indexing Develop a tool to support interactive correlation analysis over flexible subsets of the data Support correlation analysis in parallel and distributed environments Combine data sampling with correlation analysis

HPDC 2014 Thanks 26

HPDC 2014 Backup Slides 27

HPDC 2014 Correlation Efficiency Comparison based on Different Data Sizes Variables: TEMP SALT Metrics: Entropy, Histogram, Mutual Info Input: Data with different sizes No Indexing (original): Data Loading Joint Bins Generation Correlation Calculation Dynamic Indexing: Index Loading Joint Bins Generation Correlation Calculation Still achieve a good speedup because of the faster data loading speed and joint bins calculation speed Static Indexing: Joint Index Loading Correlation Calculation

HPDC 2014 Parallel Correlation Analysis based on Different Subsets Variables: TEMP SALT, 28 GB each Metrics: Entropy, Histogram, Mutual Info Nodes#: 16 Input: 1000 queries divided into 5 categories based on subset sizes Dim-based Partition: The speedup is limited Bigger subsets will generate bigger number of joint bins More data transfer and reduction cost as subset percentage increases Value-based Partition: Much better scalability 1.17x to 1.58x speedup compared to dim-based partition The speedup is not affected by data subset percentage

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan.

Similar presentations

Presentation on theme: "HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

Similar presentations

Presentation on theme: "HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan."— Presentation transcript:

Similar presentations

About project

Feedback

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan.

Presentation on theme: "HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su, Gagan Agrawal, Jonathan Woodring # Ayan."— Presentation transcript: