SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Slides:

Advertisements

Similar presentations

Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.

Advertisements

Query Task Model (QTM): Modeling Query Execution with Tasks 1 Steffen Zeuch and Johann-Christoph Freytag.

Chapter 2 Analytical Models 2.1Cost Models 2.2Cost Notations 2.3Skew Model 2.4Basic Operations in Parallel Databases 2.5Summary 2.6Bibliographical Notes.

Clydesdale: Structured Data Processing on MapReduce Jackie.

Manish Bhide, Manoj K Agarwal IBM India Research Lab India {abmanish, Amir Bar-Or, Sriram Padmanabhan IBM Software Group, USA

1 One Torus to Rule Them All: Multi-dimensional Queries in P2P Systems Prasanna Ganesan Beverly Yang Hector Garcia-Molina Stanford University.

Chapter Physical Database Design Methodology Software & Hardware Mapping Logical Design to DBMS Physical Implementation Security Implementation Monitoring.

Reverse Hashing for Sketch Based Change Detection in High Speed Networks Ashish Gupta Elliot Parsons with Robert Schweller, Theory Group Advisor: Yan Chen.

Data Partitioning for Reconfigurable Architectures with Distributed Block RAM Wenrui Gong Gang Wang Ryan Kastner Department of Electrical and Computer.

Chapter 4 Parallel Sort and GroupBy 4.1Sorting, Duplicate Removal and Aggregate 4.2Serial External Sorting Method 4.3Algorithms for Parallel External Sort.

Introduction to Parallel Rendering: Sorting, Chromium, and MPI Mengxia Zhu Spring 2006.

Comparing path-based and vertically-partitioned RDF databases Preetha Lakshmi & Chris Mueller 12/10/2007 CSCI 8715 Shashi Shekhar.

Inspector Joins IC-65 Advances in Data Management Systems 1 Inspector Joins By Shimin Chen, Anastassia Ailamaki, Phillip, and Todd C. Mowry VLDB 2005 Rammohan.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

SciDB Array Storage Mijung Kim 2/15/13.

Roger ZimmermannCOMPSAC 2004, September 30 Spatial Data Query Support in Peer-to-Peer Systems Roger Zimmermann, Wei-Shinn Ku, and Haojun Wang Computer.

Ohio State University Department of Computer Science and Engineering Automatic Data Virtualization - Supporting XML based abstractions on HDF5 Datasets.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

HPDC 2014 Supporting Correlation Analysis on Scientific Datasets in Parallel and Distributed Settings Yu Su*, Gagan Agrawal*, Jonathan Woodring # Ayan.

THE DESIGN AND IMPLEMENTATION OF A LOG-STRUCTURED FILE SYSTEM M. Rosenblum and J. K. Ousterhout University of California, Berkeley.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Shared Memory Parallelization of Decision Tree Construction Using a General Middleware Ruoming Jin Gagan Agrawal Department of Computer and Information.

Constructing Optimal Wavelet Synopses Dimitris Sacharidis Timos Sellis

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

@ Carnegie Mellon Databases Inspector Joins Shimin Chen Phillip B. Gibbons Todd C. Mowry Anastassia Ailamaki 2 Carnegie Mellon University Intel Research.

ICPP 2012 Indexing and Parallel Query Processing Support for Visualizing Climate Datasets Yu Su*, Gagan Agrawal*, Jonathan Woodring † *The Ohio State University.

Efficiently Processing Queries on Interval-and-Value Tuples in Relational Databases Jost Enderle, Nicole Schneider, Thomas Seidl RWTH Aachen University,

Efficient Data Accesses for Parallel Sequence Searches Heshan Lin (NCSU) Xiaosong Ma (NCSU & ORNL) Praveen Chandramohan (ORNL) Al Geist (ORNL) Nagiza Samatova.

Performance Prediction for Random Write Reductions: A Case Study in Modelling Shared Memory Programs Ruoming Jin Gagan Agrawal Department of Computer and.

A Novel Approach for Approximate Aggregations Over Arrays SSDBM 2015 June 29 th, San Diego, California 1 Yi Wang, Yu Su, Gagan Agrawal The Ohio State University.

INFORMATION MANAGEMENT Unit 2 SO 4 Explain the advantages of using a database approach compared to using traditional file processing; Advantages including.

Computer Science and Engineering Parallelizing Defect Detection and Categorization Using FREERIDE Leonid Glimcher P. 1 ipdps’05 Scaling and Parallelizing.

CCGrid 2014 Improving I/O Throughput of Scientific Applications using Transparent Parallel Compression Tekin Bicer, Jian Yin and Gagan Agrawal Ohio State.

1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Advisor: Gagan Agrawal

CCGrid, 2012 Supporting User Defined Subsetting and Aggregation over Parallel NetCDF Datasets Yu Su and Gagan Agrawal Department of Computer Science and.

Ohio State University Department of Computer Science and Engineering Data-Centric Transformations on Non- Integer Iteration Spaces Swarup Kumar Sahoo Gagan.

Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.

EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.

M.Kersten MonetDB, Cracking and recycling Martin Kersten CWI Amsterdam.

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Interactive Data Exploration Using Semantic Windows Alexander Kalinin Ugur Cetintemel, Stan Zdonik.

Ohio State University Department of Computer Science and Engineering Servicing Range Queries on Multidimensional Datasets with Partial Replicas Li Weng,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.

Copyright © 2010 The HDF Group. All Rights Reserved1 Data Storage and I/O in HDF5.

Computer Science and Engineering Parallelizing Feature Mining Using FREERIDE Leonid Glimcher P. 1 ipdps’04 Scaling and Parallelizing a Scientific Feature.

Elasticity in SciDB DBMS Team Members Gunjan Sharma(MT15015) Hiya popli(MT15020)

How Alluxio (formerly Tachyon) brings a 300x performance improvement to Qunar’s streaming processing Xueyan Li (Qunar) & Chunming Li (Garena)

Pathology Spatial Analysis February 2017

Parallel Algorithm Design

Selectivity Estimation of Big Spatial Data

Li Weng, Umit Catalyurek, Tahsin Kurc, Gagan Agrawal, Joel Saltz

EFFICIENT RANGE QUERY PROCESSING ON UNCERTAIN DATA

Yu Su, Yi Wang, Gagan Agrawal The Ohio State University

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal

Yi Wang, Wei Jiang, Gagan Agrawal

Department of Computer Science University of California, Santa Barbara

Gary M. Zoppetti Gagan Agrawal Rishi Kumar

Supporting Online Analytics with User-Defined Estimation and Early Termination in a MapReduce-Like Framework Yi Wang, Linchuan Chen, Gagan Agrawal The.

L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher

Presentation transcript:

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The Ohio State University

Outline Introduction Grid Aggregations Overlapping Aggregations Experimental Results Conclusion 2

Big Data Is Often Big Arrays Array data is everywhere 3 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

How to Process Big Arrays? Use relational databases? – Poor Expressibility Loses the natural positional/structural information Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … – Poor Performance Cumbersome data transformations Too heavyweight: e.g., transactions One size does not fit all! 4 Input Table Input Array Output Array Output Table Mapping ManipulationRendering

Array Databases Examples: SciDB, RasDaMan and MonetDB Take Array as the First-Class Citizens – Everything is defined in the array dialect Lightweight or No ACID Maintenance – No write conflict: ACID is inherently guaranteed Other Desired Functionality – Structural aggregations, array join, provenance… 5

The Upfront Cost of Using SciDB High-Level Data Flow – Requires data ingestion Data Ingestion Steps – Raw files (e.g., HDF5) -> CSV – Load CSV files into SciDB 6 “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

Array Storage as a DB A Paradigm Similar to NoDB – Still maintains DB functionality – But no data ingestion DB and Array Storage as a DB: Friends or Foes? – When to use DB? Load once, and query frequently – When to directly use array storage? Query infrequently, so avoid loading Our System – Focuses on a set of special array operations - Structural Aggregations 7

Structural Aggregation Aggregate the elements based on positional relationships – E.g., moving average: calculates the average of each 2 × 2 square from left to right 8 Input Array Aggregation Result aggregate the elements in the same square at a time

Structural Aggregation Types 9 Non-Overlapping Aggregation Overlapping Aggregation

Grid Aggregation Parallelization: Easy after Partitioning Considerations – Data contiguity which affects the I/O performance – Communication cost – Load balancing for skewed data Partitioning Strategies – Coarse-grained, fine-grained, hybrid, and auto-grained – Why not use dynamic repartitioning? Runtime overhead Poor data contiguity Redundant data loads 10

Coarse-Grained Partitioning Pros – Low I/O cost – Low communication cost Cons – Workload imbalance for skewed data 11

Fine-Grained Partitioning Pros – Excellent workload balance for skewed data Cons – Relatively high I/O cost – High communication cost 12

Hybrid Partitioning Pros – Low communication cost – Good workload balance for skewed data Cons – High I/O cost 13

Auto-Grained Partitioning 2 Steps – Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) For each grid, total processing cost = constant loading cost + variable computation cost – Partition the cost array - Balanced Contiguous Multi- Way Partitioning Dynamic programming (a small number of grids) Greedy (a large number of grids) 14

Auto-Grained Partitioning (Cont’d) Pros – Low I/O cost – Low communication cost – Great workload balance for skewed data Cons – Overhead of sampling an runtime partitioning 15

Partitioning Strategy Summary StrategyI/O Performance Workload Balance ScalabilityAdditional Cost Coarse-GrainedExcellentPoorExcellentNone Fine-GrainedPoorExcellentPoorNone HybridPoorGood None Auto-GrainedGreat Nontrivial 16 Our partitioning strategy decider can help choose the best strategy

Partitioning Strategy Decider Cost Model: analyze load cost and computation cost separately – Load cost Loading factor × data amount – Computation cost Exception - Auto-Grained: take load cost and computation cost as a whole 17

Overlapping Aggregation I/O Cost – Reuse the data already in the memory – Reduce the disk I/O to enhance the I/O performance Memory Accesses – Reuse the data already in the cache – Reduce cache misses to accelerate the computation Aggregation Approaches – Naïve approach – Data-reuse approach – All-reuse approach 18

Example: Hierarchical Aggregation Aggregate 3 grids in a 6 × 6 array – The innermost 2 × 2 grid – The middle 4 × 4 grid – The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated 19

Naïve Approach 20 1.Load the innermost grid 2.Aggregate the innermost grid 3.Load the middle grid 4.Aggregate the middle grid 5.Load the outermost grid 6.Aggregate the outermost grid For N grids: N loads + N aggregations

Data-Reuse Approach 21 1.Load the outermost grid 2.Aggregate the outermost grid 3.Aggregate the middle grid 4.Aggregate the innermost grid For N grids: 1 load + N aggregations

All-Reuse Approach 22 1.Load the outermost grid 2.Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

All-Reuse Approach (Cont’d) Key Insight – # of aggregation results ≤ # of queried elements – More computationally efficient to iterate over elements and update the associated aggregation results More Benefits – Load balance (for hierarchical/circular aggregations) – More speedup for compound array elements The data type of an aggregation result is usually primitive, but this is not always true for an array element 23

Parallel Performance vs. SciDB No preprocessing cost is included for SciDB Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation 24

Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 7 × 7 25

Conclusion Support efficient structural aggregations over native array storage Different partitioning strategies and a cost model for grid aggregations All-reuse approach for overlapping aggregations 26