SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The Ohio State University

Outline Introduction Grid Aggregations Overlapping Aggregations Experimental Results Conclusion 2

Big Data Is Often Big Arrays Array data is everywhere 3 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

How to Process Big Arrays? Use relational databases? – Poor Expressibility Loses the natural positional/structural information Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … – Poor Performance Cumbersome data transformations Too heavyweight: e.g., transactions One size does not fit all! 4 Input Table Input Array Output Array Output Table Mapping ManipulationRendering

Array Databases Examples: SciDB, RasDaMan and MonetDB Take Array as the First-Class Citizens – Everything is defined in the array dialect Lightweight or No ACID Maintenance – No write conflict: ACID is inherently guaranteed Other Desired Functionality – Structural aggregations, array join, provenance… 5

The Upfront Cost of Using SciDB High-Level Data Flow – Requires data ingestion Data Ingestion Steps – Raw files (e.g., HDF5) -> CSV – Load CSV files into SciDB 6 “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

Array Storage as a DB A Paradigm Similar to NoDB – Still maintains DB functionality – But no data ingestion DB and Array Storage as a DB: Friends or Foes? – When to use DB? Load once, and query frequently – When to directly use array storage? Query infrequently, so avoid loading Our System – Focuses on a set of special array operations - Structural Aggregations 7

Structural Aggregation Aggregate the elements based on positional relationships – E.g., moving average: calculates the average of each 2 × 2 square from left to right 8 Input Array 3.54.55.5 1234 5678 Aggregation Result aggregate the elements in the same square at a time

Structural Aggregation Types 9 Non-Overlapping Aggregation Overlapping Aggregation

Grid Aggregation Parallelization: Easy after Partitioning Considerations – Data contiguity which affects the I/O performance – Communication cost – Load balancing for skewed data Partitioning Strategies – Coarse-grained, fine-grained, hybrid, and auto-grained – Why not use dynamic repartitioning? Runtime overhead Poor data contiguity Redundant data loads 10

Coarse-Grained Partitioning Pros – Low I/O cost – Low communication cost Cons – Workload imbalance for skewed data 11

Fine-Grained Partitioning Pros – Excellent workload balance for skewed data Cons – Relatively high I/O cost – High communication cost 12

Hybrid Partitioning Pros – Low communication cost – Good workload balance for skewed data Cons – High I/O cost 13

Auto-Grained Partitioning 2 Steps – Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) For each grid, total processing cost = constant loading cost + variable computation cost – Partition the cost array - Balanced Contiguous Multi- Way Partitioning Dynamic programming (a small number of grids) Greedy (a large number of grids) 14

Auto-Grained Partitioning (Cont’d) Pros – Low I/O cost – Low communication cost – Great workload balance for skewed data Cons – Overhead of sampling an runtime partitioning 15

Partitioning Strategy Summary StrategyI/O Performance Workload Balance ScalabilityAdditional Cost Coarse-GrainedExcellentPoorExcellentNone Fine-GrainedPoorExcellentPoorNone HybridPoorGood None Auto-GrainedGreat Nontrivial 16 Our partitioning strategy decider can help choose the best strategy

Partitioning Strategy Decider Cost Model: analyze load cost and computation cost separately – Load cost Loading factor × data amount – Computation cost Exception - Auto-Grained: take load cost and computation cost as a whole 17

Overlapping Aggregation I/O Cost – Reuse the data already in the memory – Reduce the disk I/O to enhance the I/O performance Memory Accesses – Reuse the data already in the cache – Reduce cache misses to accelerate the computation Aggregation Approaches – Naïve approach – Data-reuse approach – All-reuse approach 18

Example: Hierarchical Aggregation Aggregate 3 grids in a 6 × 6 array – The innermost 2 × 2 grid – The middle 4 × 4 grid – The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated 19

Naïve Approach 20 1.Load the innermost grid 2.Aggregate the innermost grid 3.Load the middle grid 4.Aggregate the middle grid 5.Load the outermost grid 6.Aggregate the outermost grid For N grids: N loads + N aggregations

Data-Reuse Approach 21 1.Load the outermost grid 2.Aggregate the outermost grid 3.Aggregate the middle grid 4.Aggregate the innermost grid For N grids: 1 load + N aggregations

All-Reuse Approach 22 1.Load the outermost grid 2.Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

All-Reuse Approach (Cont’d) Key Insight – # of aggregation results ≤ # of queried elements – More computationally efficient to iterate over elements and update the associated aggregation results More Benefits – Load balance (for hierarchical/circular aggregations) – More speedup for compound array elements The data type of an aggregation result is usually primitive, but this is not always true for an array element 23

Parallel Performance vs. SciDB No preprocessing cost is included for SciDB Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation 24

Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 7 × 7 25

Conclusion Support efficient structural aggregations over native array storage Different partitioning strategies and a cost model for grid aggregations All-reuse approach for overlapping aggregations 26

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Similar presentations

Presentation on theme: "SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Similar presentations

Presentation on theme: "SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The."— Presentation transcript:

Similar presentations

About project

Feedback