Presentation is loading. Please wait.

Presentation is loading. Please wait.

SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The.

Similar presentations


Presentation on theme: "SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The."— Presentation transcript:

1 SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The Ohio State University

2 Outline Introduction Grid Aggregations Overlapping Aggregations Experimental Results Conclusion 2

3 Big Data Is Often Big Arrays Array data is everywhere 3 Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Earth Science: Ocean and Climate Data Space Science: Astronomy Data

4 How to Process Big Arrays? Use relational databases? – Poor Expressibility Loses the natural positional/structural information Most complex operations are naturally defined in terms of arrays: e.g., correlations, convolution, curve fitting … – Poor Performance Cumbersome data transformations Too heavyweight: e.g., transactions One size does not fit all! 4 Input Table Input Array Output Array Output Table Mapping ManipulationRendering

5 Array Databases Examples: SciDB, RasDaMan and MonetDB Take Array as the First-Class Citizens – Everything is defined in the array dialect Lightweight or No ACID Maintenance – No write conflict: ACID is inherently guaranteed Other Desired Functionality – Structural aggregations, array join, provenance… 5

6 The Upfront Cost of Using SciDB High-Level Data Flow – Requires data ingestion Data Ingestion Steps – Raw files (e.g., HDF5) -> CSV – Load CSV files into SciDB 6 “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

7 Array Storage as a DB A Paradigm Similar to NoDB – Still maintains DB functionality – But no data ingestion DB and Array Storage as a DB: Friends or Foes? – When to use DB? Load once, and query frequently – When to directly use array storage? Query infrequently, so avoid loading Our System – Focuses on a set of special array operations - Structural Aggregations 7

8 Structural Aggregation Aggregate the elements based on positional relationships – E.g., moving average: calculates the average of each 2 × 2 square from left to right 8 Input Array 3.54.55.5 1234 5678 Aggregation Result aggregate the elements in the same square at a time

9 Structural Aggregation Types 9 Non-Overlapping Aggregation Overlapping Aggregation

10 Grid Aggregation Parallelization: Easy after Partitioning Considerations – Data contiguity which affects the I/O performance – Communication cost – Load balancing for skewed data Partitioning Strategies – Coarse-grained, fine-grained, hybrid, and auto-grained – Why not use dynamic repartitioning? Runtime overhead Poor data contiguity Redundant data loads 10

11 Coarse-Grained Partitioning Pros – Low I/O cost – Low communication cost Cons – Workload imbalance for skewed data 11

12 Fine-Grained Partitioning Pros – Excellent workload balance for skewed data Cons – Relatively high I/O cost – High communication cost 12

13 Hybrid Partitioning Pros – Low communication cost – Good workload balance for skewed data Cons – High I/O cost 13

14 Auto-Grained Partitioning 2 Steps – Estimate the grid density (after filtering) by uniform sampling, and hence estimate the computation cost (based on the computation complexity) For each grid, total processing cost = constant loading cost + variable computation cost – Partition the cost array - Balanced Contiguous Multi- Way Partitioning Dynamic programming (a small number of grids) Greedy (a large number of grids) 14

15 Auto-Grained Partitioning (Cont’d) Pros – Low I/O cost – Low communication cost – Great workload balance for skewed data Cons – Overhead of sampling an runtime partitioning 15

16 Partitioning Strategy Summary StrategyI/O Performance Workload Balance ScalabilityAdditional Cost Coarse-GrainedExcellentPoorExcellentNone Fine-GrainedPoorExcellentPoorNone HybridPoorGood None Auto-GrainedGreat Nontrivial 16 Our partitioning strategy decider can help choose the best strategy

17 Partitioning Strategy Decider Cost Model: analyze load cost and computation cost separately – Load cost Loading factor × data amount – Computation cost Exception - Auto-Grained: take load cost and computation cost as a whole 17

18 Overlapping Aggregation I/O Cost – Reuse the data already in the memory – Reduce the disk I/O to enhance the I/O performance Memory Accesses – Reuse the data already in the cache – Reduce cache misses to accelerate the computation Aggregation Approaches – Naïve approach – Data-reuse approach – All-reuse approach 18

19 Example: Hierarchical Aggregation Aggregate 3 grids in a 6 × 6 array – The innermost 2 × 2 grid – The middle 4 × 4 grid – The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated 19

20 Naïve Approach 20 1.Load the innermost grid 2.Aggregate the innermost grid 3.Load the middle grid 4.Aggregate the middle grid 5.Load the outermost grid 6.Aggregate the outermost grid For N grids: N loads + N aggregations

21 Data-Reuse Approach 21 1.Load the outermost grid 2.Aggregate the outermost grid 3.Aggregate the middle grid 4.Aggregate the innermost grid For N grids: 1 load + N aggregations

22 All-Reuse Approach 22 1.Load the outermost grid 2.Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

23 All-Reuse Approach (Cont’d) Key Insight – # of aggregation results ≤ # of queried elements – More computationally efficient to iterate over elements and update the associated aggregation results More Benefits – Load balance (for hierarchical/circular aggregations) – More speedup for compound array elements The data type of an aggregation result is usually primitive, but this is not always true for an array element 23

24 Parallel Performance vs. SciDB No preprocessing cost is included for SciDB Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation 24

25 Parallel Sliding Aggregation Performance # of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 7 × 7 25

26 Conclusion Support efficient structural aggregations over native array storage Different partitioning strategies and a cost model for grid aggregations All-reuse approach for overlapping aggregations 26


Download ppt "SAGA: Array Storage as a DB with Support for Structural Aggregations SSDBM 2014 June 30 th, Aalborg, Denmark 1 Yi Wang, Arnab Nandi, Gagan Agrawal The."

Similar presentations


Ads by Google