Presentation is loading. Please wait.

Presentation is loading. Please wait.

Advisor: Gagan Agrawal

Similar presentations


Presentation on theme: "Advisor: Gagan Agrawal"— Presentation transcript:

1 Advisor: Gagan Agrawal
Data Management and Data Processing Support on Array-Based Scientific Data Yi Wang Advisor: Gagan Agrawal Candidacy Examination

2 Big Data Is Often Big Arrays
Array data is everywhere Molecular Simulation: Molecular Data Life Science: DNA Sequencing Data (Microarray) Array data is especially prevalent in the scientific domain. Earth Science: Ocean and Climate Data Space Science: Astronomy Data

3 Inherent Limitations of Current Tools and Paradigms
Most scientific data management and data processing tools are too heavy-weight Hard to cope with different data formats and physical structures (variety) Data transformation and data transfer are often prohibitively expensive (volume) Prominent Examples RDBMSs: not suited for array data Array DBMSs: data ingestion MapReduce: specialized file system

4 Mismatch Between Scientific Data and DBMS
Scientific (Array) Datasets: Very large but processed infrequently Read/append only No resources for reloading data Popular formats: NetCDF and HDF5 Database Technologies For (read-write) data – ACID guaranteed Assume data reloading/reformatting feasible

5 Example Array Data Format - HDF5
HDF5 (Hierarchical Data Format) To specify a data subset: 1) dimensional range and 2) value range. Each dimension is usually associated with a series of coordinate values, which is stored in a separate 1D dataset – dimension scale.

6 The Upfront Cost of Using SciDB
High-Level Data Flow Requires data ingestion Data Ingestion Steps Raw files (e.g., HDF5) -> CSV Load CSV files into SciDB The data ingestion experience is very painful. The data ingestion cost is 100x of a simple query. “EarthDB: scalable analysis of MODIS data using SciDB” - G. Planthaber et al.

7 Thesis Statement Native Data Can Be Queried and/or Processed Efficiently Using Popular Abstractions Process data stored in the native format, e.g., NetCDF and HDF5 Support SQL-like operators, e.g., selection and aggregation Support array operations, e.g., structural aggregations Support MapReduce-like processing API

8 Outline Data Management Support
Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

9 Overall Idea An SQL Implementation Over HDF5 High Efficiency
Ease-of-use: declarative language instead of low-level programming language + HDF5 API Abstraction: provides a virtual relational view High Efficiency Load data on demand (lazy loading) Parallel query processing Server-side aggregation

10 Functionality Query Based on Dimension Index Values (Type 1)
Also supported by HDF5 API Query Based on Dimension Scales (Type 2) coordinate system instead of the physical layout (array subscript) Query Based on Data Values (Type 3) Simple datatype + compound datatype Aggregate Query SUM, COUNT, AVG, MIN, and MAX Server-side aggregation to minimize the data transfer index-based condition coordinate-based condition content-based condition

11 Execution Overview 1D: AND-logic condition list
2D: OR-logic condition list 1D: OR-logic condition list Same content-based condition More optimizations with the metadata information. 11

12 Experimental Setup Experimental Datasets
4 GB (sequential experiments) and 16 GB (parallel experiments) 4D: time, cols, rows, and layers Compared with Baseline Performance and OPeNDAP Baseline performance: no query parsing OPeNDAP: translates HDF5 into a specialized data format

13 Sequential Comparison with OPeNDAP (Type2 and Type3 Queries)
By using OPeNDAP, the user has to download the entire data from the server first and then write its own filter. The performance scales poorly due to the additional data translation overhead. We implemented a client-side filter for OPeNDAP. By comparing the baseline performance and our sequential performance, we can see that the total sequential processing time for type 1 query is not distinguishable from the baseline or the intrinsic HDF5 query function.

14 Parallel Query Processing for Type2 and Type3 Queries
Scaled the system up to 16 nodes, and the selectivity was varied from <20% to >80%. Good scalability.

15 Outline Data Management Support
Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

16 Array Storage as a DB A Paradigm Similar to NoDB
Still maintains DB functionality But no data ingestion DB and Array Storage as a DB: Friends or Foes? When to use DB? Load once, and query frequently When to directly use array storage? Query infrequently, so avoid loading Our System Focuses on a set of special array operations - Structural Aggregations Absolute power corrupts absolutely.

17 Structural Aggregation Types
Non-Overlapping Aggregation Overlapping Aggregation Grid aggregation: multi-dimensional histogram Sliding aggregation: apply a kernel function to a sliding window – moving average, denoising, time series, etc. Hierarchical aggregation: observe the gradual influence of radiation from a source (pollution source/explosion location) Circular aggregation: concentric but disjoint circles instead of regularly shaped grids

18 Grid Aggregation Parallelization: Easy after Partitioning
Considerations Data contiguity which affects the I/O performance Communication cost Load balancing for skewed data Partitioning Strategies Coarse-grained Fine-grained Hybrid Auto-grained

19 Partitioning Strategy Decider
Cost Model: analyze loading cost and computation cost separately Load cost Loading factor × data amount Computation cost Exception - Auto-Grained: take loading cost and computation cost as a whole Communication cost is trivial, with an exception of fine-grained partitioning with small grid sizes.

20 Overlapping Aggregation
I/O Cost Reuse the data already in the memory Reduce the disk I/O to enhance the I/O performance Memory Accesses Reuse the data already in the cache Reduce cache misses to accelerate the computation Aggregation Approaches Naïve approach Data-reuse approach All-reuse approach

21 Example: Hierarchical Aggregation
Aggregate 3 grids in a 6 × 6 array The innermost 2 × 2 grid The middle 4 × 4 grid The outmost 6 × 6 grid (Parallel) sliding aggregation is much more complicated

22 Naïve Approach For N grids: N loads + N aggregations
Load the innermost grid Aggregate the innermost grid Load the middle grid Aggregate the middle grid Load the outermost grid Aggregate the outermost grid For N grids: N loads + N aggregations

23 Data-Reuse Approach For N grids: 1 load + N aggregations
Load the outermost grid Aggregate the outermost grid Aggregate the middle grid Aggregate the innermost grid For N grids: 1 load + N aggregations Aggregation execution is not so straightforward

24 All-Reuse Approach For N grids: 1 load + 1 aggregation
Load the outermost grid Once an element is accessed, accumulatively update the aggregation results it contributes to For N grids: 1 load + 1 aggregation Pure sequential I/O Only update the outermost aggregation result Update both the outermost and the middle aggregation results Update all the 3 aggregation results

25 Sequential Performance Comparison
Array slab/data size (8 GB) ratio: from 12.5% to 100% Coarse-grained partitioning for the grid aggregation All-reuse approach for the sliding aggregation SciDB stores `chunked’ array: can even support overlapping chunking to accelerate the sliding aggregation

26 Parallel Sliding Aggregation Performance
# of nodes: from 1 to 16 8 GB data Sliding grid size: from 3 × 3 to 6 × 6

27 Outline Data Management Support
Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

28 Approximate Aggregations Over Array Data
Challenges Flexible Aggregation Over Any Subset Dimensional-based/value-based/combined predicate Aggregation Accuracy Spatial distribution/value distribution Aggregation Without Data Reorganization Reorganization is prohibitively expensive Existing Techniques - All Problematic for Array Data Sampling: unable to capture both distributions Histograms: no spatial distribution Wavelets: no value distribution New Data Synopses – Bitmap Indices KDTree-based stratified sampling requires reorganization Histogram: 1D: no spatial distribution MD: either space cost or partitioning granularity increases exponentially, leading to either substantial estimation overheads or high inaccuracy Wavelets: If the value-based attribute added as an extra dimension to the data cube, sorting or reorganization is required.

29 Bitmap Indexing and Pre-Aggregation
Bitmap Indices Pre-Aggregation Statistics Any multi-dimensional array can be mapped to a 1D array

30 Approximate Aggregation Workflow

31 Running Example Bitmap Indices Pre-Aggregation Statistics
SELECT SUM(Array) WHERE Value > 3 AND ID < 4; Predicate Bitvector: i1’: i2’: Count1: 1 Count2: 2 Estimated Sum: 7 × 1/ × 2/3 = Precise Sum: 14

32 A Novel Binning Strategy
Conventional Binning Strategies Equi-width/Equi-depth Not designed for aggregation V-Optimized Binning Strategy Inspired by V-Optimal Histogram Goal: approximately minimize Sum Squared Error (SSE) Unbiased V-Optimized Binning: data is queried randomly Weighted V-Optimized Binning: frequently queried subarea is prior knowledge

33 Unbiased V-Optimized Binning
3 Steps: Initial Binning: use equi-depth binning Iterative Refinement: adjusting bin boundaries Bitvector Generation: mark spatial positions Add a bin/bin boundary -> improve the approximation quality/decrease SSE Remove a bin/bin boundary -> undermine the approximation quality/increase SSE

34 Weighted V-Optimized Binning
Difference: minimize WSSE instead of SSE Similar binning algorithm Major Modification representative value for each bin is not the mean value

35 Experimental Setup Data Skew 5 Types of Queries
Dense Range: less than 5% space but over 90% data Sparse Range: less than 95% space but over 10% data 5 Types of Queries DB: with dimension-based predicates VBD: with value-based predicates over dense range VBS : with value-based predicates over sparse range CD: with combined predicates over dense range CS : with combined predicates over sparse range Ratio of Querying Possibilities – 10 : 1 50% synthetic data is frequently queried 25% real-world data is frequently queried

36 SUM Aggregation Accuracy of Different Binning Strategies on the Synthetic Dataset
Equi-Width: most inaccurate in all the cases; Equi-Depth: most accurate when only value-based predicates exist; V-Optimized: most accurate when only dimension-based predicates exist or over sparse range; Weighted V-optimized: most accurate when 50% data is queried. Equi-Width Equi-Depth Unbiased V-Optimized Weighted V-Optimized

37 SUM Aggregation Accuracy of Different Methods on the Real-World Dataset
Bitmap vs. Sampling with two sampling rates 2% and 20%: 1) A significantly higher sampling rate does not necessarily lead to a significantly higher accuracy; 2) Most accurate when only value-based predicates exist. This small error is caused by the edge bin(s) that overlap with the queried value range. Conservative aggregation is slightly better than aggressive aggregation in this case. Bitmap vs. (Equi-Depth) MD-Histogram: 400 bins/buckets to partition the value domain; Equi-depth partitioning property (similar to equi-depth binning); Even less accurate than equi-depth bitmap: inaccurate in processing dimension-based predicates due to the uniform distribution assumption for every dimension Sampling_2% Sampling_20% (Equi-Depth) MD-Histogram Equi-Depth Unbiased V-Optimized Weighted V-Optimized

38 Outline Data Management Support
Supporting a Light-Weight Data Management Layer Over HDF5 SAGA: Array Storage as a DB with Support for Structural Aggregations Approximate Aggregations Using Novel Bitmap Indices Data Processing Support SciMATE: A Novel MapReduce-Like Framework for Multiple Scientific Data Formats Future Work

39 Scientific Data Analysis Today
“Store-First-Analyze-After” Reload data into another file system E.g., load data from PVFS to HDFS Reload data into another data format E.g., load NetCDF/HDF5 data to a specialized format Problems Long data migration/transformation time Stresses network and disks

40 System Overview Key Feature scientific data processing module

41 Scientific Data Processing Module
Data adaption layer is customizable: - Insert a third-party adapter - Open for extension but closed for modification

42 Parallel Data Processing Times on 16 GB Datasets
KNN K-Means Thread scalability: all the data in different formats are loaded into our system in the default array layout Node scalability: Performance difference comes from

43 Future Work Outline Data Management Support
SciSD: Novel Subgroup Discovery over Scientific Datasets Using Bitmap Indices SciCSM: Novel Contrast Set Mining over Scientific Datasets Using Bitmap Indices Data Processing Support StreamingMATE: A Novel MapReduce-Like Framework Over Scientific Data Stream Begin to analyze multi-variate dataset and the underlying relationship among multiple variables. Intuition: frequent membership operation and aggregation are involved. Bitmap, as a vertical layout, is efficient in both set of operations. (E.g., frequent item set mining). Stream Processing is a hot topic!

44 SciSD Subgroup Discovery Novelty
Goal: identify all the subsets that are significantly different from the entire dataset/general population, w.r.t. a target variable Can be widely used in scientific knowledge discovery Novelty Subsets can involve dimensional and/or value ranges All numeric attributes High efficiency by frequent bitmap-based approximate aggregations

45 Running Example

46 SciCSM “Sometimes it’s good to contrast what you like with something else. It makes you appreciate it even more.” - Darby Conley, Get Fuzzy, 2001 Contrast Set Mining Goal: identify all the filters that can generate significantly different subsets Common filters: time periods, spatial areas, etc. Usage: classifier design, change detection, disaster prediction, etc.

47 Running Example

48 StreamingMATE Extend the precursor system SciMATE to process scientific data stream Generalized Reduction Reduce data stream to a reduction object No shuffling or sorting Focus on the load balancing issues Input data volume can be highly variable Topology update: add/remove/update streaming operators

49 StreamingMATE Overview

50

51 Hyperslab Selector True: False: nullify the condition list
nullify the elementary condition 4-dim Salinity Dataset dim1: time [0, 1023] dim2: cols [0, 166] dim3: rows [0, 62] dim4: layers [0, 33] Fill up all the index boundary values

52 Type2 and Type3 Query Examples

53 Aggregation Query Examples
AG1: Simple global aggregation AG2: GROUP BY clause + HAVING clause AG3: GROUP BY clause

54 Sequential and Parallel Performance of Aggregation Queries

55 Array Databases Examples: SciDB, RasDaMan and MonetDB
Take Array as the First-Class Citizens Everything is defined in the array dialect Lightweight or No ACID Maintenance No write conflict: ACID is inherently guaranteed Other Desired Functionality Structural aggregations, array join, provenance… Array dialect: both the input and output are defined in array schema, and every operation is array-oriented.

56 Structural Aggregations
Aggregate the elements based on positional relationships E.g., moving average: calculates the average of each 2 × 2 square from left to right 1 2 3 4 5 6 7 8 3.5 4.5 5.5 Input Array Aggregation Result aggregate the elements in the same square at a time

57 Coarse-Grained Partitioning
Pros Low I/O cost Low communication cost Cons Workload imbalance for skewed data

58 Fine-Grained Partitioning
Pros Excellent workload balance for skewed data Cons Relatively high I/O cost High communication cost

59 Hybrid Partitioning Pros Cons Low communication cost
Good workload balance for skewed data Cons High I/O cost

60 Auto-Grained Partitioning
2 Steps Estimate the grid density (after filtering) by sampling, and thus, estimate the computation cost (based on the time complexity) For each grid, total processing cost = constant loading cost + varying computation cost Partitions the cost array - Balanced Contiguous Multi-Way Partitioning Dynamic programming (small # of grids) Greedy (large # of grids)

61 Auto-Grained Partitioning (Cont’d)
Pros Low I/O cost Low communication cost Great workload balance for skewed data Cons Overhead of sampling an runtime partitioning

62 Partitioning Strategy Summary
I/O Performance Workload Balance Scalability Additional Cost Coarse-Grained Excellent Poor None Fine-Grained Hybrid Good Auto-Grained Great Nontrivial Our partitioning strategy decider can help choose the best strategy

63 All-Reuse Approach (Cont’d)
Key Insight # of aggregates ≤ # of queried elements More computationally efficient to iterate over elements and update the associated aggregates More Benefits Load balance (for hierarchical/circular aggregations) More speedup for compound array elements The data type of an aggregate is usually primitive, but this is not always true for an array element Similar to the simple join operation over two tables, it is more computationally efficient to cache the smaller table and scan the larger table as few as possible.

64 Parallel Grid Aggregation Performance
Used 4 processors on a Real-Life Dataset of 8 GB User-Defined Aggregation: K-Means Vary the number of iterations to vary to the computation amount

65 Data Access Strategies and Patterns
Full Read: probably too expensive for reading a small data subset Partial Read Strided pattern Column pattern Discrete point pattern

66 Indexing Cost of Different Binning Strategies with Varying # of Bins on the Synthetic Dataset

67 SUM Aggregation of Equi-Width Binning with Varying # of Bins on the Synthetic Dataset

68 SUM Aggregation of Equi-Depth Binning with Varying # of Bins on the Synthetic Dataset

69 SUM Aggregation of V-Optimized Binning with Varying # of Bins on the Synthetic Dataset

70 Average Relative Error(%) of MAX Aggregation of Different Methods on the Real-World Dataset

71 SUM Aggregation Times of Different Methods on the Real-World Dataset (DB)

72 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBD)

73 SUM Aggregation Times of Different Methods on the Real-World Dataset (VBS)

74 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)

75 SUM Aggregation Times of Different Methods on the Real-World Dataset (CD)

76 SD vs. Classification Classification, such as decision trees or decision rules, appear unlikely to find all meaningful contrasts. A classifier finds a single model that maximizes the separation of multiple groups, not all interesting models as contrast discovery seeks. The output of classification is likely to be an entire decision tree/classification system, which can have the same subsetting predicates at different levels. Classifier separates each other, not between the subsets and the general population.


Download ppt "Advisor: Gagan Agrawal"

Similar presentations


Ads by Google