Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL

Motivation Computing power is increasing Simulations performed at finer spatial and temporal scales –Some number from road-runner EC 3 simulation 4000 3 particles, 36 bytes per particle => 2.3 TB/time- step 230 times bigger in future (close to 1 PB/time-step)

Specific Contexts Data Visualization –Cannot move massive data –Can’t visualize finer scales always Data Dissemination –Limited Wide-area bandwidths –Limited Storage at client-side Unbalanced Systems (and more so in future) –Computing speeds growing faster than Memory size and speed Disk and WAN bandwidth

Visualization Context: Data Subsetting in Paraview A widely used data analysis and visualization tool Problems: Load + Filter mode –Load the entire data set –Data filtering in visualization level Threshold Filter: based on values Extract Subset Filter: based on dimension info –Grid transformation needed during filtering Regular Structured Grid -> Unstructured Grid Underlying Problem – Very limited state of art Of management of array-based data

Context II: Wide Area Data Dissemination Simple Request Advanced Request Challenges? No subsetting request? Data subset still big? Server-side SubsettingClient-side Subsetting

Current Approaches Database Systems High-level query languages Indexing support Large-complex systems Need to load all data inside the system Cannot handle format changes etc. Ad-hoc Solutions Use procedural or scripting languages Lack indexing support Keep data in original format Light-weight solutions Adapt to format changes etc.

Needs for Visualization and Dissemination Cannot reformat/reload data –Use existing formats Support high-level APIs –Low-level programming too tedious Need Subsetting Support –Dimension-based and Value-based Need sampling support –Efficient –Must give assessment of loss of accuracy

Our Approach Automatic Data Virtualization –Support high-level view of array-based data –Allow queries assuming such a view –Extract values from dataset to serve these queries Indexing techniques applied to low-level data –Integrated with a high-level query system Sampling is a critical functionality –Integrate with data virtualization system –Use an indexing method to sample

System Overview (NetCDF) Parse the SQL expression Parse the metadata file Generate Query Request Index Generation if not generated; Index Retrieving after that.

A Faster Solution Subset at the I/O level –User specifies the subset in one query for both dimension and value ranges –Reduced I/O time and memory footprint SQL queries with ParaView and GridFTP (future) –Query over Dimensions – API support –Query over Values - Indexing Bitmap Indices and Parallel Bitmap Indices –Efficient subsetting over values

Subsetting Support Dimension-based –Possible using Metadata from NetCDF Value-based –Use existing indexing methods? Dimension + Value-based –??

Background: Bitmap Indexing Fastbit: widely used in Scientific Data Management Suitable for float value for binning small ranges Run Length Compression(WAH, BBC) –Compress bitvector based on continuous 0s or 1s

Bitmap Index and Dim Subset Run-length Compression(WAH, BBC) –Good: compression rate, fast bitwise operation; –Bad: ability to locate dim subset is lost; Two traditional methods: –With bitmap indices: post-filter on dim info; –Without bitmap indices: post-filter on values; Two-phase optimization: –Index Generate: Distributed Indices over sub-blocks; –Index Retrieval: Transform dim subsetting info into bitvectors, and support fast bitwise operation;

Optimization 1: Distributed Index Generation Study relationship between Queries and Partitions. Partition the data based on Query Preference

Index Partition Strategy α rate: Participation rate of data elements –Number of elements in indexing / Total data size –Worst: All elements have to be involved –Ideal: Elements exact the same as dim subset Partition Strategies: –Strategy 1: α is proportional to dim subsetting percentage and inversely proportional to number of partitions. –Strategy 2: In general cases where subsetting over each dimension has a similar probability, the partition should have equal preference over each dim. –Strategy 3: If queries only include a subset of dims, the partition should also be based on these dims.

Parallel Index Architecture L3: data block L1: data file L2: variable

Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – 11.17 speedup compared with filtering method Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Efficiency Comparison with FastQuery Data size: 8.4 GB Proc#: 48 Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery

Server-side Data Sampling Integrate with data virtualization Which technique to use –Simple/Stratified Random Sampling? Minimize loss of ``information’’ (e.g. entropy) Information Loss is Unavoidable But, how much is it at a certain level? Can I know before I choose a level? Can I calculate it efficiently?

Additional Sampling Considerations Many techniques fail to consider Data Value Distribution Data Spatial Locality Error Calculation is time-consuming –Scan entire dataset –Might defeat purpose of sampling Data reorganization to support sampling is undesirable –E.g. kd-tree based method Data subsetting and sampling should be combined

Our Solution A server-side subsetting and sampling framework. –Standard SQL interface –Data Subsetting: Dimensions, Values TEMP(longitude, latitude, depth) ; –Flexible sampling mechanism Support Data Sampling over Bitmap Indices –No data reorganization is needed –Generate an accurate error metrics result –Support Error Prediction before sampling the data –Support data sampling over flexible data subset

Data Sampling Using Bitmap Indices Features: –Different bitvectors reflect the value distribution Key Property: Preserves Entropy Error with respect to other metrics can also be assessed Bitmap construction for subseting is leveraged Can combine subseting and sampling No reorganization of data

Stratified Sampling over Bitvectors S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

Error Prediction Calculate errors based on bins instead of samples –Indices classifies the data into bins; –Each bin corresponds to one value or value range; –Find a represent value for each bin: V i ; –Equal probability is forced for each bin; –Compute number of samples within each bin: C i ; –Predict error metrics based on V i and C i ; Represent Value: –Small Bin: mean or median value –Big Bin: lower-bound, upper-bound, mean value

Error Prediction Metadata Mean Variance Histogram QQPlot Mean, Variance over Strides

Error Prediction Formula (1) Mean, Variance: Histogram:

Error Prediction Formula (2) QQPlot

Multi-attributes Subsetting and Sampling Support S3: Generate Bitmap Indices based on mbins S2: Combine Single Value Intervals to mbins S1: Generate Value Interval for each attribute

Experiment Setup Environment: –Darwin Cluster: 120 nodes, 48 cores, 64 GB memory Dataset: –Ocean Data – Regular Multi-dimensional Dataset –Cosmos Data – Discrete Points with 7 attributes Sampling Method: –Simple Random Method –Simple Stratified Random Method –KDTree Stratified Random Method –Big Bin Index Random Method –Small Bin Index Random Method

Experiment Goals Two Applications after Sampling: –Data Visualization - Paraview –Data Mining - K-means in MATE Goals: –Efficiency and Accuracy with and without sampling –Accuracy between different sampling methods –Efficiency between different sampling methods –Compare Predicted Error with Actual Error –Speedup for sampling over data subset

Efficiency and Accuracy of Sampling over Cosmos Data Data size: 16 GB (VX, VY, VZ) Network Transfer Speed: 20 MB/s Speedup compared to original dataset: 25% - 2.11; 12.5% - 4.30; 1% - 21.02; 0.1% - 60.14; Kmeans: 20 clusters, 3 dims, 50 iterations MATE: 16 threads Error Metrics: Means of cluster centers Much better than other methods

Absolute Mean Value Differences over Strides – 0.1%

Absolute Histogram Value Differences – 0.1%

Data Sampling Time Data size: 1.4 GB Our method: extra striding cost Compare: small bin random cost 1.19 – 3.98 most time compared with KDTree random method

Conclusions Current State of the Art –No ``DB’’ solutions with visualization/dissemination –Complex DB approaches – e.g. SciDB Our approach –Lightweight solutions –Data stays in original format –High-level query support, indexing, sampling –Integrated with visualization pipeline Ongoing work integrating with GridFTP server

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

Similar presentations

Presentation on theme: "Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL.

Similar presentations

Presentation on theme: "Light-Weight Data Management Solutions for Scientific Datasets Gagan Agrawal, Yu Su Ohio State Jonathan Woodring, LANL."— Presentation transcript:

Similar presentations

About project

Feedback