Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University

Oral Exam 2013 Motivation: Scientific Data Analysis Science becomes increasing data driven Strong requirements for efficient data analysis  Road-runner EC 3 simulation 4000 3 records 7 attributes (X, Y, VX, … MASS) 36 bytes per record Simulation Speed: 2.3 TB  Parallel Ocean Program 3-D Grid: 42 * 2400 * 3600 > 30 attributes (TEMP, SALT …) 1.4 GB per attribute Simulation Speed: > 50 GB

Oral Exam 2013 Motivation: Big Data “Big Data” Challenge: –Fast Data Generation Speed –Slow Disk IO and Network Speed –Gap will become bigger in the future –Different Data Formats Observations: –Scientific analysis over data subsets Community Climate System Model, Data Pipelines from Tomography, X-ray Photon Correlation Spectroscopy Attributes Subset, Spatial Subset, Value Subset –Multi-resolution data analysis –Wide area data transfer protocols

Oral Exam 2013 TEMP SALT UVEL VVEL Network I want to analyze TEMP within North Atlantic Ocean! More Efficient! Entire Data File Data Subset POP.nc An Example of Ocean Simulation Remote Data Server I want to see the average TEMP of the ocean! I want to quickly view the general global ocean TEMP Aggregation Result Data Samples Combine Flexible Data Management Wide Area Data Transfer Protocol

Oral Exam 2013 Introduction A server-side data virtualization method –Standard SQL queries over scientific datasets Translate SQL into low-level data access code Data Formats: NetCDF, HDF5 –Data subsetting and aggregation Multiple subsetting and aggregation types Greatly decrease the data transfer volume –Data sampling Efficient data analysis with small accuracy lost –Combine with wide area transfer protocols Flexible data management + Efficient data transfer SDQuery_DSI in Globus GridFTP

Oral Exam 2013 Thesis Work Existing Work: –Supporting User-Defined Subsetting and Aggregation over Parallel NetCDF Datasets (CCGrid2012) –Indexing and Parallel Query Processing Support for Visualizing Climate Datasets (ICPP2012) –Taming Massive Distributed Datasets: Data Sampling Using Bitmap Indices (HPDC2013) –SDQuery DSI: Integrating Data Management Support with a Wide Area Data Transfer Protocol (SC2013) Future Work: –Correlation Data Analysis among Multiple Variables Bitmap Indexing Better Efficiency, More Flexibility –Correlation Data Mining over Scientific Data

Oral Exam 2013 Outline Current Work –Parallel Server-side Data Subsetting and Aggregation –Flexible Data Sampling and Efficient Error Calculation –Combine Data Management with Data Transfer Protocol Proposed Work –Flexible Correlation Analysis over Multi-Variables –Correlation Mining over Scientific Dataset Conclusion

Oral Exam 2013 Contribution Server-side subsetting and aggregation –Subsetting: Dimensions, Coordinates, Values –Bitmap Indexing: two-phase optimizations –Aggregation: SUM, AVG, COUNT, MAX, MIN Keep data in native format(e.g., NetCDF, HDF5) –SciDB, OPeNDAP: huge data loading or transform cost Parallel data processing –Data Partition Strategy –Multiple Parallel Levels – Files, Attributes, Blocks Data visualization –SDQueryReader in Paraview –Visualize only subsets of data

Oral Exam 2013 Background: Bitmap Indexing Widely used in scientific data management Suitable for float value by binning small ranges Run Length Compression (WAH, BBC) –Compress bitvectors based on continuous 0s or 1s Can be treated as a small profile of the data

Oral Exam 2013 Overview of Server-side Data Subsetting and Aggregation Parse the SQL expression Parse the metadata file Generate Query Request Index Generation Index Retrieval Generate data subset based on IDs Perform data aggregation Generate Unstructured Grid

Oral Exam 2013 Bitmap Index Optimizations Run-length Compression(WAH, BBC) –Pros: compression rate, fast bitwise operations –Cons: ability to locate dim subset is lost Value Predicates vs. Dim Predicates Two traditional methods: –Without bitmap indices: post-filter on values –With bitmap indices (Fastbit): post-filter on dim info Two-phase optimizations: –Index Generation: Distributed indices over sub-blocks –Index Retrieval: Transform dim subsetting conditions into bitvectors Support bitwise operation among dim and value bitvectors

Oral Exam 2013 Optimization 1: Distributed Index Generation Index Generation: Generate multi-small indices over sub-blocks of data Partition Strategy: Study relationship between queries and partitions Partition the data based on query preferences α rate: redundancy rate of data elements Index Retrieval: Filter the indices based on dim-based query conditions

Oral Exam 2013 Partition Strategy Queries involve both value and dim conditions –Bitmap Indexing + Dim Filter –Worst: All elements have to be involved –Ideal: Elements exact the same as dim subset α rate: redundancy rate of data elements –Number of elements in index / Total data size Partition Strategies: –Users query has preference Timestamp, Longitude, Latitude –Study relationship between queries and partitions –Partition the data based on query preferences –α rate can be greatly decreased

Oral Exam 2013 Optimization 2: Index Retrieval Post-filter? Value-based Predicates: Find satisfied bitvectors from index files on disk Dim-based Predicates: Dynamically generate dim bitvectors which satisfy current predicates Fast Bitwise Operations: Logic AND operations are performed between dim and value bitvectors to generate the point ID set

Oral Exam 2013 Parallel Processing Framework L3: data block L1: data file L2: attribute

Oral Exam 2013 Experiment Setup Goals: –Index-based Subsetting vs. Load + Filter in Paraview –Scalability of Parallel Indexing Method –Parallel Indexing vs. FastQuery –Server-side Aggregation vs. Client-side Aggregation Dataset: –POP (Parallel Ocean Program) –GCRM (Global Cloud Resolving Model) Environment: –IBM Xeon Cluster 8 cores, 2.53GHZ –12 GB memory

Oral Exam 2013 Efficiency Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method is better than filtering when data subset < 60% Two phase optimization achieved a 0.71 – 11.17 speedup compared with traditional bitmap indexing method Index m1: Traditional Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013 Memory Comparison with Filtering in Paraview Data size: 5.6 GB Input: 400 queries Depends on subset percentage General index method has much smaller memory cost than filtering method Two phase optimization only has small extra memory cost Index m1: Bitmap Indexing, no optimization Index m2: Use bitwise operation instead of post-filtering Index m3: Use both bitwise operation and index partition Filter: load all data + filter

Oral Exam 2013 Scalability with Different Proc# Data size: 8.4 GB Proc#: 6, 24, 48, 96 Input: 100 queries X pivot: subset percentage Y pivot: time Each process take care of one sub-block Good scalability as number of processes increases

Oral Exam 2013 Compare with FastQuery FastQuery: –A parallel indexing method based on FastBit Build a relational table view over dataset Generate parallel indices based on partition of the table –Pros: standard way to process data based on tables –Cons: multi-dim feature is lost Only support row-based partition Basic Reading Unit: continuous rows (1-dim segments) Our method: –Flexible Partition Strategy Partition the multi-dim data based on users’ query preference –Smaller Reading Times Basic Reading Unit: multi-dim blocks

Oral Exam 2013 Execution Time Comparison with FastQuery Data size: 8.4 GB, 48 processes Query Type: value + 1 st dim, value + 2 nd dim, value + 3 rd dim, overall Input: 100 queries for each query type Achieved a 1.41 to 2.12 speedup compared with FastQuery

Oral Exam 2013 Parallel Data Aggregation Efficiency Data size: 16GB Process number: 1 - 16 Input: 60 aggregation queries Query Type: Only Agg Agg + Group by + Having Agg + Group by Much smaller data transfer volume Relative Speedup: 4 procs: 2.61 – 3.08 8 procs: 4.31 – 5.52 16 procs: 6.65 – 9.54

Oral Exam 2013 Contributions Statistic Sampling Techniques: –A subset of individuals to represent whole population –Information Loss and Error Metrics: Mean, Variance, Histogram, Q-Q Plot Challenges: –Sampling Accuracy Considering Data Features –Error Calculation with High Overhead Support Data Sampling over Bitmap Indices –Data samples has better accuracy –Support error prediction before sampling the data –Support data sampling over flexible data subset –No data reorganization is needed

Oral Exam 2013 Data Sampling over Bitmap Indices Features of Bitmap Indexing: –Each bin (bitvector) corresponds to one value range –Different bins reflect the entire value distribution –Each bin keeps the data spatial locality Contains all space IDs (0-bits and 1-bits) Row Major, Column Major Hilbert Curve, Z-Order Curve Method: –Perform stratified random sampling over each bin –Multi-level indices generates multi-level samples

Oral Exam 2013 Stratified Random Sampling over Bins S1: Index Generation S2: Divide Bitvector into Equal Strides S3: Random Select certain % of 1’s out of each stride

Oral Exam 2013 Error Prediction vs. Error Calculation Sampling Request Predict Request Error Prediction Error Calculation Data Sampling Error Calculation Sample Not Good? Multi-Times Error Prediction Error Metrics Feedback Decide Sampling Sampling Request Sample

Oral Exam 2013 Error Prediction Pre-estimate the error metrics before sampling Calculate error metrics based on bins –Bitmap Indices classifies the data into bins Each bin corresponds to one value or value range; Find some representative values for each bin: V i ; –Enforce equal sampling percentage for each bin Extra Metadata: number of 1-bits of each bin: C i ; Compute number of samples of each bin: S i ; –Pre-calculate error metrics based on V i and S i Representative Values: –Small Bin: mean value –Big Bin: lower-bound, upper-bound, mean value

Oral Exam 2013 Data Subsetting + Data Sampling S3: Perform Stratified Sampling on Subset S2: Find Spatial ID subset S1: Find value subset Value = [2, 3) RID = (9, 25)

Oral Exam 2013 Experiment Results Goals: –Accuracy among different sampling methods –Compare Predicted Error with Actual Error –Efficiency among different sampling methods –Speedup for combining data sampling with subsetting Datasets: –Ocean Data – Multi-dimensional Arrays –Cosmos Data – Separate Points with 7 attributes Environment: –Darwin Cluster: 120 nodes, 48 cores, 64 GB memory

Oral Exam 2013 Sample Accuracy Comparison Sampling Methods: –Simple Random Method –Stratified Random Method –KDTree Stratified Random Method –Big Bin Index Random Method –Small Bin Index Random Method Error Metrics: –Means over 200 separate sectors –Histogram using 200 value intervals –Q-Q Plot with 200 quantiles Sampling Percentage: 0.1%

Oral Exam 2013 Sample Accuracy Comparison Traditional sampling methods can not achieve good accuracy; Small Bin method achieves best accuracy in most cases; Big Bin method achieves comparable accuracy to KDTree sampling method. Mean Histogram Q-Q Plot

Oral Exam 2013 Predicted Error vs. Actual Error Means, Histogram, Q-Q Plot for Small Bin Method Means, Histogram, Q-Q Plot for Big Bin Method

Oral Exam 2013 Efficiency Comparison Index-based Sample Generation Time is proportional to the number of bins(1.10 to 3.98 times slower). The Error Calculation Time based on bins is much smaller than that based on data (>28 times faster). Sample Generation TimeError Calculation Time

Oral Exam 2013 Total Time based on Resampling Times Total Sampling Time Index-based Sampling: Multi-time Error Calculations One-time Sampling Other Sampling Methods: Multi-time Samplings Multi-time Error Calculations X axis: resampling times Speedup of Small Bin: 0.91 – 20.12

Oral Exam 2013 Speedup of Sampling over Subset X axis: Data Subsetting Percentage (100%, 50%, 30%, 10%, 1%) Y axis: Index Loading Time + Sampling Generation Time 25% Sampling Percentage Speedup :1.47 – 4.98 for Spatial Subsetting 2.25 - 21.54 for value Subsetting Subset over Spatial IDsSubset over values

Oral Exam 2013 Background: Wide-Area Data Transfer Protocols Efficient data transfers over wide-area network Globus GridFTP: –Striped, Streaming, Parallel Data Transfer –Reliable and Restartable Data Transfer Limitation: volume? –The basic data transfer unit is file (GB or TB Level) –Strong requirements for transferring data subsets Goal: Integrate core data management functionality with wide-area data transfer protocols

Oral Exam 2013 Contribution Challenges: –How should the method be designed to allow easy use and integration with existing GridFTP installation? –How can users view a remote file and specify the subsets of data ? –How to support efficient data retrieval with different subsetting scenarios? –How can data retrieval be parallelized and benefits from multi- steaming? GridFTP SDQuery DSI –Efficient Data Transfer over Flexible File Subset –Dynamic Loading / Unloading with Small Overhead –Performance Model based Hybrid Data Reading –Parallel Streaming Data Reading and Transferring

Oral Exam 2013 Motivation: Correlation Analysis Correlation Attributes (Variables) Analysis – Study relationship among variables – Make scientific discovery – Two Scenarios: Basic Scientific Rule Verification and Discovery Feature Mining – Halo finding, Eddy finding Challenge: –Correlation analysis is useful but extremely time consuming and resource costly –No method support flexible correlation analysis on data subset

Oral Exam 2013 Correlation Metrics Multi-Dimensional Histogram: –Value distributions of variables; Entropy –A metric to show the variability of the dataset; –Low => constant, predictable data; –High => random data; Mutual Information –A metric for computing the dependence between two variables; –Low => two variables are independent; –High => one variable provides information about another; Pearson Correlation Coefficient –A metric to quantify the linear correspondence between two variables; –Value Range: [-1, 1]; – 0 proportional; =0 independent;

Oral Exam 2013 Our Solution and Contribution A framework which supports both individual and correlation data analysis based on bitmap indexing –Individual Analysis: flexible data subsetting –Correlation Analysis: Interactive queries among multi-variables Correlation metrics calculation based on indices Support correlation analysis over data subset Support Correlation Analysis over Bitmap Indices –Better efficiency, smaller memory cost –Support both Static Indexing and Dynamic Indexing –Support correlation analysis over data samples

Oral Exam 2013 User Cases of Correlation Analysis Please enter variable names which you want to perform correlation queries: TEMP SALT UVEL Please enter your SQL query: SELECT TEMP FROM POP WHERE TEMP>0 AND TEMP<1 AND depth_t<50; Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMP  SALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your SQL query: SELECT SALT FROM POP WHERE SALT<0.0346; Entropy: TEMP(2.29), SALT(2.99), UVEL(2.68) Mutual Information: TEMP  UVEL 0.02, SALT->UVEL->0.19; Pearson Correlation: ….. Histogram: (UVEL) Please enter your SQL query: UNDO Entropy: TEMP(2.19), SALT(1.90), UVEL(1.48) Mutual Information: TEMP  SALT: 0.18, TEMP->UVEL->0.017; Pearson Correlation: ….. Histogram: (SALT), (UVEL) Please enter your query:

Oral Exam 2013 Dynamic Indexing No Indexing Support: –Load all data for A and B; –Filtering A and B to generate subset; –Combined Bins: Generate (A 1, B 1 )->count 1, … (A m, B m )- >count m based on each data elements within the data subset; –Calculate Correlation Information based on combined bins; Dynamic Indexing (Indices for each variable): –Query bitvectors for A and B; (no data loading cost, zero or very small filtering cost) –Combined Bins: Generate (A 1, B 1 )->count 1, … (A m, B m )- >count m based on bitwise operations between A and B (much faster because bitvectors# are much smaller than elements#) –Calculate Correlation Information based on combined bins

Oral Exam 2013 Static Indexing Dynamic Indexing: One index for each variable. Still need to perform bitwise operations to generate combine bins. Static Indexing: Generate one big indices file over multi- variables. Only need to perform bitvectors filtering or combining. (Extremely small cost)

Oral Exam 2013 Correlation Mining Challenges of Correlation Queries –Do not know which subsets contain important correlations –Keep submitting queries to explore correlations Correlation Mining: –Automatically find important correlations –Suggest correlations to users A bottom-up method: –Generate correlations over basic spatial and value units –Use bitmap indexing to speedup this process –Use association rule mining to find and combine similar correlations

Oral Exam 2013 Generate Scientific Association Rule Association Rule Example: t_lon(10.1−15.1), t_lat(25.2−30.2), depth_t(1−10), TEMP(0−1), SALT(0.01−0.02) →Mutual Information(0.23, High)

Oral Exam 2013 Feature Mining Feature Mining based on Correlation Analysis –Sub-halo: Correlation between space and velocity –Eddy: Correlation between speed in different directions OW distance to find eddies –OW > 0, not eddy; OW<= 0, might be eddy –One detection method: Build v based on row major (x, y) Build u based on column major (y, x) Eddy can not exist for long sequence of 1-bits

Oral Exam 2013 Conclusion “Big Data” challenge A server-side data virtualization method Server-side data subsetting and aggregation Data sampling based on bitmap indexing Integrate flexible data management with efficient data transfer protocol Future work: –Correlation queries –Correlation mining

Oral Exam 2013 Thanks for your attention! Q & A 52

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Similar presentations

Presentation on theme: "Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University.

Similar presentations

Presentation on theme: "Oral Exam 2013 An Virtualization based Data Management Framework for Big Data Applications Yu Su Advisor: Dr. Gagan Agrawal, The Ohio State University."— Presentation transcript:

Similar presentations

About project

Feedback