Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)

Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL) Computer Science Department Texas Tech University Cluster’13

Scientific Applications Trend  Scientific simulations tend to be data intensive VPIC, a plasma physic simulation code, generates 1 trillion particles, with 26 bytes per particle, and total size of 36 TBs for one file, each time step (source: LBNL) PIProjectOn-Line DataOff-Line Data Lamb, DonFLASH: Buoyancy-Driven Turbulent Nuclear Burning75TB300TB Fischer, PaulReactor Core Hydrodynamics2TB5TB Dean, DavidComputational Nuclear Structure4TB40TB Baker, DavidComputational Protein Structure1TB2TB Worley, Patrick H.Performance Evaluation and Analysis1TB Wolverton, Christopher Kinetics and Thermodynamics of Metal and Complex Hydride Nanoparticles 5TB100TB Washington, WarrenClimate Science10TB345TB Tsigelny, IgorParkinson's Disease2.5TB50TB Tang, WilliamPlasma Microturbulence2TB10TB Sugar, RobertLattice QCD1TB44TB Siegel, AndrewThermal Striping in Sodium Cooled Reactors4TB8TB Roux, BenoitGating Mechanisms of Membrane Proteins10TB Data requirements for selected INCITE applications at ALCF Source: R. Ross et. al., Argonne National Laboratory 2 Cluster’13

Scientific Applications Trend (cont.)  Collected data from instruments increases rapidly too In a global climate model (left), with 100 × 120 km grid cell, PBs of data managed and analyzed Various parameters, e.g., temperature and wind speed, are recorded Scientists desire higher resolution and finer granularity, which can lead to even larger sizes of datasets Source: UCAR 3 Cluster’13

 Traditional database has been successful Structured and relational data for commercial applications Transaction processing, non-procedural query analysis Sophisticated indexing  Challenges for scientific applications Lack support for core scientific data types Lack support for rich access patterns No good visualization/plotting tools Performance maybe slow Traditional Databases 4 Cluster’13

 Scientific datasets and libraries, e.g., HDF5, PnetCDF, ADIOS, have been used for scientific data management. Supports sophisticated data types and access patterns Programming interface for creating, accessing, and visualizing the data Integrated with parallel I/O Scientific Datasets and Libraries PnetCDF Dataset Format 5 Cluster’13

 Recent studies have started to utilize database techniques to integrate with scientific datasets to leverage both merits  FastBit by Wu et. al. implements indexing on datasets  FastQuery by Chou et. al. implements parallel indexing and query system for large-scale datasets  Su and Agrawal implemented user-defined subsetting  Scientific Data Service by Wu, Byna and Dong Recent Developments 6 Cluster’13

 Scientists are interested in understanding the phenomenon behind the data. A typical case is to select data points of interests by performing the range queries. Select data points From datasets Where pressure>80 And 12<temperature<25; Idea and Motivation  Our Idea: use statistical metadata to facilitate such data analysis.  The added metadata improve the query response by more than three folds  Traditionally, without any prior knowledge of the datasets, a costly process. Performance Comparison with and without statistics 7 Cluster’13

Fast Analysis with Statistical Metadata (FASM) and System Design System Architecture 8 Cluster’13 FASM system (Fast data Analysis with integrated Statistical Metadata) has four major components: Subsetting, Statistics Generating, Metadata Rich Datasets and Runtime.

 Challenges  What type of subsetting scheme is better?  What type of statistical metadata is desired?  How to utilize the statistical metadata at the runtime? FASM Challenges 9 Cluster’13

 Subsetting refers to how to partition the datasets in order to integrate the statistical metadata. For 3D datasets, we can have 1D, 2D, 3D and combined subsetting. Dimension-Driven Subsetting Different Subsetting Schemes 10 Cluster’13

 The inconsistence between logical access and physical storage using scientific datasets causes locality issue. Locality-Driven Subsetting Locality in Subsetting Schemes TypeDimensionDistance sub1(lat, lon)0 sub2(lon, level)(lat-1)×lon sub3(lat, level)lat×(lon-1) sub4(lon, time)(level×lat-1)×lon sub5(lat, time)(level×lon-1)×lat sub6(level,time)level×(lon×lat-1) 11 Cluster’13

 Concurrency plays a critical role in exploring parallelism in the access and analysis of scientific datasets. Concurrency-Driven Subsetting Distribution of Datasets on Parallel File Systems SchemeConcurrency sub1 min((x×m)/(stripe_size×leve l),n) sub2 min((x×m)/(stripe_size×lat), n) sub3 min((x×m)/(stripe_size×lon),n) sub4, sub5, sub6 min((x×m)/stripe_size,n) 12 Cluster’13

 There are different statistics we can utilize. e.g., MIN, MAX, MEAN, MEDIAN, 5-number statistics (min, lower quartile, median, upper quartile, and max)  A statistical metadata portion is added Statistics Generating and Enhanced Datasets A Sample of Metadata Rich Datasets 13 Cluster’13

 The Runtime component leverages integrated statistical metadata to facilitate data analysis and queries.  The current FASM system is designed for write-once read-many type of applications, and thus does not deal with data modifications and regeneration of statistics. FASM Runtime Input: query request and statistics_metadata; Read operation: Step 1: In each access, get Statistics_metadata; Step 2: Filter useless subsets; Step 3: Modify accessing pattern: new_start [] = FASM_start; new_count [] =FASM_count; Step 4: read: ncmpi_get_vara_float(ncid, varid, new_start[], new_count[],*fp); Return: Query Result An Example of Runtime Read Operation 14 Cluster’13

 Testbed Hrothgar, a 640-node cluster at Texas Tech University Each node contains two Intel Xeon (Westmere) 2.8 GHz 6-core processors with 24 GB of memory Nodes are connected with DDR Infiniband PnetCDF v1.3  Datasets and Query Randomly generated synthetic datasets, 300KB-100GB Real application BCCR-BCM, 12GB; Randomly generated range query and analyses, e.g., 10<pressure<30; Current Evaluations 15 Cluster’13

Statistics and Performance Improvements Performance of Different Statistics The proposed approach demonstrates clear performance advantages as the dataset size increases 16 Cluster’13

Locality and Concurrency Performance Regarding Locality of Various SubsettingConcurrency of Various Strip Size Sub1, sub2 and sub3 are better than sub4 and sub5 schemes in terms of Locality. Confirms the equation prediction. Sub3 achieves the best performance when the strip size is 5 MBs, which is 1.67 times faster than the worst case. 17 Cluster’13

Storage Overhead and Amortized Cost Storage Overhead of Added Metadata Storage overhead of integrated metadata is less than one percent for three subsetting schemes. 18 Cluster’13 Amortized Cost

Real Application Test 19 Cluster’13 BCCR Model Test Using FASM The BCCR-BCM 2.0 model. This is a climate model, Bergen Climate Model (BCM), dataset downloaded from Bjerknes Center for Climate Research (BCCR). The size of the dataset ranges from 100MB to 1.8GB, and the total size of these datasets is more than 12 GB.

Related Work  Databases and Improvements Weakness of traditional database: not support derived types, e.g., array; not support spatial access pattern; can not manipulate using standard application programs, etc. Fastbit provides a set of compressed bitmap indexes for scientific datasets. Su and Agrawal designed tools to support user-defined subsetting and aggregation over NetCDF datasets. The proposed approach has a similar idea to the indexing scheme, but is lightweight and incurs significantly less storage overhead Traditional DatabaseScientific Datasets Query optimization technics 20 Cluster’13

Related Work (cont.)  Existing Optimization for Scientific Datasets Management Libraries PnetCDF optimizes the serial NetCDF using the technics of MPI-IO; Compression algorithms are utilized in HDF5 to store the raw data. ADIOS supports collecting local, simple, statistical and/or analytical data Our FASM system tries to provide a systematic solution and provides integrating the statistical metadata during the initial write, which can be further utilized 21 Cluster’13

Related Work (cont.)  File Systems and Data Organizations for Scientific Datasets Buck combined MapReduce and NetCDF by implementing a SciHadoop prototype Kalyanaraman mapped n-D datasets to the space-filling curves. This study introduces a new approach to boost data analysis performance, with considering file systems and the impact of data distribution and locality 22 Cluster’13

Conclusion  The raw datasets and current formats not sufficient for achieving an optimal performance  Integration of database and HPC scientific datasets necessary Potential has been shown in prior studies  We propose an idea of integrating statistical metadata into datasets  Experiments have shown the integrated statistical metadata improves the query and analysis performance 23 Cluster’13

Ongoing and Future Work  Integrate the FASM with FastBit to form a two level query filtering.  Investigate further to support data modifications, resubsetting, and regeneration of statistics at runtime. 24 Cluster’13

Questions? Welcome to visit our website: http://discl.cs.ttu.eduhttp://discl.cs.ttu.edu Thank You 25 Cluster’13

Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)

Similar presentations

Presentation on theme: "Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)

Similar presentations

Presentation on theme: "Fast Data Analysis with Integrating Statistical Metadata in Scientific Datasets Jialin Liu, Yong Chen Data-Intensive Scalable Computing Laboratory (DISCL)"— Presentation transcript:

Similar presentations

About project

Feedback