Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.

Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu

Outline Background Query design
Selection of multidimensional (MD) array database Test environment setup Benchmark test and analysis Conclusions

Background

NetCDF A concept which can refer to data model, format or API
Dimension: physical dimension or index such as time step Variable: core data stored, e.g. precipitation Attribute: metadata of variables or file Format Classic, and 64-bit offset format consisting of a header and a data array stored contiguously NetCDF-4, and NetCDF-4 classic model format, support for dynamic schema and chunked storage

Problem for query Contiguous storage structure adopted by classic and 64-bit offset format 20 45 55 21 30 20 10 11 13 3 Grid 1 Grid 2 Grid 3 … Grid 1 Grid 2 Grid 3 One-dimensional array

MD array database A database of which the abstract model for data management and query is multidimensional array consisting of dimensions and attributes Many solutions Open source: Rasdaman, SciDB, MonetDB, etc. Commercial: Essbase, Caché, Oracle spatial, etc. Most utilize chunked storage structure

Possible solution Chunked storage structure of NetCDF-4 format and multidimensional (MD) array database MD array database also has smarter caching strategy 20 Index MD chunk

Research question Can a MD array database process frequently implemented queries faster than NetCDF solutions for large hydrological datasets?

Roadmap Query design (Dataset selection)
Selection of MD array database Test environment setup (HydroNET-4) NetCDF connector MD array database connector Benchmark 64-bit offset storage MD array database (normal chunk) NetCDF-4 (normal chunk) MD array database (compressed chunk) NetCDF-4 (compressed chunk)

Query design

Query and dataset collection
6 experts interviewed in total 19 conceptual queries categorized into 5 classes Selection based on dimension value Selection based on variable value Masking query, e.g. data quality check Statistical operation, e.g. Sum, Avg and Max Spatial operation, e.g. intersection Datasets include 1D time series records, 2D satellite images, 5D forecast datasets, etc.

Datasets Dataset Information stored Dimension count Dimension
Span (single file) Temporal resolution Spatial resolution and coverage Single file size Data format MPE (Multi-Sensor Precipitation Estimate) rainfall rate from satellite data product Rainfall rate; Availability; Quality 3 x, y, time (4000,4000,4) 15 minutes 0.03 degree (3.3 km), 1/3 world 250 MB 64-bit offset GEFS (Global Ensemble Forecast System) weather forecast data Temperature 2m above ground; Maximum temperature 2m above ground; Minimum temperature 2m above ground; Relative humidity 2m above ground ; Total precipitation; Total Cloud Cover; U-Component of Wind 10m above ground; V-Component of Wind 10m above ground; Data status 5 Longitude, latitude, forecast, ensemble, model run (360,181,40,20,1) 6 hours 1 degree (111 km), Global 1.55 GB

MPE & GEFS 3D MPE 5D GEFS Ensemble Modelrun Latitude Longitude
Forecast Time Longitude Latitude 3D MPE 5D GEFS

Query Designed MPE dataset GEFS dataset
Sub grid selection (Delft and northern part of the Netherlands) Time series extraction (A spot location in the Indian Ocean) Pyramid query (the Netherlands) Average calculation (the Netherlands) Maximum calculation (the Netherlands) GEFS dataset Time series extraction (Delft, one cell in GEFS) Percentile calculation (Delft, one cell) Ensemble mean calculation (the Netherlands and Europe)

Selection of MD array database

MD array database selection
Rasdaman and SciDB are focused on and compared 9 criteria in total and different approaches are employed to assess each criterion, e.g. Implementation of MD data storage structure: paper study, official documentation, forums, source code and discussion with developers No practical tests are performed

MD array database selection
Criterion Rasdaman SciDB License (i.e. commercial open-source) 1 Implementation of MD data storage structure Lossless compression support Parallelization .Net API 0.5 Query language Spatial calculating capability NetCDF importer Maintenance Overall grade 6 6.5 Final grade shows SciDB scores higher

Test environment setup

Benchmark architecture

Benchmark test and analysis

64-bit offset NetCDF files
MPE dataset One file contain 4 time steps, 250 MB A folder contains 1722 files GEFS dataset Only one file stored, containing 1 modelrun, 20 ensembles, 40 forecast steps, 181 latitudes and 360 longitudes, 1.55 GB

NetCDF-4 files MPE dataset GEFS dataset (1 file for one data store)
One file contains 4 time steps Two folders created for the two data stores, each with 720 files GEFS dataset (1 file for one data store) Data store name Chunk size (X x Y x Time) Single file size NetCDF4_C2 4000 x 4000 x 1 250 MB NetCDF4_C2_C (compression) 3 MB Data store name Chunk size (X x Y x Forecast x Ensemble x Modelrun) Single file size NetCDF4_GEFS_S3 360 x 181 x 1 x 20 x 1 1.55 GB NetCDF4_GEFS_S3_C (compression) 654 MB NetCDF4_GEFS_S5 360 x 181 x 1 x 1 x 1 NetCDF4_GEFS_S5_C (compression) 561 MB

Original size of files in 64-bit offset format
SciDB arrays MPE dataset Diverse chunk sizes and compression settings GEFS dataset 4 data schemas for storage -> modification of order of dimensions Array level MPE data stored Time step count SciDB array size Original size of files in 64-bit offset format Tiny First 2 hours of 1st September, 2013 8 37 MB 488 MB Small First 6 hours of 1st September, 2013 24 112 MB 1.3 GB Medium 1st September, 2013 96 448 MB 5.7 GB Large 7 days from 1st to 7th September, 2013 672 3 GB 40 GB Very large 30 days of September, 2013 2880 13 GB 171.6 GB

6 chunk sizes for MPE arrays
4 x 100 x 100: C5 4 x 800 x 800: C3 4 x 4000 x 4000: C1 1 x 100 x 100: C6 1 x 800 x 800: C4 1 x 4000 x 4000: C2

GEFS: effect of dimensions order
1 X F Y X F Y F M 1 X E Y X E Y

Benchmark test Two database systems (NetCDF and SciDB) are benchmarked
Each specific query is run 20 times and the average of the middle 12 records is used as query response time Network delay and query parsing for SciDB, such additional cost is between 0.05s to 0.2s

MPE sub grid selection X Time Y

Selecting grid covering the northern part of the Netherlands
MPE sub grid selection Scheme Chunk size C1, C1_C 4 x 4000 x 4000 C2, C2_C 1 x 4000 x 4000 C3, C3_C 4 x 800 x 800 C4, C4_C 1 x 800 x 800 C5, C5_C 4 x 100 x 100 C6, C6_C 1 x 100 x 100 Selecting grid covering the northern part of the Netherlands

GEFS forecast time series extraction
1 modelrun X Forecast Y X Forecast Y X Forecast Y X Forecast Y X Forecast Y Ensemble

GEFS forecast time series extraction
Scheme Dimensions order Chunk size S1, S1_C M E F Y X 1 x 20 x 1 x 181 x 360 S2, S2_C M F Y X E 1 x 1 x 181 x 360 x 20 S3, S3_C X Y F E M 360 x 181 x 1 x 20 x 1 S5, S5_C 360 x 181 x 1 x 1 x 1 Extracting precipitation forecast time series from Delft, a spot location

Overall evaluation Data solution 64-bit offset NetCDF-4 NetCDF-4 DEFLATE compression SciDB array SciDB array DEFLATE compression Management Data loading 5 4 3 1 Storage Scheme transformation Management overall score 7 6 9 8 Query MPE sub grid selection 2 MPE time series extraction MPE average calculation MPE maximum calculation GEFS forecast time series extraction GEFS percentile calculation GEFS ensemble mean calculation Query overall score 29 32 12 20 14 Compound score (management * query * 0.14) 6.48 6.57 4.71 5.52 5.00 NetCDF-4 ranks the first, then 64-bit offset, SciDB solutions come after

Conclusions and future work

Summary Within the scope of research, NetCDF-4 without compression is the best solution for managing and querying large hydrologic datasets For SciDB, small chunk size is preferable but overload of huge in-memory metadata of chunks (i.e. <InstanceID, ArrayID, ChunkID, VersionID>) is a problem DEFLATE compression of SciDB arrays can either have negative or no effect on query performance

Summary Correlation between SciDB DEFLATE compression and chunk size is observed in time series extraction With hypercubic and modest chunk sizes, the internal data structure of chunks in SciDB has insignificant influence on query performance. Masking query, e.g. data quality check as well as spatial operation should be included in comprehensive benchmarking

Future work Generic chunk model, to determine best chunk size for querying More realistic benchmark test, e.g. analyze Hydrologic Research query log and simulate scenarios Test with less memory capacity with focus on NetCDF Parallel query processing and parallel loading for SciDB

Reflection Knowledge gained from Geo-database and Geo-web courses are utilized, e.g. blocks to store images, HTTP communication The research makes use of geomatics techniques to solve water problems The research fulfills organizations’ needs (Hydrologic, Deltares, etc) and contribute water services to the public

Questions?

Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.

Similar presentations

Presentation on theme: "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu.

Similar presentations

Presentation on theme: "Comparing NetCDF and a multidimensional array database on managing and querying large hydrologic datasets: a case study of SciDB– P5 Haicheng Liu."— Presentation transcript:

Similar presentations

About project

Feedback