Presentation is loading. Please wait.

Presentation is loading. Please wait.

M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.

Similar presentations


Presentation on theme: "M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center."— Presentation transcript:

1 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) CAS2K3 Workshop Sept. 2003 in Annecy, Fance Home: http://www.mad.zmaw.de/wdcchttp://www.mad.zmaw.de/wdcc

2 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 2 Content: General remarks DKRZ archive development CERA 1) concept CERA data model and structure Automatic fill process Database access statistics 1) Climate and Environmental data Retrieval and Archiving

3 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 3 Semantic data management Data consist of numbers and metadata. Metadata construct the semantic data context. Metadata form a data catalogue which makes data searchable. Data are produced, archived and extracted within their semantic context. Data without explanation are only numbers. Problems: Metadata are of different complexity for different data types. Consistency between numbers and metadata have to be ensured.

4 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 4 DKRZ Archive Development Basics observations and assumptions: 1)Unix-File archive content end of 2002: 600 TB including Backup's 2) Observed archive rate (Jan. - May 2003): 40 TB/month 3) System changes: 50% compute power increase in August 2003 4) CERA DB size end of 2002: 12 TB 5) Observed Increase (Jan. - May 2003): 1 TB/month 6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004

5 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 5 DKRZ Archive Development Conservative Estimate

6 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 6 Problems with direct file archive access:  Missing Data Catalogue Directory structure of the Unix file system is not sufficient to organise millions of files.  Data are not stored application-oriented Raw data contain time series of 4D data blocks (3D in space and type of variable). Access pattern is time series of 2D fields.  Lack of experience with climate model data Problems in extracting relevant information from climate model raw data files.  Lack of computing facilities at client site Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals). Year20032004200520062007 Estimated File Archive Size 1,2 PB1,9 PB2,6 PB3,4 PB4,1 PB

7 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 7 Limits of model resolution ECHAM4(T42) Grid resolution: 2.8° Time step: 40 min ECHAM4(T106) Grid resolution: 1.1° Time step: 20 min Noreiks (MPIM), 2001

8 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 8 (I) Data catalogue and pointer to Unix files  Enable search and identification of data  Allow for data access as they are (II) Application-oriented data storage  Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access  Storage in standard file-format (GRIB) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

9 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 9 CERA Database: 7.1 TB (12.2001) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: 210 TB neglecting Security Copies (12.2001) CERA Database System Web-Based User Interface Catalogue Inspection Climate Data Retrieval DKRZ Mass Storage Archive InternetAccess Current database size is 20.5074 Terabyte Number of experiments: 298 Number of datasets: 29715 Number of blob within CERA at 03-SEP-03: 1262566234 Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month Parts of CERA DB

10 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 10 CERA Data: Jan. Temp.

11 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 11

12 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 12 Metadata Entry This is the central CERA Block, providing information on the entry's title type and relation to other entries the project the data belong to a summary of the entry a list of general keywords related to data creation and review dates of the metadata Additionally: Modules and Local Extensions Module DATA_ORGANIZATION (grid structure) Module DATA_ACCESS (physical storage) Local extension for specific information on (e.g.) data usage data access and data administration Coverage Information on the volume of space-time covered by the data Reference Any publication related to the data togehter with the publication form Status Status information like data quality, processing steps, etc. Distribution Distribution information including access restrictions, data format and fees if necessary Contact Data related to contact persons and institutes like distributor, investigator, and owner of copyright Parameter Block describes data topic, variable and unit Spatial Reference Information on the coordinate system used CERA-2 Data Model Blocks

13 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 13 The CERA2 data model … allows for data search according to discipline, keyword, variable, project, author, geographical region and time interval and for data retrieval. allows for specification of data processing (aggregation and selection) without attaching the primary data. is flexible with respect to local adaptations and storage of different types of geo-referenced data. is open for cooperation and interchange with other database systems. Data Model Functions

14 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 14 Level 1 - Interface: Metadata entries (XML, ASCII) Level 2 – Interf.: Separate files containing BLOB table data Experiment Description Pointer to Unix-Files Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table Data Structure in CERA DB

15 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 15 Creation of application-oriented data storage must be automatic because of large archive rates !!! Automatic Fill Process (AFP)

16 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 16 Archive Data Flow per month Compute Server Common File System Mass Storage Archive CERA DB System 60 TB/month 2003: 4 TB/month 2004: 12 TB/month 2005+: 20 TB/month Unix-Files Application Oriented Data Hierarchy Application Oriented Data Hierarchy Unix-Files Metadata Initialisation Important: Automatic fill process has to be performed before corresponding files migrate to mass storage archive.

17 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 17 Automatic Fill Process (AFP) Steps and Relations DB-Server: 1.Initialisation of CERA DB Metadata and BLOB data tables are created Compute Server: 1.Climate model calculation starts with 1. month 2.Next model month starts and primary data processing of previous month BLOB table input is produced and stored in the dynamic DB fill cache 3.Step 2 repeated until end of model experiment DB Server: 1.BLOB data table input accessed from DB fill cache 2.BLOB table injection and update of metadata 3.Step 2 repeated until table partition is filled (BLOB table fill cache) 4.Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2) 5.Close entire table and update metadata after end of model experiment

18 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 18 AFP Disk Cache Sizes Dynamic DB fill cache (BLOB table input time series) In order to guarantee stable operation the fill cache should buffer data from approximately 10 days production. Cache size is determined by the automatic data fill rate of up to 1/3 of the archive increase. YearAFP [TB/month]DB Fill Cache [TB] 200341.5 2004124 2005 - 2007207

19 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 19 AFP DISK Cache Sizes BLOB table fill cache (open BLOB table partitions) depends on BLOB table partition = Table Space = DB-File: 10 GB (adapted to UniTree environment and data are in GRIB format) 10 GB/Table-Partition result in 5 TB/Fill-Stream for standard set of variables (approx. 500 2D global fields) of current climate model. Number of parallel fill streams (climate model calculations): 8 Additional 25% is needed for HSM-transfer of closed partitions and error tracking BLOB table partition cache results as (8 Streams * 5 TB/Stream) * 1.25 = 50 TB

20 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 20 CERA Access Statistic Summary: The system is used, mainly from externals. Application dependent data storage allows for precise data access and reduces data transfer volume by a factor of 100 compared to direct file transfer. But presently data access via CERA DB is only a few percent of DKRZ's data download.

21 M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 21 CERA DB using countries URL: http://cera-www.dkrz.de/CERA/index.htmlhttp://cera-www.dkrz.de/CERA/index.html


Download ppt "M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center."

Similar presentations


Ads by Google