Presentation is loading. Please wait.

Presentation is loading. Please wait.

Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang.

Similar presentations


Presentation on theme: "Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang."— Presentation transcript:

1 Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang Stahl + Joachim Biercamp German Climate Computing Centre (DKRZ) Hamburg Visit at NCAR October 27th – 29th, 2008 in Boulder, USA

2 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)2 DKRZ: Earth system model development Simulations of past, present and future climate WDC Climate: Long-term data archiving Inter-disciplinary data dissemination

3 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)3 Block Diagram HLRE-II System at DKRZ 250 IBM Power6 nodes (240 Compute, 10 I/O) GPFS Filesystem IBM DS5300 (2-5 PByte) GPFS Filesystem IBM DS5300 (2-5 PByte) StorageTek Silos Total Capacity: 60000 Tapes Approx. 60 PB (LTO and Titan) StorageTek Silos Total Capacity: 60000 Tapes Approx. 60 PB (LTO and Titan)

4 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)4 Increase in installed compute power motivates finer spatial and temporal model resolution and integration of additional physical and chemical processes into climate models.

5 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)5 Next generation of compute server (HLRE-II) and of climate models at DKRZ implies data production increase with implications for long- term archiving. HLRE2: Compute power increase by a factor of 60 (sustained) Experience at DKRZ: Linear increase in data production with installed compute power Previous data storage strategy: all data migrated to the long-term mass storage archive (that means archive increase follows directly the compute power increase) Resulting problem: Since the total amount of money for investment and for operations is fixed the cost relation between compute service and data service shifts towards data service while reducing the compute service fraction This is not any longer feasable for HLRE-II at DKRZ. Therefore long-term archive increase has been limited to 10 PB/year which is five times more than the present data archive increase. The database increase of WDCC has been limited to 1 PB/year (presently 60 – 100 TB/year).

6 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)6 Compute server architectures: Cray C90 (1996-2003) / HLRE: NEC SX-6 (2003-2008) / HLRE-II: IBM Power 6 (2009-2013+) (HLRE: Höchstleistungsrechnersystem für die Erdsystemforschung) Data Archive at DKRZ

7 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)7 Increase of WDCC data archive

8 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)8 WDCC data downloads for 2007 (catalogue accesses neglected)

9 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)9 Analysis of data classes Test data from model code development, life cycle: weeks to months Project data from scientific model evaluation and research projects (DKRZ resources at project level), life cycle: 3 – 5 years Final results as data products for international projects (IPCC) and scientific publications, life cycle: 10 years and longer Resulting data hierarchy levels Temp(orary) scratch discs at compute server Work fixed disc space at project level for evaluation Arch(ive) tape storage space (single copy) with expiration date for project data beyond available disc space Docu(mentation) documented, long-term tape archive (security copy) for data products, focus on interdisciplinary data utilisation, data are fixed and no longer matter of change

10 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)10 Tape space distributon to archive classes at DKRZ begin of 2007: part of the “work” space on tape because GFS too small “docu” domain consists of WDCC no expiration dates in “arch” domain, parts of “arch” domain belongs to “docu” but not yet documented 2007

11 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)11 The new project based data storage strategy and resource assigment at DKRZ contains: Separation of project data and long-term archive Expiration date for project data Aware, scientific decision to move data into the long- term archive within the given archive limits Data documentation requirements for long-term archive Long-term data archive (“docu” hierarchy level) accomplishes the rules for good scientific practice

12 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)12 Data documentation requirements are accomplished by using the WDCC infrastruture CERA-2 metadata model developed in 1999  Catalogue interface: cera.wdc-climate.de  Input interface: input.wdc-climate.de CERA-2 metadata content is complete with respect to browse, to discover and to use climate data which are stored in the database system or outside in flat files Missing: structured information on data provenance (topic of EU-project METAFOR) The WDCC matches international description standards like ISO 19115, Dublin Core or GCMD and is integrated in international data federations Data storage structure assembles storage of climate time series per variable in BLOB data tables. This allows for web-based data catalogue search and data access in small data granules.

13 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)13 WDCC / CERA: General Statistics at 01-10-2008 00:00:10 Database Size (TByte): 370 Number of blobs: 6663287791 (6.6 billion) Data access by fields and not by files. Number of experiments: 1146 Number of datasets: 142062 Total size divided by number of BLOBs gives the average size of data access granules: 50 - 60 kB/BLOB

14 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)14 WDCC User Categories Experienced User From numerical model development and analysis Consolidated knowledge of model data structure and limitations of model results Application experience in tools and infrastructure to process model raw data (files) Familiar with Unix environments and programming languages Non-experienced user From climate mitigation and adaptation Only little knowledge of model data structure and limitations of model results Require application adapted model data products and field- based data access Familiar with MS-Windows environments and Office Tools

15 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)15 CERA Data Model Entry Reference Status Distribution Contact Coverage Parameter Spatial Reference Local Adm. Data Access Data Org

16 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)16 Coloured columns correspond to BLOB data tables in WDCC. Collections of matrix rows represents storage in model raw data files (complete model output storage time step by storage time step).

17 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)17 Additionally WDCC offers the primary data publication service for final data entities which are of general scientific interest Following the STD-DOI concept (Scientific and Technical Data – Digital Object Identifier, URL: www.std-doi.de)www.std-doi.de Important aspects of the publication process are  The identification of independent data entities which are suitable for publication at the level of scientific literature,  The execution of an elaborated review process for metadata and climate data,  The assigment of additional metadata for electronic publication (ISO 690-2) and of persistent identifiers (DOI / URN) and  The integration of publication metadata and persistent identifiers into the TIB library catalogue (Technical Information Library, Hannover) so that primary data entities are searchable and citable together with scientific literature.  Quality characteristic is presently “approved by author”, future development should be “peer reviewed”.

18 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)18 STD-DOI data publication workflow

19 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)19

20 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)20 Data infrastructure integrates data stewardship in the long-term archive Bit-stream preservation Quality assurance Usability enabling

21 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)21 Long-term archive data stewardship Bit-stream preservation Secondary tape copies on different tapes and technology at separate location Copy to new tapes after maximum number of tape accesses are reached (Refreshment)

22 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)22 DKRZ archive development

23 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)23 DKRZ archive increase and transition in tape technology In 2002 most data on SD3 (helical scan) Migration to 9940A and 9940B since they were available Migration to T1A (Titanium T10000)

24 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)24 Number of files in DKRZ archive on different tape media Small files are stored on 9840C (small capacity but fast access) 1. peak: Start with NEC-SX6 yields exponential increase in number of files and inposes the implementation of file quotas 2. peak: Delay in cleaning up number of files

25 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)25 Long-term archive data stewardship (continued) Quality assurance Semantic examinations: behavior of a numerical model compared to observations and to other models, part of the scientific evaluation process Syntactic examinations: formal aspects of data archiving and ensurance that data archiving is free of errors as far as possible  Consistency between metadata and climate data  Completeness of climate data  Standard range of values (expectation ranges and simple data statistics)  Spatial and temporal data arrangement

26 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)26 Long-term archive data stewardship (continued) Usability enabling Complete and searchable documenation of climate data entities (database tables and flat files) in the catalogue system of the WDCC WDCC offers web-based data access to small data granules (individual entries in BLOB DB tables) Archive technology transfer must be downward compatible to keep old data technically readable Data processing tools and data format access libraries must be migrated to new architectures

27 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)27 WDCC Architecture

28 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)28 Summary DKRZ long-term data archive will still grow but slower than linear with the installed compute power Key increase factors are for long-term archive: 10 PB/year, for WDCC: 1 PB/year. Improvement of reliability of long-term archive because of more emphasis on data stewardship than on technical data service operations At the end the new data archive concept will result in a completely documented and searchable long-term data archive. In the future more server side data processing is requested for on- site data reduction, on the fly generation of application data products and visualisation at working level and for presentations.

29 NCAR (Oct. 27-29, 2008)Lautenschlager (WDCC/MPI-M)29 References 2008 Michael Lautenschlager Preservation of Earth System Model Data In: Digital Preservation Europe, Briefing Paper 30th June 2008 (http://www.digitalpreservationeurope.eu/publications/briefs/) preservation-of-earth-system-model-data (Size: 95 Kbyte, Type: pdf)preservation-of-earth-system-model-data www.digitalpreservationeurope.eu/publications/briefs/ 2007 Lautenschlager, M., Stahl, W. Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ In: E Mikusch (Ed.): PV2007 - Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical Data, Conference Proceedings. DLR, German Remote Sensing Data Center, Oberpfaffenhofen, 2007 Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ (Size: 2.9 Mbyte, Type: pdf)Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ URL: http://www.mad.zmaw.de/service-support/publications/http://www.mad.zmaw.de/service-support/publications/


Download ppt "Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang."

Similar presentations


Ads by Google