Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang.

Slides:



Advertisements
Similar presentations
1 of 15 Information Access Internal Information © FAO 2005 IMARK Investing in Information for Development Information Access Internal Information.
Advertisements

Std-doi Publication of Climate Data at WDCC DataCite Summer Meeting 7./8. June 2010 Publication of climate data Heinke Höck World Data Center for Climate.
Preservation and Long Term Access of Data at the World Data Centre for Climate Frank Toussaint N.P. Drakenberg, H. Höck, M. Lautenschlager, H. Luthardt,
Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Data.
StatCat Building a Statistical Data Finder ssrs.yale.edu/statcat Steven Citron-Pousty Ann Green Julie Linden Yale University.
M.Lautenschlager (WDCC/MPI-M) / / 1 The CEOP Model Data Archive at the World Data Center for Climate as part of the CEOP Data Network CEOP / IGWCO.
CERA / WDCC Hannes Thiemann Max-Planck-Institut für Meteorologie Modelle und Daten zmaw.de NCAR, October 27th – 29th, 2008.
Mike Smorul Saurabh Channan Digital Preservation and Archiving at the Institute for Advanced Computer Studies University of Maryland, College Park.
Operational Dataset Update Functionality Included in the NCAR Research Data Archive Management System 1 Zaihua Ji Doug Schuster Steven Worley Computational.
M. Stockhause et al. Martina Stockhause, Michael Lautenschlager, Frank Toussaint Deutsches Klimarechenzentrum (DKRZ) World Data Centre for Climate (WDCC)
Data Preservation Best Practices for preserving your research data for future reuse The goal of data preservation is to ensure that your data is in a sustainable.
German Cluster of WDCs for Earth System Research - Entwurf - Michael Lautenschlager 1, Michael Diepenbroek 2, Hannes Grobe 2, Michael Bittner 3, Jens Klump.
M. Diepenbroek (MARUM), M. Lautenschlager (MPI-M), E. Paliouras (DLR), H. Grobe (AWI) CODATA General Assembly, Berlin World Data Center Cluster.
Preservation and Long Term Access of Data at the World Data Centre for Climate Frank Toussaint N.P. Drakenberg, H. Höck, S. Kindermann, M. Lautenschlager,
M.Lautenschlager (WDCC / MPI-M) / / 1 GO-ESSP at LLNL Livermore, June 19th – 21st, 2006 World Data Center Climate: Status and Portal Integration.
M.Lautenschlager (WDCC / MPI-M) / / 1 AGU Fall Meeting, San Francisco, December 2005 Michael Lautenschlager - WDC Climate (Max-Planck-Institut.
M. Lautenschlager (M&D/MPIM)1 The CERA Database Michael Lautenschlager Modelle und Daten Max-Planck-Institut für Meteorologie Workshop "Definition.
Grey Literature, E-Repositories and Evaluation of Academic & Research Institutes. The case study of BPI e-repository Maria V. Kitsiou - Head Librarian,
World Bank, Africa Region, Africa Household Survey Databank - The World Bank - Africa.
Z EGU Integration of external metadata into the Earth System Grid Federation (ESGF) K. Berger 1, G. Levavasseur 2, M. Stockhause 1, and M. Lautenschlager.
Chinese-European Workshop on Digital Preservation, Beijing July 14 – Network of Expertise in Digital Preservation 1 Trusted Digital Repositories,
Working Group: Practical Policy Rainer Stotzka, Reagan Moore.
F. Toussaint (WDCC, Hamburg) / / 1 CERA : Data Structure and User Interface Frank Toussaint Michael Lautenschlager World Data Center for Climate.
CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.
World Data Center for Marine Environmental Sciences.
Michael Lautenschlager World Data Center Climate Model and Data / Max-Planck-Institute for Meteorology German Climate Computing Centre (DKRZ)
Bulk Metadata Structures in CERA Frank Toussaint, Michael Lautenschlager Max-Planck-Institut für Meteorologie World Data Center for Climate.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
M.Lautenschlager (WDCC, Hamburg) / / 1 Training-Workshop Facilities and Sevices for Earth System Modelling Integrated Model and Data Infrastructure.
Data Publication and Quality Control Procedure for CMIP5 / IPCC-AR5 Data WDC Climate / DKRZ:
M.Lautenschlager (WDCC, Hamburg) / / 1 ICSU World Data Center For Climate Semantic Data Management for Organising Terabyte Data Archives Michael.
| Ingest Levels and Persistent Identification | October Ingest Levels and Persistent Identification Services for R & D and heritage organisations.
+ Information Systems and Databases 2.2 Organisation.
The CERA2 Data Base Data input – Data output Hans Luthardt Model & Data/MPI-M, Hamburg Services and Facilities of DKRZ and Model & Data Hamburg,
Michael Lautenschlager, Hannes Thiemann, Frank Toussaint WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Joachim Biercamp, Ulf Garternicht,
H. Thiemann (M&D) / / 1 Hannes Thiemann M&D Statusseminar, 22. April 2004.
IPCC TGICA and IPCC DDC for AR5 Data GO-ESSP Meeting, Seattle, Michael Lautenschlager World Data Center Climate Model and Data / Max-Planck-Institute.
How to Implement an Institutional Repository: Part II A NASIG 2006 Pre-Conference May 4, 2006 Technical Issues.
The Repository of the World Data Centre for Climate Frank Toussaint, Michael Lautenschlager Max-Planck-Institut für Meteorologie Repositories in Research.
Archiving microdata Standards and good practices United Nations Statistics Commission New York, February 26, 2009 Olivier Dupriez World Bank, Development.
Lautenschlager + Thiemann (M&D/MPI-M) / / 1 Introduction Course 2006 Services and Facilities of DKRZ and M&D Integrating Model and Data Infrastructure.
1 Summary. 2 ESG-CET Purpose and Objectives Purpose  Provide climate researchers worldwide with access to data, information, models, analysis tools,
Create XML from a template Browse available records WDCC Metadata Generation with GeoNetwork Hans Ramthun, Michael Lautenschlager, Hans-Hermann Winter.
ARIADNE is funded by the European Commission's Seventh Framework Programme Archiving and Repositories Holly Wright.
Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.
IPCC WG II + III Requirements for AR5 Data Management GO-ESSP Meeting, Paris, Michael Lautenschlager, Hans Luthardt World Data Center Climate.
Hannes Thiemann Michael Lautenschlager Deutsches Klimarechenzentrum GmbH, Germany EGU 2010.
Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.
M. Lautenschlager (M&D/MPIM)1 WDC on Climate as Part of the CERA 1 Database System Michael Lautenschlager Modelle und Daten Max-Planck-Institut.
CAS2K11 in Annecy, France September 11 – 14, 2011 Data Infrastructures at DKRZ Michael Lautenschlager.
Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.
Preservation Functionality in a Digital Archive Erik Oltmans Koninklijke Bibliotheek Raymond J. van Diessen IBM Business Consulting Services Hilde van.
Collection-Based Persistent Archives Arcot Rajasekar, Richard Marciano, Reagan Moore San Diego Supercomputer Center Presented by: Preetham A Gowda.
Grid Services for Digital Archive Tao-Sheng Chen Academia Sinica Computing Centre
© Thomas Ludwig Prof. Dr. Thomas Ludwig German Climate Computing Center (DKRZ) University of Hamburg, Department for Computer Science (UHH/FBI) Disks,
2005 – 06 – - ESSP1 WDC Climate : Web Access to Metadata and Data Frank Toussaint World Data Center for Climate (M&D/MPI-Met, Hamburg)
World Conference on Climate Change October 24-26, 2016 Valencia, Spain
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Record Storage, File Organization, and Indexes
VI-SEEM Data Repository
VI-SEEM Data Repository
Implementing an Institutional Repository: Part II
GISELA & CHAIN Workshop Digital Cultural Heritage Network
Data Management Components for a Research Data Archive
Implementing an Institutional Repository: Part II
How to Implement an Institutional Repository: Part II
Presentation transcript:

Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang Stahl + Joachim Biercamp German Climate Computing Centre (DKRZ) Hamburg Visit at NCAR October 27th – 29th, 2008 in Boulder, USA

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)2 DKRZ: Earth system model development Simulations of past, present and future climate WDC Climate: Long-term data archiving Inter-disciplinary data dissemination

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)3 Block Diagram HLRE-II System at DKRZ 250 IBM Power6 nodes (240 Compute, 10 I/O) GPFS Filesystem IBM DS5300 (2-5 PByte) GPFS Filesystem IBM DS5300 (2-5 PByte) StorageTek Silos Total Capacity: Tapes Approx. 60 PB (LTO and Titan) StorageTek Silos Total Capacity: Tapes Approx. 60 PB (LTO and Titan)

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)4 Increase in installed compute power motivates finer spatial and temporal model resolution and integration of additional physical and chemical processes into climate models.

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)5 Next generation of compute server (HLRE-II) and of climate models at DKRZ implies data production increase with implications for long- term archiving. HLRE2: Compute power increase by a factor of 60 (sustained) Experience at DKRZ: Linear increase in data production with installed compute power Previous data storage strategy: all data migrated to the long-term mass storage archive (that means archive increase follows directly the compute power increase) Resulting problem: Since the total amount of money for investment and for operations is fixed the cost relation between compute service and data service shifts towards data service while reducing the compute service fraction This is not any longer feasable for HLRE-II at DKRZ. Therefore long-term archive increase has been limited to 10 PB/year which is five times more than the present data archive increase. The database increase of WDCC has been limited to 1 PB/year (presently 60 – 100 TB/year).

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)6 Compute server architectures: Cray C90 ( ) / HLRE: NEC SX-6 ( ) / HLRE-II: IBM Power 6 ( ) (HLRE: Höchstleistungsrechnersystem für die Erdsystemforschung) Data Archive at DKRZ

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)7 Increase of WDCC data archive

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)8 WDCC data downloads for 2007 (catalogue accesses neglected)

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)9 Analysis of data classes Test data from model code development, life cycle: weeks to months Project data from scientific model evaluation and research projects (DKRZ resources at project level), life cycle: 3 – 5 years Final results as data products for international projects (IPCC) and scientific publications, life cycle: 10 years and longer Resulting data hierarchy levels Temp(orary) scratch discs at compute server Work fixed disc space at project level for evaluation Arch(ive) tape storage space (single copy) with expiration date for project data beyond available disc space Docu(mentation) documented, long-term tape archive (security copy) for data products, focus on interdisciplinary data utilisation, data are fixed and no longer matter of change

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)10 Tape space distributon to archive classes at DKRZ begin of 2007: part of the “work” space on tape because GFS too small “docu” domain consists of WDCC no expiration dates in “arch” domain, parts of “arch” domain belongs to “docu” but not yet documented 2007

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)11 The new project based data storage strategy and resource assigment at DKRZ contains: Separation of project data and long-term archive Expiration date for project data Aware, scientific decision to move data into the long- term archive within the given archive limits Data documentation requirements for long-term archive Long-term data archive (“docu” hierarchy level) accomplishes the rules for good scientific practice

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)12 Data documentation requirements are accomplished by using the WDCC infrastruture CERA-2 metadata model developed in 1999  Catalogue interface: cera.wdc-climate.de  Input interface: input.wdc-climate.de CERA-2 metadata content is complete with respect to browse, to discover and to use climate data which are stored in the database system or outside in flat files Missing: structured information on data provenance (topic of EU-project METAFOR) The WDCC matches international description standards like ISO 19115, Dublin Core or GCMD and is integrated in international data federations Data storage structure assembles storage of climate time series per variable in BLOB data tables. This allows for web-based data catalogue search and data access in small data granules.

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)13 WDCC / CERA: General Statistics at :00:10 Database Size (TByte): 370 Number of blobs: (6.6 billion) Data access by fields and not by files. Number of experiments: 1146 Number of datasets: Total size divided by number of BLOBs gives the average size of data access granules: kB/BLOB

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)14 WDCC User Categories Experienced User From numerical model development and analysis Consolidated knowledge of model data structure and limitations of model results Application experience in tools and infrastructure to process model raw data (files) Familiar with Unix environments and programming languages Non-experienced user From climate mitigation and adaptation Only little knowledge of model data structure and limitations of model results Require application adapted model data products and field- based data access Familiar with MS-Windows environments and Office Tools

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)15 CERA Data Model Entry Reference Status Distribution Contact Coverage Parameter Spatial Reference Local Adm. Data Access Data Org

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)16 Coloured columns correspond to BLOB data tables in WDCC. Collections of matrix rows represents storage in model raw data files (complete model output storage time step by storage time step).

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)17 Additionally WDCC offers the primary data publication service for final data entities which are of general scientific interest Following the STD-DOI concept (Scientific and Technical Data – Digital Object Identifier, URL: Important aspects of the publication process are  The identification of independent data entities which are suitable for publication at the level of scientific literature,  The execution of an elaborated review process for metadata and climate data,  The assigment of additional metadata for electronic publication (ISO 690-2) and of persistent identifiers (DOI / URN) and  The integration of publication metadata and persistent identifiers into the TIB library catalogue (Technical Information Library, Hannover) so that primary data entities are searchable and citable together with scientific literature.  Quality characteristic is presently “approved by author”, future development should be “peer reviewed”.

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)18 STD-DOI data publication workflow

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)19

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)20 Data infrastructure integrates data stewardship in the long-term archive Bit-stream preservation Quality assurance Usability enabling

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)21 Long-term archive data stewardship Bit-stream preservation Secondary tape copies on different tapes and technology at separate location Copy to new tapes after maximum number of tape accesses are reached (Refreshment)

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)22 DKRZ archive development

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)23 DKRZ archive increase and transition in tape technology In 2002 most data on SD3 (helical scan) Migration to 9940A and 9940B since they were available Migration to T1A (Titanium T10000)

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)24 Number of files in DKRZ archive on different tape media Small files are stored on 9840C (small capacity but fast access) 1. peak: Start with NEC-SX6 yields exponential increase in number of files and inposes the implementation of file quotas 2. peak: Delay in cleaning up number of files

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)25 Long-term archive data stewardship (continued) Quality assurance Semantic examinations: behavior of a numerical model compared to observations and to other models, part of the scientific evaluation process Syntactic examinations: formal aspects of data archiving and ensurance that data archiving is free of errors as far as possible  Consistency between metadata and climate data  Completeness of climate data  Standard range of values (expectation ranges and simple data statistics)  Spatial and temporal data arrangement

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)26 Long-term archive data stewardship (continued) Usability enabling Complete and searchable documenation of climate data entities (database tables and flat files) in the catalogue system of the WDCC WDCC offers web-based data access to small data granules (individual entries in BLOB DB tables) Archive technology transfer must be downward compatible to keep old data technically readable Data processing tools and data format access libraries must be migrated to new architectures

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)27 WDCC Architecture

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)28 Summary DKRZ long-term data archive will still grow but slower than linear with the installed compute power Key increase factors are for long-term archive: 10 PB/year, for WDCC: 1 PB/year. Improvement of reliability of long-term archive because of more emphasis on data stewardship than on technical data service operations At the end the new data archive concept will result in a completely documented and searchable long-term data archive. In the future more server side data processing is requested for on- site data reduction, on the fly generation of application data products and visualisation at working level and for presentations.

NCAR (Oct , 2008)Lautenschlager (WDCC/MPI-M)29 References 2008 Michael Lautenschlager Preservation of Earth System Model Data In: Digital Preservation Europe, Briefing Paper 30th June 2008 ( preservation-of-earth-system-model-data (Size: 95 Kbyte, Type: pdf)preservation-of-earth-system-model-data Lautenschlager, M., Stahl, W. Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ In: E Mikusch (Ed.): PV Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical Data, Conference Proceedings. DLR, German Remote Sensing Data Center, Oberpfaffenhofen, 2007 Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ (Size: 2.9 Mbyte, Type: pdf)Long-Term Archiving of Climate Model Data at WDC Climate and DKRZ URL: