H. Thiemann (M&D) / 19.05.2014 / 1 CERA (Climate and Environmental Retrieval and Archive) Hannes Thiemann (M&D/MPIMET, Hamburg) Kiel, 17.3.2004.

2 H. Thiemann (M&D) / 19.05.2014 / 2 Data Group maintaining the WDCC Michael Kurtz Hans Luthardt Michael Lautenschlager Heinke Höck Hannes Thiemann Hermann Winter Jörg Wegner Frank Toussaint Peter Lenzen

3 H. Thiemann (M&D) / 19.05.2014 / 3 Content: General remarks DKRZ archive development CERA 1) concept CERA data model and structure Automatic fill process (not presented) CERA user interface 1) Climate and Environmental data Retrieval and Archiving

4 H. Thiemann (M&D) / 19.05.2014 / 4 Semantic data management Data consist of numbers and metadata. Metadata construct the semantic data context. Metadata form a data catalogue which makes data searchable. Data are produced, archived and extracted within their semantic context. Data without explanation are only numbers. Problems: Metadata are of different complexity for different data types. Consistency between numbers and metadata have to be ensured.

5 H. Thiemann (M&D) / 19.05.2014 / 5 DKRZ Architecture Proc.: 24 nodes 192 CPU's Memory: 1.5 TeraByte Perform.: 1.5 TeraFLOPS (peak) 500 GigaFLOPS (sust.) Tape Archive: > 3.4 PetaByte Disk Cache: 60 TeraByte Bandwidth Comp.S. – Data S.: 450 Mbyte/sec 155 Mbs

6 H. Thiemann (M&D) / 19.05.2014 / 6 DKRZ Archive Development Basics observations and assumptions: 1.Unix-File archive content end of 2002: 600 TB including Backup's 2.Observed archive rate (Jan. - May 2003): 40 TB/month 3.System changes: 50% compute power increase in August 2003 4.CERA DB size end of 2002: 12 TB 5.Observed Increase (Jan. - May 2003): 1 TB/month 6.Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004

7 H. Thiemann (M&D) / 19.05.2014 / 7 DKRZ Archive Development

8 H. Thiemann (M&D) / 19.05.2014 / 8 Problems in file archive access: Missing Data Catalogue Data are not stored application-oriented Lack of experience with climate model data Lack of computing facilities at client site Year20032004200520062007 Estimated File Archive Size 1,2 PB1,9 PB2,6 PB3,4 PB4,1 PB

9 H. Thiemann (M&D) / 19.05.2014 / 9 Limits of model resolution ECHAM4(T42) Grid resolution: 2.8° Time step: 40 min ECHAM4(T106) Grid resolution: 1.1° Time step: 20 min Noreiks (MPIM), 2001

10 H. Thiemann (M&D) / 19.05.2014 / 10 (I) Data catalogue and Unix files (pointer or BLOB-table-entry) Enable search and identification of data Allow for data access as they are (II) Application-oriented data storage Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access Storage in standard file-format (GRIB, NetCDF) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

11 H. Thiemann (M&D) / 19.05.2014 / 11 CERA Database: 30 TB (12.2003) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: 1 PB neglecting Security Copies (12.2003) CERA Database System Web-Based User Interface Catalogue Inspection Climate Data Retrieval DKRZ Mass Storage Archive InternetAccess Current database size is 30 Terabyte Number of experiments: 318 Number of datasets: 32042 Number of blob within CERA at 19-JAN-04: 1551518720 Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month Parts of CERA DB Web access to entire CERA DB content

12 H. Thiemann (M&D) / 19.05.2014 / 12 Web-Based User Interface Catalogue Inspection Climate Data Retrieval CERA Database 30 TB (12/2003) Data Catalogue Processed Climate Data Pointer to Raw Data Mass Storage Archive 1 PB (12/2003) Parts of CERA

13 H. Thiemann (M&D) / 19.05.2014 / 13 CERA Data: Jan. Temp.

14 H. Thiemann (M&D) / 19.05.2014 / 14 CERA Data: Jan. Wind (2 x 250 MB)

15 H. Thiemann (M&D) / 19.05.2014 / 15 Complete with respect to IEEEs Reference Model for Metadata (Bretherton, 1994) –Browse, Search and Retrieval –Ingest, Quality Assurance, Reprocessing –Application to Application Transfer –Storage and Archive Reference –The CERA-2 Data Model (DKRZ-Report No. 15, 1998) –URL: CERA-2 Data Model

16 H. Thiemann (M&D) / 19.05.2014 / 16 Interoperability Supports interoperability due to inclusion of international standards –Directory Interchange Format (NASA, 1998) –FGDC Metadata Content Standard (FGDC, 1996) –ISO Metadata Standard for Geographic Information (ISO 19115)

17 H. Thiemann (M&D) / 19.05.2014 / 17 Metadata Entry This is the central CERA Block, providing information on the entry's title type and relation to other entries the project the data belong to a summary of the entry a list of general keywords related to data creation and review dates of the metadata Additionally: Modules and Local Extensions Module DATA_ORGANIZATION (grid structure) Module DATA_ACCESS (physical storage) Local extension for specific information on (e.g.) data usage data access and data administration Coverage Information on the volume of space-time covered by the data Reference Any publication related to the data togehter with the publication form Status Status information like data quality, processing steps, etc. Distribution Distribution information including access restrictions, data format and fees if necessary Contact Data related to contact persons and institutes like distributor, investigator, and owner of copyright Parameter Block describes data topic, variable and unit Spatial Reference Information on the coordinate system used CERA-2 Data Model Blocks

18 H. Thiemann (M&D) / 19.05.2014 / 18 Level 1 - Interface: Metadata entries (XML, ASCII) + Data Files Level 2 – Interf.: Separate files containing BLOB table data in application adapted structure (time series of single variables) Experiment Description Unix-Files Table / Pointer Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table CERA Structure

19 H. Thiemann (M&D) / 19.05.2014 / 19 Climate Model Raw Data Application-oriented Data Storage (Interface level 2) Primary Data Processing

20 H. Thiemann (M&D) / 19.05.2014 / 20 Start: Approved in January 2003 Maintenance: Model and Data (M&D/MPIMET) and German Climate Computing Centre (DKRZ) Mission: Data for climate research are collected, stored and disseminated ICSU Policy: long-term archiving and unrestricted data access for scientists Restriction: Only climate data products in CERA DB, no raw data storage. Content: Emphasis is spent on climate modelling and related data products. Co-operation: with thematically corresponding data centres like WDC- MARE (Bremen) and WDC-RSAT (Oberpfaffenhofen) URL:

21 H. Thiemann (M&D) / 19.05.2014 / 21 WDC Verbund Erdsystemforschung Wurde am 25.04.03 von den 3 deutschen ICSU WDC's in Oberpfaffenhofen gegründet. WDC for Climate: M&D / DKRZ, Hamburg WDC MARE (Marine Environmental Sciences): Marum, Bremen und Bremerhaven WDC RSAT (Remote Sensing for the Atmosphere): DFD/DLR, Oberpfaffenhofen Verpflichtung: Langzeit-Datenarchivierung und freier, unbeschränkter Datenzugang für alle Wissenschaftler (ICSU Rules for WDC's und Regeln zur guten wissenschaftlichen Praxis)

22 H. Thiemann (M&D) / 19.05.2014 / 22 WDC Verbund Erdsystemforschung Grundsatzerklärung Datenpublikation - Die Daten selbst sollen unabhängig vom archivierenden System eindeutig identifizierbar, referenzierbar und universell zugreifbar sein (z.B. Vergabe von DOI's oder URN's ). - DFG Projekt "Publikation und Zitierfähigkeit wissenschaftlicher Primärdaten" (12 Monate, Beginn 01.10.03) Service der Datenzentren - Qualifizierte thematische Datenzentren übernehmen die Rolle für die Archivierung und Publikation von wissenschaftlichen Daten. - Die Zentren garantieren eine langfristige und freie Verfügbarkeit archivierter Daten im Rahmen der Richtlinien der ISCU Weltdatenzentren. - Datenzentren stehen mit ihrer Expertise den Fördereinrichtungen, den Gutachtern und der Wissenschaft beratend zur Verfügung.

23 H. Thiemann (M&D) / 19.05.2014 / 23 WDC-CLIMATE Data Content Climate Model Data (Continuous stream of new data) IPCC DDC (Data Distribution Centre) –Will be continued for the Fourth Assessment Report CEOP (Coordinated Enhanced Observing Period) Model output retention and handling Centre –Part of WCRP that was motivated by GEWEX with focus on water and energy cycles within the climate system (01.10.2002 – 31.12.2004) Observational Data –Model related observations: ERA15/40 (ECMWF), NCEP 40 Y. Reanal. –Instrumental data: WOCE (World Ocean Circulation Experiment) –Earth observations: Access to SST's from NOAA AVHRR in cooperation with WDC RSAT (distributed archive) Project Support (encourage Good Scientific Practice) HOAPS (Hamburg Ocean Atmosphere Parameters and Fluxes from Satellite Data) CARIBIC (Civil Aircraft for Regular Investigation of the Atmosphere Based on an Instrumentation Container), MPI Mainz Different model applications

24 H. Thiemann (M&D) / 19.05.2014 / 24 Experiment Exp.-Acronym: EH5_T63L19_AMIP_6H Exp.-Name: ECHAM5_T63L19_AMIP Control Run 6H values Exp.-Description: Simulation of current climate using ECHAM5.2 forced with observed monthly sea surface temparatures and sea-ice concentrations (AMIP-2). The simulation was run on a NEC-SX6 (hurrikan). Atmospheric data is stored every 6 hours. Monthly means are available, too. Related experiments: - ECHAM5_TTTLLL_AMIP in where TTTLLL is: T21L19, T31L19, T42L19, T85L19, T106L19, T42L31, T63L31, T85L31 and T106L31 The output from the model run: Project: Climate Model Simulations at MPI Keyword: AMIP2 WDCC Example

25 H. Thiemann (M&D) / 19.05.2014 / 25 Experiment Exp.-Acronym: EH5_T63L19_AMIP_6H WDCC Example Dataset (BLOB-Table) DS-Acronym: EH5_T63L19_R365_TEMP2 Variable: 2m temperature Dataset (BLOB-Table) DS-Acronym: EH5_T63L19_R365_WIND10M Variable: 10m wind speed Number of datasets: 350 time series of 2D global fields Total amount of GRIB data: 350 * 1.6 GB = 560 GB NEWEXP/EXP300/run365

26 H. Thiemann (M&D) / 19.05.2014 / 26 Dataset DS-Acronym: EH5_T63L19_R365_TEMP2 DS-Name: EH5_T63L19_R365_TEMP2 DS-Summary: See summary of corresponding experiment. This dataset contains 6H values. Creation Date: 25-MAI-2003 Format: GRIB Size (Bytes): 1659519420 Storage: Model and Data: DB Internal Storage; Nearline Download Permission: No Topic / Parameter / Variable / Unit: atmosphere / atmospheric temperature / 2m temperature / Kelvin Code Type / Code # / Code Acronym: Echam5 / 167 / TEMP2 Temporal Structure: length of time series and storage intervalls Spatial Structure: precise definition of 3D grid points WDCC Example

27 H. Thiemann (M&D) / 19.05.2014 / 27

28 H. Thiemann (M&D) / 19.05.2014 / 28 Inclusion of other Data Sources Client applet receives foreign data URI from CERA-2 DB Foreign server provides DB data by http: German Aerospace Centre

29 H. Thiemann (M&D) / 19.05.2014 / 29 Download Statistics

30 H. Thiemann (M&D) / 19.05.2014 / 30 CERA DB using countries

31 H. Thiemann (M&D) / 19.05.2014 / 31 Contact Email: Web:

