M.Lautenschlager (WDCC, Hamburg) / 03.09.03 / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.

Slides:

Advertisements

Similar presentations

National Partnership for Advanced Computational Infrastructure San Diego Supercomputer Center Data Grids for Collection Federation Reagan W. Moore University.

Advertisements

Chapter 10: Designing Databases

GTS MetaData Generation data GTS data bases GTS Switch Volume C1 Central Support Office Information Classes white-list Metadata Synchronization.

Preservation and Long Term Access of Data at the World Data Centre for Climate Frank Toussaint N.P. Drakenberg, H. Höck, M. Lautenschlager, H. Luthardt,

1 Databases in ALICE L.Betev LCG Database Deployment and Persistency Workshop Geneva, October 17, 2005.

NOAO/Gemini Data workshop – Tucson,  Hosted by CADC in Victoria, Canada.  Released September 2004  Gemini North data from May 2000  Gemini.

M.Lautenschlager (WDCC/MPI-M) / / 1 The CEOP Model Data Archive at the World Data Center for Climate as part of the CEOP Data Network CEOP / IGWCO.

CERA / WDCC Hannes Thiemann Max-Planck-Institut für Meteorologie Modelle und Daten zmaw.de NCAR, October 27th – 29th, 2008.

File Management Systems

Physical design. Stage 6 - Physical Design Retrieve the target physical environment Create physical data design Create function component implementation.

Chapter 12 File Management Systems

1 CS 502: Computing Methods for Digital Libraries Lecture 4 Identifiers and Reference Links.

Активное распределенное хранилище для многомерных массивов Дмитрий Медведев ИКИ РАН.

Web-based Portal for Discovery, Retrieval and Visualization of Earth Science Datasets in Grid Environment Zhenping (Jane) Liu.

M.Lautenschlager (WDCC / MPI-M) / / 1 WS Spatiotemporal Databases for Geosciences, Biomedical sciences and Physical sciences Edinburgh, November.

Preservation and Long Term Access of Data at the World Data Centre for Climate Frank Toussaint N.P. Drakenberg, H. Höck, S. Kindermann, M. Lautenschlager,

M.Lautenschlager (WDCC / MPI-M) / / 1 GO-ESSP at LLNL Livermore, June 19th – 21st, 2006 World Data Center Climate: Status and Portal Integration.

UNIVERSITY of MARYLAND GLOBAL LAND COVER FACILITY High Performance Computing in Support of Geospatial Information Discovery and Mining Joseph JaJa Institute.

Distributed Computing COEN 317 DC2: Naming, part 1.

M. Lautenschlager (M&D/MPIM)1 The CERA Database Michael Lautenschlager Modelle und Daten Max-Planck-Institut für Meteorologie Workshop "Definition.

Z EGU Integration of external metadata into the Earth System Grid Federation (ESGF) K. Berger 1, G. Levavasseur 2, M. Stockhause 1, and M. Lautenschlager.

1 Chapter 12 File Management Systems. 2 Systems Architecture Chapter 12.

DISTRIBUTED DATA FLOW WEB-SERVICES FOR ACCESSING AND PROCESSING OF BIG DATA SETS IN EARTH SCIENCES A.A. Poyda 1, M.N. Zhizhin 1, D.P. Medvedev 2, D.Y.

F. Toussaint (WDCC, Hamburg) / / 1 CERA : Data Structure and User Interface Frank Toussaint Michael Lautenschlager World Data Center for Climate.

Distributed Computing COEN 317 DC2: Naming, part 1.

Metadata and Geographical Information Systems Adrian Moss KINDS project, Manchester Metropolitan University, UK

CC&E Best Data Management Practices, April 19, 2015 Please take the Workshop Survey 1.

Michael Lautenschlager World Data Center Climate Model and Data / Max-Planck-Institute for Meteorology German Climate Computing Centre (DKRZ)

Bulk Metadata Structures in CERA Frank Toussaint, Michael Lautenschlager Max-Planck-Institut für Meteorologie World Data Center for Climate.

M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center.

Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.

GStore: GSI Mass Storage ITEE-Palaver GSI Horst Göringer, Matthias Feyerabend, Sergei Sedykh

Long-term Archiving of Climate Model Data at WDC Climate and DKRZ Michael Lautenschlager WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Wolfgang.

M.Lautenschlager (WDCC, Hamburg) / / 1 Training-Workshop Facilities and Sevices for Earth System Modelling Integrated Model and Data Infrastructure.

Search Engines. Search Strategies Define the search topic(s) and break it down into its component parts What terms, words or phrases do you use to describe.

Data Publication and Quality Control Procedure for CMIP5 / IPCC-AR5 Data WDC Climate / DKRZ:

M.Lautenschlager (WDCC, Hamburg) / / 1 ICSU World Data Center For Climate Semantic Data Management for Organising Terabyte Data Archives Michael.

Automated (meta)data collection – problems and solutions Grete Christina Lingjærde and Andora Sjøgren USIT, University of Oslo.

The CERA2 Data Base Data input – Data output Hans Luthardt Model & Data/MPI-M, Hamburg Services and Facilities of DKRZ and Model & Data Hamburg,

M. Lautenschlager (M&D) / / 1 ENES: The European Earth System GRID ENES – Alcatel WS , ANTWERPEN Michael Lautenschlager Model and.

NDD (National Oceans Office Data Directory) development overview as at 1 July 2002 Tony Rees/Miroslaw Ryba CSIRO Marine Research, Hobart.

Michael Lautenschlager, Hannes Thiemann, Frank Toussaint WDC Climate / Max-Planck-Institute for Meteorology, Hamburg Joachim Biercamp, Ulf Garternicht,

H. Thiemann (M&D) / / 1 Hannes Thiemann M&D Statusseminar, 22. April 2004.

IPCC TGICA and IPCC DDC for AR5 Data GO-ESSP Meeting, Seattle, Michael Lautenschlager World Data Center Climate Model and Data / Max-Planck-Institute.

Mercury – A Service Oriented Web-based system for finding and retrieving Biogeochemical, Ecological and other land- based data National Aeronautics and.

PSI Meta Data meeting, Toulouse - 15 November The CERA C limate and E nvironment data R etrieval and A rchiving system at MPI-Met / M&D S. Legutke,

Introduction to Information Retrieval Example of information need in the context of the world wide web: “Find all documents containing information on computer.

The US Long Term Ecological Research (LTER) Network: Site and Network Level Information Management Kristin Vanderbilt Department of Biology University.

H. Widmann (M&D) Data Discovery and Processing within C3Grid GO-ESSP/LLNL / June, 19 th 2006 / 1 Data Discovery and Basic Processing within the German.

Lautenschlager + Thiemann (M&D/MPI-M) / / 1 Introduction Course 2006 Services and Facilities of DKRZ and M&D Integrating Model and Data Infrastructure.

MarLIN: a research data metadatabase for CSIRO Marine Research Tony Rees Divisional Data Centre CSIRO Marine Research, Hobart contact:

Super Computing 2000 DOE SCIENCE ON THE GRID Storage Resource Management For the Earth Science Grid Scientific Data Management Research Group NERSC, LBNL.

Create XML from a template Browse available records WDCC Metadata Generation with GeoNetwork Hans Ramthun, Michael Lautenschlager, Hans-Hermann Winter.

The Research Data Archive at NCAR: A System Designed to Handle Diverse Datasets Bob Dattore and Steven Worley National Center for Atmospheric Research.

20 Copyright © 2008, Oracle. All rights reserved. Cache Management.

Global Change Master Directory (GCMD) Mission “To assist the scientific community in the discovery of Earth science data, related services, and ancillary.

IPCC WG II + III Requirements for AR5 Data Management GO-ESSP Meeting, Paris, Michael Lautenschlager, Hans Luthardt World Data Center Climate.

Data Management Practices for Early Career Scientists: Closing Robert Cook Environmental Sciences Division Oak Ridge National Laboratory Oak Ridge, TN.

M. Lautenschlager (M&D/MPIM)1 WDC on Climate as Part of the CERA 1 Database System Michael Lautenschlager Modelle und Daten Max-Planck-Institut.

5-7 May 2003 SCD Exec_Retr 1 Research Data, May Archive Content New Archive Developments Archive Access and Provision.

Simulation Production System Science Advisory Committee Meeting UW-Madison March 1 st -2 nd 2007 Juan Carlos Díaz Vélez.

Building Preservation Environments with Data Grid Technology Reagan W. Moore Presenter: Praveen Namburi.

2005 – 06 – - ESSP1 WDC Climate : Web Access to Metadata and Data Frank Toussaint World Data Center for Climate (M&D/MPI-Met, Hamburg)

WP18, High-speed data recording Krzysztof Wrona, European XFEL

Chapter 11: File System Implementation

Introduction Multimedia initial focus

Flanders Marine Institute (VLIZ)

Metadata The metadata contains

Robert Dattore and Steven Worley

Database management systems

Presentation transcript:

M.Lautenschlager (WDCC, Hamburg) / / 1 Semantic Data Management for Organising Terabyte Data Archives Michael Lautenschlager World Data Center for Climate (M&D/MPIMET, Hamburg) CAS2K3 Workshop Sept in Annecy, Fance Home:

M.Lautenschlager (WDCC, Hamburg) / / 2 Content: General remarks DKRZ archive development CERA 1) concept CERA data model and structure Automatic fill process Database access statistics 1) Climate and Environmental data Retrieval and Archiving

M.Lautenschlager (WDCC, Hamburg) / / 3 Semantic data management Data consist of numbers and metadata. Metadata construct the semantic data context. Metadata form a data catalogue which makes data searchable. Data are produced, archived and extracted within their semantic context. Data without explanation are only numbers. Problems: Metadata are of different complexity for different data types. Consistency between numbers and metadata have to be ensured.

M.Lautenschlager (WDCC, Hamburg) / / 4 DKRZ Archive Development Basics observations and assumptions: 1)Unix-File archive content end of 2002: 600 TB including Backup's 2) Observed archive rate (Jan. - May 2003): 40 TB/month 3) System changes: 50% compute power increase in August ) CERA DB size end of 2002: 12 TB 5) Observed Increase (Jan. - May 2003): 1 TB/month 6) Automatic fill process into CERA DB is going to become operational with 4 TB/month this year and should increase from 10% of the archiving rate to approx. 30% end of 2004

M.Lautenschlager (WDCC, Hamburg) / / 5 DKRZ Archive Development Conservative Estimate

M.Lautenschlager (WDCC, Hamburg) / / 6 Problems with direct file archive access:  Missing Data Catalogue Directory structure of the Unix file system is not sufficient to organise millions of files.  Data are not stored application-oriented Raw data contain time series of 4D data blocks (3D in space and type of variable). Access pattern is time series of 2D fields.  Lack of experience with climate model data Problems in extracting relevant information from climate model raw data files.  Lack of computing facilities at client site Non-modelling scientists are not equipped to handle large amounts of data (1/2 TB = 10 years T106 or 50 years T42 in 6 hour storage intervals). Year Estimated File Archive Size 1,2 PB1,9 PB2,6 PB3,4 PB4,1 PB

M.Lautenschlager (WDCC, Hamburg) / / 7 Limits of model resolution ECHAM4(T42) Grid resolution: 2.8° Time step: 40 min ECHAM4(T106) Grid resolution: 1.1° Time step: 20 min Noreiks (MPIM), 2001

M.Lautenschlager (WDCC, Hamburg) / / 8 (I) Data catalogue and pointer to Unix files  Enable search and identification of data  Allow for data access as they are (II) Application-oriented data storage  Time series of individual variables are stored as BLOB entries in DB Tables Allow for fast and selective data access  Storage in standard file-format (GRIB) Allow for application of standard data processing routines (PINGOs) CERA Concept: Semantic Data Management

M.Lautenschlager (WDCC, Hamburg) / / 9 CERA Database: 7.1 TB ( ) * Data Catalogue * Processed Climate Data * Pointer to Raw Data files Mass Storage Archive: 210 TB neglecting Security Copies ( ) CERA Database System Web-Based User Interface Catalogue Inspection Climate Data Retrieval DKRZ Mass Storage Archive InternetAccess Current database size is Terabyte Number of experiments: 298 Number of datasets: Number of blob within CERA at 03-SEP-03: Typical BLOB sizes: 17 kB and 100 kB Number of data retrievals: 1500 – 8000 / month Parts of CERA DB

M.Lautenschlager (WDCC, Hamburg) / / 10 CERA Data: Jan. Temp.

M.Lautenschlager (WDCC, Hamburg) / / 11

M.Lautenschlager (WDCC, Hamburg) / / 12 Metadata Entry This is the central CERA Block, providing information on the entry's title type and relation to other entries the project the data belong to a summary of the entry a list of general keywords related to data creation and review dates of the metadata Additionally: Modules and Local Extensions Module DATA_ORGANIZATION (grid structure) Module DATA_ACCESS (physical storage) Local extension for specific information on (e.g.) data usage data access and data administration Coverage Information on the volume of space-time covered by the data Reference Any publication related to the data togehter with the publication form Status Status information like data quality, processing steps, etc. Distribution Distribution information including access restrictions, data format and fees if necessary Contact Data related to contact persons and institutes like distributor, investigator, and owner of copyright Parameter Block describes data topic, variable and unit Spatial Reference Information on the coordinate system used CERA-2 Data Model Blocks

M.Lautenschlager (WDCC, Hamburg) / / 13 The CERA2 data model … allows for data search according to discipline, keyword, variable, project, author, geographical region and time interval and for data retrieval. allows for specification of data processing (aggregation and selection) without attaching the primary data. is flexible with respect to local adaptations and storage of different types of geo-referenced data. is open for cooperation and interchange with other database systems. Data Model Functions

M.Lautenschlager (WDCC, Hamburg) / / 14 Level 1 - Interface: Metadata entries (XML, ASCII) Level 2 – Interf.: Separate files containing BLOB table data Experiment Description Pointer to Unix-Files Dataset 1 Description Dataset n Description BLOB Data Table BLOB Data Table Data Structure in CERA DB

M.Lautenschlager (WDCC, Hamburg) / / 15 Creation of application-oriented data storage must be automatic because of large archive rates !!! Automatic Fill Process (AFP)

M.Lautenschlager (WDCC, Hamburg) / / 16 Archive Data Flow per month Compute Server Common File System Mass Storage Archive CERA DB System 60 TB/month 2003: 4 TB/month 2004: 12 TB/month 2005+: 20 TB/month Unix-Files Application Oriented Data Hierarchy Application Oriented Data Hierarchy Unix-Files Metadata Initialisation Important: Automatic fill process has to be performed before corresponding files migrate to mass storage archive.

M.Lautenschlager (WDCC, Hamburg) / / 17 Automatic Fill Process (AFP) Steps and Relations DB-Server: 1.Initialisation of CERA DB Metadata and BLOB data tables are created Compute Server: 1.Climate model calculation starts with 1. month 2.Next model month starts and primary data processing of previous month BLOB table input is produced and stored in the dynamic DB fill cache 3.Step 2 repeated until end of model experiment DB Server: 1.BLOB data table input accessed from DB fill cache 2.BLOB table injection and update of metadata 3.Step 2 repeated until table partition is filled (BLOB table fill cache) 4.Close partition, write corresponding DB files to HSM archive, open new partition and continue with 2) 5.Close entire table and update metadata after end of model experiment

M.Lautenschlager (WDCC, Hamburg) / / 18 AFP Disk Cache Sizes Dynamic DB fill cache (BLOB table input time series) In order to guarantee stable operation the fill cache should buffer data from approximately 10 days production. Cache size is determined by the automatic data fill rate of up to 1/3 of the archive increase. YearAFP [TB/month]DB Fill Cache [TB]

M.Lautenschlager (WDCC, Hamburg) / / 19 AFP DISK Cache Sizes BLOB table fill cache (open BLOB table partitions) depends on BLOB table partition = Table Space = DB-File: 10 GB (adapted to UniTree environment and data are in GRIB format) 10 GB/Table-Partition result in 5 TB/Fill-Stream for standard set of variables (approx D global fields) of current climate model. Number of parallel fill streams (climate model calculations): 8 Additional 25% is needed for HSM-transfer of closed partitions and error tracking BLOB table partition cache results as (8 Streams * 5 TB/Stream) * 1.25 = 50 TB

M.Lautenschlager (WDCC, Hamburg) / / 20 CERA Access Statistic Summary: The system is used, mainly from externals. Application dependent data storage allows for precise data access and reduces data transfer volume by a factor of 100 compared to direct file transfer. But presently data access via CERA DB is only a few percent of DKRZ's data download.

M.Lautenschlager (WDCC, Hamburg) / / 21 CERA DB using countries URL: