Presentation is loading. Please wait.

Presentation is loading. Please wait.

ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,

Similar presentations


Presentation on theme: "ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,"— Presentation transcript:

1 ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner, D. Waliser The ExArch project seeks to develop a framework for the scientific interpretation of multi-model ensembles at the peta- and exa-scale. Specifically, a framework for evaluating the imminent Climate Model Intercomparison Project, Phase 5 (CMIP5) archive, which will be largest of its kind ever assembled in this domain and the Coordinated Regional Downscaling experiment (CORDEX), which will push even beyond CMIP5 in resolution, albeit on the regional scale.

2 ExArch: management numbers Start: March 1st, 2010 Duration: 39 months Budget: 1.44 million Euros Effort: Start: March 1st, 2011 Duration: 39 months Budget: 1.44 million Euros Effort: 246 staff months

3 Global Organization for Earth System Science Portals (GO-ESSP) [unfunded].... a federation of frameworks that can work together using agreed-upon standards METAFOR (EU FP7): Detailed meta-data for climate models IS-ENES (EU FP7): Infrastructure support, including distributed data archive. ESG (DOE, SciDAC): access to: data, information, models, analysis tools, and computational resources CURATOR (NSF, NASA. NOAA): capable infrastructure for Earth system research and operations The ExArch back-story

4 Model development CMIP3 Model development CMIP5 Evaluation Downscaling Impacts models IPCC 4 th Assessment Report (2005-7) IPCC 5 th Assessment Report (2012-14) Evaluation Downscaling Impacts models The climate assessment process

5 CMIP5 and AR5: a brief organisational overview Experiment design coordinated by Karl Taylor at PCMDI on behalf of WCRP CMIP5 archive A globally distributed archive, a federation of data centres and modelling centres will host the archive Access to the global archive expected in April

6 A schematic roadmap

7 Improving precision Image STS41B-41-2347 was provided by the Earth Sciences and Image Analysis Laboratory at NASA's Johnson Space Center. NASA Graphics

8 Fig 1 of Delaney and Barga (2010) Dealing with complexity Improved accuracy requires representing a vast range of processes:  complexity in model design;  complexity in experimental design.

9 Sampling uncertainty Uncertainty in initial conditions – We do not know the state of the ocean to sufficient accuracy Uncertainty from natural variability – Some components of the system are chaotic, we can never predict what state they will be in beyond O(10 days) Uncertainty in model design and configuration – Many choices need to be made to create a computable system. The multi-model ensemble is a “multi-metric, distance optimising Monte- Carlo approach”, where each major design decision is a trade-off between adopting approaches used by others or exploring new territory.

10 Overpeck et al. (2011), Science.

11 N pp = Number of mesh points pole to pole N g = Total number of spatial mesh points = O(N pp 3 ) N v = Number of variables ~ √N pp N e = Ensemble size ~ N pp N t = Time steps per simulated year ~ N pp N y = Years simulated per intercomparison ~ √N pp. Cost ~ N pp 6 O(40) decrease in storage density needed to bring this estimate in line with Overpeck et al.

12 Some trends – ballpark figures Change per yearChange per decade Data centre storage+60% Data centre energy use+25% Energy use/unit capacity-22%10-fold decrease 20102020 Purchase cost/Tb200 USD3 USD Operating power10 W/Tb1W/Tb Electricity cost (UK)90 GBP/MWh120 GBP/MWh Cost of 1Tb* 3 years200 + 463 + 6 Size at constant funding1Pb27Pb Analysis by Kryder and Soo Kim (2009) suggests hard drives will not be replaced by solid state or other new media before 2020.

13 StrategyInformatics Climate Science 2 Workshops Interactions with GCOS, ESA and NASA Governance structures Accessibility Software management Robust metadata Query management Near archive processing Quality assurance Climate science diagnostics Work Packages

14 Provenance Dealing with thousands of experiments with complex models in multiple configurations – need comprehensive and machine readable documentation of model configuration. Accurate metadata As meta-data becomes more complex, we need to develop systems to ensure that it is accurate: Downstream: auto-generate meta-data from model configurations Upstream: prescribe meta-data and auto-generate model configurations

15 Query syntax The query syntax is a key element of the communication between system components. Resource management Bringing calculations to an exa-scale data archive is going to create a significant computational demand.

16 Many to many data distribution Can we create a bit-torrent for data with access control? Security, transparency, inter-operability Particularly inter-operability with observational climate archives. Need to support multiple access routes: OpenDAP – for climate scientists OGC Browser interface

17 Quality control Standard diagnostics Structured queries E.g. Ensemble mean of all models with dynamic* vegetation. *: “dynamic” vegetation in the climate modelling community refers to temporally evolving vegetation, as opposed to prescribed fixed vegetation. Many climate data users want to look at standard indices (e.g. strength of El Nino) rather than the raw data. With millions of files, quality control is very important. With hundreds of different variables it is also very hard.

18 Themes: Taking the computation to the data The Climate Data Operators (CDO) library is maintained by DKRZ. Designed to act on files containing gridded spatial data, exploiting variable and units naming conventions. Over 400 operators (cf Gray's law start with 20). Objectives: (1) allow preliminary analysis to select data (e.g. days with cyclones affecting US mainland) to be carried out at the archive; (2) allow data reduction operations (e.g. area mean) to be carried out at the archive. Later: take the diagnostic computation into the run-time???

19 Degrees of closeness to the data In the room Limited CPU resource Intercontinental Regional, on backbone On site (inside firewall) May be necessary if using multiple archives General purpose resources Restricted access rights

20 Themes: Governance and communication (brain to brain) Efficient exploitation of complex archives is dependent on a range of standards with various governance procedures. Will these standards respond to the demands of exa-scale systems? ExArch will: (1) support the extension of the METAFOR Common Information Model to cover regional climate models; (2) explore methods of ensuring validity of model documentation; (3) engage with the NetCDF CF conventions governance process; (4) encourage early planning by sharing documents and interlinking milestones, discourage late planning by getting everyone round the table (doesn't scale well).

21 The end


Download ppt "ExArch: Climate analytics on distributed exascale data archives Martin Juckes, V. Balaji, B.N. Lawrence, M. Lautenschlager, S. Denvil, G. Aloisio, P. Kushner,"

Similar presentations


Ads by Google