On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR)

Slides:



Advertisements
Similar presentations
Performance of Cache Memory
Advertisements

Computer Organization CS224 Fall 2012 Lesson 12. Synchronization  Two processors or threads sharing an area of memory l P1 writes, then P2 reads l Data.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
CCSM cpl6 Design and Interfaces Tony Craig Brian Kauffman Tom Bettge National Center for Atmospheric Researc Robert Jacob Jay Larson Everest Ong Argonne.
Parallel Programming in C with MPI and OpenMP
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Components and Concurrency in ESMF Nancy Collins Community Meeting July 21, GMAO Seasonal.
NSF NCAR | NASA GSFC | DOE LANL ANL | NOAA NCEP GFDL | MIT Adoption and field tests of M.I.T General Circulation Model (MITgcm) with ESMF Chris Hill ESMF.
80386DX.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
2 Chip-Multiprocessors & You John Dennis March 16, 2007 John Dennis March 16, 2007.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
ESMF Development Status and Plans ESMF 4 th Community Meeting Cecelia DeLuca July 21, 2005 Climate Data Assimilation Weather.
CESM/RACM/RASM Update May 15, Since Nov, 2011 ccsm4_0_racm28:racm29:racm30 – vic parallelization – vic netcdf files – vic coupling mods and “273.15”
Computational Design of the CCSM Next Generation Coupler Tom Bettge Tony Craig Brian Kauffman National Center for Atmospheric Research Boulder, Colorado.
Department of Computer Science A Static Program Analyzer to increase software reuse Ramakrishnan Venkitaraman and Gopal Gupta.
CHAPTER 2: COMPUTER-SYSTEM STRUCTURES Computer system operation Computer system operation I/O structure I/O structure Storage structure Storage structure.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
PIO: The Parallel I/O Library The 13 th Annual CCSM Workshop, June 19, 2008 Raymond Loy Leadership Computing Facility / Mathematics and Computer Science.
ESMF Performance Evaluation and Optimization Peggy Li(1), Samson Cheung(2), Gerhard Theurich(2), Cecelia Deluca(3) (1)Jet Propulsion Laboratory, California.
CSEG Update Mariana Vertenstein CCSM Software Engineering Group Mariana Vertenstein CCSM Software Engineering Group.
CESM/ESMF Progress Report Mariana Vertenstein NCAR Earth System Laboratory CESM Software Engineering Group (CSEG) NCAR is sponsored by the National Science.
PetaApps: Update on software engineering and performance J. Dennis M. Vertenstein N. Hearn.
ARGONNE NATIONAL LABORATORY Climate Modeling on the Jazz Linux Cluster at ANL John Taylor Mathematics and Computer Science & Environmental Research Divisions.
Earth System Modeling Framework Status Cecelia DeLuca NOAA Cooperative Institute for Research in Environmental Sciences University of Colorado, Boulder.
Regional Models in CCSM CCSM/POP/ROMS: Regional Nesting and Coupling Jon Wolfe (CSEG) Mariana Vertenstein (CSEG) Don Stark (ESMF)
Petascale –LLNL Appro AMD: 9K processors [today] –TJ Watson Blue Gene/L: 40K processors [today] –NY Blue Gene/L: 32K processors –ORNL Cray XT3/4 : 44K.
ROMS as a Component of the Community Climate System Model (CCSM) Enrique Curchitser, IMCS/Rutgers Kate Hedstrom, ARSC/UAF Bill Large, Mariana Vertenstein,
The CCSM2.0 Quick Start Guide Lawrence Buja CCSM Software Engineering Group June
1 OASIS3-MCT_3.0 OASIS overview OASIS3-MCT_3.0 Some recent performance results Summary and future efforts A. Craig, S. Valcke, L. Coquart, CERFACS April.
Office of Research and Development Atmospheric Modeling Division, National Exposure Research Laboratory WRF-CMAQ 2-way coupled system: Part I David Wong,
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
CCSM Portability and Performance, Software Engineering Challenges, and Future Targets Tony Craig National Center for Atmospheric Research Boulder, Colorado,
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
CCSM Performance, Successes and Challenges Tony Craig NCAR RIST Meeting March 12-14, 2002 Boulder, Colorado, USA.
CS307 Operating Systems Virtual Memory Fan Wu Department of Computer Science and Engineering Shanghai Jiao Tong University Spring 2012.
Parallel Data Transfer in the Model Coupling Toolkit Robert L. Jacob J. Walter Larson Mathematics and Computer Science Division Argonne National Laboratory.
Data Structures and Algorithms in Parallel Computing
Report on POP & CICE of RACM components Jaromir Jakacki, IO PAS.
Report on POP & CICE of RACM components Jaromir Jakacki, IO PAS Boulder, CO, 2010.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
Slide 1 NEMOVAR-LEFE Workshop 22/ Slide 1 Current status of NEMOVAR Kristian Mogensen.
CCSM Software Engineering Update Tony Craig CCSM SEWG Meeting Feb 4, 2003.
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Concurrency and Performance Based on slides by Henri Casanova.
Lecture 3: Today’s topics MPI Broadcast (Quinn Chapter 5) –Sieve of Eratosthenes MPI Send and Receive calls (Quinn Chapter 6) –Floyd’s algorithm Other.
Representing Effects of Complex Terrain on Mountain Meteorology and Hydrology Steve Ghan, Ruby Leung, Teklu Tesfa, PNNL Steve Goldhaber, NCAR.
1 Data assimilation developments for NEMO  A working group was created at the 2006 Developers Meeting with the objective of standardizing for NEMO certain.
SciDAC CCSM Consortium: Software Engineering Update Patrick Worley Oak Ridge National Laboratory (On behalf of all the consorts) Software Engineering Working.
Overview of the CCSM CCSM Software Engineering Group June
Experiences and Decisions in Met Office coupled ESM Development
Chapter 4 – Thread Concepts
CHAPTER SEVEN PARALLEL PROCESSING © Prepared By: Razif Razali.
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 4 – Thread Concepts
Parallel Shared Memory
Parallel Algorithm Design
ICS Principles of Operating Systems
ARSC & IARC Report Juanxiong He1,2 Greg Newby1
Machine Independent Features
Memory Management Tasks
Chapter 8: Memory management
Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.
Mariana Vertenstein CCSM Software Engineering Group NCAR
Report on RACM – POP & CICE components
Multithreading Why & How.
Parallel Programming in C with MPI and OpenMP
Presentation transcript:

On the Road to a Sequential CCSM Robert Jacob, Argonne National Laboratory Including work by: Mariana Vertenstein (NCAR), Ray Loy (ANL), Tony Craig (NCAR) SEWG Meeting, March 16 th, 2007 Concurrent Sequential

Taxonomy of Model Integration Schemes Execution (a computer instruction view) –concurrent: completely different execution sequences are occurring on different processors. Inherently parallel. –Sequential: same instruction sequence on all processors Integration (a science view) –Sequential (in phase): each model integrates over the same time period before advancing to the next time period –Staggered (out of phase): one or more models are lagged.

CCSM models in this taxonomy Concurrent execution, staggered integration: –current CCSM (single or multiple executable) Sequential execution, sequential integration: –current CAM Sequential execution, staggered integration: –sequential CCSM and CAM-mode of sequential CCSM. Concurrent execution, sequential integration: –Possible, but has pathologically unbalanced load.

Motivation for Sequential CCSM 2-64 processors –Only option for 2-4 –Tough to load balance concurrent on small processor counts –Trivial to load balance sequential execution system. –2-64 processors coming soon to your next laptop/workstation! High resolution/1000’s of processors –Again, trivial to load balance assuming scalability –Impossible to remove all idle time from concurrent case; this is costly when 1000’s of processors are idle. –Platforms like BlueGene will be used for high-resolution runs and ensembles ( proc runs).

Sequential CCSM: Current Status seq_ccsm_drv.F90 –Main driver for sequential CCSM –MCT based. ESMF-based driver in development –Declares all states (AttributeVectors), grids (GeneralGrids), and decomposition descriptors (GlobalSegMaps) –Declares and initializes all mappings.

Sequential CCSM: Current Status seq_ccsm_drv.F90 –For models, driver calls. _init_mct (initializes states, decomp, grids) _run_mct (read/write states) _final_mct (clean up) Model developers must implement these calls to be coupled with sequential driver. –Driver also calls all mappings, fluxes, merges. Current CAM trunk uses seq_ccsm_drv to call CAM, CLM, DOM/SOM, CICE. Works with a single resolution.

Sequential CCSM: Development with dead code dead7 –New code that provides functionality of CCSM’s dead models for sequential CCSM. –Can mimic any resolution and decomposition –Receives/sends same states as full model –Dead7 is being used to test different resolutions in sequential CCSM.

Sequential CCSM: multiple resolutions Same map routine supports same or different resolution for cases where different resolution is allowed: atmocn, atmice, rofocn Simple test for same resolution based on global grid size: icesize=mct_gsMap_gsize(gsMap_i) atmsize=mct_gsMap_gsize(gsMap_a) samegrid=.false. if(icesize.eq. atmsize) samegrid =.true.

Sequential CCSM: multiple resolutions If different grids, a config or resource file is read during mapping init to determine mapping weight filename and mapping type: (excerpt from seq_maps.rc) atm2ocnFmapname: map_T31_to_gx3v5_aave_da_ nc atm2ocnFmaptype: X atm2ocnSmapname: map_T31_to_gx3v5_bilin_da_ nc atm2ocnSmaptype: X Currently working with dead models.

Sequential CCSM: other recent advances datm7 (single-node version) added to driver CICE also integrated. dead7 land with river model introduced and river mapping called from driver

CCSM and memory scaling Both sequential and concurrent execution CCSM may be run on BlueGene BlueGene, and similar planned machines, have low per-node memory. What is the memory scaling of the coupler? Early suspect: reading in mapping weights –Concurrent and sequential use same mapping scheme.

Mapping memory issues Old Mapping Algorithm –Allocate memory for all non-zero interpolation weights (ns) and all area weights for each grid (na, nb) –Read in all data to node 0 –Scatter all data to other nodes –River to gx3: 242 MB! First New Algorithm (Rob) –Allocate weights and areas one at a time. –Read and scatter one at a time –Deallocate between each set. –River to gx3: 190MB.

Mapping memory issues Second New Algorithm (Tony Craig) –Allocate a user-adjustable amount of memory –Read in an equivalent amount of weights –Broadcast to all nodes –Each node finds its piece –Repeat until all weights read. –Tests show this to be faster then read-all-and-scatter

Investigate coupler memory scaling on BlueGene Use concurrent CCSM with dead models (single executable) In all cases, dead models given 1 processor each, others all given to coupler “32” = 28 coupler processors “512” = 508 coupler processors T42_gx1: runs to completion: 32,64,…,512 –1024: ran out of memory in mapping init with old algorithm New algorithm: gets past map init. Some make it to main loop. Others die in frac_set.

Investigate coupler memory scaling on BlueGene Use concurrent CCSM with dead models (single executable) T85_gx1: similar results T340_x01: –Dies in initialization before reaching map init –Problems with trapping exit on BlueGene. –Dead models appear to be running out of memory even though diagnostics indicate plenty of free memory.

Another suspect: extra memory for different mappings In cpl6: –Coupler receives all data from atmosphere –States and Fluxes are each copied into new datatypes and follow different paths through coupler States are mapped with bilinear mapping, fluxes with conservative mapping. – call cpl_map_bun(atm_states_a, bilinear_map, atm_states_o) – call cpl_map_bun(atm_fluxes_a, conserv_map, atm_fluxes_o) –Results in 2x memory hit compared to just keeping one copy of received data. In sequential CCSM: –Use new MCT feature which allows attributes to be specified – call map_atmocn_mct(atmfields_a, bilinear_map, atmfields_o, fields_to_mapb) – Call map_atmocn_mct(atmfields_a, conserv_map, atmfields_o, fields_to_mapc)

Live models on BlueGene (ccsm 3.1.beta41) T31_gx3 runs without modification! On 32 BlueGene processors (smallest amount available; 12 atm, 12 ocn, 4 ice, 2 lnd, 2 cpl) and up to 256 (not load balanced). Note: BlueGene has no dynamic libraries. Everything must be statically linked. First attempt at 1.9x2.4_gx1: 1.5GB executable! –Was using same decomposition as dead model studies: 1 processor for POP, CICE.

CAM on BlueGene T42: runs fine up to 64 proc limit. CO and VN mode T85: runs fine up to 128 procs in CO mode, VN has memory problems FV 1x1.25: CO runs on 32,64,128,256, and 480, FV has memory problems Higher resolutions: needs parallel I/O

CCSM sequential: next steps Add time averaging Use same low-memory map read as cpl6 Add other cpl6 functionality: area normalization, diagnostics. Check scaling of executable image size on BlueGene Integrate other data models Integrate POP2