Presentation is loading. Please wait.

Presentation is loading. Please wait.

T511-L60 CY26R MPI tasks and 4 OpenMP threads

Similar presentations


Presentation on theme: "T511-L60 CY26R MPI tasks and 4 OpenMP threads"— Presentation transcript:

1 T511-L60 CY26R3 - 32 MPI tasks and 4 OpenMP threads

2 TL1023 ~ 20 km

3 AFES on the Earth Simulator (T1279L96)
T1279 ~ 10 km Source: Satoru Shingu et al. “A Tflops Global Atmospheric Simulation with the Spectral Transform Method on the Earth Simulator”

4 ECMWF’s recent experience
For the 6 months up to March 03 ECMWF operated three vector parallel systems from Fujitsu and two large scalar Cluster 1600s from IBM Advantages of vector-based machines include: Clear feedback on the efficiency of the codes being run Better environmental characteristics than systems built using general- purpose SMP servers Advantages of scalar-based machines include: Front-end machines for scalar-only tasks are not required Being more forgiving if a medium level of efficiency is acceptable The consensus view of our application programmers is that vector and scalar system can equally well provide the HPC resources for high-end modelling work

5 Which architecture is best suited for the job?
Number 1 criteria: cost/performance For Earth System Modelling, the following do not constitute good measures for cost/performance evaluations: Peak performance LINPACK results Sustained Gflops that can be achieved on the user’s application as measured by HW counters Peak/sustained performance ratio The most reliable measure for performance is the wall-clock time needed to solve a given task (e.g year simulation that is representative for the scientific work planned)

6 Scalability issues Earth System codes do not scale linearly if the same size problem is run on a larger number of processors: Load balancing, subroutine call overheads increase, etc. Serial components (Amdahl’s law) The planned increases in the problem sizes mitigate this effect Generally, a high ratio of the following is advantageous: # of parallel instruction streams in the application (and their length) # of parallel instruction streams required to keep the hardware busy In this respect, an 8-way M&A vector pipe will require as many independent instruction streams as 8 scalar M&A units The sustainable flop count (in absolute terms) per “hardware thread” is very important for application scalability The ratio of peak/sustained flops is not

7 ECMWF Model /Assimilation / Computing Status & Plans
Anthony Hollingsworth Walter Zwieflhoefer Deborah Salmond

8 Scope of talk ECMWF operational forecast-systems 2003
ECMWF HPC configuration 2003, and planned upgrade Operational timings for production model codes Planned Forecast system upgrades Drivers for 2007 Computer Upgrade

9 ECMWF operational assimilation systems 2003
4D-Var with 12-hour period, inner-loop minimizations at (up to) T159 L60, outer-loop resolution T511 L60. Short-cut-off analyses (6-hourly 3D-Var + 4 forecasts/day) for Limited Area Modelling in Member States. Ocean wave assimilation system Ocean circulation assimilation system

10 ECMWF operational prediction systems 2003
Deterministic Forecasts to 10-days, 2x day, T511/ L60 model Ensemble Predictions to 10-days, 2x day, T255 /L40 , N=51 Ensemble Forecasts to 1 month, using T159 / L40 atmosphere 110km ocean (33km merid. in the tropics). 2 per month at present, weekly in 2004 Seasonal forecasts, once per month, based on ensembles using a T95 (210km) L40 atmospheric model same ocean model as used for the one-month forecasts.

11 Performance profile of the contracted IBM solution relative to VPP systems
5 4 Phase 3 Regatta H+ Federation Switch 3 Phase 2 2 Phase 1 1 Fujitsu 2002 2003 2004 2005 2006 Performance on ECMWF codes (Fujitsu = 400 GF sustained)

12 ECMWF HPC configuration, end 2003
Two IBM cluster 1600s, with 30 x p690 servers each Each p690 is partitioned into four 8-way nodes Each of the clusters has 120 nodes 12 of the nodes in each cluster have 32 GB memory All other nodes have 8 GB memory Processors run at 1.3 GHz 960 processors per cluster for user work Dual-plane colony switch with PCI adapters 4.2 terabytes of disk space per cluster Both clusters are configured identically

13 (=> Forecast days/day)
Timings for production model codes, IFS Cycle 26r3, October 2003 (D.Salmond) Resolution Number of PEs Layout of PEs: (MPI x OMP) Time step Time for 10-day forecast (=> Forecast days/day) Equivalent FC_d/d on 1920 PEs T511/L60 288 (72 x 4) 900s 4298s (~201 FCd/d on 288 PEs) 1340 FC_d/d (3.67 yrs) on PEs T255/L40 32 (8 x 4) 1 node 2700s 1800s (~480 FCd/d on 32 Pes) 28800 FC_d/d (78.9 yrs) on PEs

14 Relative efficiency of T255 and T511 production runs
To meet delivery schedule T255/L40 is run on 1 server (32 Pes) T511/L60 is run on 9 servers Expected speed-up of T255 L40 v. T511 L60 Horizontal resolution x 4 Vertical resolution x 1.5 Time-step x 3 OVERALL x 18 Actual speed-up of daily production on 1920 PEs 28800/ x 21.5 Benefit of reduced communication, and other scalability issues?

15 Operational timings for production Seasonal Forecast code, October 2003 (D.Salmond)
System runs on 1 (LPAR) 8PEs Resolution # threads /PEs Time step Time for 6 mo. Forecast (FC_d/d) Equivalent FC_d/d on 1920 PEs IFS Atmosphere T95/L40 3 OMP threads 3600s 10hrs HOPE Ocean 1 degree ( 0.3 deg in equ. band) (441 d/d on 8 PEs) FC_d/d (288 yr/d) on PEs OASIS Coupler 1PE 24 hours

16 Relative efficiency of T255 atmosphere and T95 Seasonal Forecast production runs
Expected speed-up of Seas_Fcst model V. T255 The cost of the ocean is dominant; 1 deg. Ocean with 40 levels is estimated ~T159 L40; T95 atmosphere waits for the ocean; ignore different costs of physics in ocean and atmosphere) Horizontal resolution (255/159)**2 x 2.6 Vertical resolution (40/40) x 1.0 Time-step (3600/ 2400) x 1.5 OVERALL x 3.9 Actual speed-up of daily production on 1920 PEs 105984/ x 3.67

17 Planned Forecast system upgrades 2004-2005 on IBM phase 3
The expected resolutions are: Deterministic forecast & outer loops 4D-Var: T799 (25km) L91 Ensemble Prediction System: T399 (50km) L65 Inner loops of 4D-Var T255 (80km) L91 15-day and monthly forecast system T255(80km )/L65 T159(125km)/L65

18 Drivers for 2007 Computer Upgrade
Increased computational resources are needed in 2007, to enable:- 4D-Var inner loop resolution T399 An ensemble component of the data assimilation Further improvement of the inner-loop physics Increased use of satellite data (both reduced thinning and introduction of new instruments such as IASI) Increase in resolution for the seasonal forecasting system (to T159 L65 for the atmospheric model).

19 END thank you for your attention!

20 Research Directions and Operational Targets 2004
Assimilation of MSG data and additional ENVISAT and AIRS data Increased vertical resolution, particularly in the vicinity of the tropopause Upgrades of inner-loop physics and assimilation of cloud/rainfall information Weekly running of the monthly forecasting system Preparation for upgrades in horizontal resolution High-resolution moist singular vectors for the EPS initial states

21 Research Directions and Operational Targets 2005
Final validation and implementation of increases in horizontal resolution Validation and assimilation of new satellite data such as SSMIS, AMSR, OMI and HIRDLS Enhanced preparations for monitoring and assimilation of METOP data Seamless ensemble forecast system for medium-range and monthly forecasts

22 Research Directions and Operational Targets 2006 - 2007
Monitoring and then assimilation of data from the METOP_instruments (IASI/ AMSU/ HIRS/ MHS/ ASCAT/ GRAS/ GOME) Preparation for NPP Increased inner-loop resolution & enhanced inner-loop physics for 4D-Var Ensemble component to data assimilation Increased resolution and forecast range for seasonal forecasting


Download ppt "T511-L60 CY26R MPI tasks and 4 OpenMP threads"

Similar presentations


Ads by Google