Presentation is loading. Please wait.

Presentation is loading. Please wait.

The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June.

Similar presentations


Presentation on theme: "The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June."— Presentation transcript:

1 The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June 2012

2 2 Performance issues with the current version of makebc  makebc run for a City model on a single core (Wenming’s message of 7 March 2012) takes 80-90min  reading and uncompressing large pi files in total of ~100GB each pi file containing 2 time segments with 30 min frequency for 13 model fields with a 1088x746x70 resolution for R12 is ~2.2GB  Performance analysis showed that significant amount of time is spent in reading and uncompressing of the input files in the following loop DO JJ=1,IY ICX = IC(JJ) CALL XPND( ) END DO taking ~50% of total elapsed time. This loop can be parallelised with multithreading => giving theoretical performance improvement on Solar by 50% + 50%/8 => 100%/56.25% < 1.8 times

3 3 Three main approaches for performance improvement  Approach #1: after parent model completion parallel makebc processing of each pi file and parallel merging of the sequential LBCs  Approach #2 (suggested by Yi Xiao) includes 2 stages:  Extracting frames from pi files during parent model execution  makebc processing of the frame files with efficient multithreading on a single node after the parent model completion This method can be efficiently used in research where pi files are retrieved from the archive  Approach #3 (Tom Green & Mike Naughton)  On fly makebc processing of sequentially produced pi files during parent model execution (the most efficient scenario is with a batch job submission using a whole Solar node for each makebc processing)  Sequential merging (on fly) of “simple” LBCs into the accumulated single LBC file, the efficient way is with an appending operation if it is available

4 4 Concepts of parallel implementation (approach #1)  makebc processing stage All makebc tasks are run in batch jobs using parallel running background processes  merging LBCs into a single LBC file N – number of LBC files to merge at the (j) merging stage All merges on each merging stage (j) are done in parallel

5 5 Main performance advantages of the new implementation Old schema Single makebc run which includes: 1. Sequential reading and uncompressing of all relatively large input pi files 2. Sequential in terms of time&date segment processing of the read information New schema 1.Parallel running makebc processing for each input pi file a)reading, uncompressing and generation of boundary condition files for M input pi files are done in parallel using M processes/cores 2.Parallel merging of sequential 2 LBCs in the merging tree structure

6 6 Main ideas in parallel processing (approach #1)  Parallel processing of each pi file  A number of makebc processes are packed into a batch job submitted from the main batch job  Due to significant memory requirements depending on a size of the pi files the most efficient way to execute makebc tasks on a Solar node is with the usage of no more than 3-4 parallel running processes in the background  Some reasonable HPC resources are required for parallel makebc processing 48hour => 48 pi files => 3 LBCs per 8 core node => (1+16) 8 core nodes pi000 batch job #1 pi001 pi002 lbc002 lbc001 lbc000 makebc &...

7 7 Main ideas in parallel processing (approach #1) Some additional comments for the implementation  Execution of the submitted makebc jobs are monitored within the main batch job until all pi files are processed successfully  Parallel merging procedure for LBCs pi files are merged in stages  If the number of the merging processes running in parallel on any stage is greater than the number of the cores available per node (8 on Solar system) than the corresponding merging processes are packed into batch jobs (8 per a job) which are executed in parallel

8 8 Parallel merging tree structure

9 9 Some issues in relation to makebc processing of pi files produced by the R12 model  Current makebc version (vn7.9) does not allow to process pi files starting with not a whole hour number => as a result a number of source changes have been implemented by Tom Green to resolve this problem pi000 pi001 pi002  form of the makebc command for processing pi0nn file is the following makebc –n file.nml –i pi000 pi001... pi0nn -ow lbc with some special settings in the input file.nml namelist files such as N_DUMPS=NEXT_HOUR A_INTF_START_HR=CURRENT_HOUR A_INTF_END_HR=NEXT_HOUR 09:00 09:30 10:00 10:30 11:00 11:30 12:00

10 10 Some possible cases of resulting LBC files after makebc processing of pi files City model case Regional model case pi000 pi001 pi002 pi000 pi002 pi004 09:00 09:30 10:00 10:30 11:00 21:00 22:00 23:00 09:00 09:30 10:00 10:30 11:00 11:30 12:00 21:00 22:00 23:00 11:30 12:00 00:00 01:00 02:00 03:00 00:00 01:00 02:00 03:00 LBC1 LBC2 LBC3 makebc input output Before merging process  “first” duplicate segments are removed  orography field is removed from all LBCs excluding LBC1 Before merging process  orography field is removed from all LBCs excluding LBC1

11 11 Software used for removing duplicate segments and merging LBCs  Duplicate segments are removed from the beginning of each LBC file starting from a second LBC file  subset_um script recommended by Martin Dix and based on a program developed by Alan Iwi at Reading University is used for this purpose, at this stage parameters to run the script are not chosen automatically and their values depend on  the number of fields produced into the LBC file  the number of segments in each LBC file  Merging LBC files is done by using VAR VarScr_UMFileUtils script with a corresponding VarProg_UMFileUtils program  According to Tom Green a UM mergeum utility can not be used as it has not been updated correspondingly in the latest UM versions, also it is unused at the Met Office

12 12 Manual stuff for the main batch script set up  Separate makebc processing of pi files is required to be able to understand  on how the input namelist file for makebc execution should be properly set up for processing this type of pi files  on what kind of output information is generated into the output LBC files  Which (time&date) segments and how many segments are produced  How many fields including orography field are produced into an LBC  Is any overlapping in the produced segments into LBCs?

13 13 Results  Performance of the implemented procedure (approach #1) has been tested using 3 model cases  in a City model case (Brisbane) processing 41 pi files each of ~2.2GB gave the best elapsed time of 3min 47sec (with the latest improved merging procedure as well as the usage of Lustre file striping) running 3 makebc processes per an 8 core node in comparison with ~50min of elapsed time using the current makebc processing executed on a single core => a speed up of ~13 times  pure makebc processing of a single pi file with 2 time&date segments takes 2min10sec-2min30sec to produce an LBC with 3 time segments  in a Sydney model case using pi files from Wenming’s job ran yesterday the best elapsed time (from 3 runs) of 4min16sec for processing 51 pi files was obtained in comparison with ~83min taken by Wenming’s makebc job (processing of 48 pi files) => a speed up of over 19 times.

14 14 Results (cont #2)  in a regional model case processing 39 pi files each of ~630MB gave the best elapsed time of 3min30sec in comparison with ~15.5min in the standard usage case. The achieved performance improvement is not so significant as in the above mentioned City model case due to  a relatively large final LBC file of just over 7GB, unfortunately merging of relatively large files (over 1 GB) is a relatively expensive operation even if it is done in parallel  a relatively smaller value for the size(pi)/size(LBC0) ratio in the regional model case

15 15 Factors attributed on how fast the new procedure runs in comparison with the standard method  Number of pi files for processing  Size of each separate LBC file produced after makebc processing  size of the resulting LBC file  Ratio size(pi)/size(LBC0), where LBC0 are resulting LBC files after makebc processing

16 16 Some aspects in approach #2  Advantages  In the city model case each frame file should be ~1000 times smaller in size than pi file => <2MB  makebc executed with multithreading on 8 cores will be very efficient from the performance point of view and the whole run depending on the size of the resulting LBC file should not take more than 2-3 minutes  minimal HPC resources are required with no more than a single 8 core Solar node  Issues to be addressed  At the moment there are problems to run frames utility, maybe the latest version with vn8.2 resolves the problems, also currently this utility has a limitation of handing hourly datasets only (Tom Green comment)  a special utility is required to be able to identify on whether a pi file is complete or not during parent model execution, this is very important with asynchronous I/O usage starting from UM7.8. According to Tom Green: “This is something we are also trying to understand how best to handle”.

17 17 Some aspects in approach #3  Advantages  As soon as a sequentially produced by the parent model pi file is ready, it will be processed by makebc running in parallel with the parent model  After makebc processing the obtained LBC file can be merged on fly to the accumulated LBC version  Some technical issues for addressing  as in approach #2 a special utility is required to be able to identify on whether a pi file is complete or not  merging simple LBCs to the accumulated version should be done in the right order  Notes on performance comparison with approach #1  The last pi file will be produced near the parent model completion, so its makebc processing will not have any significant performance advantage in comparison with parallel makebc processing implemented with approach #1  Merging of the last simple LBC file with the accumulated LBC version will take a significant amount of time comparable with the time taken in the merging process in approach #1 at the final merging stage Summary: I don’t expect a reasonable advantage of this approach in comparison with approach #1.

18 18 Forms of parallel processing  Current implementation uses a set of batch jobs created in the main batch job and submitted/monitored from this job  Other possible ways of parallel processing can be the following  Usage of mpirun command with the scripts for parallel processing called within a Fortran/C MPI program  Usage of pbsdsh command with PBS  Usage of GNU parallel utility

19 19 Conclusions  The implemented parallel makebc processing can reduce the corresponding elapsed times in makebc processing by up to 10-13 times or even higher  Corresponding numerical results produced by the forecast model are identical with the previous results obtained using old makebc execution  The main job ASCII output has 2 different output lines  “Last Validity time” line: during the merging procedure this line is taken from the first LBC file but in the old makebc run case the value corresponds to the time taken from the last pi file  “Model Data” line: the first number (starting address of the data in the file after the header) is different as well as the other 2 have non-zero values which are harmless, at the moment this problem can be fixed using Martin Dix Python change_inthead.py utility  It is still worthwhile to try the frames approach #2 which can be more efficient from the HPC resources usage and providing a better performance with the usage of multithreading at the same time if any saving in the elapsed time is achievable with this approach it will be a minimal and no more than 1-3min on our Solar system


Download ppt "The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology makebc performance Ilia Bermous 21 June."

Similar presentations


Ads by Google