Presentation is loading. Please wait.

Presentation is loading. Please wait.

13 mai 14 : AtelierSupervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC.

Similar presentations


Presentation on theme: "13 mai 14 : AtelierSupervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC."— Presentation transcript:

1 13 mai 14 : AtelierSupervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL IDRIS : outils existants et contraintes TGCC : outils existants et contraintes Discussion Plan d'action Après-Midi : les jobs entrants IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes? TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes? Discussion Plan d’action

2 T0 : management T1 : platform T2 : towards a high-resolution coupled model T3 : runtime environments T4 : Big Data management and analytics of Climate Simulations T5 : CliMAF: a framework for climate models evaluation and analysis ensemble of tools different configurations different resolution set of simulations set of diagnostics assessment Improving coupled model parallelism in terms of computing and memory Managing efficiently input and restart files Integrating parallel interpolation mechanisms in XIOS Parallel component coupling Process assignment Optimization, Load balancing Climate Simulations Supervision XIOS implemented within project models XIOS a bridge towards standardisation Data and metadata services Big Data Analytics General driver and upstream user interface Services layer Visualization tools Evaluation and monitoring diagnostics IPSL implementation GAME-CERFACS implementation ANR MN2013 CONVERGENCE

3 Task 3.3 : Runtime Environment Leaders : Arnaud Caubel and Marie-Alice Foujols Contributors : IPSL, CERFACS, IDRIS, CNRS-GAME Help and expertise : TGCC, MDLS taskIPSLCERFACSCNRS- GAME IDRISTGCCMDLS 3.1 Process assignmentXXX 3.2 Optimisation, load balancing XXX 3.3 Climate Simulations Supervision XXX

4 Task 3.3 : Climate Simulations Supervision Launch a simulation with libIGCM IDRIS or TGCC Computing Post-traitment SupervisorUser Jussieu ? Commands Objective : libIGCM self-healing application : more reliability, less human intervention

5 Task 3.3 : Climate Simulations Supervision Context – one simulation 3 weeks running, files, 25 TB, 1000 jobs : 40 computing and post-processing – static workflow vs dynamic workflow Development of a supervisor agent – detect and understand failure event – understand the ultimate goals of the workflow – re-plan, re-schedule, re-map the workflow Tasks for the supervisor – events log in a comprehensive call tree (job sub., work to be done, each cp,....) – reliable lightweight communication channel between client agents and server agents (RabbitMQ implementation of AMPQ) – call tree traversal capabilities to determine checkpoint restart – autonomous rescheduling of necessary jobs – monitoring capabilities : coloured graphs with all jobs and status – regression tests handling capabilities

6 Task 3.3 : Climate Simulations Supervision Milest.DateDescriptionIPSLCERFACSCNRS- GAME IDRISTGCCMDLS MS3.3aM12 : 10/2014Supervisor agent Architectural DesignXXXx MS3.3bM24 : 10/2015Supervisor agent release candidate: enabling control channel, full events logs, call tree traversal capabilities and regression test handling XXXx MS3.3cM48 : 10/2017Supervisor agent final release : succesful rescheduling for known failure XXXx

7 Task 3.3 : Climate Simulations Supervision Additional manpower : CDD 21 pm IPSL (tasks 3.1, 3.2, 3.3) + CDD/IDRIS 6 pm Subcontractor IPSL 42 pm (tasks 3.3) TGCC/CEA : prestation ? Success criteria : A significant number of “standard” (ie “nonexpert”) users of Earth System model launch typical climate simulation (including development done in this WP) using libIGCM runtime environment on HPC centres (IDRIS and TGCC) Identified risks : if it's not possible to install supervisor agent : lighter installation with warning instead of correction the supervisor must be as transparent as possible : lighter usage ie des/activation of main tasks/secondary tasks Planning for next 6 months : Meeting/workshop to plan to discuss “Supervisor Design” (task 3.3)

8 pack_debug RebuildFrequency PackFrequency SeasonalFrequency TimeSeriesFrequency Computing job Post-processing jobs PackFrequency PeriodLength, PeriodNb

9 Generical job: AA_Job PeriodLength

10 TGCC computers and file system in a nutshell curie hybrid nodes -q hybrid curie hybrid nodes -q hybrid curie thin nodes -q standard curie thin nodes -q standard curie large nodes -q xlarge curie large nodes -q xlarge dods/store $HOME $CCCSTOREDIR $CCCWORKDIR $SCRATCHDIR HPSS : Robotic tapes curie front-end Computers sources small results IGCM_OUT : MONITORING/ATLAS temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs IGCM_OUT : Packed results Output, Analyse SE and TS Small precious files Saved space File system dods_cp cp ccc_hsm get airain front-end airain nodes cp dods/work dods_cp October 2013 Temporary space Saved space Non saved space Space on tapes compute login Visible from www quotas

11 Job_EXP00 Compute curie Job_EXP00 TGCC PeriodLength $SCRATCHDIR/IGCM_OUT/.../REBUILD $SCRATCHDIR/IGCM_OUT/XXX/Restart Debug DodsCopy=TRUE/FALSE ncrcat PackFrequency $CCCSTOREDIR/IGCM_OUT/XXX/Output pack_output PackFrequency $CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG Post curie tar pack_restart pack_debug create_ts curie monitoring Post TimeSeriesFrequency TS et SE : $CCCSTOREDIR/IGCM_OUT/…  dods/store MONITORING et ATLAS : $CCCWORKDIR  dods/work create_se SeasonalFrequency atlas $SCRATCHDIR/IGCM_OUT/XXX/Output Post RebuildFrequency rebuild curie

12 IDRIS computers and file system in a nutshell dods $HOME $WORKDIR Robotic tapes IGCM_OUT : Output, Analyse MONITORING/A TLAS $HOME $TMPDIR sources small results temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs gaya mfput/mfget dods_cp mfput/mfget dmput/dmget adapp compute ada compute adapp front-end turing front-end turing calcul $TMPDIR October 2013 Temporary space Saved space Non saved space Space on tapes Visible from www File system compute login Small precious files Saved space

13 Job_EXP00 Compute ada Job_EXP00 IDRIS PeriodLength $WORKDIR/IGCM_OUT/.../REBUILD $WORKDIR/IGCM_OUT/XXX/Restart Debug DodsCopy=TRUE/FALSE ncrcat PackFrequency gaya:IGCM_OUT/XXX/Output pack_output PackFrequency gaya:IGCM_OUT/.../RESTART DEBUG Post adapp tar pack_restart pack_debug create_ts adapp monitoring Post TimeSeriesFrequency gaya:IGCM_OUT/…  dods.idris.fr create_se SeasonalFrequency atlas Post RebuildFrequency rebuild $WORKDIR/IGCM_OUT/XXX/Output adapp

14 CM5AEH01 :

15 CM5AEH01 – 500 ans : ans PeriodLength=1M  240 jobs de calcul, PeriodNb=12, 60 RebuildFrequency=1Y  432 rebuild PackFrequency=10Y  43 pack_debug, 43 restart, 43 output SeasonalFrequency=50Y  8 create_se et 32 atlas TimeSeriesFrequency=10Y  757 create_ts et 43 monitoring 12 interventions manuelles  12/1641 = 0,73% IncidentDétectionRemèdeSupervision 1- Fatal calculUn mailClean_month et relance 3 tentatives 2- Fatal post Fatal calcul Deux mailClean_year et relance 3 tentatives 3- Job de calcul absent Manuel : se connecter et RunChecker.job Clean_month et relance heartbit

16 CM5AEH01 : RunChecker.job

17 CM5AEH01 : erreurs rencontrées Erreur job calcul : et : Fatal : error writing restartphy, job bloqué qq heures,  clean_month et relance : Fatal : erreur SLURM,  clean_month et relance : Fatal : 3h de blocage, killed,  clean_month et relance Erreur job post-traitements : 1999, 2000, 2118 et 2127 : pack_restart (1999) et rebuild parti en time limit  pack_r et rebuild relancé et, si besoin, pack_output (2119, 2129) 2166 et 2174 : rebuild KO IGCM_sys_rebuild[1860]: /ccc/cont003/home/dsm/p86ipsl/X64_CURIE/bin/rebuild: cannot execute [Permission denied],  rebuild relancé 2059, 2079, 2119 et 2129 : pack_output lancé trop tôt  pack_output relancé Autres erreurs : 13 monitoring KO : problèmes d’environnement instable (nco) entre 10/3 et 30/31 1 sub rebuild KO, resource temporarily unav : 3 tentatives dans libIGCM v2.2 IDRIS : disparition tous jobs

18 13 mai 14 : Atelier Supervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL IDRIS : outils existants et contraintes TGCC : outils existants et contraintes Discussion Plan d'action Après-Midi : les jobs entrants IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes? TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes? Discussion Plan d’action


Download ppt "13 mai 14 : AtelierSupervision Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC."

Similar presentations


Ads by Google