Presentation is loading. Please wait.

Presentation is loading. Please wait.

CyberShake Study 15.4 Technical Readiness Review.

Similar presentations


Presentation on theme: "CyberShake Study 15.4 Technical Readiness Review."— Presentation transcript:

1 CyberShake Study 15.4 Technical Readiness Review

2 Study 15.4 Scientific Goals Calculate a 1 Hz map of Southern California Produce meaningful 2 second results for the UGMS RotD50 and RotD100 at 2, 3, 4, 5, 7.5, 10 seconds Contour maps Compare 0.5 Hz and 1 Hz hazard maps Use Graves & Pitarka (2014) rupture generator with regular spaced hypocenters 336 sites (10 km mesh, points of interest, “gap” sites) Run 14 UGMS sites first Produce 1 Hz seismograms which could be combined with BBP high-frequency seismograms

3 Study 15.4 Technical Goals Run CyberShake across Blue Waters and Titan SGT and post-processing workflows on Blue Waters SGTs only on Titan All SGTs to be calculated on GPUs Will measure CyberShake application makespan Equivalent to the makespan of all of the workflows (All jobs complete) – (first workflow submitted) Includes hazard curve calculation time Includes system downtime, workflow stoppages Compare performance of Blue Waters and Titan Compare 1 Hz performance to previous studies

4 Performance Enhancements Pegasus cleanup used to decrease temp storage Should avoid Study 14.2 problem of running out of scratch space Parallel version of AWP reformat code 65% reduction in runtime DirectSynth for post-processing Single job for entire post-processing Reads in SGTs across nodes, communicates via MPI Reduces 1 Hz post-processing CPU-hours by 75% MD5 sums in post-processing workflow moved out of critical path

5 Proposed Study sites (336) Green sites are the 50 new “gap” sites

6 Study 15.4 Data Products CVM-S4.26 Los Angeles-area hazard maps RotD100 2, 3, 4, 5, 7.5, 10 sec RotD50 2, 3, 4, 5, 7.5, 10 sec Geometric mean 2, 3, 5, 10 sec Hazard curves for 336 sites, at 2, 3, 5, 10s 2-component seismograms for all ruptures (~160M) Peak amplitudes in DB Geometric mean at 2, 3, 5, 10 sec RotD100, RotD50 at 2, 3, 4, 5, 7.5, 10 sec

7 Study 15.4 Notables First 1 Hz hazard maps First study with RotD50 and RotD100 calculated First study to use OLCF Titan First study with Graves & Pitarka (2014) rupture generator with uniformly spaced hypocenters First study with 200 m rupture grid point spacing First study with source filtered at a different frequency than the simulation frequency

8 Study 15.4 Parameters 1.0 Hz deterministic 100 m spacing dt=0.005 sec nt=40000 timesteps CVM-S 4.26 Vs min = 500 m/s UCERF 2 Graves & Pitarka (2014) rupture variations 200 m rupture grid point spacing Source filtered at 2.0 Hz

9 Changes to SGT Software Stack UCVM 14.3 SGTs Only using AWP-ODC-SGT GPU version, on 800 nodes per component PostAWP Changed to parallel version of AWP reformatting, 65% speedup Reduced read sizes to avoid issue with Titan filesystem Separated MD5 sum calculation into separate job Handoff Modified handoff job from Study 13.4 to provide interface between Titan SGTs and Blue Waters post-processing

10 Changes to PP Software Stack Rupture generator V3.3.1 Extraction & Synthesis Created new job, DirectSynth Single job for all seismogram synthesis, PSA, and RotD calculation Set of SGT handler processes read in SGTs Set of workers work on synthesis, request SGTs from SGT handlers Data products are sent to master, which writes to filesystem Will use 1024 SGT handlers, 2560 workers per job Additional database jobs added for RotD data Checks, insertions, curve calculations

11 Changes to Workflows Post-processing workflows simpler Only 1 job for extraction and synthesis MD5 sum out of critical path, but will abort workflow if fails Auto-submit cron tool on shock enhanced List of sites to execute passed to cron job Will start with 14 UGMS sites Will repeat first 2 sites (CCP, COO) on Titan and Blue Waters for verification Maintains constant # of workflows When more are needed, selects next site(s), creates/plans/runs the workflow Now supports SGT, PP, and full workflows Dynamically assigns workflows to remote resources

12 Workflow Hierarchy Integrated Workflow (one per site per model) PreCVM (creates volume) Generate AWP Workflow AWP Workflow PP Pre Workflow PP Main Workflow DB workflow Post-processing workflow (Blue Waters & shock) SGT workflow (Blue Waters or Titan) Handoff (Titan/BW interface) 336 workflows for Study 15.4 AutoSubmit cron job on shock.usc.edu

13 Distributed Processing Pegasus 4.5.0 RC, HTCondor 8.2.8, Globus 5.2.5 Cron job on shock.usc.edu creates/plans/runs SGT, PP, and full workflows Jobs submitted to Blue Waters via GRAM Results staged back to shock, DB populated, curves generated Jobs submitted to Titan using pilot jobs Cannot submit jobs to Titan directly, due to security

14 Titan Distributed Processing Titan shock.usc.edu monitor_daemon.py Condor queue 1.Every 5 minutes, monitor daemon queries Condor queue on shock. 2.If there are more Titan SGT workflows in the queue than sets of jobs in the Titan queue, a new set of pilot jobs are submitted, with qsub dependencies. 3.These jobs start up Condor processes which call back to the shock Condor collector and can be assigned work. (1) SGT workflow, Titan SGT workflow, BW SGT workflow, Titan PP workflow, BW Batch queue (2) Pre SGT pilot job SGT pilot job Post SGT pilot job Pre SGT pilot job SGT pilot job Post SGT pilot job Condor collector (3) condor_master

15 Computational Requirements Per site: ~3720 node-hrs SGTs: depends on execution site (~50%) Titan = 2110 node-hrs / 63,300 SUs Blue Waters = 1760 node-hrs / 30,200 SUs More expensive for Titan because of padding in pilot jobs and different node-hrs -> SU conversion PP: 1880 node-hrs / 60,200 SUs (~50%) Computational time: Titan (SGTs): 355K node-hours / 10.7M SUs Blue Waters: 928K node-hours SGTs: 275K GPU node-hrs, 21K CPU node-hrs PP: 632K CPU node-hrs Titan has 104M SUs remaining Blue Waters has 5.3M node-hrs remaining

16 Storage Requirements Titan Purged: 526 TB (for SGTs and temp data) Blue Waters Delayed purge: 506 TB (for Titan SGTs) Purged: 526 TB SGTs + 9 TB data products SCEC Archived: 9.1 TB (seismograms, PSA, RotD) Database: 268 GB (Geom @ 4 periods, RotD @ 6) Temporary: 608 GB (workflow logs) Shared SCEC disks have 171 TB free

17 Metrics Gathering Monitord for workflow metrics Will run during workflows, since DirectSynth dramatically cuts the number of tasks Python scripts to calculate standard metrics Cronjob on Blue Waters Core usage over time Jobs running and idle Pilot monitor process on Titan Core usage Jobs running and idle Will use start and end of workflow logs to perform makespan measurement

18 Monitoring Tools Will use Study Manager to track progress Hosted on northridge.usc.edu Tracks number of runs in each state Estimates completion time based on velocity Run Manager tracks status of individual runs If errors, will dig into individual logs

19 Estimated Duration Limiting factors: XK node queue time 800 XK nodes is 19% of Blue Waters Titan -> Blue Waters If throughput is very high, transfer could be bottleneck USC HPC downtime for ~1 week in April Estimated completion is 12 weeks (11 running + 1 downtime) Based on same node availability as Study 14.2 Planning to request reservation on Blue Waters Planning to request high priority on Titan

20 Personnel Support Scientists Tom Jordan, Kim Olsen, Rob Graves Technical Lead Scott Callaghan Job Submission / Run Monitoring Scott Callaghan, David Gill, Phil Maechling NCSA Support Omar Padron, Tim Bouvet Titan Support Val Anantharaj USC Support John Yu, John Mehringer Workflow Support Karan Vahi, Gideon Juve

21 Risks Queue times on Blue Waters for XK nodes Will try to dynamically assign SGT jobs to resources Unforeseen complications with Titan pilot jobs Globus toolkit upgrades on NCSA Globus upgraded on shock and tested Waiting on new Condor release for shock Congestion protection events (network overloaded) If triggered consistently, will need to limit number of post-processing workflows Scott goes on leave before study is complete Difficult to run study under other accounts at this time

22 Action Items Upgrade UCVM on Blue Waters to 14.3.0 Notify OLCF, NCSA, and USC of study Request reservation on Blue Waters Request increased priority on Titan Develop an approach for continuity of study during Scott’s leave Test single workflow, multiple sites approach

23 Thanks for your time!


Download ppt "CyberShake Study 15.4 Technical Readiness Review."

Similar presentations


Ads by Google