Presentation is loading. Please wait.

Presentation is loading. Please wait.

Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF.

Similar presentations


Presentation on theme: "Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF."— Presentation transcript:

1 Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF

2 Overview of SCEC Computational Problem Domain: Probabilistic Seismic Hazard Analysis (PSHA) PSHA computation defined as workflow with two parts 1.“Strain Green Tensor Workflow” ~10 jobs, largest are 4000 core/hrs Output: Two (2) twenty (20) GB files = 40GB Total 2.“Post-processing Workflow” High throughput ~410,000 serial jobs, runtimes less than 1 min Output: 14,000 files totaling 12 GB Use Pegasus-WMS, HTCondor, Globus Calculates hazard for 1 location

3 Recent PSHA Research Calculation Using Workflows – CyberShake Study 13.4 PSHA calculation exploring impact of alternative velocity models and alternative codes was defined as 1144 workflows Parallel Strain Green Tensor (SGT) workflow on Blue Waters –No remote job submission support in Spring 2013 –Created basic workflow system using qsub dependencies Post-processing workflow on Stampede –Pegasus-MPI-Cluster to manage serial jobs Single job is submitted to Stampede Master-worker paradigm; work is distributed to workers Relatively little output, so writes were aggregated in master

4 CyberShake 13.45 Workflow Performance Metrics April 17, 2013 – June 17, 2013 (62 days) –Running jobs 60% of the time Blue Waters Usage – Parallel SGT calculations: –Average of 19,300 cores, 8 jobs –Average queue time: 401 sec (34% job runtime) Stampede Usage – Many-task post-processing: –Average of 1,860 cores, 4 jobs –470 million tasks executed (177 tasks/sec) –21,912 jobs total –Average queue time: 422 sec (127% job runtime)

5 Constructing Workflow Capabilities using HPC System Services Our workflows rely on multiple services –Remote gatekeeper –Remote gridftp server –Local gridftp server Currently issues are detected via workflow failure Helpful if there were automated checks that everything is operational Additional factors –Valid X509 proxy –Available disk space

6 Why Define PSHA Calculations as Multiple Workflows ? CyberShake Study 13.4 used 1144 workflows Too many jobs to merge into 1 workflow Currently, we create 1144 workflows and monitor submission via cronjob

7 Improving Workflow Capabilities Workflow tools could provide higher-level organization –Specify set of workflows as a group –Workflows are submitted and monitored –Errors detected and group paused –Easy to obtain status, progress –Statistics are aggregated over the group

8 Summary of SCEC Workflows Needs Workflows tools for large calculations on leadership-class systems that do not support remote job submission –Need capabilities to define job dependencies, manage failure and retry, track files using dbms Light-weight workflow tools that can be included in scientific software releases –For unsophisticated users, the complexity of workflow tools should not outweigh their advantages Methods to define workflows with interactive (people-in-the- loop) stages –Needed to automate computational pathways that have manual stages Methods to gather both scientific and computational metadata into databases –Needed to ensure reproducibility and support user interfaces


Download ppt "Experiences Running Seismic Hazard Workflows Scott Callaghan Southern California Earthquake Center University of Southern California SC13 Workflow BoF."

Similar presentations


Ads by Google