Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang.

Slides:

Advertisements

Similar presentations

PRAGMA BioSciences Portal Raj Chhabra Susumu Date Junya Seo Yohei Sawai.

Advertisements

The Moab Grid Suite CSS´ 06 – Bonn – July 28, 2006.

Legacy code support for commercial production Grids G.Terstyanszky, T. Kiss, T. Delaitre, S. Winter School of Informatics, University.

4/2/2002HEP Globus Testing Request - Jae Yu x Participating in Globus Test-bed Activity for DØGrid UTA HEP group is playing a leading role in establishing.

1 Concepts of Condor and Condor-G Guy Warner. 2 Harvesting CPU time Teaching labs. + Researchers Often-idle processors!! Analyses constrained by CPU time!

Setting up of condor scheduler on computing cluster Raman Sehgal NPD-BARC.

Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.

Using Metacomputing Tools to Facilitate Large Scale Analyses of Biological Databases Vinay D. Shet CMSC 838 Presentation Authors: Allison Waugh, Glenn.

An Introduction to Grid Computing Research at Notre Dame Prof. Douglas Thain University of Notre Dame

First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova

Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.

PRESTON SMITH ROSEN CENTER FOR ADVANCED COMPUTING PURDUE UNIVERSITY A Cost-Benefit Analysis of a Campus Computing Grid Condor Week 2011.

CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.

Alain Roy Computer Sciences Department University of Wisconsin-Madison An Introduction To Condor International.

Distributed Systems Early Examples. Projects NOW – a Network Of Workstations University of California, Berkely Terminated about 1997 after demonstrating.

Fabien Viale 1 Matlab & Scilab Applications to Finance Fabien Viale, Denis Caromel, et al. OASIS Team INRIA -- CNRS - I3S.

High Throughput Computing with Condor at Purdue XSEDE ECSS Monthly Symposium Condor.

Track 1: Cluster and Grid Computing NBCR Summer Institute Session 2.2: Cluster and Grid Computing: Case studies Condor introduction August 9, 2006 Nadya.

History of the National INFN Pool P. Mazzanti, F. Semeria INFN – Bologna (Italy) European Condor Week 2006 Milan, 29-Jun-2006.

Gilbert Thomas Grid Computing & Sun Grid Engine “Basic Concepts”

Open Science Grid For CI-Days Internet2: Fall Member Meeting, 2007 John McGee – OSG Engagement Manager Renaissance Computing Institute.

Using the WS-PGRADE Portal in the ProSim Project Protein Molecule Simulation on the Grid Tamas Kiss, Gabor Testyanszky, Noam.

An Introduction to High-Throughput Computing Monday morning, 9:15am Alain Roy OSG Software Coordinator University of Wisconsin-Madison.

The Glidein Service Gideon Juve What are glideins? A technique for creating temporary, user- controlled Condor pools using resources from.

Cloud Usage Overview The IBM SmartCloud Enterprise infrastructure provides an API and a GUI to the users. This is being used by the CloudBroker Platform.

PCGRID ‘08 Workshop, Miami, FL April 18, 2008 Preston Smith Implementing an Industrial-Strength Academic Cyberinfrastructure at Purdue University.

Condor Tugba Taskaya-Temizel 6 March What is Condor Technology? Condor is a high-throughput distributed batch computing system that provides facilities.

03/27/2003CHEP20031 Remote Operation of a Monte Carlo Production Farm Using Globus Dirk Hufnagel, Teela Pulliam, Thomas Allmendinger, Klaus Honscheid (Ohio.

Grid Computing I CONDOR.

Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.

Open Science Grid For CI-Days Elizabeth City State University Jan-2008 John McGee – OSG Engagement Manager Manager, Cyberinfrastructure.

Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.

Condor Birdbath Web Service interface to Condor

1 Overview of the Application Hosting Environment Stefan Zasada University College London.

Nimrod & NetSolve Sathish Vadhiyar. Nimrod Sources/Credits: Nimrod web site & papers.

GridShell + Condor How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner Edward Walker Miron Livney Todd Tannenbaum The Condor Development Team.

Protein Molecule Simulation on the Grid G-USE in ProSim Project Tamas Kiss Joint EGGE and EDGeS Summer School.

Wenjing Wu Computer Center, Institute of High Energy Physics Chinese Academy of Sciences, Beijing BOINC workshop 2013.

Condor Project Computer Sciences Department University of Wisconsin-Madison A Scientist’s Introduction.

Using the NSF TeraGrid for Parametric Sweep CMS Applications Jeffrey P. Gardner Edward Walker Vladimir Litvin Pittsburgh Supercomputing Center Texas Advanced.

BOF: Megajobs Gracie: Grid Resource Virtualization and Customization Infrastructure How to execute hundreds of thousands tasks concurrently on distributed.

Using SWARM service to run a Grid based EST Sequence Assembly Karthik Narayan Primary Advisor : Dr. Geoffrey Fox 1.

TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.

1 A Steering Portal for Condor/DAGMAN Naoya Maruyama on behalf of Akiko Iino Hidemoto Nakada, Satoshi Matsuoka Tokyo Institute of Technology.

Derek Wright Computer Sciences Department University of Wisconsin-Madison Condor and MPI Paradyn/Condor.

SAN DIEGO SUPERCOMPUTER CENTER Inca Control Infrastructure Shava Smallen Inca Workshop September 4, 2008.

Timeshared Parallel Machines Need resource management Need resource management Shrink and expand individual jobs to available sets of processors Shrink.

Weekly Work Dates:2010 8/20~8/25 Subject:Condor C.Y Hsieh.

Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.

The Gateway Computational Web Portal Marlon Pierce Indiana University March 15, 2002.

GridShell/Condor: A virtual login Shell for the NSF TeraGrid (How do you run a million jobs on the NSF TeraGrid?) The University of Texas at Austin.

HTCondor’s Grid Universe Jaime Frey Center for High Throughput Computing Department of Computer Sciences University of Wisconsin-Madison.

Claudio Grandi INFN Bologna Virtual Pools for Interactive Analysis and Software Development through an Integrated Cloud Environment Claudio Grandi (INFN.

1 An unattended, fault-tolerant approach for the execution of distributed applications Manuel Rodríguez-Pascual, Rafael Mayo-García CIEMAT Madrid, Spain.

Building on virtualization capabilities for ExTENCI Carol Song and Preston Smith Rosen Center for Advanced Computing Purdue University ExTENCI Kickoff.

Job submission overview Marco Mambelli – August OSG Summer Workshop TTU - Lubbock, TX THE UNIVERSITY OF CHICAGO.

A Web Based Job Submission System for a Physics Computing Cluster David Jones IOP Particle Physics 2004 Birmingham 1.

Honolulu - Oct 31st, 2007 Using Glideins to Maximize Scientific Output 1 IEEE NSS 2007 Making Science in the Grid World - Using Glideins to Maximize Scientific.

Integrating Scientific Tools and Web Portals

OpenPBS – Distributed Workload Management System

Centre for Computational Science, University College London

AWS Batch Overview A highly-efficient, dynamically-scaled, batch computing service May 2017.

US CMS Testbed.

Haiyan Meng and Douglas Thain

Basic Grid Projects – Condor (Part I)

Ligand Docking to MHC Class I Molecules

Overview of Workflows: Why Use Them?

rvGAHP – Push-Based Job Submission Using Reverse SSH Connections

Experiences in Running Workloads over OSG/Grid3

Condor-G Making Condor Grid Enabled

Presentation transcript:

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang

Outline Two applications –Zeolite structure search –Binding energy calculations Solutions –Workflow –Submission system –Exported file system –Resources aggregated –Example week of work

What are Zeolites? Crystalline micro-porous material –Structures exhibit regular arrays of channels from 0.3 to 1.5nm –When channels are filled with water (or other substance), they make excellent molecular sieves for industrial processes and commercial products, e.g. deodorant in cat litters. –The acid form also has useful catalytic properties, e.g. ZSM- 5 used as a co-catalyst in crude oil refinement. Basic building block is a TO 4 tetrahedron –T = Si, Al, P, etc. Prior to this study, only 180 structures were known

Scientific goals Goal 1: Discover as many thermodynamically feasible Zeolite structures as possible. Goal 2: Populate a public database for material scientists to synthesize and experiment with these new structures

Computational methodology General strategy: Create a potential cell structure and solve its energy function Approach: –Group potential cell structures with a similar template structure into space groups (230 groups in total) –Each cell structure in the space group is further characterized by the space variables (a,b,c,  ) –Solve the multi-variable energy function for each cell structure using simulated annealing

Ligand binding energy calculations binding energy is quantitative measure of ligand affinity to receptor important in docking of ligand to protein ligand energies can be used as basis for scoring ligand- receptor interactions (used in structure- based drug design)

Scientific goals Calculate binding energies between trypsin and benzamidine at different values of the force-field parameters Compare calculated binding energy with experimental values Validate force-field parameters based on comparison Apply to different ligand-receptor system

Computational methodology Binding energy is calculated –molecular dynamics (MD) simulations of ligand “disappearing” in water –MD simulations of ligand extinction in the solvated ligand-protein complex –MD calculations were performed with Amber –extinction is parameterized by coupling parameter, –each job is characterized by a different and force-field parameters S(aq)0(aq) E(aq)E-S(aq) E(aq) + S(aq) E-S(aq)

Computational Usage: zeolite search Ran on TeraGrid Allocated over two million service units –one million to Purdue Condor pool –one million to all other HPC resources on TeraGrid

Computational usage: ligand binding energy calculation Running on departmental cluster, TACC Condor cluster and lonestar Each 2.5 ns simulation takes more than two weeks Will require additional CPU time

Challenge 1: Hundreds of thousands of simulations need to be run The energy function for every potential cell structure needs to be solved. Structures with feasible solutions indicate a feasible structure. Many sites have a limit to the number of jobs that can be submitted to a local queue.

Challenge 2: Each simulation task is intrinsically serial Simulated annealing method is intrinsically serial. Each MD simulation (function of and force- field parameter) is serial and independent. Many TeraGrid sites prioritize parallel jobs There are limited slots for serial jobs.

Challenge 3: Wide variability in execution times Zeolite search –Pseudo-random solution method iterates over 100 seeds. potential run times of 10 minutes to 10 hours Some computation may never complete. It is inefficient to request a CPU for 10 hours since computation may never need it. –Computation is re-factored into tasks of up to 2 hours.

Challenge 3: Wide variability in execution times Ligand binding energy calculation –Each MD simulation calculates dynamics to 2.5 ns. –Each 2.5 ns of simulation time takes > two weeks. –Convergence is not assured after 2.5 ns.

Workflow: zeolite search Level 1 is an ensemble of workflows evaluating a space group –230 space groups evaluated Level 2 evaluates a candidate structure –6000 to structures per space group –Main task generates solution –Post-processing task checks sanity of result –Retries up to 5 times if results are wrong Level 3 solves energy function for candidate structure –Chain of 5 sub-tasks –Each sub-task computes over 20 seeds, consuming at most 2 hours of compute time

Workflow: ligand binding energy calculations Condor cluster has no maximum run time limit. Lonestar has 24-hr run time limit. MD jobs need to be restarted. Workflow jobs need to be submitted to lonestar.

Challenge 4: Application is dynamically linked Amber was built with Intel shared libraries. These libraries are not be installed on the backend. Can copy shared libraries to backend, but wasteful of space ($HOME on some systems is limited)

Challenge 5: Output file needs to be monitored Some MD simulations do not converge. It is possible to find out non convergence at 2 ns. Terminate jobs that do not converge by 2 ns. No global file system exists on some systems.

Submission system Want to run many simple jobs/workflows of serial tasks –Condor DAGMan is an excellent tool for this –requires a Condor pool How to form a Condor pool from HPC systems? –form a virtual cluster managed by Condor using MyCluster –submit jobs/workflows to this

MyCluster overview Creates a personal virtual cluster for a user –from one or from pieces of different systems Schedules user jobs onto this cluster –User can pick one of several workload managers Condor, SGE, OpenPBS Condor currently on TeraGrid –User submits all their jobs to this workload manager Deployed on TeraGrid –

Starting MyCluster Log in to a system with MyCluster installed –majority of TeraGrid systems –can be installed on other systems Execute vo-login to start a session –you’re now in a MyCluster shell MyCluster Shell Workstation 1. Create MyCluster

Configuring MyCluster Personal cluster is defined using a user- specified configuration file –Identifies which clusters can be part of personal cluster –Specifies limits on portion of those clusters to use Personal workload manager is started –Condor in this case MyCluster Shell PBS Cluster 2. MyCluster is configured Workstation Condor LSF Cluster 1. Create MyCluster condor_schedd condor_collector condor_negotiator

MyCluster Shell Submitting Work to MyCluster Jobs submitted to personal workload manager –for workflows, DAGMan jobs are submitted that in turn submit individual Condor jobs –DAGMan configured to submit at most 380 jobs at a time Personal workload manager manages jobs like for any other cluster PBS Cluster 2. MyCluster is configured 3. User submits DAGMan jobs Workstation Condor LSF Cluster 1. Create MyCluster

MyCluster Resource Management MyCluster submits parallel jobs to clusters These jobs start personal workload manager daemons –condor_startd in this case These daemons contact the personal workload manager saying they have resources available MyCluster grows and shrinks the size of its virtual cluster –Based on the amount of jobs it’s managing File system on workstation may be mounted on backend MyCluster Shell PBS Cluster 2. MyCluster is configured 3. User submits DAGMan jobs 4. MyCluster submits and manages WM daemons Workstation Condor LSF Cluster 1. Create MyCluster condor_startd XUFS 5. MyCluster uses XUFS to mount WS file system on remote resources

Example MyCluster login session % vo-login Enter GRID passphrase:  GRAM or SSH login Spawning on lonestar.tacc.utexas.edu Spawning on tg-login2.ncsa.teragrid.org Setting up VO participants......Done Welcome to your MyCluster/Condor environment To shutdown environment, type "gexit" To detach from environment, type "detach" mycluster(gtcsh.9676)% condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime LINUX INTEL Unclaimed Idle [?????] … LINUX IA64 Unclaimed Idle [?????] Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX IA64/LINUX Total

Systems aggregated with MyCluster TeraGrid SiteHPC SystemArchitecture NCSATungstenIA-32 MercuryIA-64 CobaltIA-64 SDSCTg-loginIA-64 ANLTg-loginIA-64 Tg-login-vizIA-32 TACCLonestarX86_64

Expanding and shrinking Condor cluster created with MyCluster (1 week period)

Running and pending jobs in a personal cluster using MyCluster (1 week period)

Project Conclusion Allocation completely consumed in Jan Over 3 million new structures have been found. – Ligand binding energy calculations are deployed on rodeo and lonestar will be deployed on other TG systems still ongoing...

References J. R. Boisseau, M. Dahan, E. Roberts, and E. Walker, “TeraGrid User Portal Ensemble Manager: Automatically Provisioning Parameter Sweeps in a Web Browser” E. Walker, D. J. Earl, and M. W. Deem, “How to Run a Million Jobs in Six Months on the NSF TeraGrid” apers/walker/walker.pdfhttp:// apers/walker/walker.pdf uster/ uster/ Please contact