Presentation is loading. Please wait.

Presentation is loading. Please wait.

Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang.

Similar presentations


Presentation on theme: "Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang."— Presentation transcript:

1 Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang

2 Outline Two applications –Zeolite structure search –Binding energy calculations Solutions –Workflow –Submission system –Exported file system –Resources aggregated –Example week of work

3 What are Zeolites? Crystalline micro-porous material –Structures exhibit regular arrays of channels from 0.3 to 1.5nm –When channels are filled with water (or other substance), they make excellent molecular sieves for industrial processes and commercial products, e.g. deodorant in cat litters. –The acid form also has useful catalytic properties, e.g. ZSM- 5 used as a co-catalyst in crude oil refinement. Basic building block is a TO 4 tetrahedron –T = Si, Al, P, etc. Prior to this study, only 180 structures were known

4 Scientific goals Goal 1: Discover as many thermodynamically feasible Zeolite structures as possible. Goal 2: Populate a public database for material scientists to synthesize and experiment with these new structures

5 Computational methodology General strategy: Create a potential cell structure and solve its energy function Approach: –Group potential cell structures with a similar template structure into space groups (230 groups in total) –Each cell structure in the space group is further characterized by the space variables (a,b,c,  ) –Solve the multi-variable energy function for each cell structure using simulated annealing

6 Ligand binding energy calculations binding energy is quantitative measure of ligand affinity to receptor important in docking of ligand to protein ligand energies can be used as basis for scoring ligand- receptor interactions (used in structure- based drug design)

7 Scientific goals Calculate binding energies between trypsin and benzamidine at different values of the force-field parameters Compare calculated binding energy with experimental values Validate force-field parameters based on comparison Apply to different ligand-receptor system

8 Computational methodology Binding energy is calculated –molecular dynamics (MD) simulations of ligand “disappearing” in water –MD simulations of ligand extinction in the solvated ligand-protein complex –MD calculations were performed with Amber –extinction is parameterized by coupling parameter, –each job is characterized by a different and force-field parameters S(aq)0(aq) E(aq)E-S(aq) E(aq) + S(aq) E-S(aq)

9 Computational Usage: zeolite search Ran on TeraGrid Allocated over two million service units –one million to Purdue Condor pool –one million to all other HPC resources on TeraGrid

10 Computational usage: ligand binding energy calculation Running on departmental cluster, TACC Condor cluster and lonestar Each 2.5 ns simulation takes more than two weeks Will require additional CPU time

11 Challenge 1: Hundreds of thousands of simulations need to be run The energy function for every potential cell structure needs to be solved. Structures with feasible solutions indicate a feasible structure. Many sites have a limit to the number of jobs that can be submitted to a local queue.

12 Challenge 2: Each simulation task is intrinsically serial Simulated annealing method is intrinsically serial. Each MD simulation (function of and force- field parameter) is serial and independent. Many TeraGrid sites prioritize parallel jobs There are limited slots for serial jobs.

13 Challenge 3: Wide variability in execution times Zeolite search –Pseudo-random solution method iterates over 100 seeds. potential run times of 10 minutes to 10 hours Some computation may never complete. It is inefficient to request a CPU for 10 hours since computation may never need it. –Computation is re-factored into tasks of up to 2 hours.

14 Challenge 3: Wide variability in execution times Ligand binding energy calculation –Each MD simulation calculates dynamics to 2.5 ns. –Each 2.5 ns of simulation time takes > two weeks. –Convergence is not assured after 2.5 ns.

15 Workflow: zeolite search Level 1 is an ensemble of workflows evaluating a space group –230 space groups evaluated Level 2 evaluates a candidate structure –6000 to 30000 structures per space group –Main task generates solution –Post-processing task checks sanity of result –Retries up to 5 times if results are wrong Level 3 solves energy function for candidate structure –Chain of 5 sub-tasks –Each sub-task computes over 20 seeds, consuming at most 2 hours of compute time

16 Workflow: ligand binding energy calculations Condor cluster has no maximum run time limit. Lonestar has 24-hr run time limit. MD jobs need to be restarted. Workflow jobs need to be submitted to lonestar.

17 Challenge 4: Application is dynamically linked Amber was built with Intel shared libraries. These libraries are not be installed on the backend. Can copy shared libraries to backend, but wasteful of space ($HOME on some systems is limited)

18 Challenge 5: Output file needs to be monitored Some MD simulations do not converge. It is possible to find out non convergence at 2 ns. Terminate jobs that do not converge by 2 ns. No global file system exists on some systems.

19 Submission system Want to run many simple jobs/workflows of serial tasks –Condor DAGMan is an excellent tool for this –requires a Condor pool How to form a Condor pool from HPC systems? –form a virtual cluster managed by Condor using MyCluster –submit jobs/workflows to this

20 MyCluster overview Creates a personal virtual cluster for a user –from one or from pieces of different systems Schedules user jobs onto this cluster –User can pick one of several workload managers Condor, SGE, OpenPBS Condor currently on TeraGrid –User submits all their jobs to this workload manager Deployed on TeraGrid –http://www.teragrid.org/userinfo/jobs/mycluster.php

21 Starting MyCluster Log in to a system with MyCluster installed –majority of TeraGrid systems –can be installed on other systems Execute vo-login to start a session –you’re now in a MyCluster shell MyCluster Shell Workstation 1. Create MyCluster

22 Configuring MyCluster Personal cluster is defined using a user- specified configuration file –Identifies which clusters can be part of personal cluster –Specifies limits on portion of those clusters to use Personal workload manager is started –Condor in this case MyCluster Shell PBS Cluster 2. MyCluster is configured Workstation Condor LSF Cluster 1. Create MyCluster condor_schedd condor_collector condor_negotiator

23 MyCluster Shell Submitting Work to MyCluster Jobs submitted to personal workload manager –for workflows, DAGMan jobs are submitted that in turn submit individual Condor jobs –DAGMan configured to submit at most 380 jobs at a time Personal workload manager manages jobs like for any other cluster PBS Cluster 2. MyCluster is configured 3. User submits DAGMan jobs Workstation Condor LSF Cluster 1. Create MyCluster

24 MyCluster Resource Management MyCluster submits parallel jobs to clusters These jobs start personal workload manager daemons –condor_startd in this case These daemons contact the personal workload manager saying they have resources available MyCluster grows and shrinks the size of its virtual cluster –Based on the amount of jobs it’s managing File system on workstation may be mounted on backend MyCluster Shell PBS Cluster 2. MyCluster is configured 3. User submits DAGMan jobs 4. MyCluster submits and manages WM daemons Workstation Condor LSF Cluster 1. Create MyCluster condor_startd XUFS 5. MyCluster uses XUFS to mount WS file system on remote resources

25 Example MyCluster login session % vo-login Enter GRID passphrase:  GRAM or SSH login Spawning on lonestar.tacc.utexas.edu Spawning on tg-login2.ncsa.teragrid.org Setting up VO participants......Done Welcome to your MyCluster/Condor environment To shutdown environment, type "gexit" To detach from environment, type "detach" mycluster(gtcsh.9676)% condor_status Name OpSys Arch State Activity LoadAv Mem ActvtyTime 32020@compute LINUX INTEL Unclaimed Idle 0.000 2026[?????] … 32021@tg-c383 LINUX IA64 Unclaimed Idle 0.000 2026[?????] Machines Owner Claimed Unclaimed Matched Preempting INTEL/LINUX 2 0 0 2 0 0 IA64/LINUX 2 0 0 2 0 0 Total 4 0 0 4 0 0

26 Systems aggregated with MyCluster TeraGrid SiteHPC SystemArchitecture NCSATungstenIA-32 MercuryIA-64 CobaltIA-64 SDSCTg-loginIA-64 ANLTg-loginIA-64 Tg-login-vizIA-32 TACCLonestarX86_64

27 Expanding and shrinking Condor cluster created with MyCluster (1 week period)

28 Running and pending jobs in a personal cluster using MyCluster (1 week period)

29 Project Conclusion Allocation completely consumed in Jan 2007. Over 3 million new structures have been found. –http://www.hypotheticalzeolites.net/DATABASE/DEEM/index.phphttp://www.hypotheticalzeolites.net/DATABASE/DEEM/index.php Ligand binding energy calculations are deployed on rodeo and lonestar will be deployed on other TG systems still ongoing...

30 References J. R. Boisseau, M. Dahan, E. Roberts, and E. Walker, “TeraGrid User Portal Ensemble Manager: Automatically Provisioning Parameter Sweeps in a Web Browser” E. Walker, D. J. Earl, and M. W. Deem, “How to Run a Million Jobs in Six Months on the NSF TeraGrid” http://www.usenix.org/events/worlds06/tech/prelim_p apers/walker/walker.pdfhttp://www.usenix.org/events/worlds06/tech/prelim_p apers/walker/walker.pdf http://www.tacc.utexas.edu/services/userguides/mycl uster/http://www.tacc.utexas.edu/services/userguides/mycl uster/ Please contact ewalker@tacc.utexas.edu


Download ppt "Challenges in Executing Large Parameter Sweep Studies across Widely Distributed Computing Environments Edward Walker Chona S. Guiang."

Similar presentations


Ads by Google