Presentation is loading. Please wait.

Presentation is loading. Please wait.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

Similar presentations


Presentation on theme: "NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale."— Presentation transcript:

1 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois http://www.ks.uiuc.edu/Research/namd/ Chao Mei Parallel Programming Lab, University of Illinois http://charm.cs.illinois.edu/

2 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers Theoretical and Computational Biophysics Group

3 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently

4 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Our goal for NAMD is practical supercomputing for NIH researchers 44,000 users can’t all be computer experts. –11,700 have downloaded more than one version. –2300 citations of NAMD reference papers. One program for all platforms. –Desktops and laptops – setup and testing –Linux clusters – affordable local workhorses –Supercomputers – free allocations on TeraGrid –Blue Waters – sustained petaflop/s performance User knowledge is preserved. –No change in input or output files. –Run any simulation on any number of cores. Available free of charge to all. Phillips et al., J. Comp. Chem. 26:1781-1802, 2005.

5 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Spatially decompose data and communication. Separate but related work decomposition. “Compute objects” facilitate iterative, measurement-based load balancing system. NAMD uses a hybrid force-spatial parallel decomposition Kale et al., J. Comp. Phys. 151:283-312, 1999.

6 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Charm++ overlaps NAMD algorithms Objects are assigned to processors, queued as data arrives, and executed in priority order. Phillips et al., SC2002.

7 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC NAMD adjusts grainsize to match parallelism to processor count Tradeoff between parallelism and overhead Maximum patch size is based on cutoff Ideally one or more patches per processor –To double, split in x, y, z dimensions –Number of computes grows much faster! Hard to automate completely –Also need to select number of PME pencils Computes partitioned in outer atom loop –Old: Heuristic based on on distance, atom count –New: Measurement-based compute partitioning

8 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Measurement-based grainsize tuning enables scalable implicit solvent simulation After - Measurement-based (512 cores) Before - Heuristic (256 cores)

9 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC The age of petascale biomolecular simulation is near

10 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Larger machines enable larger simulations

11 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC 2002 Gordon Bell Award PSC Lemieux: 3000 cores ATP synthase: 300K atoms Blue Waters: 300,000 cores, 1.2M threads Chromatophore: 100M atoms Target is still 100 atoms per thread

12 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Scale brings other challenges Limited memory per core Limited memory per node Finicky parallel filesystems Limited inter-node bandwidth Long load balancer runtimes Which is why we collaborate with PPL!

13 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Challenges in 100M-atom Biomolecule Simulation How to overcome sequential bottleneck? –Initialization –Output trajectory & restart data How to achieve good strong-scaling results? –Charm++ Runtime

14 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Loading Data into System (1) Traditionally done on a single core –Molecule size is small Result of 100M-atom system –Memory: 40.5 GB ! –Time: 3301.9 sec !

15 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Loading Data into System (2) Compression scheme –Atom “Signature” representing common attributes of a atom –Support more science simulation parameters –However, not enough Memory: 12.8 GB! Time: 125.5 sec!

16 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Loading Data into System (3) Parallelizing initialization –#input procs: a parameter chosen either by user or auto-computed at runtime –First, each loads 1/N of all atoms –Second, atoms shuffled with neighbor procs for later spatial decomposition –Good enough e.g. 600 input procs Memory: 0.19 GB Time: 12.4 sec

17 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Output Trajectory & Restart Data (1) At least 4.8GB output to file system per output step –tens ms/step target makes it more critical Parallelizing output –Each output proc is responsible for a portion of atoms Output to single file for compatibility

18 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Output Issue (1)

19 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Output Issue (2) Multiple and independent file Post-processing into a single file

20 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores

21 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Multi-threading MPI-based Charm++ Runtime Exploit multicore Portable as based on MPI On each node: –“processor” represented as a thread –N “worker” threads share 1 “communication” thread Worker thread: only handle computation Communication: only handle network message

22 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Benefits of SMP Mode (1) Intra-node communication is faster –Msg transferred as a pointer Program launch time reduced –224K cores: ~6 min  ~1 min Transparent to application developers –Correct charm++ program runs both in non- SMP and SMP mode

23 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Benefits of SMP Mode(2) Reduce memory footprint further –Read-only data structures shared –Memory footprint for MPI library is reduced –On avg. 7X reduction! Better cache performance Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)

24 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Potential Bottleneck on Communication Thread Computation & Communication Overlap alleviates the problem to some extent

25 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Node-aware Communication In runtime: multicast, broadcast etc. –E.g.: a series of bcast in startup: 2.78X reduction In application: multicast tree –Incorporate knowledge of computation to guide the construction of the tree Least loaded node as intermediate node

26 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Handle Burst of Messages (1) A global barrier after each timestep due to constant pressure algorithm More amplified due to only 1 comm thd per node

27 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Handle Burst of Messages (2) Work flow of comm thread –Alternate in send/release/receive modes Dynamic flow control –Exit one mode to another –E.g. 12.3% for 4480-node (53,760 cores)

28 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Hierarchical Load Balancer Large memory consumption in centralized one Processors are divided into groups Load balancing is done in each group

29 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Improvement due to Load Balancing

30 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Performance Improvement of SMP over non-SMP on Jaguar

31 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Strong Scale on Jaguar (2) 6,720 cores 53,760 cores 107,520 cores 224,076 cores

32 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Weak Scale on Intrepid (~1466 atoms/core) 2M6M 12M24M 48M 100M 1.100M-atom ONLY runs in SMP mode 2.Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap

33 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Conclusion and Future Work IO bottleneck solved by parallelization An approach that optimizes both application and its underlying runtime –SMP mode in runtime Continue to improve performance –PME calculation Integrate and optimize new science codes

34 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Acknowledgement Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation David Tanner for the implicit solvent work Machines: Jaguar@NCCS, Intrepid@ANL supported by DOE Funds: NIH, NSF

35 NIH Resource for Macromolecular Modeling and Bioinformatics http://www.ks.uiuc.edu/ Beckman Institute, UIUC Thanks


Download ppt "NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale."

Similar presentations


Ads by Google