NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

Slides:

Advertisements

Similar presentations

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Advertisements

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Programming Massively Parallel Processors.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Tree-Based Density Clustering using Graphics Processors

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Future Direction with NAMD David Hardy

1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.

A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.

Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.

Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,

BTRC for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 1 Demonstration: Using NAMD David Hardy

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Basic Parallel Programming Concepts Computational.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Charm++ Hits and Misses A NAMD Perspective.

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.

A Preliminary Investigation on Optimizing Charm++ for Homogeneous Multi-core Machines Chao Mei 05/02/2008 The 6 th Charm++ Workshop.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Data Structures and Algorithms in Parallel Computing Lecture 7.

IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,

Static Process Scheduling

Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Building PetaScale Applications and Tools on the TeraGrid Workshop December 11-12, 2007 Scott Lathrop and Sergiu Sanielevici.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Parallel Programming By J. H. Wang May 2, 2017.

Parallel Density-based Hybrid Clustering

Parallel Algorithm Design

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Parallel Programming in C with MPI and OpenMP

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Theoretical and Computational Biophysics Group

Department of Computer Science University of California, Santa Barbara

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

BigSim: Simulating PetaFLOPS Supercomputers

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Parallel Programming in C with MPI and OpenMP

Department of Computer Science University of California, Santa Barbara

An Orchestration Language for Parallel Objects

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale Machines with Charm++ James Phillips Beckman Institute, University of Illinois Chao Mei Parallel Programming Lab, University of Illinois

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC UIUC Beckman Institute is a “home away from home” for interdisciplinary researchers Theoretical and Computational Biophysics Group

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Biomolecular simulations are our computational microscope Ribosome: synthesizes proteins from genetic information, target for antibiotics Silicon nanopore: bionanodevice for sequencing DNA efficiently

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Our goal for NAMD is practical supercomputing for NIH researchers 44,000 users can’t all be computer experts. –11,700 have downloaded more than one version. –2300 citations of NAMD reference papers. One program for all platforms. –Desktops and laptops – setup and testing –Linux clusters – affordable local workhorses –Supercomputers – free allocations on TeraGrid –Blue Waters – sustained petaflop/s performance User knowledge is preserved. –No change in input or output files. –Run any simulation on any number of cores. Available free of charge to all. Phillips et al., J. Comp. Chem. 26: , 2005.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Spatially decompose data and communication. Separate but related work decomposition. “Compute objects” facilitate iterative, measurement-based load balancing system. NAMD uses a hybrid force-spatial parallel decomposition Kale et al., J. Comp. Phys. 151: , 1999.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Charm++ overlaps NAMD algorithms Objects are assigned to processors, queued as data arrives, and executed in priority order. Phillips et al., SC2002.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD adjusts grainsize to match parallelism to processor count Tradeoff between parallelism and overhead Maximum patch size is based on cutoff Ideally one or more patches per processor –To double, split in x, y, z dimensions –Number of computes grows much faster! Hard to automate completely –Also need to select number of PME pencils Computes partitioned in outer atom loop –Old: Heuristic based on on distance, atom count –New: Measurement-based compute partitioning

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Measurement-based grainsize tuning enables scalable implicit solvent simulation After - Measurement-based (512 cores) Before - Heuristic (256 cores)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC The age of petascale biomolecular simulation is near

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Larger machines enable larger simulations

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC 2002 Gordon Bell Award PSC Lemieux: 3000 cores ATP synthase: 300K atoms Blue Waters: 300,000 cores, 1.2M threads Chromatophore: 100M atoms Target is still 100 atoms per thread

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scale brings other challenges Limited memory per core Limited memory per node Finicky parallel filesystems Limited inter-node bandwidth Long load balancer runtimes Which is why we collaborate with PPL!

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Challenges in 100M-atom Biomolecule Simulation How to overcome sequential bottleneck? –Initialization –Output trajectory & restart data How to achieve good strong-scaling results? –Charm++ Runtime

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Loading Data into System (1) Traditionally done on a single core –Molecule size is small Result of 100M-atom system –Memory: 40.5 GB ! –Time: sec !

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Loading Data into System (2) Compression scheme –Atom “Signature” representing common attributes of a atom –Support more science simulation parameters –However, not enough Memory: 12.8 GB! Time: sec!

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Loading Data into System (3) Parallelizing initialization –#input procs: a parameter chosen either by user or auto-computed at runtime –First, each loads 1/N of all atoms –Second, atoms shuffled with neighbor procs for later spatial decomposition –Good enough e.g. 600 input procs Memory: 0.19 GB Time: 12.4 sec

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Output Trajectory & Restart Data (1) At least 4.8GB output to file system per output step –tens ms/step target makes it more critical Parallelizing output –Each output proc is responsible for a portion of atoms Output to single file for compatibility

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Output Issue (1)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Output Issue (2) Multiple and independent file Post-processing into a single file

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Initial Strong Scaling on Jaguar 6,720 cores 53,760 cores 107,520 cores 224,076 cores

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Multi-threading MPI-based Charm++ Runtime Exploit multicore Portable as based on MPI On each node: –“processor” represented as a thread –N “worker” threads share 1 “communication” thread Worker thread: only handle computation Communication: only handle network message

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Benefits of SMP Mode (1) Intra-node communication is faster –Msg transferred as a pointer Program launch time reduced –224K cores: ~6 min  ~1 min Transparent to application developers –Correct charm++ program runs both in non- SMP and SMP mode

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Benefits of SMP Mode(2) Reduce memory footprint further –Read-only data structures shared –Memory footprint for MPI library is reduced –On avg. 7X reduction! Better cache performance Enables the 100M-atom run on Intrepid (BlueGene/P 2GB/node)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Potential Bottleneck on Communication Thread Computation & Communication Overlap alleviates the problem to some extent

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Node-aware Communication In runtime: multicast, broadcast etc. –E.g.: a series of bcast in startup: 2.78X reduction In application: multicast tree –Incorporate knowledge of computation to guide the construction of the tree Least loaded node as intermediate node

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Handle Burst of Messages (1) A global barrier after each timestep due to constant pressure algorithm More amplified due to only 1 comm thd per node

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Handle Burst of Messages (2) Work flow of comm thread –Alternate in send/release/receive modes Dynamic flow control –Exit one mode to another –E.g. 12.3% for 4480-node (53,760 cores)

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Hierarchical Load Balancer Large memory consumption in centralized one Processors are divided into groups Load balancing is done in each group

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Improvement due to Load Balancing

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Performance Improvement of SMP over non-SMP on Jaguar

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Strong Scale on Jaguar (2) 6,720 cores 53,760 cores 107,520 cores 224,076 cores

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Weak Scale on Intrepid (~1466 atoms/core) 2M6M 12M24M 48M 100M 1.100M-atom ONLY runs in SMP mode 2.Dedicating one core to communication per node in SMP mode (25% loss) caused performance gap

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Conclusion and Future Work IO bottleneck solved by parallelization An approach that optimizes both application and its underlying runtime –SMP mode in runtime Continue to improve performance –PME calculation Integrate and optimize new science codes

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Acknowledgement Gengbin Zheng, Yanhua Sun, Eric Bohm, Chris Harrison, Osman Sarood for the 100M-atom simulation David Tanner for the implicit solvent work Machines: supported by DOE Funds: NIH, NSF

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Thanks