NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale.

Slides:

Advertisements

Similar presentations

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Advertisements

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Implements IBM PowerPC architecture v2.06  Clock.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the POWER6  Shift from high frequency to multi-core  Implements.

Fall 2008Introduction to Parallel Processing1 Introduction to Parallel Processing.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

Exascale Evolution 1 Brad Benton, IBM March 15, 2010.

Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.

Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.

Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.

Molecular Dynamics Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

Comparison of Distributed Operating Systems. Systems Discussed ◦Plan 9 ◦AgentOS ◦Clouds ◦E1 ◦MOSIX.

Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Introduction to Computer Architecture. What is binary? We use the decimal (base 10) number system Binary is the base 2 number system Ten different numbers.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

An FPGA Implementation of the Ewald Direct Space and Lennard-Jones Compute Engines By: David Chui Supervisor: Professor P. Chow.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Using GPUs to compute the multilevel summation.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Blue Waters and PPL’s Role Celso Mendes & Eric Bohm Parallel Programming Laboratory Dep. Computer Science, University of Illinois.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Yang Yu, Tianyang Lei, Haibo Chen, Binyu Zang Fudan University, China Shanghai Jiao Tong University, China Institute of Parallel and Distributed Systems.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Performance Benefits on HPCx from Power5 chips and SMT HPCx User Group Meeting 28 June 2006 Alan Gray EPCC, University of Edinburgh.

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Anton, a Special-Purpose Machine for Molecular Dynamics Simulation By David E. Shaw et al Presented by Bob Koutsoyannis.

Adam Kunk Anil John Pete Bohman.  Released by IBM in 2010 (~ February)  Successor of the Power6  Clock Rate: 2.4 GHz GHz  Feature size: 45.

Succeeding with Technology Chapter 2 Hardware Designed to Meet the Need The Digital Revolution Integrated Circuits and Processing Storage Input, Output,

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Petascale Computing Resource Allocations PRAC – NSF Ed Walker, NSF CISE/ACI March 3,

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Processor Level Parallelism 1

Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.

Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.

1 Scalable Cosmological Simulations on Parallel Machines Filippo Gioachin¹ Amit Sharma¹ Sayantan Chakravorty¹ Celso Mendes¹ Laxmikant V. Kale¹ Thomas R.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Lynn Choi School of Electrical Engineering

Distributed Processors

Is System X for Me? Cal Ribbens Computer Science Department

Performance Evaluation of Adaptive MPI

Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.

BlueGene/L Supercomputer

Scalable Molecular Dynamics for Large Biomolecular Systems

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Multiple Processor Systems

Constructing a system with multiple computers or processors

Constructing a system with multiple computers or processors

Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University

Constructing a system with multiple computers or processors

BigSim: Simulating PetaFLOPS Supercomputers

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

© David Kirk/NVIDIA and Wen-mei W. Hwu,

Parallel Programming in C with MPI and OpenMP

Parallel computing in Computational chemistry

Presentation transcript:

NAMD on BlueWaters Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

NSF/NCSA Blue Waters Project Sustained Petaflops system funded by NSF to be ready in  System expected to exceed 300,000 processor cores. NSF Acceptance test: 100 million atom Bar Domain simulation using NAMD. NAMD PRAC The Computational Microscope  Systems from 10 to 100 million atoms A recently submitted PRAC from an independent group wishes to use NAMD  1 Billion atoms!

Hybrid of spatial and force decomposition: Spatial decomposition of atoms into cubes (called patches) For every pair of interacting patches, create one object for calculating electrostatic interactions Recent: Blue Matter, Desmond, etc. use this idea in some form NAMD Molecular Dynamics simulation of biological systems Uses the Charm++ idea:  Decompose the computation into a large number of objects  Have an Intelligent Run-time system (of Charm++) assign objects to processors for dynamic load balancing

BW Challenges and Opportunities Support systems >= 100 Million atoms Performance requirements for 100 Million atom Scale to over 300,000 cores Power 7 Hardware  PPC architecture  Wide node at least 32 cores with 128 HT threads BlueWaters Torrent interconnect Doing research under NDA

BlueWaters Architecture IBM Power7 Peak Perf ~10 PF Sustained ~1 PF 300,000+ cores 1.2+ PB Memory 18+ PB Disc 8 cores/chip 4 chips/MCM 8 MCMs/Drawer 4 Drawers/SuperNode 1024 cores/SuperNode Linux OS

Power 7 64-bit PowerPC 3.7-4Ghz Up to 8 FLOPs/cycle 4-way SMT 128 byte cache lines 32 KB L1 256 KB l2 4 MB local in shared 32 MB L3 cache 2 fixed point, 2 load store 1 VMX 1 decimal FP 2 VSX  4 FLOPs/cycle 6-wide in-order 8-wide out-of-order 12 data streams prefetch

Hub Chip Module Connects 8 QCMs via L-local (copper)  24 GB/s Connects 4 P7-IH drawers L-remote (optical)  6 GB/s Connects up to 512 SuperNodes D (optical)  10 GB/s

Availability NCSA has BlueDrop machine  Linux  IBM 780 (MR) POWER7 3.8 Ghz  Login node 2x8 core processors  Compute note 4x8 core in 2 enclosures BlueBioU  Linux  18 IBM 750 (HV32) nodes 3.55 Ghz  Infiniband 4x DDR (Galaxy)

NAMD on BW Use SMT=4 effectively Use Power7 effectively  Shared memory topology  Prefetch  Loop unrolling  SIMD VSX Use Torrent effectively  LAPI/XMI

Petascale Scalability Concerns Centralized load balancer - solved IO  Unscalable file formats - solved  input read at startup - solved  Sequential output – in progress Fine grain overhead – in progress Non-bonded multicasts – being studied Particle Mesh Ewald  Largest grid target <= 1024  Communication overhead primary issue  Considering Multilevel Summation alternative

NAMD and SMT=4 P7 hardware threads are prioritized  0,1 highest  2,3 lowest Charm runtime measure processor performance  Load balancer operates accordingly NAMD on SMT=4 35% faster than SMT=1  No new code required! At the limit it requires 4x more decomposition

NAMD on Power7 HV 32 AIX

SIMD -> VSX VSX adds double precision support to VMX SSE2 already in use in 2 NAMD functions MD-SIMD implementation of nonbonded MD benchmark available from Kunzman Translate SSE to VSX Add VSX support to MD-SIMD

MD-SIMD performance

Support for Large Molecular Systems New Compressed PSF file format  Supports >100 million atoms  Supports parallel startup  Support MEM_OPT molecule representation MEM_OPT molecule format reduces data replication through signatures Parallelize reading of input at startup  Cannot support legacy PDB format  Use binary coordinates format Changes in VMD courtesy John Stone

Parallel Startup

Fine grain overhead End user targets are all fixed size problems Strong scaling performance dominates  Maximize number of nanoseconds/day of simulation Non-bonded cutoff distance determines patch size  Patch can be subdivided along x, y, z dimensions 2 away X, 2-away XY, 2 away XYZ  Theoretically K-away...

1-away vs 2-away X

Fine-grain overhead reduction Distant computes have little or no interaction  Long diagonal opposites of 2-awayXYZ mostly outside of cutoff Optimizations  Don't migrate tiny computes  Sort pairlists to truncate computation  Increase margin and do not create redundant compute objects Slight (<5%) reduction in step time

Future work Integrate parallel output into CVS NAMD Consolidate small compute objects Leverage native communication API Particle Mesh Ewald improve/replace Parallel I/O optimization study on multiple platforms High (>16k) scaling study on multiple platforms