1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin.

Slides:

Advertisements

Similar presentations

Module 13: Performance Tuning. Overview Performance tuning methodologies Instance level Database level Application level Overview of tools and techniques.

Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

1 NAMD - Scalable Molecular Dynamics Gengbin Zheng 9/1/01.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

©Company confidential 1 Performance Testing for TM & D – An Overview.

Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

1 Case Study I: Parallel Molecular Dynamics and NAMD: Laxmikant Kale Parallel Programming Laboratory University of Illinois at.

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Multi-core Programming Thread Profiler. 2 Tuning Threaded Code: Intel® Thread Profiler for Explicit Threads Topics Look at Intel® Thread Profiler features.

UPC/SHMEM PAT High-level Design v.1.1 Hung-Hsun Su UPC Group, HCS lab 6/21/2005.

Scaling to New Heights Retrospective IEEE/ACM SC2002 Conference Baltimore, MD.

Adaptive MPI Milind A. Bhandarkar

1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.

The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.

1 Introduction to Parallel Computing Issues Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science And Theoretical.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

PMaC Performance Modeling and Characterization Performance Modeling and Analysis with PEBIL Michael Laurenzano, Ananta Tiwari, Laura Carrington Performance.

Scalable Analysis of Distributed Workflow Traces Daniel K. Gunter and Brian Tierney Distributed Systems Department Lawrence Berkeley National Laboratory.

Guide to Linux Installation and Administration, 2e1 Chapter 10 Managing System Resources.

Martin Schulz Center for Applied Scientific Computing Lawrence Livermore National Laboratory Lawrence Livermore National Laboratory, P. O. Box 808, Livermore,

1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

1 NAMD: Biomolecular Simulation on Thousands of Processors James C. Phillips Gengbin Zheng Sameer Kumar Laxmikant Kale Parallel.

University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,

Replicating Memory Behavior for Performance Skeletons Aditya Toomula PC-Doctor Inc. Reno, NV Jaspal Subhlok University of Houston Houston, TX By.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Parallel Molecular Dynamics Application Oriented Computer Science Research Laxmikant Kale

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Full and Para Virtualization

1 Scaling Molecular Dynamics to 3000 Processors with Projections: A Performance Analysis Case Study Laxmikant Kale Gengbin Zheng Sameer Kumar Chee Wai.

Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.

1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.

1 Rocket Science using Charm++ at CSAR Orion Sky Lawlor 2003/10/21.

Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.

Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.

Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.

Computer Science Overview Laxmikant Kale October 29, 2002 ©2002 Board of Trustees of the University of Illinois ©

Programming for Performance Laxmikant Kale CS 433.

Tuning Threaded Code with Intel® Parallel Amplifier.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,

INTRODUCTION TO HIGH PERFORMANCE COMPUTING AND TERMINOLOGY.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

Parallel Objects: Virtualization & In-Process Components

Parallel Algorithm Design

Performance Evaluation of Adaptive MPI

Scalable Molecular Dynamics for Large Biomolecular Systems

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Parallelization of CPAIMD using Charm++

Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale

BigSim: Simulating PetaFLOPS Supercomputers

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Support for Adaptivity in ARMCI Using Migratable Objects

Emulating Massively Parallel (PetaFLOPS) Machines

Presentation transcript:

1 Scaling Applications to Massively Parallel Machines using Projections Performance Analysis Tool Presented by Chee Wai Lee Authors: L. V. Kale, Gengbin Zheng, Chee Wai Lee, Sameer Kumar

2 Motivation  Performance optimization is increasingly challenging –Modern applications are complex and dynamic –Some may involve small amount of computation per step –Performance issues and obstacles change:  Need very good Performance Analysis tools –Feedback at the level of applications –Analysis capabilities –Scalable views –Automatic instrumentation

3 Projections  Outline: –Projections: trace generation –Projections: views –Case Study: NAMD, Molecular Dynamics program that won a Gordon Bell award at SC’02 by scaling MD for biomolecules to 3,000 procs –Case Study: CPAIMD, a Car-parrinello ab initio MD application. –Performance Analysis on next generation supercomputers: Challenges.

4 Trace Generation  Automatic instrumentation by runtime system  Detailed –In the log mode each event is recorded in full detail (including timestamp) in an internal buffer.  Summary –reduces the size of output files and memory overhead. –It produces (in the default mode) a few lines of output data per processor. –This data is recorded in bins corresponding to intervals of size 1ms by default.  Flexible –APIs and runtime options for instrumenting user events and data generation control.

5

6 Post mortem analysis: views  Utilization Graph –As a function of time interval or processor –Shows processor utilization –As well as: time spent on specific parallel methods  Timeline: –upshot-like, but more details –Pop-up views of method execution, message arrows, user-level events  Profile: stacked graphs: –For a given period, breakdown of the time on each processor Includes idle time, and message-sending, receiving times

7

8 Projections Views: continued  Overview –Like a timeline, but includes all processors, and all time! –Each pixel (x,y) represents utilization of processor y at time x  Histogram of method execution times –How many method-execution instances had a time of 0- 1 ms? 1-2 ms?..  Performance counters –Associated with each entry method –Usual counters, interface to PAPI

9 Projections and Performance Analysis  Identify performance bottlenecks.  Verification of performance.

10 Case Studies: Outline  We illustrate the use of Projections –Through case studies of NAMD & CPAIMD. –Illustrate the use of different visualization options. –Show performance debugging methodology.

11 NAMD: A Production MD program NAMD Fully featured program NIH-funded development Distributed free of charge (~5000 downloads so far) Binaries and source code Installed at NSF centers User training and support Large published simulations (e.g., aquaporin simulation featured in SC’02 keynote) Collaboration with K. Schulten, R. Skeel, and co-workers

12 Molecular Dynamics in NAMD  Collection of [charged] atoms, with bonds –Newtonian mechanics –Thousands of atoms (10, ,000)  At each time-step –Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s –Short-distance: every timestep –Long-distance: using PME (3D FFT) –Multiple Time Stepping : PME every 4 timesteps –Calculate velocities and advance positions  Challenge: femtosecond time-step, millions needed!

VPs VP s 30,000 VPs NAMD Parallelization using Charm++ with PME These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

14 Grainsize Issues  A variant of Amdahl’s law, for objects: –The fastest time can be no shorter than the time for the biggest single object! –Lesson from previous efforts  Splitting computation objects: –30,000 nonbonded compute objects –Instead of approx 10,000

15 Mode: 700 us Distribution of execution times of non-bonded force computation objects (over 24 steps)

16 Effect of Multicast Optimization on Integration Overhead By eliminating overhead of message copying and allocation. Message Packing Overhead and Multicast

17 Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. Load Balancing Aggressive Load Balancing Refinement Load Balancing

18 Load Balancing Steps Regular Timesteps Instrumented Timesteps Detailed, aggressive Load Balancing : object migration Refinement Load Balancing

19 Processor Utilization across processors after (a) greedy load balancing and (b) refining. Note that the underloaded processors are left underloaded (as they don’t impact performance); refinement deals only with the overloaded ones Some overloaded processors

20 Benefits of Avoiding Barrier  Problem with barriers: –Not the direct cost of the operation itself as much –But it prevents the program from adjusting to small variations E.g. K phases, separated by barriers (or scalar reductions) Load is effectively balanced. But –In each phase, there may be slight non-determistic load imbalance –Let Li,j be the load on I’th processor in j’th phase With barrier:Without:

milliseconds

22 Handling Stretches  Challenge –NAMD still did not scale well to 3000 procs with 4 procs per node –due to stretches : inexplicable increase in compute time or communication gaps at random (but few) points –Stretches caused by: Operating system, file system and resource management daemons interfering with the job –Badly configured network API Messages waiting for the rendezvous of the previous message to be acknowledged, leading to stretches in the ISends  Managing stretches –Use blocking receives –Giving OS time when the job process is idle, to run daemons –Fine tuning the network layer

23 Stretched Computations  Jitter in computes up to 80ms –On processors using 4 processors per node –NAMD ATPase 3000 processors time steps of 12 ms –Within that time: each processor sends and receives : Approximately messages of 4-6 KB each –OS Context switch time is 10 ms –OS and Communication layer can have “hiccups” “Hiccups” termed as stretches –Stretches can be a large performance impediment

24 Stretch Removal Histogram Views Number of function executions vs. their granularity Note: log scale on Y-axis Before Optimizations Over 16 large stretched calls After Optimizations About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms

25 Activity Priorities  Identified a portion of CPAIMD that ran too early via the Time Profile tool.

26 Serial Performance  The use of performance counters helped identify serial performance issues like cache performance.  Projections makes use of PAPI to measure performance counters.

27 Challenges Ahead  Scalable Performance Data generation –Meaningful restrictions on Trace data generation. –Data compression. –Online analysis.  Scalable Performance Visualization –Automatic identification of performance problems.