IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Slides:



Advertisements
Similar presentations
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Thoughts on Shared Caches Jeff Odom University of Maryland.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS University of Illinois at Urbana-Champaign Abhinav Bhatele, Eric Bohm, Laxmikant V.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
Akhil Langer, Harshit Dokania, Laxmikant Kale, Udatta Palekar* Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1CPSD NSF/DARPA OPAAL Adaptive Parallelization Strategies using Data-driven Objects Laxmikant Kale First Annual Review October 1999, Iowa City.
Efficient Parallel Implementation of Molecular Dynamics with Embedded Atom Method on Multi-core Platforms Reporter: Jilin Zhang Authors:Changjun Hu, Yali.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
1 Scalable Molecular Dynamics for Large Biomolecular Systems Robert Brunner James C Phillips Laxmikant Kale.
Computational issues in Carbon nanotube simulation Ashok Srinivasan Department of Computer Science Florida State University.
Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.
Advanced / Other Programming Models Sathish Vadhiyar.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Mehmet Can Kurt, The Ohio State University Gagan Agrawal, The Ohio State University DISC: A Domain-Interaction Based Programming Model With Support for.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Spring 2010 Lecture 13: Basic Parallel.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
©Wen-mei W. Hwu and David Kirk/NVIDIA Urbana, Illinois, August 2-5, 2010 VSCSE Summer School Proven Algorithmic Techniques for Many-core Processors Lecture.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Massively Parallel Cosmological Simulations with ChaNGa Pritish Jetley, Filippo Gioachin, Celso Mendes, Laxmikant V. Kale and Thomas Quinn.
1 ChaNGa: The Charm N-Body GrAvity Solver Filippo Gioachin¹ Pritish Jetley¹ Celso Mendes¹ Laxmikant Kale¹ Thomas Quinn² ¹ University of Illinois at Urbana-Champaign.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
SPIDAL Analytics Performance February 2017
Computational Techniques for Efficient Carbon Nanotube Simulation
OCR on Knights Landing (Xeon-Phi)
Parallel Objects: Virtualization & In-Process Components
Performance Evaluation of Adaptive MPI
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
IEEE BigData 2016 December 5-8, Washington D.C.
Component Frameworks:
Scalable Molecular Dynamics for Large Biomolecular Systems
Title Meta-Balancer: Automated Selection of Load Balancing Strategies
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Sathish Vadhiyar Courtesy: Dr. David Walker, Cardiff University
Welcome to the 2016 Charm++ Workshop!
Parallelization of CPAIMD using Charm++
Programming Massively Parallel Processors Lecture Slides for Chapter 9: Application Case Study – Electrostatic Potential Calculation © David Kirk/NVIDIA.
Gary M. Zoppetti Gagan Agrawal
Case Studies with Projections
Many-Core Graph Workload Analysis
BigSim: Simulating PetaFLOPS Supercomputers
Computational Techniques for Efficient Carbon Nanotube Simulation
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Gary M. Zoppetti Gagan Agrawal Rishi Kumar
An Orchestration Language for Parallel Objects
Higher Level Languages on Adaptive Run-Time System
Support for Adaptivity in ARMCI Using Migratable Objects
6- General Purpose GPU Programming
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

Optimizing Molecular Dynamics and Stencil mini-applications for Intel’s Knights Landing IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale University of Illinois at Urbana Champaign

Charm++ Runtime System Supports an asynchronous message-driven execution model Overdecomposition of work units (or chares) on PEs (Processing Elements) Migratability of work units Load balancing framework Support for several applications NAMD OpenAtom ChaNGa Over-decomposition of work into Charm++ objects Mapping of Charm++ objects to processors Charm++ RTS view of over-decomposition

LeanMD – Molecular Dynamics Mini-application Molecular Dynamics simulation application Atoms in the simulation are divided spatially into k- cells roughly the size of the cutoff distance 3-away strategy Uses PINY MD physics engine for its molecular dynamics calculations Computation bound Computation of interacting forces Communication involved Exchange of atoms based on calculated position Load imbalance from exchange of atoms Cutoff radius LeanMD work units

Stencil3D Application Stencil application is representative of communication pattern with neighboring chares in 2D or 3D grid Example is MIMD Lattice computation 3D chare array of doubles Regular communication pattern with 7 neighbor chares Computation performed on received data Temperatures are updated on receiving boundary data from neighbors Communication/Memory bound No load imbalance (x,y-1) (x-1,y) (x,y) (x+1,y) (x,y+1) Five point communication for 2D stencil

Experimental Setup Charm++ in SMP mode Single KNL node (Stampede 1.5) Reduces message copy Ability to map workers to cores Single KNL node (Stampede 1.5) Quadrant mode Cache Mode (except for comparison of Cache and Flat mode) 256 cores were used out of 272 cores for mapping work units to PEs Performance Tools IPC, MCDRAM and DDR4 accesses – perf stat Energy numbers – RAPL counters

LeanMD – Performance Analysis Load balancing using Greedy heuristics Speedup (Higher the better) Energy Usage (Lower the better) Input size: Cell and Compute Dimensions 16 X 16 X 16 cells CPU Pinning: No significant degradation was observed without CPU pinning

Stencil3D - Performance analysis Large: Stencil3d 2048 X 2048 X 1024 Block dimensions: 256 X 128 X 128 Optimal configuration for smaller working set sizes was also observed to be 128 threads (2 hyperthreads per core) Factors for lack of performance improvement Working set size Potentially causes increased eviction from MCDRAM Higher memory streams in parallel Overhead from higher number of worker threads Scope for energy efficient execution with 2 hyperthreads per core CPU-pinning: 10%-30% performance degradation without CPU-pinning Lack of benefits from L2 locality, context switches, thread migrations Load imbalance due to variation in core frequencies

Flat mode vs. Cache mode LeanMD and Stencil3D do not show any significant performance difference when run on Cache mode vs Flat mode Input data size fits in MCDRAM in both cases Stream is the only benchmark that benefits from Flat mode for input data size half the size of MCDRAM

Conclusion and Future work HPC applications like LeanMD and Stencil3D can benefit in terms of performance and energy when tuned appropriately on Knights Landing Depending on working set size there might be thrashing of HBM in Cache mode Memory bound applications might not benefit from using all 4 hyperthreads Overdecomposition factor is a key influencer in performance depending on the number of worker PEs Variation in core frequencies can lead to load imbalance making CPU pinning or similar techniques necessary Automation of tuning for application characteristics within the Charm++ framework by adjusting knobs like Hyperthreading CPU affinity Scheduling on nodes with preferred HBM usage mode