The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.

Slides:



Advertisements
Similar presentations
B4 Application Environment Load Balancing Job and Queue Management Tim Smith CERN/IT.
Advertisements

Parallel Processing with OpenMP
Introduction to Openmp & openACC
Practical techniques & Examples
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”
Thoughts on Shared Caches Jeff Odom University of Maryland.
1 Chapter 1 Why Parallel Computing? An Introduction to Parallel Programming Peter Pacheco.
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
Gabrielle Allen*, Thomas Dramlitsch*, Ian Foster †, Nicolas Karonis ‡, Matei Ripeanu #, Ed Seidel*, Brian Toonen † * Max-Planck-Institut für Gravitationsphysik.
Reference: Message Passing Fundamentals.
A Grid Parallel Application Framework Jeremy Villalobos PhD student Department of Computer Science University of North Carolina Charlotte.
MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
A Mobile Agent Infrastructure for QoS Negotiation of Adaptive Distributed Applications Roberto Speicys Cardoso & Fabio Kon University of São Paulo – USP.
Silberschatz, Galvin and Gagne ©2009Operating System Concepts – 8 th Edition Chapter 4: Threads.
AstroBEAR Parallelization Options. Areas With Room For Improvement Ghost Zone Resolution MPI Load-Balancing Re-Gridding Algorithm Upgrading MPI Library.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Adventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory.
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
Task Farming on HPCx David Henty HPCx Applications Support
Computer Science in UNEDF George Fann, Oak Ridge National Laboratory Rusty Lusk, Argonne National Laboratory Jorge Moré, Argonne National Laboratory Esmond.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Event Metadata Records as a Testbed for Scalable Data Mining David Malon, Peter van Gemmeren (Argonne National Laboratory) At a data rate of 200 hertz,
Silberschatz, Galvin and Gagne ©2011Operating System Concepts Essentials – 8 th Edition Chapter 4: Threads.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Designing and Evaluating Parallel Programs Anda Iamnitchi Federated Distributed Systems Fall 2006 Textbook (on line): Designing and Building Parallel Programs.
Principles of Scalable HPC System Design March 6, 2012 Sue Kelly Sandia National Laboratories Abstract: Sandia National.
The Pipeline Processing Framework LSST Applications Meeting IPAC Feb. 19, 2008 Raymond Plante National Center for Supercomputing Applications.
CCA Common Component Architecture Manoj Krishnan Pacific Northwest National Laboratory MCMD Programming and Implementation Issues.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Loosely Coupled Parallelism: Clusters. Context We have studied older archictures for loosely coupled parallelism, such as mesh’s, hypercubes etc, which.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
CS333 Intro to Operating Systems Jonathan Walpole.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Chapter 4: Threads Modified from the slides of the text book. TY, Sept 2010.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition, Chapter 4: Multithreaded Programming.
1 "Workshop 31: Developing a Hands-on Undergraduate Parallel Programming Course with Pattern Programming SIGCSE The 44 th ACM Technical Symposium.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Architecture View Models A model is a complete, simplified description of a system from a particular perspective or viewpoint. There is no single view.
Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 4: Threads.
Comprehensive Scientific Support Of Large Scale Parallel Computation David Skinner, NERSC.
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Source Level Debugging of Parallel Programs Roland Wismüller LRR-TUM, TU München Germany.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
MPI on a Million Processors Pavan Balaji, 1 Darius Buntinas, 1 David Goodell, 1 William Gropp, 2 Sameer Kumar, 3 Ewing Lusk, 1 Rajeev Thakur, 1 Jesper.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Exploring Parallelism with Joseph Pantoga Jon Simington.
Operating System Concepts
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
1 Advanced MPI William D. Gropp Rusty Lusk and Rajeev Thakur Mathematics and Computer Science Division Argonne National Laboratory.
PARALLEL AND DISTRIBUTED PROGRAMMING MODELS U. Jhashuva 1 Asst. Prof Dept. of CSE om.
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
Chapter 4: Threads.
Chapter 4: Threads.
Performance Evaluation of Adaptive MPI
Chapter 4: Threads.
Chapter 4: Threads.
Hybrid Programming with OpenMP and MPI
Chapter 4: Threads & Concurrency
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear Physics Division Argonne National Laboratory

2 Outline The Nuclear Physics problem - GFMC The CS problem: scalability, load balancing, simplicity for the application Approach: a portable, scalable library, implementing a simple programming model Status –Choices made –Lessons learned –Promising results Plans –Scaling up –Branching out –Using threads so far…

Argonne National Laboratory 3 The Physics Project: Green’s function Monte Carlo (GFMC) The benchmark for nuclei with 12 or fewer nucleons Starts with variational wave function containing non-central correlations Uses imaginary-time propagation to filter out excited-state contamination –Samples removed or multiplied during propagation -- work fluctuates –Local energies evaluated every 40 steps –For 12 C expect ~10,000 Monte Carlo samples to be propagated for ~1000 steps Non ADLB version: – Several samples per node; work on one sample not parallelized ADLB version: –Three-body potential and propagator parallelized –Wave functions for kinetic energy and two-body potential done in parallel –Can use many processors per sample Physics target: Properties of both ground and excited states of 12 C

Argonne National Laboratory 4 The Computer Science Project: ADLB Specific Problem: to scale up GFMC, a master/slave code General Problem: scaling up the master/slave paradigm in general –Usually based on a single (or shared) data structure Goal for GFMC: scale to 160,000 processes (available on BG/P) General goal: provide simple yet scalable programming model for algorithms parallelized via master/slave structure General goal in GFMC setting: eliminate (most) MPI calls from GFMC and scale the general approach GFMC is not an easy case: –Multiple types of work –Any process can create work –Large work units (multi-megabyte) –Priorities and types used together to specify some sequencing without constraining parallelism (“workflow”)

Argonne National Laboratory 5 Master/Slave Algorithms and Load Balancing Advantages –Automatic load balancing Disadvantages –Scalability - master can become bottleneck Wrinkles –Slaves may create new work –Multiple work types and priorities that impose ordering Master Slave Shared Work queue Shared Work queue

Argonne National Laboratory 6 Load Balancing in GFMC Before ADLB Master/Slave algorithm Slaves do create work dynamically Newly created work stored locally Periodic load balancing –Done by master –Slaves communicate work queue lengths to master –Master determines reallocation of work –Master tells slaves to reallocate work –Slaves communicate work to one another –Computation continues with balanced work queues

Argonne National Laboratory 7 GFMC Before ADLB

Argonne National Laboratory 8 Zoomed in

Argonne National Laboratory 9 BlueGene P 4 cpus per node 4 threads or processes per node 160,000 cpus Efficient MPI communication Original GFMC not expected to continue to scale past 2000 processors –More parallelism needed to exploit BG/P for Carbon 12 –Existing load-balancing mechanism will not scale A general-purpose load-balancing library should –Enable GFMC to scale –Be of use to other codes as well –Simplify parallel programming of applications

Argonne National Laboratory 10 The Vision No explicit master for load balancing; slaves make calls to ADLB library; those subroutines access local and remote data structures (remote ones via MPI). Simple Put/Get interface from application code to distributed work queue hides most MPI calls –Advantage: multiple applications may benefit –Wrinkle: variable-size work units, in Fortran, introduce some complexity in memory management Proactive load balancing in background –Advantage: application never delayed by search for work from other slaves –Wrinkle: scalable work-stealing algorithms not obvious

Argonne National Laboratory 11 The API (Application Programming Interface) Basic calls –ADLB_Init( num_servers, am_server, app_comm) –ADLB_Server() –ADLB_Put( type, priority, len, buf, answer_dest ) –ADLB_Reserve( req_types, work_handle, work_len, work_type, work_prio, answer_dest) –ADLB_Ireserve( … ) –ADLB_Get_Reserved( … ) –ADLB_Set_done() –ADLB_Finalize() A few others, for tuning and debugging –(still at experimental stage)

Argonne National Laboratory 12 Asynchronous Dynamic Load Balancing - Thread Approach The basic idea: Application Threads ADLB Library Thread Shared Memory Put/get MPI Communication with other nodes Work queue

Argonne National Laboratory 13 ADLB - Process Approach Application Processes ADLB Servers put/get

Argonne National Laboratory 14 Early Version of ADLB in GFMC on BG/L

Argonne National Laboratory 15 History and Status Year 1: learned the application; worked out first version of API; did thread version of implementation Year 2: Switched to process version, began experiments at scale –On BG/L (4096), SiCortex (5800), and BG/P (16K so far) –Variational Monte Carlo (part of GFMC) –Full GFMC (see following performance graph) –Latest: GFMC for neutron drop at large scale Some additions to the API for increased scalability, memory management Internal changes to manage memory Still working on memory management for full GMFC with fine-grain parallelism, needed for 12 C Basic API not changing

Argonne National Laboratory 16 Comparing Speedup

Argonne National Laboratory 17 Most Recent Runs 14-neutron drop on 16,384 processors of BG/P Speedup of 13,634 (83% efficiency) No microparallization since more configurations ADLB processes 171 million work packages of size 129KB each, total of 20.5 terabytes of data moved Heretofore uncomputed level of accuracy for the computed energy and density Also some benchmarking runs for 9 Be and 7 Li

Argonne National Laboratory 18 Future Plans Shortdetour into scalable debugging tools for understanding behavior, particularly memory usage Further microparallelization of GFMC using ADLB Scaling up on BG/P Revisit the thread model for ADLB implementation, particularly in anticipation of Q and experimental compute-node Linux on P Help with multithreading the application (locally parallel execution of work units via OpenMP) Work with others in the project who use manager/worker patterns in their codes Work with others outside the project in other SciDACs

Argonne National Laboratory 19 Summary We have designed a simple programming model for a class of applications We are working on this in the context of a specific UNEDF application, which is a challenging one We have done two implementations Performance results are promising so far Still have not completely conquered the problem Needed: tools for understanding behavior, particularly to help with application-level debugging

Argonne National Laboratory 20 The End