ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.

Slides:



Advertisements
Similar presentations
Multiple Processor Systems
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Practical techniques & Examples
Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
CS 443 Advanced OS Fabián E. Bustamante, Spring 2005 Resource Containers: A new Facility for Resource Management in Server Systems G. Banga, P. Druschel,
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Master/Slave Architecture Pattern Source: Pattern-Oriented Software Architecture, Vol. 1, Buschmann, et al.
GridRPC Sources / Credits: IRISA/IFSIC IRISA/INRIA Thierry Priol et. al papers.
CISC October Goals for today: Foster’s parallel algorithm design –Partitioning –Task dependency graph Granularity Concurrency Collective communication.
Task Scheduling and Distribution System Saeed Mahameed, Hani Ayoub Electrical Engineering Department, Technion – Israel Institute of Technology
Reference: Message Passing Fundamentals.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
Scripting Languages For Virtual Worlds. Outline Necessary Features Classes, Prototypes, and Mixins Static vs. Dynamic Typing Concurrency Versioning Distribution.
Parallel Programming Models and Paradigms
Understanding Operating Systems 1 Overview Introduction Operating System Components Machine Hardware Types of Operating Systems Brief History of Operating.
Learning Objectives Understanding the difference between processes and threads. Understanding process migration and load distribution. Understanding Process.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Strategies for Implementing Dynamic Load Sharing.
Chapter 6: An Introduction to System Software and Virtual Machines
Distributed Process Management1 Learning Objectives Distributed Scheduling Algorithms Coordinator Elections Orphan Processes.
Mapping Techniques for Load Balancing
Adventures in Load Balancing at Scale: Successes, Fizzles, and Next Steps Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory.
Task Farming on HPCx David Henty HPCx Applications Support
The Asynchronous Dynamic Load-Balancing Library Rusty Lusk, Steve Pieper, Ralph Butler, Anthony Chan Mathematics and Computer Science Division Nuclear.
Computer System Architectures Computer System Software
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
German National Research Center for Information Technology Research Institute for Computer Architecture and Software Technology German National Research.
Institute of Computer and Communication Network Engineering OFC/NFOEC, 6-10 March 2011, Los Angeles, CA Lessons Learned From Implementing a Path Computation.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Active Monitoring in GRID environments using Mobile Agent technology Orazio Tomarchio Andrea Calvagna Dipartimento di Ingegneria Informatica e delle Telecomunicazioni.
Introduction, background, jargon Jakub Yaghob. Literature T.G.Mattson, B.A.Sanders, B.L.Massingill: Patterns for Parallel Programming, Addison- Wesley,
Cli/Serv.: Dist. Prog./21 Client/Server Distributed Systems v Objectives –explain the general meaning of distributed programming beyond client/server.
Scalable Web Server on Heterogeneous Cluster CHEN Ge.
A performance evaluation approach openModeller: A Framework for species distribution Modelling.
Invitation to Computer Science 5 th Edition Chapter 6 An Introduction to System Software and Virtual Machine s.
INVITATION TO COMPUTER SCIENCE, JAVA VERSION, THIRD EDITION Chapter 6: An Introduction to System Software and Virtual Machines.
Multiprocessor and Real-Time Scheduling Chapter 10.
DISTRIBUTED SYSTEMS Principles and Paradigms Second Edition ANDREW S
PARALLEL APPLICATIONS EE 524/CS 561 Kishore Dhaveji 01/09/2000.
Summary :-Distributed Process Scheduling Prepared By:- Monika Patel.
Computer Science Lecture 7, page 1 CS677: Distributed OS Multiprocessor Scheduling Will consider only shared memory multiprocessor Salient features: –One.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Motivation: Sorting is among the fundamental problems of computer science. Sorting of different datasets is present in most applications, ranging from.
CPSC 171 Introduction to Computer Science System Software and Virtual Machines.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
Lecture 3 : Performance of Parallel Programs Courtesy : MIT Prof. Amarasinghe and Dr. Rabbah’s course note.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.
CDP Tutorial 3 Basics of Parallel Algorithm Design uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison.
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
High Performance Flexible DSP Infrastructure Based on MPI and VSIPL 7th Annual Workshop on High Performance Embedded Computing MIT Lincoln Laboratory
Uses some of the slides for chapters 3 and 5 accompanying “Introduction to Parallel Computing”, Addison Wesley, 2003.
Cluster computing. 1.What is cluster computing? 2.Need of cluster computing. 3.Architecture 4.Applications of cluster computing 5.Advantages of cluster.
Background Computer System Architectures Computer System Software.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
TensorFlow– A system for large-scale machine learning
Chapter 4: Threads.
Processes and Threads Processes and their scheduling
Parallel Programming By J. H. Wang May 2, 2017.
Chapter 2 Processes and Threads Today 2.1 Processes 2.2 Threads
Process Management Presented By Aditya Gupta Assistant Professor
Parallel Algorithm Design
Parallel Programming in C with MPI and OpenMP
EE 193: Parallel Computing
Chapter 4: Threads.
Chapter 4: Threads.
Multiprocessor and Real-Time Scheduling
Overview of Workflows: Why Use Them?
CS703 - Advanced Operating Systems
Presentation transcript:

ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne National Laboratory

Outline  Review of what ADLB is  Progress in the context of GFMC  Recent applications other than GFMC  Current implementation activities 2

Load Balancing  Definition: the assignment (scheduling) of tasks (code + data) to processes so as to minimize the total idle times of processes  Static load balancing –all tasks are known in advance and pre-assigned to processes –works well if all tasks take the same amount of time –requires no coordination process  Dynamic load balancing –tasks are assigned to processes by coordinating process when processes become available –Requires communication between manager and worker processes –Tasks may create additional tasks –Tasks may be quite different from one another 3

Generic Master/Slave Algorithm  Easily implemented in MPI  Solves some problems –implements dynamic load balancing –termination detection (either preemptive or by exhaustion) –dynamic task creation –can implement workflow structure of tasks  Scalability problems –Master can become a communication bottleneck (granularity dependent) –Memory can become a bottleneck (depends on task description size) 4 Master Slave Shared Work queue Shared Work queue

The ADLB Model (no master)  Doesn’t really change algorithms in slaves  Not a new idea (e.g. Linda)  But need scalable, portable, distributed implementation of shared work queue –MPI complexity hidden here 5 Slave Shared Work queue Shared Work queue

API for a Simple Programming Model  Basic calls –ADLB_Init( num_servers, am_server, app_comm) –ADLB_Server() –ADLB_Put( type, priority, len, buf, target_rank, answer_dest ) –ADLB_Reserve( req_types, handle, len, type, prio, answer_dest) –ADLB_Ireserve( … ) –ADLB_Get_Reserved( handle, buffer ) –ADLB_Set_Done() –ADLB_Finalize()  A few others, for tuning and debugging –ADLB_{Begin,End}_Batch_Put() –Getting performance statistics with ADLB_Get_info(key) 6

API Notes  Return codes (defined constants) –ADLB_SUCCESS –ADLB_NO_MORE_WORK –ADLB_DONE_BY_EXHAUSTION –ADLB_NO_CURRENT_WORK (for ADLB_Ireserve)  Batch puts are for inserting work units that share a large proportion of their data  Types, answer_rank, target_rank can be used to implement some common patterns –Sending a message –Decomposing a task into subtasks –Maybe should be built into API 7

How It Works (Current production version) 8 Application Processes ADLB Servers put/get

Progress with GFMC 9

Another Physics Application – Parameter Sweep 10 Luminescent solar concentrators –Stationary, no moving parts –Operate efficiently under diffuse light conditions (northern climates) Inexpensive collector, concentrate light on high-performance solar cell In this case, the authors never learned any parallel programming approach before ADLB (ADLB as high-level programming model)

Two Other Applications  The “Batcher” –Simple but potentially useful –Input is a file of Unix command lines –ADLB worker processes execute each one with the Unix “system” call  Swift substrate –Swift is a high-level workflow description language –ADLB is being tested as an execution engine for Swift programs –Fine granularity needed 11

Alternate Implementations of the Same API  Single server with one-sided communication among clients –Motivation: Eliminate multiple views of “shared” queue data structure and the effort required to keep them (almost) coherent) Free up more processors for application calculations by eliminating most servers. Use larger client memory to store work packages –Relies on “passive target” MPI-2 remote memory operations –Single master proved to be a scalability bottleneck at 32,000 processors (8K nodes on BG/P) not because of processing capability but because of network congestion. –Have not yet experimented with hybrid version (1- sided, multiple-server)  Completely symmetric (“no server”) threaded version –ADLB code runs in separate thread on each node. 12 MPI_GetMPI_Put ADLB_Get ADLB_Put

Argonne National Laboratory 13 Asynchronous Dynamic Load Balancing The basic idea: Application Threads ADLB Library Thread Shared Memory Put/get MPI Communication with other nodes Work queue

14 Asynchronous Dynamic Load Balancing  The basic idea: Application Threads ADLB Library Thread Shared Memory Put/get MPI Communication with other nodes Work queue

Preliminary Experiments with the Threaded Version  Two kinds of experiments –0-size and 0-length work units to test minimal overheads –random size and length of work units to test load balancing 15

Getting ADLB  Web site is  To download adlb: –svn co adlbm  What you get: –source code (multiple versions) –configure script and Makefile –README, with API documentation –Examples Sudoku Batcher –Batcher README Traveling Salesman Problem  To run your application –configure, make to build ADLB library –Compile your application with mpicc, use Makefile as example –Run with mpiexec  Problems/complaints/kudos to 16

Conclusions  The Philosophical Accomplishment: Scalability need not come at the expense of complexity  The Practical Accomplishment: Multiple uses –As high-level library to make simple applications scalable –As execution engine for complicated applications (like GFMC) higher-level “many-task” programming models 17

The End 18