Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.

Slides:



Advertisements
Similar presentations
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Advertisements

Resource Management §A resource can be a logical, such as a shared file, or physical, such as a CPU (a node of the distributed system). One of the functions.
Thoughts on Shared Caches Jeff Odom University of Maryland.
Study of Hurricane and Tornado Operating Systems By Shubhanan Bakre.
Development of Parallel Simulator for Wireless WCDMA Network Hong Zhang Communication lab of HUT.
Spark: Cluster Computing with Working Sets
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
MPI and C-Language Seminars Seminar Plan  Week 1 – Introduction, Data Types, Control Flow, Pointers  Week 2 – Arrays, Structures, Enums, I/O,
Reference: Message Passing Fundamentals.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
Locality-Aware Request Distribution in Cluster-based Network Servers 1. Introduction and Motivation --- Why have this idea? 2. Strategies --- How to implement?
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Efficient Parallelization for AMR MHD Multiphysics Calculations Implementation in AstroBEAR.
Chapter 11 Operating Systems
Strategies for Implementing Dynamic Load Sharing.
1 Process Description and Control Chapter 3 = Why process? = What is a process? = How to represent processes? = How to control processes?
User-Level Interprocess Communication for Shared Memory Multiprocessors Brian N. Bershad, Thomas E. Anderson, Edward D. Lazowska, and Henry M. Levy Presented.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Chapter 3 Memory Management: Virtual Memory
Operating System Concepts and Techniques Lecture 5 Scheduling-1 M. Naghibzadeh Reference M. Naghibzadeh, Operating System Concepts and Techniques, First.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Memory Management Memory Management COSC513 – Spring 2004 Student Name: Nan Qiao Student ID#: Professor: Dr. Morteza Anvari.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
Fall 2012 Chapter 2: x86 Processor Architecture. Irvine, Kip R. Assembly Language for x86 Processors 6/e, Chapter Overview General Concepts IA-32.
Chapter 9: Virtual Memory Background Demand Paging Copy-on-Write Page Replacement Allocation of Frames Thrashing Memory-Mapped Files Allocating Kernel.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
ATmospheric, Meteorological, and Environmental Technologies RAMS Parallel Processing Techniques.
Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.
Project Presentation By: Dean Morrison 12/6/2006 Dynamically Adaptive Prepaging for Effective Virtual Memory Management.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
Full and Para Virtualization
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing Task Partitioning Dynamic Mapping Dr. Xiao Qin Auburn University
Virtual Memory By CS147 Maheshpriya Venkata. Agenda Review Cache Memory Virtual Memory Paging Segmentation Configuration Of Virtual Memory Cache Memory.
Auburn University
Memory Management.
Xing Cai University of Oslo
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Understanding Operating Systems Seventh Edition
COMBINED PAGING AND SEGMENTATION
Parallel Programming By J. H. Wang May 2, 2017.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Chapter 9 – Real Memory Organization and Management
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Cache Memory Presentation I
Real-time Software Design
Main Memory Management
Parallel Programming in C with MPI and OpenMP
CMSC 611: Advanced Computer Architecture
Economics, Administration & Information system
Process Description and Control
Fast Communication and User Level Parallelism
Concurrency: Mutual Exclusion and Process Synchronization
High Performance Computing
CS510 Operating System Foundations
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Virtual Memory: Working Sets
Database System Architectures
Parallel Programming in C with MPI and OpenMP
Performance-Robust Parallel I/O
COMP755 Advanced Operating Systems
Presentation transcript:

Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies

What is VORPAL? A parallel simulation package that models laser - plasma interactions using the PIC or fluid model. Built on an object oriented framework. Over 1,300 pages (68,000 lines) of C++ code.

Goal: Implement Dynamic Load Balancing for VORPAL Dynamic: Automatically adjusts decomposition appropriately at runtime. Fully compatible with all simulations. Efficient: Minimizes overhead. User friendly: Requires no special configuration.

Grids define how regions are to be updated. extended region physical region One or two cells wide

Extended regions simplify parallel synchronization. extended region physical region extended region physical region CPU 1CPU 0 Extended regions contain data in physical regions of other CPUs.

Allocated region provides simple over allocation scheme. extended region physical region Allocated Region

Working with current framework introduces restrictions. Boundaries can be adjusted by at most one cell in each direction. Decomposition description is restricted to prevent irregular domains. Direction 0 Direction 1 Direction 0 Direction 1 Impossible Normal

Two strategies exist to accommodate decomposition adjustment. Over allocate all fields to accommodate future decomposition adjustment. Resize and copy all field data as needed for each decomposition adjustment.

Over Allocation Region in use overflow Disadvantage: Decreases cache hits, resulting in lower efficiency Advantage: Minimal overhead. overflow

Over allocation introduces minimal overhead. Region in use overflow

Over Allocation in direction of maximum stride is optimal. Region in use overflow

Resizing New memory block requested, values are copied over Disadvantage: Large overhead Advantage: Does not affect efficiency.

Optimal performance is achieved through both methods New memory block requested, values are copied over Memory is over allocated to prevent future resizes.

Decomposition adjustment occurs at end of update UpdateLoad Balancing CPU time measured each iteration

DLB is achieved in eight steps. 1.All nodes send processing times to node zero. 2.Node zero decides whether to perform an adjustment. 3.Node zero constructs an adjustment description and sends to all other nodes. 4.All nodes apply the adjustment, reconfigure grid. 5.All fields resize as necessary. 6.Field iteraters are rebuilt 7.All messengers are rebuilt 8.Fields and particles are synchronized.

1. Processing times are collected to aid in adjustment decisions. Each node measures the virtual time it took to perform its last update. Virtual time excludes time spent blocked on I/O. This amount is sent to node zero. The process waits for reply from node zero. Time

2. Adjustment decision made based on idle time. Given the cost of performing a load balance, VORPAL only does so only if any node during the last time step was idle for more than 10% of the time. Time

3. Adjustment is computed to eliminate bottlenecks Using a greedy algorithm, node zero constructs adjustment information for all processors. This is sent to all nodes. Time

Boundaries are shrunk around processors with high load. CPU1 CPU2CPU3 CPU Load lowest highest Direction 0 Direction 1 CPU0

4. Decomposition object is modified on each processor Each node applies the adjustment. The local grid is adjusted to match the new size. The allocated region is modified if it cannot support the new size. Time

5. Fields resize if allocated region has changed. All fields check the allocated region to see if it has grown. If so, the field allocates additional memory to accommodate 25 more cells in the direction of growth. Time

6. Outdated information in Field iteraters is rebuilt. Pointers to specific memory locations may have changed if a resize was performed. Physical and extended regions may have changed size. Time

7. Outdated message passing objects are rebuilt. Intersecting regions may have changed so the must be reconstructed. Time

8. New boundaries must be synchronized with neighbors. Field data on physical boundaries is sent to neighboring processors and extended regions are filled from neighbors. Particles that may have crossed outside the boundary of the current node are sent to neighboring nodes. Unfortunately, since there is nothing to do while synchronization takes place, an enormous overhead is seen at this step. Time

When to use load balancing. When running a PIC simulations. When plasma concentration is expected to change. When decomposition is along the zero direction. When a large number of time steps are used.

Case study: DLB can beat best static decomposition by 23%. CPU0 CPU1 Particles loaded into right region Boundary at midpoint

Sliding window moves particles left to CPU0 CPU0 CPU1 Particles shifted into left region

Standard run shows large differences in CPU use.

Load Balancing ensures uptimes on each node are equal.

Conclusion Load balancing performs desired functions. Overhead involved in message passing is quite significant, somewhat limiting usefulness.