Laxmikant Kale Parallel Programming Laboratory

Adaptive MPI: Intelligent runtime strategies and performance prediction via simulation
Laxmikant Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 9/17/2018

PPL Mission and Approach
To enhance Performance and Productivity in programming complex parallel applications Performance: scalable to thousands of processors Productivity: of human programmers Complex: irregular structure, dynamic variations Approach: Application Oriented yet CS centered research Develop enabling technology, for a wide collection of apps. Develop, use and test it in the context of real applications How? Develop novel Parallel programming techniques Embody them into easy to use abstractions So, application scientist can use advanced techniques with ease Enabling technology: reused across many apps 9/17/2018

Quantum Chemistry (QM/MM) Computational Cosmology
Develop abstractions in context of full-scale applications Protein Folding Quantum Chemistry (QM/MM) Molecular Dynamics Computational Cosmology Parallel Objects, Adaptive Runtime System Libraries and Tools Crack Propagation Dendritic Growth Space-time meshes Rocket Simulation The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE 9/17/2018

Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Benefits System implementation User View 9/17/2018

Outline Adaptive MPI Load Balancing Fault tolerance Projections:
performance analysis Performance prediction bigsim 9/17/2018

AMPI: MPI with Virtualization
Each virtual process implemented as a user-level thread embedded in a Charm object Real Processors MPI “processes” Implemented as virtual processes (user-level migratable threads) MPI processes 9/17/2018

Making AMPI work Multiple user-level threads per processor:
Problems with global variable Solution I: Automatic: switch GOT pointer at context switch Available on most machines Solution 2: Manual: replace global variables Solution 3: automatic: via compiler support (AMPIzer) Migrating Stacks Use isomalloc technique (Mehaut et al) Use memory files and mmap() Heap data: Isomalloc heaps OR user supplied pack/unpack functions for the heap data 9/17/2018

ELF and global variables
The Executable and Linking Format (ELF) Executable has a Global Offset Table containing global data GOT pointer stored at %ebx register Switch this pointer when switching between threads Support on Linux, Solaris 2.x, and more Integrated in Charm++/AMPI Invoke by compile time option -swapglobal 9/17/2018

Adaptive overlap and modules
This is very nicely illustrated in a figure I borrowed from Attila Gursoy’s thesis. Module A needs to avail services from Module B and C. In a message-passing paradigm, these modules cannot execute concurrently, thus idle times in one modules cannot be substituted by computations from another module. This is possible in message-driven paradigm. We therefore base our component architecture on a message-driven runtime system, called Converse. We describe Converse next. SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) Modularity, Reuse, and Efficiency with Message-Driven Libraries: Proc. of the Seventh SIAM Conference on Parallel Processing for Scientigic Computing, San Fransisco, 1995 9/17/2018

Benefit of Adaptive Overlap
Problem setup: 3D stencil calculation of size 2403 run on Lemieux. Shows AMPI with virtualization ratio of 1 and 8. 9/17/2018

Comparison with Native MPI
Performance Slightly worse w/o optimization Being improved Flexibility Small number of PE available Special requirement by algorithm Problem setup: 3D stencil calculation of size 2403 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #. 9/17/2018

AMPI Extensions Automatic load balancing Non-blocking collectives
Checkpoint/restart mechanism Multi-module programming 9/17/2018

Automatic load balancing: MPI_Migrate()
Load Balancing in AMPI Automatic load balancing: MPI_Migrate() Collective call informing the load balancer that the thread is ready to be migrated, if needed. 9/17/2018

Load Balancing Steps Regular Timesteps
Detailed, aggressive Load Balancing : object migration Instrumented Timesteps Refinement Load Balancing 9/17/2018

Processor Utilization against Time on 128 and 1024 processors
Refinement Load Balancing Load Balancing Aggressive Load Balancing Processor Utilization against Time on 128 and 1024 processors On 128 processor, a single load balancing step suffices, but On 1024 processors, we need a “refinement” step. 9/17/2018

Shrink/Expand Problem: Availability of computing platform may change
Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600 9/17/2018

76 bytes all-to-all on Lemieux
Optimized All-to-all “Surprise” Completion time vs. computation overhead 76 bytes all-to-all on Lemieux CPU is free during most of the time taken by a collective operation 900 800 Radix Sort 700 600 AAPC Completion Time(ms) Led to the development of Asynchronous Collectives now supported in AMPI 500 400 300 200 100 100B 200B 900B 4KB 8KB Message Size (bytes) Mesh Direct 9/17/2018

Asynchronous Collectives
Our implementation is asynchronous Collective operation posted Test/wait for its completion Meanwhile useful computation can utilize CPU MPI_Ialltoall( … , &req); /* other computation */ MPI_Wait(req); 9/17/2018

Fault Tolerance 9/17/2018

Applications need fast, low cost and scalable fault tolerance support
Motivation Applications need fast, low cost and scalable fault tolerance support As machines grow in size MTBF decreases Applications have to tolerate faults Our research Disk based Checkpoint/Restart In Memory Double Checkpointing/Restart Sender based Message Logging Proactive response to fault prediction (impending fault response) 9/17/2018

Checkpoint/Restart Mechanism
Automatic Checkpointing for AMPI and Charm++ Migrate objects to disk! Automatic fault detection and restart Now available in distribution version of AMPI and Charm++ Blocking Co-ordinated Checkpoint States of chares are checkpointed to disk Collective call MPI_Checkpoint(DIRNAME) The entire job is restarted Virtualization allows restarting on different # of processors Runtime option > ./charmrun pgm +p4 +vp16 +restart DIRNAME Simple but effective for common cases 9/17/2018

In-memory Double Checkpoint
In-memory checkpoint Faster than disk Co-ordinated checkpoint Simple MPI_MemCheckpoint(void) User can decide what makes up useful state Double checkpointing Each object maintains 2 checkpoints: Local physical processor Remote “buddy” processor For jobs with large memory Use local disks! 32 processors with 1.5GB memory each 9/17/2018

Restart A “Dummy” process is created: Other processors:
Need not have application data or checkpoint Necessary for runtime Starts recovery on all other Processors Other processors: Remove all chares Restore checkpoints lost on the crashed PE Restore chares from local checkpoints Load balance after restart 9/17/2018

Restart Performance 10 crashes 128 processors
Checkpoint every 10 time steps 9/17/2018

Scalable Fault Tolerance
Motivation: When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! How? Sender-side message logging Asynchronous Checkpoints on buddy processors Latency tolerance mitigates costs Restart can be speeded up by spreading out objects from failed processor Current progress Basic scheme idea implemented and tested in simple programs General purpose implementation in progress Only failed processor’s objects recover from checkpoints, playing back their messages, while others “continue” 9/17/2018

Recovery Performance Execution Time with increasing number of faults on 8 processors (Checkpoint period 30s) 9/17/2018

Performance visualization and analysis tool
Projections: Performance visualization and analysis tool 9/17/2018

An Introduction to Projections
Performance Analysis tool for Charm++ based applications. Automatic trace instrumentation. Post-mortem visualization. Multiple advanced features that support: Data volume control Generation of additional user data. 9/17/2018

Trace Generation Automatic instrumentation by runtime system Detailed
In the log mode each event is recorded in full detail (including timestamp) in an internal buffer. Summary Reduces the size of output files and memory overhead. Produces a few lines of output data per processor. This data is recorded in bins corresponding to intervals of size 1 ms by default. Flexible APIs and runtime options for instrumenting user events and data generation control. The charm++ runtime system can automatically instrument the code for performance analysis.This is because the runtime system gets control before and after the execution of every asynchronously invoked method, as well as when a message (method invocation) is sent out. The runtime can record detailed traces of these events. Alternatively, in a “summary” mode, it records a short summary file for each processor. In this mode, it records data for each method, and for each time-interval. It maintains bins for a fixed number of intervals. If the program runs longer, it shrinks the number of intervals by doubling interval period. 9/17/2018

The Summary View Provides a view of the overall utilization of the application. Very quick to load. 9/17/2018

Graph View Features: Selectively view Entry points.
Convenient means to switch to between axes data types. 9/17/2018

Timeline The most detailed view in Projections.
Useful for understanding critical path issues or unusual entry point behaviors at specific times. 9/17/2018

Animations 9/17/2018

9/17/2018

Time Profile Solved by prioritizing entry methods
Identified a portion of CPAIMD (Quantum Chemistry code) that ran too early via the Time Profile tool. Solved by prioritizing entry methods 9/17/2018

9/17/2018

Overview: one line for each processor, time on X-axis
White: busy, Black:idle Red: intermediate 9/17/2018

A boring but good-performance overview
9/17/2018

An interesting but pathetic overview
9/17/2018

Over 16 large stretched calls
Stretch Removal Histogram Views Number of function executions vs. their granularity Note: log scale on Y-axis We used a variety of techniques to eliminate or reduce the OS interference. We used a low level elan communication library instead of the default MPI implementation of Charm++. We allowed the OS to run its deamons via sleep calls when idle, and so on. The eventual benefit is recorded in the the histogram plot of projections. After Optimizations About 5 large stretched calls, largest of them much smaller, and almost all calls take less than 3.2 ms Before Optimizations Over 16 large stretched calls 9/17/2018

Miscellaneous Features - Color Selection
Colors are automatically supplied by default. We allow users to select their own colors and save them. These colors can then be restored the next time Projections loads. 9/17/2018

User APIs Controlling trace generation Tracing User (Events
void traceBegin() void traceEnd() Tracing User (Events int traceRegisterUserEvent(char *, int) void traceUserEvent(char *) void traceUserBracketEvent(int, double, double) double CmiWallTimer() Runtime options: +traceoff +traceroot <directory> Projections mode only: +logsize <# entries> +gz-trace Summary mode only: +bincount <# of intervals> +binsize <interval time quanta (us)> 9/17/2018

Performance Prediction
Via Parallel Simulation 9/17/2018

BigSim: Performance Prediction
Extremely large parallel machines are around already/about to be available: ASCI Purple (12k, 100TF) Bluegene/L (64k, 360TF) Bluegene/C (1M, 1PF) How to write a petascale application? What will be the Performance like? Would existing parallel applications scale? Machines are not there Parallel Performance is hard to model without actually running the program 9/17/2018

Objectives and Simualtion Model
Develop techniques to facilitate the development of efficient peta-scale applications Based on performance prediction of applications on large simulated parallel machines Simulation-based Performance Prediction: Focus on Charm++ and AMPI programming models Performance prediction based on PDES Supports varying levels of fidelity processor prediction, network prediction. Modes of execution : online and post-mortem mode 9/17/2018

Blue Gene Emulator/Simulator
Actually: BigSim, for simulation of any large machine using smaller parallel machines Emulator: Allows development of programming environment and algorithms before the machine is built Allowed us to port Charm++ to real BG/L in 1-2 days Simulator: Allows performance analysis of programs running on large machines, without access to the large machines Uses Parallel Discrete Event Simulation 9/17/2018

Architecture of BigNetSim
9/17/2018

Simulation Details Emulate large parallel machines on smaller existing parallel machines – run a program with multi-million way parallelism (implemented using user-threads). Consider memory and stack-size limitations Ensure time-stamp correction Emulator layer API is built on top of machine layer Charm++/AMPI implemented on top of emulator, like any other machine layer Emulator layer supports all Charm features: Load-balancing Coomunication optimizations 9/17/2018

Performance Prediction
Usefulness of performance prediction: Application developer (making small modifications) Difficult to get runtimes on huge current machines For future machines, simulation is the only possibility Performance debugging cycle can be considerably reduced Even approximate predictions can identify performance issues such as load imbalance, serial bottlenecks, communication bottlenecks, etc Machine architecture designer Knowledge of how target applications behave on it, can help identify problems with machine design early Record traces during parallel emulation Run trace-driven simulation (PDES) 9/17/2018

Performance Prediction (contd.)
Predicting time of sequential code: User supplied time for every code block Wall-clock measurements on simulating machine can be used via a suitable multiplier Hardware performance counters to count floating point, integer, branch instructions, etc Cache performance and memory footprint are approximated by percentage of memory accesses and cache hit/miss ratio Instruction level simulation (not implemented) Predicting Network performance: No contention, time based on topology & other network parameters Back-patching, modifies comm time using amount of comm activity Network-simulation, modelling the netowrk entirely 9/17/2018

Performance Prediction Validation
7-point stencil program with 3D decomposition Run on 32 real processors, simulating 64, 128,... PEs NAMD benchmark Apo-Lipoprotein A1 atom dataset with 92k atoms, running for 15 timesteps For large processors, because of cache and memory effects, the predicted value seems to diverge from actual value 9/17/2018

Performance on Large Machines
Problem: How to predict performance of applications on future machines? (E.g. BG/L) How to do performance tuning without continuous access to large machine? Solution: Leverage virtualization Develop machine emulator Simulator: accurate time modeling Run program on “100,000 processors” using only hundreds of processors Analysis: Use performance viz. suite (projections) Molecular Dynamics Benchmark ER-GRE: atoms 1.6 million objects 8 step simulation 16k BG processors 9/17/2018

Projections: Performance visualization
9/17/2018

NetWork Simulation Detailed implementation of interconnection networks
Configurable network parameters Topology / Routing Input / Output VC seclection Bandwidth / Latency NIC parameters Buffer / Message size, etc Support for hardware collectives in network layer 9/17/2018

Higher level programming
Orchestration language Allows expressing global control flow in a charm program HPF like flavor, but Charm++-like processor virtualization, and explicit communication Multiphase Shared Arrays Provides a disciplined use of shared address space Each array can be accessed only in one of the following modes: ReadOnly, Write-by-One-Thread, Accumulate-only Access mode can change from phase to phase Phases delineated by per-array “sync” 9/17/2018

Other projects Faucets Load balancing strategies Commn optimization
Flexible cluster scheduler resource management across clusters Multi-cluster applications Load balancing strategies Commn optimization POSE: Parallel discrete even simulation ParFUM: Parallel framework for Unstructured mesh Invite collaborations: Virtualization of other languages and libraries New load balancing strategies Applications 9/17/2018

Some Active Collaborations
Biophysics: Molecular Dynamics (NIH, ..) Long standing, 91-, Klaus Schulten, Bob Skeel Gordon bell award in 2002, Production program used by biophysicists Quantum Chemistry (NSF) QM/MM via Car-Parinello method + Roberto Car, Mike Klein, Glenn Martyna, Mark Tuckerman, Nick Nystrom, Josep Torrelas, Laxmikant Kale Material simulation (NSF) Dendritic growth, quenching, space-time meshes, QM/FEM R. Haber, D. Johnson, J. Dantzig, + Rocket simulation (DOE) DOE, funded ASCI center Mike Heath, +30 faculty Computational Cosmology (NSF, NASA) Simulation: Scalable Visualization: Others Simulation of Plasma Electromagnetics 9/17/2018

We are pursuing a broad agenda
Summary We are pursuing a broad agenda aimed at productivity and performance in parallel programming Intelligent Runtime System for adaptive strategies Charm++/AMPI are production level systems Support dynamic load balancing, communication optimizations Performance prediction capabiltiees based on simulation Basic Fault tolerance, performance viz tools are part of the suite Application-oriented yet Computer Science centered research Workshop on Charm++ and Applications: Oct , UIUC 9/17/2018

Laxmikant Kale Parallel Programming Laboratory

Similar presentations

Presentation on theme: "Laxmikant Kale Parallel Programming Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Laxmikant Kale Parallel Programming Laboratory

Similar presentations

Presentation on theme: "Laxmikant Kale Parallel Programming Laboratory"— Presentation transcript:

Similar presentations

About project

Feedback