Presentation is loading. Please wait.

Presentation is loading. Please wait.

Runtime Optimizations via Processor Virtualization

Similar presentations


Presentation on theme: "Runtime Optimizations via Processor Virtualization"— Presentation transcript:

1 Runtime Optimizations via Processor Virtualization
Laxmikant Kale Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

2 Processor virtualization
Outline Where we stand Predictable behavior very important for performance What is virtualization Charm++, AMPI Consequences: Software Engineering: Natural unit of parallelism Data vs code abstraction cohesion/coupling: MPI vs objects Message driven execution Adaptive overlap Handling jitter outofcore caches /controllable SRAM Flexible and dynamic mapping Vacating. Accommodating speed varn shrink/expand Principle of persistence Measurement based lb Communication opts Learning algorithms New uses of compiler analysis Apparently simple, Threads, Global vars Minimizing state at migration Border fusion December 28, 2018 Processor virtualization

3 Processor virtualization
Technical Approach Seek optimal division of labor between “system” and programmer: Decomposition done by programmer, everything else automated Develop standard library of reusable parallel components Specialization MPI expression Scheduling Mapping Decomposition HPF Charm++ December 28, 2018 Processor virtualization

4 Object-based Decomposition
Idea: Divide the computation into a large number of pieces Independent of number of processors Typically larger than number of processors Let the system map objects to processors This is “virtualization++” Language and runtime support for virtualization Exploitation of virtualization to the hilt December 28, 2018 Processor virtualization

5 Object-based Parallelization
User is only concerned with interaction between objects System implementation User View December 28, 2018 Processor virtualization

6 Realizations: Charm++ and AMPI
Parallel C++ with Data Driven Objects (Chares) Object Arrays/ Object Collections Object Groups: Global object with a “representative” on each PE Asynchronous method invocation Prioritized scheduling Information sharing abstractions: readonly, tables,.. Mature, robust, portable ( AMPI: Adaptive MPI, with many MPI threads on each processor December 28, 2018 Processor virtualization

7 Processor virtualization
Chare Arrays Elements are data-driven objects Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree, ... Send messages to index, receive messages at element. Reductions and broadcasts across the array Dynamic insertion, deletion, migration-- and everything still has to work! December 28, 2018 Processor virtualization

8 Processor virtualization
Object Arrays A collection of data-driven objects (aka chares), With a single global name for the collection, and Each member addressed by an index Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] December 28, 2018 Processor virtualization

9 Processor virtualization
Object Arrays A collection of chares, with a single global name for the collection, and each member addressed by an index Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] December 28, 2018 Processor virtualization

10 Processor virtualization
Object Arrays A collection of chares, with a single global name for the collection, and each member addressed by an index Mapping of element objects to processors handled by the system User’s view A[0] A[1] A[2] A[3] A[..] System view A[0] A[3] December 28, 2018 Processor virtualization

11 Processor virtualization
Comparison with MPI Advantage: Charm++ Modules/Abstractions are centered on application data structures, Not processors Several other… Advantage: MPI Highly popular, widely available, industry standard “Anthropomorphic” view of processor Many developers find this intuitive But mostly: There is no hope of weaning people away from MPI There is no need to choose between them! December 28, 2018 Processor virtualization

12 Processor virtualization
Adaptive MPI A migration path for legacy MPI codes Allows them dynamic load balancing capabilities of Charm++ AMPI = MPI + dynamic load balancing Uses Charm++ object arrays and migratable threads Minimal modifications to convert existing MPI programs Automated via AMPizer Bindings for C, C++, and Fortran90 December 28, 2018 Processor virtualization

13 Processor virtualization
AMPI: MPI “processes” Implemented as virtual processors (user-level migratable threads) Real Processors December 28, 2018 Processor virtualization

14 II: Consequences of virtualization
Better Software Engineering Message Driven Execution Flexible and Dynamic mapping to processors December 28, 2018 Processor virtualization

15 Processor virtualization
Modularization “Number of processors” decoupled from logical units E.g. Oct tree nodes for particle data No artificial restriction on the number of processors Cube of power of 2 Modularity: Software engineering: Cohesion and coupling MPI’s “are on the same processor” is a bad coupling principle Objects liberate you from that: E.g. solid and fluid moldules in a rocket simulation December 28, 2018 Processor virtualization

16 Processor virtualization
Rocket Simulation Large Collaboration headed Mike Heath DOE supported center Challenge: Multi-component code, with modules from independent researchers MPI was common base AMPI: new wine in old bottle Easier to convert Can still run original codes on MPI, unchanged December 28, 2018 Processor virtualization

17 Rocket simulation via virtual processors
Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid December 28, 2018 Processor virtualization

18 AMPI and Roc*: Communication
Rocflo Rocflo Rocflo Rocflo Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid December 28, 2018 Processor virtualization

19 Message Driven Execution
Scheduler Scheduler Message Q Message Q December 28, 2018 Processor virtualization

20 Adaptive Overlap via Data-driven Objects
Problem: Processors wait for too long at “receive” statements Routine communication optimizations in MPI Move sends up and receives down Sometimes. Use irecvs, but be careful With Data-driven objects Adaptive overlap of computation and communication No object or threads holds up the processor No need to guess which is likely to arrive first December 28, 2018 Processor virtualization

21 Adaptive overlap and modules
This is very nicely illustrated in a figure I borrowed from Attila Gursoy’s thesis. Module A needs to avail services from Module B and C. In a message-passing paradigm, these modules cannot execute concurrently, thus idle times in one modules cannot be substituted by computations from another module. This is possible in message-driven paradigm. We therefore base our component architecture on a message-driven runtime system, called Converse. We describe Converse next. SPMD and Message-Driven Modules (From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.) December 28, 2018 Processor virtualization

22 Modularity and Adaptive Overlap
“Parallel Composition Principle: For effective composition of parallel components, a compositional programming language should allow concurrent interleaving of component execution, with the order of execution constrained only by availability of data.” (Ian Foster, Compositional parallel programming languages, ACM Transactions of Programming Languages and Systems, 1996) The first substantiation of our approach comes from Ian Foster’s parallel composition principle, which states that a programming language should allow individual components to concurrently interleave their execution. And the order of execution of these components should be implicitly determined by data availability. December 28, 2018 Processor virtualization

23 Handling OS Jitter via MDE
MDE encourages asynchrony Asynchronous reductions, for example Only data dependence should force synchronization One benefit: Consider algorithm with N steps Each step has different load balance Loose dependence between steps (on neighbors, for example) Sum-of-max (MPI) vs max-of-sums (MDE) OS Jitter: Causes random processors to add delays in each step Handled Automatically by MDE December 28, 2018 Processor virtualization

24 Virtualization/MDE leads to predictability
Ability to predict: Which data is going to be needed and Which code will execute Based on the ready queue of object method invocations December 28, 2018 Processor virtualization

25 Virtualization/MDE leads to predictability
Ability to predict: Which data is going to be needed and Which code will execute Based on the ready queue of object method invocations So, we can: Prefetch data accurately Prefetch code if needed Out-of-core execution Caches vs controllable SRAM S Q December 28, 2018 Processor virtualization

26 Flexible Dynamic Mapping to Processors
The system can migrate objects between processors Vacate workstation used by a parallel program Dealing with extraneous loads on shared workstations Shrink and Expand the set of processors used by an app Adaptive job scheduling Better System utilization Adapt to speed difference between processors E.g. Cluster with 500 MHz and 1 Ghz processors Automatic checkpointing Checkpointing = migrate to disk! Restart on a different number of processors December 28, 2018 Processor virtualization

27 Load Balancing with AMPI/Charm++
Turing cluster has processors with different speeds 267.75 299.85 301.56 235.19 Time Step 133.76 149.01 150.08 117.16 Pre-Cor Iter 46.83 52.20 52.50 41.86 Solid update 86.89 96.73 97.50 75.24 Fluid update 8P3,8P2 w. LB 8P3,8P2 w/o LB 16P2 16P3 Phase December 28, 2018 Processor virtualization

28 Principle of Persistence
Parallel analog of principle of locality Heuristics Holds for most CSE applications Once the application is expressed in terms of interacting objects: Object communication patterns and computational loads tend to persist over time In spite of dynamic behavior Abrupt but infrequent changes Slow and small changes December 28, 2018 Processor virtualization

29 Utility of the principle of persistence
Learning / adaptive algorithms Adaptive Communication libraries Each to all individualized sends Performance depends on many runtime characteristics Library switches between different algorithms Measurement based load balancing December 28, 2018 Processor virtualization

30 Dynamic Load Balancing For CSE
Many CSE applications exhibit dynamic behavior Adaptive mesh refinement Atom migration Particle migration in cosmology codes Objects allow RTS to remap them to balance load Just move objects away from overloaded processors to underloaded processors Just?? December 28, 2018 Processor virtualization

31 Measurement Based Load Balancing
Based on Principle of persistence Runtime instrumentation Measures communication volume and computation time Measurement based load balancers Use the instrumented data-base periodically to make new decisions Many alternative strategies can use the database Centralized vs distributed Greedy improvements vs complete reassignments Taking communication into account Taking dependences into account (More complex) December 28, 2018 Processor virtualization

32 Load balancer in action
Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked December 28, 2018 Processor virtualization

33 “Overhead” of Multipartitioning
December 28, 2018 Processor virtualization

34 Optimizing for Communication Patterns
The parallel-objects Runtime System can observe, instrument, and measure communication patterns Communication is from/to objects, not processors Load balancers can use this to optimize object placement Communication libraries can optimize By substituting most suitable algorithm for each operation Learning at runtime V. Krishnan, MS Thesis, 1996 December 28, 2018 Processor virtualization

35 Example: All to all on Lemieux
December 28, 2018 Processor virtualization

36 The Other Side: Pipelining
A sends a large message to B, whereupon B computes Problem: B is idle for a long time, while the message gets there. Solution: Pipelining Send the message in multiple pieces, triggering a computation on each Objects makes this easy to do: Example: Ab Initio Computations using Car-Parinello method Multiple 3D FFT kernel Recent collaboration with: R. Car, M. Klein, G. Martyna, M, Tuckerman, N. Nystrom, J. Torrellas December 28, 2018 Processor virtualization

37 Processor virtualization
December 28, 2018 Processor virtualization

38 Processor virtualization
Effect of Pipelining Multiple Concurrent 3D FFTs, on 64 Processors of Lemieux V. Ramkumar (PPL) December 28, 2018 Processor virtualization

39 Control Points: learning and tuning
The RTS can automatically optimize the degree of pipelining If it is given a control point (knob) to tune By the application Controlling pipelining between a pair of objects: S. Krishnan, PhD Thesis, 1994 Controlling degree of virtualization: Orchestration Framework: Ongoing PhD thesis December 28, 2018 Processor virtualization

40 Example: Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds Newtonian mechanics At each time-step Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s Calculate velocities and advance positions 1 femtosecond time-step, millions needed! Thousands of atoms (1, ,000) Collaboration with K. Schulten, R. Skeel, and coworkers December 28, 2018 Processor virtualization

41 NAMD performance using virtualization
Written in Charm++ Uses measurement based load balancing Object level performance feedback using “projections” tool for Charm++ Identifies problems at source level easily Almost suggests fixes Attained unprecedented performance December 28, 2018 Processor virtualization

42 Processor virtualization
Grainsize analysis Solution: Split compute objects that may have too much work: using a heuristics based on number of interacting atoms Problem December 28, 2018 Processor virtualization

43 Integration overhead analysis
Problem: integration time had doubled from sequential run December 28, 2018 Processor virtualization

44 Improved Performance Data
Published in SC2000: Gordon Bell Award Finalist December 28, 2018 Processor virtualization

45 Processor virtualization
December 28, 2018 Processor virtualization

46 Processor virtualization
Role of compilers New uses of compiler analysis Apparently simple, but then again, data flow analysis must have seemed simple Supporting Threads, Shades of global variables Minimizing state at migration Border fusion Split-phase semantics (UPC). Components (separately compiled) Compiler – RTS collaboration needed! December 28, 2018 Processor virtualization

47 Processor virtualization
Summary Virtualization as a magic bullet Logical decomposition, better software eng. Flexible and dynamic mapping to processors Message driven execution: Adaptive overlap, modularity, predictability Principle of persistence Measurement based load balancing, Adaptive communication libraries Future: Compiler support Realize the potential: Strategies and applications More info: December 28, 2018 Processor virtualization

48 Component Frameworks Seek optimal division of labor between “system” and programmer: Decomposition done by programmer, everything else automated Develop standard library of reusable parallel components Domain specific frameworks Specialization MPI expression Scheduling Mapping Decomposition HPF Charm++ December 28, 2018 Processor virtualization


Download ppt "Runtime Optimizations via Processor Virtualization"

Similar presentations


Ads by Google