Techniques for Developing Efficient Petascale Applications

Techniques for Developing Efficient Petascale Applications
Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

Application Oriented Parallel Abstractions
Synergy between Computer Science Research and Apps has been beneficial to both LeanCP Space-time meshing Other Applications Issues NAMD Charm++ Techniques & libraries Rocket Simulation ChaNGa 9/3/2018 SciDAC-08

Enabling CS technology of parallel objects and intelligent runtime systems (Charm++ and AMPI) has led to several collaborative applications in CSE 9/3/2018 PSC PetaScale

Outline Techniques for Petascale Programming: BigSim:
My biases: isoefficiency analysis, overdecompistion, automated resource management, sophisticated perf. analysis tools BigSim: Early development of Applications on future machines Dealing with Multicore processors and SMP nodes Novel parallel programming models 9/3/2018 9/3/2018 SciDAC-08 7 7

For Scaling to Multi-PetaFLOP/s
Techniques For Scaling to Multi-PetaFLOP/s 9/3/2018 SciDAC-08

1.Analyze Scalability of Algorithms
(say via the iso-efficiency metric) 9/3/2018 SciDAC-08

Isoefficiency Analysis
Vipin Kumar (circa 1990) An algorithm (*) is scalable if If you double the number of processors available to it, it can retain the same parallel efficiency by increasing the size of the problem by some amount Not all algorithms are scalable in this sense.. Isoefficiency is the rate at which the problem size must be increased, as a function of number of processors, to keep the same efficiency Problem size processors Equal efficiency curves Parallel efficiency= T1/(Tp*P) T1 : Time on one processor Tp: Time on P processors 9/3/2018 SciDAC-08

NAMD: A Production MD program
Fully featured program NIH-funded development Distributed free of charge (~20,000 registered users) Binaries and source code Installed at NSF centers User training and support Large published simulations 9/3/2018 9/3/2018 SciDAC-08 11 11

Molecular Dynamics in NAMD
Collection of [charged] atoms, with bonds Newtonian mechanics Thousands of atoms (10,000 – 5,000,000) At each time-step Calculate forces on each atom Bonds: Non-bonded: electrostatic and van der Waal’s Short-distance: every timestep Long-distance: using PME (3D FFT) Multiple Time Stepping : PME every 4 timesteps Calculate velocities and advance positions Challenge: femtosecond time-step, millions needed! Collaboration with K. Schulten, R. Skeel, and coworkers 9/3/2018 SciDAC-08

Traditional Approaches: non isoefficient
In Replicated Data: All atom coordinates stored on each processor Communication/Computation ratio: P log P Partition the Atoms array across processors Nearby atoms may not be on the same processor C/C ratio: O(P) Distribute force matrix to processors Matrix is sparse, non uniform, C/C Ratio: sqrt(P) Not Scalable Not Scalable Not Scalable 9/3/2018 SciDAC-08

Spatial Decomposition Via Charm
Atoms distributed to cubes based on their location Size of each cube : Just a bit larger than cut-off radius Communicate only with neighbors Work: for each pair of nbr objects C/C ratio: O(1) However: Load Imbalance Limited Parallelism Charm++ is useful to handle this Cells, Cubes or“Patches” 9/3/2018 SciDAC-08

Object Based Parallelization for MD:
Force Decomposition + Spatial Decomposition Now, we have many objects to load balance: Each diamond can be assigned to any proc. Number of diamonds (3D): 14·Number of Patches 2-away variation: Half-size cubes 5x5x5 interactions 3-away interactions: 7x7x7 9/3/2018 SciDAC-08

Performance of NAMD: STMV
STMV: ~1million atoms Time (ms per step) No. of cores

2. Decouple decomposition from Physical Processors
9/3/2018 SciDAC-08

Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Prefetch to local stores Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization Benefits System implementation User View 9/3/2018 SciDAC-08

Migratable Objects (aka Processor Virtualization)
Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Implementations: Charm++, AMPI Benefits Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Predictability : Automatic out-of-core Prefetch to local stores Asynchronous reductions Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization MPI processes Virtual Processors (user-level migratable threads) Real Processors 9/3/2018 SciDAC-08

Parallel Decomposition and Processors
MPI-style encourages Decomposition into P pieces, where P is the number of physical processors available If your natural decomposition is a cube, then the number of processors must be a cube … Charm++/AMPI style “virtual processors” Decompose into natural objects of the application Let the runtime map them to processors Decouple decomposition from load balancing 9/3/2018 SciDAC-08

Decomposition independent of numCores
Rocket simulation example under traditional MPI vs. Charm++/AMPI framework Benefit: load balance, communication optimizations, modularity Solid Fluid P Solid1 Fluid1 Solid2 Fluid2 Solidn Fluidm Solid3 In traditional MPI paradigm. The number of partitions of both modules is typically equal to the number of processor P And although the i’th elements of fluid module and solid module are not connected geometrically in the simulation, they are glued together on the I’th processor. Under Charm++/AMPI framework, the two modules each get their own set of parallel objects. And the size of the arrays are not restricted or related. The benefit of this is performance optimizations and better modularity. Problem: due to the asynchronous method invocation, the flow of control is buried deep into the object code 9/3/2018 9/3/2018 SciDAC-08 LSU PetaScale 21 21

Fault Tolerance Automatic Checkpointing “Impending Fault” Response
Migrate objects to disk In-memory checkpointing as an option Automatic fault detection and restart “Impending Fault” Response Migrate objects to other processors Adjust processor-level parallel data structures Scalable fault tolerance When a processor out of 100,000 fails, all 99,999 shouldn’t have to run back to their checkpoints! Sender-side message logging Latency tolerance helps mitigate costs Restart can be speeded up by spreading out objects from failed processor 9/3/2018 IITB2003

Principle of persistence
3. Use Dynamic Load Balancing Based on the Principle of Persistence Principle of persistence Computational loads and communication patterns tend to persist, even in dynamic computations So, recent past is a good predictor or near future 9/3/2018 9/3/2018 SciDAC-08 LSU PetaScale 23 23

Two step balancing Computational entities (nodes, structured grid points, particles…) are partitioned into objects Say, a few thousand FEM elements per chunk, on average Objects are migrated across processors for balancing load Much smaller problem than repartitioning entire mesh, e.g. Occasional repartitioning for some problems 9/3/2018 SciDAC-08

Load Balancing Steps Regular Timesteps
Detailed, aggressive Load Balancing Instrumented Timesteps Refinement Load Balancing 9/3/2018 SciDAC-08

ChaNGa: Parallel Gravity
Collaborative project (NSF ITR) with Prof. Tom Quinn, Univ. of Washington Components: gravity, gas dynamics Barnes-Hut tree codes Oct tree is natural decomposition: Geometry has better aspect ratios, and so you “open” fewer nodes up. But is not used because it leads to bad load balance Assumption: one-to-one map between sub-trees and processors Binary trees are considered better load balanced With Charm++: Use Oct-Tree, and let Charm++ map subtrees to processors 9/3/2018 SciDAC-08

9/3/2018 SciDAC-08

ChaNGa Load balancing Simple balancers worsened performance!
dwarf 5M on 1,024 BlueGene/L processors 5.6s 6.1s 9/3/2018 SciDAC-08

Load balancing with OrbRefineLB
dwarf 5M on 1,024 BlueGene/L processors 5.6s 5.0s Need sophisticated balancers, and ability to choose the right ones automatically 9/3/2018 SciDAC-08

ChaNGa Preliminary Performance
ChaNGa: Parallel Gravity Code Developed in Collaboration with Tom Quinn (Univ. Washington) using Charm++ 9/3/2018 SciDAC-08

Load balancing for large machines: I
Centralized balancers achieve best balance Collect object-communication graph on one processor But won’t scale beyond tens of thousands of nodes Fully distributed load balancers Avoid bottleneck but.. Achieve poor load balance Not adequately agile Hierarchical load balancers Careful control of what information goes up and down the hierarchy can lead to fast, high-quality balancers 9/3/2018 SciDAC-08

Load balancing for large machines: II
Interconnection topology starts to matter again Was hidden due to wormhole routing etc. Latency variation is still small.. But bandwidth occupancy is a problem Topology aware load balancers Some general heuristic have shown good performance But may require too much compute power Also, special-purpose heuristic work fine when applicable Still, many open challenges 9/3/2018 SciDAC-08

Car-Parinello ab initio MD
OpenAtom Car-Parinello ab initio MD NSF ITR , IBM Molecular Clusters : Nanowires: Semiconductor Surfaces: 3D-Solids/Liquids: G. Martyna M. Tuckerman L. Kale 9/3/2018 SciDAC-08

Goal : The accurate treatment of complex heterogeneous
nanoscale systems Courtsey: G. Martyna read head bit write head soft under-layer recording medium

New OpenAtom Collaboration (DOE)
Principle Investigators G. Martyna (IBM TJ Watson) M. Tuckerman (NYU)‏ L.V. Kale (UIUC)‏‏ K. Schulten (UIUC)‏ J. Dongarra (UTK/ORNL)‏ Current effort focus QMMM via integration with NAMD2 ORNL Cray XT4 Jaguar (31,328 cores) A unique parallel decomposition of the Car-Parinello method. Using Charm++ virtualization we can efficiently scale small (32 molecule) systems to thousands of processors 9/3/2018 SciDAC-08

Decomposition and Computation Flow
9/3/2018 SciDAC-08

BlueGene/P BlueGene/L OpenAtom Performance Cray XT3

Topology aware mapping of objects

Improvements wrought by network topological aware mapping of computational work to processors
The simulation of the left panel maps computational work to processors taking the network connectivity into account while the right panel simulation does not . The ``black’’ or idle time processors spent waiting for computational work to arrive on processors is significantly reduced at left. (256waters, 70R, on BG/L 4096 cores)

OpenAtom: Topology Aware Mapping on BG/L

OpenAtom: Topology Aware Mapping on Cray XT/3

4. Analyze Performance with
Sophisticated Tools 9/3/2018 SciDAC-08

Exploit sophisticated Performance analysis tools
We use a tool called “Projections” Many other tools exist Need for scalable analysis A recent example: Trying to identify the next performance obstacle for NAMD Running on 8192 processors, with 92,000 atom simulation Test scenario: without PME Time is 3 ms per step, but lower bound is 1.6ms or so.. 9/3/2018 SciDAC-08

9/3/2018 SciDAC-08

5. Model and Predict Performance
Via Simulation 9/3/2018 SciDAC-08

How to tune performance for a future machine?
For example, Blue Waters will arrive in 2011 But need to prepare applications for it stating now Even for extant machines: Full size machine may not be available as often as needed for tuning runs A simulation-based approach is needed Our Approach: BigSim Full scale Emulation Trace driven Simulation 9/3/2018 SciDAC-08

BigSim: Emulation Emulation:
Run an existing, full-scale MPI, AMPI or Charm++ application Using an emulation layer that pretends to be (say) 100k cores Leverages Charm++’s object-based virtualization Generates traces: Characteristics of SEBs (Sequential Execution Blocks) Dependences between SEBs and messages 9/3/2018 SciDAC-08

BigSim: Trace-Driven Simulation
Trace Driven Parallel Simulation Typically on tens to hundreds of processors Multiple resolution Simulation of sequential execution: from simple scaling to cycle-accurate modeling Multipl resolution simulation of the Network: from simple latency/bw model to A detailed packet and switching port level simulation Generates Timing traces just as a real app on full scale machine Phase 3: Analyze performance Identify bottlenecks, even w/o predicting exact performance Carry out various “what-if” analysis 9/3/2018 SciDAC-08

Projections: Performance visualization
9/3/2018 SciDAC-08

BigSim: Challenges This simple picture hides many complexities
Emulation: Automatic Out-of-core support for large memory footprint apps Simulation: Accuracy vs cost tradeoffs Interpolation mechanisms Memory optimizations Active area of research But early versions will be available for application developers for Blue Waters next month 9/3/2018 SciDAC-08

Validation 9/3/2018 SciDAC-08

Challenges of Multicore Chips And SMP nodes 9/3/2018 SciDAC-08

Challenges of Multicore Chips
SMP nodes have been around for a while Typically, users just run separate MPI processes on each core Seen to be faster (OpenMP + MPI has not succeeded much) Yet, you cannot ignore the shared memory E.g. For gravity calculations (cosmology), a processor may request a set of particles from remote nodes With shared memory, one can maintain a node-level cache, minimizing communication Need model without pitfalls of Pthreads/OpenMP Costly locks, false sharing, no respect for locality E.g. Charm++ avoids these pitfalls, IF its RTS can avoid them 9/3/2018 SciDAC-08

Charm++ Experience Pingpong performance 9/3/2018 SciDAC-08

Raising the Level of Abstraction
9/3/2018 SciDAC-08

Raising the Level of Abstraction
Clearly (?) we need new programming models Many opinions and ideas exist Two metapoints: Need for Exploration (don’t standardize too soon) Interoperability Allows a “beachhead” for novel paradigms Long-term: Allow each module to be coded using the paradigm best suited for it Interoperability requires concurrent composition This may require message-driven execution 9/3/2018 SciDAC-08

Simplifying Parallel Programming
By giving up completeness! A paradigm may be simple, but not suitable for expressing all parallel patterns Yet, if it can cover a significant classes of patters (applications, modules), it is useful A collection of incomplete models, backed by a few complete ones, will do the trick Our own examples: both outlaw non-determinism Multiphase Shared Arrays (MSA): restricted shared memory LCPC ‘04 Charisma: Static data flow among collections of objects LCR’04, HPDC ‘07 9/3/2018 SciDAC-08

MSA: In the simple model: A program consists of
A collection of Charm threads, and Multiple collections of data-arrays Partitioned into pages (user-specified) Each array is in one mode at a time But its mode may change from phase to phase Modes Write-once Read-only Accumulate Owner-computes A B 9/3/2018 SBAC-PAD PetaScale

A View of an Interoperable Future
9/3/2018 SciDAC-08

More Info: http://charm.cs.uiuc.edu /
Summary Petascale Applications will require a large toolbox; My pet tools/techniques: Scalable Algorithms with isoefficiency analysis Object-based decomposition Dynamic Load balancing Scalable Performance analysis Early performance tuning via BigSim Multicore chips, SMP nodes, accelerators Require extra programming effort with locality-respecting models Raise the level of abstraction Via “Incomplete yet simple” paradigms, and domain-specific frameworks More Info: / 9/3/2018 SciDAC-08

Techniques for Developing Efficient Petascale Applications

Similar presentations

Presentation on theme: "Techniques for Developing Efficient Petascale Applications"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Techniques for Developing Efficient Petascale Applications

Similar presentations

Presentation on theme: "Techniques for Developing Efficient Petascale Applications"— Presentation transcript:

Similar presentations

About project

Feedback