1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale Parallel Programming.

1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Dept. of Computer Science University of Illinois at Urbana Champaign

2 Outline Challenges and opportunities: character of the new machines Charm++ and AMPI –Basics –Capabilities, Programming techniques –Dice them fine: VPS to the rescue –Juggling for overlap –Load balancing: scenarios and strategies Case studies –Classical Molecular Dynamics –Car-Parinello AI MD Quantum Chemistry –Rocket SImulation Raising level of abstraction: –Higher level compiler supported notations –Domain-specific “frameworks” –Example: Unstructured mesh (FEM) framework

3 Machines: current, planned and future Current: –Lemieux: 3000 processors, 750 nodes, full-bandwidth fat-tree network –ASCI Q: similar architecture –System X: Infiniband –Tungston: myrinet –Thunder –Earth Simulator Planned: –IBM’s Blue Gene L: 65k nodes, 3D-taurus topology –Red Storm (10k procs) Future? –BG/L is an example: 1M processors! 0.5 MB per procesor –HPCS 3 architecural plans

4 Some Trends: Communication Bisection bandwidth: –Can’t scale as well with number of processors without being expensive –Wire-length delays even on lemieux: messages going thru the highest level switches take longer Two possibilities: –Grid topologies, with near neighbor connections High Link speed, low bisection bandwidth –Expensive, full-bandwidth networks

5 Trends: Memory Memory latencies are 100 times slower than processor! –This will get worse A solution: put more processors in, –To increase bandwidth between processors and memory –On chip DRAM –In other words: low memory-to-processor ratio But this can be handled with programming style Application viewpoint, for physical modeling: –Given a fixed amount of run-time (4 hours or 10 days) –Doubling spatial resolution increases CPU needs more than 2-fold (smaller time-steps)

6 Application Complexity is increasing Why? –With more FLOPS, need better algorithms.. Not enough to just do more of the same.. Example: Dendritic growth in materials –Better algorithms lead to complex structure –Example: Gravitational force calculation Direct all-pairs: O(N 2 ), but easy to parallelize Barnes-Hut: N log(N) but more complex –Multiple modules, dual time-stepping –Adaptive and dynamic refinements Ambitious projects –Projects with new objectives lead to dynamic behavior and multiple components

7 Specific Programming Challenges Explicit management of resources –This data on that processor –This work on that processor Analogy: memory management –We declare arrays, and malloc dynamic memory chunks as needed –Do not specify memory addresses As usual, Indirection is the key –Programmer: This data, partitioned into these pieces This work divided that way –System: map data and work to processors

8 Virtualization: Object-based Parallelization User View System implementation User is only concerned with interaction between objects Idea: Divide the computation into a large number of objects –Let the system map objects to processors

9 Virtualization: Charm++ and AMPI These systems seek an optimal division of labor between the “system” and programmer: –Decomposition done by programmer, –Everything else automated Specialization MPI Expression Scheduling Mapping Decomposition HPF Charm++ Abstraction

10 Charm++ and Adaptive MPI Charm++: Parallel C++ Asynchronous methods Object arrays In development for over a decade Basis of several parallel applications Runs on all popular parallel machines and clusters AMPI: A migration path for legacy MPI codes –Allows them dynamic load balancing capabilities of Charm++ Uses Charm++ object arrays Minimal modifications to convert existing MPI programs –Automated via AMPizer –Collaboration w. David Padua Bindings for –C, C++, and Fortran90 Both available from http://charm.cs.uiuc.edu

11 Parallel Objects, Adaptive Runtime System Libraries and Tools The enabling CS technology of parallel objects and intelligent Runtime systems has led to several collaborative applications in CSE Molecular Dynamics Crack Propagation Space-time meshes Computational Cosmology Rocket Simulation Protein Folding Dendritic Growth Quantum Chemistry (QM/MM)

12 Message From This Talk Virtualization is ready and powerful to meet the needs of tomorrows applications and machines Virtualization and associated techniques that we have been exploring for the past decade are ready and powerful enough to meet the needs of high-end parallel computing and complex and dynamic applications These techniques are embodied into: –Charm++ –AMPI –Frameworks (Strucured Grids, Unstructured Grids, Particles) –Virtualization of other coordination languages (UPC, GA,..)

13 Acknowlwdgements Graduate students including: –Gengbin Zheng –Orion Lawlor –Milind Bhandarkar –Terry Wilmarth –Sameer Kumar –Jay deSouza –Chao Huang –Chee Wai Lee Recent Funding: –NSF (NGS: Frederica Darema) –DOE (ASCI : Rocket Center) –NIH (Molecular Dynamics)

14 Charm++ : Object Arrays A collection of data-driven objects (aka chares), –With a single global name for the collection, and –Each member addressed by an index –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] User’s view

15 Charm++ : Object Arrays A collection of chares, –with a single global name for the collection, and –each member addressed by an index –Mapping of element objects to processors handled by the system A[0]A[1]A[2]A[3]A[..] A[3]A[0] User’s view System view

16 Chare Arrays Elements are data-driven objects Elements are indexed by a user-defined data type-- [sparse] 1D, 2D, 3D, tree,... Send messages to index, receive messages at element. Reductions and broadcasts across the array Dynamic insertion, deletion, migration-- and everything still has to work!

17 Charm++ Remote Method Calls To call a method on a remote C++ object foo, use the local “proxy” C++ object CProxy_foo generated from the interface file: array[1D] foo { entry void foo(int problemNo); entry void bar(int x); }; Interface (.ci) file This results in a network message, and eventually to a call to the real object’s method: void foo::bar(int x) {... } In another.C file method and parameters i’th object CProxy_foo someFoo=...; someFoo[i].bar(17); In a.C file Generated class

18 #include “myModule.decl.h” class myMain : public CBase_myMain { myMain(int argc,char **argv) { int nElements=7, i=nElements/2; CProxy_foo f=CProxy_foo::ckNew(2,nElements); f[i].bar(3); } }; #include “myModule.def.h” Charm++ Startup Process: Main module myModule { array[1D] foo { entry foo(int problemNo); entry void bar(int x); } mainchare myMain { entry myMain(int argc,char **argv); } }; Interface (.ci) file In a.C file Generated class Called at startup Special startup object

19 Other Features Broadcasts and Reductions Runtime creation and deletion n D and sparse array indexing Library support (“modules”) Groups: per-processor objects Node Groups: per-node objects Priorities: control ordering

20 AMPI: “Adaptive” MPI MPI interface, for C and Fortran, implemented on Charm++ Multiple “virtual processors” per physical processor –Implemented as user-level threads Very fast context switching-- 1us –E.g., MPI_Recv only blocks virtual processor, not physical Supports migration (and hence load balancing) via extensions to MPI

21 AMPI: 7 MPI processes

22 AMPI: Real Processors 7 MPI “processes” Implemented as virtual processors (user-level migratable threads)

23 How to Write an AMPI Program Write your normal MPI program, and then … Link and run with Charm++ –Compile and link with charmc charmc -o hello hello.c -language ampi charmc -o hello2 hello.f90 -language ampif –Run with charmrun charmrun hello

24 How to Run an AMPI program Charmrun –A portable parallel job execution script –Specify number of physical processors: +pN –Specify number of virtual MPI processes: +vpN –Special “nodelist” file for net-* versions

25 AMPI MPI Extensions Process Migration Asynchronous Collectives Checkpoint/Restart

26 How to Migrate a Virtual Processor? Move all application state to new processor Stack Data –Subroutine variables and calls –Managed by compiler Heap Data –Allocated with malloc/free –Managed by user Global Variables

27 Stack Data The stack is used by the compiler to track function calls and provide temporary storage –Local Variables –Subroutine Parameters –C “alloca” storage Most of the variables in a typical application are stack data

28 Migrate Stack Data Without compiler support, cannot change stack’s address –Because we can’t change stack’s interior pointers (return frame pointer, function arguments, etc.) Solution: “isomalloc” addresses –Reserve address space on every processor for every thread stack –Use mmap to scatter stacks in virtual memory efficiently –Idea comes from PM 2

29 Migrate Stack Data Thread 2 stack Thread 3 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Thread 1 stack Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Migrate Thread 3

30 Migrate Stack Data Thread 2 stack Thread 4 stack Processor A’s Memory Code Globals Heap 0x00000000 0xFFFFFFFF Thread 1 stack Code Globals Heap 0x00000000 0xFFFFFFFF Processor B’s Memory Migrate Thread 3 Thread 3 stack

31 Migrate Stack Data Isomalloc is a completely automatic solution –No changes needed in application or compilers –Just like a software shared-memory system, but with proactive paging But has a few limitations –Depends on having large quantities of virtual address space (best on 64- bit) 32-bit machines can only have a few gigs of isomalloc stacks across the whole machine –Depends on unportable mmap Which addresses are safe? (We must guess!) What about Windows? Blue Gene?

32 Heap Data Heap data is any dynamically allocated data –C “malloc” and “free” –C++ “new” and “delete” –F90 “ALLOCATE” and “DEALLOCATE” Arrays and linked data structures are almost always heap data

33 Migrate Heap Data Automatic solution: isomalloc all heap data just like stacks! –“-memory isomalloc” link option –Overrides malloc/free –No new application code needed –Same limitations as isomalloc Manual solution: application moves its heap data –Need to be able to size message buffer, pack data into message, and unpack on other side –“pup” abstraction does all three

34 Problem setup: 3D stencil calculation of size 240 3 run on Lemieux. AMPI runs on any # of PEs (eg 19, 33, 105). Native MPI needs cube #. Comparison with Native MPI Performance –Slightly worse w/o optimization –Being improved Flexibility –Small number of PE available –Special requirement by algorithm

35 Benefits of Virtualization Software engineering –Number of virtual processors can be independently controlled –Separate VPs for different modules Message driven execution –Adaptive overlap of communication –Modularity –Predictability Automatic out-of-core –Asynchronous reductions Dynamic mapping –Heterogeneous clusters Vacate, adjust to speed, share –Automatic checkpointing –Change set of processors used Principle of persistence –Enables runtime optimizations –Automatic dynamic load balancing –Communication optimizations –Other runtime optimizations More info: http://charm.cs.uiuc.edu

36 Data driven execution Scheduler Message Q

37 Adaptive Overlap of Communication With Virtualization, you get Data-driven execution –There are multiple entities (objects, threads) on each proc No single object or threads holds up the processor Each one is “continued” when its data arrives –No need to guess which is likely to arrive first –So: Achieves automatic and adaptive overlap of computation and communication This kind of data-driven idea can be used in MPI as well. –Using wild-card receives –But as the program gets more complex, it gets harder to keep track of all pending communication in all places that are doing a receive

38 Why Message-Driven Modules ? SPMD and Message-Driven Modules ( From A. Gursoy, Simplified expression of message-driven programs and quantification of their impact on performance, Ph.D Thesis, Apr 1994.)

39 Checkpoint/Restart Any long running application must be able to save its state When you checkpoint an application, it uses the pup routine to store the state of all objects State information is saved in a directory of your choosing Restore also uses pup, so no additional application code is needed (pup is all you need)

40 Checkpointing Job In AMPI, use MPI_Checkpoint( ); –Collective call; returns when checkpoint is complete In Charm++, use CkCheckpoint(, ); –Called on one processor; calls resume when checkpoint is complete Restarting: –The charmrun option ++restart is used to restart –Number of processors need not be the same

41 AMPI’s Collective Communication Support Communication operation in which all or a large subset participate –For example broadcast Performance impediment All to all communication –All to all personalized communication (AAPC) –All to all multicast (AAM)

42 Communication Optimization Organize processors in a 2D (virtual) Mesh Message from (x1,y1) to (x2,y2) goes via (x1,y2) 2* messages instead of P-1 But each byte travels twice on the network

43 Performance Benchmark A Mystery ? Radix Sort

44 CPU time vs Elapsed Time Time breakdown of an all-to-all operation using Mesh library Computation is only a small proportion of the elapsed time A number of optimization techniques are developed to improve collective communication performance

45 Asynchronous Collectives Time breakdown of 2D FFT benchmark [ms] VPs implemented as threads Overlapping computation with waiting time of collective operations Total completion time reduced

46 Shrink/Expand Problem: Availability of computing platform may change Fitting applications on the platform by object migration Time per step for the million-row CG solver on a 16-node cluster Additional 16 nodes available at step 600

48 Projections Projections is designed for use with a virtualized model like Charm++ or AMPI Instrumentation built into runtime system Post-mortem tool with highly detailed traces as well as summary formats Java-based visualization tool for presenting performance information

49 Trace Generation (Detailed) Link-time option “-tracemode projections” –In the log mode each event is recorded in full detail (including timestamp) in an internal buffer –Memory footprint controlled by limiting number of log entries –I/O perturbation can be reduced by increasing number of log entries –Generates a..log file for each processor and a.sts file for the entire application Commonly used Run-time options +traceroot DIR +logsize NUM

50 Visualization Main Window

51 Post mortem analysis: views Utilization Graph –Mainly useful as a function of processor utilization against time and time spent on specific parallel methods Profile: stacked graphs: –For a given period, breakdown of the time on each processor Includes idle time, and message-sending, receiving times Timeline: –upshot-like, but more details –Pop-up views of method execution, message arrows, user-level events

53 Projections Views: continued Histogram of method execution times –How many method-execution instances had a time of 0-1 ms? 1-2 ms?..

54 Projections Views: continued Overview –A fast utilization chart for entire machine across the entire time period

56 Projections Conclusions Instrumentation built into runtime Easy to include in Charm++ or AMPI program Working on –Automated analysis –Scaling to tens of thousands of processors –Integration with hardware performance counters

57 Multi-run analysis: in progress Collect performance data from different runs –On varying number of processors: –See which functions increase in computation time: Algorithmic overhead –See how the communication costs scale up per processor and total

59 Load balancing scenarios Dynamic creation of tasks –Initial vs continuous –Coarse grained vs fine grained tasks –Master-slave –Tree structured –Use “Seed Balancers” in Charm++/AMPI Iterative Computations –When there is a strong correlation across iterations: Measurement based load balancers –When the correlation is weak –Wehen there is no co-relation: use seed baalncer

60 Measurement Based Load Balancing Principle of persistence –Object communication patterns and computational loads tend to persist over time –In spite of dynamic behavior Abrupt but infrequent changes Slow and small changes Runtime instrumentation –Measures communication volume and computation time Measurement based load balancers –Use the instrumented data-base periodically to make new decisions –Many alternative strategies can use the database

61 Periodic Load balancing Strategies Stop the computation? Centralized strategies: –Charm RTS collects data (on one processor) about: Computational Load and Communication for each pair –If you are not using AMPI/Charm, you can do the same instrumentation and data collection –Partition the graph of objects across processors Take communication into account –Pt-to-pt, as well as multicast over a subset –As you map an object, add to the load on both sending and receiving processor The red communication is free, if it is a multicast.

62 Object partitioning strategies You can use graph partitioners like METIS, K-R –BUT: graphs are smaller, and optimization criteria are different Greedy strategies –If communication costs are low: use a simple greedy strategy Sort objects by decreasing load Maintain processors in a heap (by assigned load) –In each step: assign the heaviest remaining object to the least loaded processor –With small-to-moderate communication cost: Same strategy, but add communication costs as you add an object to a processor –Always add a refinement step at the end: Swap work from heaviest loaded processor to “some other processor” Repeat a few times or until no improvement –Refinement-only strategies

63 Object partitioning strategies When communication cost is significant: –Still use greedy strategy, but: At each assignment step, choose between assigning O to least loaded processor and the processor that already has objects that communicate most with O. –Based on the degree of difference in the two metrics –Two-stage assignments: »In early stages, consider communication costs as long as the processors are in the same (broad) load “class”, »In later stages, decide based on load Branch-and-bound –Searches for optimal, but can be stopped after a fixed time

64 Crack Propagation Decomposition into 16 chunks (left) and 128 chunks, 8 for each PE (right). The middle area contains cohesive elements. Both decompositions obtained using Metis. Pictures: S. Breitenfeld, and P. Geubelle As computation progresses, crack propagates, and new elements are added, leading to more complex computations in some chunks

65 Load balancer in action Automatic Load Balancing in Crack Propagation 1. Elements Added 3. Chunks Migrated 2. Load Balancer Invoked

66 Distributed Load balancing Centralized strategies –Still ok for 3000 processors for NAMD Distributed balancing is needed when: –Number of processors is large and/or –load variation is rapid Large machines: –Need to handle locality of communication Topology sensitive placement –Need to work with scant global information Approximate or aggregated global information (average/max load) Incomplete global info (only “neighborhood”) Work diffusion strategies (1980’s work by author and others!) –Achieving global effects by local action…

67 Other features Client-Server interface (CCS) Live Visualization support Libraries: –Communication optimization libraries –2D, 3D FFTs, CG,.. Debugger: –freeze/thaw …

68 Scaling to PetaFLOPS machines: Advice Dice them fine: –Use a fine grained decomposition –Just enough to amortize the overhead Juggle as much as you can –Keeping communication ops in flight for latency tolerance Avoid synchronizations as much as possible –Use asynchronous reductions, –Async. Collectives in general

69 Grainsize control A Simple definition of grainsize: –Amount of computation per message –Problem: short message/ long message More realistic: –Computation to communication ratio

70 Grainsize Control Wisdom One may think that : –One should chose the largest grainsize that will generate sufficient parallelization In fact: –One should select smallest grainsize that will amortize the overhead Total CPU Time T –T = Tseq + (Tseq/g)Toverhead

71 How to avoid Barriers/Reductions Sometimes, they can be eliminated –with careful reasoning –Somewhat complex programming When they cannot be avoided, –one can often render them harmless Use asynchronous reduction (not normal MPI) –E.g. in NAMD, energies need to be computed via a reductions and output. Not used for anything except output –Use Asynchronous reduction, working in the background When it reports to an object at the root, output it

72 Asynchronous reductions: Jacobi Convergence check –At the end of each Jacobi iteration, we do a convergence check –Via a scalar Reduction (on maxError) But note: –each processor can maintain old data for one iteration So, use the result of the reduction one iteration later! –Deposit of reduction is separated from its result. –MPI_Ireduce(..) returns a handle (like MPI_Irecv) And later, MPI_Wait(handle) will block when you need to.

73 Asynchronous reductions in Jacobi compute reduction compute reduction Processor timeline with sync. reduction Processor timeline with async. reduction This gap is avoided below

74 Asynchronous or Split-phase interfaces Notify/wait syncs in CAF

75 Case Studies Examples of Scalability Series of examples –Where we attained scalability –What techniques were useful –What lessons we learned Molecular Dynamics: NAMD Rocket Simulation

76 Object Based Parallelization for MD: Force Decomposition + Spatial Deomp. Now, we have many objects to load balance: –Each diamond can be assigned to any proc. – Number of diamonds (3D): –14·Number of Patches

77 Bond Forces Multiple types of forces: –Bonds(2), Angles(3), Dihedrals (4),.. –Luckily, each involves atoms in neighboring patches only Straightforward implementation: –Send message to all neighbors, –receive forces from them –26*2 messages per patch! Instead, we do: –Send to (7) upstream nbrs –Each force calculated at one patch B CA

78 700 VPs 192 + 144 VP s 30,000 VPs Virtualized Approach to implementation: using Charm++ These 30,000+ Virtual Processors (VPs) are mapped to real processors by charm runtime system

80 Case Study: NAMD (Molecular Dynamics) NAMD: Biomolecular Simulation on Thousands of Processors J. C. Phillips, G. Zheng, S. Kumar, and L. V. Kale Proc. Of Supercomputing 2002 Gordon Bell Award Unprecedented performance for this application ATPase synthase 1.02 TeraFLOPs

81 Scaling to 64K/128K processors of BG/L What issues will arise? –Communication Bandwidth use more important than processor overhead Locality: –Global Synchronizations Costly, but not because it takes longer Rather, small “jitters” have a large impact Sum of Max vs Max of Sum –Load imbalance important, but low grainsize is crucial –Critical paths gains importance

82 Electronic Structures using CP Car-Parinello method Based on pinyMD –Glenn Martyna, Mark Tuckerman Data structures: –A bunch of states (say 128) –Represented as 3D arrays of coeffs in G- space, and also 3D arrays in real space –Real-space prob. density –S-matrix: one number for each pair of states For orthonormalization –Nuclei Computationally –Transformation from g-space to real-space Use multiple parallel 3D- FFT –Sums up real-space densities –Computes energies from density –Computes forces –Normalizes g-space wave function

83 One Iteration

84 Parallel Implementation

86 Orthonormalization At the end of every iteration, after updating electron configuration Need to compute (from states) –a “correlation” matrix S, S[i,j] depends on entire data from states i, j –its transform T –Update the values Computation of S has to be distributed –Compute S[i,j,p], where p is plane number –Sum over p to get S[i,j] Actual conversion from S->T is sequential

87 Orthonormalization

88 Computation/Communication Overlap

89 G-Space Planes:Integration, 1D- FFT Real-Space Planes: 2D-FFT Compute Forces on/by Nuclei Rho-Real-Space Planes: Real-Space Planes: 2D-IFFT G-Space Planes:Integration, 1D- IFFT Pair-calculators

94 Rocket Simulation Dynamic, coupled physics simulation in 3D Finite-element solids on unstructured tet mesh Finite-volume fluids on structured hex mesh Coupling every timestep via a least-squares data transfer Challenges: –Multiple modules –Dynamic behavior: burning surface, mesh adaptation Robert Fielder, Center for Simulation of Advanced Rockets Collaboration with M. Heath, P. Geubelle, others

95 Application Example: GEN2 MPI/AMPI Rocpanda Rocblas Rocface Rocman Roccom Rocflo-MP Rocflu-MP Rocsolid Rocfrac Rocburn2D ZNZN APNAPN PYPY Truegrid Tetmesh Metis Gridgen Makeflo

96 Rocket simulation via virtual processors Scalability challenges: –Multiple independently developed modules, possibly executing concurrently –Evolving simulation Changes the balance between fluid and solid –Adaptive refinements –Dynamic insertion of sub-scale simulation components Crack-driven fluid flow and combustion –Heterogeneous (speed-wise) clusters

97 Rocket simulation via virtual processors Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo

98 AMPI and Roc*: Communication Rocflo Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocface Rocsolid Rocflo By separating independent modules into separate sets of virtual processors, flexibility was gained to deal with alternate formulations: Fluids and solids executing concurrently OR one after other. Change in pattern of load distribution within or across modules

100 Performance Prediction on Large Machines Problem: –How to develop a parallel application for a non- existent machine? –How to predict performance of applications on future machines? –How to do performance tuning without continuous access to a large machine? Solution: –Leverage virtualization –Develop a machine emulator –Simulator: accurate time modeling –Run a program on “100,000 processors” using only hundreds of processors Originally targeted to BlueGene/Cyclops Now generalized (and used for BG/L)

101 Why Emulate? Allow development of parallel software –Exposes scalability limitations in data structures, e.g. O(P) arrays are ok if P is not a million –Software is ready before the machine is

102 How to emulate 1M processor apps Leverage processor virtualization –Let each Virtual Processor of Charm++ stand for a real processor of the emulated machine –Adequate if you want emulate MPI apps on 1 M processors!

103 Emulation on a Parallel Machine Simulating (Host) Processors Simulated multi-processor nodes Simulated processor Emulating 8M threads on 96 ASCI-Red processors

104 How to emulate 1M processor apps A twist: what if you want to emulate Charm++ app? –E.g. 8 M objects VPs using 1M target machine processors? –A little runtime trickery –Processors modeled as data structures, while VPs as VPs! VP Processor

105 Memory Limit? Some applications have low memory use –Molecular dynamics [Some] Large machines may have low memory-per- processor –E.g. BG/L : 256 MB for 2 processor node –A BG/C design: 16-32 MB for 32 processor node More general solution is still needed: –Provided by out-of-core execution capability of Charm++

106 Message Driven Execution and Out-of-Core Execution Scheduler Message Q Virtualization leads to Message Driven Execution So, we can: Prefetch data accurately Automatic Out-of-core execution

107 A Success Story Emulation based implementation of lower layers, as well as Charm++ and AMPI completed last year As a result, –BG/L Port of Charm++/AMPI accomplished in 1-2 days –Actually, 1-2 hours for the basic port –1-2 days to fix a OS level “bug” that prevented user-level multi-threading

108 Emulator Performance Scalable Emulating a real-world MD application on a 200K processor BG machine Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant V. Kalé, ``A Parallel-Object Programming Model for PetaFLOPS Machines and Blue Gene/Cyclops'' in NGS Program Workshop, IPDPS02

109 Performance Prediction How to predict component performance? –Multiple resolution levels –Sequential Component: user supplied expression; timers; performance counters; instruction level simulation –Communication component: Simple latency-based network model; contention-based network simulation Parallel Discrete Event Simulation (PDES) –Logical processor (LP) has virtual clock –Events are time-stamped –State of an LP changes when an event arrives to it –Protocols:Conservative vs. optimistic protocols Conservative: (example: DaSSF, MPISIM) Optimistic: (examples: Time Warp, SPEEDES)

110 Why not use existing PDES? Major synchronization overheads –Checkpointing overhead –Rollback overhead We can do better –Inherent determinacy of parallel application –Most parallel programs are written to be deterministic,

111 Categories of Applications Linear-order applications –No wildcard receives –Strong determinacy, no timestamp correction necessary Reactive applications (atomic) –Message driven objects –Methods execute as corresponding messages arrive Multi-dependent applications –Irecvs with WaitAll –Uses of structured dagger to capture dependency Gengbin Zheng, Gunavardhan Kakulapati, Laxmikant V. Kalé, ``BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines'', in IPDPS 2004

112 Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Architecture of BigSim Simulator

113 Charm++ and MPI applications Simulation output trace logs BigNetSim (POSE) Network Simulator Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module Offline PDES Architecture of BigSim Simulator

114 Big Network Simulation Simulate network behavior: packetization, routing, contention, etc. Incorporate with post-mortem timestamp correction via POSE Currently models: torus (BG/L), fat-tree (qs-net) BGSIM Emulator POSE Timestamp Correction BG Log Files (tasks & dependencies) Timestamp-corrected Tasks BigNetSim

115 BigSim Validation on Lemieux

116 Performance of the BigSim Real processors (PSC Lemieux) SpeedupSpeedup

117 FEM simulation Simple 2D structural simulation in AMPI 5 million element mesh 16k BG processors Running on only 32 PSC Lemieux processors

118 Case Study - LeanMD Molecular dynamics simulation designed for large machines K-away cut-off parallelization Benchmark er-gre with 3-away 36573 atoms 1.6 million objects vs. 6000 in 1-away 8 step simulation 32k processor BG machine Running on 400 PSC Lemieux processors Performance visualization tools

119 Load Imbalance Histogram

120 Performance visualization

122 Component Frameworks Motivation –Reduce tedium of parallel programming for commonly used paradigms –Encapsulate required parallel data structures and algorithms –Provide easy to use interface, Sequential programming style preserved No alienating invasive constructs –Use adaptive load balancing framework Component frameworks –FEM –Multiblock –AMR

123 FEM framework Present clean, “almost serial” interface: –Hide parallel implementation in the runtime system –Leave physics and time integration to user –Users write code similar to sequential code –Or, easily modify sequential code Input: –connectivity file (mesh), boundary data and initial data Framework: –Partitions data, and –Starts driver for each chunk in a separate thread –Automates communication, once user registers fields to be communicated –Automatic dynamic load balancing

124 Why use the FEM Framework? Makes parallelizing a serial code faster and easier –Handles mesh partitioning –Handles communication –Handles load balancing (via Charm) Allows extra features –IFEM Matrix Library –NetFEM Visualizer –Collision Detection Library

125 Serial FEM Mesh ElementSurrounding Nodes E1N1N3N4 E2N1N2N4 E3N2N4N5

126 Partitioned Mesh ElementSurrounding Nodes E1N1N3N4 E2N1N2N3 ElementSurrounding Nodes E1N1N2N3 Shared Nodes AB N2N1 N4N3

127 FEM Mesh: Node Communication Summing forces from other processors only takes one call: FEM_Update_field Similar call for updating ghost regions

128 Robert Fielder, Center for Simulation of Advanced Rockets FEM Framework Users: CSAR Rocflu fluids solver, a part of GENx Finite-volume fluid dynamics code Uses FEM ghost elements Author: Andreas Haselbacher

129 FEM Experience Previous: –3-D volumetric/cohesive crack propagation code (P. Geubelle, S. Breitenfeld, et. al) –3-D dendritic growth fluid solidification code (J. Dantzig, J. Jeong) –Adaptive insertion of cohesive elements Mario Zaczek, Philippe Geubelle Performance data –Multi-Grain contact (in progress) Spandan Maiti, S. Breitenfield, O. Lawlor, P. Guebelle Using FEM framework and collision detection –NSF funded project –Space-time meshes  Did initial parallelization in 4 days

130 Performance data: ASCI Red Mesh with 3.1 million elements Speedup of 1155 on 1024 processors.

131 Dendritic Growth Studies evolution of solidification microstructures using a phase-field model computed on an adaptive finite element grid Adaptive refinement and coarsening of grid involves re- partitioning Jon Dantzig et al with O. Lawlor and Others from PPL

132 “Overhead” of Multipartitioning Conclusion: Overhead of virtualization is small, and in fact it benefits by creating automatic

133 Parallel Collision Detection Detect collisions (intersections) between objects scattered across processors Approach, based on Charm++ Arrays Overlay regular, sparse 3D grid of voxels (boxes) Send objects to all voxels they touch Collide objects within each voxel independently and collect results Leave collision response to user code

134 Parallel Collision Detection Results: 2  s per polygon; Good speedups to 1000s of processors ASCI Red, 65,000 polygons per processor. (scaled problem) Up to 100 million polygons This was a significant improvement over the state-of-art. Made possible by virtualization, and Asynchronous, as needed, creation of voxels Localization of communication: voxel often on the same processor as the contributing polygon

135 Summary Processor virtualization is a powerful techniques Charm++/AMPI are production quality systems –with bells and whistles –Can scale to petaFLOPS class machines Domain-specific frameworks –Can raise the level of abstraction and promote reuse –Unstructured Mesh framework Next : compiler support, new coordination mechanisms Software available –http://charm.cs.uiuc.edu

136 Optimizing for Communication Patterns The parallel-objects Runtime System can observe, instrument, and measure communication patterns –Communication is from/to objects, not processors –Load balancers can use this to optimize object placement –Communication libraries can optimize By substituting most suitable algorithm for each operation Learning at runtime V. Krishnan, MS Thesis, 1996

137 Molecular Dynamics: Benefits of avoiding barrier In NAMD: –The energy reductions were made asynchronous –No other global barriers are used in cut-off simulations This came handy when: –Running on Pittsburgh Lemieux (3000 processors) –The machine (+ our way of using the communication layer) produced unpredictable, random delays in communication A send call would remain stuck for 20 ms, for example How did the system handle it? –See timeline plots

138 Golden Rule of Load Balancing Golden Rule: It is ok if a few processors idle, but avoid having processors that are overloaded with work Finish time = max{Time on I’th processor}Excepting data dependence and communication overhead issues Example: 50,000 tasks of equal size, 500 processors: A: All processors get 99, except last 5 gets 100+99 = 199 OR, B: All processors have 101, except last 5 get 1 Fallacy: objective of load balancing is to minimize variance in load across processors Identical variance, but situation A is much worse!

139 Amdahls’s Law and grainsize Before we get to load balancing: Original “law”: –If a program has K % sequential section, then speedup is limited to 100/K. If the rest of the program is parallelized completely Grainsize corollary: –If any individual piece of work is > K time units, and the sequential program takes T seq, Speedup is limited to T seq / K So: –Examine performance data via histograms to find the sizes of remappable work units –If some are too big, change the decomposition method to make smaller units

140 Grainsize: LeanMD for Blue Gene/L BG/L is a planned IBM machine with 128k processors Here, we need even more objects: –Generalize hybrid decomposition scheme 1-away to k-away 2-away : cubes are half the size.

141 5000 vps 76,000 vps 256,000 vps

142 New strategy

1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale Parallel Programming.

Similar presentations

Presentation on theme: "1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale Parallel Programming."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale Parallel Programming.

Similar presentations

Presentation on theme: "1 Petascale Programming with Virtual Processors: Charm++, AMPI, and domain-specific frameworks Laxmikant Kale Parallel Programming."— Presentation transcript:

Similar presentations

About project

Feedback