BigSim: Simulating PetaFLOPS Supercomputers Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign
5th Annual Workshop on Charm++ and its Applications Introduction Petascale machines are being built It is hard to understand parallel application without actually running it Tools that allow one to predict the performance of applications before machines are built, also help debugging and tuning Allows easier "offline" experimentation Provides a feedback for machine designers 4/10/2019 5th Annual Workshop on Charm++ and its Applications
5th Annual Workshop on Charm++ and its Applications Overview Prediction accuracy Sequential performance Parallel performance (message passing) Flexibility - different levels of fidelity/tradeoffs Challenges for usability Memory constraints Time cost How to analyze result - scalable performance analysis tools Portability Scalability BigSim system 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Summary of Our Approach Parallel Discrete Event Simulation (PDES) the operation of a system is represented as a chronological sequence of events Event-driven simulation method Two phase simulation BigSim Emulator, capable of running application Charm++, Adaptive MPI / MPI Predict sequential execution blocks Generate trace logs Parallel Discrete Event Simulator (POSE-based) Predict message passing performance Implemented on Charm++ 4/10/2019 5th Annual Workshop on Charm++ and its Applications
BigSim System Architecture Performance visualization (Projections) Offline PDES Network Simulator BigSim Simulator Simulation output trace logs Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, ..) Simple Network Model BigSim Emulator Charm++ and MPI applications 4/10/2019 5th Annual Workshop on Charm++ and its Applications 5
BigSim Emulator – the first step Actual execution of real applications on much smaller machines Model each target processor as virtual processor Using light-weight (Converse) threads simulating each target processor Charm++/AMPI applications without change No global variables/swap-globals/copy-globals 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Emulation on a Parallel Machine Target Nodes Simulating (Host) Processor Hardware thread 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Predicting Sequential Performance Different level of fidelity: Direct mapping Performance counter Instruction-level simulator Tradeoff between time cost and accuracy Exploit inherent determinacy of Charm++ application Out of order messages do not change the computation behavior (structured dagger) Different approaches 4/10/2019 5th Annual Workshop on Charm++ and its Applications
5th Annual Workshop on Charm++ and its Applications Trace Logs Event logs are generated for each target processor Predicted time on execution blocks with timestamps Event dependencies Message sending events [22] name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]] backward: forward: [23] [23] name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]] msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208 msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208 backward: [22] forward: [24] [24] name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]] backward: [0x80a7af0 23] forward: [25] [28] 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Predicting message passing performance – second step Parallel Discrete Event Simulation Needed for correcting causality errors due to out-of-order messages Timestamps are re-adjusted without rerunning the actual application Simulate network behaviours: packetization, routing, contention, etc simple latency based network model, and contention-based network model 4/10/2019 5th Annual Workshop on Charm++ and its Applications
5th Annual Workshop on Charm++ and its Applications PDES and Charm++ PDES is a naturally message-driven activity Large number of entities naturally maps to virtual processors Fine grained PDES Migratable PDES entities and load balancing 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Network Simulator Design 4/10/2019 5th Annual Workshop on Charm++ and its Applications 12
Network Simulator: Data Flow 4/10/2019 5th Annual Workshop on Charm++ and its Applications 13
Network Communication Pattern Analysis NAMD with apoa1 15 timestep 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Network Communication Pattern Analysis Data transferred (KB) in a single time step 4/10/2019 5th Annual Workshop on Charm++ and its Applications
NAMD network simulation (scalability) Simulating NAMD on 2048 processors with 3D Torus topology Tests ran on Turing Apple cluster with Myrinet Terry L. Wilmarth, Gengbin Zheng, Eric J. Bohm, Yogesh Mehta, Praveen Jagadishprasad, Laxmikant V. Kalé, ``Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE'', PADS 2005 4/10/2019 5th Annual Workshop on Charm++ and its Applications
LeanMD Performance Analysis Benchmark 3-away ER-GRE 36573 atoms 1.6 million objects 8 step simulation 64k BG processors Running on PSC Lemieux 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Integration with Instruction level simulators Instruction level simulators typically are Standalone applications Sequential execution Very slow An interpolation-based scheme Use result of a smaller scale instruction level simulation to interpolate for large dataset do a least-squares fit to determine the coefficients of an approximation polynomial function 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Case study: BigSim / Mambo void func( ) { StartBigSim( ) … EndBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation Interpolation BigSim Parallel Simulation Results are not available + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 4/10/2019 5th Annual Workshop on Charm++ and its Applications
5th Annual Workshop on Charm++ and its Applications Future Work Use performance counter to predict sequential timing Out-of-core execution for memory bound applications 4/10/2019 5th Annual Workshop on Charm++ and its Applications
Out-of-core Emulation Motivation Physical memory is shared VM system would not handle well Message driven execution Peek msg queue => what execute next? (prefetch) 4/10/2019 5th Annual Workshop on Charm++ and its Applications