BigSim: Simulating PetaFLOPS Supercomputers

Slides:



Advertisements
Similar presentations
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Advertisements

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana-Champaign Sameer Kumar IBM T. J. Watson Research Center.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Charm++ Workshop 2008 BigSim Tutorial Presented by Eric Bohm Charm++ Workshop 2008 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Advanced / Other Programming Models Sathish Vadhiyar.
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
PSI-SIM: System Performance Evaluation Environment for Next-Generation Supercomputers K. Inoue, H. Shibamura, R. Susukita, Y. Inadomi, H. Honda, Y. Yu,
Allen D. Malony Department of Computer and Information Science TAU Performance Research Laboratory University of Oregon Discussion:
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Simics: A Full System Simulation Platform Synopsis by Jen Miller 19 March 2004.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Integrated Performance Views in Charm++: Projections meets TAU Scott Biersdorff Allen D. Malony Department Computer and Information Science University.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
NIH Resource for Macromolecular Modeling and Bioinformatics Beckman Institute, UIUC Scaling NAMD to 100 Million Atoms on Petascale.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
1 Creating Simulations with POSE Terry L. Wilmarth, Nilesh Choudhury, David Kunzman, Eric Bohm Parallel Programming Laboratory University of Illinois at.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
COMP8330/7330/7336 Advanced Parallel and Distributed Computing Communication Costs in Parallel Machines Dr. Xiao Qin Auburn University
Robust Non-Intrusive Record-Replay with Processor Extraction Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Creating Simulations with POSE
Structural Simulation Toolkit / Gem5 Integration
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Performance Evaluation of Adaptive MPI
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Gengbin Zheng and Eric Bohm
Development & Evaluation of Network Test-beds
COMP60621 Fundamentals of Parallel and Distributed Systems
Faucets: Efficient Utilization of Multiple Clusters
Projections Overview Ronak Buch & Laxmikant (Sanjay) Kale
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
COMP60611 Fundamentals of Parallel and Distributed Systems
Support for Adaptivity in ARMCI Using Migratable Objects
Parallel Exact Stochastic Simulation in Biochemical Systems
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

BigSim: Simulating PetaFLOPS Supercomputers Gengbin Zheng Parallel Programming Laboratory University of Illinois at Urbana-Champaign

5th Annual Workshop on Charm++ and its Applications Introduction Petascale machines are being built It is hard to understand parallel application without actually running it Tools that allow one to predict the performance of applications before machines are built, also help debugging and tuning Allows easier "offline" experimentation Provides a feedback for machine designers 4/10/2019 5th Annual Workshop on Charm++ and its Applications

5th Annual Workshop on Charm++ and its Applications Overview Prediction accuracy Sequential performance Parallel performance (message passing) Flexibility - different levels of fidelity/tradeoffs Challenges for usability Memory constraints Time cost How to analyze result - scalable performance analysis tools Portability Scalability BigSim system 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Summary of Our Approach Parallel Discrete Event Simulation (PDES) the operation of a system is represented as a chronological sequence of events Event-driven simulation method Two phase simulation BigSim Emulator, capable of running application Charm++, Adaptive MPI / MPI Predict sequential execution blocks Generate trace logs Parallel Discrete Event Simulator (POSE-based) Predict message passing performance Implemented on Charm++ 4/10/2019 5th Annual Workshop on Charm++ and its Applications

BigSim System Architecture Performance visualization (Projections) Offline PDES Network Simulator BigSim Simulator Simulation output trace logs Charm++ Runtime Load Balancing Module Performance counters Instruction Sim (RSim, IBM, ..) Simple Network Model BigSim Emulator Charm++ and MPI applications 4/10/2019 5th Annual Workshop on Charm++ and its Applications 5

BigSim Emulator – the first step Actual execution of real applications on much smaller machines Model each target processor as virtual processor Using light-weight (Converse) threads simulating each target processor Charm++/AMPI applications without change No global variables/swap-globals/copy-globals 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Emulation on a Parallel Machine Target Nodes Simulating (Host) Processor Hardware thread 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Predicting Sequential Performance Different level of fidelity: Direct mapping Performance counter Instruction-level simulator Tradeoff between time cost and accuracy Exploit inherent determinacy of Charm++ application Out of order messages do not change the computation behavior (structured dagger) Different approaches 4/10/2019 5th Annual Workshop on Charm++ and its Applications

5th Annual Workshop on Charm++ and its Applications Trace Logs Event logs are generated for each target processor Predicted time on execution blocks with timestamps Event dependencies Message sending events [22] name:msgep (srcnode:0 msgID:21) ep:1 [[ recvtime:0.000498 startTime:0.000498 endTime:0.000498 ]] backward: forward: [23] [23] name:Chunk_atomic_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000498 endTime:0.000503 ]] msgID:3 sent:0.000498 recvtime:0.000499 dstPe:7 size:208 msgID:4 sent:0.000500 recvtime:0.000501 dstPe:1 size:208 backward: [22] forward: [24] [24] name:Chunk_overlap_0 (srcnode:-1 msgID:-1) ep:0 [[ recvtime:-1.000000 startTime:0.000503 endTime:0.000503 ]] backward: [0x80a7af0 23] forward: [25] [28] 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Predicting message passing performance – second step Parallel Discrete Event Simulation Needed for correcting causality errors due to out-of-order messages Timestamps are re-adjusted without rerunning the actual application Simulate network behaviours: packetization, routing, contention, etc simple latency based network model, and contention-based network model 4/10/2019 5th Annual Workshop on Charm++ and its Applications

5th Annual Workshop on Charm++ and its Applications PDES and Charm++ PDES is a naturally message-driven activity Large number of entities naturally maps to virtual processors Fine grained PDES Migratable PDES entities and load balancing 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Network Simulator Design 4/10/2019 5th Annual Workshop on Charm++ and its Applications 12

Network Simulator: Data Flow 4/10/2019 5th Annual Workshop on Charm++ and its Applications 13

Network Communication Pattern Analysis NAMD with apoa1 15 timestep 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Network Communication Pattern Analysis Data transferred (KB) in a single time step 4/10/2019 5th Annual Workshop on Charm++ and its Applications

NAMD network simulation (scalability) Simulating NAMD on 2048 processors with 3D Torus topology Tests ran on Turing Apple cluster with Myrinet Terry L. Wilmarth, Gengbin Zheng, Eric J. Bohm, Yogesh Mehta, Praveen Jagadishprasad, Laxmikant V. Kalé, ``Performance Prediction using Simulation of Large-scale Interconnection Networks in POSE'', PADS 2005 4/10/2019 5th Annual Workshop on Charm++ and its Applications

LeanMD Performance Analysis Benchmark 3-away ER-GRE 36573 atoms 1.6 million objects 8 step simulation 64k BG processors Running on PSC Lemieux 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Integration with Instruction level simulators Instruction level simulators typically are Standalone applications Sequential execution Very slow An interpolation-based scheme Use result of a smaller scale instruction level simulation to interpolate for large dataset do a least-squares fit to determine the coefficients of an approximation polynomial function 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Case study: BigSim / Mambo void func( ) { StartBigSim( ) … EndBigSim( ) } Mambo Prediction for Target System Cycle-accurate prediction of sequential blocks on POWER7 processor BigSim Parallel Emulation Interpolation BigSim Parallel Simulation Results are not available + Replace sequential timing Trace files Parameter files for sequential blocks Adjusted trace files 4/10/2019 5th Annual Workshop on Charm++ and its Applications

5th Annual Workshop on Charm++ and its Applications Future Work Use performance counter to predict sequential timing Out-of-core execution for memory bound applications 4/10/2019 5th Annual Workshop on Charm++ and its Applications

Out-of-core Emulation Motivation Physical memory is shared VM system would not handle well Message driven execution Peek msg queue => what execute next? (prefetch) 4/10/2019 5th Annual Workshop on Charm++ and its Applications