Workshop 2004 1 BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.
Combining Statistical and Symbolic Simulation Mark Oskin Fred Chong and Matthew Farrens Dept. of Computer Science University of California at Davis.
Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.
Helper Threads via Virtual Multithreading on an experimental Itanium 2 processor platform. Perry H Wang et. Al.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.
Topology Aware Mapping for Performance Optimization of Science Applications Abhinav S Bhatele Parallel Programming Lab, UIUC.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
EstiNet Network Simulator & Emulator 2014/06/ 尉遲仲涵.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
KUAS.EE Parallel Computing at a Glance. KUAS.EE History Parallel Computing.
Application-specific Topology-aware Mapping for Three Dimensional Topologies Abhinav Bhatelé Laxmikant V. Kalé.
 What is an operating system? What is an operating system?  Where does the OS fit in? Where does the OS fit in?  Services provided by an OS Services.
1 Scaling Collective Multicast Fat-tree Networks Sameer Kumar Parallel Programming Laboratory University Of Illinois at Urbana Champaign ICPADS ’ 04.
Low-Power Wireless Sensor Networks
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
Performance Model & Tools Summary Hung-Hsun Su UPC Group, HCS lab 2/5/2004.
Advanced / Other Programming Models Sathish Vadhiyar.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Architectural Characterization of an IBM RS6000 S80 Server Running TPC-W Workloads Lei Yang & Shiliang Hu Computer Sciences Department, University of.
Computer Networks with Internet Technology William Stallings
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
LogP and BSP models. LogP model Common MPP organization: complete machine connected by a network. LogP attempts to capture the characteristics of such.
Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Interconnect simulation. Different levels for Evaluating an architecture Numerical models – Mathematic formulations to obtain performance characteristics.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/
Using BigSim to Estimate Application Performance Ryan Mokos Parallel Programming Laboratory University of Illinois at Urbana-Champaign October 19, 2010.
6/29/1999PDPTA'991 Performance Prediction for Large Scale Parallel Systems Yuhong Wen and Geoffrey C. Fox Northeast Parallel Architecture Center (NPAC)
Marcelo R.N. Mendes. What is FINCoS? A set of tools for data generation, load submission, and performance measurement of CEP systems; Main Characteristics:
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Computer Simulation of Networks ECE/CSC 777: Telecommunications Network Design Fall, 2013, Rudra Dutta.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
BigSim Tutorial Presented by Gengbin Zheng and Eric Bohm Charm++ Workshop 2004 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
Vertical Profiling : Understanding the Behavior of Object-Oriented Applications Sookmyung Women’s Univ. PsLab Sewon,Moon.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Performance analysis of a Pose application -- BigNetSim Nilesh Choudhury.
(1) SIMICS Overview. (2) SIMICS – A Full System Simulator Models disks, runs unaltered OSs etc. Accuracy is high (e.g., pollution effects factored in)
BigSim Tutorial Presented by Eric Bohm LACSI Charm++ Workshop 2005 Parallel Programming Laboratory University of Illinois at Urbana-Champaign.
Projections - A Step by Step Tutorial By Chee Wai Lee For the 2004 Charm++ Workshop.
1 Creating Simulations with POSE Terry L. Wilmarth, Nilesh Choudhury, David Kunzman, Eric Bohm Parallel Programming Laboratory University of Illinois at.
Introduction to Performance Tuning Chia-heng Tu PAS Lab Summer Workshop 2009 June 30,
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Creating Simulations with POSE
Distributed and Parallel Processing
Parallel Programming By J. H. Wang May 2, 2017.
Performance Evaluation of Adaptive MPI
Computer Simulation of Networks
Department of Computer Science Northwestern University
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
ModelNet: A Large-Scale Network Emulator for Wireless Networks Priya Mahadevan, Ken Yocum, and Amin Vahdat Duke University, Goal:
BigSim: Simulating PetaFLOPS Supercomputers
Emulating Massively Parallel (PetaFLOPS) Machines
Presentation transcript:

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004

Workshop Motivations ● Big machines are coming! – BG/L (128,000 processors) – ASCI Purple – Red Storm ● Can your application scale to 128,000 processors? – Not without a lot of wasted runtime on a peta scale machine – How much runtime can you get on hardware that isn't available?

Workshop Approach ● Processor simulation – Coarse grained emulation – Fine grained instruction simulation ● Network Simulation – Coarse grained latency simulation – Fine grained transport layer ● Composition – Online: run it all at once – Offline: break the simulation up into levels

Workshop The Medium is the Message ● Sequential performance is not the key to scalability ● Problem decomposition – Load balancing – Timing of result phases ● Communication – Timing and speed – network contention

Workshop Life is Short ● Detail, Speed, Generality: Choose Two. – The more accuracy you want, the longer it will take to run and more architecture specific it must be ● We picked speed and generality – Coarse grained processor emulation – Coarse grained communication latency model ● We want it all – Let the user add detail during Post-Mortem analysis

Workshop Paths Not Taken ● Instruction level simulation – architecture specific complexity ● pipelines, branch prediction, multiple instructions per cycle, compiler optimizations, etc. – detailed instruction simulators are heavyweight sequential applications – this level of accuracy is not vital to parallel performance optimization of scientific applications – for sequential performance measurement use sequential optimization techniques

Workshop BigSim Features ● Choose network size and topology ● Configurable performance prediction methods ● Compile AMPI and Charm++/SDAG to run on emulator ● Supports standard Charm++ frameworks ● Projections tracing for performance analysis

Workshop BigSim Architecture Charm++ and MPI applications Simulation output trace logs Performance visualization (Projections) BigSim Emulator Charm++ Runtime Online PDES engine Instruction Sim (RSim, IBM,..) Simple Network Model Performance counters Load Balancing Module BigNetSim (POSE) Network Simulator Offline PDES

Workshop BigSim Emulator ● Emulate full machine on existing parallel machines – Actually run a parallel program with multi-million way parallelism ● Started with mimicking Blue Gene low level API ● Machine layer abstraction – Many multiprocessor (SMP) nodes connected via message passing

Workshop Simulating (Host) Processor Simulated multi-processor nodes Simulated processor Emulation

Workshop BigSim Emulator:Functional View Affinity message queues Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Converse scheduler Converse Q Communication processors Worker processors inBuf f Non-affinity message queues Correctio nQ Affinity message queues Target Node

Workshop Simulation ● Parallel Discrete Event Simulation – machine behaviors can be thought of as events beginning at a particular time and lasting for a set duration – direct execution or trace-driven ● Charm++ allows out of order messages ● Dependent events need to be executed in an order different from their arrival time ● Need time stamp correction based on dependency

Workshop A Tale of Two Networks Direct Network Indirect Network

Workshop Post-Mortem Network Simulation ● Run application on emulator and gather event trace logs – source – destinations – time stamp – event dependency – message size ● Replay on network simulator model – contention – topology – routing algorithms – packetization – collective communication

Workshop POSE ● Parallel Object-oriented Simulation Environment – Charm++ ● Virtualization, load balancing, communication optimization, performance analysis – POSE Advantages ● Optimistic synchronization – maximize utilization with speculative execution ● adaptive strategies adjust to simulation behavior ● optimized for fine grained simulations ● good scalability

Workshop POSE Design

Workshop POSE Performance ● Tungsten 1->256 ● >13,000,000 events ● Wall clock – 8 seconds on 256 processors ● out of work? – 1775 secs sequential ● swapping heavily ● estimated at 325 secs Cheater!

Workshop TCSim ● Time Stamp Correction Network Simulation ● Transform log into event messages ● Sends messages into network – BGnode – BGproc ● Capture results ● Terminate at set time or when we run out of messages

Workshop HiSim Bluegene

Workshop What If? ● What if Lemieux had processors? FEM on 125 to processors Run on 32 real Lemieux processors

Workshop LeanMD ● Molecular dynamics simulation designed for large machines ● K-away cut-off parallelization Benchmark er-gre with 3-away atoms 1.6 million objects 8 step simulation 32k processor BG machine Running on 400 PSC Lemieux processors

Workshop LeanMD on BigSim

Workshop QsNet ● Indirect Network – Hierarchical – Node to Switch – Switch to Switch ● AKA Elan

Workshop Network Performance Prediction Actual MeasuredSimulated K-Shift strategy performance under random load on 64 Lemieux processors

Workshop Validation

Workshop Future Work ● User events Projections event log in simulation time ● More validation to improve accuracy ● Hybrid Networks ● Approximation from performance counters ● Integration with instruction level simulation – use statistical sampling to make viable ● Sample network configuration files