Μπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor.

Slides:

Advertisements

Similar presentations

1/17/20141 Leveraging Cloudbursting To Drive Down IT Costs Eric Burgener Senior Vice President, Product Marketing March 9, 2010.

Advertisements

Analysis of Computer Algorithms

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

MicroKernel Pattern Presented by Sahibzada Sami ud din Kashif Khurshid.

1 Copyright © 2005, Oracle. All rights reserved. Introducing the Java and Oracle Platforms.

NGS computation services: API's,

High-Performance Simulations of Complex Networked Systems for Capturing Feedback and Fidelity Kalyan S. Perumalla, Ph.D. Senior Research Staff Member Oak.

ECE 495: Integrated System Design I

Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.

© 2003 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice Performance Measurements of a User-Space.

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Efficient Simulation of Agent-based Models on Multi-GPU & Multi-Core Clusters Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory.

The Platform as a Service Model for Networking Eric Keller, Jennifer Rexford Princeton University INM/WREN 2010.

INTRODUCTION TO SIMULATION WITH OMNET++ José Daniel García Sánchez ARCOS Group – University Carlos III of Madrid.

Traffic Analyst Complete Network Visibility. © 2013 Impact Technologies Inc., All Rights ReservedSlide 2 Capacity Calibration Definitive Requirements.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.

Executional Architecture

Region-Scale Evacuation Modeling using GPUs Towards Highly Interactive, GPU-based Evaluation of Evacuation Transport Scenarios at State-Scale Kalyan S.

CS533 Concepts of Operating Systems Class 14 Virtualization and Exokernels.

System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.

MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.

Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.

Dynamic Optimization using ADORE Framework 10/22/2003 Wei Hsu Computer Science and Engineering Department University of Minnesota.

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

University of Michigan Electrical Engineering and Computer Science 1 A Distributed Control Path Architecture for VLIW Processors Hongtao Zhong, Kevin Fan,

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

Understanding Application Scaling NAS Parallel Benchmarks 2.2 on NOW and SGI Origin 2000 Frederick Wong, Rich Martin, Remzi Arpaci-Dusseau, David Wu, and.

Contiki A Lightweight and Flexible Operating System for Tiny Networked Sensors Presented by: Jeremy Schiff.

Engineering Analysis of High Performance Parallel Programs David Culler Computer Science Division U.C.Berkeley

A Comparative Study of Network Protocols & Interconnect for Cluster Computing Performance Evaluation of Fast Ethernet, Gigabit Ethernet and Myrinet.

Tools for Engineering Analysis of High Performance Parallel Programs David Culler, Frederick Wong, Alan Mainwaring Computer Science Division U.C.Berkeley.

1 Multi - Core fast Communication for SoPC Multi - Core fast Communication for SoPC Technion – Israel Institute of Technology Department of Electrical.

Parallelization of FFT in AFNI Huang, Jingshan Xi, Hong Department of Computer Science and Engineering University of South Carolina.

Figure 1.1 Interaction between applications and the operating system.

MPI Program Performance. Introduction Defining the performance of a parallel program is more complex than simply optimizing its execution time. This is.

Message Passing Interface In Java for AgentTeamwork (MPJ) By Zhiji Huang Advisor: Professor Munehiro Fukuda 2005.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

Synchronization and Communication in the T3E Multiprocessor.

A Cloud is a type of parallel and distributed system consisting of a collection of inter- connected and virtualized computers that are dynamically provisioned.

Programming Project: Hybrid Programming Rebecca Hartman-Baker Oak Ridge National Laboratory

Adventures in Mastering the Use of Performance Evaluation Tools Manuel Ríos Morales ICOM 5995 December 4, 2002.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

©2010 John Wiley and Sons Chapter 12 Research Methods in Human-Computer Interaction Chapter 12- Automated Data Collection.

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

1 Wide Area Network Emulation on the Millennium Bhaskaran Raman Yan Chen Weidong Cui Randy Katz {bhaskar, yanchen, wdc, Millennium.

Full and Para Virtualization

COMP381 by M. Hamdi 1 Clusters: Networks of WS/PC.

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

Disco: Running Commodity Operating Systems on Scalable Multiprocessors Presented by: Pierre LaBorde, Jordan Deveroux, Imran Ali, Yazen Ghannam, Tzu-Wei.

Cloud Computing – UNIT - II. VIRTUALIZATION Virtualization Hiding the reality The mantra of smart computing is to intelligently hide the reality Binary->

4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.

KIT – University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association SYSTEM ARCHITECTURE GROUP DEPARTMENT OF COMPUTER.

Programming Parallel Hardware using MPJ Express By A. Shafi.

Presenter: Yi-Ting Chung Fast and Scalable Hybrid Functional Verification and Debug with Dynamically Reconfigurable Co- simulation.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

The Mach System Sri Ramkrishna.

Performance Analysis, Tools and Optimization

CS 258 Reading Assignment 4 Discussion Exploiting Two-Case Delivery for Fast Protected Messages Bill Kramer February 13, 2002 #

Hybrid Programming with OpenMP and MPI

BigSim: Simulating PetaFLOPS Supercomputers

Parallel Exact Stochastic Simulation in Biochemical Systems

Presentation transcript:

µπ A Scalable & Transparent System for Simulating MPI Programs Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology Kalyan S. Perumalla, Ph.D. Senior R&D Manager Oak Ridge National Laboratory Adjunct Professor Georgia Institute of Technology SimuTools, Malaga, Spain March 17, 2010

2Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Motivation & Background Software & Hardware Lifetimes Lifetime of large parallel machine: 5 years Lifetime of large parallel machine: 5 years Lifetime of useful parallel code: 20 years Lifetime of useful parallel code: 20 years Port, analyze, optimize Port, analyze, optimize Ease of development: Obviate actual scaled hardware Ease of development: Obviate actual scaled hardware Energy efficient: Reduce failed runs at actual scale Energy efficient: Reduce failed runs at actual scale Software & Hardware Design Co-design: E.g., 1 μs barrier cost/benefit Co-design: E.g., 1 μs barrier cost/benefit Hardware: E.g., Load from application Hardware: E.g., Load from application Software: Scaling, debugging, testing, customizing Software: Scaling, debugging, testing, customizing

3Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Performance Investigation System μπ = micro parallel performance investigator – Performance prediction for MPI, Portals and other parallel applications – Actual application code executed on the real hardware – Platform is simulated at large virtual scale – Timing customized by user-defined machine μπ = micro parallel performance investigator – Performance prediction for MPI, Portals and other parallel applications – Actual application code executed on the real hardware – Platform is simulated at large virtual scale – Timing customized by user-defined machine Scale is key differentiator – Target: 1,000,000 virtual cores – E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsik micro simulator kernel – Highly scalable PDES engine Scale is key differentiator – Target: 1,000,000 virtual cores – E.g., 1,000,000 virtual MPI ranks in simulated MPI application Based on µsik micro simulator kernel – Highly scalable PDES engine

4Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Generalized Interface & Timing Framework Accommodates arbitrary level of timing detail – Compute time: can use a full system simulation (instruction-level) on the side, or model with cache-effects, other corrected processor speed, etc., depending on user desire, accuracy-cost trade-off – Communication time: can use network simulator, queueing and congestion models, etc., depending on user desire, accuracy-cost

5Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Compiling MPI application with μπ Modify #include and recompile Change #include to #include Relink to μπ library – Instead of –lmpi use -lmupi Modify #include and recompile Change #include to #include Relink to μπ library – Instead of –lmpi use -lmupi

6Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Executing MPI application over μπ Run the modified MPI application (a μπ simulation) ‏ – mpirun –np 4 test -nvp 32 runs test with 32 virtual MPI ranks simulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel Run the modified MPI application (a μπ simulation) ‏ – mpirun –np 4 test -nvp 32 runs test with 32 virtual MPI ranks simulation uses 4 real cores μπ itself uses multiple real cores to run simulation in parallel

7Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Interface Support Existing, Sufficient MPI_Init(), MPI_Finalize() MPI_Init(), MPI_Finalize() MPI_Comm_rank() MPI_Comm_size() MPI_Comm_rank() MPI_Comm_size() MPI_Barrier() MPI_Barrier() MPI_Send(), MPI_Recv() MPI_Send(), MPI_Recv() MPI_Isend(), MPI_Irecv() MPI_Isend(), MPI_Irecv() MPI_Waitall() MPI_Waitall() MPI_Wtime() MPI_Wtime() MPI_COMM_WORLD MPI_COMM_WORLD Planned, Optional Other wait variants Other wait variants Other send/recv variants Other send/recv variants Other collectives Other collectives Group communication Group communication Other, Performance-Oriented MPI_Elapse_time(dt) MPI_Elapse_time(dt) Added for simulation speed Added for simulation speed Avoids actual computation, instead simply elapses time Avoids actual computation, instead simply elapses time MPI_Elapse_time(dt) MPI_Elapse_time(dt) Added for simulation speed Added for simulation speed Avoids actual computation, instead simply elapses time Avoids actual computation, instead simply elapses time

8Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Performance Study Benchmarks Benchmarks – Zero lookahead – 10μs lookahead Platform Platform – Cray XT5, 226K cores Scaling Results Scaling Results – Event Cost – Synchronization Overhead – Multiplexing Gain Benchmarks Benchmarks – Zero lookahead – 10μs lookahead Platform Platform – Cray XT5, 226K cores Scaling Results Scaling Results – Event Cost – Synchronization Overhead – Multiplexing Gain 8

9Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Experimentation Platform: Jaguar* * Data and images from

10Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Event Cost

11Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Synchronization Speed

12Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Multiplexing Gain

13Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Summary - Quantitative Unprecedented scalability – 27,648,000 virtual MPI ranks on 216,000 actual cores Optimal multiplex-factor seen – 64 virtual ranks per real rank Low slowdown even on zero-lookahead scenarios – Even on fast virtual networks Unprecedented scalability – 27,648,000 virtual MPI ranks on 216,000 actual cores Optimal multiplex-factor seen – 64 virtual ranks per real rank Low slowdown even on zero-lookahead scenarios – Even on fast virtual networks

14Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) μπ Summary - Qualitative The only available simulator for highly scaled MPI runs – Suitable for source-available, trace-driven, or modeled applications Configurable hardware timing – User-specified latencies, bandwidths, arbitrary inter-network models Executions repeatable and deterministic – Global time-stamped ordering – Deterministic timing model, and – Purely discrete event simulation Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled – Trapped: on-line, live actual execution – Instrumented: off-line trace generation, trace-driven on-line execution – Modeled: model-driven computation and MPI communication patterns Nearly zero perturbation with unlimited instrumentation The only available simulator for highly scaled MPI runs – Suitable for source-available, trace-driven, or modeled applications Configurable hardware timing – User-specified latencies, bandwidths, arbitrary inter-network models Executions repeatable and deterministic – Global time-stamped ordering – Deterministic timing model, and – Purely discrete event simulation Most suitable for applications whose MPI communication may be either trapped, instrumented or modeled – Trapped: on-line, live actual execution – Instrumented: off-line trace generation, trace-driven on-line execution – Modeled: model-driven computation and MPI communication patterns Nearly zero perturbation with unlimited instrumentation

15Managed by UT-Battelle for the U.S. Department of Energy SimuTools’10 Presentation – Perumalla (ORNL) Ongoing Work NAS Benchmarks – E.g., FFT Actual at-scale application – E.g., Chemistry Optimized implementation of certain MPI primitives – E.g., MPI_Barrier(), MPI_Reduce() Tie to other important phenomena – E.g., energy consumption models NAS Benchmarks – E.g., FFT Actual at-scale application – E.g., Chemistry Optimized implementation of certain MPI primitives – E.g., MPI_Barrier(), MPI_Reduce() Tie to other important phenomena – E.g., energy consumption models

Thank you! Questions? Discrete Computing Systems Discrete Computing Systems