Emulating Massively Parallel (PetaFLOPS) Machines

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Chapter 3 Embedded Computing in the Emerging Smart Grid Arindam Mukherjee, ValentinaCecchi, Rohith Tenneti, and Aravind Kailas Electrical and Computer.
Distributed Systems CS
SE-292 High Performance Computing
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Class CS 775/875, Spring 2011 Amit H. Kumar, OCCS Old Dominion University.
Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Multilingual Debugging Support for Data-driven Parallel Languages Parthasarathy Ramachandran Laxmikant Kale Parallel Programming Laboratory Dept. of Computer.
ECE669 L11: Static Routing Architectures March 4, 2004 ECE 669 Parallel Computer Architecture Lecture 11 Static Routing Architectures.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Optimizing Threaded MPI Execution on SMP Clusters Hong Tang and Tao Yang Department of Computer Science University of California, Santa Barbara.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Modeling Process CSCE 668Set 14: Simulations 2 May be several algorithms (processes) runs on each processor to simulate the desired communication system.
Eric Keller, Evan Green Princeton University PRESTO /22/08 Virtualizing the Data Plane Through Source Code Merging.
Boston, May 22 nd, 2013 IPDPS 1 Acceleration of an Asynchronous Message Driven Programming Paradigm on IBM Blue Gene/Q Sameer Kumar* IBM T J Watson Research.
© 2010 IBM Corporation Enabling Concurrent Multithreaded MPI Communication on Multicore Petascale Systems Gabor Dozsa 1, Sameer Kumar 1, Pavan Balaji 2,
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
(Superficial!) Review of Uniprocessor Architecture Parallel Architectures and Related concepts CS 433 Laxmikant Kale University of Illinois at Urbana-Champaign.
Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"
1 Blue Gene Simulator Gengbin Zheng Gunavardhan Kakulapati Parallel Programming Laboratory Department of Computer Science.
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
The Cosmic Cube Charles L. Seitz Presented By: Jason D. Robey 2 APR 03.
Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.
1Charm++ Workshop 2010 The BigSim Parallel Simulation System Gengbin Zheng, Ryan Mokos Charm++ Workshop 2010 Parallel Programming Laboratory University.
University of Illinois at Urbana-Champaign Memory Architectures for Protein Folding: MD on million PIM processors Fort Lauderdale, May 03,
SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | SCHOOL OF COMPUTER SCIENCE | GEORGIA INSTITUTE OF TECHNOLOGY MANIFOLD Manifold Execution Model and System.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
6/29/1999PDPTA'991 Performance Prediction for Large Scale Parallel Systems Yuhong Wen and Geoffrey C. Fox Northeast Parallel Architecture Center (NPAC)
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
NGS/IBM: April2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
NGS Workshop: Feb 2002PPL-Dept of Computer Science, UIUC Programming Environment and Performance Modeling for million-processor machines Laxmikant (Sanjay)
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
1 Converse BlueGene Emulator Gengbin Zheng Parallel Programming Lab 2/27/2001.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Towards High Performance Processing of Streaming Data May Supun Kamburugamuve, Saliya Ekanayake, Milinda Pathirage and Geoffrey C. Fox Indiana.
Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.
Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.
A Parallel Communication Infrastructure for STAPL
These slides are based on the book:
Network Connected Multiprocessors
Distributed Shared Memory
Advanced Topics in Concurrency and Reactive Programming: Asynchronous Programming Majeed Kassis.
Mobile Operating System
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
The University of Adelaide, School of Computer Science
CS 286 Computer Organization and Architecture
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
CMSC 611: Advanced Computer Architecture
Department of Computer Science University of California, Santa Barbara
Chapter 5: Computer Systems Organization
Indiana University, Bloomington
Faucets: Efficient Utilization of Multiple Clusters
Communication Framework
Chapter 3 Part 3 Switching and Bridging
Chapter 13: I/O Systems I/O Hardware Application I/O Interface
The University of Adelaide, School of Computer Science
BigSim: Simulating PetaFLOPS Supercomputers
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Lecture 17 Multiprocessors and Thread-Level Parallelism
Department of Computer Science University of California, Santa Barbara
The University of Adelaide, School of Computer Science
Chapter 13: I/O Systems.
Cluster Computers.
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Emulating Massively Parallel (PetaFLOPS) Machines Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé http://charm.cs.uiuc.edu Department of Computer Science Parallel Programming Laboratory

Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene Converse BlueGene Future Work

Blue Gene: Processor-in-memory Case Study Five steps to a PetaFLOPS, taken from: http://www.research.ibm.com/bluegene/ BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

SMP Node 25 processors 200 processing elements Input/Output Buffer 32 x 128 bytes Network Connected to six neighbors via duplex link 16 bit @ 500 MHz = 1 Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn

Processor in out STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing 2 floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!

Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

Emulator Objectives Emulate Blue Gene and other petaFLOPS machines. Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.

Emulator Implementation What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

Experiments on Emulator Sample applications implemented: Primes Jacobi relaxation MD prototype 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms

Collective Operations Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x

Converse BlueGene Emulator Objective Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Switching from Charm++ bluegene emulator to Converse emulator allow better performance improvement by accessing low lever communication and thread library directly via Converse; it also make it possible to port Charm++ on top of bluegene emulator. So that Charm++ can become one of the possible parallel programming language on Bluegene, and existing Charm++ application can run on the emulator.

Bluegene Emulator Node Structure Communication threads Worker thread inBuffer Like Converse, the bluegene emulator is also a message driven system. The only way two nodes communicate is to send a bluegene message with a handler function associated to it. This is pretty much like active messages. This slide shows our abstraction for a bluegene node. Remember that there are 200 process elements on each node, we represent them as threads. We further divide threads into two different types: communication threads and worker threads. The job of a communication thread is to poll the blegene node’s inBuffer and schedule it to worker threads. The job of a worker thread is to pick the task assigned to it and execute it. We have two different messages, affinity messages and non-affinity messages. Considering the performance, we introduce the concept of affinity message which is a special message that can only executed on a specified thread. Thus, affinity messages have specific thread ID associated to it, so the communication threads must schedule it to the specified worker threads. Non-affinity message can be assigned to any worker threads. Affinity message queue Non-affinity message queue Node Structure

Performance Pingpong Close to Converse pingpong; Charm++ pingpong 81-103 us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong 134-175 us RTT Eliminate the charm message overhead, so performance is much better than previous charm++ bluegene emulator. Tests are conducted on Origin2000.

Charm++ on top of Emulator BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Note: CkMyPe() now returns the thread’s global serial number.

Future Work: Simulator LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques