Presentation is loading. Please wait.

Presentation is loading. Please wait.

Emulating Massively Parallel (PetaFLOPS) Machines

Similar presentations


Presentation on theme: "Emulating Massively Parallel (PetaFLOPS) Machines"— Presentation transcript:

1 Emulating Massively Parallel (PetaFLOPS) Machines
Neelam Saboo, Arun Kumar Singla Joshua Mostkoff Unger, Gengbin Zheng, Laxmikant V. Kalé Department of Computer Science Parallel Programming Laboratory

2 Roadmap BlueGene Architecture Need for an Emulator Charm++ BlueGene
Converse BlueGene Future Work

3 Blue Gene: Processor-in-memory Case Study
Five steps to a PetaFLOPS, taken from: BOARD PROCESSOR 1 GFlop/s, 0.5 MB NODE/CHIP 25 GFlop/s, 12.5 MB TOWER BLUE GENE 1 PFlop/s, 0.5 TB FUNCTIONAL MODEL: 34X34X36 cube of shared memory nodes each having 25 processors.

4 SMP Node 25 processors 200 processing elements Input/Output Buffer
32 x 128 bytes Network Connected to six neighbors via duplex link MHz = Gigabyte/s Latencies: 5 cycles per hop 75 cycles per turn

5 Processor in out STATS: 500 MHz Memory-side cache eliminates coherency problems 10 cycles local cache 20 cycles remote cache 10 cycles cache miss 8 integer units sharing floating point units 8 x 25 x ~40,000 = ~8 x 106 processing elements!

6 Need for Emulator Emulator – enables programmer to develop, compile, and run software using programming interface that will be used in actual machine

7 Emulator Objectives Emulate Blue Gene and other petaFLOPS machines.
Memory limitations and time limitations on single processor requires that simulation MUST be performed on parallel architecture. Issues: Assume that program written for processor-in-memory machine will handle out-of-order execution and messaging. Therefore don’t need complex event queue/rollback.

8 Emulator Implementation
What are basic data structures/interface? Machine configuration (topology), handler registration Nodes with node-level shared data Threads (associated with each node) representing processing elements Communication between nodes How to handle all these objects on parallel architecture? How to handle object-to-object communication? Difficulties of implementation eased by using Charm++, object-oriented parallel programming paradigm.

9 Experiments on Emulator
Sample applications implemented: Primes Jacobi relaxation MD prototype 40,000 atoms, no bonds calculated, nearest neighbor cutoff Ran full Blue Gene (with 8 x 106 threads) on ~100 ASCI-Red processors ApoA-I: 92k Atoms

10 Collective Operations
Explore different algorithms for broadcasts and reductions RING LINE OCTREE z y Use “primitive” 30 x 30 x 20 (10 threads) Blue Gene emulation on 50 processor Linux cluster x

11 Converse BlueGene Emulator Objective
Performance estimation (with proper time stamping) Provide API for building Charm++ on top of emulator. Switching from Charm++ bluegene emulator to Converse emulator allow better performance improvement by accessing low lever communication and thread library directly via Converse; it also make it possible to port Charm++ on top of bluegene emulator. So that Charm++ can become one of the possible parallel programming language on Bluegene, and existing Charm++ application can run on the emulator.

12 Bluegene Emulator Node Structure Communication threads Worker thread
inBuffer Like Converse, the bluegene emulator is also a message driven system. The only way two nodes communicate is to send a bluegene message with a handler function associated to it. This is pretty much like active messages. This slide shows our abstraction for a bluegene node. Remember that there are 200 process elements on each node, we represent them as threads. We further divide threads into two different types: communication threads and worker threads. The job of a communication thread is to poll the blegene node’s inBuffer and schedule it to worker threads. The job of a worker thread is to pick the task assigned to it and execute it. We have two different messages, affinity messages and non-affinity messages. Considering the performance, we introduce the concept of affinity message which is a special message that can only executed on a specified thread. Thus, affinity messages have specific thread ID associated to it, so the communication threads must schedule it to the specified worker threads. Non-affinity message can be assigned to any worker threads. Affinity message queue Non-affinity message queue Node Structure

13 Performance Pingpong Close to Converse pingpong; Charm++ pingpong
us v.s. 92 us RTT Charm++ pingpong 116 us RTT Charm++ Bluegene pingpong us RTT Eliminate the charm message overhead, so performance is much better than previous charm++ bluegene emulator. Tests are conducted on Origin2000.

14 Charm++ on top of Emulator
BlueGene thread represents Charm++ node; Name conflict: Cpv, Ctv MsgSend, etc CkMyPe(), CkNumPes(), etc Note: CkMyPe() now returns the thread’s global serial number.

15 Future Work: Simulator
LeanMD : Fully functional MD with only cutoff How can we examine performance of algorithms on variants of processor-in-memory design in massive system? Several layers of detail to measure Basic: Correctly model performance, timestamp messages with correction for out-of-order execution More detailed: network performance, memory access, modeling sharing of floating-point unit, estimation techniques


Download ppt "Emulating Massively Parallel (PetaFLOPS) Machines"

Similar presentations


Ads by Google