Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Similar presentations


Presentation on theme: "1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)"— Presentation transcript:

1 1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

2 2 Message Passing Multicomputer n Consists of multiple computing units, called nodes n Each node is an autonomous computer, consists of Processor(s) (may be an SMP) Local memory Disks or I/O peripherals (optional) full-scale OS (some microkernel) n Nodes are communicated by message passing No-remote-memory-access (NORMA) machines n Distributed memory machines

3 3 IBM SP2

4 4 SP2 n IBM SP2 => Scalable POWERparallel System n Developed based on RISC System/6000 architecture (POWER2 processor) n Interconnect: High-Performance Switch (HPS)

5 5 SP2 - Nodes n 66.7 MHz POWER2 processor with L2 cache. n POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. n 2 floating point units (FPU) + 2 fixed point units (FXU) n Perform up to four floating-point operations (2 multiply-add ops) per cycle. n A peak performance of 266 Mflops (66.7 x4) can be achieved.

6 6 IBM SP2 using Two types of nodes : n Thin node: X 4 micro-channel (I/O) slots, 96KB L2 cache, 64-512MB memory, 1-4 GB disk n Wide node : X 8 micro-channel slots, 288KB L2 cache, 64- 2048MB memory, 1-8 GB disk

7 7 SP2 Wide Node

8 8 IBM SP2: Interconnect n Switch: X High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit).  Omega-switch-based multistage network n Network interface: X Enhanced Communication Adapter. X The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking

9 9 SP2 Switch Board n Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed n 4 routes between each pair of nodes (set at booting time) n hardware latency is 500 nsec (board) n capable of scaling bisectional bandwidth linearly with the number of nodes

10 10 n Maximum point-to-point bandwidth: 40MB/s n 1 packet consists of 256 bytes flit size = 1 byte (wormhole routing) SP2 HPS (a 16 x 16 switch board) Vulcan chip

11 11 SP2 Communication Adapter n one adapter per node n one switch board unit per rack n send FIFO has 128 entries (256 bytes each) n receive FIFO has 64 entries (256 bytes each) n 2 DMA engines

12 12 SP2 Communication Adapter POWER2 Host Node Network Adapter

13 13 128-node SP2 (16 nodes per frame)

14 14 INTEL PARAGON

15 15 Intel Paragon (2-D mesh)

16 16 Intel Paragon Node Architecture n Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). X One of them is used as message processor (communication co- processor) handling all communication events. X Two are application processors (computation only) n Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) X Peak memory-to-processor bandwidth: 400 MB/s X Peak cache-to-processor bandwidth:1.2 GB/s.

17 17 Intel Paragon Node Architecture n message processor: X handles message protocol processing for the application program, X freeing the application processor to continue with numeric computation while messages are transmitted and received. X also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).

18 18 Paragon Node Architecture

19 19 Paragon Interconnect n 2-D Mesh X I/O devices attached on a single side X 16-bit link, 175 MB/s n Mesh Routing Components (MRCs), X one for each node. X 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). X In a 512 PEs (16x32), 10 hops is 400-700nsec

20 20 CRAY T3D

21 21 Cray T3D Node Architecture n Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) n PE: 150 MHz DEC 21064 Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS n 1024 processor: sustained max speed 152 Gflop/s

22 22 T3D Node and Network Interface

23 23 Cray T3D Interconnect n Interconnect: 3D Torus, 16-bit data/link, 150 MHz n Communication channel peak rate: 300 MB/s.

24 24 T3D n The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner n The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)

25 25 T3D : Local and Remote Memory n Local memory: X 16 or 64 MB DRAM per PE X Latency: 13 to 38 clock cycles (87 to 253 nsec) X Bandwidth: up to 320 MB/s n Remote memory: X Directly addressable by the processor, X Latency of 1 to 2 microseconds X Bandwidth: over 100 MB/s (measured in software).

26 26 T3D : Local and Remote Memory Distributed Shared Memory Machine n All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. n NCC-NUMA : non-cache-coherence NUMA

27 27 T3D: Bisectional Bandwidth n The network moves data in packets with payload sizes of either one or four 64-bit words n The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; X 512 node=8x8x8, 64 nodes/frame, 4x64x300

28 28 T3E Node E-Register Alpha 21164 4-issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz)

29 29 Cluster: Network of Workstation (NOW) Cluster of Workstation (COW) Pile-of-PCs (POPC)

30 30 Clusters of Workstations n Several workstations which are connected by a network. X connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. X some software to tightly integrate all resources n Each workstation is a independent machines

31 31 n Advantages X Cheaper X Easy to scale X Coarse-grain parallelism (traditionally) n Disadvantages of Clusters X Longer communication latency compared with other parallel system (traditionally) Cluster

32 32 ATM Cluster (Fore SBA-200) n Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc,.. n NI location: I/O bus n Communication processor: Intel i960, 33MHz, 128KB RAM n Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port n HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)

33 33 Myrinet Cluster n Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC,.. n NI location: I/O bus n Communication processor: LANai, 25 MHz, 128 KB SRAM n Peak bandwidth: 80 MB/s --> 160 MB/s

34 34 Conclusion n Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. n Overlap computation with communication improve performance.

35 35 Paragon n Main processor : 50 MHz i860 XP, 75 Mflop/s. n NI location : Memory bus (64-bit, 400 MB/s) n Communication processor : 50 MHz i860 XP -- a processor n Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)

36 36 SP2 n Main processor : 66.7 MHz POWER2, 266 MFLOPs n NI location : I/O bus (32-bit micro-channel) n Communication processor : 40 MHz i860 XR -- a processor n Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)

37 37 T3D n Main processor: 150 MHz DEC 21064 Alpha AXP, 150 MFLOPS n NI location: Memory bus (320 MB/s local; or 100 MB/s remote) n Communication processor : Controller (BLT) -- hardware circuitry n Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)


Download ppt "1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)"

Similar presentations


Ads by Google