1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Multiple Processor Systems
Accessing I/O Devices Processor Memory BUS I/O Device 1 I/O Device 2.
1 Uniform memory access (UMA) Each processor has uniform access time to memory - also known as symmetric multiprocessors (SMPs) (example: SUN ES1000) Non-uniform.
System Area Network Abhiram Shandilya 12/06/01. Overview Introduction to System Area Networks SAN Design and Examples SAN Applications.
Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Commodity Computing Clusters - next generation supercomputers? Paweł Pisarczyk, ATM S. A.
♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.
Protocols and software for exploiting Myrinet clusters Congduc Pham and the main contributors P. Geoffray, L. Prylli, B. Tourancheau, R. Westrelin.
SGI’2000Parallel Programming Tutorial Supercomputers 2 With the acknowledgement of Igor Zacharov and Wolfgang Mertz SGI European Headquarters.
Today’s topics Single processors and the Memory Hierarchy
Types of Parallel Computers
CSCI-455/522 Introduction to High Performance Computing Lecture 2.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
1 Lecture 5: Part 1 Performance Laws: Speedup and Scalability.
CSCI 8150 Advanced Computer Architecture Hwang, Chapter 1 Parallel Computer Models 1.2 Multiprocessors and Multicomputers.
IBM RS6000/SP Overview Advanced IBM Unix computers series Multiple different configurations Available from entry level to high-end machines. POWER (1,2,3,4)
CS 213 Commercial Multiprocessors. Origin2000 System – Shared Memory Directory state in same or separate DRAMs, accessed in parallel Upto 512 nodes (1024.
1 Lecture 1 Parallel Processing for Scientific Applications.
1 BGL Photo (system) BlueGene/L IBM Journal of Research and Development, Vol. 49, No. 2-3.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
NUMA Mult. CSE 471 Aut 011 Interconnection Networks for Multiprocessors Buses have limitations for scalability: –Physical (number of devices that can be.

Server Platforms Week 11- Lecture 1. Server Market $ 46,100,000,000 ($ 46.1 Billion) Gartner.
Chapter 17 Parallel Processing.
Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Machine.
IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.
1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.
Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.
Storage area network and System area network (SAN)
1 Computing platform Andrew A. Chien Mohsen Saneei University of Tehran.
Interconnection Structures
Synchronization and Communication in the T3E Multiprocessor.
Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.
1 Introduction to Parallel Computing. 2 Multiprocessor Architectures Message-Passing Architectures –Separate address space for each processor. –Processors.
Seaborg Cerise Wuthrich CMPS Seaborg  Manufactured by IBM  Distributed Memory Parallel Supercomputer  Based on IBM’s SP RS/6000 Architecture.
1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,
Company LOGO High Performance Processors Miguel J. González Blanco Miguel A. Padilla Puig Felix Rivera Rivas.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
DEVICES AND COMMUNICATION BUSES FOR DEVICES NETWORK
Chapter 6 Multiprocessor System. Introduction  Each processor in a multiprocessor system can be executing a different instruction at any time.  The.
1 Computer System Organization I/O systemProcessor Compiler Operating System (Windows 98) Application (Netscape) Digital Design Circuit Design Instruction.
Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.
Multiple Processor Systems Chapter Multiprocessors 8.2 Multicomputers 8.3 Distributed systems.
CSE 661 PAPER PRESENTATION
Parallel Programming on the SGI Origin2000 With thanks to Igor Zacharov / Benoit Marchand, SGI Taub Computer Center Technion Moshe Goldberg,
Computer Architectures... High Performance Computing I Fall 2001 MAE609 /Mth667 Abani Patra.
Copyright © 2011 Curt Hill MIMD Multiple Instructions Multiple Data.
An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.
MODERN OPERATING SYSTEMS Third Edition ANDREW S. TANENBAUM Chapter 8 Multiple Processor Systems Tanenbaum, Modern Operating Systems 3 e, (c) 2008 Prentice-Hall,
Computing Environment The computing environment rapidly evolving ‑ you need to know not only the methods, but also How and when to apply them, Which computers.
PARALLEL PROCESSOR- TAXONOMY. CH18 Parallel Processing {Multi-processor, Multi-computer} Multiple Processor Organizations Symmetric Multiprocessors Cache.
Interconnection network network interface and a case study.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Spring EE 437 Lillevik 437s06-l22 University of Portland School of Engineering Advanced Computer Architecture Lecture 22 Distributed computer Interconnection.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
CDA-5155 Computer Architecture Principles Fall 2000 Multiprocessor Architectures.
Background Computer System Architectures Computer System Software.
Intro to Distributed Systems Hank Levy. 23/20/2016 Distributed Systems Nearly all systems today are distributed in some way, e.g.: –they use –they.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.
Network Connected Multiprocessors
Overview Parallel Processing Pipelining
Storage area network and System area network (SAN)
Constructing a system with multiple computers or processors
Cluster Computers.
Presentation transcript:

1 Lecture 7: Part 2: Message Passing Multicomputers (Distributed Memory Machines)

2 Message Passing Multicomputer n Consists of multiple computing units, called nodes n Each node is an autonomous computer, consists of Processor(s) (may be an SMP) Local memory Disks or I/O peripherals (optional) full-scale OS (some microkernel) n Nodes are communicated by message passing No-remote-memory-access (NORMA) machines n Distributed memory machines

3 IBM SP2

4 SP2 n IBM SP2 => Scalable POWERparallel System n Developed based on RISC System/6000 architecture (POWER2 processor) n Interconnect: High-Performance Switch (HPS)

5 SP2 - Nodes n 66.7 MHz POWER2 processor with L2 cache. n POWER2 can perform six instructions (2 load/store, index increment, conditional branch, and two floating-point) per cycle. n 2 floating point units (FPU) + 2 fixed point units (FXU) n Perform up to four floating-point operations (2 multiply-add ops) per cycle. n A peak performance of 266 Mflops (66.7 x4) can be achieved.

6 IBM SP2 using Two types of nodes : n Thin node: X 4 micro-channel (I/O) slots, 96KB L2 cache, MB memory, 1-4 GB disk n Wide node : X 8 micro-channel slots, 288KB L2 cache, MB memory, 1-8 GB disk

7 SP2 Wide Node

8 IBM SP2: Interconnect n Switch: X High Performance Switch (HPS), operates at 40 MHz, peak link bandwidth 40 MB/s (40 x 8-bit).  Omega-switch-based multistage network n Network interface: X Enhanced Communication Adapter. X The adapter incorporates an Intel i860 XR 64-bit microprocessor (40 MHz) does communication coprocessing, data checking

9 SP2 Switch Board n Each has 8 switch elements, operated at 40 MHz, for reliability, 16 elements installed n 4 routes between each pair of nodes (set at booting time) n hardware latency is 500 nsec (board) n capable of scaling bisectional bandwidth linearly with the number of nodes

10 n Maximum point-to-point bandwidth: 40MB/s n 1 packet consists of 256 bytes flit size = 1 byte (wormhole routing) SP2 HPS (a 16 x 16 switch board) Vulcan chip

11 SP2 Communication Adapter n one adapter per node n one switch board unit per rack n send FIFO has 128 entries (256 bytes each) n receive FIFO has 64 entries (256 bytes each) n 2 DMA engines

12 SP2 Communication Adapter POWER2 Host Node Network Adapter

node SP2 (16 nodes per frame)

14 INTEL PARAGON

15 Intel Paragon (2-D mesh)

16 Intel Paragon Node Architecture n Up to three 50 MHz INTEL i860 processors (75 Mflop/s) per node (usually two in most implementation). X One of them is used as message processor (communication co- processor) handling all communication events. X Two are application processors (computation only) n Each node is a shared memory multiprocessor (64-bit bus, bus speed: 400 MB/s with cache coherence support) X Peak memory-to-processor bandwidth: 400 MB/s X Peak cache-to-processor bandwidth:1.2 GB/s.

17 Intel Paragon Node Architecture n message processor: X handles message protocol processing for the application program, X freeing the application processor to continue with numeric computation while messages are transmitted and received. X also used to implement efficient global operations such as synchronization, broadcasting, and global reduction calculations (e.g., global sum).

18 Paragon Node Architecture

19 Paragon Interconnect n 2-D Mesh X I/O devices attached on a single side X 16-bit link, 175 MB/s n Mesh Routing Components (MRCs), X one for each node. X 40 nsec per hop (switch delay) and 70 nsec if changes dimension (from x-dim to y-dim). X In a 512 PEs (16x32), 10 hops is nsec

20 CRAY T3D

21 Cray T3D Node Architecture n Each processing node contains two PEs, a network interface, and a block transfer engine. (shared by the two PEs) n PE: 150 MHz DEC Alpha AXP, 34-bit address, 64 MB memory, 150 MFLOPS n 1024 processor: sustained max speed 152 Gflop/s

22 T3D Node and Network Interface

23 Cray T3D Interconnect n Interconnect: 3D Torus, 16-bit data/link, 150 MHz n Communication channel peak rate: 300 MB/s.

24 T3D n The cost of routing data between processors through interconnect nodes is two clock cycles (6.67 nsec per cycle) per node traversed and one extra clock cycle to turn a corner n The overheads for using block transfer engine is high. (startup cost > 480 cycles x 6.67 nsec = 3.2 usec)

25 T3D : Local and Remote Memory n Local memory: X 16 or 64 MB DRAM per PE X Latency: 13 to 38 clock cycles (87 to 253 nsec) X Bandwidth: up to 320 MB/s n Remote memory: X Directly addressable by the processor, X Latency of 1 to 2 microseconds X Bandwidth: over 100 MB/s (measured in software).

26 T3D : Local and Remote Memory Distributed Shared Memory Machine n All memory is directly accessible; no action is required by remote processors to formulate responses to remote requests. n NCC-NUMA : non-cache-coherence NUMA

27 T3D: Bisectional Bandwidth n The network moves data in packets with payload sizes of either one or four 64-bit words n The bisectional bandwidth of a 1024-PE T3D is 76 GB/s; X 512 node=8x8x8, 64 nodes/frame, 4x64x300

28 T3E Node E-Register Alpha issue (2 integer + 2 floating point) 600 Mflop/s (300 MHz)

29 Cluster: Network of Workstation (NOW) Cluster of Workstation (COW) Pile-of-PCs (POPC)

30 Clusters of Workstations n Several workstations which are connected by a network. X connected with Fast/Gigabit Ethernet, ATM, FDDI, etc. X some software to tightly integrate all resources n Each workstation is a independent machines

31 n Advantages X Cheaper X Easy to scale X Coarse-grain parallelism (traditionally) n Disadvantages of Clusters X Longer communication latency compared with other parallel system (traditionally) Cluster

32 ATM Cluster (Fore SBA-200) n Cluster node : Intel Pentium II, Pentium SMP, SGI, Sun Sparc,.. n NI location: I/O bus n Communication processor: Intel i960, 33MHz, 128KB RAM n Peak bandwidth: 19.4 MB/s or 77.6 MB/s per port n HKU: PearlCluster (16-node), SRG DP-ATM Cluster ($-node, 16.2 MB/s)

33 Myrinet Cluster n Cluster node: Intel Pentium II, Pentium SMP, SGI, Sun SPARC,.. n NI location: I/O bus n Communication processor: LANai, 25 MHz, 128 KB SRAM n Peak bandwidth: 80 MB/s --> 160 MB/s

34 Conclusion n Many current network interfaces employ a dedicated processor to offload communication tasks from the main processor. n Overlap computation with communication improve performance.

35 Paragon n Main processor : 50 MHz i860 XP, 75 Mflop/s. n NI location : Memory bus (64-bit, 400 MB/s) n Communication processor : 50 MHz i860 XP -- a processor n Peak bandwidth: 175 MB/s (16-bit link, 1 DMA engine)

36 SP2 n Main processor : 66.7 MHz POWER2, 266 MFLOPs n NI location : I/O bus (32-bit micro-channel) n Communication processor : 40 MHz i860 XR -- a processor n Peak bandwidth: 40 MB/s (8-bit link, 40 MHz)

37 T3D n Main processor: 150 MHz DEC Alpha AXP, 150 MFLOPS n NI location: Memory bus (320 MB/s local; or 100 MB/s remote) n Communication processor : Controller (BLT) -- hardware circuitry n Peak bandwidth: 300 MB/s (16-bit data/link at 150 MHz)