Presentation is loading. Please wait.

Presentation is loading. Please wait.

.1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

Similar presentations


Presentation on theme: ".1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]"— Presentation transcript:

1 .1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]

2 .2 Review: Bus Connected SMPs (UMAs)  Caches are used to reduce latency and to lower bus traffic  Must provide hardware for cache coherence and process synchronization  Bus traffic and bandwidth limits scalability (<~ 36 processors) Processor Cache Single Bus Memory I/O Processor Cache

3 .3 Review: Multiprocessor Basics # of Proc Communication model Message passing8 to 2048 Shared address NUMA8 to 256 UMA2 to 64 Physical connection Network8 to 256 Bus2 to 36  Q1 – How do they share data?  Q2 – How do they coordinate?  Q3 – How scalable is the architecture? How many processors?

4 .4 Network Connected Multiprocessors  Either a single address space (NUMA and ccNUMA) with implicit processor communication via loads and stores or multiple private memories with message passing communication with sends and receives l Interconnection network supports interprocessor communication Processor Cache Interconnection Network (IN) Memory

5 .5 Communication in Network Connected Multi’s  Implicit communication via loads and stores l hardware designers have to provide coherent caches and process synchronization primitive l lower communication overhead l harder to overlap computation with communication l more efficient to use an address to remote data when demanded rather than to send for it in case it might be used (such a machine has distributed shared memory (DSM))  Explicit communication via sends and receives l simplest solution for hardware designers l higher communication overhead l easier to overlap computation with communication l easier for the programmer to optimize communication

6 .6 Cache Coherency in NUMAs  For performance reasons we want to allow the shared data to be stored in caches  Once again have multiple copies of the same data with the same address in different processors l bus snooping won’t work, since there is no single bus on which all memory references are broadcast  Directory-based protocols l keep a directory that is a repository for the state of every block in main memory (which caches have copies, whether it is dirty, etc.) l directory entries can be distributed (sharing status of a block always in a single known location) to reduce contention l directory controller sends explicit commands over the IN to each processor that has a copy of the data

7 .7 IN Performance Metrics  Network cost l number of switches l number of (bidirectional) links on a switch to connect to the network (plus one link to connect to the processor) l width in bits per link, length of link  Network bandwidth (NB) – represents the best case l bandwidth of each link * number of links  Bisection bandwidth (BB) – represents the worst case l divide the machine in two parts, each with half the nodes and sum the bandwidth of the links that cross the dividing line  Other IN performance issues l latency on an unloaded network to send and receive messages l throughput – maximum # of messages transmitted per unit time l # routing hops worst case, congestion control and delay

8 .8 Bus IN  N processors, 1 switch ( ), 1 link (the bus)  Only 1 simultaneous transfer at a time l NB = link (bus) bandwidth * 1 l BB = link (bus) bandwidth * 1 Processor node Bidirectional network switch

9 .9 Ring IN  If a link is as fast as a bus, the ring is only twice as fast as a bus in the worst case, but is N times faster in the best case  N processors, N switches, 2 links/switch, N links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * 2

10 .10 Fully Connected IN  N processors, N switches, N-1 links/switch, (N*(N-1))/2 links  N simultaneous transfers l NB = link bandwidth * (N*(N-1))/2 l BB = link bandwidth * (N/2) 2

11 .11 Crossbar (Xbar) Connected IN  N processors, N 2 switches (unidirectional),2 links/switch, 2N 2 links  N simultaneous transfers l NB = link bandwidth * N l BB = link bandwidth * N/2

12 .12

13 .13 Hypercube (Binary N-cube) Connected IN  N processors, N switches, logN links/switch, (NlogN)/2 links  N simultaneous transfers l NB = link bandwidth * (NlogN)/2 l BB = link bandwidth * N/2 2-cube 3-cube

14 .14 2D and 3D Mesh/Torus Connected IN  N simultaneous transfers l NB = link bandwidth * 4N or link bandwidth * 6N l BB = link bandwidth * 2 N 1/2 or link bandwidth * 2 N 2/3  N processors, N switches, 2, 3, 4 (2D torus) or 6 (3D torus) links/switch, 4N/2 links or 6N/2 links

15 .15 Fat Tree  N processors, log(N-1)*logN switches, 2 up + 4 down = 6 links/switch, N*logN links  N simultaneous transfers l NB = link bandwidth * NlogN l BB = link bandwidth * 4

16 .16 Fat Tree CDAB  Trees are good structures. People use them all the time. Suppose we wanted to make a tree network.  Any time A wants to send to C, it ties up the upper links, so that B can't send to D. l The bisection bandwidth on a tree is horrible - 1 link, at all times  The solution is to 'thicken' the upper links. l More links as the tree gets thicker increases the bisection  Rather than design a bunch of N-port switches, use pairs

17 .17 SGI NUMAlink Fat Tree www.embedded-computing.com/articles/woodacre

18 .18 IN Comparison  For a 64 processor system BusRingTorus6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of Switches 1 Links per switch Total # of links 1

19 .19 IN Comparison  For a 64 processor system BusRing2D Torus 6-cubeFully connected Network bandwidth 1 Bisection bandwidth 1 Total # of switches 1 Links per switch Total # of links (bidi) 1 64 2 64 2+1 64+64 256 16 64 4+1 128+64 192 32 64 6+1 192+64 2016 1024 64 63+1 2016+64

20 .20 Network Connected Multiprocessors ProcProc Speed # ProcIN Topology BW/link (MB/sec) SGI OriginR16000128fat tree800 Cray 3TEAlpha 21164 300MHz2,0483D torus600 Intel ASCI RedIntel333MHz9,632mesh800 IBM ASCI White Power3375MHz8,192multistage Omega 500 NEC ESSX-5500MHz640*8640-xbar16000 NASA Columbia Intel Itanium2 1.5GHz512*20fat tree, Infiniband IBM BG/LPower PC 440 0.7GHz65,536*23D torus, fat tree, barrier

21 .21 IBM BlueGene 512-node protoBlueGene/L Peak Perf1.0 / 2.0 TFlops/s180 / 360 TFlops/s Memory Size128 GByte16 / 32 TByte Foot Print9 sq feet2500 sq feet Total Power9 KW1.5 MW # Processors512 dual proc65,536 dual proc Networks3D Torus, Tree, Barrier

22 .22 A BlueGene/L Chip 32K/32K L1 440 CPU Double FPU 32K/32K L1 440 CPU Double FPU 2KB L2 2KB L2 16KB Multiport SRAM buffer 4MB L3 ECC eDRAM 128B line 8-way assoc Gbit ethernet 3D torusFat treeBarrier DDR control 6 in, 6 out 1.6GHz 1.4Gb/s link 3 in, 3 out 350MHz 2.8Gb/s link 4 global barriers 144b DDR 256MB 5.5GB/s 8 1 128 256 11GB/s 5.5 GB/s 700 MHz

23 .23 Networks of Workstations (NOWs) Clusters  Clusters of off-the-shelf, whole computers with multiple private address spaces  Clusters are connected using the I/O bus of the computers l lower bandwidth that multiprocessor that use the memory bus l lower speed network links l more conflicts with I/O traffic  Clusters of N processors have N copies of the OS limiting the memory available for applications  Improved system availability and expandability l easier to replace a machine without bringing down the whole system l allows rapid, incremental expandability  Economy-of-scale advantages with respect to costs

24 .24 Commercial (NOW) Clusters ProcProc Speed # ProcNetwork Dell PowerEdge P4 Xeon3.06GHz2,500Myrinet eServer IBM SP Power41.7GHz2,944 VPI BigMacApple G52.3GHz2,200Mellanox Infiniband HP ASCI QAlpha 212641.25GHz8,192Quadrics LLNL Thunder Intel Itanium21.4GHz1,024*4Quadrics BarcelonaPowerPC 9702.2GHz4,536Myrinet

25 .25 Summary  Flynn’s classification of processors - SISD, SIMD, MIMD l Q1 – How do processors share data? l Q2 – How do processors coordinate their activity? l Q3 – How scalable is the architecture (what is the maximum number of processors)?  Shared address multis – UMAs and NUMAs l Scalability of bus connected UMAs limited (< ~ 36 processors) l Network connected NUMAs more scalable l Interconnection Networks (INs) -fully connected, xbar -ring -mesh -n-cube, fat tree  Message passing multis  Cluster connected (NOWs) multis


Download ppt ".1 Network Connected Multi’s [Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005]"

Similar presentations


Ads by Google