Presentation is loading. Please wait.

Presentation is loading. Please wait.

♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC.

Similar presentations


Presentation on theme: "♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC."— Presentation transcript:

1 ♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC TX7 HP Alpha ♦ Commodity processor with custom interconnect SGI Altix Intel Itanium 2 Cray Red Storm AMD Opteron ♦ Custom processor with custom interconnect Cray X1 NEC SX-7 IBM Regatta IBM Blue Gene/L Loosely Coupled Tightly Coupled Commercial Parallel Computer Architecture

2 Super computers examples SGI Altix The Columbia Supercomputer at NASA's Advanced Supercomputing Facility at Ames Research Center. It consists of a 10,240-processor SGI Altix system comprised of 20 nodes, each with 512 Intel Itanium 2 processors, and running a Linux operating system Black Hole Simulations Hitachi SR11000 NEC SX-7 Apple Cray RedStorm Cray BlackWidow IBM Blue Gene/L

3 IBM Regatta p690+ 41 SMP nodes with 32 processors each (total 1312) Processortype: Power4+ 1.7 GHz Overall peak performance: 8.9 Teraflops Linpack: 5.6 Teraflops Main memory: 41 x 128 Gbytes (aggregate 5.2 TB) Operating system: AIX 5.2 Fujitsu Primepower 16 SPARC64 processors 1.35GHz/1.89GHz 128GB memory 16 disks 2 x 8-way system boards Solaris 8, 9, 10

4 Processors used in supercomputer and performance ♦ Intel Pentium Xeon 3.2 GHz, peak = 6.4 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 3.1 Gflop/s ♦ AMD Opteron 2.2 GHz, peak = 4.4 Gflop/s Linpack 100 = 1.3 Gflop/s Linpack 1000 = 3.1 Gflop/s ♦ Intel Itanium 2 1.5 GHz, peak = 6 Gflop/s Linpack 100 = 1.7 Gflop/s Linpack 1000 = 5.4 Gflop/s ♦ HP PA RISC ♦ Sun UltraSPARC IV ♦ HP Alpha EV68 1.25 GHz, 2.5 Gflop/s ♦ MIPS R16000 Linpack: a standard benchmark software that test how fast your computer runs Gflop/s: One billion floating point operations per second

5 Inter-processor connection technologies Switch topology NIC$ NodeMPI Lat (us) 1-way speed (MB/s) Bi-Dir speed (MB/s) Gigabit Ethernet Bus$ 50$10030100150 SCITorus$1,60 0 $16005300400 QsNetII (R) Fat Tree$1200$29003880900 Myrinet (D card Clos$595$9956.5240480 Myrinet (E card) Clos$995$13956450900 IBM 4XFat Tree$1000$14006820790 ♦ Gig Ethernet ♦ Myrinet ♦ Infiniband ♦ QsNet ♦ SCI More detail…

6 Tree, Fat-tree Tree network: there is only one path between any pair of processors. Fat tree network: increase the number of communication links close to the root. Root level has more physical connections

7 Torus topology A.K.A----Wrapped-around-mesh topology Three-dimensional Mesh Mesh with wraparound

8 Clos network is a kind of multistage switching network Three stages, each consisting a number of crossbars. Middle stage have redundant switching boxes to alleviate blocking probability

9 Myrinet By Myricom company First Myrinet in 1994 An alternative for Ethernet to connect the nodes in a cluster entirely operated in user space, no Operating System delays Miyinet switch: 10-Gbps, $12,800 Clos networks up to 128 host ports 10G PCI Express NIC With fiber connectors

10 QsNetII network By Quadrics (formed in 1996) uses a 'fat tree' topology QsNetII scales up to 4096 nodes Each node might have multiple CPUs Designed for use within SMP systemsSMP MPI latency on standard AMD Opteron starts at 1.22 usec;Opteron Bandwidth on Intel Xeon EM64T is 912 Mbytes/s.EM64T QsNetII E-Series 128-way switch

11 Each chip contains two nodes Each node is a PPC440 processor Each node has 512 local memory Each node runs lightweight OS with MPI. Each node runs one user process No context switching at node

12 BlueGene/L Interconnection Use five networks: –GigE for I/O nodes, to external systems –A control network use FastEthernet –3-D Torus for node-to-node message passing Handle majority of application traffic (mpi messaging) Longest path: 64 hops MPI software is highly customized: –A collective network for broadcasting –A barrier network

13 BlueGene/L Interconnection Networks 3 Dimensional Torus Interconnects all compute nodes (65,536) Virtual cut-through hardware routing 1.4Gb/s on all 12 node links (2.1 GB/s per node) 1 μs latency between nearest neighbors, 5 μs to the farthest 4 μs latency for one hop with MPI, 10 μs to the farthest Global Tree Interconnects all compute and I/O nodes (1024) One-to-all broadcast functionality Reduction operations functionality 2.8 Gb/s of bandwidth per link Latency of one way tree traversal 2.5 μs ~23TB/s total binary tree bandwidth (64k machine) Ethernet Incorporated into every node ASIC Active in the I/O nodes (1:64) All external comm. (file I/O, control, user interaction, etc.) Low Latency Global Barrier and Interrupt Latency of round trip 1.3 μs

14


Download ppt "♦ Commodity processor with commodity inter- processor connection Clusters Pentium, Itanium, Opteron, Alpha GigE, Infiniband, Myrinet, Quadrics, SCI NEC."

Similar presentations


Ads by Google