Presentation is loading. Please wait.

Presentation is loading. Please wait.

TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg.

Similar presentations


Presentation on theme: "TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg."— Presentation transcript:

1 TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg

2 Heiner.litz@uni-hd.de 2 Motivation  Future Trends  More cores, 2-fold increase per year [Asanovic 2006]  More nodes, 200.000+ nodes for Exascale [Exascale Rep.]  Consequence  Exploit fine grain parallelisim  Improve serialization/synchronization  Requirement  Low latency communication

3 Heiner.litz@uni-hd.de Motivation  Latency lags Bandwidth [Patterson, 2004]  Memory vs. Network  Memory BW 10GB/s  Network BW 5 GB/s  Memory Latency 50ns  Network Latency 1us  2x vs. 20x 3

4 Heiner.litz@uni-hd.de State of the Art 4 Scalability Lower Latency Infiniband Ethernet Quickpath SW DSM HyperTransport Larrabee Tilera Clusters SMPs TCCluster

5 Heiner.litz@uni-hd.de 5 Observation  Today’s CPUs represent complete Cluster nodes  Processor cores  Switch  Links

6 Heiner.litz@uni-hd.de 6 Approach  Use host interface as interconnect  Tightly Coupled Cluster (TCCluster)

7 Heiner.litz@uni-hd.de Background  Coherent HyperTransport  Shared memory SMPs  Cache coherency overhead  Max. 8 endpoints  Table based routing (nodeID)  Non-coherent HyperTransport  Subset of cHT  I/O devices, Southbridge,..  PCI like protocol  “Unlimited” number of devices  Interval routing (memory address) 7

8 Heiner.litz@uni-hd.de 8 Approach  Processors pretend to be I/O devices  Partitioned global address space  Communicate via PIO writes to MMIO

9 Heiner.litz@uni-hd.de Routing  9

10 Heiner.litz@uni-hd.de Programming Model  Remote Store PM  Each process has local private memory  Each process supports remotely writable regions  Sending by storing to remote locations  Receiving by reading from local memory  Synchronization through serializing instructions  No support of bulk transfers (DMA)  No support for remote reads  Emphasis on locality, low latency reads 10

11 Heiner.litz@uni-hd.de 11 Implementation  2x Two-socket Quadcore Shanghai Tyan Box SB HTX 3 3 2 1 node1 node0 HTX SB 3 3 2 1 node0 node1 ncHT link 16@3.6Gbit BOX 0 BOX 1 Reset/PWR

12 Heiner.litz@uni-hd.de 12 Implementation

13 Heiner.litz@uni-hd.de 13 Implementation  Software based approach  Firmware  Coreboot (LinuxBIOS)  Link de-enumeration  Force non-coherent  Link frequency & electrical parameters  Driver  Linux based  Topology & Routing  Manages remotely writable regions

14 Heiner.litz@uni-hd.de 14 Memory Layout 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB MMIO WC RW mem UC 0 GB 4 GB 5 GB 6 GB Local DRAM node 0 WB Node1 WB RW mem UC MMIO WC DRAM Hole BOX 0 BOX 1

15 Heiner.litz@uni-hd.de 15 Bandwidth – HT800(16bit) Singlethread message-rate: 142 mio

16 Heiner.litz@uni-hd.de 16 Latency – HT800(16bit) 227 ns Software-2-Software Half-Roundtrip

17 Heiner.litz@uni-hd.de 17 Conclusion  Introduced novel tightly coupled interconnect  “Virtually” moved the NIC into the CPU  Order of magnitude latency improvement  Scalable  Next steps:  MPI over RSM support  Own mainboard with multiple links

18 Heiner.litz@uni-hd.de References  [Asanovic, 2006] Asanovic K, Bodik R, Catanzaro B, Gebis J. The landscape of parallel computing research: A view from berkeley. UC Berkeley Tech Report. 2006.  [Exascale Rep ] ExaScale Computing Study: Technology Challenges in Achieving Exascale Systems  [Patterson, 2004] Latency lags Bandwidth. Communications of the ACM, vol. 47, number 10, pp. 71-75, October 2004. 18

19 Heiner.litz@uni-hd.de UoH confidential and proprietary19 Routing  Traditional system (all nodes have same view of memory) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F IO x50 - x5F IO cHT ncHT

20 Heiner.litz@uni-hd.de UoH confidential and proprietary20 Routing  Our approach (each CPU has its own view with one coherent node0 and 4 IO links) DRAM x00 - x0Fx10 - x1Fx20 - x2F DRAM x30 - x3F ncHT DRAM

21 Heiner.litz@uni-hd.de UoH confidential and proprietary21 Routing in the Opteron Fabric  Type of a HT packet (posted, non posted, cHT, ncHT) is determined by SRQ based on:  MTRR  GART  Top of memory register  IO and DRAM range registers  Routing is determined by the NB on:  Routing table registers  MMIO base/limit registers  Coherent link traffic distribution register

22 Heiner.litz@uni-hd.de UoH confidential and proprietary22 Transaction Example 1. Core 0 performs write to IO address. Forwarded to X-Bar via SRQ 2. X-Bar forwards it to IO bridge to convert into posted Write 3. X-Bar forwards it to IO link 4. X-Bar forwards it to IO bridge to convert into coherent sizedWr 5. X-Bar forwards it to Mem Ctrler

23 Heiner.litz@uni-hd.de UoH confidential and proprietary23 Topology and Adressing 12 78 1516 2122 1314 1920 34 910 1718 2324 56 1112 2728 3334 2526 3132 2930 3536 1-12 -> Top 13-15 -> Left 17-18 -> Right 19-24 -> Down 1-30 -> Top 31-34 -> Left 36 -> Right null -> Down Limited possibilities as Opteron only supports 8 address range registers

24 Heiner.litz@uni-hd.de UoH confidential and proprietary24 Limitations  Communication is PIO, no DMA, no offloading  No congestion management, no HW barriers, no multicast, limited QoS etc  Synchronous system, all Opterons require same clock, no COTS boxes  Security Issues: Nodes can write directly to phys mem on any node  Posted writes to remote memory do not have the coherency bit set, no local caching possible?

25 Heiner.litz@uni-hd.de How does it work?  Minimalistic Linux Kernel (MinLin)  100 MB, runs in Ramdisk  Boots over ethernet or FILO  Mount homes over ssh  PCI subsystem, to access NB config  Multicore/processor supported  No harddisk, VGA, keyboard,..  No Module support, no device drivers  No Swapping/paging UoH confidential and proprietary25


Download ppt "TCCluster: A Cluster Architecture Utilizing the Processor Host Interface as a Network Interconnect Heiner Litz University of Heidelberg."

Similar presentations


Ads by Google