Presentation on theme: "NUMA machines and directory cache mechanisms"— Presentation transcript:
1NUMA machines and directory cache mechanisms AMANO,HideharuTextbook pp.７０～７９
2NUMA(Non-Uniform Memory Access model） Providing shared memory whose access latency and bandwidth are different by the address.Usually, its own memory module is easy to be accessed, but ones with other PUs are not.All shared memory modules are mapped into a unique logical address space, thus the program for UMA machines works without modification.Also called a machine with Distributed Shared Memory⇔ A machine with Centralized Shared memory (UMA).
3The model of NUMA Unique address space Node ０ ０ Node １ １ ２ Node 2 ３ ＮｅｔｗｏｒｋNode 2３Unique address spaceNode ３
5Variation of NUMASimple NUMA： cache coherence is not kept by the hardware（CM*,Cenju，T3D，RWC-1, Earth simulator）CC (Cache Coherent)-NUMA： providing coherent cache.（DASH，Alewife, Origin, SynfinityNUMA, NUMA-Q, Recent servers like Power 7）COMA (Cache Only Memory Architecture) : No home memory（DDM,KSR-1)
7Simple NUMAA PU can access memory with other PUs/Clusters, but the cache coherence is not kept.Simple hardwareSoftware cache support functions are sometimes provided.Suitable for connecting a lot of PUs: Supercomputers : Cenju, T3D, Earth simulator, IBM BlueGene, RoadrunnerWhy recent top supercomputers take the simple NUMA structure?Easy programming for wide variety of applicationsPowerful interconnection network
8CM* （CMU：the late 1970’s） One of roots of multiprocessors Slocalｋｍａｐ．．．CM00CM09PDP11 compatibleprocessorsSlocal is an address transform mechanism.Kmap is a kind of switch.
9Cray’s T3D: A simple NUMA supercomputer (1993) UsingAlpha 21064
12The fastest computerAlso simple NUMAFrom IBM web site
13Supercompuer K L2 C Core Core Tofu Interconnect 6-D Torus/Mesh Core ControllerCoreCoreCoreCoreSPARC64 VIIIfx ChipRDMA mechanismNUMA or UMA+NORMA4 nodes/board96nodes/Lack24boards/Lack
14Cell（IBM/SONY/Toshiba） SPE:Synergistic ProcessingElement(SIMD core)128bit(32bit X 4)2 way superscalarSXULSDMASXULSDMASXULSDMASXULSDMA512KB Local StoreExternalDRAMMICBICEIB: 2+2 Ring BusL2 C512KBSXULSDMASXULSDMASXULSDMASXULSDMAFlex I/O32KB+32KBL1 CPXUThe LS of SPEsare mapped onthe same addressspace of the PPEPPECPU Core IBM Power2-way superscalar, 2-thread
15CC-NUMA Directory management mechanism is required for coherent cache. Early CC-NUMAs use hierarchical buses.Complete hardwired logicStanford DASH、MIT Alewife、Origin、Sinfinity NUMAManagement processorStanford FLASH（MAGIC)、NUMA-Q(SCLIC)、JUMP-1(MBP-light)Recently, CC-NUMAs using multicore nodes are widely used.
16Ultramax (Sequent Co.） An early NUMA Hierarchical busShared memoryCache．．．．．．Hierarchical extension of bus connected multiprocessorsHierarchical bus bottlenecks the system.
17Stanford DASH A root of recent CC-NUMAs router2-D mesh with Caltech router．．．ＰＵ00ＰＵ０３DirectoryMain MemorySGI Power ChallengeDirectory Coherent control、Point-to-Point connectionRelease Consistency
18SGI Origin Main Memory is connected with Hub Chip directly. Bristled HypercubeＮｅｔｗｏｒｋHub ChipMain Memory is connected with Hub Chip directly.1 Cluster consists of 2 PEs.
25Distributed cache management of CC-NUMA Cache directory is provided in the home memory.The cache coherence is kept by messages between PUs.Invalidation type protocols are commonly used.The protocol itself is similar to those used in snoop cache, but everything must be managed with message transfers.
30Cache coherent control （Node 2 reads） Write Back→ SNode ０Node ３Write Back ReqＤDS１１Cache lineｒｅｑReadsNode 2Node １S
31Cache coherent control （Node 2 writes） Write Back→ INode ０Node ３Write Back ReqＤD１１Cache lineｒｅｑWritesNode 2Node １D
32QuizShow the states of cache connected to each node and directory of home memory in CC-NUMA.The node memory in node 0 is accessed:Node 1 readsNode 2 readsNode 1 writesNode 2 writes
33Triangle data transfer Write Back Req toNode2Node ０Node ３Ｄ→ ID１１ｒｅｑWrite BackWritesNode 2Node １DMESI, MOSI like protocols can be implemented,but the performance is not so improved.
34Synchronization in CC-NUMA Simple indivisible operations (eg. Test&set) increase traffic too much.Test and Test&set is effective, but not sufficient.After sending an invalidation message, traffic is concentrated around the host node.Queue-based lock:linked list for lock is formed using directory for cache management.Only the node which can get a lock is informed.
35Traffic congestion caused by Test and Test&Set(x) (Node 3 executes the critical section) PollingNode ０x=1:ＳNode ３x=0→１:ＳS１１１１Critical sectionNode 2PollingPollingx=1:Ｓx=1:ＳNode １
36Traffic congestion caused by Test and Test&Set(x) (Node 3 finishes the critical section) Write reqrelease xNode ０x=1:ＳNode ３Ix=0:DSD１１１１x=0→１:ＳInvalidationNode 2x=1:Ｓx=1:ＳNode １II
37Traffic congestion caused by Test and Test&Set(x) (Waiting nodes issue the request) PollingNode ０x=1:ＳNode ３x=0:DD１ReqestsNode 2PollingPollingx=1:Ｓx=1:ＳNode １
38Queue-based lock : Requesting a lock Directorynode0node3reqLock pointerreqnode1node2
39Queue-based lock: Releasing the lock releaseDirectorynode0node3Lock pointerlocknode1node2
40Directory structure Directory Methods Full Map directoryLimited PointerChained DirectoryHierarchical bit-mapRecent CC-NUMAs with multicore nodes is small scale, and the simple full map directory is preferred.The number of cores in a node is increasing rather than the number of nodes.
41Full map directory Bit = Nodes １１SBit = NodesIf the size is large, a large memory is required.Node 2Used in Stanford DASHNode １
43Limited Pointer Limited number of pointers are used. A number of nodes which share the data is not so large (From profiling of parallel programs)If the number of nodes exceeds the pointers,Invalidate （ｅｖｉｃｔｉｏｎ）Broadcast messagesCall the management software (LimitLess)Used in MIT Alewife
44Linked List Note that the pointer is provided in cache Node ０ Node ３ S
45Linked List Pointers are provided in each cache. Small memory requirementThe latency for pointer chain often becomes large.Improved method: tree structureSCI(Scalable Coherent Interface)
50COMA(Cache Only Memory Machine) No home memory and every memory behaves like cache (Not actual cache)Cache line gathers to required clusters. Optimal data allocation can be done dynamically without special care.When miss-hit, the target line must be searched.DDM、KSR-1
51DDM(Data Diffusion Machine） If not existing, go upward．．．ＤFirst, check its own cluster×
53Summary Simple NUMA is used for large scale supercomputers Recent servers use CC-NUMA structure in which each node is a multicore SMP.Directory based cache coherence protocols are used between L3 caches.This style has been a main stream of large scale servers.
54ExerciseShow the states of cache connected to each node and directory of home memory in CC-NUMA.The node memory in node 0 is accessed:Node 1 readsNode 3 readsNode 1 writesNode 2 writesNode 3 writes