High Performance Computing

High Performance Computing
Firmware Communications Master Program in Computer Science and Networking Academic Year Dr. Gabriele Mencagli, PhD Department of Computer Science University of Pisa High Performance Computing, G. Mencagli 18/09/2018

Contents Parts of the book covered by this lecture
Average distance in interconnection networks Definitions of stream and sub-stream Streaming at the firmware level Store-and-forward flow control Wormhole flow control Switch units at the firmware level Evaluation of base latencies in multiprocessors Memory access latency Inter-processor communication Parts of the book covered by this lecture Sections 10.5, , 18.2, 21.1, 21.2 ( read), 21.3 (except ) Parts 3 of the Appendix High Performance Computing, G. Mencagli 18/09/2018

Pipelined Firmware Communications
High Performance Computing, G. Mencagli 18/09/2018

Load -> (Cache Fault)
Goal of this Lecture Goal: after this lecture we will know how to evaluate the base latency of firmware messages. This for any multiprocessor architecture Several firmware messages are transmitted between PEs, Memory macro-modules, and UCs Firmware messages (request, reply) of different types are generated During the interpretation of LOAD and STORE instructions During the execution of an inter-processor communication via I/O Typical example (cache block reading) Block read reply from M to PE0. A message composed of Header (1 word), src id, dst id, msg_type Words of the block Total = 9 words Block read request from PE0 to M. A message composed of Header (1 word), src id, dst id, msg_type Physical addr of the block (2 words) Total = 3 words Load -> (Cache Fault) Different requests each with a request-reply behavior. Conceptually, FW units acts as clients or servers depending on the interaction type High Performance Computing, G. Mencagli 18/09/2018

Average Distance Diameter=6
At the firmware level a system is a graph of interconnected units (e.g., P, C1, C2, W, MINFs, SW, WW) The distance between two units U1 and U2 is the number of units along the shortest path between U1 and U2 Examples Diameter=6 D(U1,U11)=4 𝟏 𝟒 𝟏 𝟐 Based on the properties of the system, the distance may be equal for all the paths, otherwise an average distance can be calculated according to the probabilities of the different paths 𝐷 𝑎𝑣𝑔 𝑈 1 , 𝑈 11 = 1 4 ∙ ∙ ∙5=5.25 High Performance Computing, G. Mencagli 18/09/2018

Distance by Composition
In some cases, the specification of a system infrastructure defines some units explicitly, while for some sub-systems the (average) distance only is known Example The diameter of the graph can be approximated (at best with the information we have) in 𝐷≈ 𝑑 𝑛𝑒𝑡 +4=𝟕 This case will be common in our analysis of multiprocessors We have to compute the average distance between two units (e.g., a PE and a memory macro- module) We have a complex interconnection network of which the average distance is known High Performance Computing, G. Mencagli 18/09/2018

Firmware Streaming Message (stream) Packet (sub-stream)
At the firmware level many streams (information flows) are generated and transmitted by the units composing the system, according to the different operations/functionalities they execute Each stream is composed of several sub-streams, where all the stream elements of a sub-stream follow exactly the same path from the source to the destination Different sub-streams (of the same stream) may follow different paths (if they exist) between the same source and destination Message (stream) Stream Elements (e.g., words, 32, 64 bits, more) Packet (sub-stream) Our goal is to evaluate the latency of a sub-stream. In the following the term stream is used in place of sub-stream unless the distinction is necessary High Performance Computing, G. Mencagli 18/09/2018

Interconnection Network
Paths and Hop Latency A packet is the unit of routing, i.e. all the words of a packet will be forwared along the same path Example Interconnection Network Im M W C2 C1 P MINF Reply (1 stream of 9 words) Request (1 stream of 3 words) Fault Fault LOAD We call the hop latency (the latency of a “hop” routing step that utilizes a sub-system composed of a unit and an output link). It is calculated as Hop Latency: 𝑻 𝒉𝒐𝒑 =𝝉+ 𝑻 𝒕𝒓 We suppose (and this it realistic) that each unit along the path takes the routing decision and forwarding action in one clock cycle) High Performance Computing, G. Mencagli 18/09/2018

Base and Under-Load Latency
Stream latency: time spent from the transmission of the first word of the stream (from the source) to when the destination receives the last word In this lecture we will study the base latency, i.e. the latency measured without taking into account the possible contention in the network switches and in the memory The base latency only depends on the architectural features (e.g., type of interconnection network, clock speed of the units) processes mapping strategies (critical with distance-sensitive networks like meshes/cubes) The under-load latency takes into account contention (in the links, in the memory macro-module) and is influenced by link contention memory contention Goal 𝝆< 𝝆 𝒄𝒓𝒊𝒕𝒊𝒄𝒂𝒍 May depend on program and RTS strategies (e.g., sharded data mapping) Im M Mem. Access Contention Base latency Under-load latency SW Network Contention High Performance Computing, G. Mencagli

Flow Control What is Flow Control?
Two important aspects of the network behavior are flow control and routing Important: interconnection networks for parallel architectures must execute routing and flow control algorirthm directly at the firmware level The large overhead of traditional approaches (like TCP-IP) cannot be tolerated in this case In the case of distributed-memory architectures, more advanced networks are available in double version: with firmware-level protocol and TCP-IP protocol What is Flow Control? Definition: flow control denotes the set of techniques to manage the network resources (switches and links) and their buffering capacity Unit of routing: in any flow control strategy, all the words belonging to the same packet must travel exactly the same path We will study two flow control strategies Store-and-Forward Flow Control Wormhole Flow Control (and its variants) High Performance Computing, G. Mencagli 18/09/2018

Store-and-Forward Property: the packet must be completely received by an intermediate node (switch) before starting to forward the packet to one of its output link (according to the routing strategy) Effect: the transmission of a packet is not pipelined. However, if a message is composed of different packets, they are transmitted in pipeline (possibly following different paths depending on the routing) Example Packet (4 words) SW Dest e.g., link width = 1 word Buffering -> Routing Decision Buffering -> Routing Decision Buffering -> Routing Decision While the size of a packet can be different (depending on the operation requested), the link width is fixed (one of few words). All the words of a packet are transmitted word-by-word. High Performance Computing, G. Mencagli 18/09/2018

Interconnection Network
Stream Latency Description: a packet of size 𝑺= 𝟏𝟎 𝟑 words (32 bits) is routed along a path from unit 𝑼 𝒊 to unit 𝑼 𝒋 in a network with average distance 𝒅 𝒏𝒆𝒕 =𝟏𝟒. The firmware communication uses single-buffering interfaces with 𝑻 𝒕𝒓 =𝟐𝝉 Problem: find the stream latency for transmitting the packet from the source unit to the destination unit. Suppose that the interconnection network adopts the Store-and- Forward flow control strategy Ui Uj Interconnection Network 𝑑 𝑛𝑒𝑡 =14 𝑑=16 We will show that the stream latency with Store-and-Forward flow control is 𝑳 𝑺 ~𝑶 𝒅∙𝑺 Simplification: the transmission latency is the same for all the links and the calculation time of each switch is 𝟏𝝉 High Performance Computing, G. Mencagli 18/09/2018

Store-and-Forward Cost Model
Consider a smaller example with 𝑺=𝟐 and 𝒅=𝟒, in order to draw explicitly the temporal pattern of the communication 𝑇 𝑡𝑟 𝜏 3∙ 3𝜏+3 𝑇 𝑡𝑟 =27𝜏 In general with any 𝑆≥1 and 𝑑≥1 word 1 word 2 word S 𝑆−1 ∙ 2𝜏+2 𝑇 𝑡𝑟 + 𝜏+ 𝑇 𝑡𝑟 = 2∙𝑆−1 𝜏+ 𝑇 𝑡𝑟 The general cost model is: 𝑳 𝑺 = 𝒅−𝟏 𝟐∙𝑺−𝟏 𝝉+ 𝑻 𝒕𝒓 Excercise: 𝐿 ≈1.5∙ 10 5 𝜏 𝐿 𝑆 ~𝑂 𝑑∙𝑆 High Performance Computing, G. Mencagli 18/09/2018

Wormhole Flow Control SW
Goal: reduce the stream latency by better expoiting the effect of pipelined communications (also within a single packet) Idea: each single packet is composed of smaller units called flits (equal to the link width). All the flits of the same packet are forwared along the same path. However, flits are considered as stream elements to be transmitted in pipeline Consequences: buffering is not applied to all the flits of a packet before being routed instead the buffering unit is just one single flit. For each switch The first flit is the packet header, containing the information (source, destination) needed to correctly decide the packet routing All the next flits of the same packet are forwared using the same output link SW Dest 4 3 2 1 3 2 1 2 1 1 e.g., 32 bits header SRC LEN DST TYPE High Performance Computing, G. Mencagli 18/09/2018

Wormhole Cost Model Excercise: 𝐿 10 3 ≈0.5∙ 10 4 𝜏 𝐿 𝑆 ~𝑂 𝑑+𝑆
Consider a smaller example with 𝑺=𝟐 and 𝒅=𝟒, in order to draw explicitly the temporal pattern of the communication with wormhole flow control 3∙ 𝜏+ 𝑇 𝑡𝑟 +2∙ 𝜏+ 𝑇 𝑡𝑟 =15𝜏 In general with any 𝑆≥1 and 𝑑≥1 the stream latency cost model is the following 𝑳 𝑺 = 𝟐∙𝑺+𝒅−𝟑 𝝉+ 𝑻 𝒕𝒓 The cost model can be applied to double-buffering communications and we obtain 𝑳 𝑺 = 𝑺+𝒅−𝟐 𝝉+ 𝑻 𝒕𝒓 Excercise: 𝐿 ≈0.5∙ 10 4 𝜏 𝐿 𝑆 ~𝑂 𝑑+𝑆 High Performance Computing, G. Mencagli 18/09/2018

Wormhole Variants In traditional wormhole flow control, in the case of a temporary block of the output link (e.g., the ack is delayed), the “worm” of flits is distributed backward in the switches preceding the congested one source destination 2 1 3 2 1 3 virtual cut-through source destination An alternative approach is called virtual cut through. In case of a blocking condition, all the flits are received and buffered by the congested switch as in store-and-forward. It reduces congestion on the switches High Performance Computing, G. Mencagli 18/09/2018

Extensions of the Cost Model
Single buffering 𝑳 𝑺 = 𝟐∙𝑺+𝒅−𝟑 𝑻 𝒉𝒐𝒑 Double buffering 𝑳 𝑺 = 𝑺+𝒅−𝟐 𝑻 𝒉𝒐𝒑 So far we made some initial assumptions All the links have the same transmission latency All the units have calculation time of one clock cycle The stream length remains constant from unit to unit These constraints can be easily removed. For instance, in the system chip Suppose that units U11, U12, U13, U14 are integrated in the same chip where 𝑻 𝒕𝒓 ≈𝟎 Hop latency inside the chip is 𝑻 𝒉𝒐𝒑 =𝝉, off chip we have 𝑻 𝒉𝒐𝒑 =𝝉+ 𝑻 𝒕𝒓 The stream latency can be approximated by using the maximum 𝑻 𝒉𝒐𝒑 in the cost model of pipelined firmware communications High Performance Computing, G. Mencagli 18/09/2018

Extensions of the Cost Model
Suppose to have a pipeline of two systems 𝜮 𝟏 and 𝜮 𝟐 . The first processes a stream of length 𝑺 𝟏 whose first 𝑺 𝟐 elements are processed by the second system ( 𝑺 𝟐 < 𝑺 𝟏 ) Case 1 Σ 1 Σ 2 Stream of 𝑺 𝟏 items processed by 𝜮 𝟏 Stream of 𝑺 𝟐 elements processed by 𝜮 𝟐 The latency of the whole system is 𝑳 𝑺 𝟏 = 𝑳 𝟏 𝑺 𝟏 Case 2 Σ 1 Σ 2 Stream of 𝑺 𝟏 items processed by 𝜮 𝟏 Stream of 𝑺 𝟐 elements processed by 𝜮 𝟐 The latency of the whole system can be approximated as 𝑳 𝑺 𝟏 = 𝑳 𝟏 𝟏 + 𝑳 𝟐 𝑺 𝟐 ≈ 𝑳 𝟐 𝑺 𝟐 High Performance Computing, G. Mencagli 18/09/2018

Wormhole Switch The structure of a wormhole switch unit (2x2) is illustrated below The switch is decomposed into two independent units, each one is a undirectional 2x2 switch All firmware communication interfaces are asynchronous (see Part 3 of the Appendix) High Performance Computing, G. Mencagli 18/09/2018

Switch Microprogram (Sketched)
Firmware units are programmed using a micro-language. Each micro-program is composed of a set of micro-instructions i. switch( 𝑪 𝟏 ,…, 𝑪 𝑵 ) do { (𝟎,…,𝟎) <operations>, goto j; ... (𝟏,…,𝟏) <operations>, goto k; The microprogram of the unidirectional switch must Tests non-deterministically both the input interfaces in every clock cycle If a message is present in one of the interfaces, in the first clock cycle the switch reads the first flit (header), determines the output link (routing), and all the flits, one per clock cycle, of the message will be received and transmitted onto the same link In the phase at the previous point, at every clock cycle the switch also checks the presence of the header in the other input interface. If the routing is compatible, at every clock cycle the switches can receive and forward two flits per clock cycle Otherwise, if the second packet has the same routing (output link) the header remains in the input interface as long as the transmission of all the flits of the first packet has been completed High Performance Computing, G. Mencagli 18/09/2018

Idea of the Switch Unit Design
Example of functionalities implementable in one clock cycle condition variables (e.g., routing functions, RDYs, ACKs) Control Part Sequential circuit (combinational logic + state registers) combinational circuit register Commands (e.g., alphas to control the multiplexers selection) IN1 IN2 OUT 1 2 MUX R Operating Part High Performance Computing, G. Mencagli 18/09/2018

Switch Complexity We give only a very general idea of the complexity design of a monolithic switch unit operating as in the previous slide (i.e. wormhole flow control) A firmware unit is composed of a control part and an operating part Control Part states 2x2 Switch S1=No(IN1), No(IN2) S2=IN1->OUT1 S3=IN1->OUT2 S4=IN2->OUT1 S5=IN2->OUT2 S6=IN1->OUT1, IN2->OUT2 … optimizations are possible... Operating Part Control Part inputs outputs condition variables control variables Number of states of the Control Part equal to the number of micro-instructions How many states for our wormhole switch? The complexity design of the microprogram of such kind of unit is exponential in the number of input interfaces, i.e. 𝑶 𝜶 𝑵 where 𝛼≥2 is a constant This is the most serious problem of a monolithic implementation of large switch units besides other technological problems like pin count Therefore, a modular decomposition of switches is strongly needed for its application per se and for limited-degree networks in general Questa slide è assolutamente da rivedere e sistemare. High Performance Computing, G. Mencagli 18/09/2018

Base Latency Evaluation
High Performance Computing, G. Mencagli 18/09/2018

Studied in the second part of the course!!!
Space of Solutions The techniques studied in the previous part can be applied to evaluate the base latency (in absence of contention) of various operations in a firmware architecture Cost models are applied to a large space of architectural solutions (very large) Studied in the second part of the course!!! Some notable cases will be studied explicitly The cache coherence (CC) parts will be explained in the second part of this course The following slide shows a set of assumptions that students must remember! High Performance Computing, G. Mencagli 18/09/2018

Assumptions Some notable cases will be studied explicitly. We will made the following assumptions during our study 2-level C2-C1 private caches per PE (inclusive, on-demand with 𝝈 𝟏 =𝟖 and 𝝈 𝟐 = 𝟏𝟐𝟖 data words) Wormhole networks with 1-word (32 bits) link width, minimal deterministic routing Interleaved external memory macro-modules with 𝒎= 𝝈 𝟏 . Physical address 48 bits All the units with the same clock cycle 𝝉 expect the memory modules for which we assume (unless otherwise noted) 𝝉 𝑴 =𝟐𝟎𝝉 Intra-chip link transmission latency 𝑻 𝒕𝒓 ≈𝟎, inter-chip 𝑻 𝒕𝒓 ≈𝝉 All the firmware communication interfaces use the double buffering technique Important statement In no way numerical values have a conceptual meaning. Although realistic, they will be used to show the application of the cost models, to understand the impact of the latencies and to do our exercises! High Performance Computing, G. Mencagli 18/09/2018

Streamed Firmware Messages
Operations in multiprocessor architectures consist in firmware messages exchanged between a source and a destination unit A request (reply) is a firmware message (packet) composed of several words. In general we can say that Inside a PE or inside a macro-module, a firmware message has not a specific streamed representation, but all the bits are transmitted using parallel links Firmware messages crossing the interconnection network are stream of flits, of which the first one is the message header while the others are data flits Interconnection Network W CPU Processing Element (PE) Secondary Cache . . . UC Local I/O units Primary Cache (I+D) MMUs Processor units External memory interface I/O interface Interr. Arbiter Header fields: type (5 bits), src (5 bits), dst (5 bits), length (8 bits) Firmware messages are formatted in the interface units, i.e. PE-W, CMP-WW and in the memory macro-module interface unit Im High Performance Computing, G. Mencagli 18/09/2018

Service on a C1-block Basis
Memory read requests: a PE executes a load instruction for a cache block X which is not in its C1 and C2 (it must be transferred from the main memory) SMP case: suppose that 𝝈 𝟏 =𝟖 and 𝝈 𝟐 =𝟏𝟐𝟖. So, each C2 block inlcudes 16 contiguous C1 blocks. Suppose that we have 16 macro-modules each having 𝝈 𝟏 internal modules C2-block read request involves 𝒒= 𝝈 𝟐 𝝈 𝟏 macro-modules Mutually interleaved macro-modules The interleaved memory is not well exploited and in general this may be source of additional congestion! Analogously, if the macro-modules are sequentially interleaved a C2-block is read by a macro-module as a continuous and indivisible stream of 𝑞= 𝜎 2 𝜎 1 blocks, this unacceptably increases the waiting time (service time of the macro-module is longer) We assume that the memory service is executed on a C1-block basis: after a fault in c2, q C1 cache blocks are requested to the memory one at a time as indipendent requests High Performance Computing, G. Mencagli 18/09/2018

CMP Memory Base Latency
Request length 𝑺 𝒓𝒆𝒒 =𝟑 words, reply length 𝑺 𝒓𝒆𝒑𝒍𝒚 =𝟗 words General cost model (assume 𝑻 𝒕𝒓 =𝟏𝝉 off-chip) 𝐿 𝑟𝑒𝑎𝑑−𝐶1 𝜎 1 = 𝐿 𝑟𝑒𝑎𝑑−𝑟𝑒𝑞 + 𝐿 𝑟𝑒𝑎𝑑−𝑟𝑒𝑝𝑙𝑦 Where 𝐿 𝑟𝑒𝑎𝑑−𝑟𝑒𝑞 = 𝑆 𝑟𝑒𝑞 +𝑑−2 ∙ 𝑇 ℎ𝑜𝑝 𝐿 𝑟𝑒𝑎𝑑−𝑟𝑒𝑝𝑙𝑦 = 𝜏 𝑀 + 𝑆 𝑟𝑒𝑝𝑙𝑦 +𝑑−2 ∙ 𝑇 ℎ𝑜𝑝 On a single CMP Path: C1-C2-W-INT_NET-MINF-IM-M Distance: 𝒅=𝟔+ 𝒅 𝒏𝒆𝒕 =𝟕 Max 𝑻 𝒉𝒐𝒑 =𝟐𝝉 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟔𝟒𝝉 High Performance Computing, G. Mencagli 18/09/2018

CMP Memory Base Latency
In the previous case the internal interconnect is a crossbar ( 𝑑 𝑛𝑒𝑡 ≈1). For more parallel CMPs the internal network can be more complex Example, a CMP with toroidal 16-ary 2-cube internal interconnect Sequentially interleaved Example: 256 Swiches, 240 PEs and 16 MINFs The distance depends on the internal interconnect where the data structures are allocated Important: the CMP can be used a NUMA, where the macro-module connected to MINFi is the preferred one of the PEs in the i-th column With this assumption we have 𝒅 𝒏𝒆𝒕 ≈𝟒 (𝑑=11) and we have 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟕𝟔𝝉 In the worst case in which each PE can access all the macro-modules with the same probability (e.g., they are mutually interleaved), we have 𝒅 𝒏𝒆𝒕 ≈𝟖 and we have 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟗𝟐𝝉 High Performance Computing, G. Mencagli 18/09/2018

Multi-CMP SMP Memory Base Latency
Let us study the case of a multiple-CMP with SMP organization MINF WW K-aray n-fly butterfly interconnect Path: C1-C2-W-INT_NET-MINF-WW-EXT_NET-Im- M Each CMP-WW interfaces the CMP MINFs with the external interconnectect. It must have the necessary bandwidth (e.g., it should be able to serve four requests per clock cycle in the best case -- it can be implemented by multiple units) Assumption: the internal interconnect (intra-CMP) is a crossbar, while the external interconnect is a 2-ary n-fly Size of the system: 𝑵=𝟐𝟓𝟔 PEs organized in 16 CMPs (each with 16 PEs and 4 MINFs) Distance: 𝒅=𝟖+ 𝒅 𝒏𝒆𝒕 , 𝒅 𝒏𝒆𝒕 =? High Performance Computing, G. Mencagli 18/09/2018

Distance in a Butterfly
Butterfly is an indirect network with minimal distance between two nodes Smaller example with a 2-ary 3-fly. So the distance is 𝒅 𝒏𝒆𝒕 =𝒏 n Im WW Actually here we have 64 WWs Actually here we have 64 memory macro-module mutually interleaved We need a 2-ary 6-fly butterfly interconnect (thus the above figure is just to give you the idea of the structure). So 𝒅 𝒏𝒆𝒕 =𝟔 and 𝒅=𝟖+𝟔=𝟏𝟒 Due to the large size of this multiprocessor system, the base latency is higher than the case of a single-CMP systems (30% higher) 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟗𝟐𝝉 High Performance Computing, G. Mencagli 18/09/2018

Multi-CMP NUMA SMP Local Memory Base Latency
Let us study the case of a multiple-CMP with internal SMP organization Local Memory Path: C1-C2-W-INT_NET-MINF-WW-EXT_SMP_NET- Im-M Assumptions: internal network of the CMP is a crossbar, SMP external network is a crossbar. External NUMA network is a binary generalized fat tree based on a 2-ary n-fly Size of the system: 𝑵=𝟐𝟓𝟔 PEs organized in 16 CMPs (each with 16 PEs and 4 MINFs) Local memory access: distance: 𝒅=𝟗 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟕𝟐𝝉 Slightly greater than the latency of a single-CMP system High Performance Computing, G. Mencagli 18/09/2018

Multi-CMP NUMA SMP Remote Memory Base Latency
The external NUMA network is a generalized fat tree MINF WW . . . Remote Memory Path: C1-C2-W-INT_NET-MINF-WW-EXT_NUMA_NET-WW- EXT_SMP_NET- Im-M Height of the tree is 𝒏=𝒍𝒐𝒈 𝟐 𝟔𝟒=𝟔 It can be proven (page 296) that 𝒅 𝒏𝒆𝒕 ≈𝜹∙𝒏 with 1≤𝛿≤2 Without any information about the program mapping (under the uniform distribution) we have 𝒅 𝒏𝒆𝒕 ≈𝟏.𝟗∙𝒏≈𝟏𝟐 and in our example 𝒅≈𝟐𝟐 𝑳 𝒓𝒆𝒂𝒅−𝑪𝟏 𝝈 𝟏 =𝟏𝟐𝟒𝝉 High Performance Computing, G. Mencagli 18/09/2018

Interprocessor Communications
Both in SMP and NUMA architectures PEs can communicate through their local I/O (UC) This kind of communications does not imply an explicit reply from the destination (request only) It is a pure synchronization message, though typically some few data words are transmitted (𝒉≥𝟏). Therefore, the length of the stream is 𝑺=𝟏+𝒉 Single-CMP case (crossbar internal interconnect) Path: DM-UC-W-INT_NET-W-UC-IU 𝒅=𝟕 𝑻 𝒕𝒓 ≈𝟎 𝑻 𝒉𝒐𝒑 =𝝉 𝑳 𝒏𝒐𝒕𝒊𝒇𝒚 = 𝑺+𝒅−𝟐 𝝉= 𝟔+𝒉 𝝉≤𝟏𝟎𝝉 High Performance Computing, G. Mencagli 18/09/2018

Multi-CMP SMP case: 16 CMPs each one with 4 parallel WWs, i.e. we have 64 leaves. For inter-process communications the butterfly is used as a Fat Tree! n Im WW In the previous case (with 𝑑 𝑒𝑥𝑡−𝑛𝑒𝑡 ≈1.9∙𝑛≈12) we have: 𝒅=𝟐𝟒 Not the butterfly of this architecture (just to exemplify) Path: DM-UC-W-INT_NET-MINF-WW-EXT_NET-WW-MINF-INT_NET-W-UC-IU 𝑻 𝒉𝒐𝒑 =𝟐𝝉 𝑳 𝒏𝒐𝒕𝒊𝒇𝒚 = 𝑺+𝒅−𝟐 𝑻 𝒉𝒐𝒑 = 𝟒𝟔+𝟐𝒉 𝝉≤𝟓𝟓𝝉 Similarly, we can do the same evaluation for a Multi-CMP NUMA architecture High Performance Computing, G. Mencagli 18/09/2018

Multi-CMP NUMA SMP case: 16 CMPs each one with 4 parallel WWs, i.e. we have 64 leaves In the previous case (with 𝑑 𝑛𝑢𝑚𝑎 ≈12) we have: 𝒅=𝟐𝟒 Path: DM-UC-W-INT_NET-MINF-WW-NUMA_EXT_NET-WW-MINF-INT_NET-W-UC-IU 𝑳 𝒏𝒐𝒕𝒊𝒇𝒚 = 𝑺+𝒅−𝟐 𝑻 𝒉𝒐𝒑 = 𝟒𝟔+𝟐𝒉 𝝉≤𝟓𝟓𝝉 High Performance Computing, G. Mencagli 18/09/2018

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

High Performance Computing

Similar presentations

Presentation on theme: "High Performance Computing"— Presentation transcript:

Similar presentations

About project

Feedback