Parallel Programming Platforms Chapter 2

Parallel Programming Platforms Chapter 2
Dagoberto A.R.Justo PPGMAp UFRGS 9/22/2018 intro

The Standard Serial Architecture: von Neumann model
instruction and data stream memory cpu A design model for a digital computer CPU + memory instruction and data mixed in the memory One single stream of instruction and data instruction data result

The Standard Serial Architecture : Harvard model
memory separate storage for instruction and data stream for instruction and stream for data called by Flynn's taxonomy a single instruction single data stream (SISD) instruction data result cpu

The Standard Serial Architecture : Mixed model
Memory pages various memory banks one address memory divided in many pages separate pages for instructions and data instruction data result cpu

Inside CPU A B C D E F G H registers ALU Input/ Output Unit
Address Bus Data Bus Control Unit Program Counter (PC) Instruction Register (IR) Processor Status (PSW) L1 Cache Inside CPU Input/Output Unit Interacts with memory via LOAD/STORE ALU performs arithmetic and logic operations the registers store numbers Control Unit Intruction Register (IR) Where the instruction is decoded before being executed. Program Counter (PC) address of the next instruction to be executed Processor Status Word (PSW) Flag of last operation indicating if last result is zero, nonzero, positive, negative… used for conditional jumps

Adding a Cache Memory Cache is an extra memory faster than RAM
Memory pages Cache is an extra memory faster than RAM L1 Cache Inside CPU L2 Cache instruction data result cpu

Assembly Language LOAD A,nn A=nn
LOAD B,#nnnn B= Load value of address #nnnn STORE #nnnn,B Store B in address #nnnn LOAD A,B A=B ADD A,B A=A+B SUB A,B A=A-B JUMP #nnnn Jump to the address #nnnn. replace the address in PC with #nnnn. JMPZ #nnnn if PSW indicates zero, jump to #nnnn JMPNZ #nnnn if PSW indicates nonzero, jump to #nnnn the processor status word (PSW) stores information about last operation such as if it is positive, zero, carry…

Basic Stages in a Microprocessor
Programming model A simple program: D=2 E=3 F=D+E Inside the CPU an infinite loop is executed. Line Assembly Machine Language 1 LOAD A,#C000 2 LOAD B,#C002 3 ADD A,B 4 STORE #C004,A 3A C0 00 3B C C0 04 PC … C03B 00 C0 3A Registers: A B C D loop IF instruction fetch ID instruction decode OF operands fetch EX execute instruction WB write back end loop Basic Stages in a Microprocessor 05 00

Execution with No Pipeline: Adding four numbers
Took 20 cycles to compute 6 instructions 1. LOAD A,#C000 2. LOAD B,#C002 3. ADD A,#C004 4. ADD B,#C006 5. ADD A,B 6. STORE A, #C008 IF ID OF Load A,#C000 IF ID OF Load B,#C002 IF ID OF EX ADD A,#C004 IF ID OF EX ADD B,#C006 IF ID EX ADD A,B IF ID WB ST A,#C008 5 10 15 20

Instruction Pipeline: Adding four numbers
Took 9 cycles to compute 6 instructions In the best case scenario, this pipeline can sustain a completion rate of one instruction per cycle. Pipelining overlaps various stages of instruction execution to achieve performance IF ID OF Load A,#C000 1. LOAD A,#C000 2. LOAD B,#C002 3. ADD A,#C004 4. ADD B,#C006 5. ADD A,B 6. STORE A, #C008 IF ID OF Load B,#C002 IF ID OF EX ADD A,#C004 IF ID OF EX ADD B,#C006 IF ID X EX ADD A,B IF ID X WB STORE A,#C008 5 10

Limitations in Pipelining
The speed of a pipeline is limited by the slowest stage. Then: Conventional processors have deep pipelines (20 stage pipelines in Pentium processors). In typical program (machine level), every 5-6th instruction is a conditional jump! This requires very accurate branch prediction. The penalty of a misprediction grows with the depth of the pipeline, since a larger number of instructions will have to be flushed.

Superscalar Execution: Issue Mechanisms
1. LOAD A,#C000 2. ADD A,#C002 3. LOAD B,#C004 Two instructions by cycle Simple model: instructions are issued in the order two instructions are issued in the same cycle only if they don’t have dependency, otherwise one instruction is issued this is called in-order issue. (limited performance) Aggressive model: instructions can be issued out of order. The first and third instructions can be co-scheduled. This is called dynamic issue. IF ID OF LOAD A,#C000 IF ID OF LOAD A,#C000 IF ID OF EX ADD A,#C002 IF ID OF LOAD B,#C004 IF ID OF LOAD B,#C004 IF ID OF EX ADD A,#C002 5 10 5 10

Superscalar Execution, example 1
One simple way of alleviating these bottlenecks is to use multiple pipelines Two instructions at a time EX EX horizontal waste EX vertical waste IF ID OF LOAD A,#C000 1. LOAD A,#C000 2. LOAD B,#C002 3. ADD A,#C004 4. ADD B,#C006 5. ADD A,B 6. STORE A, #C008 IF ID OF LOAD B,#C002 IF ID OF EX ADD A,#C004 IF ID OF EX ADD B,#C006 IF ID X EX ADD A,B IF ID X WB STORE A,#C008 5 10

Too much dependency Only one instruction is issued by cycle 1. LOAD A,#C000 2. ADD A,#C002 3. ADD A,#C004 4. ADD A,#C006 5. STORE A, #C008 IF ID OF LOAD A,#C000 IF ID OF EX ADD A,#C002 IF ID OF EX ADD A,#C004 IF ID OF EX ADD A,#C006 IF ID X WB STORE A,#C008 5 10

Almost the same that example 1, but not the best order of dependency 1. LOAD A,#C000 2. ADD A,#C002 3. LOAD B,#C004 4. ADD B,#C006 5. ADD A,B 6. STORE A, #C008 IF ID OF LOAD A,#C000 IF ID OF EX ADD A,#C002 IF ID OF LOAD B,#C004 IF ID OF EX ADD B,#C006 IF ID X EX ADD A,B IF ID X WB STORE A,#C008 5 10

Superscalar Execution
There is some wastage of resources due to data dependencies True Data Dependency The result of one operation is an input to the next. Resource Dependency Two operations require the same resource. Branch Dependency Scheduling instructions across conditional branch statements cannot be done deterministically a-priori Different instruction mixes with identical semantics can take significantly different execution time

Superscalar Execution
The scheduler (hardware) looks at a large number of instructions in an instruction queue and selects appropriate instructions to execute concurrently Not all functional units can be kept busy at all times. Vertical waste := if during a cycle, no functional units are utilized Horizontal waste := if during a cycle, only some of the functional units are utilized

Very Long Instruction Word (VLIW) Processors
Pre-processing by the compiler Rely on compile time analysis to identify and bundle together instructions that can be executed concurrently. These instructions are packed and dispatched together Thus a VLIW processor is used used in the Multiflow Trace machine (circa 1984). variants are employed in the Intel IA64 processors.

Very Long Instruction Word (VLIW) Processors: Considerations
 Compiler has a bigger context from which to select co-scheduled instructions.  Compilers do not have runtime information such as cache misses. Scheduling is, therefore, conservative.  Branch and memory prediction is more difficult.  VLIW performance is highly dependent on the compiler. A number of techniques such as loop unrolling, speculative execution, branch prediction are critical.

DRAM Memory DRAM Memory DRAM Memory DRAM Memory L2 Cache L2 Cache
CPU CPU L1 Cache L1 Cache register register

Limitations of Memory System Performance
Bottleneck Memory access speed is often slower than processor speed Memory performance is captured by: Latency the time from the issue of a memory request to the time the data is available at the processor. Bandwidth the rate at which data can be pumped to the processor by the memory system.

Bandwidth x Latency Consider the example of a fire-hose.
the water comes out of the hose two seconds after the hydrant is turned on Latency is two seconds once the water starts flowing, the hydrant delivers water at the rate of 5 gallons/second bandwidth is 5 gallons/second If you want immediate response then reduce latency If you want to fight big fires then increase bandwidth.

Memory Latency: An Example
Consider processor at 1 GHz (1 ns clock) two multiply-add units capable of executing four instructions in each cycle of 1 ns. DRAM with a latency of 100 ns (no caches) The peak processor rating is 4 GFLOPS. Since the memory latency is equal to 100 cycles and block size is one word, every time a memory request is made, the processor must wait 100 cycles before it can process the data.

 Consider computing a dot-product of two vectors
A dot-product computation performs one multiply-add on a single pair of vector elements, i.e., each floating point operation requires one data fetch. It follows that the peak speed of this computation is limited to one floating point operation every 100 ns, or a speed of 10 MFLOPS, a very small fraction of the peak processor rating!

Improving Effective Memory Latency Using Caches
Caches are small and fast memory elements between the processor and DRAM. acts as a low-latency high-bandwidth storage. If a piece of data is repeatedly used, the effective latency of this memory system can be reduced by the cache. cache hit ratio: The fraction of data references satisfied by the cache Used to determine the performance of a code

Impact of Caches: Example
Consider the same architecture as before plus a cache of size 32 KB with a latency of 1 ns or one cycle. Multiply two matrices A and B of dimensions 32 × 32. Note that the cache is large enough to store A, B and C.

 Time to fetch 2 matrices into the cache
2K words takes approximately 200 µs Time to multiply two n × n matrices 2n3 operations = 64K operations 16K cycles (or 16 µs) at four instructions per cycle. Total time for the computation µs. peak computation rate of 64K/216 or 303 MFLOPS.

Impact of Caches Repeated references to the same data item correspond to temporal locality. In our example, we had O(n2) data accesses and O(n3) computation. This asymptotic difference makes the above example particularly desirable for caches. • Data reuse is critical for cache performance.

Impact of Memory Bandwidth
Memory bandwidth is determined by the bandwidth of the memory bus as well as the memory units. • Memory bandwidth can be improved by increasing the size of memory blocks. The underlying system takes l time units (where l is the latency of the system) to deliver b units of data (where b is the block size).

Impact of Memory Bandwidth: Example
Consider the same architecture as before except that the block size is 4 words instead of 1 word Repeat the dot-product computation Assuming that the vectors are laid out linearly in memory, 8 FLOPs (4 multiply-adds) can be performed in 200 cycles. A single memory access fetches four consecutive words in the vector. Therefore, two accesses can fetch four elements of each of the vectors. This corresponds to a FLOP every 25 ns, for a peak speed of 40 MFLOPS.

Note: increasing block size does not change latency of the system Physically, the scenario can be viewed as a wide data bus (4 words or 128 bits) connected to multiple memory banks. In practice, such wide buses are expensive to construct. In a more practical system, consecutive words are sent on the memory bus on subsequent bus cycles after the first word is retrieved.

The above examples clearly illustrate how increased bandwidth results in higher peak computation rates. The data layouts were assumed to be such that consecutive data words in memory were used by successive instructions (spatial locality of reference). If we take a data-layout centric view, computations must be reordered to enhance spatial locality of reference.

Consider the code: do i=1,1000 column_sum(i) = 0.0 do j=1,1000 column_sum(i)=column_sum(i)+B(i,j) end do The code sums columns of the matrix B into a vector column_sum

The vector column_sum is small and easily fits into the cache The matrix B is accessed in a column order. The strided access results in very poor performance.

We can fix the above code as follows: do i=1,1000 column_sum(i)= 0.0 end do do j=1,1000 column_sum(i)=column_sum(i)+B(i,j) In this case, the matrix is traversed in a row- order and performance can be expected to be significantly better.

Multiplying a matrix by a vector
multiplying column-by-column, keeping a running sum; computing each element of the result as a dot product of a row of the matrix with the vector.

Memory System Performance: Summary
Exploiting spatial and temporal locality in applications is critical for amortizing memory latency and increasing effective memory bandwidth The ratio of the number of operations to number of memory accesses is a good indicator of anticipated tolerance to memory bandwidth Memory layouts and organizing computation appropriately can make a significant impact on the spatial and temporal locality

Alternate Approaches for Hiding Memory Latency
Consider browsing the web on a slow network. Solution: prefetching anticipate which pages we are going to browse and issue requests for them in advance; multithreading open multiple browsers and access different pages in each browser, thus while we are waiting for one page to load, we could be reading others corresponds to spatial locality accessing memory words we access a whole bunch of pages in one go - amortizing the latency across various accesses.

Multithreading for Latency Hiding
a single stream of control in the flow of a program do i=1,n c(i)=dot_product(A(i,:), b) end do each dot-product is independent of the other, and therefore represents a concurrent unit of execution c(i)=create_thread(dot_product, A(i,:), b)

Multithreading for Latency Hiding: Example
the first create_thread accesses a pair of vector elements and waits for them. In the meantime, the second create_thread access two other vectors in the next cycle, and so on. After l units of time (l is the latency of the memory) the first function instance gets the requested data from memory and can perform the required computation. In the next cycle, the data items for the next function instance arrive, and so on. In this way, in every clock cycle, we can perform a computation.

Prefetching for Latency Hiding
Misses on loads cause programs to stall. Why not advance the loads so that by the time the data is actually needed, it is already there! The only drawback is that you might need more space to store advanced loads. However, if the advanced loads are overwritten, we are no worse than before!

Tradeoffs of Multithreading and Prefetching
Consider a machine with a 1 GHz clock, 4-word cache line, single cycle access to the cache, 100 ns latency to DRAM The computation has a cache hit ratio at 1 KB of 25% and at 32 KB of 90% Consider two cases 1) a single threaded execution in which the entire cache is available to the serial context 2) a multithreaded execution with 32 threads where each thread has a cache residency of 1 KB. If the computation makes one data request in every cycle of 1 ns, you may notice that the first scenario requires 400MB/s of memory bandwidth and the second, 3GB/s.

Tradeoffs of Multithreading and Prefetching
Bandwidth requirements of a multithreaded system may increase very significantly because of the smaller cache residency of each thread. Multithreaded systems become bandwidth bound instead of latency bound. Multithreading and prefetching only address the latency problem and may often exacerbate the bandwidth problem. Multithreading and prefetching also require significantly more hardware resources in the form of storage.

Avoidance Of Bottlenecks
A Bottleneck Many accesses to memory Memory latency too long CPU too slow The Solution Multiple memory banks caches with blocks of size 1 caches with larger blocks pipelined execution units multiple execution units fused add/multiply multiple processors sharing one memory

Implicit Parallelism All of these solutions represent implicit parallelism: that is, Parallelism implemented in/by the hardware The programmer has no way to specify/control its use The program benefits from parallelism or inhibits the parallel In languages such as C*, Fortran D, HP Fortran, the program constructs encourage the use of the hardware parallelism requires good optimizing/cognizant compilers Example, for A,B,C vectors A = B + C S = sum(B*C) OpenMP, an extension to Fortran and C, also provides specific language support for parallelism but the way parallelism is specified is explicit in the program

2.3 Classification of Parallel Platforms
2.3.1.Control Structure 2.3.2.Communication Model 9/22/2018 intro

2.3.1. Control Structure of Parallel Platforms
Replace the CPU by many processors Centralized control or synchronization mechanism (an vector/array machine) same instruction to all processors (but different data) Each processor has different data Single instruction multiple data machine (SIMD) Uses the data parallelism idea (C*, Fortran D, HP Fortran) No centralized control mechanism (a multi-computer) Multiple CPUs and memory Completely separate instruction streams and no synchronization Multiple instructions multiple data (MIMD) SPMD -- a programming paradigm to simplify use of an MIMD architecture Single instruction multiple data (SPMD) with NO synchrony

Flynn-Johnson's Taxonomy
Classify general computational activities by the way the instruction streams and data streams are handled Aside: some modern processor chips support both streams with caches and different data paths

SIMD vs MIMD Diagrams SIMD: MIMD: One global control unit
P P+CU P P+CU Global control unit P P+CU Interconnection Network Interconnection Network P P+CU SIMD: One global control unit MIMD: Independent control units (CU), one for each processor element (P)

SIMD Program With Synchrony
C=A B==0? D=C A 5 B 0 C 5 D 5 Processor 0 Processor 1 A 4 B 2 C 0 Processor 2 A 1 B 1 Processor 3 A 0 C 2 D 2 C 1 D 1 D 0 C=A/B Idle Suppose division takes longer than assignment. Then if (B==0) then C=A else C=A/B end if D=C T I M E ↓ 1 2 3 4 5

SPMD Each processor has its own instructions and data stream (as in MIMD machine) BUT, each processor loads and runs the same executable, but does not necessarily execute the same instructions at the same time Typical program (PC -- program counter): if( my_id == 0 )then ! I am the master ! Perform some work myself … else ! I am a worker ! Start some work endif PC for Proc #0 PC for Proc #1 PC for Proc #2

SPMD -- No Synchrony Processor 0 A 5 B 0 C 5 Processor 1 A 4 B 2 C 0 Processor 2 A 1 B 1 C 0 Processor 3 A 0 B 0 C 0 Suppose division takes longer than assignment. Then if (B==0) then C=A else C=A/B end if D=C T I M E ↓ B==0? B==0? B==0? B==0? 1 C=A C=A/B C=A/B C=A 2 D=C D=C 3 4 D=C D=C A 5 B 0 C 5 D 5 A 4 B 2 C 2 D 2 A 1 B 1 C 1 D 1 A 0 B 0 C 0 D 0 5

2.3.2. Communication Model of Parallel Platforms
Message-Passing Platform Each processor has its own private memory with its own addresses for memory locations The programming paradigm is to pass messages from task to task to coordinate computational tasks Shared-Address-Space Platform All processors share the same address space Location n is the same place for all processors Access memory via a switch, bus, and memory router (depending on the manufacturer) Software uses a shared memory programming paradigm Sometimes specialized languages such as C*, HPF are used In such cases, the compiler generates code to synchronize access to the shared address space

Message Passing Platform
Memory is local, associated with each processor For a processor to read or write into another processor's memory, a message consisting of the data is sent and received Interconnection Network P M P M P M P M P M

NUMA and UMA Shared Address Space Platform
Interconnection Network Interconnection Network Interconnection Network P M P M P M C C P M P M P M C C Shared-address-space computer with (a) Uniform-memory-access, (b) Uniform-memory-access and cache, (c) Non-Uniform-Memory-Access with local memory only.

Emulating Message Passing Using A Shared-Address machine
Messages are passed notionally as follows: Write the message to a shared space Tell the receiver where the message is Receiver reads the message and places in its own space Emulating shared memory in a distributed memory machine needs hardware support to be effective (Eg. SGI Origin) Thus, message passing is viewed as the more general programming paradigm Learn and implement messaging on either local memory or shared memory Implement your program in message passing paradigm and use/write a message passing library

Summary -- Parallel Issues
Control: SIMD vs. MIMD Coordination: synchronous vs. asynchronous Memory Organization: private vs. shared Address Space: local vs. global Memory Access: uniform vs. non-uniform Granularity: power of each processor Scalability: efficiency with number of processors Interconnection Network: topology, routed, switched

2.4 Physical Organization
9/22/2018 intro

Architecture of an Ideal Parallel Computer
A parallel random access machine (PRAM): p processors global memory of uniform access (all processor share the same address space) The processors are synchronized by a single global clock but can execute different instructions  a synchronous shared memory MIMD machine But a PRAM model divides into different classes depending on how multiple processors read and/or write into the same memory location

PRAM (Parallel RAM) Access Subclasses Exclusive: Concurrent:
each processor read/write alone, while the others wait Concurrent: different processors read/write at the same address Subclasses Exclusive-read, exclusive-write (EREW) Concurrent-read, exclusive-write (CREW) Exclusive-read, concurrent-write (ERCW) Concurrent-read, concurrent-write (CRCW)

Concurrency Issue Concurrent reads are not a problem
But concurrent writes must be arbitrated. The most frequent protocols are: Common -- write only if the current writes are for the same value -- otherwise the writes fail Arbitrary -- arbitrarily let one processor write and the other processor writes fail Priority -- the processors are orded in priority and the processor with highest priority succeeds in writing and all the others fail Sum -- the sum of all the processor write values is written into the location and no processor fails

2.4.2 Interconnection Networks
How the components are connected to each other Nodes: Processor and Memory Edges: Wires/fiber/Wireless connecting the Nodes Network Interface Cards, Repeaters, Hubs, Bridges, Switches, Routers Classification Static: Fixed, never changes with time Limitation -- physical space for the connections Dynamic: Can be changed with time and execution The program change the connections in the switch to link the nodes Limitation -- congestion in the paths of the switch, depending on how carefully the user program is designed

Static Interconnection Networks
No switches The connections are fixed essentially by directly connected wires between processors In effect, the processor is the switch, selecting the wire the message is to traverse In contrast, in a switch, the message has a header which is read by the switch as the message is received and is routed by analyzing the sender and receiver addresses in the header There is potentially less overhead with static interconnection networks. On the other hand, the processor somehow has to know where to route the message.

Static and Dynamic Interconnection Networks
P P P P P P processor switching element Network interface/switch P P Classification of interconnection networks: a static network; and (b) a dynamic network. note: (the examples show only processors, but the situation with memories is equivalent)

Dynamic Interconnection Networks
Consider an EREW PRAM with p processors and m memory locations switch elements needed to connect every processor to every memory location O(mp) Say, a switch in the form of a crossbar nonblocking but uses mp switch elements This is very costly to build, like prohibitive cost Compromise by placing the memory in banks Say b banks with m/b memory locations per bank It is considerably less costly than the above situation -- O(pb) switch elements Its disadvantage is that it blocks all processors but one reading/writing any location in a bank if one is read/writing to a bank The Cray Y-MP was built with this kind of switch

2.4.3 Network Topologies Bus-Based Crossbar Multistage (Omega)
Completely-Connected Star-Connected Linear Array, Meshes,… Tree-Based 9/22/2018 intro

Bus Based Networks A bus is a connection between all processors and all memory banks A request is made to the bus to fetch or send data The bus arbitrates all requests and performs the data transfer when not used (or overused) for other requests The bus is limited in the amount of data that can be sent at once Address Shared Memory Data P0 P1 P2 Pn

 As the number of processors and memory banks increases, processors increasingly wait for data to be transferred Results in what is called saturation of the bus

Use of Caches The saturation issue is alleviated by processor caches
Caches introduce the requirement for cache coherency Address Shared Memory Data C C C C P0 P1 P2 Pn

Use Of Caches Assuming the program needs data in address-local chunks, the architecture transmits a block of items in consecutive addresses instead of just one item at a time The time to send a block is often nearly the same time as the time to get the first item (high latency, large bandwidth) By sending junks at a time, it reduces the number of requests to the bus for data from memory and thus reduces the saturation of the bus Drawback: the caches are copies of memory and the copied data may be in different caches Consider writing into one of them All caches must have the same value for the same memory location (made coherent) to maintain the integrity of the computation

Crossbar Networks Non-blocking in the sense that a processor to a particular memory bank does NOT block another processor to a different memory bank Contention when more than one item from a given memory bank is required -- a bottleneck is created M0 M1 M2 M3 M4 Mb P0 P1 P2 A switching element P3 Pn

Crossbar Switch Properties
The number of switch elements is O(pb) It makes no sense to have b less than p Thus, the cost is (p2) This is notation for a lower bound -- that is, there exists an N such that the cost  c*p2 for some constant c and for all n  N It is not cost effective to have b near m A frequently configured system is with b as some modest and constant multiple of p, often 1 For this case, the cost is O(p2)

The Compromise Connection
Crossbars are scalable but become very expensive as they scale up Buses are inexpensive but are non-scalable because of saturation Is there anything that has moderate cost but is scalable? Yes: the multistage switch A set of stages that does not grow quickly with increases in the number of processors -- thus, not so costly The set of stages grows with the number of processors in a modest way to avoid saturation The elements in the stages are simple so that they are inexpensive yet can grow to avoid severe saturation

Cost and Performance Vs Number of Processors
Crossbars cost the most and buses the least as the number of processors increases BUT, crossbars perform the best and buses the least as the num. of procs increases And multistage switches are the compromise and are frequently used

Multi-Stage Network The stages are switches
P0 Stage 1 Stage 2 Stage 3 Stage n Mb M0 P1 M1 P2 M2 P3 M3 Pn Mb The stages are switches an omega network is made up from a collection of 2x2 (sub)switches in the pass-through or crossover configuration

Configuration For 2x2 Switch
Suppose a particular 2x2 switch is in the k-th stage encountering a message from processor numbered a going to processor numbered b. Then, this switch is in: pass-through configuration when the k-th bits of a and b are the same Crossover configuration when the k-th bits of a and b are the different

The Omega Network Assume b = p The number of stages is log(p)
The number of switches in a stage is p/2 Each switch is very simple p/2 crossover 2x2 switches -- each switch: has 2 inputs and 2 outputs (hence p/2 switches per stage) is either in pass-through or crossover configuration The configuration is dynamic and is determined by the p-th bit of source and destination addresses of the current message Each stage has p inputs and p outputs Each set of connections from 0-th position and between all stages is arranged in what is called a perfect shuffle -- see next slide

Perfect Shuffle Position i is connected to position j if:

Complete Omega Network For 8x8 Switch
Each box is a 2x2 crossover switch Trace of the path for a message from processor 0 to bank 5 (eg: (000)(101)) is: Stage 0, switch 0 in crossover configuration message goes to stage 1, switch 1 Stage 1, switch 1 in pass-through configuration message goes to stage 2, switch 2 Stage 2, switch 2 in crossover configuration Message goes to output 101 (5) Processors Switch Stages Memory 000 001 010 011 100 101 110 111

But An Omega Switch May Block
Consider 2 messages sent at the same time One from processor 2 (010) to back 7 (111) The other from processor 6 (110) to bank 4 (010) They both route to switch 2, stage 0 One requires the switch to be in pass-through configuration The other requires the switch to be in crossover configuration Both messages end up on the second output port of this switch Either the switch blocks or the link from the second output of switch 2, stage 0 to switch 1, stage 1 overloads (blocks -- two messages simultaneously try to traverse the same wire or link)

Blocking In An Omega Network

Completely-Connected Network
Like a crossbar switch Every pair of processors can communicate in one hop without creating a block situation one processor can communicate to all others simultaneously (as long as the processor supports this) But, it has physical scaling problems: A chaotic collection of wires for even a modest number of processors (consider say, 128 processors) where it is physically difficult to find space for the connections Everyone to everyone connection -- no preferences

Star-Connected Network
One processor can communicate to all others in one hop All other pairs of processors require two hops The center processor thus becomes a bottleneck if the processor communication is between the "starlets" This network behaves very much like a bus congestion is possible and likely Many ethernet clusters in offices without switches behave like this With switches, the back planes have very large bandwidths to ameliorate the congestion (saturation) conditions The boss/worker model -- the center is the boss

A Linear Network And Ring Connected Network
Linear-Array Have to pass the message in correct direction Message goes through intermediary processors (multiple hops) Ring For the shortest path, you have to pass the message in the correct direction Message still goes through intermediary processors (multiple hops)

2-D Mesh, 2-D Wrapped Mesh, and 3-D Mesh Connected Static Networks
(b) is also called a 2-d wraparound mesh or 2-d torus If not square, called a 2-d rectangle mesh or torus These kinds of configurations have been built, including a 3-d torus

Hypercube Connections
A cube of dimension d with 2d processors For d = 1, it is a linear configuration (linear array) with 2 (= 21) processors, one at each end For d = 2, it is the configuration of a square, with processors on the corners (4 = 22 of them) For d = 3, it is the configuration of a cube, with processors on the corners (8 = 23 of them) … (and so on -- can you guess what d = 4 is) The very special properties of such configurations are: For any d-dimensional cube, every processor is connected to d other processors The number of steps (connection segments) between any pair of processors is at most d Actually, the shortest distance is the Hamming distance between the binary representation of the processor numbers, assuming the numbering is from 0 to 2d–1 Note that a d-dimensional hypercube is made up from two (d–1) dimensional hypercubes, connected at their corresponding positions

1D, 2D, 3D Hypercube Networks

Distinct Partitions Of A 3-D Hypercube
Always d distinct partitions of a d-dimensional hypercube into two subcubes of dimension d–1

More Properties Of Hypercubes
Two processors are neighbors (connected by a direct link) if the binary representations of their processor numbers differ in exactly on bit: Recall the Hamming distance is s  t (where  is exclusive or -- 1 only in positions where bits are different) Thus, s and t are neighbors if and only if s  t has exactly one 1-bit Consider the set of all processors with the same k bits (any subset of the d bits). Then: This set of processors forms a d–k dimensional hypercube It is a sub-hypercube of the original hypercube

Sub-cubes

k-ary d-cube Networks Instead of 2 processors in each dimension of the cube (as with the hypercube), consider k processors connected in a ring: d-dimensional torus with p processors in each dimension is a p-ary d-cube k linearly connected processors is a k-ary 1-cube

Tree Networks With Message Routing
Linear array and ring are special cases of trees (a) is a complete binary tree with a static network The processors at the connections are also routers or switches (b) is a dynamic tree network The processors are only at the leaves -- connections are only switches

Message Routing In Trees
Routing in a dynamic tree The message starts up the tree until it finds the root of the smallest tree that contains the sender and receiver Then, the message goes down the tree to the receiver For a static tree, the same algorithm works except that the message may not go down the tree Note that there is a congestion in trees The upper parts of the tree get lots of traffic, more traffic than the lower parts To ameliorate the congestion, fatten the links at the upper parts of the tree Called a fat tree interconnection

Dynamic Fat Tree Network
This tree doubles the number of paths as you go higher in the tree The CM-5 used a dynamic fat tree with 100s of processors I believe it did not double in width at all levels

2.4.4 Evaluating Static Interconnection Networks
Distance (between two processors): number of edges in the shortest path linking two processors. Cost and performance measures of networks Diameter Connectivity Bisection width, channel width, channel rate, and bisection bandwidth Cost

Evaluating Static Interconnection Networks: Diameter
Maximum distance between any two processors Examples: for a 2-d hypercube, it is 2 for a completely connected network, it is 1 for a star network, it is 2 for a linear array of size p, it is p–1 for a ring of size p, it is p/2 for a p  q mesh, it is p+q – 2 for a d-dimensional hypercube, it is d for a hypercube of p processors, it is log2 p

Evaluating Static Interconnection Networks: Connectivity
a measure of the number of paths between any two processors the higher the connectivity, the more choice in routing messages and less chance of contention arc connectivity: the least number of arcs (connections) that must be removed to break the network into disconnected networks Examples: for a star, linear array, and tree network, it is 1 for a completely connected network of p processors, it is p–1 for a d-dimensional hypercube, it is d for a ring, it is 2 for a wraparound mesh, it is 4

Bisection width, channel width, channel rate, and bisection bandwidth
the least number of arcs that must be removed to break the network into two partitions with an equal number of processors in each partition Channel width: the maximum number of bits that can be transmitted simultaneously over a link connecting two processors it is equal to the number of wires (arcs) connecting the two processors Channel rate: is the rate at which bits can be transferred (bits/sec) Channel bandwidth: is the peak rate that a channel can operate; that is, the product channel rate * channel width Bisection bandwidth: is the peak rate along all the arcs that represent the bisection width; it is the product bisection width * channel bandwidth

Examples Of Bisection Widths
of d-dimensional hypercube with p processors, it is 2d–1 or p/2 of a pp mesh, it is p of a pp wraparound mesh, it is 2p of a tree, it is 1 of a star, it is 1 (by convention -- you cannot make two equal partitions) of a completely connected network of p processors, p2/4

Evaluating Static Interconnection Networks: Cost
There are two common measures: The number of arcs (links) for a d-dimensional hypercube with p=2d processors, it is (p log p)/2 -- the solution to a recurrence (cd = 2cd–1+2d–1, c0 = 0) for a linear array and trees of p processors, it is p–1 Or the bisection bandwidth It related to measures of minimal sizes (area or volume) of the packaging to build the network For example, if the bisection bandwidth is w of a: 2-dimensional package, a lower bound on the area is (w2) 3-dimensional package, a lower bound on the volume of the network is  (w3/2)

Summary: Characteristics Of Static Network Topologies
Diameter Bisection Width Arc Connectivity Cost (No. of Links) Complete 1 p2/4 p–1 p(p–1)/2 Star 2 Compl Bin. Tree 2 log((p+1)/2) Linear Ring p/2 p 2-D wrap mesh 2(p) –1)  p 2(p – p) 2-D no-wrap mesh 2p/2 2  p 4 2p Hypercube log p p/2 (p log p)/2 Wrap k-ary d-cube d k/2 2kd–1 2d dp

2.4.5 Evaluating Dynamic Interconnection Networks
What is diameter for dynamic networks? Nodes are now defined as processors and switches both have a delay connected with processing a message that passes through them Thus, it is defined as maximum number of nodes between any two processors What is connectivity for dynamic networks? Similarly, it is defined as the minimum number of connections that must be removed to partition the network into two unreachable parts What is bisection width for dynamic networks? Defined as the minimum number of edges (connections) that must be removed to partition the network into two halves of an equal number of processors What is the cost of a dynamic network? Depends on the link cost and the switch cost The switch cost, in practice, dominants so that it becomes the number of switches

An Example Of Bisection Width
Lines A, B, C are 3 cuts that cross the same number (4) of connections -- partitions the processors into two equal sizes This is the minimum number of connections for any cut Thus, the bisection width is 4

Summary: Characteristics Of Dynamic Network Topologies
Diameter Bisection Width Arc Connectivity Cost (No. of Links) Crossbar 1 p p2 Omega Network log p p/2 2 Dynamic Tree 2 log p p–1

2.4.6 Cache Coherence In Multiprocessor Systems
Complicated issue with uni-processors, particularly when there are multiple levels of cache and caches are refreshed in blocks or lines For a multiprocessors, the problem is worse As well as multiple copies, there are multiple processors trying to perform reads and writes of the same memory locations There are two frequently used protocols to ensure the integrity of the computation Invalidate protocol: Invalidate all locations that are copies of a location when one of the copies is changed. Update the invalid locations only when needed -- this is the most frequently implemented currently Update protocol: Update all locations that are copies of a location when one of the copies is written

Cache Coherence In Multiprocessor Systems
x = 1 P0 load x Memory P1 x = 3 P0 write #3, x x = 1 Memory P1 Invalidate Invalid Protocol x = 3 P0 write #3, x Memory P1 Update x = 1 P0 load x Memory P1 Update Protocol

Property Of The Update Protocol
Invalidate protocol: Suppose a processor reads a location once (thus placing it in its cache) and never accesses it again Another processor reads and updates a location many times (also in its cache) Behavior: the second processor causes the first processor's cache location to be marked invalid and is not necessary updated in the first processor's cache -- to continually update the first processor's cache would be a waste of effort

The False Sharing Issue
Caused by cache lines -- the other addresses in the cache that are brought to the cache when any single value is the processor's cache The sharing of data among processors now occurs when a second processor access a data item in the same line but not the same value Thus a whole block (line) of data is shared and located in two or more caches Now consider that each processor repeatedly writes into a different item in the same cache line It looks like (thus false) the items are being shared but they are not Updates to one item in the line require the whole line to be invalidated or updated as the protocol requires when in fact there is no sharing of the same item It turns out the cost for update protocol is slightly less But the tradeoff between communication overheads (updating) and idling (stalling for invalidates) is better for the invalidate protocol

Maintaining Coherence Using Invalidate Protocols
For analysis and understanding, consider 3 states for a memory address: Shared state Two or more processors have loaded the memory location Invalid state At least one processor has updated the value and all other processors mark their value with this state Dirty state: The processor that modifies a value and there are other copies of it in other processor is marked with this state The processor with a value with this state is the source processor for any updates to this value, when needed

State Diagram For The 3-State Coherence Invalidate Protocol
Dirty Shared read write C_write C_read flush read/write Legend Of State Changes - processor action - coherence action flush action -- may occur when processor replaces a dirty item in a cache-replacement action

Parallel Program Execution On A 3-State Coherence System Using The Invalidate Protocol
Instruction at processor 0 Instruction at processor 1 Variables and their states at processor 0 Variables and their states at processor 1 Variables and their states in global mem. x = 5, D y = 12, D read x x = x + 1 read y x = x + y y = y + 1 y = x + y x = 5, S x = 6, D y = 13, S y = 6, S x = 19, D y = 13, I x = 20, D y = 12, S y = 13, D x = 6, S x = 6, I y = 19, D y = 20, D x = 5, I y = 12, I

Implementation Techniques For Cache Coherence
1.Snoopy cache systems 2. Directory-based systems 2.1.Centralized directory 2.2 Distributed directory 9/22/2018 intro

1.Snoopy Cache Systems On broadcast interconnection networks with a bus or a ring: Each processor snoops on the bus looking for transactions that effect its cache The cache has tag bits that specify the cache line state When a processor with a dirty item sees a read request for that item, it gets control of the request and sends the data out When a processor sees a write to an item it owns, it invalidates its copy

Snoop H/W Processor Tags Cache Snoop H/W Processor Tags Cache Snoop
Dirty Address/data Memory

Performance Of Snoopy Caches
Extensively studied and implemented Implementation is simple and straightforward Can be easily added to existing bus-based systems Good performance properties in the sense that: If different caches access different data (the expected case), it performs well Once designated dirty, the processor with the dirty tag can continue use of the data without penalty Also computations that only read shared data perform well Poor performance in the case that the processors are reading and updating the same data value Generates many coherence operations across the bus Because it is a shared bus (between processors and with data movement), the coherence operations saturate the bus

2.Directory Based Systems
Directory-based systems provide a solution to this problem With snoopy buses, each processor has to continually monitor the bus for updates of interest The solution is to find a way to direct the updates only to the processors that need the updates This is done with a directory associated with each block of memory specifying who has each block and when the block is shared

2.1. Centralized Directory Scheme
Interconnection Network Shared state for processors 0 and 3 Processor Cache State Presence Bits Data Directory and Memory S 1

Typical Scenario For a Directory Based System
Consider the scenario of slide 63 Both processors access x (with value 1) x is moved to each cache and is marked shared Processor 0 executes store to x (with value 3) x in the directory is marked as dirty Presence bit for all other processors is marked off Processor 0 can access the changed x at will (memory not changed) Processor 1 accesses x x has a dirty tag and sees that processor 0 has x Processor 0 updates the memory block and sends the updated x to 1 The presence bits for processors 0 and 1 are set and x is now marked as shared S 1 D 1 S 1 3

Performance Of Centralized Directory Scheme
Implementation is more complex Good performance properties (as before) in the sense that If different caches access different data (expected usually case), it performs well Once designated dirty, the processor with the dirty tag can continue to use of the data without penalty Also computations that only read shared data perform well Poor performance in the case that the processors are reading and updating the same data value Encounters two kinds of overhead: Propagation of state and generation of state Results in respectively communication and contention problems The contention is on the directory -- too many requests to the directory for information The communication is on the bus -- Solution: increase its bandwidth There is an issue with the directory -- it takes space Reduce the size of the directory by increasing the block length of the cache line -- BUT may increase false sharing overhead

2.2. Distributed Directory Scheme
Contention on the directory of states can be ameliorated by distributing the directory This distribution is performed consistently with memory distribution The state and presence bits are represented near/in each piece of the distributed memory Each processor maintains the coherence of its own memory Behavior: A first read is a request from the user processor to the owner processor for the block, and state information is set in the owner's directory A write by a user propagates an invalidate to the owner and the owner forwards that state to all other processors sharing the data The effect: The directory is distributed and contention is only for the owner processor -- not the previous situation where several processors are contending for the information in the directory

Distributed Directory Scheme
Interconnection Network Processor Cache Presence Bits/State Memory

Performance Of Distributed Directory Schemes
Better in performance Because this design permits O(p) simultaneous coherence operations Scales much better than the simple snoopy or centralized directory-based systems Now latency and bandwidth of the network become the bottleneck

2.5 Communication Costs in Parallel Machines
1.Message Passing Costs 2.Communication Costs in Shared-Addres-Space Machines 9/22/2018 intro

Message Passing Costs Startup time ts: Per-hop time th:
Time to process the message at the send and receive nodes Time for adding header, trailer, error correction information, executing the router algorithm, and interfacing between node and router Per-hop time th: Time for the header of the message (over 1 link) to leave one node and arrive at the next node Also called node latency It is dominated by the latency time to determine which output buffer or channel to route the message to Per-word transfer time tw: It is 1/r where r is the channel bandwidth measured in words per second This time includes network and buffering overheads

1. Store-Forward Routing
To send a message down a link, The entire message is stored in the receiver before further processing of the message can begin at the receiver or of other messages in the sender Total communication time: For a message of size m traversing l links, it is: tcomm = ts + (mtw + th) l For typical algorithms, th is small compared with mtw, even for small m and so th is ignored. That is, tcomm = ts + mtw l

2. Packet Routing Waiting for the entire message is inefficient
For LANs, WANs, and long-haul networks, the message is broken down into packets Intermediate nodes wait only on small pieces (packets), not the whole message Reduces the overhead for handling an error Only a packet gets resent, not the entire message Allows packets to take different paths Error correction is on a smaller pieces and so is more effective BUT, there are overheads because the packet has to have routing information, error correction, and sequencing information stored in it The advantages outweigh the overhead incurred

A Cost Model For Packet Routing
The packet size is m = r + s where: r is the size of the original message s is the size of additional information to deal with packets The time to packetize the message is proportional to the size of the message and is: mtw1 Let the number of hops be l Let the time to communicate one word over the network every second be tw2 with latency th Time to receive the first packet of a message: thl + tw2(r+s) There are m/r – 1 remaining packets to send Thus the time is (simplified): tcomm = ts + thl + twm where tw = tw1 + tw2(1 + s/r)

3. Cut-Though Routing Packets but
The processor-interconnect network is limited in generality, size, and amount of noise or interference Thus, the overheads can be reduced because: No routing information is needed No sequencing information needed as the packets are transmitted and received in order Errors can be associated with the whole message instead of packets Errors occur less frequently so simpler schemes can be used The term cut-through routing is applied to this simpler packetizing of the messages The packets are fixed size, are called flow control digits or flits Smaller than long-haul network packets as no headers

Tracer packets establish the route:
A tracer packet is sent ahead to initiate the route for the message -- sets the path for the flits to follow The flits pass down one after the other They are not buffered at each node but are passed on as soon as their arrival is complete without waiting for the next one -- reduces memory and memory bandwidth needs at the node

A Simplified Cost Model for Cut-Through Routing
Assuming: l := the number of links m := number of words in the message ts := startup time There is clearly time to startup and shutdown the flit pipeline which is proportional to l and must be included in the per-hop time below There is also the time to send the tracer packet which is proportional to l and must be included in the per-hop time tw := per-word transmission time Then the cost for m words is m tw l.th := per-hop time It clearly depends on the number of links in this way because each processor is simultaneously handling a different flit Then, the communication cost is: tcomm = ts + lth + m tw Compare this with the store-forward time tcomm = ts + mtw l

A Simplified Cost Model for Cut-Through Routing (continued)
The cost model for C-T routing is: tcomm = ts + l th + m tw To minimize this cost, we might design our software so that we: Communicate in bulk (reduce the effect of ts) This reduces the number of messages so that the number of times we pay for ts is reduced -- this is appropriate because ts is usually large relative to th and tw Minimize the data volume (reduce the term m tw) Reduce the amount of communication Minimize the distance traveled by the messages (reduce l) This reduces the term lth

A Simplified Cost Model for Cut-Through Routing (continued)
On the other hand: Message passing libraries such as MPI give the users very little control of the mapping between the programmer's logical processors and the machine's physical processors Message passing implementations will use 2-hop schemes to reduce contention by picking the middle node at random Most switches nowadays have essentially equal link times between any two processors Thus, lth is small compared with the other terms For these reasons, we use the simplified cost model: tcomm = ts + m tw That is, drop the lth term

Performance Comparison And Issues
For CT routing vs SF routing: SF has the compounding factor ml CT routing is linear in m and l Size of flits: Too small implies processing of lots of flits and the processing time must be very fast Too large means more memory and memory bandwidth is needed or latency is increased Thus, there is a tradeoff in the design of the routers Flits are typically 4 bits to 32 bytes Congestion and contention The CT routing can deadlock, particularly when heavily loaded The solution is message buffering and/or careful routing

Store-Forward And Cut-Through Routing
P4

Deadlock In Cut-Through Routing
Destinations of messages 0, 1, 2, and 3 are processors A, B, C, and D respectively Flit from message 0 occupies path CB It cannot progress because the flit from message 3 occupies path BA And so on

2. Communication Costs For Shared-Address-Space Machines
The bottom line is that there is no uniform and simple model except for very special cases and architectures -- we give up for a general treatment The issues that cause this are: Memory layout Determined by the system and compiler Size of caches, particularly when there are small caches Details of the invalidate and/or update protocols for the caches Spatial locality of data Cache line sizes vary and the compiler locates the data Pre-fetching by the compiler False sharing (depends on threading, scheduling, compiler) Contention for resources (cache update lines, directories, etc) Depends on the execution scheduler

2.6. Routing Mechanisms For Interconnection Networks
The process whereby the network determines what route or routes to choose for a message and the way it routes the messages May use the state (how busy parts of the network are) of the network to determine the path Minimal routing A path of least length is used -- usually computed directly from the source and destination address Minimal routing can lead to congestion so that non-minimal routing is often used Non-minimal routing The least length path is not chosen To avoid congestion route is selected randomly or uses network state information

Routing Mechanisms Continued
Routing can be deterministic: Always the same route, given the addresses of the source and destination Routing can be adaptive: Tries to avoid congestion and delay, depending on what is currently being used (or may be a 2-hop scheme with the middle node selected at random) Dimension ordered routing Uses the dimension properties of the connection network to determine the path -- routes via minimal paths along dimensions For a mesh, it is X-Y routing For a hypercube, it is called E-cube routing

X-Y Routing For A Mesh Send the message along the row of the source processor until it reaches the X value of the destination processor Then, send the message along the column to the destination processor This is a minimal path  

E-Cube Routing Let Ps and Pd be the binary labels for nodes s and d
Then, the Hamming distance is Ps  Pd Its 1s indicate what dimensions to send the message along Used to compute the dimension from the destination address, given the address of the current node where the message is; that is, Compute Ps  Pd and send the message from Ps along the dimension corresponding to the least significant bit The message is now at processor Pq and now form Pq  Pd Now send the message from Pq along the dimension of the least significant bit of Pq  Pd and now the message is at a new Pq Repeat until the message arrives at its destination NOTE: the assumption is that the message always contains the address of its destination

E-Cube Routing For 3-D Cube
Form the Hamming distance Send the message from the source processor to the next destination determined by changing the least significant bit in the Hamming distance of source processor and repeat for each new source until dest. is reached

2.7. Embedding Other Networks into a Hypercube Network
Using all of the processors of a hypercube and an appropriate subset of the connections, can the processors be configured to look like other networks? Important process because an algorithm may naturally fit the first network configuration, and the actual machine uses a different network configuration The problem is a general graph problem: Given two graphs: G(V,E) and G'(V',E'), map each vertex in V onto one or more vertices in V' map each edge in E onto one or more edges in E' The vertices are processors and the edges are network links Yes, as shown in the following cases and slides

Example Of A Bad Case Take a 44 mesh and map the nodes at random to another 44 mesh The original arrangement was designed to avoid congestion of communication Each link is used just once for all pairs of communication The particular random arrangement can congest up to 5 messages on one link k communicates with g, o, and l down the k-h link j communicates with i along the k-h link d communicates with h along the k-h link text suggests 6 links but I do not see that case

Diagram Of Bad Case • a b e c d i m f j n o p g h k l Only nearest neighbor communication on dotted lines o • k h j m i d c p e l g f b a n Processors mapped at random to the same grid k communicates with g, o, and l down the k-h link j communicates with i along the k-h link d communicates with h along the k-h link Up to 5 paths mapped to the same link assuming the original code performed only nearest neighbor communication

Properties Of Such Mappings
Congestion The maximum number of edges in E mapped into a single edge in E' It measures the amount of traffic required on an edge in G' Dilation The maximum number of edges in E', joined together, to correspond to a single edge in E It measures the increased delay in G' caused by traversing multiple links Expansion The ratio of number of processors in V' to the number of processors in V However, the cases described next all have expansion of 1

A Linear Array embedded into a Hypercube
Consider a linear array (or ring) of 2d processors For convenience, linear array processors are labeled from 0 to 2d – 1 Processor i in the linear array maps to processor numbered G(i,d) in the hypercube using the following mapping: G is called the binary reflected Gray code (RGC) The d+1 Gray codes are derived from the d Gray codes as follows: For d+1, take two copies of the d codes For one copy of the d codes, prefix a 0 bit for the other copy, reflect the d codes and prefix a 1 bit

A Ring Of 8 Processors embedded into a Hypercube
This mapping has dilation of 1 and congestion of 1 Last node (7) of the ring Node 0 of the Ring

 Note: Hypercube processors in consecutive rows of the Gray table differ by one bit -- thus, they are adjacent in the cube Thus, each edge in the linear array maps to one and only one edge in the hypercube Thus, the dilation (maximum number of edges in E' an edge is E is mapped to) is 1 Also, the congestion (max. number of edges in E mapped onto a single edge in E' ) is 1 by the requirements of G

Meshes embedded into Hypercubes
We consider the mesh to be a wraparound mesh of size 2r  2s and the hypercube of dimension r+s We use the properties of the mapping of a ring to a hypercube to do this -- because each row and column is a ring -- as follows: For processor (i,j) in the mesh (the processor at the intersection of row i and column j), the binary hypercube processor number is G(i,r) || G(j,s) where || is the concatenation of two binary strings

A Square p-Processor Mesh embedded into a p-Processor Ring
The number of links in a p processor square wraparound mesh is: 2 p p = 2p p links on each row with p rows = p links Considering the columns, there are just as many links again = p Thus, there is a total of 2p links The number of links in a p-processor ring is p Thus, there must be congestion The natural mappings from mesh to ring and ring to mesh are shown on the next slide

A Mesh embedded into Linear Array (and Vice Versa)
Linear Array Onto Mesh Congestion: 1 Dilation: 1 Legend: Bold lines are links in the linear array Dotted lines are links in the mesh Congestion -- 5 crossings Mesh Onto Linear Array Congestion: 5 = p + 1 Dilation: 7 = 2p – 1 Dilation -- 7 bold edges

Can We Do Better? What is a lower bound for the congestion?
Recall that congestion is the maximum number of edges in E mapped into a single edge in E' It turns out to be p so it is possible to do better but not that much better Simple argument but overly conservative: The mesh has 2p links, the linear array has p links Therefore, a congestion of 2 seems possible Proof (using bisection width, not links): The bisection width of a 2-D mesh is p The bisection width of a linear array is 1 Thus, the congestion is at best p

Hypercubes embedded into 2-D Meshes
Assume p nodes in the hypercube and assume p is an even power of 2 Treat the hypercube as p hypercubes each with p nodes Let d = log2 p -- d is even by assumption Consider the hypercubes with the least significant d/2 bits varying and the first d/2 bits fixed Map each of these hypercubes to a row of p  p mesh, each row having p nodes and there are p rows Use the inverse of the mapping from a linear array to a hypercube on slide 102 Connect the nodes column-wise so that the nodes with the same d/2 least significant bits are in the same column Notice that this is a similar arrangement as for the rows

16-Node Hypercube embedded into 16-Node 2-D Mesh
00 01 11 10 All nodes numbered 10 are in the first column Nodes numbered in sequence 00,01,11, 10 are in a row See Textbook For a 32-Node Example, Figure 2.33, Page 72

Congestion Is p/2 And Is Best
The argument is as follows: The bisection width of this mapped 2-D mesh is p/2 Proof: The bisection width of a p hypercube is p/2 The bisection width of a row (linear array) is 1 The congestion of a row is thus p/2 (the ratio [p/2]/1) Similarly, for the columns, the congestion is p/2 Because the row and column mapping affects disjoint sets of nodes, the bisection width is the same, namely p/2 The lower bound for the bisection width is p/2 (Links of a hypercube)/(links of mesh) = p/(2p) = p/2 Thus, because the lower bound is equal to the actual, this is the best mapping for a hypercube to a 2-D mesh

Processor-Processor Mapping And Fat Interconnection Networks
The previous examples were mappings of dense networks to sparse networks The congestion was larger than one If both networks have the same bandwidth on each link, this could be disastrous The denser network is usually higher dimensionality which is costly Complicated layouts, wire crossings, variable wire lengths However, the sparser network is simpler and it is easier to make the congested links fatter Example: The congestion on the mapping from the hypercube to the 2-D mesh with the same number of processors is p/2 Widen the paths of the mesh by a factor of p/2 Then the two networks have the same bisection bandwidth The disadvantage is that the diameter of the mesh is larger than the cube

Cost Performance Tradeoffs -- The Mesh Is Better For The Same Cost
Consider a fattened mesh (fattened so as to make the costs of the network the same) and hypercube with the same number of processors Assume the cost of the network is proportional to the number of wires Increase the links on the p-node wraparound mesh by a factor of (log p)/4 wires, which makes the p-node hypercube and p-node wraparound mesh have the same cost Let's compare the average communication costs Let lav be the average distance between any two nodes For the 2-D mesh, this distance is p/2 links For the hypercube, it is (log p)/2 links

 The average time for a message of length m on average (over lav hops) is: For the 2-D mesh with cut-through routing: ts + thlav + twm Because the channel width has been increased by a factor of (log p)/4, the term twm is decreased by the same factor and lav is replaced by p/2 tav_comm_2D = ts + th p/2 + 4twm/(log p) For the hypercube (with lav = (log p)/2) tav_comm_cube = ts + th (log p)/2 + twm Comparison: for large m The average comm. time is smaller for the 2-D mesh for p > 16 This is not true for store-forward protocol for the mesh This comparison is for the case that the cost is determined by the number of links

 Suppose the cost is determined by the bisection width
This time, we increase the bandwidth on the mesh links by a factor of p/4 This is the ratio of the bisection bandwidth for a hypercube to a mesh with p processors each -- that is, (p/2)/(2p) = p/4 (see slide 58) Comparison: For the 2-D mesh with cut-through routing (with lav = p/2) : tav_ccomm_2D = ts + th p/2 + 4twm/p For the hypercube (with lav = (log p)/2): tav_comm_cube = ts + th (log p)/2 + twm Again, the average comm. time is smaller for 2-D mesh for p > 16

Case Studies: The IBM Blue-Gene Architecture
The hierarchical architecture of Blue Gene.

Case Studies: The Cray T3E Architecture
Interconnection network of the Cray T3E: (a) node architecture; (b) network topology.

Case Studies: The SGI Origin 3000 Architecture
Architecture of the SGI Origin 3000 family of servers.

Case Studies: The Sun HPC Server Architecture
Architecture of the Sun Enterprise family of servers.

Parallel Programming Platforms Chapter 2

Similar presentations

Presentation on theme: "Parallel Programming Platforms Chapter 2"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Parallel Programming Platforms Chapter 2

Similar presentations

Presentation on theme: "Parallel Programming Platforms Chapter 2"— Presentation transcript:

Similar presentations

About project

Feedback