Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction  Instructor : Marcelo Cintra – 1.03 IF) 

Similar presentations


Presentation on theme: "CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction  Instructor : Marcelo Cintra – 1.03 IF) "— Presentation transcript:

1 CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction  Instructor : Marcelo Cintra (mc@staffmail.ed.ac.uk – 1.03 IF)  Lectures: Tue and Fri in G0.9 WRB at 10am  Pre-requisites: CS3 Computer Architecture  Practicals: Practical 1 – out week 3 (26/1/10); due week 5 (09/2/10) Practical 2 – out week 5 (09/2/10); due week 7 (23/2/10) Practical 3 – out week 7 (23/2/10); due week 9 (09/3/10) (MSc only) Practical 4 – out week 7 (26/2/10); due week 9 (12/3/10)  Books: –(**) Culler & Singh - Parallel Computer Architecture: A Hardware/Software Approach – Morgan Kaufmann –(*) Hennessy & Patterson - Computer Architecture: A Quantitative Approach – Morgan Kaufmann – 3 rd or 4 th editions  Lecture slides (no lecture notes)  More info: www.inf.ed.ac.uk/teaching/courses/pa/ 1

2 CS4/MSc Parallel Architectures - 2009-2010 Topics  Fundamental concepts –Performance issues –Parallelism in software  Uniprocessor parallelism –Pipelining, superscalar, and VLIW processors –Vector, SIMD processors  Interconnection networks –Routing, static and dynamic networks –Combining networks  Multiprocessors, Multicomputers, and Multithreading –Shared memory and message passing systems –Cache coherence and memory consistency  Performance and scalability 2

3 CS4/MSc Parallel Architectures - 2009-2010 Lect. 1: Performance Issues  Why parallel architectures? –Performance of sequential architecture is limited (by technology and ultimately by the laws of physics) –Relentless increase in computing resources (transistors for logic and memory) that can no longer be exploited for sequential processing –At any point in time many important applications cannot be solved with the best existing sequential architecture  Uses of parallel architectures –To solve a single problem faster (e.g., simulating protein folding: researchweb.watson.ibm.com/bleugene) –To solve a larger version of a problem (e.g., weather forecast: www.jamstec.go.jp/esc) –To solve many problems at the same time (e.g., transaction processing) 3

4 CS4/MSc Parallel Architectures - 2009-2010 Limits to Sequential Execution  Speed of light limit –Computation/data flow through logic gates, memory devices, and wires –At all of these there is a non-zero delay that is at a minimum equal to delay of the speed of light –Thus, the speed of light and the minimum physical feature sizes impose a hard limit on the speed of any sequential computation  Von Neumann’s limit –Programs consist of ordered sequence of instructions –Instructions are stored in memory and must be fetched in order (same for data) –Thus, sequential computation is ultimately limited by the memory bandwidth 4

5 CS4/MSc Parallel Architectures - 2009-2010 Examples of Parallel Architectures  An ARM processor in a common mobile phone has 10s of instructions in-flight in its pipeline  Pentium IV executes up to 6 microinstructions per cycle and has up to 126 microinstructions in-flight  Intel’s quad-core chips have four processors and are now in mainstream desktops and laptops  Japan’s Earth Simulator has 5120 vector processors, each with 8 vector pipelines  IBM’s largest BlueGene supercomputer has 131,072 processors  Google has about 100,000 Linux machines connected in several cluster farms 5

6 CS4/MSc Parallel Architectures - 2009-2010 Comparing Execution Times  Example: system A: T A  execution time of program P on A system B: T B  execution time of program P’ on B  Notes: –For fairness P and P’ must be “best possible implementation” on each system –If multiple programs are run then report weighted arithmetic mean –Must report all details such as: input set, compiler flags, command line arguments, etc 6 Speedup: S = TBTB TATA ; we say: A is S times faster or A is ( TBTB TATA X 100 - 100 ) % faster

7 CS4/MSc Parallel Architectures - 2009-2010 Amdahl’s Law  Let: F  fraction of problem that can be optimized S opt  speedup obtained on optimized fraction  e.g.: F = 0.5 (50%), S opt = 10 Sopt = ∞  Bottom-line: performance improvements must be balanced 7  S overall = 1 (1 – F) + F S opt S overall = 1 (1 – 0.5) + 0.5 10 = 1.8S overall = 1 (1 – 0.5) + 0 = 2

8 CS4/MSc Parallel Architectures - 2009-2010 Amdahl’s Law and Efficiency  Let: F  fraction of problem that can be parallelized S par  speedup obtained on parallelized fraction P  number of processors  e.g.: 16 processors (S par = 16), F = 0.9 (90%),  Bottom-line: for good scalability E>50%; when resources are “free” then lower efficiencies are acceptable 8 S overall = 1 (1 – F) + F S par S overall = 1 (1 – 0.9) + 0.9 16 = 6.4 E = S overall P E = 6.4 16 = 0.4 (40%)

9 CS4/MSc Parallel Architectures - 2009-2010 Performance Trends: Computer Families  Bottom-line: microprocessors have become the building blocks of most computer systems across the whole range of price-performance 9 Culler and Singh Fig. 1.1

10 CS4/MSc Parallel Architectures - 2009-2010 Technological Trends: Moore’s Law 10  Bottom-line: overwhelming number of transistors allow for incredibly complex and highly integrated systems

11 CS4/MSc Parallel Architectures - 2009-2010 Tracking Technology: The role of CA  Bottom-line: architectural innovation complement technological improvements 11 H&P Fig. 1.1

12 CS4/MSc Parallel Architectures - 2009-2010 The Memory Gap  Bottom-line: memory access is increasingly expensive and CA must devise new ways of hiding this cost 12 H&P Fig. 5.2

13 CS4/MSc Parallel Architectures - 2009-2010 Software Trends  Ever larger applications: memory requirements double every year  More powerful compilers and increasing role of compilers on performance  Novel applications with different demands: e.g., multimedia –Streaming data –Simple fixed operations on regular and small data  MMX-like instructions e.g., web-based services –Huge data sets with little locality of access –Simple data lookups and processing  Transactional Memory(?) (www.cs.wisc.edu/trans-memory)  Bottom-line: architecture/compiler co-design 13

14 CS4/MSc Parallel Architectures - 2009-2010 Current Trends in CA  Very complex processor design: –Hybrid branch prediction (MIPS R14000) –Out-of-order execution (Pentium IV) –Multi-banked on-chip caches (Alpha 21364) –EPIC (Explicitly Parallel Instruction Computer) (Intel Itanium)  Parallelism and integration at chip level: –Chip-multiprocessors (CMP) (Sun T2, IBM Power6, Intel Itanium 2) –Multithreading (Intel Hyperthreading, IBM Power6, Sun T2) –Embedded Systems On a Chip (SOC)  Multiprocessors: –Servers (Sun Fire, SGI Origin) –Supercomputers (IBM BlueGene, SGI Origin, IBM HPCx) –Clusters of workstations (Google server farm)  Power-conscious designs 14

15 CS4/MSc Parallel Architectures - 2009-2010 Lect. 2: Types of Parallelism  Parallelism in Hardware (Uniprocessor) –Parallel arithmetic –Pipelining –Superscalar, VLIW, SIMD, and vector execution  Parallelism in Hardware (Multiprocessor) –Chip-multiprocessors a.k.a. Multi-cores –Shared-memory multiprocessors –Distributed-memory multiprocessors –Multicomputers a.k.a. clusters  Parallelism in Software –Tasks –Data parallelism –Data streams (note: a “processor” must be capable of independent control and of operating on non-trivial data types) 1

16 CS4/MSc Parallel Architectures - 2009-2010 Taxonomy of Parallel Computers  According to instruction and data streams (Flynn): –Single instruction single data (SISD): this is the standard uniprocessor –Single instruction, multiple data streams (SIMD):  Same instruction is executed in all processors with different data  E.g., graphics processing –Multiple instruction, single data streams (MISD):  Different instructions on the same data  Never used in practice –Multiple instruction, multiple data streams (MIMD): the “common” multiprocessor  Each processor uses it own data and executes its own program (or part of the program)  Most flexible approach  Easier/cheaper to build by putting together “off-the-shelf” processors 2

17 CS4/MSc Parallel Architectures - 2009-2010 Taxonomy of Parallel Computers  According to physical organization of processors and memory: –Physically centralized memory, uniform memory access (UMA)  All memory is allocated at same distance from all processors  Also called symmetric multiprocessors (SMP)  Memory bandwidth is fixed and must accommodate all processors  does not scale to large number of processors  Used in most CMPs today (e.g., IBM Power5, Intel Core Duo) 3 Interconnection CPU Main memory CPU Cache

18 CS4/MSc Parallel Architectures - 2009-2010 Taxonomy of Parallel Computers  According to physical organization of processors and memory: –Physically distributed memory, non-uniform memory access (NUMA)  A portion of memory is allocated with each processor (node)  Accessing local memory is much faster than remote memory  If most accesses are to local memory than overall memory bandwidth increases linearly with the number of processors 4 Interconnection CPU Mem. CPU Cache Mem. Node

19 CS4/MSc Parallel Architectures - 2009-2010 Taxonomy of Parallel Computers  According to memory communication model –Shared address or shared memory  Processes in different processors can use the same virtual address space  Any processor can directly access memory in another processor node  Communication is done through shared memory variables  Explicit synchronization with locks and critical sections  Arguably easier to program –Distributed address or message passing  Processes in different processors use different virtual address spaces  Each processor can only directly access memory in its own node  Communication is done through explicit messages  Synchronization is implicit in the messages  Arguably harder to program  Some standard message passing libraries (e.g., MPI) 5

20 CS4/MSc Parallel Architectures - 2009-2010 Shared Memory vs. Message Passing  Shared memory  Message passing 6 flag = 0; … a = 10; flag = 1; flag = 0; … while (!flag) {} x = a * y; Producer (p1)Consumer (p2) … a = 10; send(p2, a, label); … receive(p1, b, label); x = b * y; Producer (p1)Consumer (p2)

21 CS4/MSc Parallel Architectures - 2009-2010 Types of Parallelism in Applications  Instruction-level parallelism (ILP) –Multiple instructions from the same instruction stream can be executed concurrently –Generated and managed by hardware (superscalar) or by compiler (VLIW) –Limited in practice by data and control dependences  Thread-level or task-level parallelism (TLP) –Multiple threads or instruction sequences from the same application can be executed concurrently –Generated by compiler/user and managed by compiler and hardware –Limited in practice by communication/synchronization overheads and by algorithm characteristics 7

22 CS4/MSc Parallel Architectures - 2009-2010 Types of Parallelism in Applications  Data-level parallelism (DLP) –Instructions from a single stream operate concurrently (temporally or spatially) on several data –Limited by non-regular data manipulation patterns and by memory bandwidth  Transaction-level parallelism –Multiple threads/processes from different transactions can be executed concurrently –Sometimes not really considered as parallelism –Limited by access to metadata and by interconnection bandwidth 8

23 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  The problem: –Operate on a (n+2)x(n+2) matrix –Points on the rim have fixed value –Inner points are updated as: –Updates are in-place, so top and left are new values and bottom and right are old ones –Updates occur at multiple sweeps –Keep difference between old and new values and stop when difference for all points is small enough 9 A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])

24 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  Dependences: –Computing the new value of a given point requires the new value of the point directly above and to the left –By transitivity, it requires all points in the sub-matrix in the upper-left corner –Points along the top-right to bottom-left diagonals can be computed independently 10

25 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  ILP version (from sequential code): –Machine instructions from each j iteration can occur in parallel –Branch prediction allows overlap of multiple iterations of j loop –Some of the instructions from multiple j iterations can occur in parallel 11 while (!done) { diff = 0; for (i=1; i<=n; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); diff += abs(A[i,j] – temp); } if (diff/(n*n) < TOL) done=1; }

26 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  TLP version (shared-memory): 12 int mymin = 1+(pid * n/P); int mymax = mymin + n/P – 1; while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P); }

27 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  TLP version (shared-memory) (for 2 processors): –Each processor gets a chunk of rows  E.g., processor 0 gets: mymin=1 and mymax=2 and processor 1 gets: mymin=3 and mymax=4 13 int mymin = 1+(pid * n/P); int mymax = mymin + n/P – 1; while (!done) { diff = 0; mydiff = 0; for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); }...

28 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  TLP version (shared-memory): –All processors can access freely the same data structure A –Access to diff, however, must be in turns –All processors update together their own done variable 14... for (i=mymin; i<=mymax; i++) { for (j=1; j<=n; j++) { temp = A[i,j]; A[i,j] = 0.2*(A[i,j]+A[i,j-1]+A[i-1,j] + A[i,j+1]+A[i+1,j]); mydiff += abs(A[i,j] – temp); } lock(diff_lock); diff += mydiff; unlock(diff_lock); barrier(bar, P); if (diff/(n*n) < TOL) done=1; barrier(bar, P); }

29 CS4/MSc Parallel Architectures - 2009-2010 Types of Speedups and Scaling  Scalability: adding x times more resources to the machine yields close to x times better “performance” –Usually resources are processors, but can also be memory size or interconnect bandwidth –Usually means that with x times more processors we can get ~x times speedup for the same problem –In other words: How does efficiency (see Lecture 1) hold as the number of processors increases?  In reality we have different scalability models: –Problem constrained –Time constrained –Memory constrained  Most appropriate scalability model depends on the user interests 15

30 CS4/MSc Parallel Architectures - 2009-2010 Types of Speedups and Scaling  Problem constrained (PC) scaling: –Problem size is kept fixed –Wall-clock execution time reduction is the goal –Number of processors and memory size are increased –“Speedup” is then defined as: –Example: CAD tools that take days to run, weather simulation that does not complete in reasonable time 16 S PC = Time(1 processor) Time(p processors)

31 CS4/MSc Parallel Architectures - 2009-2010 Types of Speedups and Scaling  Time constrained (TC) scaling: –Maximum allowable execution time is kept fixed –Problem size increase is the goal –Number of processors and memory size are increased –“Speedup” is then defined as: –Example: weather simulation with refined grid 17 S TC = Work(p processors) Work(1 processor)

32 CS4/MSc Parallel Architectures - 2009-2010 Types of Speedups and Scaling  Memory constrained (MC) scaling: –Both problem size and execution time are allowed to increase –Problem size increase with the available memory with smallest increase in execution time is the goal –Number of processors and memory size are increased –“Speedup” is then defined as: –Example: astrophysics simulation with more planets and stars 18 S MC = Work(p processors) Time(p processors) x Time(1 processor) Work(1 processor) = Increase in Work Increase in Time

33 CS4/MSc Parallel Architectures - 2009-2010 Lect. 3: Superscalar Processors I/II  Pipelining: several instructions are simultaneously at different stages of their execution  Superscalar: several instructions are simultaneously at the same stages of their execution  (Superpipelining: very deep pipeline with very short stages to increase the amount of parallelism)  Out-of-order execution: instructions can be executed in an order different from that specified in the program  Dependences between instructions: –Read after Write (RAW) (a.k.a. data dependence) –Write after Read (WAR) (a.k.a. anti dependence) –Write after Write (WAW) (a.k.a. output dependence) –Control dependence  Speculative execution: tentative execution despite dependences 1

34 CS4/MSc Parallel Architectures - 2009-2010 A 5-stage Pipeline 2 General registers IDMEMIFEXEWB Memory IF = instruction fetch (includes PC increment) ID = instruction decode + fetching values from general purpose registers EXE = arithmetic/logic operations or address computation MEM = memory access or branch completion WB = write back results to general purpose registers

35 CS4/MSc Parallel Architectures - 2009-2010 A Pipelining Diagram  Start one instruction per clock cycle 3 IFI1I2 I1I2 ID EXE MEM WB I1I2 I1I2 I1I2 I3I4 I3 I4I5 I3I4I5I6 cycle123456 instruction flow  each instruction still takes 5 cycles, but instructions now complete every cycle: CPI  1

36 CS4/MSc Parallel Architectures - 2009-2010 Multiple-issue  Start two instructions per clock cycle 4 IFI1I3 I1I3 ID EXE MEM WB I1I3 I1I3 I1I3 I5I7 I5 I7I9 I5I7I9I11 cycle123456 instruction flow I2I4I6I8I10I12 I2I4I6I8I10 I2I4I6I8 I2I4I6 I2I4 CPI  0.5; IPC  2

37 CS4/MSc Parallel Architectures - 2009-2010 A Pipelined Processor (DLX) 5 H&P Fig. A.18

38 CS4/MSc Parallel Architectures - 2009-2010 Advanced Superscalar Execution 6  Ideally: in an n-issue superscalar, n instructions are fetched, decoded, executed, and committed per cycle  In practice: –Control flow changes spoil fetch flow –Data, control, and structural hazards spoil issue flow –Multi-cycle arithmetic operations spoil execute flow  Buffers at issue (issue window or issue queue) and commit (reorder buffer) decouple these stages from the rest of the pipeline and regularize somewhat breaks in the flow General registers IDMEM Fetch engine EXEWB Memory instructions

39 CS4/MSc Parallel Architectures - 2009-2010 Problems At Instruction Fetch 7  Crossing instruction cache line boundaries –e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per cache line; 4-wide superscalar processor –More than one cache lookup are required in the same cycle –What if one of the line accesses is a cache miss? –Words from different lines must be ordered and packed into instruction queue Case 1: all instructions located in same cache line and no branch Case 2: instructions spread in more lines and no branch

40 CS4/MSc Parallel Architectures - 2009-2010 Problems At Instruction Fetch 8  Control flow –e.g., 32 bit instructions and 32 byte instruction cache lines → 8 instructions per cache line; 4-wide superscalar processor –Branch prediction is required within the instruction fetch stage –For wider issue processors multiple predictions are likely required –In practice most fetch units only fetch up to the first predicted taken branch Case 1: single not taken branch Case 2: single taken branch outside fetch range and into other cache line

41 CS4/MSc Parallel Architectures - 2009-2010 Example Frequencies of Control Flow 9 benchmarktaken %avg. BB size # of inst. between taken branches eqntott86.24.204.87 espresso63.8 4.246.65 xlisp 64.74.34 6.70 gcc67.6 4.656.88 sc70.2 4.716.71 compress60.9 5.39 8.85 Data from Rotenberg et. al. for SPEC 92 Int  One branch/jump about every 4 to 6 instructions  One taken branch/jump about every 4 to 9 instructions

42 CS4/MSc Parallel Architectures - 2009-2010 Solutions For Instruction Fetch 10  Advanced fetch engines that can perform multiple cache line lookups –E.g., interleaved I-caches where consecutive program lines are stored in different banks that can accessed in parallel  Very fast, albeit not very accurate branch predictors (e.g., next line predictor in the Alpha 21464) –Note: usually used in conjunction with more accurate but slower predictors (see Lecture 4)  Restructuring instruction storage to keep commonly consecutive instructions together (e.g., Trace cache in Pentium 4)

43 CS4/MSc Parallel Architectures - 2009-2010 Example Advanced Fetch Unit 11 Figure from Rotenberg et. al. Control flow prediction units: i)Branch Target Buffer ii)Return Address Stack iii)Branch Predictor Final alignment unit 2-way interleaved I-cacheMask to select instructions from each of the cache lines

44 CS4/MSc Parallel Architectures - 2009-2010 Trace Caches 12  Traditional I-cache: instructions laid out in program order  Dynamic execution order does not always follow program order (e.g., taken branches) and the dynamic order also changes  Idea: –Store instructions in execution order (traces) –Traces can start with any static instruction and are identified by the starting instruction’s PC –Traces are dynamically created as instructions are normally fetched and branches are resolved –Traces also contain the outcomes of the implicitly predicted branches –When the same trace is again encountered (i.e., same starting instruction and same branch predictions) instructions are obtained from trace cache –Note that multiple traces can be stored with the same starting instruction

45 CS4/MSc Parallel Architectures - 2009-2010 Pros/Cons of Trace Caches 13 +Instructions come from a single trace cache line +Branches are implicitly predicted –The instruction that follows the branch is fixed in the trace and implies the branch’s direction (taken or not taken) +I-cache still present, so no need to change cache hierarchy +In CISC IS’s (e.g., x86) the trace cache can keep decoded instructions (e.g., Pentium 4) -Wasted storage as instructions appear in both I-cache and trace cache, and in possibly multiple trace cache lines -Not very good at handling indirect jumps and returns (which have multiple targets, instead of only taken/not taken) and even unconditional branches -Not very good when there are traces with common sub-paths

46 CS4/MSc Parallel Architectures - 2009-2010 Structure of a Trace Cache 14 Figure from Rotenberg et. al.

47 CS4/MSc Parallel Architectures - 2009-2010 Structure of a Trace Cache 15  Each line contains n instructions from up to m basic blocks  Control bits: –Valid –Tag –Branch flags and mask: m-1 bits to specify the direction of the up to m branches –Branch mask: the number of branches in the trace –Trace target address and fall-through address: the address of the next instruction to be fetched after the trace is exhausted  Trace cache hit: –Tag must match –Branch predictions must match the branch flags for all branches in the trace

48 CS4/MSc Parallel Architectures - 2009-2010 Trace Creation 16  Starts on a trace cache miss  Instructions are fetched up to the first predicted taken branch  Instructions are collected, possibly from multiple basic blocks (when branches are predicted taken)  Trace is terminated when either n instructions or m branches have been added  Trace target/fall-through address are computed at the end

49 CS4/MSc Parallel Architectures - 2009-2010 Example 17  I-cache lines contain 8 32-bit instructions and Trace Cache lines contain up to 24 instructions and 3 branches  Processor can issue up to 4 instructions per cycle L1: I1 [ALU]... I5 [Cond. Br. to L3] L2: I6 [ALU]... I12 [Jump to L4] L3: I13 [ALU]... I18 [ALU] L4: I19 [ALU]... I24 [Cond. Br. to L1] Machine Code B1 (I1-I5) B2 (I6-I12) B3 (I13-I18) B4 (I19-I24) Basic Blocks I1I2I3 I4I5I6I7I8I9I10I11 I12I13I14I15I16I17I18I19 I20I21I22I23 Layout in I-Cache I24

50 CS4/MSc Parallel Architectures - 2009-2010 Example 18  Step 1: fetch I1-I3 (stop at end of line) → Trace Cache miss → Start trace collection  Step 2: fetch I4-I5 (possible I-cache miss) (stop at predicted taken branch)  Step 3: fetch I13-16 (possible I-cache miss)  Step 4: fetch I17-I19 (I18 is predicted not taken branch, stop at end of line)  Step 5: fetch I20-I23 (possible I-cache miss) (stop at predicted taken branch)  Step 6: fetch I24-I27  Step 7: fetch I1-I4 replaced by Trace Cache access B1 (I1-I5) B2 (I6-I12) B3 (I13-I18) B4 (I19-I24) Basic Blocks I1I2I3 I4I5I6I7I8I9I10I11 I12I13I14I15I16I17I18I19 Layout in I-Cache Common path I1I2I3I4I5I13I14I15 I16I17I18I19I20I21I22I23 Layout in Trace Cache I20I21I22I23I24

51 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 19  Original hardware trace cache: “Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching”, E. Rotenberg, S. Bennett, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.  Next trace prediction for trace caches: “Path-Based Next Trace Prediction”, Q. Jacobson, E. Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1997.  A Software trace cache: “Software Trace Cache”, A. Ramirez, J.-L. Larriba-Pey, C. Navarro, J. Torrellas, and M. Valero, Intl. Conf. on Supercomputing, June 1999.

52 CS4/MSc Parallel Architectures - 2009-2010 Lect. 4: Superscalar Processors II/II  n-wide instruction width + m-deep pipeline + d delay to resolve branches: –Up to n*m instructions in-flight –Up to n*d instructions must be re-executed on branch misprediction –Current processors have 10 to 20 cycles of branch misprediction penalty  Current branch prediction accuracy is around 80%-90% for “difficult” applications and >95% for “easy” applications  Increasing prediction accuracy usually involves increasing the size of tables  Different predictor types are good at different types of branch behavior  Current processors have multiple branch predictors with different accuracy-delay tradeoffs 1

53 CS4/MSc Parallel Architectures - 2009-2010 Quantifying Prediction Accuracy 2  Two measures: –Coverage: the fraction of branches for which the predictor has a prediction (Note: usually, it is considered that coverage is 100% and no prediction equals predict not taken) –Accuracy: the ratio of correctly predicted branches over the total number of branches predicted (Pitfall: higher accuracy is not necessarily better when coverage is lower)  Performance impact is proportional to (1-accuracy), penalty, and amount of branches in the application  Two ways of looking at accuracy improvements: –E.g., accuracy improves from 95% to 97%: 97 - 95 95 = 0.021 Only 2% increase in accuracy 5 - 3 5 = 0.4 40% reduction in mispredictions

54 CS4/MSc Parallel Architectures - 2009-2010 2-bit Branch Prediction  Branch prediction buffers: –Match branch PC during IF or ID stages  2-bit saturating counter: –00: do not take –01: do not take –10: take –11: take 3 Branch PC 0x135c8 0x147e0 … Outcome 00 01 … 0x135c4: add r1,r2,r3 0x135c8: bne r1,r0,n …

55 CS4/MSc Parallel Architectures - 2009-2010 (2,2) Correlating Predictor  For example: if the four counter values are 00 01 10 01 and the last two branches were, respectively, taken and not taken, then we will predict the branch as not taken (01)  Organized as a table of values indexed by the sequence of past branch outcomes and by the branch PC  This is an example of a context-based branch predictor 4 Prediction bits 00 01 10 11 Do not take Take PredictionIf NT/NTIf T/NTIf NT/TIf T/T 00 01 10 11 00 01 10 11 00 01 10 11

56 CS4/MSc Parallel Architectures - 2009-2010 Two Level Branch Predictors 5  Two types of arrangement/indexing: –Global: Information is not particular to a branch and the table/information is not directly indexed by the branch’s PC  Good when branches are highly correlated –Local (a.k.a. per address): Information is particular to a branch and the table/information is indexed by the branch’s PC  Good when branches are individually highly biased –Partially local: Table/information is indexed by part of the branch’s PC (in order to save bits in the tags for the tables) –Note: sometimes global information may be indexed by information that was local, and is then somewhat indexed by the branch’s PC

57 CS4/MSc Parallel Architectures - 2009-2010 Two Level Branch Predictors 6  1 st level: history of the last n branches –If global:  Single History Register (HR) (n-bit shift register) with the last outcomes of all branches –If local:  Multiple HR’s in a History Register Table (HRT) that is indexed by the branch’s PC, where each HR contains the last outcomes of the corresponding branch only  2 nd level: the branch behavior of the last s occurrences of the history pattern –If global:  Single Pattern History Table (PHT) indexed by the resulting HR contents –If local:  Multiple PHT’s that are indexed by the branch’s PC, where each entry is indexed by the resulting HR contents –Thus, 2 n entries for each HR

58 CS4/MSc Parallel Architectures - 2009-2010 Two Level Branch Predictors 7  Example with global history and global pattern table (GAg) –All branches use the same HR –All branches use the same PHT –2-bit saturating counter is only an example and other schemes are possible –Meaning: “When the outcome of the last any n branches is 11…10 then the prediction is P, regardless of what branch is being predicted” 111... 0 Branch History Register 00 … 00 00 … 01 00 … 10 11 … 10 11 … 11 … Pattern History Table P 2-bit Saturating Counter P = 01 Predict Not Taken Indexing Prediction Result

59 CS4/MSc Parallel Architectures - 2009-2010  Example with local history and global pattern table (PAg) –Each branch uses its own HR –All branches use the same PHT –Meaning: “When the outcome of the last n instances of the branch being predicted is 11…10 then the prediction is P, regardless of what branch is being predicted” Two Level Branch Predictors 8 111... 0 Branch History Registers 00 … 00 00 … 01 00 … 10 11 … 10 11 … 11 … Pattern History Table P Indexing 010... 0 tag PC 111... 0 tag Indexing

60 CS4/MSc Parallel Architectures - 2009-2010  Example with local history and local pattern table (PAp) –Each branch uses its own HR –Each branch uses its own PHT –Meaning: “When the outcome of the last n instances of the branch being predicted is 11…10 then the prediction is P for this particular branch” Two Level Branch Predictors 9 111... 0 Branch History Registers 00 … 00 00 … 01 00 … 10 11 … 10 11 … 11 … Pattern History Table P Indexing 010... 0 tag PC 111... 0 tag Indexing tag P’ tag

61 CS4/MSc Parallel Architectures - 2009-2010 Two Level Branch Predictors 10  Notes: –When only part of the branch’s PC is used for indexing there is aliasing (i.e., multiple branches appear to be same) –In practice there is a finite number of entries in the tables with local information, so  Either these only cache information for the most recently seen branches  Or the tables are indexed by hashing (usually with an XOR) the branch’s PC (this also leads to aliasing) –Aliasing also happens with global information, as multiple branches appear to have the same behavior/prediction –Accuracy of predictor depends on:  Local versus Global information at each level  Size of the tables in local schemes (number of different branches that can be tracked)  Depth of the history (n)  Amount of aliasing

62 CS4/MSc Parallel Architectures - 2009-2010 Two Level Branch Predictors 11  Updates: –The HR’s are updated with the outcome of the branch being predicted (only the corresponding HR in case of local scheme) –The predictor in the selected PHT entry is updated with the outcome of branch (e.g., a 2-bit saturating counter is incremented/decremented if the outcome is taken/not taken)  Taxonomy: –History Table type:  Global: GA;Local (per address): PA –Pattern Table type:  Global: g;Local (per address): p –Thus: GAg=global history table and global pattern table PAg=local history table and global pattern table –GAp combination does not make much sense

63 CS4/MSc Parallel Architectures - 2009-2010 Local vs. Global Predictors 12  Simple 2-bit predictor performs best for small predictor sizes, but saturates quickly and below other predictors  Local outperforms global for all these predictor sizes Data from McFarling for SPEC 1989 Int and FP

64 CS4/MSc Parallel Architectures - 2009-2010 Combining Branch Predictors 13  Different predictors are good at different behaviors  Different predictors have different accuracy and latency  Combining predictors –Can lead to schemes that are good at more behaviors – Can generate quickly a reasonably accurate prediction and with some more delay a highly accurate prediction, which corrects the previous prediction if necessary –Usually combine a simple and a complex predictor  Choosing between multiple predictors: –“Meta-predictor” to choose the predictor that most likely has the correct prediction –Augment predictors with confidence estimators

65 CS4/MSc Parallel Architectures - 2009-2010 Combining Branch Predictors 14  Meta predictor –Use 2-bit saturating counter to select predictor to use Selector Counters S tag PC Indexing 2-bit Saturating Counter S = 01 Use Predictor P2 Selection Result P1P2 Predictors 2:1 MUX Predictions Final prediction

66 CS4/MSc Parallel Architectures - 2009-2010 Combining Branch Predictors 15  Meta predictor –2-bit saturating counter interpretation:  00: Use P2  01: Use P2  10: Use P1  11: Use P1 –Updating counter:  P1 correct and P2 correct this time: no change to counter  P1 correct and P2 incorrect this time: increment counter  P1 incorrect and P2 correct this time: decrement counter  P1 incorrect and P2 incorrect this time: no change to counter  Choosing among more than 2 predictors is more involved and rarely pays off

67 CS4/MSc Parallel Architectures - 2009-2010 Example: The Alpha 21464 Predictors 16  8-wide out-or-order superscalar processor with very deep pipeline and multithreading  Predictors take approximately 44KBytes of storage  Up to 16 branches predicted every cycle  Minimum misprediction penalty of 14 cycles (112 instructions) and most common is 20 or 25 cycles (160 or 200 instructions)  Based on global schemes; local schemes were ruled out because: –They would require up to 16 parallel lookups of the tables –Difficult to maintain per-branch information (e.g., the same branch may appear multiple times in such a deeply pipelined wide issue machine)  In addition to conditional branch prediction it has a jump predictor and a return address stack predictor

68 CS4/MSc Parallel Architectures - 2009-2010 Example: The Alpha 21464 Predictors 17  Fetch unit: –Can fetch up to 16 instructions from 2 dynamically consecutive I-cache lines –Instruction fetch stops at the first taken branch (predicted not taken branches (up to 16) do not stop fetch)  1 st Predictor: Next Line Predictor –Operates within a single cycle –Unacceptably high misprediction rate  2 nd Predictor: 2Bc-gskew –Operates over 2 cycles and is pipelined –Actually consists of 2 different predictors (a 2-bit saturating counter and an e-gskew) combined and with a meta predictor selector –Uses “de-aliasing” approach:  Partition the tables into multiple sets and use special hashing functions  Shown to reduce aliasing in global schemes

69 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 18  Seminal branch prediction work: “Two-Level Adaptive Training Branch Prediction”, T.-Y. Yeh and Y. Patt, Intl. Symp. on Microarchitecture, December 1991. “Alternative Implementations of Two-Level Adaptive Branch Prediction”, T.-Y. Yeh and Y. Patt, Intl. Symp. on Computer Architecture, June 1992. “Improving the Accuracy of Dynamic Branch Prediction Using Branch Correlation”, S.-T. Pan, K. So, and J. T. Rahmeh, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1992. “Combining Branch Predictors”, S. McFarling, WRL Technical Note TN-36, June 1993.  Adding confidence estimation to predictors: “Assigning Confidence to Conditional Branch Predictions”, E. Jacobsen, E. Rotenberg, and J. Smith, Intl. Symp. on Microarchitecture, December 1996.

70 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 19  Alpha 21464 predictor: “Design Tradeoffs for the Alpha EV8 Conditional Branch Predictor”, A. Seznec, S. Felix, V. Krishnan, and Y. Sazeides, Intl. Symp. on Computer Architecture, June 2002. “Next Cache Line and Set Prediction”, B. Calder and D. Grunwald, Intl. Symp. on Computer Architecture, June 1995. “Trading Conflict and Capacity Aliasing in Conditional Branch Predictors”, P. Michaud, A. Seznec, and R. Uhlig, Intl. Symp. on Computer Architecture, June 1997.  Neural net based branch predictors: “Fast Path-Based Neural Branch Prediction”, D. Jimenez, Intl. Symp. on Microarchitecture, December 2003.  Championship Branch Prediction –www.jilp.org/cbp/ –camino.rutgers.edu/cbp2/

71 CS4/MSc Parallel Architectures - 2009-2010 Probing Further 20  Advanced register allocation and de-allocation “Late Allocation and Early Release of Physical Registers”, T. Monreal, V. Vinals, J. Gonzalez, A. Gonzalez, and M. Valero, IEEE Trans. on Computers, October 2004.  Value prediction “Exceeding the Dataflow Limit Via Value Prediction”, M. H. Lipasti and J. P. Shen, Intl. Symp. on Microarchitecture, December 1996.  Limitations to wide issue processors “Complexity-Effective Superscalar Processors”, S. Palacharla, N. P. Jouppi, and J. Smith, Intl. Symp. on Computer Architecture, June 1997. “Clock Rate Versus IPC: the End of the Road for Conventional Microarchitectures”, V. Agarwal, M. S. Hrishikesh, S. W. Keckler, and D. Burger, Intl. Symp. on Computer Architecture, June 2000.  Recent alternatives to out-of-order execution “”Flea-flicker” Multipass Pipelining: An Alternative to the High-Power Out- of-Order Offense”, R. D. Barnes, S. Ryoo, and W. Hwu, Intl. Symp. on Microarchitecture, November 2005.

72 CS4/MSc Parallel Architectures - 2009-2010 Lect. 5: Vector Processors  Many real-world problems, especially in science and engineering, map well to computation on arrays  RISC approach is inefficient: –Based on loops → require dynamic or static unrolling to overlap computations –Indexing arrays based on arithmetic updates of induction variables –Fetching of array elements from memory based on individual, and unrelated, loads and stores –Small register files –Instruction dependences must be identified for each individual instruction  Idea: –Treat operands as whole vectors, not as individual integer of float-point numbers –Single machine instruction now operates on whole vectors (e.g., a vector add) –Loads and stores to memory also operate on whole vectors –Individual operations on vector elements are independent and only dependences between whole vector operations must be tracked 1

73 CS4/MSc Parallel Architectures - 2009-2010 Execution Model  Straightforward RISC code: –F2 contains the value of s –R1 contains the address of the first element of a –R2 contains the address of the first element of b –R3 contains the address of the last element of a + 8 2 for (i=0; i<64; i++) a[i] = b[i] + s; loop: L.D F0,0(R2) ;F0=array element of b ADD.D F4,F0,F2 ;main computation S.D F4,0(R1) ;store result DADDUI R1,R1,8 ;increment index DADDUI R2,R2,8 ;increment index BNE R1,R3,loop ;next iteration

74 CS4/MSc Parallel Architectures - 2009-2010 Execution Model  Straightforward vector code: –F2 contains the value of s –R1 contains the address of the first element of a –R2 contains the address of the first element of b –Assume vector registers have 64 double precision elements –Notes:  Some vector operations require access to integer and FP register files as well  In practice vector registers are not of the exact size of the arrays  Refer to Figure G.3 of Hennessy&Patterson for a list of the most common types of vector instructions  Only 3 instructions executed compared to 6*64=384 executed in the RISC 3 for (i=0; i<64; i++) a[i] = b[i] + s; LV V1,R2 ;V1=array b ADDVS.D V2,V1,F2 ;main computation SV V2,R1 ;store result

75 CS4/MSc Parallel Architectures - 2009-2010 Execution Model (Pipelined)  With multiple vector units, I2 can execute together with I1 (as we will see later)  In practice, the vector units takes several cycles to operate on each element, but is pipelined 4 IFI1 ID EXE MEM WB I1.1 cycle123456 I1.2 7 I1.3 8 I2 I1.2 I1.3 I1.4 I1.3 I1.4 I1.5 I1.4 I1.5 I1.6

76 CS4/MSc Parallel Architectures - 2009-2010 Pros of Vector Processors  Reduced pressure on instruction fetch –Fewer instructions are necessary to specify the same amount of work  Reduced pressure on instruction issue –Reduced number of branches alleviates branch prediction –Much simpler hardware for checking dependences  Simpler register file –No need for too many ports as only one element used per cycle (for pipeline approach)  More streamlined memory accesses –Vector loads and stores specify a regular access pattern –High latency of initiating memory access is amortized 5

77 CS4/MSc Parallel Architectures - 2009-2010 Cons of Vector Processors  Requires a specialized, high-bandwidth, memory system –Caches do not usually work well with vector processors –Usually built around heavily banked memory with data interleaving  Still requires a traditional scalar unit (integer and FP) for the non- vector operations  Difficult to maintain precise interrupts (can’t rollback all the individual operations already completed)  Compiler or programmer has vectorize programs  Not very efficient for small vector sizes  Not suitable/efficient for many different classes of applications 6

78 CS4/MSc Parallel Architectures - 2009-2010 Performance Issues  Performance of a vector instruction depends on the length of the operand vectors  Initiation rate –Rate at which individual operations can start in a functional unit –For fully pipelined units this is 1 operation per cycle –Usually >1 for load/store unit  Start-up time –Time it takes to produce the first element of the result –Depends on how deep the pipeline of the functional units are –Especially large for load/store unit  With an initiation rate of 1, the time to complete a single vector instruction is equal to the vector size + the start-up time, which is approximately equal to the vector size for large vectors 7

79 CS4/MSc Parallel Architectures - 2009-2010 Performance Issues  Common vector processor performance metrics: –R ∞ : the rate of execution of the processor with vectors of infinite size (i.e., with no overheads due to smaller vectors) –N 1/2 : the vector length required for the processor to reach half of R ∞ –N V : the vector length required for the processor to match the performance of scalar execution (i.e., the point at which it pays off to execute in vector mode) 8

80 CS4/MSc Parallel Architectures - 2009-2010 Dealing with Vector Sizes  Two new registers are used: –vector length register (VLR) specifies (to the hardware) what length is to be assumed for the next instruction to be issued –maximum vector length (MVL) specifies (to the programmer/compiler) what the maximum length is (i.e., the size of the registers in the particular machine)  Use strip mining for user arrays larger than MVL 9 for (i=0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/9/2534335/slides/slide_80.jpg", "name": "CS4/MSc Parallel Architectures - 2009-2010 Dealing with Vector Sizes  Two new registers are used: –vector length register (VLR) specifies (to the hardware) what length is to be assumed for the next instruction to be issued –maximum vector length (MVL) specifies (to the programmer/compiler) what the maximum length is (i.e., the size of the registers in the particular machine)  Use strip mining for user arrays larger than MVL 9 for (i=0; i

81 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Masking  What if the operations involve only some elements of the array, depending on some run-time condition?  Solution: masking –Add a new boolean vector register (the vector mask register) –The vector instruction then only operates on elements of the vectors whose corresponding bit in the mask register is 1 –Add new vector instructions to set the mask register  E.g., SNEVS.D V1,F0 sets to 1 the bits in the mask registers whose corresponding elements in V1 are not equal to the value in F0  CVM instruction sets all bits of the mask register to 1 10 for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s;

82 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Masking  Vector code: –F2 contains the value of s and F0 contains zero –R1 contains the address of the first element of a –R2 contains the address of the first element of b –Assume vector registers have 64 double precision elements 11 for (i=0; i<64; i++) if (b[i] != 0) a[i] = b[i] + s; LV V1,R2 ;V1=array b SNEVS.D V1,F0 ;mask bit is 1 if b !=0 ADDVS.D V2,V1,F2 ;main computation CVM SV V2,R1 ;store result

83 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Scatter-Gather  How can we handle sparse matrices?  Solution: scatter-gather –Use the contents of an auxiliary vector to select which elements of the main vector are to be used –This is done by pointing to the address in memory of the elements to be selected –Add new vector instruction to load memory values based on this auxiliary vector  E.g. LVI V1,(R1+V2) loads the elements of a user array from memory locations R1+V2(i)  Also SVI store counterpart 12 for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s;

84 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Scatter-Gather  Vector code: –F2 contains the value of s –R1 contains the address of the first element of a –R2 contains the address of the first element of b –V3 contains the indices of a and b that need to be used –Assume vector registers have 64 double precision elements 13 for (i=0; i<64; i++) a[K[i]] = b[K[i]] + s; LVI V1,(R2+V3) ;V1=array b indexed by V3 ADDVS.D V2,V1,F2 ;main computation SVI V2,(R1+V3) ;store result

85 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Striding  Assume that the 2D array b is laid out by rows –Iterations access non-contiguous elements of b –Could use scatter-gather, but this would waste a vector register –Access pattern is very regular and a single integer, the stride, fully defines it –Add a new vector instruction to load values from memory based on the stride  E.g., LVWS V1,(R1,R2) loads the elements of a user array from memory locations R1+i*R2  Also SVWS store counterpart 14 for (i=0; i<64; i++) a[i] = b[i,j] + s;

86 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Chaining  Forwarding in pipelined RISC processors allow dependent instructions to execute as soon as the result of the previous instruction is available 15 IFaddmul addmul ID EXE MEM WB addmul addmul addmul I3I6 cycle123456 I3 I5 I3 I4 I5 ADD.D R1,R2,R3 # R1=R2+R3 MUL.D R4,R5,R1 # R4=R5+R1 value

87 CS4/MSc Parallel Architectures - 2009-2010 Advanced Features: Chaining  Similar idea applies to vector instructions and is called chaining –Difference is that chaining of vector instructions requires multiple functional units as the same unit cannot be used back-to-back 16 IFaddmul addmul ID EXE MEM A.1A.2 A.1 I3 cycle123456 A.3A.4 ADDV.D V1,V2,V3 # V1=V2+V3 MULV.D V4,V5,V1 # V4=V5+V1 value EXE M.1M.2M.3 M.1M.2 A.2A.3 WB

88 CS4/MSc Parallel Architectures - 2009-2010 Example: The Earth Simulator 17  73 rd fastest supercomputer as of Top500 list of November 2008 (was 1 st March 2002 to September 2004)  Multiprocessor Vector architecture –640 nodes, 8 vector processors per node → 5120 processors –8 pipelines per vector processor –10 TBytes of main memory –Vector units contain 72 vector registers, each with 256 elements  Performance and Power consumption –35.9 TFLOPS on Top500 benchmark (closest RISC-based multiprocessor (#72) reaches 36.6 TFLOPS using 9216 processors) –12800 KWatts power consumption  Designed specifically to simulate nature (e.g., weather, ocean, earthquakes) at a global scale (i.e., the whole earth)

89 CS4/MSc Parallel Architectures - 2009-2010 Further Reading 18  The first truly successful vector supercomputer: “The CRAY-1 Computer System”, R. M. Russel, Communications of the ACM, January 1978.  A recent vector processor on a chip: “Vector vs. Superscalar and VLIW Architectures for Embedded Multimedia Benchmarks”, C. Kozyrakis and D. Patterson, Intl. Symp. on Microarchitecture, December 2002.  Integrating a vector unit with a state-of-the-art superscalar: “Tarantula: A Vector Extension to the Alpha Architecture”, R. Espasa, F. Ardanaz, J. Elmer, S. Felix, J. Galo, R. Gramunt, I. Hernandez, T. Ruan, G. Lowney, M. Mattina, and A. Seznec, Intl. Symp. on Computer Architecture, June 2002.

90 CS4/MSc Parallel Architectures - 2009-2010 Lect. 6: SIMD Processors  Superscalar execution model: –Mix of scalar ALUs –n unrelated instructions per cycle –2n unrelated operands per cycle –Results from any ALU can feed back to any ALU individually –Operands are wide (32/64 bits)  Vector execution model: –Vector ALU –1 vector instruction → multiple of the same operation –Operands belong to an array –Results are written back to reg. file –Operands are wide (32/64 bits) 1 Instr. Sequencer Reg. file Instr. Sequencer

91 CS4/MSc Parallel Architectures - 2009-2010  Network of simple processing elements (PE) –PEs operate in lockstep under the control of a master sequencer, the array control unit (ACU) (note: masking is possible) –PEs can exchange results with a small number of neighbors via special data- routing instructions –Each PE has its own local memory or (less common) accesses memory via an alignment network –PEs operate on very narrow operands (1 bit in the extreme case of the CM- 1) –Very large (up to 64K) number of PEs –Usually operated as co-processors with a host computer to perform I/O and to handle external memory  Suitable for some scientific, AI, and vizualization applications  Intended for use as supercomputers  Programmed via custom extensions of common HLL 2 Original SIMD Idea

92 CS4/MSc Parallel Architectures - 2009-2010 3 Original SIMD Idea Instr. Sequencer MMMMMMMMMMMM

93 CS4/MSc Parallel Architectures - 2009-2010 Example: Equation Solver Kernel  The problem: –Operate on a (n+2)x(n+2) matrix  SIMD implementation: –Assign one node to each PE –Step 1: all PE’s send their data to their east neighbors and simultaneously read the data sent by their west neighbors (nodes at the right, top, and bottom rim are masked out at this step) –Steps 2 to 4: same as step 1 for west, south, and north (again, appropriate nodes are masked out) –Step 5: all PE’s compute the new value using equation above –Note: strictly speaking we need some extra tricks to juggle new and old values 4 A[i,j] = 0.2 x (A[i,j] + A[i,j-1] + A[i-1,j] + A[i,j+1] + A[i+1,j])

94 CS4/MSc Parallel Architectures - 2009-2010 Example: MasPar MP-1  Key features –First SIMD to use a traditional RISC IS –ACU also performs non-SIMD operations/computation –From 1K to 16K PE’s –PE array interconnects  2D mesh for 8-way (N, S, E, W, NE, SE, SW, NW) neighbor communication (X net)  Circuit-switched 3 stage hierarchical crossbar for any-to-any communication  Two global buses for ACU-PE lockstep control –PE’s have local memory for data (16KB) (instructions are stored in the ACU) –PE’s commonly operate on 32 bit words, but can also operate on individual bits, bytes, 16 bit words, and 64 bit words 5

95 CS4/MSc Parallel Architectures - 2009-2010 Example: MasPar MP-1 6 Figure from Blank PE array with 2D mesh ACU and Unix host Crossbar with routers

96 CS4/MSc Parallel Architectures - 2009-2010 A Modern SIMD Co-processor  ClearSpeed CSX600 –Intended as an accelerator for high performance technical computing –Current implementation has 96 PE’s plus a scalar unit for non-SIMD operations (including control flow) –Each PE is in fact a VLIW core –1, 2, 4, and 8 byte operands –PE’s can communicate directly with right and left neighbors –Also supports multithreading to hide I/O latency (Lecture 12) –Uses traditional instruction and data caches in addition to memory local to each PE –Programmed with a extension of C  Poly variables: replicated in each PE with different values  Mono variables: only a single instance exists (either at the host, or replicated at the PE’s but with synchronized values) 7

97 CS4/MSc Parallel Architectures - 2009-2010 A Modern SIMD Co-processor 8 Figure from ClearSpeed PE array with local memories (SRAM) and registers RISC scalar processor and ACU Neighbor communication infrastructure (swazzle)

98 CS4/MSc Parallel Architectures - 2009-2010 Multimedia SIMD Extensions  Key ideas: –No network of processing elements, but an array of (narrow) ALU’s –No memories associated with ALU’s, but a pool of relatively wide (64 to 128 bits) registers that store several operands –Still narrow operands (8 bits) and instructions that use operands of different sizes –No direct communication between ALU’s, but via registers and with special shuffling/permutation instructions –Not co-processors or supercomputers, but tightly integrated into CPU pipeline –Still lockstep operation of ALU’s –Special instructions to handle common media operations (e.g., saturated arithmetic) 9

99 CS4/MSc Parallel Architectures - 2009-2010 Multimedia SIMD Extensions  SIMD ext. execution model: 10 Instr. Sequencer Reg. file Shuffling network Inter register operations R1 R2 ++++ R3 Intra register operations R1 ++++ R2 or R1 ++++ R2

100 CS4/MSc Parallel Architectures - 2009-2010 Example: Intel SSE  Streaming SIMD Extensions introduced in 1999 with Pentium III  Improved over earlier MMX (1997) –MMX re-used the FP registers –MMX only operated on integer operands  70 new machine instructions (SEE2 added 144 more in 2001) and 8 128bit registers –Registers are part of the architectural state –Include instructions to move values between SEE and x86 registers –Operands can be: single (32bit) and double (64bit) precision FP; 8, 16, and 32 bit integer –Some instructions to support digital signal processing (DSP) and 3D –SSE2 included instructions for handling the cache (recall that streaming data does not utilize caches efficiently) 11

101 CS4/MSc Parallel Architectures - 2009-2010 A Modern SIMD Variation: Cell 12  IBM/Sony/Toshiba Cell Broadband Engine:  Heterogeneous “multi-core” system with 1 PowerPC (PPE) + 8 SIMD engines (SPE – “Synergistic Processor Units”)  On-chip storage based on “scratch pads” (very, very hard to program)  Used in the Playstation 3  SIMD support  SPE’s are incapable of independent control and are “slaves” to PowerPC  PPE already supports SIMD extensions (IBM’s VMX)  SPE supports SIMD through specific IS  128 128-bit registers and 128 bit datapath (note: no scalar registers in SPE)  Accessible to programmer through HLL intrinsics (i.e., function calls, e.g., spu_add(a,b))  Additional support for synchronization across SPE’s and PPE and for data transfer

102 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 13  Seminal SIMD work: “A Model of SIMD Machines and a Comparison of Various Interconnection Networks”, H. Siegel, IEEE Trans. on Computers, December 1979. “The Connection Machine”, D. Hillis, Ph.D. dissertation, MIT, 1985.  Two commercial SIMD supercomputers: “The CM-2 Technical Summary”, Thinking Machines Corporation, 1990. “The MasPar MP-1 Architecture”, T. Blank, Compcon, 1990.  A modern SIMD co-processor: “CSX Processor Architecture”, ClearSpeed, Whitepaper, 2006.

103 CS4/MSc Parallel Architectures - 2009-2010 Lect. 7: Shared Mem. Multiprocessors I/V  Obtained by connecting full processors together –Processors contain normal width (32 or 64 bits) datapaths –Processors are capable of independent execution and control –Processors have their own connection to memory (Thus, by this definition, Sony’s Playstation 3 is not a multiprocessor as the 8 SPE’s in the Cell are not full processors)  Have a single OS for the whole system, support both processes and threads, and appear as a common multiprogrammed system (Thus, by this definition, Beowulf clusters are not multiprocessors)  Can be used to run multiple sequential programs concurrently or parallel programs  Suitable for parallel programs where threads can follow different code 1

104 CS4/MSc Parallel Architectures - 2009-2010  Recall the communication model: –Threads in different processors can use the same virtual address space –Communication is done through shared memory variables –Explicit synchronization with locks (e.g., variable flag below) and critical sections 2 flag = 0; … a = 10; flag = 1; flag = 0; … while (!flag) {} x = a * y; Producer (p1)Consumer (p2) Shared Memory Multiprocessors

105 CS4/MSc Parallel Architectures - 2009-2010 Shared Memory Multiprocessors  Recall the two common organizations: –Physically centralized memory, uniform memory access (UMA) (a.k.a. SMP) –Physically distributed memory, non-uniform memory access (NUMA) (Note that both organizations have caches between processors and memory) 3 CPU Main memory CPU Cache CPU Mem. CPU Cache Mem.

106 CS4/MSc Parallel Architectures - 2009-2010 The Cache Coherence Problem 4 CPU Main memory CPU Cache T 0 : A=1 T 0 : A not cached T 1 : load A (A=1) T 1 : A=1 T 1 : A not cached T 2 : load A (A=1)T 2 : A not cachedT 2 : A=1 T 3 : store A (A=2)T 3 : A not cachedT 3 : A=1 stale T 4 : load A (A=1)T 4 : A=1T 4 : A=2 T 4 : A=1 use old value T 5 : load A (A=1) use stale value!

107 CS4/MSc Parallel Architectures - 2009-2010 Cache Coherence Protocols  Idea: –Keep track of what processors have copies of what data –Enforce that at any given time a single value of every data exists:  By getting rid of copies of the data with old values → invalidate protocols  By updating everyone’s copy of the data → update protocols  In practice: –Guarantee that old values are eventually invalidated/updated (write propagation) (recall that without synchronization there is no guarantee that a load will return the new value anyway) –Guarantee that a single processor is allowed to modify a certain datum at any given time (write serialization) –Must appear as if no caches were present  Note: must fit with cache’s operation at the granularity of lines 5

108 CS4/MSc Parallel Architectures - 2009-2010 Write-invalidate Example 6 CPU Main memory CPU Cache T 1 : load A (A=1) T 1 : A=1 T 1 : A not cached T 2 : load A (A=1)T 2 : A not cachedT 2 : A=1 T 3 : store A (A=2)T 3 : A not cached T 3 : A=1 invalidate stale T 4 : load A (A=2)T 4 : A not cachedT 4 : A=2 T 4 : A=1 new value T 5 : load A (A=2) new value

109 CS4/MSc Parallel Architectures - 2009-2010 Write-update Example 7 CPU Main memory CPU Cache T 1 : load A (A=1) T 1 : A=1 T 1 : A not cached T 2 : load A (A=1)T 2 : A not cachedT 2 : A=1 T 3 : store A (A=2)T 3 : A not cachedT 3 : A = 2 update T 4 : load A (A=2)T 4 : A = 2 new value T 5 : load A (A=2)

110 CS4/MSc Parallel Architectures - 2009-2010 Invalidate vs. Update Protocols  Invalidate: + Multiple writes by the same processor to the cache block only require one invalidation + No need to send the new value of the data (less bandwidth) –Caches must be able to provide up-to-date data upon request –Must write-back data to memory when evicting a modified block Usually used with write-back caches (more popular)  Update: + New value can be re-used without the need to ask for it again + Data can always be read from memory + Modified blocks can be evicted from caches silently –Possible multiple useless updates (more bandwidth) Usually used with write-through caches (less popular) 8

111 CS4/MSc Parallel Architectures - 2009-2010 Cache Coherence Protocols  Implementation –Can be in either hardware or software, but software schemes are not very practical (and will not be discussed further in this course)  Add state bits to cache lines to track state of the line –Most common: Invalid, Shared, Owned, Modified, Exclusive –Protocols usually named after the states supported  Global state of a memory line corresponds to the collection of its state in all caches  Cache lines transition between states upon load/store operations from the local processor and by remote processors  These state transitions must guarantee the invariant: no two cache copies can be simultaneously modified 9

112 CS4/MSc Parallel Architectures - 2009-2010 Example: MSI Protocol  States: –Modified (M): block is cached only in this cache and has been modified –Shared (S): block is cached in this cache and possibly in other caches (no cache can modify the block) –Invalid (I): block is not cached 10

113 CS4/MSc Parallel Architectures - 2009-2010 Example: MSI Protocol  Transactions originated at this CPU: 11 InvalidShared Modified CPU read miss CPU read hit CPU write miss CPU write CPU write hit CPU read hit

114 CS4/MSc Parallel Architectures - 2009-2010 Example: MSI Protocol  Transactions originated at other CPU: 12 InvalidShared Modified CPU read miss CPU read hit CPU write miss CPU write CPU write hit CPU read hit Remote write miss Remote read miss

115 CS4/MSc Parallel Architectures - 2009-2010 Example: MESI Protocol  States: –Modified (M): block is cached only in this cache and has been modified –Exclusive (E): block is cached only in this cache, has not been modified, but can be modified at will –Shared (S): block is cached in this cache and possibly in other caches –Invalid (I): block is not cached  State E is obtained on reads when no other processor has a shared copy –All processors must answer if they have copies or not  Easily done in bus-based systems with a shared-OR line –Or some device must know if processors have copies  Advantage over MSI –Often variables are loaded, modified in register, and then stored –The store on state E then does not require asking for permission to write 13

116 CS4/MSc Parallel Architectures - 2009-2010 Example: MESI Protocol  Transactions originated at this CPU: 14 InvalidShared Modified CPU read miss & sharing CPU read hit CPU write miss CPU write CPU write hit CPU read hit Exclusive CPU read hit CPU read miss & no sharing CPU write Must inform everyone (upgrade) Can be done silently

117 CS4/MSc Parallel Architectures - 2009-2010 Example: MESI Protocol  Transactions originated at other CPU: 15 InvalidShared ModifiedExclusive Remote write miss Remote read miss Remote write miss Remote read miss Remote write miss

118 CS4/MSc Parallel Architectures - 2009-2010 Possible Implementations  Three possible ways of implementing coherence protocols in hardware –Snooping: all cache controllers monitor all other caches’ activities and maintain the state of their lines  Commonly used with buses and in many CMP’s today –Directory: a central control device directly handles all cache activities and tells the caches what transitions to make  Can be of two types: centralized and distributed  Commonly used with scalable interconnects and in many CMP’s today –List: each cache controller keeps track of its own state and the identity and state of its neighbors in a linked list  E.g., IEEE SCI protocol (ANSI/IEEE Std 1596-1992)  Only used in a few machines in the late 90’s 16

119 CS4/MSc Parallel Architectures - 2009-2010 Behavior of Cache Coherence Protocols  Uniprocessor cache misses (the 3 C’s): –Cold (or compulsory) misses: when a block is accessed for the first time –Capacity misses: when a block is not in the cache because it was evicted because the cache was full –Conflict misses: when a block is not in the cache because it was evicted because the cache set was full  Coherence misses: when a block is not in the cache because it was invalidated by a write from another processor –Hard to reduce  relates to intrinsic communication and sharing of data in the parallel application –False sharing coherence misses: processors modify different words of the cache block (no real communication or sharing) but end up invalidating the complete block 17

120 CS4/MSc Parallel Architectures - 2009-2010 Behavior of Cache Coherence Protocols  False sharing increases with larger cache line size –Only true sharing remains with single word/byte cache lines  False sharing can be reduced with better placement of data in memory  True sharing tends to decrease with larger cache line sizes (due to locality)  Classifying misses in a multiprocessor is not straightforward –E.g., if P0 has line A in the cache and evicts it due to capacity limitation, and later P1 writes to the same line: is this a capacity or a coherence miss? It is both, as fixing one problem (e.g., increasing cache size) won’t fix the other (see Figure 5.20 of Culler&Singh for a complete decision chart) 18

121 CS4/MSc Parallel Architectures - 2009-2010 Behavior of Cache Coherence Protocols  Common types of data access patterns –Private: data that is only accessed by a single processor –Read-only shared: data that is accessed by multiple processors but only for reading (this includes instructions) –Migratory: data that is used and modified by multiple processors, but in turns –Producer-consumer: data that is updated by one processor and consumed by another –Read-write: data that is used and modified by multiple processors simultaneously  Falsely shared data  Data used for synchronization (Lecture 10)  Bottom-line: threads don’t usually read and write the same data indiscriminately 19

122 CS4/MSc Parallel Architectures - 2009-2010  Snooping coherence on simple shared bus –“Easy” as all processors and memory controller can observe all transactions –Bus-side cache controller monitors the tags of the lines involved and reacts if necessary by checking the contents and state of the local cache –Bus provides a serialization point (i.e., every transaction A is either before or after another transaction B)  More complex with split transaction buses 1 P1 L1 00 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Lect. 8: Shared Mem. Multiprocessors II/V

123 CS4/MSc Parallel Architectures - 2009-2010 “The devil is in the details”, Classic Proverb  Problem: conflict when processor and bus-side controller must check the cache  Solutions: –Use dual-ported modules for the tag and state array –Or, duplicate tag and state array  Both must be kept consistent when one is changed, which introduces some amount of conflicts 2 P1 L1 00 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Snooping on Simple Bus Ld/St

124 CS4/MSc Parallel Architectures - 2009-2010  Problem: even if bus is atomic, transactions are not instantaneous and may require several steps → transactions are not atomic –E.g., part of a transaction may be delayed by a memory response or by a bus- side controller that had to wait to access its tags –E.g., out-of-order processors may issue cache requests that conflict with the current request being served –E.g., an upgrade request may lose bus arbitration to another processor’s and may have to be re-issued as a full write miss (due to the required invalidation)  Solution: –Introduce transient states to cache lines and the protocol (the I, S, M, etc states seen in Lecture 7 are then called the stable states) 3 Snooping on Simple Bus

125 CS4/MSc Parallel Architectures - 2009-2010 Example: Extended MESI Protocol  Transactions originated at this CPU: 4 InvalidShared Modified CPU write miss CPU write CPU write hit CPU read hit Exclusive CPU read hit CPU read miss & shr. CPU read miss & no shr. CPU write I→S,E CPU read hit I→M S→M CPU write miss bus granted CPU read bus granted & shr. bus granted & no shr. CPU write bus granted & no conflict conflict

126 CS4/MSc Parallel Architectures - 2009-2010  Problems: –Processor interacts with L1 while bus snooping device interacts with L2, and propagating such operations up or down is not instantaneous –L2 lines are usually bigger than L1 lines 5 Snooping with Multi-Level Hierarchies P1 L1 00 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Ld/St L2 00 Line state L2 00 Line state

127 CS4/MSc Parallel Architectures - 2009-2010  Solution: 1. Maintain inclusion property –Lines in L1 must also be in L2 → no data is found solely in L1, so no risk of missing a relevant transaction when snooping at L2 –Lines M state in L1 must also be in M state in L2→ snooping controller at L2 can identify all data that is modified locally 2. Propagate coherence transactions 6 Snooping with Multi-Level Hierarchies

128 CS4/MSc Parallel Architectures - 2009-2010  Maintaining inclusion property Assume: L1: associativity a1, number of sets n1, block size b1 L2: associativity a2, number of sets n2, block size b2 –Difficulty: Replacement policy (e.g., LRU) Assume: a1=a2=2; b1=b2; n2=k*n1; lines m1, m2, and m3 map to same set in L1 and the same set in L2 7 Snooping with Multi-Level Hierarchies m1 L1 m1 L2 P miss 2 3 Ld m2 1 m2 fill 4 5 miss 8 9 Ld m1 6 hit Ld m3 7 fill 10 fill 11 m3

129 CS4/MSc Parallel Architectures - 2009-2010  Maintaining inclusion property Assume: L1: associativity a1, number of sets n1, block size b1 L2: associativity a2, number of sets n2, block size b2 –Difficulty: Different line sizes Assume: a1=a2=1; b1=1, b2=2; n1=4, n2=8 Thus, words w0 and w17 can coexist in L1, but not in L2 8 Snooping with Multi-Level Hierarchies w0 L1L2 w1 w2 w3 w16 w17

130 CS4/MSc Parallel Architectures - 2009-2010  Maintaining inclusion property –Most combinations of L1/L2 size, associativity, and line size do not automatically lead to inclusion –One solution is to have a1=1, a2≥1, b1=b2, and n1≤n2 –More common solution is to invalidate the L1 line (or lines, if b1 { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/9/2534335/slides/slide_130.jpg", "name": "CS4/MSc Parallel Architectures - 2009-2010  Maintaining inclusion property –Most combinations of L1/L2 size, associativity, and line size do not automatically lead to inclusion –One solution is to have a1=1, a2≥1, b1=b2, and n1≤n2 –More common solution is to invalidate the L1 line (or lines, if b1

131 CS4/MSc Parallel Architectures - 2009-2010  Non-split-transaction buses are idle from when the address request is finished until the data returns from memory or another cache  In split-transaction buses transactions are split into a request transaction and a response transaction, which can be separated  Sometimes implemented as two buses: one for requests and one for responses 10 Snooping with Split-Transaction Buses Address (normal) address 1 address 2 Data (normal) data 1 Address (split) address 1address 3 Data (split) data 0data 1

132 CS4/MSc Parallel Architectures - 2009-2010  Problems –Multiple requests can clash (e.g., a read and a write, or two writes, to the same data) (Note that this is more complicated than the case in Slide 3, as now different transactions may be at different stages of service) –Buffers used to hold pending transactions may fill up and cause incorrect execution and even deadlock (flow control is required) –Responses from multiple requests may appear in a different order than their respective requests  Responses and requests must then be matched using tags for each transaction  Note: it may be necessary for snoop controllers to request more time before responding (e.g., when they can’t have quick enough access to the local cache tags)  Note: snoop controllers may have to keep track themselves of what transactions are pending, in case there is conflict 11 Snooping with Split-Transaction Buses

133 CS4/MSc Parallel Architectures - 2009-2010  Clashing requests –Allow only one request at a time for each line (e.g., SGI Challenge)  Flow control –Use negative acknowledgement (NACK) when buffers are full (requests must be retried later; a bit more tricky with responses, due to danger of deadlock) (e.g., SGI Challenge) –Or, design the size of all queues for the worst case scenario  Ordering of transactions –Responses can appear in any order → the interleaving of the requests fully determine the order of transactions (e.g., SGI Challenge) –Or, enforce a FIFO order of transactions across the whole system (caches + memory) (e.g., Sun Enterprise) 12 Snooping with Split-Transaction Buses

134 CS4/MSc Parallel Architectures - 2009-2010 Sun Enterprise (1996-2001)  Up to 30 UltraSparc processors (Enterprise 6000)  The Gigaplane bus: (3 rd generation of buses from Sun) –Peak bandwidth of 2.67GB/s at 83MHz –Supports up to 16 nodes (either processor or I/O boards) –256bits data, 43bits address, 32bits ECC, and 57 control lines –Split-transaction with up to 112 outstanding transactions  Up to 30GB of main memory 16-way interleaved  Memory is physically located in processor boards, but it is still a UMA system 13 P ctrl L1 L2 Mem P L1 L2 Bus interface CPU/Mem cards FiberChannel SBUS Bus interface I/O cards SBUS 100bT,SCSI Gigaplane bus

135 CS4/MSc Parallel Architectures - 2009-2010 Sun Fire (2001-present)  Up to 106 UltraSparc III processors (Fire 15K)  The Fireplane bus: (4 th generation of buses from Sun) –Peak bandwidth of 9.6GB/s at 150MHz –Actually implemented using 4 levels of switches, not bus lines –Consists of two snooping domains connected by the upper level switch  Up to 576GB of main memory  Memory is physically located in processor boards, but it is still a UMA system 14 PL1Mem Switch level 0 PL1Mem L2 Switch level 1 3x3 data switch Low-end system with 2 processors Up to 8 processors Switch level 2 data switch Switch level 3 18x18 data switch Up to 106 processorsUp to 24 processors

136 CS4/MSc Parallel Architectures - 2009-2010  Like a bus, rings easily support broadcasts  Snooping implemented by all controllers checking the message as it passes by and re-injecting it into the ring  Potentially multiple transactions can be simultaneously on different stretches of the ring (harder to enforce proper ordering)  Large latency for long rings and growing linearly with number of processors  Used to provide coherence across multiple chips in current CMP systems (e.g., IBM Power 5) 15 Snooping with Ring P L1 Mem P L1 Mem P L1 Mem P L1 Mem P L1 Mem P L1 Mem

137 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 16  Original (hardware) cache coherence works: “Using Cache Memory to Reduce Processor Memory Traffic”, J. Goodman, Intl. Symp. on Computer Architecture, June 1983. “A Low-Overhead Coherence Solution for Multiprocessors with Private Cache Memories”, M. Papamarcos and J. Patel, Intl. Symp. on Computer Architecture, June 1984. “Hierarchical Cache/Bus Architecture for Shared-Memory Multiprocessors”, A. Wilson Jr., Intl. Symp. on Computer Architecture, June 1987.  An early survey of cache coherence protocols: “Cache Coherence Protocols: Evaluation Using a Multiprocessor Simulation Model”, J. Archibald and J.-L. Baer, ACM Trans. on Computer Systems, November 1986.  Discussion on the difficulties of maintaining inclusion “On the Inclusion Properties for Multi-Level Cache Hierarchies”, J.-L. Baer and W.-H. Wang, Intl. Symp. on Computer Architecture, May 1988.

138 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 17  Modern bus-based coherent multiprocessors: “The Sun Fireplane System Interconnect”, A. Charlesworth, Supercomputing Conf., November 2001.  Some software cache coherence schemes: “The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture”, G. Pfister, W. Brantley, D. George, S. Harvey, W. Kleinfelder, K. McAuliffe, E. Melton, V. Norton, and J. Weiss, Intl. Conf. on Parallel Processing, August 1985. “Automatic Management of Programmable Caches”, R. Cytron, S. Karlowsky, and K. McAuliffe, Intl. Conf. on Parallel Processing, August 1988.

139 CS4/MSc Parallel Architectures - 2009-2010  Snooping coherence –Global state of a memory line is the collection of its state in all caches, and there is no summary state anywhere –All cache controllers monitor all other caches’ activities and maintain the state of their lines –Requires a broadcast shared medium (e.g., bus or ring) that also maintains a total order of all transactions –Bus acts as a serialization point to provide ordering  Directory coherence –Global state of a memory line is the collection of its state in all caches, but there is a summary state at the directory –Cache controllers do not observe all activity, but interact only with directory –Can be implemented on scalable networks, where there is no total order and no simple broadcast, but only one-to-one communication –Directory acts as a serialization point to provide ordering (Lecture 11) 1 Lect. 9: Shared Mem. Multiprocessors III/V

140 CS4/MSc Parallel Architectures - 2009-2010 Directory Structure  Directory information (for every memory line) –Line state bits (e.g., not cached, shared, modified) –Sharing bit-vector: one bit for each processor that is sharing or for the single processor that has the modified line –Organized as a table indexed by the memory line address  Directory controller –Hardware logic that interacts with cache controllers and enforces cache coherence 2 Sharing vector 00000 Line stateMemory 4 Cache states: 00 = invalid 01 = shared 10 = modified Dir. states: 00 = not cached 01 = shared 10 = modified Directory information Up to 3 processors can be supported Line is not cached so sharing vector is empty and memory value is valid 101019 Line is shared in P0 and P2 and memory value is valid

141 CS4/MSc Parallel Architectures - 2009-2010 Directory Operation  Example: load with no sharers 3 Sharing vector 00000 Line stateMemory 4 P0 L1 00 Line state P1 L1 00 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Dir. states: 00 = not cached 01 = shared 10 = modified Load Miss 11 4Value 1

142 CS4/MSc Parallel Architectures - 2009-2010 Directory Operation  Example: load with sharers 4 Sharing vector 00101 Line stateMemory 4 P0 L1 01 Line state P1 L1 00 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Dir. states: 00 = not cached 01 = shared 10 = modified Load 4 Miss 1 4 4 Value 1

143 CS4/MSc Parallel Architectures - 2009-2010 Directory Operation  Example: store with sharers 5 Sharing vector 01101 Line stateMemory 4 P0 L1 01 Line state P1 L1 01 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Dir. states: 00 = not cached 01 = shared 10 = modified Store 4 Miss 4 4 0 Acknowledge 6 10 Invalidate 100 Reply

144 CS4/MSc Parallel Architectures - 2009-2010 Directory Operation  Example: load with owner 6 Sharing vector 01010 Line stateMemory 4 P0 L1 00 Line state P1 L1 10 Line state P2 L1 00 Line state Cache states: 00 = invalid 01 = shared 10 = modified Dir. states: 00 = not cached 01 = shared 10 = modified Load 44 6 Miss Forward 01 6 Value 01 Acknowledge+Value 1016

145 CS4/MSc Parallel Architectures - 2009-2010 Notes on Directory Operation  On a write with multiple sharers it is necessary to collect and count all the invalidation acknowledgements (ACK) before actually writing  On transactions that involve more complex state changes the directory must also receive and acknowledgement –In case something goes wrong –To establish the completion of the load or store (Lecture 11)  As with snooping on buses, “the devil is in the details” and we actually need transient states, must deal with conflicting requests, and must handle multi-level caches  As with buses, when buffers overflow we need to introduce NACKs  Directories should work well if only a small number of processors share common data at any given time (otherwise broadcasts are better) 7

146 CS4/MSc Parallel Architectures - 2009-2010 Quantitative Motivation for Directories  Number of invalidations per store miss on MSI with infinite caches  Bottom-line: number of sharers for read-write data is small 8 Culler and Singh Fig. 8.9

147 CS4/MSc Parallel Architectures - 2009-2010 Example Implementation Difficulties  Operations have to be serialized locally  Operations have to be serialized at directory 9 P0P1 Dir. 1. P0 sends read request for line A. 1 2. P1 sends read exclusive request for line A (waits at dir.). 2 3. Dir. responds to (1), sets sharing vector (message gets delayed). 3 4a/b. Dir. responds to (2) to both P0 (sharer) and P1 (new owner). 4a 4b Problem: when (3) finally arrives at P0 the stale value of line A is placed in the cache. Solution: P0 must serialize transactions locally so that it won’t react to 4b while it has a read pending. 5. P0 invalidates line A and sends acknowledgement 5 P0P1 Dir. 1. P1 sends read exclusive request for line A. 1 2. Dir. forwards request to P0 (owner). 2 4. P1 receives (3a) and considers read excl. complete. A replacement miss sends the updated value back to memory. 4 Problem: when (4) arrives dir. accepts and overwrites memory. When (3b) finally arrives dir. completes ownership transfer and thinks that P1 is the owner. Solution: dir. must serialize transactions so that it won’t react to 4 while the ownership transfer is pending. 3b3a/b. P0 sends data to P2 and ack. to dir. (ack gets delayed). 3a

148 CS4/MSc Parallel Architectures - 2009-2010 Directory Overhead  Problem: consider a system with 128 processors, 256GB of memory, 1MB L2 cache per processor, and 64byte cache lines –128 bits for sharing vector plus 3 bits for state → ~16bytes –Per line: 16/64 = 0.25 → 25% memory overhead –Total: 0.25*256G = 64GB of memory overhead!  Solution: Cached Directories –At any given point in time there can be only 128M/64 = 2M lines actually cached in the whole system –Lines not cached anywhere are implicitly in state “not cached” with null sharing vector –To maintain only the entries for the actively cached lines we need to keep the tags → 64bits = 8bytes –Overhead per cached line: (8+16)/64 = 0.375 → 37.5% overhead –Total overhead: 0.375*2M = 768KB of memory overhead 10

149 CS4/MSc Parallel Architectures - 2009-2010 Scalability of Directory Information  Problem: number of bits in sharing vector limit the maximum number of processors in the system –Larger machines are not possible once we decide on the size of the vector –Smaller machines waste memory  Solution: Limited Pointer Directories –In practice only a small number of processors share each line at any time –To keep the ID of up to n processors we need log 2 n bits and to remember m sharers we need m IDs → m*log 2 n –For n=128 and m=4 → 4*log 2 128 = 28bits = 3.5bytes –Total overhead: (3.5/64)*256G = 14GB of memory overhead –Idea:  Start with pointer scheme  If more than m processors attempt to share the same line then trap to OS and let OS manage longer lists of sharers  Maintain one extra bit per directory entry to identify the current representation 11

150 CS4/MSc Parallel Architectures - 2009-2010 Distributed Directories  Directories can be used with UMA systems, but are more commonly used with NUMA systems  In this case the directory is actually distributed across the system  These machines are then called cc-NUMA, for cache-coherent- NUMA, and DSM, for distributed shared memory 12 Interconnection CPU Cache Mem. Node Dir. CPU Cache Mem. Dir. CPU Cache Mem. Dir. CPU Cache Mem. Dir.

151 CS4/MSc Parallel Architectures - 2009-2010 Distributed Directories  Now each part of the directory is only responsible for the memory lines of its node  How are memory lines distributed across the nodes? –Lines are mapped per OS page to nodes –Pages are mapped to nodes following their physical address –Mapping of physical pages to nodes is done statically in chunks –E.g., 4 processors with 1GB of memory each and 4KB pages (thus, 256 pages per node)  Node 0 is responsible (home) for pages [0,255]  Node 1 is responsible for pages [256,511]  Node 2 is responsible for pages [512,767]  Node 3 is responsible for pages [768,1023]  Load to address 1478656 goes to page 1478656/4096=361, which goes to node 361/256=1 13

152 CS4/MSc Parallel Architectures - 2009-2010 Distributed Directories  How is data mapped to nodes? –With a single user, OS can map a virtual page to any physical page→ OS can place data almost anywhere, albeit at the granularity of pages –Common mapping policies:  First-touch: the first processor to request a particular data has the data’s page mapped to its range of physical pages –Good when each processor is the first to touch the data it needs, and other nodes do not access this page often  Round-robin: as data is requested virtual pages are mapped to physical pages in circular order (i.e., node 0, node 1, node 2, … node N, node 0, …) –Good when one processor manipulates most of the data at the beginning of a phase (e.g., initialization of data) –Good when some pages are heavily shared (hot pages)  Note: data that is only private is always mapped locally –Advanced cc-NUMA OS functionality  Mapping of virtual pages to nodes can be changed on-the-fly (page migration)  A virtual page with read-only data can be mapped to physical pages in multiple nodes (page replication) 14

153 CS4/MSc Parallel Architectures - 2009-2010 Combined Coherence Schemes  Use bus-based snooping in nodes and directory (or bus snooping) across nodes –Bus-based snooping coherence for a small number of processors is relatively strait- forward –Hopefully communication across processors within a node will not have to go beyond this domain –Easier to scale up and down the machine size –Two levels of state:  Per-node at higher level (e.g., a whole node owns modified data, but Dir. does not know which processor in the node actually has it)  Per-processor at lower level (e.g., by snooping inside the node we can find the exact owner and the exact up-to-date value) 15 Bus CPU Main memory CPU Cache Dir. Bus or Scalable interconnect Bus CPU Main memory CPU Cache Dir.

154 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 16  Original directory coherence idea: “A New Solution to Coherence Problems in Multicache Systems”, L. Censier and P. Feautrier, IEEE Trans. on Computers, December 1978  Seminal work on distributed directories: “The DASH Prototype: Implementation and Performance”, D. Lenoski, J. Laudon, T. Joe, D. Nakahira, L. Stevens, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1992.  A commercial machine with distributed directories: “The SGI Origin: a ccNUMA Highly Scalable Server”, J. Laudon and D. Lenoski, Intl. Symp. on Computer Architecture, June 1997.  A commercial machine with SCI: “STiNG: a CC-NUMA Computer System for the Commercial Marketplace”, T. Lovett and R. Clapp, Intl. Symp. on Computer Architecture, June 1996.  Adaptive full/limited pointer distributed directory protocols: “An Evaluation of Directory Schemes for Cache Coherence”, A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz, Intl. Symp. on Computer Architecture, June 1988.

155 CS4/MSc Parallel Architectures - 2009-2010 Probing Further 17  Page migration and replication for ccNUMA “Operating System Support for Improving Data Locality on ccNUMA Compute Servers”, B. Verghese, S. Devine, A. Gupta, and M. Rosemblum, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.  Cache Only Memory Architectures “Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures”, P. Stenstrom, T. Joe, and A. Gupta, Intl. Symp. on Computer Architecture, June 1992.  Recent alternative protocols: token, ring “Token Coherence: Decoupling Performance and Correctness”, M. Martin, M. Hill, and D. Wood, Intl. Symp. on Computer Architecture, June 2003. “Coherence Ordering for Ring-Based Chip Multiprocessors”, M. Marty and M. Hill, Intl. Symp. On Microarchitecture, December 2006.

156 CS4/MSc Parallel Architectures - 2009-2010  Synchronization is necessary to ensure that operations in a parallel program happen in the correct order  Different primitives are used at different levels of abstraction –High-level (e.g., critical sections, monitors, parallel sections and loops, atomic): supported in languages themselves or language extensions (e.g, Java threads, OpenMP) –Middle-level (e.g., semaphores, condition variables, locks, barriers): supported in libraries (e.g., POSIX threads) –Low-level (e.g., compare&swap, test&set, load-link & store-conditional): supported in hardware  Higher level primitives can be constructed from lower level ones  Things to consider: deadlock, livelock, starvation 1 Lect. 10: Shared Mem. Multiprocessors IV/V

157 CS4/MSc Parallel Architectures - 2009-2010 Example: Sync. in Java Threads 2  Synchronized Methods –Concurrent calls to the method on the same object have to be serialized –All data modified during one call to the method becomes atomically visible to all calls to other methods of the object –E.g.: –Can be implemented with locks public class SynchronizedCounter { private int c = 0; public synchronized void increment() { c++; } SynchronizedCounter myCounter;

158 CS4/MSc Parallel Architectures - 2009-2010 Example: Sync. in OpenMP 3  Doall loops –Iterations of the loop can be executed concurrently –After the loop, all processors have to wait and a single one continues with the following code –All data modified during the loop is visible after the loop –E.g.: –Can be implemented with barrier #pragma omp parallel for \ private(i,s) shared (A,B)\ schedule(static) for (i=0; i { "@context": "http://schema.org", "@type": "ImageObject", "contentUrl": "http://images.slideplayer.com/9/2534335/slides/slide_158.jpg", "name": "CS4/MSc Parallel Architectures - 2009-2010 Example: Sync.", "description": "in OpenMP 3  Doall loops –Iterations of the loop can be executed concurrently –After the loop, all processors have to wait and a single one continues with the following code –All data modified during the loop is visible after the loop –E.g.: –Can be implemented with barrier #pragma omp parallel for \ private(i,s) shared (A,B)\ schedule(static) for (i=0; i

159 CS4/MSc Parallel Architectures - 2009-2010 Example: Sync. in POSIX Threads 4  Locks –Only one thread can own the lock at any given time –Unlocking makes all the modified data visible to all threads and locking forces the thread to obtain fresh copies of all data –E.g.: –Can be implemented with test&set pthread_mutex_t mylock; pthread_mutex_init(&mylock, NULL); pthread_mutex_lock(&mylock); Count++; pthread_mutex_unlock(&mylock);

160 CS4/MSc Parallel Architectures - 2009-2010 Example: Building CS from Locks 5  Relatively strait-forward  Actual implementation is encapsulated in library function  In practice, the library may implement different policies on how to wait for a lock and how to avoid starvation Processor 0 int A, B, C; lock_t mylock; lock(&mylock); A = …; B = …; unlock(&mylock); Processor 1 lock(&mylock); … = A + …; … = C + …; unlock(&mylock); initialization parallel

161 CS4/MSc Parallel Architectures - 2009-2010 Example: Building CS from Ld/St? 6  E.g., Peterson’s algorithm  No! This is not a safe way to implement CS in a modern multiprocessor (Lecture 11) Processor 0 int A, B, C; int mylock[2], turn; mylock[0]=0; mylock[1]=0; turn = 0; mylock[0] = 1; turn = 1; while(mylock[1]&&turn==1); A = …; B = …; mylock[0] = 0; Processor 1 mylock[1] = 1; turn = 0; while(mylock[0]&&turn==0); … = A + …; … = C + …; mylock[1] = 0; initialization parallel

162 CS4/MSc Parallel Architectures - 2009-2010 Hardware Primitives 7  Hardware job is to provide atomic memory operations, which involves both processors and the memory subsystem  Implemented in the IS, but usually encapsulated in library function calls by manufacturers  At a minimum, hardware must provide an atomic swap  Examples: –Compare&Swap (e.g., Sun Sparc) and Test&Set: if value in memory is equal to value in register Ra then swap memory value with the value in Rb and return memory’s original value in Ra  Can implement more complex conditions for synchronization  Requires comparison operation in memory or must block memory location until processor is done with comparison CAS (R1),R2,R3 ;MEM[R1]==R2?MEM[R1]=R3:

163 CS4/MSc Parallel Architectures - 2009-2010 Hardware Primitives 8  Examples: –Fetch&Increment (e.g., Intel x86) (in general Fetch&Op): increment the value in memory and return the old value in register  Less flexible than Compare&Swap  Requires arithmetic operation in memory or must block memory location (or bus) until processor is done with comparison (e.g., x86) –Swap: swap the values in memory and in a register  Less flexible of all  Does not require comparison or arithmetic operation in memory lock; ADD (R1),R2 ;MEM[R1]=MEM[R1]+R2 LODSW ;accumulator=MEM[DS:SI]

164 CS4/MSc Parallel Architectures - 2009-2010 Building Locks with Hdw. Primitives 9  Example: Test&Set int lock(int *mylock) { int value; value = test&set(mylock,1); if (value) return FALSE; else return TRUE; } void unlock(int *mylock) { *mylock = 0; return; }

165 CS4/MSc Parallel Architectures - 2009-2010 What If the Lock is Taken? 10  Spin-wait lock –Each call to lock invokes the hardware primitive, which involves an expensive memory operation and takes up network bandwidth  Spin-wait on cache: Test-and-Test&Set –Spin on cached value using normal load and rely on coherence protocol –Still, all processors race to memory, and clash, once the lock is released while (!lock(&mylock)); … unlock(&mylock); while (TRUE) { if (!lock(&mylock)) while (!mylock); } … unlock(&mylock);

166 CS4/MSc Parallel Architectures - 2009-2010 What If the Lock is Taken? 11  Software solution: Blocking locks and Backoff –Wait can be implemented in the application itself (backoff) or by calling the OS to be put to sleep (blocking) –The waiting time is usually increased exponentially with the number of retries –Similar to the backoff mechanism adopted in the Ethernet protocol while (TRUE) { if (!lock(&mylock)) wait (time); } … unlock(&mylock);

167 CS4/MSc Parallel Architectures - 2009-2010 A Better Hardware Primitive 12  Load-link and Store-conditional –Implement atomic memory operation as two operations –Load-link (LL):  Registers the intention to acquire the lock  Returns the present value of the lock –Store-conditional (SC):  Only stores the new value if no other processor attempted a store between our previous LL and now  Returns 1 if it succeeds and 0 if it fails –Relies on the coherence mechanism to detect conflicting SC’s –All operation is done locally at the cache controllers or directory, no need for complex blocking operation in memory –New register id added to L1 to remember pending LL from local processor (e.g., PowerPC RESERVE register) –Also benefits from blocking and backoff –Introduced in the MIPS processor, now also used in PowerPC and ARM

168 CS4/MSc Parallel Architectures - 2009-2010 A Better Hardware Primitive 13  Load-link and Store-conditional operation P0 L1 RESERVE P1 L1 RESERVE Coherence substrate LL 0xA 0xA1 LL completes SC 0xA SC suceeds P0 L1 RESERVE P1 L1 RESERVE LL 0xA Coherence substrate LL 0xA 0xA1 LL completes SC 0xA SC fails 0xA1 LL completesSC suceeds SC 0xA 0

169 CS4/MSc Parallel Architectures - 2009-2010  E.g., spin-wait with attempted swap –At the end, if SC succeeds the value of the lock variable will be in R4 –If lock is taken then start over again Building Locks with LL/SC 14 try: OR R3,R4,R0 ;move value to be exchanged LL R2,0(R1) ;value of lock loaded SC R3,0(R1) ;try to store value BEQZ R3,try ;branch if SC failed MOV R4,R2 ;move lock value into R4 check: BNEZ R4,try ;try again if lock was taken

170 CS4/MSc Parallel Architectures - 2009-2010 An Alternative Hdw. Approach 15  Locks have a relatively large overhead and, thus, are suitable for guarding relatively large amounts of data  Some algorithms need to exchange only a small number of words each time  Also, consumer thread must wait for all data guarded by a lock to be ready before it can begin work  Better approach for fine-grain synchronization: Full/Empty Bits –Associate one bit with every memory word (1.5% overhead for 64bit words) –Augment the behavior of load/store  Load: if word is empty then trap to OS (to wait) otherwise, return value and set bit to empty  Store: if word is full then trap to OS (to deal with error) otherwise, store the new value, set bit to full, and release any threads pending on the word (with OS help)  Reset: set the bit to empty –Good for producer-consumer type of communication/synchronization

171 CS4/MSc Parallel Architectures - 2009-2010 Example: Using Full/Empty Bits 16  Compare against example in Slide 5 Processor 0 int A, B, C; A = …; // blocks if not // yet used B = …; // no impact on P1 Processor 1 … = A + …; // waits if not // ready … = C + …; // does not // have to wait

172 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 17  A commercial machine with Full/Empty bits: “The Tera Computer System”, R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Symp. on Supercomputing, June 1990.  Performance evaluations of synchronization for shared-memory: “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”, T. Anderson, IEEE Trans. on Parallel and Distributed Systems, January 1990. “Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors”, J. Mellor-Crummey and M. Scott, ACM Trans. on Computer Systems, February 1991.

173 CS4/MSc Parallel Architectures - 2009-2010  Consider the following code: –What are the possible outcomes? 1 Lect. 11: Shared Mem. Multiprocessors V/V A=0, B=0, C=0; … C = 1; A = 1;while (A==0); B = 1; P1P2 initialization parallel A==1, C==1? A==0, C==1? A==0, C==0? A==1, C==0? while (B==0); print A; P3 Yes. This is what one would expect. Yes. If st to B overtakes the st to A on the interconnect toward P3. Yes. If the st to C overtakes the st to A from the same processor.

174 CS4/MSc Parallel Architectures - 2009-2010 Memory Consistency  Cache coherence: –Guarantees eventual write propagation –Guarantees a single order of all writes  Memory consistency: –Specifies the ordering of loads and stores to different memory locations –Defined in so called Memory Consistency Models –This is really a “contract” between the hardware, the compiler, and the programmer  i.e., hardware and compiler will not violate the ordering specified  i.e., the programmer will not assume a stricter order than that of the model –Hardware/Compiler provide “safety net” mechanisms so the user can enforce a stricter order than that provided by the model 2 For the same memory location. No guarantees on when writes propagate. No guarantees on the order of writes.

175 CS4/MSc Parallel Architectures - 2009-2010 Sequential Consistency (SC)  Key ideas: –The behavior on a multiprocessor should be the same as in a time-shared multiprocessor –Thus, memory ordering has to follow the individual order in each thread and there can be any interleaving of such sequential segments –Memory abstraction is that of a random switch to memory: –Notice that in practice many orderings are still valid 3 P0P1Pn Memory

176 CS4/MSc Parallel Architectures - 2009-2010 Terminology  Issue: memory operation leaves the processor and becomes visible to the memory subsystem  Performed: memory operation appears to have taken place –Performed w.r.t. processor X: as far as processor X can tell  E.g., a store S by processor Y to variable A is performed w.r.t. processor X if a subsequent load by X to A returns the value of S (or the value of a store later than S, but never a value older than that of S)  E.g., a load L is performed w.r.t. processor X if all subsequent stores by any processor cannot affect the value returned by L to X –Globally performed or complete: performed w.r.t. to all processors  E.g., a store S by processor Y to variable A is globally performed if any subsequent load by any processor to A returns the value of S  X consistent execution: any execution that matches one of the possible total orders (interleavings) as defined by model X 4

177 CS4/MSc Parallel Architectures - 2009-2010 Example: Sequential Consistency  Some valid SC orderings: 5 A=0, B=0, C=0; … C = 1; A = 1;while (A==0); B = 1; P1P2 initialization parallel P1: st C # C=1 P1: st A # A=1 P2: ld A # while P2: st B # B=1 P3: ld B # while P3: ld A # print while (B==0); print A; P3 P1: st C # C=1 P2: ld A # while … P1: st A # A=1 P2: ld A # while P2: st B # B=1 P3: ld B # while P3: ld A # print P1: st C # C=1 P2: ld A # while … P1: st A # A=1 P2: ld A # while P3: ld B # while … P2: st B # B=1 P3: ld B # while P3: ld A # print

178 CS4/MSc Parallel Architectures - 2009-2010 Sequential Consistency (SC)  Sufficient conditions 1. Threads issue memory operations in program order 2. Before issuing next memory operation threads wait until last issued write completes (i.e., performs w.r.t. all other processors) 3. Before issuing next memory operation threads wait until last issued read completes and until the matching write (i.e., the one whose value is returned to the read) also completes  Notes: –Condition 3 is actually quite demanding and is the one that guarantees write atomicity –In practice necessary conditions may be more relaxed –These conditions are easily violated in real hardware and compilers (e.g., write buffers in hdw. and ld-st scheduling in compiler) –Program order defined after source code (programmer’s intention) and may be different from assembly code due to compiler optimizations 6

179 CS4/MSc Parallel Architectures - 2009-2010 Relaxed Memory Consistency Models  At a high level they relax ordering constraints between pairs of reads, writes, and read-write (e.g., reads are allowed to bypass writes, writes are allowed to bypass each other)  In practice there are some implementation artifacts (e.g., no write atomicity in Pentium)  Some models make synchronization explicit and different from normal loads and stores  Many models have been proposed and implemented –Total Store Ordering (TSO) (e.g., Sparc) –Partial Store Ordering (PSO) (e.g., Sparc) –Relaxed Memory Ordering (RMO) (e.g., Sparc) –Processor Consistency (PC) (e.g., Pentium) –Weak Ordering (WO) –Release Consistency (RC) –PowerPC 7

180 CS4/MSc Parallel Architectures - 2009-2010 Relaxed Memory Consistency Models  Note that control flow and data flow dependences within a thread must still be honored regardless of the consistency model –E.g., 8 A=0, B=0, C=0; … C = 1; A = 1;while (A==0); B = 1;while (B==0); print A; st to B cannot overtake ld to A ld to A cannot overtake ld to B A = 1; … A = 2; … B = A; Second st to A cannot overtake earlier st to A ld to A cannot overtake earlier st to A

181 CS4/MSc Parallel Architectures - 2009-2010 Example: Total Store Ordering (TSO)  Reads are allowed to bypass writes (can hide write latency)  Similar to PC  Still makes prior example work as expected, but breaks some intuitive assumptions, including Peterson’s algorithm (Lecture 10) 9 … C = 1; A = 1;while (A==0); B = 1; P1P2 … A = 1; Print B; B = 1; Print A; P1P2 SC guarantees that A==0 and B==0 will never be printed TSO allow it if ld B (P1) overtakes st A (P1) and ld A (P2) overtakes st B (P2)

182 CS4/MSc Parallel Architectures - 2009-2010 Example: Release Consistency (RC)  Reads and writes are allowed to bypass both reads and writes (i.e., any order that satisfies control flow and data flow is allowed)  Assumes explicit synchronization operations: acquire and release (Lecture 10). So, for correct operation, our example must become:  Constraints –All previous writes must complete before a release can complete –No subsequent reads can complete before a previous acquire completes –All synchronization operations must be sequentially consistent (i.e., follow the rules of Slide 6, where an acquire is equivalent to a read and a release is equivalent to a write) 10 … C = 1; Release(A);while (!Lock(A)); B = 1; P1P2

183 CS4/MSc Parallel Architectures - 2009-2010 Example: Release Consistency (RC)  Example: original program order 11 Read/write … Read/write P1 Acquire Read/write … Read/write Release Read/write … Read/write  Allowable overlaps –Reads and writes from block 1 can appear after the acquire (thus, initialization also requires an acquire-release pair) –Reads and writes from block 3 can appear even before the release –Between acquire and release any order is valid in block 2 (and also 1 and 3)  Note that despite the many reorderings, this still matches our intuition of critical sections 1 2 3 Read/write … Read/write 1 Acquire Read/write … Read/write Release 2 Read/write … Read/write 3

184 CS4/MSc Parallel Architectures - 2009-2010 Races and Proper Synchronization  Races: unsynchronized loads and stores “race each order” through the memory hierarchy (e.g., the loads and stores to A, B, and C in prior example)  Delay-set Analysis –Technique that allows to identify races that require synchronization –Mark all memory references in both threads and create arcs between them  Directed arcs that follow program order (the blue ones below)  Undirected arcs that follow cross-thread data dependences (the green ones below, recall that the print implicitly contains a read) –Cycles following the arcs indicate the problematic memory references 12 … A = 1; Print B; B = 1; Print A; P1P2

185 CS4/MSc Parallel Architectures - 2009-2010 Memory Barriers/Fences  How can I enforce some order of memory accesses? –Ideally use the synchronization primitives, but these can be very costly  Memory Barriers/Fences: –New instructions in the IS and supported in the processor and memory –Specify that previously issued memory operations must complete before processor is allowed to proceed past the barrier  Write-to-read barrier: all previous writes must complete before the next read can be issued  Write-to-write: all previous writes must complete before the next write can be issued  Full barriers: all previous loads and stores must complete before the next memory operation can be issued  Note: not to be confused with synchronization barriers (Lecture 10)  Note: stricter models can be emulated with such barriers on system that only support less strict models 13

186 CS4/MSc Parallel Architectures - 2009-2010 Final Notes  Many processors/systems support more than one consistency model, usually set at boot time  It is possible to decouple consistency model presented to programmer from that of the hardware/compiler –E.g., hardware may implement a relaxed model but compiler guarantees SC via memory barriers  It is possible to allow a great degree of reordering with SC through speculative execution in hardware (with rollback when stricter model is violated) (e.g., MIPS R10000/SGI Origin) 14

187 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 15  Original definition of sequential consistency: “How to Make a Multiprocessor Computer that Correctly Execute Multiprocess Programs”, L. Lamport, IEEE Trans. on. Computers, September 1979.  Original work on relaxed consistency models: “Correct Memory Operation of Cache-Based Multiprocessors”, C. Scheurich and M. Dubois, Intl. Symp. on Computer Architecture, June 1987. “Weak Ordering: A New Definition”, S. Adve and M. Hill, Intl. Symp. on Computer Architecture, June 1990. “Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors”, K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy, Intl. Symp. on Computer Architecture, June 1990.  A very good tutorial on memory consistency models: “Shared Memory Consistency Models: A Tutorial”, S. Adve and K. Gharachorloo, IEEE Computer, December 1996.

188 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 16  Problems with OO memory consistency models (e.g., Java): “Fixing the Java Memory Model”, W. Pugh, Conf. on Java Grande, June 1999.  Delay set analysis: “Efficient and Correct Execution of Parallel Programs that Share Memory”, D. Shasha and M. Snir, ACM Trans. on. Programming Languages and Operating Systems, February 1988.  Compiler support for SC on non-SC hardware: “Analyses and Optimizations for Shared Address Space Programs”, A. Krishnamurthy and K. Yelick, Journal of Parallel and Distributed Computing, February 1996. “Hiding Relaxed Memory Consistency with Compilers”, J. Lee and D. Padua, Intl. Conf. on Parallel Architectures and Compilation Techniques, October 2000.

189 CS4/MSc Parallel Architectures - 2009-2010 Probing Further 17  Transactional Memory “Transactional Memory: Architectural Support for Lock-Free Data Structures”, M. Herlihy and J. Moss, Intl. Symp. on Computer Architecture, June 1993. “Transactional Memory Coherence and Consistency”, L. Hammond, V. Wong, M. Chen, B. Calstrom, J. Davis, B. Hertzberg, M. Prabhu, H. Wijaya, C. Kozyrakis, and K. Olukotun, Intl. Symp. on Computer Architecture, June 2004. “Transactional Execution: Toward Reliable, High-Performance Multithreading”, R. Rajwar and J. Goodman, IEEE Micro, November 2003.

190 CS4/MSc Parallel Architectures - 2009-2010 Lect. 12: Multithreading  Memory latencies and even latencies to lower level caches are becoming longer w.r.t. processor cycle times  There are basically 3 ways to hide/tolerate such latencies by overlapping computation with the memory access –Dynamic out-of-order scheduling –Prefetching –Multithreading  OOO execution and prefetching allow overlap of computation and memory access within the same thread (these were covered in CS3 Computer Architecture)  Multithreading allows overlap of memory access of one thread/process with computation by another thread/process 1

191 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 2  Basic idea: –Recall multi-tasking: on I/O a process is context-switched out of the processor by the OS –With multithreading a thread/process is context-switched out of the pipeline by the hardware on longer-latency operations process 1 running system call for I/O OS interrupt handler running I/O completion Process 1 running process 2 running OS interrupt handler running process 1 running Long-latency operation Hardware context switch Long-latency operation Process 1 running process 2 running Hardware context switch

192 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 3  Basic idea: –Unlike in multi-tasking, context is still kept in the processor and OS is not aware of any changes –Context switch overhead is minimal (usually only a few cycles) –Unlike in multi-tasking, the completion of the long-latency operation does not trigger a context switch (the blocked thread is simply marked as ready) –Usually the long-latency operation is a L1 cache miss, but it can also be others, such as a fp or integer division (which takes 20 to 30 cycles and is unpipelined)  Context of a thread in the processor: –Registers –Program counter –Stack pointer –Other processor status words  Note: the term is commonly (mis)used to mean simply the fact that the system supports multiple threads

193 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 4  Latency hiding example: Memory latencies Pipeline latency Thread A Thread B Thread C Thread D = context switch overhead = idle (stall cycle) Culler and Singh Fig. 11.27

194 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 5  Hardware mechanisms: –Keeping multiple contexts and supporting fast switch  One register file per context  One set of special registers (including PC) per context –Flushing instructions from the previous context from the pipeline after a context switch  Note that such squashed instructions add to the context switch overhead  Note that keeping instructions from two different threads in the pipeline increases the complexity of the interlocking mechanism and requires that instructions be tagged with context ID throughout the pipeline –Possibly replicating other microarchitectural structures (e.g., branch prediction tables, load-store queues, non-blocking cache queues)  Employed in the Sun T1 and T2 systems (a.k.a. Niagara)

195 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 6  Simple analytical performance model: –Parameters:  Number of threads (N): the number of threads supported in the hardware  Busy time (R): time processor spends computing between context switch points  Switching time (C): time processor spends with each context switch  Latency (L): time required by the operation that triggers the switch –To completely hide all L we need enough N such that ~N*(R+C) equals L (strictly speaking, (N-1)*R + N*C = L)  Fewer threads mean we can’t hide all L  More threads are unnecessary –Note: these are only average numbers and ideally N should be bigger to accommodate variation RC L RCRCRC

196 CS4/MSc Parallel Architectures - 2009-2010 Blocked Multithreading 7  Simple analytical performance model: –The minimum value of N is referred to as the saturation point (N sat ) –Thus, there are two regions of operation:  Before saturation, adding more threads increase processor utilization linearly  After saturation, processor utilization does not improve with more threads, but is limited by the switching overhead –E.g.: for R=40, L=200, and C=10 N sat = R + L R + C U sat = R R + C Culler and Singh Fig. 11.25

197 CS4/MSc Parallel Architectures - 2009-2010 Fine-grain or Interleaved Multithreading 8  Basic idea: –Instead of waiting for long-latency operation, context switch on every cycle –Threads waiting for a long latency operation are marked not ready and are not considered for execution –With enough threads no two instructions from the same thread are in the pipeline at the same time → no need for pipeline interlock at all  Advantages and disadvantages over blocked multithreading: + No context switch overhead (no pipeline flush) + Better at handling short pipeline latencies/bubbles –Possibly poor single thread performance (each thread only gets the processor once every N cycles) –Requires more threads to completely hide long latencies –Slightly more complex hardware than blocked multithreading  Some machines have taken this idea to the extreme and eliminated caches altogether (e.g., Cray MTA-2, with 128 threads per processor)

198 CS4/MSc Parallel Architectures - 2009-2010 Fine-grain or Interleaved Multithreading 9  Latency hiding example: Memory latencies Pipeline latency Thread A Thread B Thread C Thread D= idle (stall cycle) Culler and Singh Fig. 11.28 Thread E Thread F A is still blocked, so is skipped E is still blocked, so is skipped

199 CS4/MSc Parallel Architectures - 2009-2010 Fine-grain or Interleaved Multithreading 10  Simple analytical performance model (see Slide 6): –Parameters:  Number of threads (N) and Latency (L)  Busy time (R) is now 1 and switching time (C) is now 0 –To completely hide all L we need enough N such that N-1 = L –Again, these are only average numbers and ideally N should be bigger to accommodate variation –The minimum value of N (i.e., N=L+1) is the saturation point (N sat ) –Again, there are two regions of operation:  Before saturation, adding more threads increase processor utilization linearly  After saturation, processor utilization does not improve with more threads, but is 100% (i.e., U sat = 1) R L RRR

200 CS4/MSc Parallel Architectures - 2009-2010 Simultaneous Multithreading (SMT) 11  Basic idea: –Don’t actually context switch, but on a superscalar processor fetch and issue instructions from different threads/processes simultaneously –E.g., 4-issue processor  Advantages: + Can handle not only long latencies and pipeline bubbles but also unused issue slots + Full performance in single-thread mode –Most complex hardware of all multithreading schemes cycles no multithreading cache miss blockedinterleavedSMT

201 CS4/MSc Parallel Architectures - 2009-2010 Simultaneous Multithreading (SMT) 12  Fetch policies: –Non-multithreaded fetch: only fetch instructions from one thread in each cycle, in a round-robin alternation –Partitioned fetch: divide the total fetch bandwidth equally between some of the available threads (requires more complex fetch unit to fetch from multiple I-cache lines; see Lecture 3) –Priority fetch: fetch more instructions for specific threads (e.g., those not in control speculation, those with the least number of instructions in the issue queue)  Issue policies: –Round-robin: select one ready instruction from each ready thread in turn until all issue slots are full or there are no more ready instructions (note: should remember which thread was the last to have an instruction selected and start from there in the next cycle) –Priority issue:  E.g., threads with older instructions in the issue queue are tried first  E.g., threads in control speculative mode are tried last  E.g., issue all pending branches first

202 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 13  Original work on multithreading: “The Tera Computer System”, R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, Intl. Conf. on Supercomputing, June 1990. “Performance Tradeoffs in Multithreaded Processors”, A. Agarwal, IEEE Trans. on Parallel and Distributed Systems, September 1992. “Simultaneous Multithreading: Maximizing On-Chip Parallelism”, D. Tullsen, S. Eggers, and H. Levy, Intl. Symp. on Computer Architecture, June 1995. “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor”, D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, Intl. Symp. on Computer Architecture, June 1996.  Intel’s hyper-threading mechanism: “Hyper-Threading Technology Architecture and Microarchitecture”, D. Marr, F. Binns, D. Hill, G. Hinton, D. Koufaty, J. Miller, and M. Upton, Intel Technology Journal, Q1 2002.

203 CS4/MSc Parallel Architectures - 2009-2010 Lect. 13: Chip-Multiprocessors (CMP)  Main driving forces: –Complexity of design and verification of wider-issue superscalar processor would be unmanageable –Performance gains of either wider issue width or deeper pipelines would be only marginal  Limited ILP in applications  Wire delays and longer access times of larger structures –Power consumption of the large centralized structures necessary in wider- issue superscalar processors would be unmanageable –Increase relative importance of throughput oriented computing as compared to latency oriented computing –Continuation of Moore’s law so that more transistors fit in a chip 1

204 CS4/MSc Parallel Architectures - 2009-2010 Early (ca. 2006) CMP’s 2  Example: Intel Core Duo –2 cores  3-issue superscalar  12-stage pipeline  2-way simultaneous multithreading (HT)  Up to 2.33GHz  P6 (Pentium M) microarchitecture –2MB shared L2 cache –151M transistors in 65nm technology –Power consumption between 9W and 30W

205 CS4/MSc Parallel Architectures - 2009-2010 Current (ca. 2007) CMP’s 3  Example: Sun T2 –8 cores  Single issue, statically scheduled  8-stage pipeline  8-way multithreading (blocked)  Up to 1.4GHz  UltraSparc V9 IS –4MB shared L2 cache –65nm technology –Power consumption around 72W

206 CS4/MSc Parallel Architectures - 2009-2010 Future CMP’s? 4  Example: Intel Polaris (2007) –80 cores  Single issue, statically scheduled  3.2GHz (up to 5GHz) –Scalable, packet-switched, interconnect (8x10 mesh) –No shared L2 or L3 cache –No cache coherence –“Tiled” approach  Core + cache + router –Stacked memory technology –Power consumption around 62W  Example: Intel SCC (2010) –48 cores (full IA-32 compatible)

207 CS4/MSc Parallel Architectures - 2009-2010 CMP’s vs. Multi-chip Multiprocessors 5  While conceptually similar to traditional multiprocessors, CMP’s have specific issues: –Off-chip memory bandwidth: number of pins per package does not increase much –On-chip interconnection network: wires and metal layers are a very scarce resource –Shared memory hierarchy: processors must share some lower level cache (e.g., L2 or L3) and the on-chip links between these –Wire delays: actual physical distances to be crossed for communication affect the latency of the communication –Power consumption and heat dissipation: both are much harder to fit within the limitations of a single chip package

208 CS4/MSc Parallel Architectures - 2009-2010 Shared vs. Private L2 Caches 6  Private caches: + Less chance of negative interference between processors + Simpler interconnections –Possibly wasted storage in less loaded parts of the chip –Must enforce coherence across L2’s  Shared caches: –More chance for negative interference between processors + Possible positive interference between processors + Better utilization of storage + Single/few threads have access to all resources when cores are idle + No need enforce coherence (but still must enforce coherence across L1’s) and L2 can act as a coherence point (i.e., directory) –All-to-one interconnect takes up large area and may become a bottleneck  Note: L1 caches are tightly integrated into the pipeline and are an inseparable part of the core

209 CS4/MSc Parallel Architectures - 2009-2010 Shared vs. Private L2 Caches 7  Priority Inversion and Fair Sharing –In uniprocessor and multi-chip multiprocessors: processes with higher priority are given more resources (e.g., more processors, larger scheduling quanta, more memory/caches, etc) → faster execution –In CMP’s with shared resources (e.g., L2 caches, off-chip memory bandwidth, issue slots with multithreading)  Dynamic allocation of resources to threads/processes is oblivious to OS (e.g., LRU replacement policy in caches)  Hardware policies attempt to maximize utilization across the board  Hardware treats all threads/processes equally and threads/processes compete dynamically for resources –Thus, at run time, a lower priority thread/processe may grab a larger share of resources and may execute relatively faster than a higher thread/process –One of the biggest problems is that of fair cache sharing –In more general terms, overall quality of service should be directly proportional to priority

210 CS4/MSc Parallel Architectures - 2009-2010 Shared vs. Private L2 Caches 8  Fair Sharing –Example: –Interference in L2 causes gzip to have 3 to 10 times more L2 misses and to run at as low as half the original speed –Effect of interference depends on what other application is co-scheduled with gzip Figure from Kim et. al.

211 CS4/MSc Parallel Architectures - 2009-2010 Shared vs. Private L2 Caches 9  Fair Sharing –Condition for fair sharing:  Where Tded i is the execution time of thread i when executed alone in the CMP with a dedicated L2 cache and Tshr i is its execution time when sharing L2 with the other n-1 threads –To maximize fair sharing, minimize: where –Possible solution: partition caches in different sized portions either statically or at run time Tshr 1 Tded 1 = Tshr 2 Tded 2 = …= Tshr n Tded n M ij = X i - X j X i = Tshr i Tded j

212 CS4/MSc Parallel Architectures - 2009-2010 NUCA L2 Caches 10  On-chip L2 and L3 caches are expected to continue increasing in size (e.g., Core Duo has 2MB while Core 2 Duo has 4MB L2)  Such caches are logically divided in a few (2 to 8) logical banks with independent access  Banks are physically divided into small (128KB to 512KB) sub- banks  Thus, future multi-megabyte L2 and L3 caches will likely have 32 or more sub-banks  Increasing wire delays mean that sub-banks closer to a given processor could be accessed quicker than sub-banks further away  Also, some sub-banks will invariably be closer to one processor and far from another, and some sub-banks will be at similar distances from a few processors  Bottom-line: uniform (worst case) access times will be increasingly inefficient

213 CS4/MSc Parallel Architectures - 2009-2010 NUCA L2 Caches 11  Key ideas: –Allow and exploit the fact that different sub-banks have different access times –Each sub-bank has its own wire set to the cache controller (which does increase overall area) –Either statically or dynamically map and migrate the most heavily used lines to the banks closer to the processor –By tweaking the dynamic mapping and migration mechanisms such NUCA caches can adapt from private to shared caches –Obviously, with such dynamic mapping and migration, searching the cache and performing replacements becomes more expensive  E.g., Sun’s T2 uses a NUCA L2 cache with 8 banks spread across the chip borders, but with static mapping and no migration

214 CS4/MSc Parallel Architectures - 2009-2010 Directory Coherence On-Chip? 12 Mem.Dir. CPU L2 Cache Mem.Dir. CPU L2 Cache Dir. CPU L1 Cache L2 CacheDir. CPU L1 Cache  One-to-One mapping from CC-NUMA?  L2 Cache → L1 Cache  Main memory → L2 Cache  Dir. entry per memory line → Dir. entry per L2 cache line  Mem. lines mapped to physical mem. by first-touch policy at OS page granularity → L2 lines mapped to physical L2 by first-touch policy at OS page level

215 CS4/MSc Parallel Architectures - 2009-2010 Directory Coherence On-Chip 13  The mapping problem (home node)  OS page granularity is too coarse and many lines needed by Px might be actually used by Py, but still have to be cached in Px (ok for large mem. but not ok for small L2; also may lead to imbalance in mapping)  Line granularity with first-touch needs a hardware/OS mapping of every individual cache line to a physical L2 (too expensive)  Solution: map at line granularity but circularly based on physical address (mem. line 0 maps to L2 #0, mem. line 1 maps to L2 #1, etc)  The problem with this solution is that locality of use is lost!  The eviction problem  Upon eviction of an L2 (mem.) line the corresponding dir. entry is lost and all L1 cached copies must be invalidated (ok for rare paging case in CC-NUMA, but not ok for small L2)  Solution: associate dir. entries not with L2 cache lines, but with cached L1 lines (replicated tags and exclusive L1-Home L2)

216 CS4/MSc Parallel Architectures - 2009-2010 Exclusivity with Replicated Tags 12  Dir. contains copy of the L1 tags of lines mapped to the home L2, but L2 does not have to keep the L1 data itself  Good: lines can be evicted from L2 silently (by exclusivity, they are not cached in any L1) and Dir. does not change  Bad: replicated tags (i.e., the Dir. information) increases with number of L1 caches  E.g., for 8 cores with 32KB L1 with 32B lines (i.e., 1024 lines) and fully associative → 8x1024 = 8,192 entries per Dir.  (In practice, associativity reduces this overhead and alternative exist) L2 Cache CPU L1 Cache L2 CacheDir. CPU L1 Cache Dir.

217 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 14  Early study of chip-multiprocessors “The Case for a Single-Chip Multiprocessor”, K. Olukotun, B. Nayfeh, L. Hammond, K. Wilson, and K. Chang, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 1996.  More recent study of chip-multiprocessors (throughput- oriented) “Maximizing CMP Throughput with Mediocre Cores”, J. Davis, J. Laudon, and K. Olukotun, Intl. Conf. on Parallel Architecture and Compilation Techniques, September 2005.  First NUCA caches proposal (for uniprocessor) “An Adaptive, Non-uniform Cache Structure for Wire-delay Dominated On- chip Caches”, C. Kim, D. Burger, and S. Keckler, Intl. Conf. on Architectural Support for Programming Languages and Operating Systems, October 2002.

218 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 15  NUCA cache study for CMP “Managing Wire Delay in Large Chip-Multiprocessor Caches”, B. Beckmann and D. Wood, Intl. Symp. on Microarchitecture, December 2004.  Recent fair cache sharing studies “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture”, S. Kim, D. Chandra, and Y. Solihin, Intl. Conf. on Parallel Architecture and Compilation Techniques, October 2004. “CQoS: A Framework for Enabling QoS in Shared Caches of CMP Platforms”, R. Iyer, Intl. Conf. on Supercomputing, June 2004.  Other recent studies on priorities and quality of service in CMP/SMT “Symbiotic Job-Scheduling with Priorities for Simultaneous Multithreading Processors”, A. Snavely, D. Tullsen, and G. Voelker, Intl. Conf. on Measurement and Modeling of Computer Systems, June 2002.

219 CS4/MSc Parallel Architectures - 2009-2010 Lect. 14: Interconnection Networks  Communication networks (e.g., LANs and WANs) –Must follow industry standards –Must support many different types of packets –Many features, such as reliability, are handled by upper software layers –Currently based on buses (e.g., Ethernet LAN) and optic fiber (WAN) –Latency is high and bandwidth is low –Topologies are highly irregular (e.g., Internet)  Multiprocessor interconnects –Custom made and proprietary –Must only support a few (3 to 4) different types of packets –Most features handled in hardware –Many different topologies and technologies are commonly used –Latency is low and bandwidth is high –Topologies are very regular 1

220 CS4/MSc Parallel Architectures - 2009-2010 Interconnection Networks 2  General organization –Network controller (NC): links the processor (host) to the network –Switches (SW): links different parts of the network internally –Note: SW’s may not be present at all in some topologies CPU Cache Mem. NC CPU Cache Mem. NC CPU Cache Mem. NC SW

221 CS4/MSc Parallel Architectures - 2009-2010 Interconnection Networks 3  Characterizing Interconnects –Topology: the “shape” or structure of the interconnect (e.g., buses, meshes, hypercubes, butterflies, etc)  Direct networks: each host+NC connects directly to other hosts+NCs  Indirect networks: hosts+NCs connect to a subset of the switches, which are then the entry points to the network and are themselves connected to other internal switches –Routing algorithm: the rules and mechanisms for routing messages  Dynamic: route from a given A to B may change at different times  Static: route from a given A to B is fixed –Switching strategy: how exchange of messages is set up  Circuit switching: route and connection from source to destination is established and fixed before communication (e.g., like telephone calls)  Packet switching: each part of the communication (packet) is handled separately –Flow control mechanism: how traffic flow under conflict and/or congestion is handled

222 CS4/MSc Parallel Architectures - 2009-2010 Interconnection Networks 4  Terminology: –Link: the physical connection between two hosts/switches –Channel: a logical connection between two hosts/switches that are connected with a link (multiple channels may be multiplexed into a single link) –Degree of a switch: the number of input/output channels –Simplex channel: communication can only happen in one direction Duplex channel: communication can happen in both directions –Phit: the smallest physical unit of data that can be transferred in a unit of time over a link –Flit: the smallest unit of data that can be exchanged between two hosts/switches (1 flit ≥ 1 phit) –Hop: each step between two adjacent hosts/switches –Permutations: a combination of pairs of hosts that can communicate simultaneously

223 CS4/MSc Parallel Architectures - 2009-2010 Interconnection Networks 5  Important properties: –Degree or radix: the smallest number of hosts/switches that any given host/switch can connect directly to –Diameter: the longest distance between any two hosts (in number of hops) –Bisection: a collection of links that if removed would divide the network in two equal-size (disconnect) parts Bisection width: the minimum number of links across all bisections Bisection bandwidth: the minimum bandwidth across the bisections –Total bandwidth: the maximum communication bandwidth that can be attained –Cost: usually given as a function of the total number of links, switches, and network controllers –Scalability: how a given property scales with the increase in the number of hosts (e.g., bandwidth, cost, diameter, etc)  Usually given in terms of O() (e.g., O(1) is constant and O(N) is linear) –Fault tolerance: whether communication between any two nodes is still possible after failure of some links

224 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Buses: –Degree: N-1 (i.e., fully connected) –Diameter: 1 –Bisection width: 1 –Total bandwidth: O(1) –Cost: O(N) –Permutations: single pair, broadcast (one-to-all), multicast (one-to-many) 6 Bus CPU Main memory CPU

225 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Crossbar: –Degree: N-1 (i.e., fully connected) –Diameter: 2 (sometimes also said to be 1) –Bisection width: N –Total bandwidth: O(N) –Cost: O(N 2 ) –Permutations: single-pair, any pair-wise permutation 7 CPU 1 CPU 2 CPU 3 CPU 4 CPU 1CPU 2CPU 3CPU 4 CPU Mem.

226 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Bidirectional Ring: or, with same-size wires: –Degree: 2 –Diameter: N/2 –Bisection width: 2 –Total bandwidth: O(N) (e.g., all nodes communicate to neighbor) –Cost: O(N) –Permutations: single-pair, neighbor 8 CPU Mem.

227 CS4/MSc Parallel Architectures - 2009-2010 Topologies  2-D Mesh: –Degree: 2 (maximum is 4 at internal nodes) –Diameter: 2*(k-1) (k is the number of nodes per row/column, i.e., N 1/2 ) –Bisection width: k –Total bandwidth: O(N) –Cost: O(N) –Permutations: single-pair, neighbor 9 CPU Mem.

228 CS4/MSc Parallel Architectures - 2009-2010 Topologies  2-D Torus: –Degree: 4 –Diameter: k –Bisection width: 2*k –Total bandwidth: O(N) –Cost: O(N) –Permutations: single-pair, neighbor 10 CPU Mem.

229 CS4/MSc Parallel Architectures - 2009-2010 Topologies  4-D Cube (hypercube): –Degree: 4 –Diameter: 4 –Bisection width: 8 –Total bandwidth: O(N) –Cost: O(N) –Permutations: single-pair, neighbor 11 CPU Mem.

230 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Binary Tree: or, in H-tree configuration: –Degree: 1 for hosts and 3 for switches –Diameter: 2*log 2 N –Bisection width: 1 –Total bandwidth: O(N) –Cost: O(N) –Permutations: single-pair, neighbor –Note: “fat” tree → width of links increases as we go toward root 12 CPU Mem. Intermediate node: switch root: switch leaf node:

231 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Switched network: –Degree: 1 for hosts and 2 for switches –Diameter: log 2 N –Bisection width: N/2 –Total bandwidth: O(N) –Cost: O(Nlog 2 N) –Permutations: depends on the actual topology 13 CPU Mem. CPU 1 CPU 2 CPU 3 CPU 4 CPU 1 CPU 2 CPU 3 CPU 4

232 CS4/MSc Parallel Architectures - 2009-2010 Topologies  Switched network: e.g., Omega network 14

233 CS4/MSc Parallel Architectures - 2009-2010 Routing 15  Example: mesh and d-dimension cubes –Hosts are numbered as in a matrix –To avoid deadlock use dimension-ordered routing (a.k.a. X-Y routing in 2D)  Follow all the steps necessary in one dimension before changing dimensions  Always choose dimensions in the same order –E.g., from (1,1) to (3,3) and from (3,3) to (1,1) (0,0)(0,1)(0,2)(0,3) (1,0)(1,1)(1,2)(1,3) (2,0)(2,1)(2,2)(2,3) (3,0)(3,1)(3,2)(3,3)

234 CS4/MSc Parallel Architectures - 2009-2010 Routing 16  Example: Omega network –Hosts are numbered linearly in binary (log 2 N bits are required) –The routing function is given by F=S XOR D, where S and D are the binary numbers of the source and destination hosts, respectively –At each level of the network, use the corresponding bit of the routing function to go:  Straight, if bit is 1  Across, if bit is 0 –Assign numbers to hosts appropriately (easy for Omega, but more complex for other networks) –E.g., from 010 to 011 F=001 → straight, straight, across and from 100 to 111 F=011 → straight, across, across 000 001 010 011 100 101 110 111 000 001 010 011 100 101 110 111

235 CS4/MSc Parallel Architectures - 2009-2010 Packet Switching 17  Store-and-forward –Enough space must be pre-allocated in destination router’s buffers for the complete packet –Router must wait until the complete packet is received before it can initiate forwarding it  Cut-through –Enough space must be pre-allocated in destination router’s buffer for the complete packet –Router may initiate forwarding parts of the packet as soon as they arrive  Wormhole –Packets are divided in small pieces called flow units (flits) –Only header flit contains address of destination and is responsible for setting up the route (trailing flits simply follow the header) –No need to allocate enough buffer space for entire packet (packet spreads through multiple routers and links like a “worm”) –May lead to deadlock

236 CS4/MSc Parallel Architectures - 2009-2010 References and Further Reading 18  Recent books on multiprocessor interconnects “Principles and Practice of Interconnection Networks”, W. Dally and B. Towles, Morgan Kaufmann, 2003. “Interconnection Networks”, J. Duato, Morgan Kaufmann, 2002.


Download ppt "CS4/MSc Parallel Architectures - 2009-2010 CS4 Parallel Architectures - Introduction  Instructor : Marcelo Cintra – 1.03 IF) "

Similar presentations


Ads by Google