© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic.

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic Concepts  Computer Design Computer Architectures for AI Computer Architectures in Practice

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/2 Computer Design Quantitative assessments Instruction sets Pipelining  Parallelism

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/3Parallelism Introduction to parallel processing Instruction level parallelism (Data level parallelism)  Part 3 (Task level parallelism)  Part 3

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/4Parallelism Introduction to parallel processing  Basic concepts: granularity, program, process, thread, language aspects  Types of parallelism Instruction level parallelism

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/5Parallelism Introduction to parallel processing  Basic concepts: granularity, program, process, thread  Types of parallelism Instruction level parallelism

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/6Granularity Definition:  granularity is the complexity/grain size of some item  e.g. computation item (instruction), data item (scalar, array, struct), communication item (token granularity), hardware building block (gate, RTL component) RISC (e.g. add r1,r2,r4) CISC (e.g. ld *a0++,r1) High Level Languages HLLs (e.g. x = sin(y)) Application-specific (e.g. edge-det.invert.image) Low High Granularity

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/7Granularity Deciding the granularity is an important design choice E.g. grain size for the communication tokens in a parallel computer:  coarse grain: less communication overhead  fine grain: less time penalty when two communication packets compete for transmission over the same channel and collide

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/8Program What do we mean by “program”?  to a programmer, a program is an ordered set of instructions  to an operating system, a program is an executable file stored on disk

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/9Process Definition: a process is a quantum of work (e.g. a job) dealt with as an entity Resources are allocated and scheduled on a process basis

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/10Process  the dispatcher schedules the jobs thereby allocating the resources for each started job (mechanic, hydraulic car lift, …) with as goal to optimize throughput by keeping all mechanics busy while still satisfying all other constraints (limited number of hydraulic car lifts, restricted capabilities of the mechanics, …)   the OS schedules the processes thereby allocating the resources for each process (CPU, memory, peripherals, …) with as goal to optimize throughput by keeping all CPUs busy while still satisfying all other constraints (limited number of peripherals, restricted capabilities of some CPUs, …)  When an owner needs his/her car back before a certain time, the job scheduling needs to take into account this real-time deadline as a constraint  processes with RT deadlines

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/11Process Life cycle of a process  process creation  Fill out Process Control Block process ID owner process status (ready to run, running, suspended) allocated address space (cf. MMU) required resources scheduling data (priorities, deadlines, …)  Allocate address space  Load the program into the address space  Pass the process description to the scheduler  process execution  process termination  remove process from scheduler list  free address space  free Process Control Block

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/12Thread Definition: a thread is a part of a process that can be autonomously scheduled All threads of the same process share the same address space Multiple threads of the same process can run concurrently Threads have been introduced to utilize high level parallelism within a process Hence, the grain size for scheduling is reduced leading to more available parallelism, without increasing allocation overhead which is done at process level Resource allocation is done at process level, scheduling at thread level

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/13Parallelism Introduction to parallel processing  Basic concepts: granularity, program, process, thread  Types of parallelism Instruction level parallelism

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/14 Types of parallelism Functional parallelism  Different computations have to be performed on the same or different data  E.g. Multiple users submit jobs to the same computer or a single user submits multiple jobs to the same computer  this is functional parallelism at the process level  taken care of at run-time by the OS Important for the exam!

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/15 Types of parallelism Functional parallelism  E.g. equalization, reverberation, surround sound on CD audio samples  this is functional parallelism at the thread level  taken care of at design-time by the designer tools (worst case analysis) or at run-time (actual case) by the OS  worst case analysis with no run-time decisions is best when application behavior is manifest, i.e. control flow is data independent  actual case analysis with run-time decisions is best when application behavior is non-manifest, i.e. control flow is data dependent

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/16 Types of parallelism Functional parallelism  E.g. multiply-accumulate samples to implement a FIR filter  this is functional parallelism at the instruction level  taken care of at design-time by the designer/compiler (worst case analysis) or at run- time by the instruction dispatch hardware (actual case)  worst case analysis with no run-time decisions is best when instruction behavior is manifest, i.e. control flow is data independent (e.g. data independent loops)  actual case analysis with run-time decisions is best when application behavior is non-manifest, i.e. control flow is data dependent (e.g. data dependent loops)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/17 Types of parallelism Data parallelism  Same computations have to be performed on a whole set of data  E.g. 2D convolution of an image  This is data parallelism at the loop level: consecutive loop iterations are candidates for parallel execution, subject to inter-iteration data dependencies  Leads often to massive amount of parallelism Important for the exam!

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/18 Levels of parallelism Instruction level parallel (ILP)  Functional parallelism at the instruction level  Example: pipelining Data level parallel (DLP)  Data parallelism at the loop level Process & thread level parallel (TLP)  Functional parallelism at the thread and process level

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/19Parallelism Introduction to parallel processing Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/20Parallelism Introduction to parallel processing Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Super scalar

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/21 Type of Instruction Level Parallelism utilization Sequential instruction issuing, sequential instruction execution  von Neumann processors EU Instruction word

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/22 Type of Instruction Level Parallelism utilization Sequential instruction issuing, parallel instruction execution  pipelined processors EU 1 EU 2 EU 3 EU 4 Instruction word

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/23 Type of Instruction Level Parallelism utilization Parallel instruction issuing – compile-time determined by compiler, parallel instruction execution  VLIW processors: Very Long Instruction Word EU 1 EU 2 EU 3 EU 4 Instruction word

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/24 Type of Instruction Level Parallelism utilization Parallel instruction issuing – run-time determined by HW dispatch unit, parallel instruction execution  super-scalar processors (to be seen later) EU 1 EU 2 EU 3 EU 4 Instruction window

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/25 Type of Instruction Level Parallelism utilization Most processors provide sequential execution semantics  regardless how the processor actually executes the instructions (sequential or parallel, in-order or out-of-order), the result is the same as sequential execution in the order they were written VLIW and IA-64 provide parallel execution semantics  explicit indication in ASM which instructions are executed in parallel

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/27VLIW EU Dec Instruction Register Dec Main instruction memory 128 bit Instruction Cache 128 bit 32 bit each 256 decoded bits each Register file 32 bit each; 8 read ports, 4 write ports Cache/ RAM 32 bit each; 2 read ports, 1 write port Main data memory 32 bit; 1 bi-directional port

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/28VLIW Properties  Multiple Execution Units: multiple instructions issued in one clock cycle  Every EU requires 2 operands and delivers one result every clock cycle: high data memory bandwidth needed  Careful design of data memory hierarchy  Register file with many ports  Large register file: 64-256 registers  Carefully balanced cache/RAM hierarchy with decreasing number of ports and increasing memory size and access time for the higher levels (IMEC research: DTSE)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/29VLIW Properties  Every EU requires 2 operands and delivers one result every clock cycle: high data memory bandwidth needed  Application should be optimized to fit into the data memory hierarchy (IMEC research: DTSE) for speed and power reasons Lower level memories are smaller: word-lines and bit-lines are shorter and hence their capacitance is smaller Lower level memories are faster Accessing a lower level memory consumes less power

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/30VLIW Properties  Every EU requires an instruction every clock cycle  Waste of memory when not enough parallel instructions are available to keep all EUs concurrently busy Many instruction slots will contain a NOP Solution: dynamically varying instruction size, i.e. only non-NOP slots are present and an 8 slot instruction word can contain for example 2 instruction words, one containing 2 non-NOP slots and another containing 6 non-NOP slots

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/31VLIW Properties  Compiler should determine which instructions can be issued in a single cycle without control dependency conflict nor data dependency conflict  Deterministic utilization of parallelism: good for hard-real-time  Compile-time analysis of source code: worst case analysis instead of actual case  Very sophisticated compilers, especially when the EUs are pipelined! Perform well since early 2000

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/32VLIW Properties  Compiler should determine which instructions can be issued in a single cycle without control dependency conflict nor data dependency conflict  Very difficult to write assembly: programmer should resolve all control flow conflicts all data flow conflicts all pipelining conflicts and at the same time fit data accesses into the available data memory bandwidth and all program accesses into the available program memory bandwidth e.g. 2 weeks for a sum-of-products (3 lines of C- code)  All high end DSP processors since 1999 are VLIW processors (examples: Philips Trimedia -- high end TV, TI TMS320C6x -- GSM base stations and ISP modem arrays)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/33 Low power DSP EU Dec Instruction Register Dec Main instruction memory 128 bit Instruction Cache 128 bit 32 bit each 256 decoded bits each Register file 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port Too much power dissipation in fetching wide instructions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/34 Low power DSP EU Dec Instruction Register Dec 24 bit ICache 128 bit 32 bit each 256 decoded bits each Register file 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port Instruction expansion Main IMem 24 bit E.g. ADD4 is expanded into ADD || ADD || ADD || ADD

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/35 Low power DSP Properties  Power consumption in program memory is reduced by specializing the instructions for the application  Not all combinations of all instructions for the EUs are possible, but only a limited set, i.e. those combinations that lead to a substantial speed-up of the application  Those relevant combinations are represented by the smallest possible amount of bits to reduce program memory width and hence program memory power consumption  Can only be done for embedded DSP applications: processor is specialized for 1 application (examples: TI TMS320C54x -- GSM mobile phones, TI TMS320C55x -- UMTS mobile phones)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/36 Low power DSP for interactive multimedia REU Dec Instruction Register Dec 24 bit ICache 128 bit 32 bit each 256 decoded bits each Register file 32 bit each; 8 read ports, 4 write ports 32 bit each; 2 read ports, 1 write port Reconfigurable Instruction expansion Main IMem 24 bit Run-time reconfiguration allows to adapt specialization to changing application requirements

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/37 Low power DSP for interactive multimedia Properties of interactive multimedia  Every object has its own decoder algorithm  Scalable decoders  Variable number of objects  Interaction with user  Decoder algorithm downloaded together with data: SAOL/SASL, Java  Decoder lasts longer than the standards  Low power consumption, high performance Properties of VLIW architecture  Power consumption in program memory is reduced by specializing the instructions for the current application instance (Research at ACCA)  Specialization of EUs is adapted to application instance

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/39 Advanced Pipelining Pipeline CPI is the result of many components A number of techniques act on one or more of these components:  Loop unrolling  Scoreboarding  Dynamic branch prediction  Speculation  … To be seen later  CPU TIME (p) = IC(p)  CPI(p) clock rate

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/40 Advanced Pipelining Till now, Instruction-level parallelism was searched within the boundaries of a basic block (BB) A BB is 6-7 instructions on average  too small to reach the expected performance What is worse, there’s a big chance that these instructions have dependencies  Even less performance can be expected

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/41 Advanced Pipelining To obtain more, we need to go beyond the BB limitation: We must exploit ILP across multiple BB’s Simplest way: loop level parallelism (LLP):  Exploiting the parallelism among iterations of a loop Converting LLP into ILP  Loop unrolling  Statically (compiler-based)  Dynamically (HW-based) Using vector instructions  Does not require LLP - > ILP conversion

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/42 Advanced Pipelining The efficiency of the conversion depends  On the amount of ILP available  On latencies of the functional units in the pipeline  On the ability to avoid pipeline stalls by separating dependent instructions by a “distance” (in terms of stages) equal to the latency peculiar to the source instruction LW x, …  INSTR …, x a load must not be followed by the immediate use of the load destination register

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/43 Advanced Pipelining  Loop unrolling Assumptions and steps 1.We assume the following latencies

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/44 Advanced Pipelining  Loop unrolling 2.We assume to work with a simple loop such as for (I=1; I<=1000; I++) x[I] = X[I] + s; Note: each iteration is independent of the others  Very simple case

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/45 Advanced Pipelining  Loop unrolling 3.Translated in DLX, this simple loop looks like this: ; assumptions: R1 = &x[1000] ; F2 = s Loop: LD F0, 0(R1) ; F0 = x[I] ADDD F4, F0, F2 ; F4 = F0 + s SD 0(R1), F4 ; store result SUBI R1, R1, #8 ; R1 = R1 - 1 BNEZ R1, Loop ; if (R1) ; goto Loop WO

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/46 4.Tracing the loop (no scheduling!): Loop: LD F0, 0(R1) ; 1 stall 2 ADDD F4, F0, F2 ; 3 stall 4 stall 5 SD 0(R1), F4 ; 6 SUBI R1, R1, #8 ; 7 BNEZ R1, Loop ; 8 stall ; 9 9 clock cycles per iteration, with 4 stalls Advanced Pipelining  Loop unrolling   

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/47 Advanced Pipelining  Loop unrolling 5.With scheduling, we move from Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop to Loop: LD F0, 0(R1) ADDD F4, F0, F2 SUBI R1, R1, #8 BNEZ R1, Loop SD 8(R1), F4 8 whose trace shows that less cycles are wasted:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/48 Advanced Pipelining  Loop unrolling 6.Tracing the loop (with scheduling!): Loop: LD F0, 0(R1) ; 1 stall 2 ADDD F4, F0, F2 ; 3 SUBI R1, R1, 8 ; 4 BNEZ R1, Loop ; 5 SD 8(R1), F4 ; 6 6 clock cycles per iteration, with 1 stall 3 stalls less! Still the useful cycles are just 3 How to gain more? O O 

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/49 Advanced Pipelining  Loop unrolling 7.With loop unrolling: replicating the body of loop multiple times Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ LD F6, -8(R1) ; F6 vs. F0 ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ LD F10, -16(R1) ; F10 vs. F0 ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ LD F14, -24(R1) ; F14 vs. F0 ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop Spared 3 x (SUBI + BNEZ)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/50 Advanced Pipelining  Loop unrolling Loop unrolling: replicating the body of loop multiple times  Some branches are eliminated  The ratio w/o increases  The BB artificially increases its size  Higher probability of optimal scheduling  Requires a wider set of registers and adjusting values of load and store registers  (In the given example,) Every operation is followed by a dependent instruction  Will cause a stall  Trace of unscheduled unrolled loop: 27 cycles  2 per LD, 3 per ADD, 2 per branch, 1 per any other  6.8 clock cycles per iteration  Pure scheduling is better! (6 cycles)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/51 Advanced Pipelining  Loop unrolling Unrolled loop plus scheduling Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ LD F6, -8(R1) ; F6 vs. F0 ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ LD F10, -16(R1) ; F10 vs. F0 ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ LD F14, -24(R1) ; F14 vs. F0 ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/52 Advanced Pipelining  Loop unrolling Unrolled loop plus scheduling Loop: LD F0, 0(R1) LD F6, -8(R1) ; F6 vs. F0 ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ LD F10, -16(R1) ; F10 vs. F0 ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ LD F14, -24(R1) ; F14 vs. F0 ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/53 Advanced Pipelining  Loop unrolling Unrolled loop plus scheduling Loop: LD F0, 0(R1) LD F6, -8(R1) ; F6 vs. F0 LD F10, -16(R1) ; F10 vs. F0 ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ LD F14, -24(R1) ; F14 vs. F0 ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/54 Advanced Pipelining  Loop unrolling Unrolled loop plus scheduling Loop: LD F0, 0(R1) LD F6, -8(R1) ; F6 vs. F0 LD F10, -16(R1) ; F10 vs. F0 LD F14, -24(R1) ; F14 vs. F0 ADDD F4, F0, F2 SD 0(R1), F4 ; skip SUBI and BNEZ ADDD F8, F6, F2 ; F8 vs. F4 SD -8(R1), F8 ; skip SUBI and BNEZ ADDD F12, F10, F2 ; F12 vs. F4 SD -16(R1), F12 ; skip SUBI and BNEZ ADDD F16, F14, F2 ; F16 vs. F4 SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/55 Advanced Pipelining  Loop unrolling Unrolled loop plus scheduling Loop: LD F0, 0(R1) LD F6, -8(R1) ; F6 vs. F0 LD F10, -16(R1) ; F10 vs. F0 LD F14, -24(R1) ; F14 vs. F0 ADDD F4, F0, F2 ADDD F8, F6, F2 ; F8 vs. F4 ADDD F12, F10, F2 ; F12 vs. F4 ADDD F16, F14, F2 ; F16 vs. F4 SD 0(R1), F4 ; skip SUBI and BNEZ SD -8(R1), F8 ; skip SUBI and BNEZ SD -16(R1), F12 ; skip SUBI and BNEZ SD -24(R1), F16 ; skip SUBI and BNEZ SUBI R1, R1, #32 ; R1 = R1 – 4 BNEZ R1, Loop 14 clock cycles, or 3.5 clock cycles / iteration Enough distance to prevent the dependency to turn into a hazard

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/56 Advanced Pipelining  Loop unrolling Unrolling the loop exposes more computation that can be scheduled to minimize the stalls Unrolling increases the BB; as a result, a better choice can be done for scheduling A useful technique with two key requirements:  Understanding how an instruction depends on another  Understanding how to change or reorder the instructions, given the dependencies In what follows we concentrate on .

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/57 Loop unrolling: . dependencies Again, let ( I k ) 1  k  IC(p) be the ordered series of instructions executed during the run of program p Given two instructions, I i and I j, with i<j, we say that I j is dependent on I i (I i  I j ) iff  R(I i )  D(I j )  R is the range and D the domain of a given instruction  I i produces a result which is consumed by I j or  n  { 1,…,IC(p)} and  k 1 < k 2 < … < k n such that I i  I k 1  I k 2  I k n  I j

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/58 Loop unrolling: . dependencies (I i  I k 1  I k 2  …  I k n  I j ) is called a dependency (transitive) chain Note that a dependency chain can be as long as the entire execution of p A hazard implies dependency Dependency does not imply a hazard! Scheduling tries to place dependent instructions in places where no hazard can occur

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/59 Loop unrolling: . dependencies For instance: SUBI R1, R1, #8 BNEZ R1, Loop This is clearly a dependence, but it does not result in a hazard  Forwarding eliminates the hazard Another example: LD F0, 0(R1) ADDD F4, F0, F2 This is a data dependency which does lead to a hazard and a stall

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/60 Loop unrolling: . dependencies Dealing with data dependencies Two classes of methods: 1.Keeping the dependence though avoiding the hazard (via scheduling) 2.Eliminating a dependence by transforming the code

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/61 Loop unrolling: . dependencies Class 2 implies more work These are optimization methods used by the compilers Detecting dependencies when only using registers is easy; the difficulties come from detecting dependencies in memory: For instance 100(R4) and 20(R6) may point to the same memory location Also the opposite situation may take place: LD 20(R4), R2 … ADD R3, R1, 20(R4) If R4 changes, this is no dependency

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/62 Loop unrolling: . dependencies I i  I j means that I i produces a result that is consumed by I j When there is no such production, e.g., I i and I j are both loads or stores, we call this a name dependency Two types of name dependencies:  Antidependence  Corresponds to WAR hazards  I j  x ; I i  x (reordering implies an error)   Output dependence  Corresponds to WAW hazards  I j  x ; I i  x (reordering implies an error) No value is transferred between the instructions Register renaming solves the problem

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/63 Loop unrolling: . dependencies Register renaming: if the register name is changed, the conflict disappears This technique can be either static (and done by the compiler) or dynamic (done by the HW) Let us consider again the following loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Let us perform unrolling w/o renaming:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/64 Loop unrolling: . dependencies Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F0, -8(R1) ADDD F4, F0, F2 SD -8(R1), F4 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F0 SUBI R1, R1, #32 BNEZ R1, Loop The yellow arrows are name dependencies. To solve them, we perform renaming

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/65 Loop unrolling: . dependencies Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F0, -16(R1) ADDD F4, F0, F2 SD -16(R1), F4 LD F0, -24(R1) ADDD F4, F0, F2 SD -24(R1), F0 SUBI R1, R1, #32 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/66 Loop unrolling: . dependencies Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/67 Loop unrolling: . dependencies Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 LD F6, -8(R1) ADDD F8, F6, F2 SD -8(R1), F8 LD F10, -16(R1) ADDD F12, F10, F2 SD -16(R1), F12 LD F14, -24(R1) ADDD F16, F14, F2 SD -24(R1), F16 SUBI R1, R1, #32 BNEZ R1, Loop The yellow arrows are data dependencies. To solve them, we reorder the instructions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/68 Loop unrolling: . dependencies A third class of dependencies is the one of control dependencies Examples: if (p 1 ) s 1 ; if (p 2 ) s 2 ; then p 1  c s 1 (s 1 is control dependent on p 1 ) p 2  c s 2 (s 2 is control dependent on p 2 ) Clearly  p 1  c s 2  that is  s 2 is not control dependent on p 1

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/69 Loop unrolling: . dependencies Control dependence requires two constraints to be fulfilled: if (p1) { s1; s2; … } s1 if (p1) { s2; … } Is not equivalent to 1. That is, if p 1  c s 1, then one cannot schedule s 1 so that it is executed before p 1

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/70 Loop unrolling: . dependencies Control dependence requires two constraints to be fulfilled: s if (p1) { s1; … } if (p1) { s s1; … } Is not equivalent to 2. That is, if  p 1  c s , then one cannot schedule s so that it is executed under control dependence of p 1

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/71 Loop unrolling: . dependencies Two properties are critical to control dependency:  Exception behaviour  Data flow Exception behaviour: suppose we have the following excerpt: BEQZR2, L1 DIVIR1, 8(R2) L1:… We may be able to move the DIVI to before the BEQZ without violating the sequential semantics of the program Suppose the branch is taken. Normally one would simply need to undo the DIVI What if DIVI triggers a DIVBYZERO exception?

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/72 Loop unrolling: . dependencies Two properties are critical to control dependency:  Exception behaviour  Data flow Data flow must be preserved Let us consider the following excerpt: ADDR1, R2, R3 BEQZR4, L SUBR1, R5, R6 L:ORR7, R1, R8 Value of R1 depends on the control flow The OR depends on both ADD and SUB Also depends on the nature of the branch R1 = (taken)? ADD.. : SUB..

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/73 Loop Level Parallelism Let us consider the following loop: for (I=1; I<=100; I++) { A[I+1] = A[I] + C[I]; /* S1 */ B[I+1] = B[I] + A[I+1]; /* S2 */ } S1 is a loop-carried dependency (LCD): iteration I+1 is dependent on iteration I: A’ = f(A) S2 is B’ = f(B,A’) If a loop has only non-LCD’s, then it is possible to execute more than one loop iteration in parallel – as long as the dependencies within each iteration are not violated

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/74 Loop Level Parallelism What to do in the presence of LCD’s? Loop transformations. Example: for (I=1; I<=100; I++) { A[I+1] = A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I]; /* S2 */ } A’ = f(A, B) B’ = f(C, D) Note: no dependencies except LCD’s  Instructions can be swapped!

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/75 Loop Level Parallelism What to do in the presence of LCD’s? Loop transformations. Example: for (I=1; I<=100; I++) { A[I+1] = A[I] + B[I]; /* S1 */ B[I+1] = C[I] + D[I]; /* S2 */ } Note: the flow, i.e., A0 B0A0 B0 C0 D0 C0 D0 A1 B1 can beA1 B1 C1 D1 changed intoC1 D1 A2 B2A2 B2 C2 D2......

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/76 for (i=1; i <= 100; i=i+1) { A[i] = A[i] + B[i];/* S1 */ B[i+1] = C[i] + D[i];/* S2 */ } becomes A[1] = A[1] + B[1]; for (i=1; i <= 99; i=i+1) { B[i+1] = C[i] + D[i]; A[i+1] = A[i+1] + B[i+1]; } B[101] = C[100] + D[100]; Loop Level Parallelism

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/77 Loop Level Parallelim A’ = f(A, B)B’ = f(C, D) B’ = f(C, D)A’ = f(A’, B’) Now we have dependencies but no more LCD’s!  It is possible to execute more than one loop iteration in parallel – as long as the dependencies within each iteration are not violated

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/78 Dependency avoidance 1.“Batch” approaches: at compile time, the compiler schedules the instructions in order to minimize the dependencies (static scheduling) 2.“Interactive” approaches: at run-time, the HW rearranges the instructions in order to minimize the stalls (dynamic scheduling) Advantages of 2:  Only approach when dependencies are only known at run-time (pointers etc.)  The compiler can be simpler  Given an executable compiled for a machine with machine-level X and pipeline organization Y, it can run efficiently on another machine with the same machine level but a different pipeline organization Z

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/79 Dynamic Scheduling Static scheduling: compiler techniques for scheduling (rearranging) the instructions  so to separate dependent instructions  And hence minimize unsolvable hazards causing unavoidable stalls Dynamic scheduling: HW-based, run-time techniques A dynamically scheduled processor does not try to remove true data dependencies (which would be impossible): it tries to avoid stalling when dependencies are present The two techniques can be both used

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/80 Dynamic Scheduling: General Idea If an instruction is stalled in the pipeline, no later instruction can proceed A dependence between two instructions close to each other causes a stall A stall means that, even though there may be idle functional units that could potentially serve other instructions, those units have to stay idle Example: DIVDF0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 ADDD depends on DIVD; but SUBD does not. Despite this, it is not issued!

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/81 Dynamic Scheduling: General Idea So SUBD is not issued even there might be a functional unit ready to perform the requested operation Big performance limitation! What are the reasons that lead to this problem? In-order instruction issuing and execution: instructions issue and execute one at a time, one after the other

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/82 Dynamic Scheduling: General Idea Example: in DLX, the issue of an instruction occurs at ID (instruction decode) In DLX, ID checks for absence of structural hazards and waits for the absence of data hazards These two steps may be made distinct

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/83 Dynamic Scheduling: General Idea The issue process gets divided into two parts: 1.Checking the presence of structural hazards 2.Waiting for the absence of a data hazard Instructions are issued in order, but they execute and complete as soon as their data operands are available Data flow approach

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/84 Dynamic Scheduling: General Idea The ID pipeline stage is divided into two sub-stages: ID.1 (Issue) : decode the instruction, check for structural hazards ID.2 (read operands) : wait until no data hazards, then read operands

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/85 Dynamic Scheduling: General Idea In the DLX floating point pipeline, the EX stage of instructions may take multiple cycles For each issued instruction I, depending on the resolution of structural and data hazards, I may be be waiting for resources or data, or in execution, or completed More than a single instruction can be in execution at the same time

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/86Scoreboarding Scorebord (CDC6600, 1964): a technique to allow instructions to execute out of order when there are sufficient resources and no data dependencies Goal: execution rate of 1 instruction per clock cycle in the absence of structural hazards Large set of FUs:  4 FPUs,  5 units for memory references  7 integer FUs  Highly redundant (parallel) system Four steps replace the ID, EX, WB stages

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/87Scoreboarding IF (a FU is available && no active instruction has same destination reg) { issue I to the FU; update state; } Avoids WAWs ASA (the two source operands are available in the registers) { read operands; manage RAW stalls; } For each FU: ASA (operands are available) { start EX; EOX? Alert scoreboard; } When at WB: { wait for (no WAR hazards); store output to destination reg; } Avoids WARs

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/88Scoreboarding In eliminating stalls, a scoreboard is limited by several factors:  Amount of parallelism available among the instructions  (in the presence of many dependencies there’s not much that one can do…)  Number of scoreboard entries  (How far ahead the pipeline can look for independent instructions)  Number and types of FUs  Number of WAR’s and WAW’s

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/89Scoreboarding The effectiveness of the scoreboard heavily depends on the register file All operands are read from registers, all outputs go to destination registers  The availability of registers influence the capability to eliminate stalls

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/90 Tomasulo’s approach Tomasulo’s approach (IBM 360/91, 1967) : An improvement of scoreboarding when a limited number of registers is allowed by a machine architecture Based on virtual registers The IBM 360/91 had two key design goals:  To be faster than its predecessors  To be machine level compatible with its predecessors Problem: the 360 family had only 4 FP registers Tomasulo combined the key ideas of scoreboarding with register renaming

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/91 Tomasulo’s approach IBM 360/91 FUs:  3 ADDD/SUBD, 2 MULD, 6 LD, 6 SD Key element: the reservation station (RS): a buffer which holds the operands of the instructions waiting to issue Key concept:  A RS fetches and buffers an operand as soon as it is available, eliminating the need to get that operand from a register  Instead of tracing the source and destination registers, we track source and destination RS’s OP RS a RS b RS c

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/92 Tomasulo’s approach A reservation station represents:  A static data, read from a register  A “live” data (a future data) that will be produced by another RS and FU Hazard detection and execution control are not centralised into a scoreboard They are distributed in each RS, which, independently:  Controls a FU attached to it,  And starts that FU the moment the operands become available

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/93 Tomasulo’s approach The operands go to the FUs through the (wide set of) RS’s, not through the (small) register file This is managed through a broadcast that makes use of a common result-or-data bus All units waiting for an operand can load it at the same time: RS a OP OP 2 RS d RS c RS b RS e

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/94 Tomasulo’s approach The execution is driven by a graph of dependencies RS a SUBD MULTD RS d RS c RS b RS e SUBD RS f SUBD RS g A “live data structure” approach (similar to LINDA): a tuple is made available in the future, when a thread will have finished producing it

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/95 Tomasulo’s approach Let RSid mean “reservation station id” R x mean “register x” Objective: computing R D  f(R S1, R S2 )

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/96 Tomasulo’s approach: ISSUE /* R D is assigned to RS # r */ If (R[S1].output_RSid != 0) RS[r].input_lx_RSid = R[S1].output_RSid; /* this means “to be produced and to appear in the RS whose number is R[S1].output_RSid” */ else { /* no RS will produce the input value; fetch it from register R S1 */ RS[r].input_lx_value = R S1 ; RS[r].input_lx_RSid = 0; } If (R[S2].output_RSid != 0) RS[r].input_rx_RSid = R[S2].output_RSid; else { /* no RS will produce the input value; fetch it from register R S2 */ RS[r].input_rx_value = R S2 ; RS[r].input_rx_RSid = 0; } RS[r].busy = YES; R[D].output_RSid = r; These instructions attach to a RS: (a)A “present”, available value, or (b)A “future”, “live” value, i.e., a computation which hasn’t yet produced its output, and its represented by another RS

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/97 Tomasulo’s approach: EXEC Whenever, in RS[r], both fields (input_lx_RSid, input_rx_RSid) = (0,0), then both operands are available in fields (input_lx_value, input_rx_value) This means that the execution of f is startable

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/98 Tomasulo’s approach: WB 1)forall(x) : if ( R[x].input_lx_RSid == r) { RS[r].input_lx_value = RS[r].output_value; R[r]. input_lx_ _RSid = 0; } 2)forall(x) : if ( R[x].input_rx_RSid == r) { RS[r].input_lx_value = RS[r].output_value; R[r].output_RSid = 0; } 3)forall(x) : if ( StoreBuffer[x].RSid == r) { StoreBuffer[x].value; StoreBuffer[x].RSid = 0; } 4)RS[r].busy = NO; All the registers… …and buffers… …RE’s… …whose Rsid is the same as the completing RS’, update their values from the CDB and mark the Rsid fields to indicate that that value has been received

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/99 Major Advantages of Tomasulo’s Distributed approach: the RS’s independently control the FU’s Distributed hazard detection logic The CDB broadcasts results - > all pending instructions depending on that result are unblocked simultaneously  The CDB, being a bus, reaches many destinations in a single clock cycle  If the waiting instructions get their missing operand in that clock cycle, they can all begin execution on the next clock cycle WAR and WAW are eliminated by renaming registers using the RS’s

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/101 Reducing branch penalties: Dynamic Branch Prediction A branch history table AddressBranch Nature 0xA0B2DF37BNEZ … 0xA0B2F02ABEQ … 0xA0B30504BNEZ … 0xA0B30537BGT … 37 2A 04.......................................... taken untaken un

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/102 Dynamic Branch Prediction  Branch History Table  Algorithm /* before the branch is evaluated */ If (Current instruction is a branch) { entry = PC & 0x000000FF; predict branch as ( BHT [ entry ] ); } /* after the branch */ If (branch was mispredicted) BHT [ entry ] = 1 – BHT [ entry ]

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/103 Dynamic Branch Prediction  Branch History Table  Algorithm Just one bit is enough for coding the Boolean value “taken” vs. “untaken” Note: the function associating addresses to entries in the BHT is not guaranteed to be a bijection (one-to-one relationship): The algorithm records the most recently behaviour of one or more branches  For instance, entry 37 corresponds to two b.’s Despite this, the scheme works well… …though in some cases, the performance of the scheme is not that satisfactory:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/104 Dynamic Branch Prediction  Branch History Table  Accuracy for (i=0; i<BIGN; i++) for (j=0; j<9; j++) { do stg(); } Loop is  taken nine times in a row  then not taken once Taken 90%, Untaken 10% What is the prediction accuracy?

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/105 Dynamic Branch Prediction  Branch History Table  Accuracy Taken U0 Taken T1... Taken T1 UntakenT0 Taken U0 Taken T1... Taken T1 UntakenT0 Taken U0 Taken T1... Taken T1 UntakenT0 Taken U0 8 successful predictions 2 mispredictions 9 9 9 S.S. Prediction accuracy is just 80% !

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/106 Dynamic Branch Prediction  Branch History Table  Accuracy Loop branches (taken n-1 times in a row, untaken once) Performance of this dynamic branch predictor (based on a single-bit prediction entry):  Misprediction: 2 x 1 / n  Twice rate of untaken branches

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/107 Dynamic Branch Prediction  Two-bit Prediction Scheme Use a two bit field as a “branch behaviour recorder” Allow a state to change only when two mispredictions in a row occur:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/108 Dynamic Branch Prediction  Branch History Table  Accuracy Taken U20 Taken U0 TakenT21... Taken T21 UntakenT0 TakenT21... Taken T21 UntakenT0 TakenT21... Taken T21 7 successful predictions 9 successful predictions 2 mispredictions first S.S. Prediction accuracy is now 90% 9 successful predictions STEADY STATE

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/109 Dynamic Branch Prediction  Branch History Table  Accuracy Prediction accuracy with programs from SPEC89 – 2-bit prediction buffer of 4096 entries

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/110 Dynamic Branch Prediction  General Scheme In the general case, one could use an n-bit branch behaviour recorder and a branch history table of 2 m entries In this case  A change occurs every 2 n-1 mispredictions  There is a higher chance that not too many branch addresses be associated with the same BHT entry  Larger memory penalty

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/112 Dynamic Branch Prediction Schemes One-bit prediction buffer  Good, but with limited accuracy Two-bit prediction buffer  Very good, greater accuracy, slightly higher overhead Infinite-bit prediction buffer  As good as the two-bit one, but with a very large overhead Correlating predictors

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/113 Dynamic Branch Prediction  Correlated predictors Two-level predictors If the behaviour of a branch is correlated to the behaviour of another branch, no single-level predictor would be able to capture its behaviour Example: if (aa == 2) aa = 0; if (bb == 2) bb = 0; if (aa != bb) { … If we keep track of the recent behaviour of other previous branches, our accuracy may increase

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/114 Dynamic Branch Prediction  Correlated predictors A simpler example: if (d == 0) d = 1; if (d == 1) … In DLX, this is BNEZR1, L1; b1 ( d != 0 ) MOVR1, #1 L1:SUBIR3, R1, #1 BNEZR3, L2; b2 ( d != 1)... L2:...

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/115 Dynamic Branch Prediction  Correlated predictors In DLX, this is BNEZR1, L1; b1 ( d != 0 ) MOVR1, #1 L1:SUBIR3, R1, #1 BNEZR3, L2; b2 ( d != 1)... L2:... Let us assume that d is 0, 1 or 2 Initial value d==0? b1 Value of d d==1? b2 of d before b2 0 Yes Untaken1 Yes Untaken 1 No Taken1 Yes Untaken 2 NoUntaken2 No Taken

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/116 Dynamic Branch Prediction  Correlated predictors This means that (B1 == untaken )  (B2 == untaken ) A one-bit predictor may not be able to capture this property and behave very badly Initial value d==0?B1 Value of d d==1? b2 of d before b2 0 Yes Untaken1 Yes Untaken 1 No Taken1 Yes Untaken 2 NoUntaken2 No Taken

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/117 Dynamic Branch Prediction  Correlated predictors Let us suppose that d alternates between 2 and 0 This is the table for the one-bit predictor: d b1 b1 new b1 b2 b2 new b2 pred action pred pred action pred 2 NT T T NT T T 0 T NT NT T NT NT 2 NT T T NT T T 0 T NT NT T NT NT ALL branches are mispredicted!

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/118 Dynamic Branch Prediction  Correlated predictors Correlated predictor: example: Every branch, say branch number j>1, has two separate prediction bits  First bit: predictor used if branch j-1 was NT  Second bit: otherwise At the end of branch j If (branch was mispredicted) BHT [ B.. ] [ entry ] = 1 – BHT [ B.. ] [ entry ] At the end of branch j-1: Behaviour_j_min_1 = (taken?) 1 : 0; At the beginning of branch j: predict branch as ( BHT [ Behaviour_j_min_1 ] [ entry ] );

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/119 Dynamic Branch Prediction  Correlated predictors The behaviour of a branch selects a one-bit branch predictor If the prediction is not OK, its state is flipped

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/120 Dynamic Branch Prediction  Correlated predictors We may also consider the last TWO branches  The behaviour of these two branches selects, e.g., a one-bit predictor  (NT NT, NT T, T NT, T T)  (0-3)  BHT [0..3]  This is called a (2,1) predictor  Or, the behaviour of the last two branches selects an n-bit predictor  This is a (2, n) predictor

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/121 Dynamic Branch Prediction  Correlated predictors A (2,2) predictor: A 2-bit branch history entry selects a 2-bit predictor

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/122 Dynamic Branch Prediction  Correlated predictors General case: (m, n) predictors  Consider the last m branches and their 2 m possible values  This m-tuple selects an n-bit predictor  A change in the prediction only occurs after 2 n-1 mispredictions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/123 Dynamic Branch Prediction  Branch-Target Buffer A run-time technique to reduce the branch penalty In DLX, it is possible to “predict” the new PC, via a branch prediction buffer, during the second stage of the pipeline With a Branch-Target Buffer (BTB), the new PC can be derived during the first stage of the pipeline

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/124 Dynamic Branch Prediction  Branch-Target Buffer The BTB is a branch-prediction cache that stores the addresses of taken branch An associative array which works as follows: (instruction address)  (branch target address) In case of a hit, we know the predicted instruction address one cycle earlier w.r.t. the branch prediction buffer Fetching begins immediately at the predicted PC

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/125 Dynamic Branch Prediction  Branch-Target Buffer Design issues:  The entire address must be used (correspondence must be one-to-one)  Limited number of entries in the BTB  Most frequently used  BTB requires a number of actions to be executed during the first pipeline stage, also in order to update the state of the buffer  The pipeline management gets more complex and the clock cycle duration may have to be increased

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/126 Dynamic Branch Prediction  Branch-Target Buffer Total branch penalty for a BTB Assumptions: penalties are as follows InstructionPredictionActualPenalty is in bufferbranchcycles YesTakenTaken0 YesTakenUntaken2 No*Taken2 Prediction accuracy: 90% Hit rate in buffer: 90% Taken branch frequency: 60%

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/127 Dynamic Branch Prediction  Branch-Target Buffer Branch penalty = Percent buffer hit rate x Percent incorrect predictions x Penalty +(1 - Percent buffer hit rate) x Percent taken branches x Penalty = InstructionPredictionActualPenalty is in bufferbranchcycles YesTakenTaken0 YesTakenUntaken2 No*Taken2 10% 90% 10% 60% 90%x10%x2 + 10%x60%x2 = 0.18+0.12= 0.30 clock cycles (vs. 0.50 for delayed br.)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/128 Dynamic Branch Prediction  Branch-Target Buffer The same approach can be applied to the procedures return addresses Example: 0x4ABCCALL0x30A0 0x4AC0… … 0x4CF4CALL0x30A0 0x4CF8… … 0x30A0 0x4CF8 0x4AC0 Associative arrays of stacks If cache is large enough, all return addresses are predicted correctly

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/129Parallelism Introduction to parallel processing Instruction level parallelism  Introduction  VLIW  Advanced pipelining techniques  Superscalar

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/130 Superscalar architectures So far, the goal was reaching the ideal CPI = 1 goal Further increasing performance by having CPI < 1 is the goal of superscalar processors (SP) To reach this goal, SP issue multiple instructions in the same clock cycle Multiple-issue processors  VLIW (seen already)  SP  Statically scheduled (compiler)  Dynamically scheduled (HW; Scoreboarding/Tomasulo) In SP, a varying # of instructions is issued, depending on structural limits and dependencies

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/131 Superscalar architectures Superscalar version of DLX At most two instructions per clock cycle can be issued 1.One of: load, store (integer or FP), branch, integer ALU operation 2.A FP ALU operation IF and ID operate on 64 bits of instructions Multiple independent FPU are available

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/132 Superscalar architectures The superscalar DLX is indeed a sort of “bidimensional pipeline”: IF Integer Instr. EX IDWBMEM IF FP Instr. EX IDWBMEM IFEX IDWBMEM IFEX IDWBMEM Integer Instr. FP Instr. Integer Instr. FP Instr. Integer Instr. FP Instr. IFEX IDWBMEM IFEX IDWBMEM IFEX IDWBMEM IFEX IDWBMEM

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/133 Superscalar architectures Every new solution breeds new problems.. Latencies! When the latency of the load is 1:  In the “monodimensional pipeline”, one cannot use the result of the load in the current and next cycle: P LD NOP LD c  In the bidimensional pipeline of SP, this means a loss of three cycles: PiPi LD NOP LD c NOP LD c’ P fp

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/134 Superscalar architectures Let us consider again the following loop: Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Let us perform unrolling (x5) + scheduling on the Superscalar DLX:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/135 Superscalar architectures IntegerFP Cycle Loop:LD F0, 0(R1)1 LD F6, -8(R1)2 LD F10, -16(R1)ADDD F4,F0,F23 LD F14, -24(R1)ADDD F8,F6,F24 LD F18, -32(R1)ADDD F12,F10,F25 SD 0(R1), F4ADDD F16,F14,F26 SD -8(R1), F8ADDD F20,F18,F27 SD -16(R1), F128 SD -24(R1), F169 SUBI R1, R1, #40 10 BNEZ R1, Loop 11 SD -32(R1), F20 12 12 clock cycles per 5 iterations = 2.4 cc/i

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/136 Superscalar architectures Superscalar = 2.4 cc/i vs normal = 3.5 cc/i But in the example there were not enough FP instructions to keep the FP pipeline in use  From cycle 8 to cycle 12 and for the first two cycles, each cycle holds just one instruction How to get more?  Dynamic scheduling for SP  Multicycle extension of the Tomasulo algorithm

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/137 Superscalar architectures and the Tomasulo algorithm Idea: employing separate data structures for the Integer and the FP registers  Integer Reservation Stations (IRS)  FP Reservation Stations (FRS) In the same cycle, issue a FP (to a FRS) and an integer instruction (to a IRS) Note: issuing does not mean executing!  Possible dependencies might serialize the two instructions issued in parallel Dual issue is obtained pipelining the instruction-issue stage so that it runs twice as fast

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/138 Superscalar architectures Multiple issue strategy’s inherent limitations:  The amount of ILP may be limited (see loop p.134)  Extra HW is required  Multiple FPU and IU  More complex (- > slower) design  Extra need for large memory and register-file bandwith  Increase in code size due to hard loop unrolling  Recall: CPU TIME (p) = IC(p)  CPI(p) clock rate

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/139 Superscalar architectures: compiler support Symbolic loop unrolling  The loop is not physically unrolled, though reorganized, so to eliminate dependencies Software pipelining:  Dependencies are eliminated by interleaving instructions from different iterations of the loop  Loop is not unrolled Loop: LD F0, 0(R1) ADDD F4, F0, F2 SD 0(R1), F4 SUBI R1, R1, #8 BNEZ R1, Loop Loop: SD 0(R1), F4 ADDD F4, F0, F2 LD F0, -16(R1) SUBI R1, R1, #8 BNEZ R1, Loop RAW: problematic WAR: HW removable 

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/140 Superscalar architectures: compiler support Trace scheduling Aim: tackling the problem of too short basic blocks Method:  Trace selection  Trace compaction

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/141 Superscalar architectures: compiler support Trace selection:  A number of contiguous basic blocks are put together into a “trace”  Using static branch prediction, the conditional branches are chosen as taken/untaken, while loop branches are considered as taken A B C Book- keeping  A BX C test

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/142 Superscalar architectures: compiler support Trace compaction:  The resulting trace is a longer straight-line of code  Trace compaction: global code scheduling A B C Book- keeping Code scheduling with a basic block whose size is that of A + B + C Speculative movement of code

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/143 Superscalar architectures: HW support Conditional instructions: instructions like CMOVZ R2, R3, R1 which means if (R1 == 0) R2 = R3; or(R1)? R2 = R3 : /* NOP */; The instruction turns into a NOP if the condition is not met  This also means that no exception are raised! Using conditional instructions we convert a control dependence (due to a branch) into a data dependence Speculative transformation in a two-issue superscalar with conditional instructions:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/144 Superscalar architectures: HW support : conditional instructions IntegerFP Cycle LW R1, 40(R2)ADDD R3,R4,R51 ADDD R6,R3,R72 BEQZ R10, L3 LW R8, 20(R10)4 LW R9,0(R8)5 LW R1, 40(R2)ADDD R3,R4,R51 LWC R8,20(R10),R10ADDD R6,R3,R72 BEQZ R10, L3 LW R9,0(R8)4 We speculate on the outcome of the branch. If the condition is not met, we don’t slow down the execution, because we had used a slot that would otherwise be lost

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/145 Superscalar architectures: HW support : conditional instructions Conditional instructions are useful to implement short alternative control flows Their usefulness though is limited by several factors:  Conditional instructions that are annullated still take execution time – unless they are scheduled into waste slots  They are good only in limited cases, when there’s a simple alternative sequence  Moving an instruction across multiple branches would require double-conditional instructions! LWCCR1, R2, R10, R12 (makes no sense)  They require to do extra work w.r.t. their “regular” version  The extra time required for the test may require more cycles than the regular versions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/146 Superscalar architectures: HW support : conditional instructions Most architectures support a few conditional instructions (conditional move) The HP PA architecture allows any register-register instruction to turn the next instruction into a NOP – which makes that a conditional instruction Exceptions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/147 Superscalar architectures: HW support : conditional instructions Exceptions:  Fatal (normally causing termination; e.g., memory protection violation)  Resumable exceptions (causing a delay, but no termination; e.g., page fault exception) Resumable exceptions can be processed for speculative instructions just as if they were normal instructions  Corresponding time penalty is not considered as incorrect Fatal exceptions cannot be handled by speculative instructions, hence must be deferred to the next non-speculative instructions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/148 Superscalar architectures: HW support : conditional instructions Moving instructions across a branch must not affect  The (fatal) exception behaviour  The data dependences How to obtain this? 1.All the exceptions triggered by speculative instructions are ignored by HW and OS The HW and OS do handle all exceptions, but return an undefined value for any fatal exception. The program is allowed to continue – though this will almost certainly lead to incorrect results Note: scheme 1. can never cause a correct program to fail, regardless the fact that you used or not speculation

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/149 Superscalar architectures: HW support : conditional instructions 2.Poison bits: A speculative instructions does not trigger any exception, but turns a bit on in the involved result registers. Next “normal” (non-speculative) instruction using those registers will be “poisoned” - > it will cause an exception 3.Boosting: Renaming and buffering in the HW (similar to the Tomasulo approach) Speculation can be used, e.g., to optimize an if-the-else such as if (a==0) a = b; else a = a + 4 or, equivalently, a = (a==0)? b : a + 4

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/150 Superscalar architectures: HW support : conditional instructions Suppose A is in 0(R3) and B in 0(R2) Example: LWR1, 0(R3); load A BNEZ R1, L1; A != 0 ? GOTO L1 LWR1, 0(R2); load B JL2; skip ELSE L1:ADDR1,R1,4; ELSE part L2:SW0(R3), R1; store A Speculation: LWR1, 0(R3); load A LWR9, 0(R2); load speculatively B BNEZ R1, L3 ADDR9, R1, 4; here R9 is A+4 L3: SW0(R3), R9; here R9 is A+4 or B In this case, a temporary register is used Method 1: speculation is transparent

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/151 Superscalar architectures: HW support : conditional instructions Method 2 applied to the previous code fragment: LWR1, 0(R3); load A LW*R9, 0(R2); load speculatively B BNEZ R1, L3 ADDR9, R1, 4; here R9 is A+4 L3: SW0(R3), R9; here R9 is A+4 or B LW* is a speculative version of LW LW* an opcode that turns on the poison bit of register R9 Next non speculative instruction using R9 will be “poisoned”: it will cause an exception If another speculative instruction uses R9, the poison bit will be inherited

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/152 Superscalar architectures: HW support : conditional instructions Combining speculation with dynamic scheduling  An attribute bit is added to each instruction (1: speculative, 0: normal)  When that bit is 1, it is allowed to execute, but cannot enter the commit (WB) stage  The instruction then has to wait until the end of the speculated code  It will be allowed to modify the register file / memory only at end of speculative-mode Hence: instructions execute out-of-order, but are forced to commit in order A special set of buffers holds the results that have finished execution but have not committed yet (reorder buffers)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/153 Superscalar architectures: HW support : conditional instructions As neither the register values nor the memory values are actually WRITTEN until an instruction commits, the processor can easily undo its speculative actions when a branch is found to be mispredicted If a speculated instruction raises an exception, this is recorded in the reorder buffer In case of branch misprediction such that a certain speculative instruction should not have been executed, the exception is flushed along with the instruction when the reorder buffer is cleared

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/154 Superscalar architectures: HW support : conditional instructions Reorder buffers:  An additional set of virtual registers that hold the result of the instructions  That have finished execution, but  Have not committed yet  Issue: only when both a Reservation Station and a reorder buffer are available  As soon as an instruction completes, its output goes into its reorder buffer  Until the instruction has not committed, input is received from the reorder buffer (the Reservation Station is freed, the reorder buffer is not)  The actual updating of registers takes place when the instruction reaches the top of the list of reorder buffers

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/155 Superscalar architectures: HW support : conditional instructions At this point the commit phase takes place:  Either the result is written into the register file,  Or, in case of a mispredicted branch, the reorder buffer is flushed and execution restarts at the correct successor of the branch Assumption: when a branch with incorrect prediction reaches the head of the buffer, it means that the speculation was wrong

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/156 Superscalar architectures: HW support : conditional instructions This technique allows also to tackle situation like if (cond) do_this ; else do_that ; One may “bet” on the outcome of the branch and say, e.g., it will be a taken one Even unlikely events do happen, so sooner or later a misprediction occurs Idea: let the instructions in the else part (do_that) issue and execute, with a separate list of reorder buffers (list2) This second list is simpler: we don’t check for the current head-of-list. Elements in there need to be explicitly removed In case of a misprediction, in the second list we have already executed the do_that part, and we just need to perform its commit In case of positive prediction, the ELSE part is purged off list2

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/157 Superscalar architectures If a processor A has a lower CPI w.r.t another processor B, will A always run faster than B? Not always!  A higher clock rate is indeed a deterministic measure of the performance improvement  A multiple issue (superscalar) architecture cannot guarantee its improvements (stochastic improvements)  Pushing towards a low CPI means adapting sophisticated (=complex) techniques… which slows down the clock rate!  Improving one aspect of a M.I.P. does not necessarily lead to overall performance improvements

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/158 Superscalar architectures A simple question: “how much ILP exists in a program?” or, in other words, “how much can we expect from techniques that are based on the exploitation of the ILP?” How to proceed:  Delivering a set of very optimistic assumptions and measuring how much parallelism is available under those assumptions

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/159 Superscalar architectures Assumptions (HW model of an ideal processor): 1.Infinite # of virtual registers (- > no WAW or WAR can suspend the pipeline) 2.All conditional branches are predicted exactly (!!) 3.All computed jumps and returns are perfectly predicted 4.All memory addresses are known exactly, so a store can be moved before a load – provided that the addresses are not identical 5.Infinite issue processor 6.No restriction about the types of instructions to be executed in a cycle (no structural hazards) 7.All latencies are 1

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/160 Superscalar architectures How to match these assumptions?? Gambling! We run a program and produce a trace with all the values of all the instances of each branch  Taken, Taken, Taken, Untaken, Taken, …  Each corresponding target address is recorded and assumed to be available  Then we use a simulator to mimic, e.g., an infinite virtual registers machine etc. Results are depicted in next picture Parallelism is expressed in IPC: instruction issues per clock cycles

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/161 Superscalar architectures Tomcatv reaches 150 IPC (for a particular run)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/162 Superscalar architectures Then we can diminish the above assumptions and introduce limitations that represent our current possibilities with computer design techniques for ILP  Window size: the actual range of instructions we inspect when looking for candidates for contemporary issuing  Realistic branch prediction  Finite # of registers See images 4-39 and 4-40

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/165 Superscalar architectures: conclusive notes In the next 10 years it is realistic to reach an architecture that looks like this:  64 instruction issues per clock cycle  Selective predictor, 1K entries, 16-entry return predictor  Perfect disambiguation of memory references  Register renaming with 64 + 64 extra registers Computer architectures in practice: Section 4.8 (PowerPC 620)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/167 Pipelining and communications Suppose that N+1 processes need to communicate a private value to all the others They use all the values to produce next output (e.g., for voting) Communication is fully synchronous and needs to be repeated m times, m large...

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/168 Pipelining and communications Let us assume that no bus is available Point-to-point communication Processes are numbered p 0 …p N Two instructions are available  Send (p j, value)  Receive (p j, &value) Blocking functions If the receiver is ready to receive, they last one stage time, otherwise they block the caller for a multiple of the stage time Sending and receiving occur at discrete time steps

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/169 Pipelining and communications In each time t, processor p i may be  Sending data (next stage p i is unblocked)  Receiving data (next stage p i is unblocked)  Blocked in a Receive()  Blocked in a Send() Slot = time corresponding to an entire stage time Each time t we have n slots (a slot per process) If p i is blocked, its slot is wasted (it’s a “bubble”) Otherwise the slot is used

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/170 Pipelining and communications In each time t, processor p i may be in  State S(j) : Sending data to processor p j  State R(j) : Receiving data from p j  State WR(j) : Blocked in a Receive( p j, … )  State WS(j) : Blocked in a Send( p j, …) We use formalism: proc s t proc’ to indicate that, at time t, proc is in state s with proc’ For instance p 1 WR(4) 21 p 3 means that the 21 st slot of p 1 is wasted waiting for p 3 to send its value to it

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/171 Pipelining and communications The following algorithm is executed by process j: Before gaining the right to broadcast, process j needs to go through j couples of states (WR, R) Ordered broadcast : the k-th message to be sent goes to process p k Finally, process j goes through N-j couples of states (WR, R)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/172 Pipelining and communications p is a vector of indices For process j, p can be any arrangement of the integers 0, 1, …, j-1, j+1, … N Whatever the arrangement, the algorithm works correctly For instance, if N = 4 (5 processes) and j = 1, then p can be any permutation of 0, 2, 3, and 4 p determines the order in which process j sends its value to its neighbours Example: p [] = [ 3, 2, 0, 4]. Then p 1 executes: send (p 3 ), send(p 2 ), send(p 0 ), send(p 4 )

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/173 Pipelining and communications Example: p [] = ordered permutation  Ex: N=5 and p j  p [ 0, … j-1,j+1, … N ] Frequencies of used slots Slot wasted in send Slot wasted in receive Duration

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/174 Pipelining and communications Case N = 20, p [] = ordered permutation Gray = wasted slots Black = used slots In general, duration is Used slots / total # of slots Average # used slots during one stage time This image:reminds us of another one:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/175 No pipelining: Many slots are wasted! 30 B C D A Time 30 6 PM 7 8 9 10 11 12 1 2 AM Pipelining and communications

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/176 Pipelining and communications Let us now consider the case in which processor k uses p [] = [ k+1, k+2, …, N, O, 1, …, k-1 ]

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/179 Pipelining and communications Efficiency: first case vs. second case

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/180 Pipelining and communications Algorithm of pipelined broadcast Beginning of steady state Every 10 slots, 5 mark the completion of a broadcast Throughput = t / 2 (t = 1 slot) A full broadcast is finished every 2 t The image may remind us of another one…

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/181 Between 7.30 and 9.30pm, a whole job is completed every 30’ 6 PM B C D A 30 Pipelining (slide P2.2/20)       … … … … During that period, each worker is permanently at work… …but a new input must arrive within 30’ 122 AM 78 9 10 11 1

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/182Process Definition: a process is a quantum of work (e.g. a job) dealt with as an entity Resources are allocated and scheduled on a process basis Comparison  a car needs to be repaired and is parked in the repair shop’s car park  the car corresponds to the program and the car park to the disk  the owner goes to the dispatcher at the counter and asks to repair the car  the user types in a command interpreted by the OS command interpreter  the dispatcher creates a new job and fills out a worksheet  the OS creates a new process and fills out the process descriptor (called Process Control Block, PCB)

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/183VLIW Properties  Every EU requires an instruction every clock cycle  Wide instruction memory, up to the main memory! Memory width does not decrease with level as for data memory  High power consumption in program memory accesses (ACCA research)  Applications should be optimized to fit into program memory hierarchy (ACCA research)  Waste of memory when not enough parallel instructions are available to keep all EUs concurrently busy Many instruction slots will contain a NOP Solution: dynamically varying instruction size, i.e. only non-NOP slots are present and an 8 slot instruction word can contain for example 2 instruction words, one containing 2 non-NOP slots and another containing 6 non-NOP slots

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic.

Similar presentations

Presentation on theme: "© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic.

Similar presentations

Presentation on theme: "© V. De Florio KULeuven 2002 Basic Concepts Computer Design Computer Architectures for AI Computer Architectures In Practice 2.3/1 Course contents Basic."— Presentation transcript:

Similar presentations

About project

Feedback