Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,

Similar presentations


Presentation on theme: "1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,"— Presentation transcript:

1 1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,

2 CE 454Ahmed Ezzat 2 Outline Design of the Microarchitecture Level – Speed vs Cost – Reducing Execution Path Length – Instruction Prefetching Design - The Mic-2 – Pipelined Design - The Mic-3 – Seven-Stage Pipeline Design: The Mic-4 Improving Performance – Cache Memory – Branch Prediction – Out-of-Order Execution and Register Renaming – Speculative Execution Reading Assignment: Examples of the Microarchitecture Level

3 CE 454Ahmed Ezzat 3 Simple machines are slow and fast machines are complex Speed improvement due to organization vs faster technology can’t be ignored Ways to make faster machines – Reduce the number of clock cycles needed to execute an instruction (known as path length) – Make clock cycle shorter (I.e reducing execution path length) – Overlap the execution of instructions, e.g., Instruction pipelining Ways to measure cost – Count of number of components, transistors, etc. – Area (real estate) required on the IC is more important Design of the Microarchitecture Level: Speed vs Cost

4 CE 454Ahmed Ezzat 4 Design of the Microarchitecture Level: Reducing Execution Path Length Mic-1 is simple CPU with minimum hardware (less than 5000 transistors) + control store (ROM) + main memory (RAM) IJVM was implemented in microcode with little hardware. Now, let us look for a faster alternative The above POP instruction cost 4 clock cycles (3 microinstructions + 1 main loop)

5 CE 454Ahmed Ezzat 5 Design of the Microarchitecture Level: Reducing Execution Path Length Merging the Interpreter Loop with the Microcode Merge interpreter loop with microcode – Main loop instruction can be overlapped with the previous instruction – When ALU not used in POP2, use it. As a result POP2 cost is reduced to 3 clock cycles – Having dead cycle, where ALU is not used, is not common, so merging Main1 into the end of each microinstruction sequence is worth doing

6 CE 454Ahmed Ezzat 6 Three-Bus Architecture Using Mic-1 architecture, let us revisit ILOAD instruction (push local variable onto stack) Have two input buses, A and B: Can add any two registers in one cycle Design of the Microarchitecture Level: Reducing Execution Path Length

7 CE 454Ahmed Ezzat 7 Instruction Fetch Unit: IFU Execution Loop – PC passed through ALU and incremented – PC used to fetch next byte of instruction – Operands read from memory – Operands written to memory – ALU compute and store result ALU intervenes in instruction fetching (fetch one byte at a time then assemble) – this ties the ALU – 1 cycle/byte: – Have a separate Instruction Fetch Unit to Increment PC Fetch Bytes Assemble 8- and 16-bit operands Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

8 CE 454Ahmed Ezzat 8 Instruction Fetch Unit: IFU Two ways: – IFU interpret code, fetch additional fields and assemble in register for use by the main execution unit (ALU) – Always fetch next 8- or 16- bytes regardless of use – design shown in next page Use 2 MBR’s. (MBR1 holds oldest, and MBR2 two oldest bytes in the shift register): – Automatically senses when MBR1 is read – Read next byte into MBR1 – When MBR1 is read, shift register shifts I Byte R – When MBR2 is read it is loaded 2 bytes – IFU has its own IMAR, to address memory when new word is needed Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

9 CE 454Ahmed Ezzat 9 Instruction Fetch Unit: IFU Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

10 CE 454Ahmed Ezzat 10 The Whole Design Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

11 CE 454Ahmed Ezzat 11 Summary – Mic-2 is an enhanced version of Mic-1 – Eliminates the main loop entirely – Avoid tying the ALU incrementing the PC – Reu8ces path length whenever 16-bit index or offset is calculated – no need to assemble in H – Mic-2 improves some instructions more than others. For example it reduces: LDC_W from 9  3 microinstructions ILOAD from 6  3 microinstructions SWAP from 8  6 microinstructions IADD from 4  3 microinstructions Design of the Microarchitecture Level: Instruction Prefetching Design – Mic-2

12 CE 454Ahmed Ezzat 12 Mic-2 is faster than Mic-1 with little increase in the real estate introduced by the IFU Reducing cycle time is tied into technology used. How about exploiting parallelism as Mic-2 is highly sequential except the IFU! Major components of the data path cycle – Driving selected registers onto A and B – ALU and shifter work – Results get back to registers and stored Can introduce latches to partition buses – Parts operate independently – Why Can speed up clock because maximum delay is less Can use parts during every sub cycle Design of the Microarchitecture Level: Pipelined Design – Mic-3

13 CE 454Ahmed Ezzat 13 3-bus architecture with 3 latches Latch is inserted in the middle of each bus In effect, latch partition the data path into 3 distinct parts that can operate independently (Mic-3) Each subcycle is about 1/3 original length, hence triple the clock speed Previously during 1, 3 subcycles ALU is idle. Now ALU can be used on every subcycle – better throughput Design of the Microarchitecture Level: Pipelined Design – Mic-3

14 CE 454Ahmed Ezzat 14 In Mic-3, need 3 microsteps to use the data path: – Load A and B – Perform operation and load C – Write result back SWAP in Mic-2 Design of the Microarchitecture Level: Pipelined Design – Mic-3

15 CE 454Ahmed Ezzat 15 Design of the Microarchitecture Level: Pipelined Design – Mic-3 SWAP in Mic-3 Mic-3 instructions takes more cycles than Mic-2 However, Mic-3 cycle is 1/3 of Mic-2 cycle For SWAP, Mic-3 costs 11 microsteps, while Mic-2 would cost (6x3) = 18 microsteps

16 CE 454Ahmed Ezzat 16 Dependencies Like to start SWAP3 in cycle 3, but MDR is available only in cycle 5. This is called true Dependence or RAW (Read After Write) dependence. SWAP3 has to wait/stall till SWAP1 completes Pipelining is a key technique in all modern CPUs. An analogy is a car assembly line – produces 1 car/hr independent of how long it actually takes to assemble a car. Reading assignment: A Seven-Stage Pipeline (Mic-4) Design of the Microarchitecture Level: Pipelined Design – Mic-3

17 CE 454Ahmed Ezzat 17 Ways to improve performance, primarily CPU and memory): – Implementation improvement without architectural changes Means old programs run without changes, Major selling point 80386 through Pentiums improvements are like this – Architectural changes New or additional instructions and/or registers New architecture such as RISC, IA-64, etc. Major Techniques – Cache memory – Branch prediction – Out of order execution with register renaming – Speculative execution Improving Performance

18 CE 454Ahmed Ezzat 18 Memory latency vs bandwidth are at odds (e.g., pipelining) – hence cache Split cache: Separate caches for instructions and data – Two separate memory ports – Doubles the speed with independent access Level 2+ cache: between I/D cache and main memory Improving Performance: Cache Memory

19 CE 454Ahmed Ezzat 19 Caches are generally inclusive – L3 cache includes L2 cache content, and L2 cache includes L1 cache content Depends on Locality of reference – Spatial locality: memory locations with addresses numerically similar to the recently accessed memory are likely to be accessed in the near future – Temporal locality: recently accessed memory locations are likely to be accessed again Cache Model – Main memory is divided into fixed size blocks called caches lines – 4 to 64 consecutive bytes If memory referenced, – Cache controller checks if memory referenced is in the caches, – else a cache line is removed and new line is cached from main memory Improving Performance: Cache Memory

20 CE 454Ahmed Ezzat 20 Given memory word stored exactly in one place – If not there, not in cache Format: – VALID BIT: on if cache line has valid data – TAG: (16 bit) unique value identifying corresponding line of memory – DATA: (32 bytes) copy of data from memory Improving Performance: Direct-Mapped Caches

21 CE 454Ahmed Ezzat 21 Address Translation TAG: Tag bit in address corresponds to TAG field in the cache entry LINE: which cache entry holds the data, if it is present WORD: which word within line BYTE: which byte within the word (not used normally) When CPU gives address, HW extracts LINE bits – Indexes into cache, finds one of 2048 entries, if valid TAG field are compared, If same cache HIT! – Else cache miss!, whole cache line fetched from memory, stored in cache, existing line stored back in necessary Improving Performance: Direct-Mapped Caches

22 CE 454Ahmed Ezzat 22 Consecutive memory lines in consecutive cache line entries If access pattern (e.g., address “X” and address “X + cache size”) the line will be overwritten and if this pattern is frequent, it would result in poor performance – frequent misses Direct-mapped cache is very common, and typically they are effective as collisions as the ones described above are rare Improving Performance: Direct-Mapped Caches

23 CE 454Ahmed Ezzat 23 Allow “n” entries for each hashed address (address modulo cache-size). These entries need to be ordered as LRU for replacement Each entry must be checked to see if the needed line is present 2-way and 4-way caches have performed well Improving Performance: N-way Set Associative Caches

24 CE 454Ahmed Ezzat 24 Cache replacement policy: LRU Writing Cache Back – Write through – Write deferred or write back Writing to address that is not in the cache: – Write Allocation: Bring to cache – typically used with write back cache – Write memory directly: typically used with write through cache Improving Performance: Issues in Cache Design

25 CE 454Ahmed Ezzat 25 Pipelining works best with linear code, but 20% of code is either branches or conditional branches, hence branch prediction is important Most pipelined machines execute instruction following branch, logically should not do so (because we know the opcode after we started the next instruction fetch) – Try to find useful instruction to execute after branches! – Compilers can stuff No Op instructions, but it slows down and makes the code longer Example Predictions – Backward branches will be taken, e.g., end of loop. Some Forward branches occurs due to error condition, which is rare, so not taking forward branch is O.K. Two ways of branch prediction – Execute until change state (i.e., write to register) then update scratch temporarily until we know if the branch prediction is correct – Record update value to be able to rollback in case of need Improving Performance: Branch Prediction

26 CE 454Ahmed Ezzat 26 CPU maintains history table of previous branches in HW. – Look up history table for predictions (a) Organized just like caches (b) End of Loop takes wrong guess, having 2-bit branch history Hence change branch only after two correct executions (c) 2- or 4-way associative entry approach as with cache Can take a Finite State Machine approach Improving Performance: Dynamic Branch Prediction

27 CE 454Ahmed Ezzat 27 Dynamic branch prediction is carried out at run time – requires special expensive hardware Compiler passes hints (new branch instruction format) – Sets a bit to indicate which branch will be mostly taken – Requires special hardware (enhanced instructions) Profiling – Program run though a profiler (simulator ) to capture branch behavior, and pass the info to the compiler which in turn can pass it into special branch instructions – IA-64 supports profiling Improving Performance: Static Branch Prediction

28 CE 454Ahmed Ezzat 28 Pipelined superscalar machines fetches and issues instructions before they are needed In order instruction issue and retirements is simpler but inefficient Some instructions depend on others, hence cannot resort to out of order execution. Example machine: – 8 registers, 2 for operands, one for result – Decoded in cycle N, execution starts in N+1 – Addition & subtraction is written back in N+2 – Multiplication is written back in N+3 Scoreboard is a table to reflect use of registers for reading and writing at run time Improving Performance: Out-of-Order Execution and Register Renaming

29 CE 454Ahmed Ezzat 29 Improving Performance: Example: In Order Execution

30 CE 454Ahmed Ezzat 30 In order issue and in order retirement Instruction Dependencies – Read After Write (RAW): If any operand being written, do not issue – Write After Read (WAR): If result register being read, do not issue – Write After Write (WAW): If result register being written, do not issue Instruction (I4) has RAW dependency, it stalls – Decode units stalls until R4 is available – Stops pulling from fetch unit – When buffer full fetch unit stalls fetching from memory Improving Performance: Example: In Order Execution

31 CE 454Ahmed Ezzat 31 Issued out of order and may retire out of order Instruction (I5) is issued while Instruction (I4) is stalled Problem: (I5) can use an operand (I4) computed New Rule: Do not issue instructions that uses operand stored by previous instruction Example: (I7) uses R1, written by (I6), – never uses again because (I8) writes R1, – hence (I6) can use different register to hold value Register renaming: decode unit changes R1 in (I6), (I7) to S1 (secret) S1 so (I5), (I6) can be issued concurrently Eliminates WAW and WAR dependencies often Improving Performance: Out-of-Order Execution and Register Renaming

32 CE 454Ahmed Ezzat 32 Improving Performance: Out-of-Order Execution and Register Renaming Same eight instructions: Executed in 18 cycles using in- order issue and retirement Executed in 9 cycles using out- of-order issue and retirement

33 CE 454Ahmed Ezzat 33 Code consists of basic blocks with no control structures such as if then else or while statements. Only linear sequence of code. No branches. Within each block, reordering works well. Program can be represented as a directed graph. Problem: blocks are short, insufficient parallelism If slow instructions can be moved up across blocks (hoisting), so that if they are executed, then the result is there when needed! Speculative execution: execute code before known if they will be executed Improving Performance: Speculative Execution

34 CE 454Ahmed Ezzat 34 Improving Performance: Speculative Execution - Example

35 CE 454Ahmed Ezzat 35 In the example, – say except even-sum and odd-sum, all variables are kept in registers. – Can move LOAD of even-sum and odd-sum variables to top of loop. – Only one of {even-sum, odd-sum} will be needed in any iteration, and the other LOAD is wasted Reordering instructions must have no irrevocable results Can rename all destination registers in speculative code Problem: Speculative code causing exceptions (cache miss or page fault Solution: Use SPECULATIVE-LOAD instead of load so that in case of cache miss does not cause load from memory Poison Bit: If speculative instruction (LOAD) causes trap, a special version of such instruction is used that instead, it sets the poison-bit on the result register. If that register is touched by the regular instruction in the future, it will cause the trap. If the result in that register is never used, the poison bit is eventually cleared and no harm is done Improving Performance: Speculative Execution - Problems

36 CE 454Ahmed Ezzat 36


Download ppt "1 CE 454 Computer Architecture Lecture 8 Ahmed Ezzat The Microarchitecture Level, Ch-4.4, 4.5,"

Similar presentations


Ads by Google