Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 1999 ©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes.

Similar presentations


Presentation on theme: "1 1999 ©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes."— Presentation transcript:

1 1 1999 ©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)

2 2 1999 ©UCB Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load:Read Data from Data Memory Store:Write Data to Data Memory 5) Write Back: Write Data to Register

3 3 1999 ©UCB Pipelined Execution Representation °To simplify pipeline, every instruction takes same number of steps, called stages °One clock cycle per stage IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Program Flow Time

4 4 1999 ©UCB Review: Single-cycle Datapath for MIPS Data Memory (Dmem) PCRegisters ALU Instruction Memory (Imem) Stage 1Stage 2Stage 3 Stage 4 Stage 5 IFtchDcdExecMemWB °Use datapath figure to represent pipeline ALU IM Reg DMReg

5 5 1999 ©UCB ALU IM Reg DMReg IM Graphical Pipeline Representation I n s t r. O r d e r Time (clock cycles) Load Add Store Sub Or ALU IM Reg DMReg ALU IM Reg DMReg ALU Reg DMReg ALU IM Reg DMReg (right half highlighted means read, left half write)

6 6 1999 ©UCB Required Changes to Datapath °Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. °Next PC value is computed in the 3 rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1 st stage °Branch address is computed in 3 rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits. °For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length od ID/EX register = (Reg1)+(Reg2)+(offset)+(PC)+ destn = 133 bits

7 7 1999 ©UCB Pipelined Datapath (with Pipeline Regs)(6.2) Address 4 32 0 Add Add result Shift left 2 I n s t r u c t i o n M u x 0 1 Add PC 0 Address Write data M u x 1 Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero Imem Dmem Regs IF/ID ID/EX EX/MEM MEM/WB 64 bits 133 bits 102 bits 69 bits 5 Fetch Decode Execute Memory Write Back

8 8 1999 ©UCB Pipelined Control (6.3) Start with single-cycle controller Group control lines by pipeline stage needed Extend pipeline registers with control bits Control EX Mem WB WB WB IF/IDID/EXEX/MEMMEM/WB Instruction RegDst ALUop ALUSrc Branch MemRead MemWrite MemToReg RegWrite

9 9 1999 ©UCB Problems for Pipelining °Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load) Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)

10 10 1999 ©UCB MIPS R4000 pipeline

11 11 1999 ©UCB Advanced Architectural Concepts °Can we achieve CPI 1?) State-of-the-Art Microprocessor °“Superscalar” execution or Instruction Level Parallelism (ILP) “Deeper Pipeline => Dynamic Branch Prediction => Speculation => Recovery °“Out-of-order” Execution => Instruction Window and Prefetch => Reorder Buffers °“VLIW” Ex: Intel/HP Titanium

12 12 1999 ©UCB Instruction Level Parallelism (ILP) IPC > 1 IFtchDcdExecMemWB Mem Dcd ExecMem WB IFtchDcdExecMemWB IFtch Dcd Exec MemWB IFtch Dcd Exec Mem WB Program Flow ILP = 2 Time IFtch Dcd ExecWB IFetch EX: Pentium, SPARC, MIPS 10000, IBM Power PC

13 13 1999 ©UCB HW Schemes: Instruction Parallelism °Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution. Scoreboard dates to CDC 6600 in 1963

14 14 1999 ©UCB How ILP Works °Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width °To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time °Prefetch instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.

15 15 1999 ©UCB Microarchitecture of an ILP-based CPU (Power PC)

16 16 1999 ©UCB

17 17 1999 ©UCB Very Large Instruction Word (VLIW) IPC > 1 IFtchDcdExecMemWB Exec IFtchDcdExecMemWB Exec IFtch Dcd Exec Mem WB Program Flow EX: Itanium Time Exec

18 18 1999 ©UCB TriMedia TM32 Architecture data cache 16KB 64-bit memory bus multi-port 128 words x 32 bits register file FU instruction cache 32 KB instruction cache 32 KB bypass network PC 32-bit peripheral bus VLIW instruction decode and launch Compressed code in the Instruction Cache

19 19 1999 ©UCB What is Multiprocessing °Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! °Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing °With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor

20 20 1999 ©UCB Hardware Multithreading °We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking °Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). °Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. °How about both multiprocessors and multithreading on a chip? => Network Processor

21 21 1999 ©UCB Hardware Multithreading °How can we guarantee no dependencies between instructions in a pipeline? One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

22 22 1999 ©UCB Architectural Comparisons (cont.) Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot Simultaneous Multithreading

23 23 1999 ©UCB Intel IXP2400 Network Processor XScale core replaces StrongARM 1.4 GHz target in 0.13-micron Nearest neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM

24 24 1999 ©UCB IBM Cell Processor SPU: Synergetic Processor Unit

25 25 1999 ©UCB Chip Multiprocessors


Download ppt "1 1999 ©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes."

Similar presentations


Ads by Google