1 1999 ©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan www.cs.ucr.edu/~bhuyan Adapted from notes.

Slides:



Advertisements
Similar presentations
Pipeline Example: cycle 1 lw R10,9(R1) sub R11,R2, R3 and R12,R4, R5 or R13,R6, R7.
Advertisements

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Instruction-Level Parallelism (ILP)
©UCB CS 161Computer Architecture Chapter 5 Lecture 9 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.
Review: MIPS Pipeline Data and Control Paths
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 18 - Pipelined.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
©UCB CS 162 Computer Architecture Lecture 1 Instructor: L.N. Bhuyan
Chapter Six Enhancing Performance with Pipelining
1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.
©UCB CS 162 Computer Architecture Lecture 2: Introduction & Pipelining Instructor: L.N. Bhuyan
1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.
Supplementary notes for pipelining LW ____,____ SUB ____,____,____ BEQ ____,____,____ ; assume that, condition for branch is not satisfied OR ____,____,____.
55:035 Computer Architecture and Organization Lecture 10.
Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.
Pipelined Datapath and Control
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.
Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.
Electrical and Computer Engineering University of Cyprus LAB3: IMPROVING MIPS PERFORMANCE WITH PIPELINING.
Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.
CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (
Computing Systems Pipelining: enhancing performance.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and
CDA 3101 Summer 2003 Introduction to Computer Organization Pipeline Control And Pipeline Hazards 17 July 2003.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
CMPE 421 Parallel Computer Architecture Part 3: Hardware Solution: Control Hazard and Prediction.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.
Computer Architecture Lecture 6.  Our implementation of the MIPS is simplified memory-reference instructions: lw, sw arithmetic-logical instructions:
Electrical and Computer Engineering University of Cyprus
Lecture 18: Pipelining I.
Computer Organization
Stalling delays the entire pipeline
Note how everything goes left to right, except …
CDA 3101 Spring 2016 Introduction to Computer Organization
Performance of Single-cycle Design
Single Clock Datapath With Control
ECS 154B Computer Architecture II Spring 2009
ECS 154B Computer Architecture II Spring 2009
ECE232: Hardware Organization and Design
Review: MIPS Pipeline Data and Control Paths
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
Single-cycle datapath, slightly rearranged
A pipeline diagram Clock cycle lw $t0, 4($sp) IF ID
Systems Architecture II
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
Instruction Execution Cycle
Pipelining (II).
Introduction to Computer Organization and Architecture
Pipelining - 1.
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Pipelined datapath and control
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

©UCB CS 161Computer Architecture Introduction to Advanced Architecturs Lecture 13 Instructor: L.N. Bhuyan Adapted from notes by Dave Patterson (http.cs.berkeley.edu/~patterson)

©UCB Stages of Execution in Pipelined MIPS 5 stage instruction pipeline 1) I-fetch: Fetch Instruction, Increment PC 2) Decode: Instruction, Read Registers 3) Execute: Mem-reference: Calculate Address R-format: Perform ALU Operation 4) Memory: Load:Read Data from Data Memory Store:Write Data to Data Memory 5) Write Back: Write Data to Register

©UCB Pipelined Execution Representation °To simplify pipeline, every instruction takes same number of steps, called stages °One clock cycle per stage IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Program Flow Time

©UCB Review: Single-cycle Datapath for MIPS Data Memory (Dmem) PCRegisters ALU Instruction Memory (Imem) Stage 1Stage 2Stage 3 Stage 4 Stage 5 IFtchDcdExecMemWB °Use datapath figure to represent pipeline ALU IM Reg DMReg

©UCB ALU IM Reg DMReg IM Graphical Pipeline Representation I n s t r. O r d e r Time (clock cycles) Load Add Store Sub Or ALU IM Reg DMReg ALU IM Reg DMReg ALU Reg DMReg ALU IM Reg DMReg (right half highlighted means read, left half write)

©UCB Required Changes to Datapath °Introduce registers to separate 5 stages by putting IF/ID, ID/EX, EX/MEM, and MEM/WB registers in the datapath. °Next PC value is computed in the 3 rd step, but we need to bring in next instn in the next cycle – Move PCSrc Mux to 1 st stage °Branch address is computed in 3 rd stage. With pipeline, the PC value has changed! Must carry the PC value along with instn. Width of IF/ID register = (IR)+(PC) = 64 bits. °For lw instn, we need write register address at stage 5. But the IR is now occupied by another instn! So, we must carry the IR destination field as we move along the stages. See connection in fig. Length od ID/EX register = (Reg1)+(Reg2)+(offset)+(PC)+ destn = 133 bits

©UCB Pipelined Datapath (with Pipeline Regs)(6.2) Address Add Add result Shift left 2 I n s t r u c t i o n M u x 0 1 Add PC 0 Address Write data M u x 1 Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero Imem Dmem Regs IF/ID ID/EX EX/MEM MEM/WB 64 bits 133 bits 102 bits 69 bits 5 Fetch Decode Execute Memory Write Back

©UCB Pipelined Control (6.3) Start with single-cycle controller Group control lines by pipeline stage needed Extend pipeline registers with control bits Control EX Mem WB WB WB IF/IDID/EXEX/MEMMEM/WB Instruction RegDst ALUop ALUSrc Branch MemRead MemWrite MemToReg RegWrite

©UCB Problems for Pipelining °Hazards prevent next instruction from executing during its designated clock cycle, limiting speedup Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) Control hazards: conditional branches & other instructions may stall the pipeline delaying later instructions (must check detergent level before washing next load) Data hazards: Instruction depends on result of prior instruction still in the pipeline (matching socks in later load)

©UCB MIPS R4000 pipeline

©UCB Advanced Architectural Concepts °Can we achieve CPI 1?) State-of-the-Art Microprocessor °“Superscalar” execution or Instruction Level Parallelism (ILP) “Deeper Pipeline => Dynamic Branch Prediction => Speculation => Recovery °“Out-of-order” Execution => Instruction Window and Prefetch => Reorder Buffers °“VLIW” Ex: Intel/HP Titanium

©UCB Instruction Level Parallelism (ILP) IPC > 1 IFtchDcdExecMemWB Mem Dcd ExecMem WB IFtchDcdExecMemWB IFtch Dcd Exec MemWB IFtch Dcd Exec Mem WB Program Flow ILP = 2 Time IFtch Dcd ExecWB IFetch EX: Pentium, SPARC, MIPS 10000, IBM Power PC

©UCB HW Schemes: Instruction Parallelism °Key idea: Allow instructions behind stall to proceed DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF12,F8,F14 Enables out-of-order execution => out-of-order completion ID stage checks for hazards. If no hazards, issue the instn for execution. Scoreboard dates to CDC 6600 in 1963

©UCB How ILP Works °Issuing multiple instructions per cycle would require fetching multiple instructions from memory per cycle => called Superscalar degree or Issue width °To find independent instructions, we must have a big pool of instructions to choose from, called instruction buffer (IB). As IB length increases, complexity of decoder (control) increases that increases the datapath cycle time °Prefetch instructions sequentially by an IFU that operates independently from datapath control. Fetch instruction (PC)+L, where L is the IB size or as directed by the branch predictor.

©UCB Microarchitecture of an ILP-based CPU (Power PC)

©UCB

©UCB Very Large Instruction Word (VLIW) IPC > 1 IFtchDcdExecMemWB Exec IFtchDcdExecMemWB Exec IFtch Dcd Exec Mem WB Program Flow EX: Itanium Time Exec

©UCB TriMedia TM32 Architecture data cache 16KB 64-bit memory bus multi-port 128 words x 32 bits register file FU instruction cache 32 KB instruction cache 32 KB bypass network PC 32-bit peripheral bus VLIW instruction decode and launch Compressed code in the Instruction Cache

©UCB What is Multiprocessing °Parallelism at the Instruction Level is limited because of data dependency => Speed up is limited!! °Abundant availability of program level parallelism, like Do I = 1000, Loop Level Parallelism. How about employing multiple processors to execute the loops => Parallel processing or Multiprocessing °With billion transistors on a chip, we can put a few CPUs in one chip => Chip multiprocessor

©UCB Hardware Multithreading °We need to develop a hardware multithreading technique because switching between threads in software is very time-consuming (Why?), so not suitable for main memory (instead of I/O) access, Ex: Multitasking °Develop multiple PCs and register sets on the CPU so that thread switching can occur without having to store the register contents in main memory (stack, like it is done for context switching). °Several threads reside in the CPU simultaneously, and execution switches between the threads on main memory access. °How about both multiprocessors and multithreading on a chip? => Network Processor

©UCB Hardware Multithreading °How can we guarantee no dependencies between instructions in a pipeline? One way is to interleave execution of instructions from different program threads on same pipeline Interleave 4 threads, T1-T4, on non-bypassed 5-stage pipe T1: LW r1, 0(r2) T2: ADD r7, r1, r4 T3: XORI r5, r4, #12 T4: SW 0(r7), r5 T1: LW r5, 12(r1)

©UCB Architectural Comparisons (cont.) Time (processor cycle) SuperscalarFine-GrainedCoarse-Grained Multiprocessing Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Idle slot Simultaneous Multithreading

©UCB Intel IXP2400 Network Processor XScale core replaces StrongARM 1.4 GHz target in 0.13-micron Nearest neighbor routes added between microengines Hardware to accelerate CRC operations and Random number generation 16 entry CAM

©UCB IBM Cell Processor SPU: Synergetic Processor Unit

©UCB Chip Multiprocessors