How Computers Work Lecture 13 Details of the Pipelined Beta

Slides:

Advertisements

Similar presentations

Lecture 4: CPU Performance

Advertisements

COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.

Pipeline Issues This pipeline stuff makes my head hurt! Maybe it’s that dumb hat.

CS 151 DIGITAL Systems Design Lecture 34 Datapath Analysis.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ENGIN112 L34: Datapath Analysis November 24, 2003 ENGIN 112 Intro to Electrical and Computer Engineering Lecture 34 Datapath Analysis.

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

How Computers Work Lecture 5 Page 1 How Computers Work Lecture 5 Memory Implementation.

Pipelining the Beta bet·ta ('be-t&) n. Any of various species

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

L16 – Pipelining 1 Comp 411 – Spring /20/2011 Pipelining Between 411 problems sets, I haven’t had a minute to do laundry Now that’s what I call dirty.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

How Computers Work Lecture 12 Page 1 How Computers Work Lecture 12 Introduction to Pipelining.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

How Computers Work Lecture 3 Page 1 How Computers Work Lecture 3 A Direct Execution RISC Processor: The Unpipelined BETA.

Computer Organization and Design Pipelining Montek Singh Mon, Dec 2, 2013 Lecture 16.

Computer Organization and Design Pipelining Montek Singh Dec 2, 2015 Lecture 16 (SELF STUDY – not covered on the final exam)

L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.

Real-World Pipelines Idea Divide process into independent stages

Problem with Single Cycle Processor Design

Lecture 18: Pipelining I.

Computer Organization

Pipelining: Hazards Ver. Jan 14, 2014

ARM Organization and Implementation

Lecture 16: Basic Pipelining

Performance of Single-cycle Design

5 Steps of MIPS Datapath Figure A.2, Page A-8

Appendix C Pipeline implementation

Decode and Operand Read

\course\cpeg323-08F\Topic6b-323

Pipelining Lessons 6 PM T a s k O r d e B C D A 30

CpE 442 Designing a Pipeline Processor (lect. II)

Chapter 4 The Processor Part 3

Lecture 6: Advanced Pipelines

Pipelining review.

Lecture 16: Basic Pipelining

Pipelining Read Chapter

CS 704 Advanced Computer Architecture

Pipelining in more detail

CSC 4250 Computer Architectures

Pipelining Lessons 6 PM T a s k O r d e B C D A 30

\course\cpeg323-05F\Topic6b-323

Pipeline control unit (highly abstracted)

The Processor Lecture 3.6: Control Hazards

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Pipelining Appendix A and Chapter 3.

RTL for the SRC pipeline registers

Introduction to Computer Organization and Architecture

Guest Lecturer: Justin Hsia

Pipelining Hazards.

Presentation transcript:

How Computers Work Lecture 13 Details of the Pipelined Beta

Review: 1-Stage Beta (Top-Down View) WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL With st(ra,C,rc) : Mem[C+<rc>] <- <ra>

Review: Pipeline Stages GOAL: Maintain (nearly) 1.0 CPI, but increase clock speed. APPROACH: structure processor as 4-stage pipeline: Instruction Fetch stage: Maintains PC, fetches one instruction per cycle and passes it to Register File stage: Reads source operands from register file, passes them to ALU stage: Performs indicated operation, passes result to Write-Back stage: writes result back into register file. Four “major” datapath subsystems: - Instr Memory - Register File - ALU - Data Memory Suggests a 4-stage pipe. IF RF ALU WB WHAT OTHER information do we have to pass down the pipeline? PC, INST

Sketch of 4-Stage Pipeline Instruction Fetch IF PIPELINE SKETCH... 4 CL Instruction Exec Time 1 Instr/CL Exec RATE Short combinational paths, hence faster clock. Need also - Instruction (or parts of it) to drive control logic - <PC>, for BR/JMP instructions! (recall, they load old PC into destination register) instruction Register File CL RF (read) instruction A B ALU CL ALU instruction Y Write Back CL RF (write) instruction NEED ALSO: <PC> - for Branch/JMP instrs! Lets make it real.... BR & JMP To make it real, need to add some detail...

IF RF ALU WB WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL IF RF ALU WB

Pipeline Hazards Return to 2nd problem... Contents of a register WRITTEN by instruction k is READ by instruction k+1... before its stored in RF! EG: ADD(r1, r2, r3) CMPLEC(r3, 100, r0) fails since CMPLEC sees “stale” <r3>. Return to 2nd problem... ... called a PIPELINE HAZARD. R3 Written ADD(r1,r2,r3) IF RF ALU WB CMPLEC(r3,100,r0) Suppose there’s communication between consecutive instrs? ADD writes R3 CMP reads R3 R3 Read Time

Can renegotiate the contract... R3 Written ADD(r1,r2,r3) IF RF ALU WB CMPLEC(r3,100,r0) R3 Read SOLUTIONS: 1. “Program around it”. ... document weirdo semantics, declare it a software problem. - Breaks sequential semantics! - Costs code efficiency. Can renegotiate the contract... BUT IT COSTS! Compilers can fill 1 delay slot: fill 75% 2 delay slots: fill 25% EXAMPLE: Rewrite ADD(r1, r2, r3) CMPLEC(r3, 100, r0) MULC(r1, 100, r4) SUB(r1, r2, r5) ADD(r1, r2, r3) MULC(r1, 100, r4) SUB(r1, r2, r5) CMPLEC(r3, 100, r0) as HOW OFTEN can we do this?

HOLD current IF, RF state; SOLUTIONS: 2. Stall the pipeline. Freeze IF, RF stages for 2 cycles, inserting NOPs into ALU IR... R3 Written ADD(r1,r2,r3) IF RF ALU WB NOP CMPLEC(r3,100,r0) PIPELINE STALL: HOLD current IF, RF state; ADVANCE ALU, WB state (putting in NOPs or annulled instrs!) R3 Read DRAWBACK: SLOW PERFORMANCE!

APPROACH: Add new paths, control logic. SOLUTIONS: 3. Bypass Paths. Add extra data paths & control logic to re-route data in problem cases. MOST AMBITIOUS SOLN: Fix the problem. OBSERVATION: The new data exists in our data paths; its just not where we’re looking. APPROACH: Add new paths, control logic. HOW MANY DO WE NEED? (ans: 4) <R1>+<R2> Produced <R0> Produced <R0> About to be Written Back ADD(r1,r2,r3) IF RF ALU WB CMPLEC(r3,100,r0) BT(r0,LOOP) XOR(r0,r0,r3) <R1>+<R2> Used <R0> Used <R0> Used

Hardware Implementation of Bypass Paths WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL Hardware Implementation of Bypass Paths IF RF ALU WB

Bypass Paths - I Case 1: • instr i writes r3 Register File WA WD WE ALU A B Y IR WB RA1 RA2 RD1 RD2 RF CMPLEC(r3,100,r0) ADD(r1,r2,r3) Case 1: • instr i writes r3 • instr i+1 uses <r3> as 2nd source operand New data is at ALU OUTPUT; must route it to B input reg LOGIC will select this path under proper circumstances (NB: RDest is Rc for operate instrs) BUG! Won’t work for BR/JMP/LD

Hardware Implementation of Bypass Paths WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL Hardware Implementation of Bypass Paths IF RF ALU WB

Bypass Paths - II Case 2: • instr i writes r3 Register File WA WD WE ALU A B Y IR WB RA1 RA2 RD1 RD2 RF ADD(r1, r2, r3) XOR(r3, r4, r2) Case 2: • instr i writes r3 • instr i+2 uses <r3> as 2nd source operand New data is at RF write data input; must route it to B input reg LOGIC will select this path under proper circumstances (NB: RDest is Rc for operate instrs)

PC Bypassing IF +1 RF ALU WB WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL PC Bypassing IF +1 RF ALU WB

BRANCH DELAY SLOTS NOP = ADD(r31, r31, r31) NOP PROBLEM: One (or more) following instructions have been pre-fetched by the time a branch is taken. POSSIBLE SOLUTIONS: 1. “Program around it”. Either 1a. Follow each BR with 2 NOP instructions; or 1b. Make your compiler clever enough to move USEFUL instructions following branches. 2. Make pipeline “annul” instructions following branches which are taken, eg by disabling WERF and WEMEM and PCSEL. NOP = ADD(r31, r31, r31) NOP WERF R/W- PCSEL

Can we shorten the number of delay slots? A: Yes (by 1)

Load Delays SIMILAR Problems with LD... Consider LOADS: Can we fix all these problems using our previous bypass paths? LD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) SIMILAR Problems with LD... BUT data isn’t around as early! 2 hazards in this example: can both be fixed using bypasses? ANSWER: No – only 1 (XOR) ! For a LD instruction fetched during clock i, data isn’t returned from memory until late into cycle i + 3 ! LD Data Available NO - ADD conflict! LD(r1, 0, r4) IF RF ALU WB ADD(r1, r4, r5) XOR(r3, r4, r6) i+3 R4 Needed R4 Needed

LD Data Bypass IF RF ALU WB WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL LD Data Bypass IF RF ALU WB

Load Delays - II Load Timing Problems: LD(r1, 0, r4) ADD(r1, r4, r5) XOR(r3, r4, r6) Problem 1 Problem 2 Can bypass one load delay, skipping the RF access cycle. Other load delay is more fundamental... it results from the slowness of big memories. Can relegate both problems to Compiler. Alternatively, fix Problem 2 using Bypass Paths and fix Problem 1 using NOPs / Stalls BYPASSES STALLS

Load Problems - III But, but, what about FASTER processors? FACT: Processors will become fast relative to memories! Do we just lengthen the cycle time? ALTERNATIVE: Longer pipelines. 1. Add “MEMORY WAIT” stages between START of read operation & return of data. 2. Build pipelined memories, so that multiple (say, N) memory transactions can be in progress at once. 3. (Optional). Stall pipeline when the N limit is exceeded. 4-Stage pipeline requires 1 instruction’s delay. 5-Stage pipeline requires 2 instruction’s delay. PROBLEM: Memory access time is O(log size), at best. (One can argue that its worse! We’ll see next week). Are we up the creek, as processors get faster? Try lengthening the pipeline... 1 2

5 Stage Pipeline IF RF ALU MEM RD WB WD Memory Register File RA2 RD2 WA RC WERF WEMEM WE A B A op B RA1 RD1 RA RB BSEL ASEL ALUFN WDSEL 1 2 ALU SEXT C 4:0 9:5 20:5 25:21 31:26 OPCODE PC Q +1 D Z JMP(R31,XADDR,XP) XADDR ISEL PCSEL 5 Stage Pipeline IF RF ALU MEM RD WB

The Long Pipeline Fantasy are there any limits? Suppose memory access time is 80 ns, clock speed is 2 ns. Might include 40 MEMORY-WAIT stages in pipeline: IF RF ALU WB MEM 40 Stages of memory waiting... 44-Stage Pipeline: COMPLICATIONS? BNE(...) ... LD(..., r1) ... ADD(r1, ...) Lots of delay Slots (possibly bypassed) Lots of cycles of pipeline stalls 1 40

What Have We Learned Today? Pipelining improves throughput by lowering clock period Pipelining cannot improve latency Data Hazards can be fixed with Re-Programming, NOPs, Bypass Paths Branch Hazards can be fixed with Re-Programming, NOPs, Annulment Memory Hazards can be fixed with Re-Programming, NOPs, (1) Bypass Path As in Karate, balance is important … too much pipelining is BAD

Next Time: Implicit Multiple Issue Automatic Out-of-Order Execution