Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Slides:



Advertisements
Similar presentations
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
1 Lecture 3 Pipeline Contd. (Appendix A) Instructor: L.N. Bhuyan CS 203A Advanced Computer Architecture Some slides are adapted from Roth.
A scheme to overcome data hazards
CPE 631: ILP, Static Exploitation Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
COMP4611 Tutorial 6 Instruction Level Parallelism
EEL Advanced Pipelining and Instruction Level Parallelism Lotzi Bölöni.
CS152 Lec15.1 Advanced Topics in Pipelining Loop Unrolling Super scalar and VLIW Dynamic scheduling.
Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
Forwarding and Hazards MemberRole William ElliottTeam Leader Jessica Tyler ShulerWiki Specialist Tyler KimseyLead Engineer Cameron CarrollEngineer Danielle.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
Pipelining - Hazards.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
\course\ELEG652-03Fall\Topic Exploitation of Instruction-Level Parallelism (ILP)
Appendix A Pipelining: Basic and Intermediate Concepts
Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Branch Hazards and Static Branch Prediction Techniques
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Pipelining Example Laundry Example: Three Stages
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
1 Pipelining CDA 3101 Discussion Section Question 1 – 6.1 Suppose that time for an ALU operation can be shortened by 25% in the following figure.
Lecture 18: Pipelining I.
Review: Instruction Set Evolution
Pipelining: Hazards Ver. Jan 14, 2014
Pipelining Chapter 6.
CSCI206 - Computer Organization & Programming
Lecture 07: Pipelining Multicycle, MIPS R4000, and More
Single Clock Datapath With Control
Pipeline Implementation (4.6)
Appendix C Pipeline implementation
ECE232: Hardware Organization and Design
Pipelining: Advanced ILP
Chapter 4 The Processor Part 3
CS 5513 Computer Architecture Pipelining Examples
Pipelining Multicycle, MIPS R4000, and More
Pipelining review.
Pipelining Chapter 6.
CSCI206 - Computer Organization & Programming
Data Hazards Data Hazard
Project Instruction Scheduler Assembler for DLX
CS203 – Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture
Throughput = #instructions per unit time (seconds/cycles etc.)
Guest Lecturer: Justin Hsia
CS 3853 Computer Architecture Pipelining Examples
Presentation transcript:

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria

Vittorio Zaccaria – Laboratory of Architectures DLX Load/Store Architecture Registers are faster than memory The compiler can do deeper optimization 16bit offsets and immediates 32bit integer registers 64bit floating point registers Fixed operation encoding: Addr. Mode contained in the operation code Fits in one word Faster decoding

Vittorio Zaccaria – Laboratory of Architectures DLX (cont.) 32 General purpose registers 32 bit instructions:

Vittorio Zaccaria – Laboratory of Architectures DLX Pipeline

Vittorio Zaccaria – Laboratory of Architectures Pipeline Visualization

Vittorio Zaccaria – Laboratory of Architectures Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline Hazards

Vittorio Zaccaria – Laboratory of Architectures Structural Hazards

Vittorio Zaccaria – Laboratory of Architectures Data Hazards

Vittorio Zaccaria – Laboratory of Architectures Control Hazards

Vittorio Zaccaria – Laboratory of Architectures An example program:.data dati_a:.word1,2,3,4,5,6,7,8 dati_b:.word2,3,4,5,6,7,7,9.text.globalmain addr3,r0,0 loop:lwr4,dati_a(r3) lwr5,dati_b(r3) subr5,r5,r4 addir3,r3,4 bnezr5,loop exit:

Vittorio Zaccaria – Laboratory of Architectures 1st Exercise: Draw pipeline chart Indicate: Data Hazards between WB stages and ID stages. Control Hazards between EX stage and IF stage

Hazard Individuation

Vittorio Zaccaria – Laboratory of Architectures 2nd Exercise: Hazard Resolution Software solution NOPs insertion Hardware solutions Bubbles/stalls generation Register forwarding Software optimizations Code rescheduling

Vittorio Zaccaria – Laboratory of Architectures NOP insertion add r3,r0,0 NOP Loop: Lw r4,dati_a(r3) Lw r5,dati_b(r3) NOP Sub r5,r5,r4 Add r3,r3,4 NOP Bnez r5,Loop NOP

Vittorio Zaccaria – Laboratory of Architectures NOP dynamic execution First loop: Second loop: Loop composed by 5 instr and 4 Nops

Vittorio Zaccaria – Laboratory of Architectures Performance Indexes CPI= average clock cycles per instruction; Average Clock cycles= n° instr+n°stalls/nops+4 4 is the n° of cycles needed to execute the last instruction. CPI=[Average Clock cycles]/[n° instr]

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of NOPs Actual CPI= Instructions+Nops = = 2.42 Instructions 7 MIPS frequency[=200Mhz] = MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures NOPs Manual Exercise Execute manually the loop for two cycles (finishing on the nop after the 2nd bnez ) and calculate CPI and MIPS 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Results CPI= (21+4)/11=2.27 MIPS= 88

Vittorio Zaccaria – Laboratory of Architectures Asymptotic loop performance Consider an intermediate cycle of the loop. Count instructions + nops of the cycle and divide it by the number of effective instructions -> asymptotical CPI 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of NOPs (asymptotic) Asymptotic loop CPI= (Instructions+Nops)*n+4 9n = =~ 1.8 Instructions*n 5n MIPS frequency[=200Mhz] = 111 MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures Bubbles Bubbles are NOPs inserted by the hardware. Branch instructions provoke the generation of a NOP Next instructions are stalled Previous instructions are executed.

Vittorio Zaccaria – Laboratory of Architectures Bubbles Example

Vittorio Zaccaria – Laboratory of Architectures Performance evaluation of bubbles Actual CPI= Instructions+Bubbles/aborts = = 2.42 Instructions 7 MIPS frequency[=200Mhz] = MIPS CPI*10^6

Vittorio Zaccaria – Laboratory of Architectures Verify on the simulator File-> load code... -> pipe1.s -> select -> load -> yes Configuration -> disable forwarding Open clock cycle diagram Execute -> single cycle (until 1st load of the 2nd cycle has been executed)

Vittorio Zaccaria – Laboratory of Architectures Result

Vittorio Zaccaria – Laboratory of Architectures Manual Exercise Preview what happens in an intermediate cycle Calculate asymptotical CPI and MIPS 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Let’s simulate it Simulate the program until the 4 th cycle

Vittorio Zaccaria – Laboratory of Architectures Solutions After the 1st cycle, we note the same behavior: 5 instructions 1 nop 3 stalls so the asymptotic values are: Asymptotic values: CPI=1.8 MIPS=111.11

Vittorio Zaccaria – Laboratory of Architectures Result Forwarding

Vittorio Zaccaria – Laboratory of Architectures Result Forwarding

Vittorio Zaccaria – Laboratory of Architectures Forwarding Example

Vittorio Zaccaria – Laboratory of Architectures Simulation of 2 cycles of the loop. Configuration -> enable forwarding Open clock cycle diagram File -> Reset DLX Execute -> single cycle Just to the WB of the 2nd bnez

Vittorio Zaccaria – Laboratory of Architectures Simulation results

Vittorio Zaccaria – Laboratory of Architectures Manual Exercise Calculate CPI and MIPS for the 2 cycles. Calculate Asymptotical CPI and MIPS. 15 minutes

Vittorio Zaccaria – Laboratory of Architectures Results 2 cycles: 11 instructions 1 nop 2 stalls 4 cycles to flush the pipe  CPI=18/11=1.63  MIPS=122

Vittorio Zaccaria – Laboratory of Architectures Asymptotical Results 5 instructions 1 nop 1 stall CPI=[7n+4]/5n=1.4 MIPS=

Vittorio Zaccaria – Laboratory of Architectures Speedup Speed up of A w.r.t. B: Exec. Time B Exec. Time A

Vittorio Zaccaria – Laboratory of Architectures Calculate asymptotical speedup Speedup(NOPs,Bubbles) Speedup(Forwarding,NOPs) Speedup(Forwarding,Bubbles) 5 minutes

Vittorio Zaccaria – Laboratory of Architectures Calculate Asym. speedup Speedup(NOPs,Bubbles)=1 Speedup(Forwarding,NOPs)=1.29 Speedup(Forwarding,Bubbles)=1.29

Vittorio Zaccaria – Laboratory of Architectures Scheduling Optimizations change of the order of operations to minimize stalls/bubbles (forwarding enabled): lw r3,0(r2) addr3,r3,r7 lw r4,0(r2) add r4,r4,r8 addr4,r4,r3 CPI=(5+2+4)/5 lw r3,0(r2) lw r4,0(r2) addr3,r3,r7 add r4,r4,r8 addr4,r4,r3 CPI=(5+4)/5

Vittorio Zaccaria – Laboratory of Architectures 1st Exercise addi r1,r0,1 seqr2,r1,r1 addr3,r3,r3 Loop: lwr4,0(r3) sub r3,r3,r4 bnez r1,Loop

Vittorio Zaccaria – Laboratory of Architectures Manual Exercises Draw the conflicts between operations until the end of the 3 rd execution of the cycle (last instruction bnez ). No forwarding possible. Insert bubbles/aborts in the right place to solve hazards. Calculate CPI and throughput of the trace. Calculate asymptotical CPI of the loop. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Hazard Diagram

Vittorio Zaccaria – Laboratory of Architectures Bubbles/Stall insertion

Vittorio Zaccaria – Laboratory of Architectures CPIs Trace CPI=[24+4]/12=~2.33 Asymptotic CPI=[6n+4]/3n=~2

Vittorio Zaccaria – Laboratory of Architectures Manual Exercises Suppose now that forwarding is possible. Draw the new execution pipeline diagram (until the execution of the 3rd bnez) and indicate when stalls must be generated by the hardware. Calculate CPI and MIPS Calculate asymptotical CPI and MIPS 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Pipeline Diagram

Vittorio Zaccaria – Laboratory of Architectures Results CPI=21/12=1.75 Asymptotical CPI=[(4+1)n+4]/3n=5/3=1.66

Vittorio Zaccaria – Laboratory of Architectures 2 nd exercise loop:lw r2,dati_a(r4) lw r3,dati_b(r5) add r1,r2,r3 sw dati_a(r6),r1 addi r4,r4,4 addi r5,r5,4 addi r6,r6,4 j loop

Vittorio Zaccaria – Laboratory of Architectures 1 st part Assume no forwarding possible Insert bubbles/aborts in the right place to solve hazards, assume no forwarding possible. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Results 8 instructions 1 NOP 4 stalls => CPI=~13/8

Vittorio Zaccaria – Laboratory of Architectures Results No forwarding and no scheduling asymptotic result: 13/8

Vittorio Zaccaria – Laboratory of Architectures A Possible Re-Scheduling loop:lw r2,dati_a(r4) lw r3,dati_b(r5) addi r4,r4,4 addi r5,r5,4 add r1,r2,r3 sw dati_a(r6),r1 addi r6,r6,4 j loop Idea: increase distance of add from last lw.

Vittorio Zaccaria – Laboratory of Architectures Re-Scheduling results Scheduled code decreases CPI to 11/8

Vittorio Zaccaria – Laboratory of Architectures 2 nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right place to solve hazards Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. Calculate Asymptotical CPI of the two loops. Calculate Speedup between the original code (w/o fw.) and the last rescheduled and forwarded code. 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Forwarding Results With forwarding but not rescheduling we obtain: 10/8

Vittorio Zaccaria – Laboratory of Architectures Re-scheduling We use the same re-scheduled code: By rescheduling the loop we obtain 9/8

Vittorio Zaccaria – Laboratory of Architectures Speedup Results Total requested speedup is: CPI[unscheduled,unforwarded] = ---- CPI[scheduled,forwarded] 9

Vittorio Zaccaria – Laboratory of Architectures 3 rd Exercise loop:lw r2,dati_a(r1) addi r2,r2,4 lw r3,dati_b(r1) addi r3,r3,4 lw r4,dati_a(r1) addi r4,r4,4 add r2,r2,r3 add r2,r2,r4 sw dati_a(r1),r2 addi r1,r1,4 bnez r1,loop

Vittorio Zaccaria – Laboratory of Architectures 1 st part Assume no forwarding possible Insert bubbles/aborts in the right place to solve hazards. Calculate asymptotical CPI of the loop. Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. 20 minutes

Vittorio Zaccaria – Laboratory of Architectures Bubbles insertion 11 instructions, 1 nop, 12 stalls => CPI= 24/11

Vittorio Zaccaria – Laboratory of Architectures Rescheduled code loop:lw r2,dati_a(r1) lw r3,dati_b(r1) lw r4,dati_a(r1) addi r2,r2,4 addi r3,r3,4 addi r4,r4,4 add r2,r2,r3 add r2,r2,r4 sw dati_a(r1),r2 addi r1,r1,4 bnez r1,loop Idea: perform elaborations after all data has been loaded

Vittorio Zaccaria – Laboratory of Architectures Scheduled code results 11 instr., 1 nop, 7 stalls => CPI=19/11

Vittorio Zaccaria – Laboratory of Architectures 2 nd part Now assume that forwarding is possible Insert needed bubbles/aborts in the right place to solve hazards Schedule the instructions to minimize stalls by augmenting the distance between conflicting instructions. Calculate Asymptotical CPI of the loop. Calculate Speedup between the original code (w/o fw.) and the last rescheduled and forwarded code. 10 minutes

Vittorio Zaccaria – Laboratory of Architectures Bubbles insertion NOP + 4 stalls => CPI=16/11

Vittorio Zaccaria – Laboratory of Architectures Rescheduling Results 11 instr. + 1 NOP + 1 stall => CPI=13/11 Requested Speedup=24/13

Vittorio Zaccaria – Laboratory of Architectures Floating Point Pipeline Hazards DLX FPU Pipeline

Vittorio Zaccaria – Laboratory of Architectures DLX FPU Pipeline Latency of a FU=number of cycles that must intervene between an instruction that produce a value through the FU and an instruction that uses this value (-1). Initiation Interval of the FU: time that must elapse between issuing two operations to the same FU. A stall in a pipeline does not mean a stall in the entire processor.

Vittorio Zaccaria – Laboratory of Architectures FPU Latencies and I.I. FULatencyInitiation Interval Integer ALU01 FP add11 FP and integer multiply 41 FP and integer divide 1819 [structural hazards!] WINDLX default latencies

Vittorio Zaccaria – Laboratory of Architectures Problems with FPUs Divide instructions can provoke structural hazards and need to be stalled in the ID stage. Writes in the RF can be more than one. WAW hazards are possible because WB can be reached out of order. RAW hazards more frequent due to the longer latency of operations.

Vittorio Zaccaria – Laboratory of Architectures Long Stalls even with Full Forwarding

Vittorio Zaccaria – Laboratory of Architectures Register file structural hazard solution. Structural hazards on register file: Solution: stall one of the instructions before entering the MEM stage.

Vittorio Zaccaria – Laboratory of Architectures FPU WAW Hazards Subd finishes before multd! there is a WAW conflict, i.e., if we dont stall subd, multd will overwrite its results! ldf6,dati_a(r2) ldf2,dati_b(r3) multdf6,f2,f4 subdf6,f2,f2 addd f6,f8,f2

Vittorio Zaccaria – Laboratory of Architectures Exercise: execute only a cycle of this loop: loop:ldf0,dati_a(r2) ldf4,dati_b(r3) multdf0,f0,f4 adddf2,f0,f2 addir2,r2,8 addir3,r3,8 subr5,r4,r2 bnezr5,loop How many cycles between the IF of the 1 st ld and the WB of the 1 st bnez ?

Vittorio Zaccaria – Laboratory of Architectures Results CPI of the trace =19/8 instructions.