Chapter Six 1.

Slides:

Advertisements

Similar presentations

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Advertisements

CMPT 334 Computer Organization

Pipeline and Vector Processing (Chapter2 and Appendix A)

Chapter 8. Pipelining.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

10/11: Lecture Topics Slides on starting a program from last time Where we are, where we’re going RISC vs. CISC reprise Execution cycle Pipelining Hazards.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

Goal: Describe Pipelining

Computer Architecture

S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania Computer Organization Pipelined Processor Design 1.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Chapter Six Enhancing Performance with Pipelining

Pipelining Andreas Klappenecker CPSC321 Computer Architecture.

Pipelined Processor II CPSC 321 Andreas Klappenecker.

Appendix A Pipelining: Basic and Intermediate Concepts

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.

Introduction to Pipelining Rabi Mahapatra Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley)

9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.

CS1104: Computer Organisation School of Computing National University of Singapore.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 8: MIPS Pipelined.

Computer Organization CS224 Fall 2012 Lesson 28. Pipelining Analogy  Pipelined laundry: overlapping execution l Parallelism improves performance §4.5.

Pipelining Enhancing Performance. Datapath as Designed in Ch. 5 Consider execution of: lw $t1,100($t0) lw $t2,200($t0) lw $t3,300($t0) Datapath segments.

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Pipeline Hazards. CS5513 Fall Pipeline Hazards Situations that prevent the next instructions in the instruction stream from executing during its.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Orange Coast College Business Division Computer Science Department CS 116- Computer Architecture Pipelining.

ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

Pipelining Example Laundry Example: Three Stages

LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.

CBP 2005Comp 3070 Computer Architecture1 Last Time … All instructions the same length We learned to program MIPS And a bit about Intel’s x86 Instructions.

Introduction to Computer Organization Pipelining.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

Pipelining Intro Computer Organization 1 Computer Science Dept Va Tech January 2006 ©2006 McQuain & Ribbens Basic Instruction Timings Making some assumptions.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

CS203 – Advanced Computer Architecture Pipelining Review.

Lecture 18: Pipelining I.

Pipelining Chapter 6.

Morgan Kaufmann Publishers

Single Clock Datapath With Control

Pipeline Implementation (4.6)

CDA 3101 Spring 2016 Introduction to Computer Organization

Morgan Kaufmann Publishers The Processor

Pipelining Chapter 6.

Pipelining Chapter 6.

November 5 No exam results today. 9 Classes to go!

Chapter 8. Pipelining.

CS203 – Advanced Computer Architecture

Pipelining Chapter 6.

Pipelining Chapter 6.

Presentation transcript:

Chapter Six 1

Pipelining Place one dirty load of clothes in the washer. When the washer is finished, place the wet load in the dryer. When the dryer is finished, place the dry load on a table and fold. When folding is finished, ask your roommate to put the clothes away. The laundry analogy for pipelining

6.1 Pipelining The washer, dryer, “folder” and “storer” each take 30 minutes for their task. Sequential laundry takes 8 hours for four loads of wash. Pipelined laundry takes 3.5 hours. How? We show the pipeline stage of different loads over time by showing copies of the four resource on this two-dimensional time line, but we really have just one of each resource. The laundry analogy for pipelining

Pipelining So pipelined laundry is potentially four times faster than non- pipelined How? In Processor, we pipeline instruction execution. MIPS instructions classically take five step: Fetch instruction Read registers while decoding the instruction. The format of MIPS instruction allows reading and decoding to occur simultaneously Execute the operation or calculate an address. Access an operand in data memory. Write the result into a register.

Pipelining All steps - called stages in pipelining – are operating concurrently. As long as we separate resources for each stage, we can pipeline the tasks. The time for each stage is not shorter for pipelining; the reason pipelining is faster for many loads I that everything is working in parallel, so more loads are finished per hour. Pipelining improves throughput of our laundry system without improving the time to complete a single load. Pipelining would not decrease the time to complete one load of laundry, but when we have many loads of laundry to do, the improvement in throughput decreases the total time to complete the work.

Pipelining Example: Single cycle versus pipelined performance Limit eight instructions only: load word (lw), store word (sw), subtract (sub), add (add), or (or), set-less-than (slt), and branch-on-equal (beq). Compare the single cycle implementation, in which all instructions take 1 clock cycle, to a pipelined implementation. See Figure 6-2. 8 instructions with different total time. For single cycle model, every instruction takes exactly 1 clock cycle. The clock cycle must be long enough to accommodate the slowest operation.

Pipelining Example: Single cycle versus pipelined performance Limit eight instructions only: load word (lw), store word (sw), add (add), subtract (sub), add (add), or (or), set-less-than (slt), and branch-on-equal (beq). Compare the single cycle implementation, in which all instructions take 1 clock cycle, to a pipelined implementation. 200 ps for ALU operation, and 100 ps for register file read or write. Instruction class Instruction fetch Register read ALU operation Data access Register write Total time Lw 200 ps 100 ps 800 ps Sw 200ps 700 ps R format (add, sub, And, or, Slt) 600 ps

Pipelining Improve performance by increasing instruction throughput Ideal speedup is number of stages in the pipeline. Do we achieve this? Note: timing assumptions changed for this example Figure 6-3

Pipelining Assumption: The write to the register occurs in the first half of the clock cycle and the read from the register occurs in the second half. Lw $1, 100($2) $1 = Memory[$s2 + 100] $s0, $s1, … and $t0, $t1, … are fast locations for data. The pipeline stages take a single clock cycle, so the clock cycle must be long enough to accommodate the slowest operations.200 ps. The pipelined execution clock cycle must have the worst-case cycle of 200 ps even though some stages take only 100 ps. What is the cycle time for single cycle machine? 500 or 800 ps? Ideal speedup is number of stages in the pipeline. Do we achieve this?

Pipelining Pipeline speedup : If the stages are perfectly balanced (ideally) : Time between instruction pipelined = (Time between instruction non-pipelined)/ (Number of pipe stages) Ideally, Five stages five time faster, 800ps/5 = 160 ps For the previous example with three instructions, 1,400 ps versus 2,400 ps, why? (number of instruction is not large) For the example of additional 1,000,000 instructions  The execution time for pipelining  1,000,000 * 200 ps + 1,400 ps The execution time for non-pipelining  1,000,000 * 800 ps + 2400 ps (800,002,400)/(200,001,400) ~4.0 =800/20

Pipelining Pipeline speedup : Pipelining improve performance by increasing instruction throughput, as opposed to decreasing the execution time of individual instruction, but instruction throughput is the important metric.

Design Instruction Set for Pipelining What makes it easy all instructions are the same length Make it easier to fetch instructions in the first pipeline stage and to decode them in the second stage IA-32 instructions are translated into simple micro-operations that look like MIPS Pentium 4 actually pipelines the micro-operations rather than the native IA-32 instructions. just a few instruction formats The source register fields being located in the same place in each instruction. This makes second stage can begin reading the register files and determine what type of instructions was fetched can happen at the same time.

Design Instruction Set for Pipelining memory operands appear only in loads and stores Use this execute stage to calculate the memory address and then access memory in the following stage. If we could operate on the operands in memory, stage 3 and stage 4 would expend to an address stage, memory stage, and the execute stage Operands must be aligned in memory. No worry about a single data transfer instruction requiring two data memory access. A single stage is enough. with out-of-order execution, etc.

Design Instruction Set for Pipelining When the next instruction cannot execute in the following clock cycle is called hazard. Three are three different types. What makes it hard? structural hazards: suppose we had only one memory Hardware can not support the combination of instructions that we want to execute in the same clock cycle Example, two instructions access memory at the same time in the Figure 6.3, the first instruction is accessing data from memory while the fourth instruction is fetching an instruction from the same memory. Two memories are required to solve this problem Example, if we used a washer-dryer combination instead of separate washer and dryer.

Design Instruction Set for Pipelining control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction When the pipeline must be stalled because one step must wait for another to complete. It arises from the dependence of one instruction on an earlier one that is still in the pipeline.

Design Instruction Set for Pipelining For example: add $s0, $t0, $t1 // $s0 = $t0 + $t1 sub $t2, $s0, $t3 The add instruction doesn’t write its result until the fifth stage, we would have to add the three bubbles to the pipe. The solution: Don’t need to wait for the instruction to complete before trying to resolve the data hazard. As soon as ALU creates the sum for the add, we can supply it as an input for the subtract. Adding extra hardware to retrieve the missing item early from the internal resource is called forwarding or bypassing.

Design Instruction Set for Pipelining Example: For the above two instruction, show what pipeline stages would be connected by forwarding . The shading indicates the element is used by the instruction. Shading on the right half of the register file or memory means the element is read in that stage, and shading of the left half means it is written in that stage.

Design Instruction Set for Pipelining

Design Instruction Set for Pipelining Suppose the first instruction were a load of $so instead of an add,. The desired data would be available only after the fourth stage of the first instruction in the dependence which is too late for the input of the third stage of sub.  load-use data hazard, or pipeline stall  bubble See example on page 378. Find the hazard for the code.

Design Instruction Set for Pipelining Example Consider the following code segment in C: A = B + E; C = B + F; Corresponding assembly code: lw $t1, 0($t0) lw $t2, 4($t0) add $t3, $t1, $t2 sw $t3, 12($t0) lw $t4, 8($01) add $t5, $t1, $t4 Sw $t5, 16($t0) Find the hazards in the code and reorder the instructions to avoid any pipeline stalls.

Design Instruction Set for Pipelining control hazards: need to worry about branch instructions The need to make a decision based on the results of the instruction while others are executing. Solution 1: stall In pipelining, we must begin fetching the instruction following the branch on the very next clock cycle. But the pipeline cannot possibly know what the next instruction should be, since it only just received the branch instruction from the memory. Add an extra hardware so that we can test registers, calculate the branch address, and update the PC during the second stage of the pipeline. But one extra 200 ps clock cycle still needed.

Design Instruction Set for Pipelining

Design Instruction Set for Pipelining Solution 2:predict Predict the branches will be untaken. When it is right, the pipeline proceeds at full speed. Only when branches are taken does the pipeline stall. If the prediction is wrong, the pipeline control must ensure that the instruction following the wrongly guessed branch have no effect and must restart the pipeline from the proper branch address.

Design Instruction Set for Pipelining

Design Instruction Set for Pipelining We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc.