CPU Design and PipeliningCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading: Stallings,

Slides:



Advertisements
Similar presentations
CPU Structure and Function
Advertisements

CSCI 4717/5717 Computer Architecture
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
Computer Organization and Architecture
Pipeline and Vector Processing (Chapter2 and Appendix A)
Chapter 8. Pipelining.
Chapter Six 1.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Chapter 12 CPU Structure and Function. CPU Sequence Fetch instructions Interpret instructions Fetch data Process data Write data.
Computer Organization and Architecture
Computer Organization and Architecture
Computer Organization and Architecture The CPU Structure.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Fetch instruction Decode instruction Calculate operands (i.e. EAs) Fetch operands Execute instructions Write result Overlap these operations.
Chapter 16 Control Unit Implemntation. A Basic Computer Model.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Appendix A Pipelining: Basic and Intermediate Concepts
RICARFENS AUGUSTIN JARED COELLO OSVALDO QUINONES Chapter 12 Processor Structure and Function.
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.
CH12 CPU Structure and Function
Processor Structure & Operations of an Accumulator Machine
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Micro-operations Are the functional, or atomic, operations of a processor. A single micro-operation generally involves a transfer between registers, transfer.
5-Stage Pipelining Fetch Instruction (FI) Fetch Operand (FO) Decode Instruction (DI) Write Operand (WO) Execution Instruction (EI) S3S3 S4S4 S1S1 S2S2.
Group 5 Tony Joseph Sergio Martinez Daniel Rultz Reginald Brandon Haas Emmanuel Sacristan Keith Bellville.
Edited By Miss Sarwat Iqbal (FUUAST) Last updated:21/1/13
9.2 Pipelining Suppose we want to perform the combined multiply and add operations with a stream of numbers: A i * B i + C i for i =1,2,3,…,7.
Presented by: Sergio Ospina Qing Gao. Contents ♦ 12.1 Processor Organization ♦ 12.2 Register Organization ♦ 12.3 Instruction Cycle ♦ 12.4 Instruction.
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
ECE 456 Computer Architecture Lecture #14 – CPU (III) Instruction Cycle & Pipelining Instructor: Dr. Honggang Wang Fall 2013.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Processor Architecture
Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.
Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Superscalar - summary Superscalar machines have multiple functional units (FUs) eg 2 x integer ALU, 1 x FPU, 1 x branch, 1 x load/store Requires complex.
Pipelining Example Laundry Example: Three Stages
Question What technology differentiates the different stages a computer had gone through from generation 1 to present?
PART 4: (1/2) Central Processing Unit (CPU) Basics CHAPTER 12: P ROCESSOR S TRUCTURE AND F UNCTION.
LECTURE 7 Pipelining. DATAPATH AND CONTROL We started with the single-cycle implementation, in which a single instruction is executed over a single cycle.
Processor Organization
CPU Design and Pipelining – Page 1CSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading:
DICCD Class-08. Parallel processing A parallel processing system is able to perform concurrent data processing to achieve faster execution time The system.
Computer Architecture Chapter (14): Processor Structure and Function
Computer Organization CS224
Central Processing Unit Architecture
William Stallings Computer Organization and Architecture 8th Edition
Parallel Processing - introduction
Chapter 9 a Instruction Level Parallelism and Superscalar Processors
Morgan Kaufmann Publishers The Processor
Central Processing Unit CPU
The processor: Pipelining and Branching
Computer Organization and ASSEMBLY LANGUAGE
William Stallings Computer Organization and Architecture 8th Edition
Computer Architecture
CPU Structure and Function
Chapter 11 Processor Structure and function
Presentation transcript:

CPU Design and PipeliningCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: CPU Operations and Pipelining Reading: Stallings, Sections 12.3 and 12.4

CPU Design and PipeliningCSCI 4717 – Computer Architecture A Short Analogy Imagine a small bathroom with no privacy partitions around the toilet, i.e., one person in the bathroom at a time. –1.5 minutes to pee –1 minute to wash hands –2 minutes to dry hands under hot air Four people need to use the bathroom 4  ( ) = 18 minutes How long would it take if we added a partition around toilet allowing three people to use the bathroom at the same time?

CPU Design and PipeliningCSCI 4717 – Computer Architecture Instruction Cycle Over the past few weeks, we have visited the steps the processor uses to execute an instruction

CPU Design and PipeliningCSCI 4717 – Computer Architecture Data Flow The better we can break up the execution of an instruction into its sub-cycles, the better we will be able to optimize the processor’s performance This partitioning of the instruction’s operation depends on the CPU design In general, there is a sequence of events that can be described that make up the execution of an instruction –Fetch cycle –Data fetch cycle –Indirect cycle –Execute cycle –Interrupt cycle

CPU Design and PipeliningCSCI 4717 – Computer Architecture Instruction Fetch PC contains address of next instruction Address moved to Memory Address Register (MAR) Address placed on address bus Control unit requests memory read Result placed on data bus, copied to Memory Buffer Register (MBR), then to IR Meanwhile PC incremented by size of machine code (typically one address)

CPU Design and PipeliningCSCI 4717 – Computer Architecture Data Fetch Operand address is fetched into MBR IR is examined to determine if indirect addressing is needed. If so, indirect cycle is performed –Address of location from which to fetch operand address is calculated based on first fetch –Control unit requests memory read –Result (actual address of operand) moved to MBR Address in MBR moved to MAR Address placed on address bus Control unit requests memory read Result placed on data bus, copied to MBR

CPU Design and PipeliningCSCI 4717 – Computer Architecture Indirect Cycle Some instructions require operands, each of which requires a memory access With indirect addressing, an additional memory access is required to determine final operand address Indirect addressing may be required of more than one operand, e.g., a source and a destination Each time indirect addressing is used, an additional operand fetch cycle is required.

CPU Design and PipeliningCSCI 4717 – Computer Architecture Execute Cycle Due to wide range of instruction complexity, execute cycle may take one of many forms. –register-to-register transfer –memory or I/O read –ALU operation Duration is also widely varied

CPU Design and PipeliningCSCI 4717 – Computer Architecture Interrupt Cycle At the end of the execution of an instruction, interrupts are checked Unlike execute cycle, this cycle is simple and predictable If no interrupt pending – go to instruction fetch If interrupt pending –Current PC saved to allow resumption after interrupt –Contents of PC copied to MBR –Special memory location (e.g. stack pointer) loaded to MAR –MBR written to memory –PC loaded with address of interrupt handling routine –Next instruction (first of interrupt handler) can be fetched

CPU Design and PipeliningCSCI 4717 – Computer Architecture Pipelining As with a manufacturing assembly line, the goal of instruction execution by a CPU pipeline is to: –break the process into smaller steps, each step handled by a sub process –as soon as one sub process finishes its task, it passes its result to the next sub process, then attempts to begin the next task –multiple tasks being operated on simultaneously improves performance –No single instruction is made faster, but entire workload can be done faster.

CPU Design and PipeliningCSCI 4717 – Computer Architecture Breaking an Instruction into Cycles A simple approach is to divide instruction into two stages: –Fetch instruction –Execute instruction There are times when the execution of an instruction doesn’t use main memory In these cases, use idle bus to fetch next instruction in parallel with execution. This is called instruction prefetch

CPU Design and PipeliningCSCI 4717 – Computer Architecture Instruction Prefetch

CPU Design and PipeliningCSCI 4717 – Computer Architecture Improved Performance of Prefetch fetchexecfetchexecfetchexecfetchexec fetchexec fetchexec fetchexec fetchexec Instruction 1Instruction 2Instruction 3Instruction 4 Instruction 1 Instruction 2 Instruction 3 Instruction 4 Without prefetch: With prefetch:

CPU Design and PipeliningCSCI 4717 – Computer Architecture Improved Performance of Prefetch (continued) Examining operation of prefetch appears to take half as many cycles as the number of instructions increases Performance, however, is not doubled: –Except when forced to wait, a fetch is usually shorter than execution –Any jump or branch means that prefetched instructions are not the required instructions Solution: break execute stage into more stages to improve performance

CPU Design and PipeliningCSCI 4717 – Computer Architecture Three Cycle Instruction The number of cycles it takes to execute a single instruction is further reduced (to approximately a third) if we break an instruction into three cycles (fetch/decode/execute). Instruction 1 Instruction 2 Instruction 3 Instruction 4 FDE Instruction 1 FDE Instruction 2 FDE Instruction 3 FDE Instruction 4 FDE FDE FDE FDE

CPU Design and PipeliningCSCI 4717 – Computer Architecture Pipelining Strategy Theoretically, if instruction execution could be broken into more pieces, we could realize even better performance –Fetch instruction (FI) – Read next instruction into buffer –Decode instruction (DI) – Determine the opcode –Calculate operands (CO) – Find effective address of source operands –Fetch operands (FO) – Get source operands from memory –Execute instructions (EI) – Perform indicated operation –Write operands (WO) – Store the result This decomposition produces nearly equal durations

CPU Design and PipeliningCSCI 4717 – Computer Architecture Sample Timing Diagram for Pipeline

CPU Design and PipeliningCSCI 4717 – Computer Architecture Problems with Previous Figure (Important Slide!) Assumes that each instruction goes through all six stages of pipeline It is possible to have FI, FO, and WO happening at the same time Even with the more detailed decomposition, some stages will still take more time Conditional branches will disrupt pipeline even worse than two-stage prefetch/execute Interrupts, like conditional branches, will disrupt pipeline CO and FO stages may depend on results of previous instruction at a point before the WO stage writes the results

CPU Design and PipeliningCSCI 4717 – Computer Architecture Other Disruptions to Pipeline Resource limitations – if the same resource is required for more than one stage of the pipeline, e.g., the system bus Data hazards – if a subsequent instruction depends on the outcome of a previous instruction, it must wait for the first instruction to complete Conditional program flow – the next instruction of a branch cannot be fetched until we know that we're branching

CPU Design and PipeliningCSCI 4717 – Computer Architecture Effects of a Branch in a Pipeline

CPU Design and PipeliningCSCI 4717 – Computer Architecture More Roadblocks to Realizing Full Speedup There are two additional factors that frustrate improving performance using pipelining –Overhead required between stages such as buffer-to-buffer transfers –The amount of control logic required to handle memory and register dependencies and to control the pipeline itself With each added stage, the hardware needed to support pipelining requires careful consideration and design

CPU Design and PipeliningCSCI 4717 – Computer Architecture Pipeline Performance Equations Here are some simple measures of pipeline performance and relative speed up:  = time for one stage  m = maximum stage delay d = delay of latches between stages k = number of stages  = max[  i ] + d =  m + d1 < i < k

CPU Design and PipeliningCSCI 4717 – Computer Architecture Pipeline Performance Equations (continued) In general, d is equivalent to a clock pulse and  m >> d. For n instructions with no branches, the total time required to execute all n instructions through a k-stage pipeline, T k, is: T k = [k + (n – 1)]  It takes k cycles to fill the pipeline, then once cycle each for the remaining n-1 instructions.

CPU Design and PipeliningCSCI 4717 – Computer Architecture Speedup Factor For a k-stage pipeline, the ideal speedup calculated with respect to execution without a pipeline is: S k = T 1 / T k = n·k·  / [k + (n – 1)]  = n·k / [k + (n – 1)] As n  ∞, the speed up goes to k The potential gains of a pipeline are offset by increased cost, delay between stages, and consequences of a branch.

CPU Design and PipeliningCSCI 4717 – Computer Architecture In-Class Exercise Assume that we are executing 1.5  10 6 instructions using a 6-stage pipeline. If there is a 10% chance that an instruction will be a conditional branch and a 50% chance that a conditional branch will be taken, how long should it take to execute this code? Assume a single stage takes  seconds.

CPU Design and PipeliningCSCI 4717 – Computer Architecture Dealing with Branches A variety of approaches have been used to reduce the consequences of branches encountered in a pipelined system: –Multiple Streams –Prefetch Branch Target –Loop buffer –Branch prediction –Delayed branching

CPU Design and PipeliningCSCI 4717 – Computer Architecture Multiple Streams Branch penalty is a result of having two possible paths of execution Solution: Have two pipelines Prefetch each branch into a separate pipeline Once outcome of conditional branch is determined, use appropriate pipeline Competing for resources – this method leads to bus & register contention More streams than pipes – multiple branches lead to further pipelines being needed

CPU Design and PipeliningCSCI 4717 – Computer Architecture Prefetch Branch Target Target of branch is prefetched in addition to instructions following branch Keep target until branch is executed Used by IBM 360/91

CPU Design and PipeliningCSCI 4717 – Computer Architecture Loop Buffer Add a small, very fast memory Maintained by fetch stage of pipeline Use it to contain the n most recently fetched instructions in sequence. Before taking a branch, see if branch target is in buffer Similar in concept to a cache dedicated to instructions while maintaining an order of execution Used by CRAY-1

CPU Design and PipeliningCSCI 4717 – Computer Architecture Loop Buffer Benefits Particularly effective with loops if the buffer is large enough to contain all of the instructions in a loop. Instructions only need to be fetched once. If executing from within the buffer, buffer acts like a prefetch by having all of the instructions already loaded into high-speed memory without having to access main memory or cache.

CPU Design and PipeliningCSCI 4717 – Computer Architecture Loop Buffer Diagram

CPU Design and PipeliningCSCI 4717 – Computer Architecture Branch Prediction There are a number of methods that processors employ to make an educated guess as to the direction a branch may take. Static –Predict never taken –Predict always taken –Predict by opcode Dynamic – depend on execution history –Taken/not taken switch –Branch history table

CPU Design and PipeliningCSCI 4717 – Computer Architecture Static Branch Strategies Predict Never Taken –Assume that jump will not happen –Always fetch next instruction –68020 & VAX 11/780 –VAX will not prefetch after branch if a page fault would result (This is a conflict between the operating system and the CPU design) Predict always taken –Assume that jump will happen –Always fetch target instruction Predict by Opcode –Some instructions are more likely to result in a jump than others –Can get up to 75% success

CPU Design and PipeliningCSCI 4717 – Computer Architecture Dynamic Branch Strategies Attempt to improve accuracy by basing prediction on history Dedicate one or more bits with each branch instruction to reflect recent history of instruction Not stored in memory, rather in high-speed storage –one possibility is in cache with instructions (history is lost when instruction is replaced) –another is to keep a small table with recently executed branch instructions (Could use a tag-like structure with low order bits of instruction's address to point to a line.)

CPU Design and PipeliningCSCI 4717 – Computer Architecture Taken/Not taken switch Storing one bit for history: –0: last branch not taken –1: last branch taken –Shortcoming is with loops where first branch is always predicted wrong since last time through loop, CPU didn’t branch. Also predicts wrong on last pass through loop. Storing two bits for history: –00: branch not taken, followed by branch taken –01: branch taken, followed by branch not taken –10: two branch taken in a row –11: two branch not taken in a row –Can be optimized for loops

CPU Design and PipeliningCSCI 4717 – Computer Architecture Branch Prediction State Diagram Must get two disagreements in a row before switching prediction

CPU Design and PipeliningCSCI 4717 – Computer Architecture Branch History Table There are three things that should be kept in the branch history table –Address of the branch instruction –Bits indicating branch history –Branch target information, i.e., where do we go if we decide to branch?

CPU Design and PipeliningCSCI 4717 – Computer Architecture Delayed Branch Possible to improve pipeline performance by rearranging instructions Start making calculations for branch earlier so that pipeline can filled with real processing while branch is being assessed Chapter 13 will examine this in greater detail ADDr1, 5 CMPr2, 10 BNEGO_HERE wo/delayed branch ADDr1, 5 CMPr2, 10 BNEGO_HERE NOP w/delayed branch CMPr2, 10 BNEGO_HERE ADDr1, 5 w/delayed branch