PipeliningPipelining Computer Architecture (Fall 2006)

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Computer Organization and Architecture
CSCI 4717/5717 Computer Architecture
© 2006 Edward F. Gehringer ECE 463/521 Lecture Notes, Spring 2006 Lecture 1 An Overview of High-Performance Computer Architecture ECE 463/521 Spring 2006.
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
COMP25212 Further Pipeline Issues. Cray 1 COMP25212 Designed in 1976 Cost $8,800,000 8MB Main Memory Max performance 160 MFLOPS Weight 5.5 Tons Power.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
Intro to Computer Org. Pipelining, Part 2 – Data hazards + Stalls.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
CMPT 334 Computer Organization
Chapter 8. Pipelining.
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Instruction-Level Parallelism (ILP)
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
1 Copyright © 2012, Elsevier Inc. All rights reserved. Chapter 3 (and Appendix C) Instruction-Level Parallelism and Its Exploitation Computer Architecture.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
S. Barua – CPSC 440 CHAPTER 6 ENHANCING PERFORMANCE WITH PIPELINING This chapter presents pipelining.
Chapter 12 Pipelining Strategies Performance Hazards.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
Computer ArchitectureFall 2007 © October 29th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 31, CS-447– Computer Architecture M,W 10-11:20am Lecture 17 Review.
Appendix A Pipelining: Basic and Intermediate Concepts
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipelining. Overview Pipelining is widely used in modern processors. Pipelining improves system performance in terms of throughput. Pipelined organization.
Pipelining By Toan Nguyen.
(6.1) Central Processing Unit Architecture  Architecture overview  Machine organization – von Neumann  Speeding up CPU operations – multiple registers.
Architecture Basics ECE 454 Computer Systems Programming
Intel Architecture. Changes in architecture Software architecture: –Front end (Feature changes such as adding more graphics, changing the background colors,
Parallelism Processing more than one instruction at a time. Pipelining
Computer Organization and Architecture Instruction-Level Parallelism and Superscalar Processors.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Super computers Parallel Processing By Lecturer: Aisha Dawood.
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
Pipelining and Parallelism Mark Staveley
CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.
EKT303/4 Superscalar vs Super-pipelined.
Pipelining Example Laundry Example: Three Stages
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
Introduction to Computer Organization Pipelining.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Advanced Pipelining 7.1 – 7.5. Peer Instruction Lecture Materials for Computer Architecture by Dr. Leo Porter is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike.
Chapter Six.
CS 352H: Computer Systems Architecture
Computer Architecture Chapter (14): Processor Structure and Function
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Pipeline Implementation (4.6)
/ Computer Architecture and Design
CDA 3101 Spring 2016 Introduction to Computer Organization
Instruction Level Parallelism and Superscalar Processors
Superscalar Processors & VLIW Processors
Instruction Level Parallelism and Superscalar Processors
Chapter Six.
Chapter Six.
Control unit extension for data hazards
* From AMD 1996 Publication #18522 Revision E
Computer Architecture
Chapter 8. Pipelining.
Control unit extension for data hazards
CSC3050 – Computer Architecture
Control unit extension for data hazards
Presentation transcript:

PipeliningPipelining Computer Architecture (Fall 2006)

Hardware Recap A modern CPU consists of multiple, independent, interacting units –Each unit does a specific task in processing instructions Instruction Fetch (IF) Memory Unit Instruction Decode (ID) Execute (EX) ALU Write Back (WB)

Stages in Instruction Processing Stage1: Instruction Fetch (IF) –Retrieve instruction bytes from memory Note that instructions may be multiple bytes in length Stage 2: Instruction Decode (ID) –Convert instruction to micro program –Retrieve additional operands if necessary Stage 3: Execute (EX) –Process the instruction using the ALU May involve the FPU too! Stage 4: Write Back (WB) –Store results back into memory or registers

Instruction Execution Given several instructions –They are typically executed in a serial fashion WBEXIDIF I1 I2 WBEXIDIF I3 WBEXIDIF If each stage takes k msec, Then each instruction takes 4k msec. Time for 3 instructions = 4k * 3 = 12k msec.

Concurrency When 1 instruction is processed what are other units doing? –They are all working all the time Producing the same output! –Consuming energy Dissipated in the form of heat –But not doing anything really useful What do we do with these idling units? –Try and use them in some useful way.

Pipelining Pipelining is an implementation technique in which stages of instruction processing from multiple instructions are overlapped to improve overall throughput of the CPU. WBEXIDIF I1 I2 WBEXIDIF I3 WBEXIDIF If each stage takes k msec, Then each instruction takes 4k msec. However, time for 3 instructions = 6k

Basic Facts about Pipelining Facts you must note about pipelining –Pipelining does not reduce time to process a single instruction –It only increases throughput of the processor By effectively using the hardware But requires more hardware to implement it. –It is effective only when large number of instructions are processed Typically 1000s of instructions –Theoretical performance improvement is proportional to the number of stages in the pipeline In our examples we have a 4 stage pipeline so performance improves by about 4 times!

Facts about Pipelining (Contd.) Facts you must note about pipelining (Contd.) –Increasing number of stages typically increases throughput of the CPU! Increases implementation complexity of CPU Increases logic gate counts making hardware expensive Increases heat dissipation –There is an upper limit to depth of pipeline Practical performance is usually well below theoretical maximum Performance is limited due to hazards or stalls in the pipeline

Hazards in a pipeline Hazards or stalls prevent a pipeline from operating at maximum efficiency –They force the pipeline to skip processing instructions in a given cycle. Hazards are classified into 3 categories –Data hazards –Control hazards –Structural hazards

Data Hazards Data hazard arises due to interferences between instructions –Consider the instructions shown below: add %ebx, %eax add %eax, %ecx –The second instruction is dependent on the result from the first Consequently second instruction has to wait for the first instruction to complete!

Execution with Data Hazards Given the following instructions with data hazards, the pipeline stalls as shown below: I1: add %ebx, %eax I2: add %eax, %ecx WBEXIDIF I1 I2 IDIF Stall WBEX Stalls are typically illustrated using “bubbles”

Forwarding: A solution for Data Hazards Forwarding: Short circuit stages to forward results from one stage to execution of next instruction. WBEXIDIF I1 I2 WBEXIDIF Graphical representation of forwarding. Results from previous instruction are forwarded to the next instruction to circumvent data hazards!

Notes on forwarding Forwarding does not solve all data hazards –Compilers reorder instructions to minimize data hazards –Requires complex hardware to achieve forwarding May require forwarding between multiple instructions Deeper pipelines suffer more because of increased complexity

Control Hazards Control hazards occur due to branching –Conditional or un-conditional Branching requires new instructions to be fetched Pipeline has to be flushed and refilled –Deeper pipelines incur greater penalties here –About every 7 th or 8 th instruction is a branch! This is a significant hazard and has to be circumvented to achieve reasonable performance from pipelines

Dynamic Branch Prediction: Solution for Control Hazards Processor includes hardware to predict outcome of branch instructions even before they are executed –So that appropriate instructions can be preloaded into the pipeline Dynamic Branch Prediction –Performed when a program is executed –Achieved by associating a 2-bit branch predictor with each branch instruction internally Branch predictors are transparent to programmer! Take up internal memory space on the CPU Require additional hardware for processing –They are about 90% accurate! Significantly reduce control hazard They do not eliminate control hazards!

Structural Hazard Structural hazards arise due to limitations of hardware –Cannot read & write to memory simultaneously –Memory may not keep up with CPU speed CPU has to stall for memory to respond Usually caches are used to minimize stalls –Caches don’t eliminate stalls.

Clock-Cycle Per Instruction (CPI) Instructions require varying number of stages to be processed Due to various hazards –Each state consumes 1 clock-cycle Average number of clock-cycles required to process an instruction is called CPI –CPI is a strong measure of CPU performance Smaller CPI is better. Ideally it is 1.

Extracting More Performance Pipelining inherently aims to exploit potential parallelism among instructions or Instruction-level parallelism (ILP) –ILP can be maximized by increasing pipeline stages or depth of pipeline. But increasing pipeline depth has negative consequences! –Alternative is to increase number of functional units and process parallel instructions simultaneously Instructions may be processed out-of-order –Faster instructions may start later and finish earlier while a slower instruction is running on another unit! –Require additional hardware to reorder out-of-order instructions

Dynamic Multiple Issue Dynamic Multiple Issue or Superscalar processors –Have multiple processing units Typically fed by a single pipeline –Dynamically (when program is running) issue multiple instructions to be processed in parallel Instructions are typically executed out-of-order –Have a in-order commit unit Reorder instructions processed out-of-order.

Overview of Superscalar processor Instruction fetch Decode Unit Reservation Station (Queue) Integer Unit Reservation Station (Queue) FPU Reservation Station (Queue) Load/Store Commit Unit In-order Issue In-order Commit Out-of-order Execute

Athlon vs. Pentium FeatureAthlonPentium Pipeline depth (Smaller is better) 10-int. 15-FPU30 Functional Units per core (More is better) 6-int/load-store 3 - FPU 5 int/FPU Clock Frequency (More is better) Less than 2.5 Ghz Almost 5 GHz Instruction in flight72126 Instructions per Clock-Cycle (IPC) (More is better) 8.754

Performance & Benchmarking Performance is measured as time taken to execute a program –Such programs are called benchmarks Benchmarks provide a standard for measurement. Performance depends on many factors In addition to pipeline design & superscalar operations –Cache sizes and cache memory performance –Memory-CPU bus interconnection speed Memory design