Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Slides:



Advertisements
Similar presentations
Lecture 4: CPU Performance
Advertisements

Instruction Set Design
ISA Issues; Performance Considerations. Testing / System Verilog: ECE385.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
CMPT 334 Computer Organization
Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.
Computer Systems. Computer System Components Computer Networks.
CS-447– Computer Architecture Lecture 12 Multiple Cycle Datapath
1 Recap (Pipelining). 2 What is Pipelining? A way of speeding up execution of tasks Key idea : overlap execution of multiple taks.
Chapter 12 Pipelining Strategies Performance Hazards.
EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Pipelining Andreas Klappenecker CPSC321 Computer Architecture.
1 COMP541 Sequencing – III (Sequencing a Computer) Montek Singh April 9, 2007.
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Chapter 12 CPU Structure and Function. Example Register Organizations.
Pipelined Processor II CPSC 321 Andreas Klappenecker.
Appendix A Pipelining: Basic and Intermediate Concepts
The Processor Data Path & Control Chapter 5 Part 1 - Introduction and Single Clock Cycle Design N. Guydosh 2/29/04.
Pipelining By Toan Nguyen.
Instruction Sets and Pipelining Cover basics of instruction set types and fundamental ideas of pipelining Later in the course we will go into more depth.
Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.
RISC:Reduced Instruction Set Computing. Overview What is RISC architecture? How did RISC evolve? How does RISC use instruction pipelining? How does RISC.
Chapter 5 Basic Processing Unit
Lecture #32 Page 1 ECE 4110–5110 Digital System Design Lecture #32 Agenda 1.Improvements to the von Neumann Stored Program Computer Announcements 1.N/A.
Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.
Chapter 8 Problems Prof. Sin-Min Lee Department of Mathematics and Computer Science.
Reconfigurable Computing - Multipliers: Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on.
Digital Kommunikationselektronik TNE027 Lecture 2 1 FA x n –1 c n c n1- y n1– s n1– FA x 1 c 2 y 1 s 1 c 1 x 0 y 0 s 0 c 0 MSB positionLSB position Ripple-Carry.
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
Chapter 2 Summary Classification of architectures Features that are relatively independent of instruction sets “Different” Processors –DSP and media processors.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
Reconfigurable Computing - Type conversions and the standard libraries John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots.
Lecture 8: Processors, Introduction EEN 312: Processors: Hardware, Software, and Interfacing Department of Electrical and Computer Engineering Spring 2014,
Computer Organization CS224 Fall 2012 Lesson 28. Pipelining Analogy  Pipelined laundry: overlapping execution l Parallelism improves performance §4.5.
Chapter 4 The Processor CprE 381 Computer Organization and Assembly Level Programming, Fall 2012 Revised from original slides provided by MKP.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
EE204 L12-Single Cycle DP PerformanceHina Anwar Khan EE204 Computer Architecture Single Cycle Data path Performance.
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Stored Programs In today’s lesson, we will look at: what we mean by a stored program computer how computers store and run programs what we mean by the.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Pipelining Example Laundry Example: Three Stages
Lecture 15: Pipelined Datapath Soon Tee Teoh CS 147.
Reconfigurable Computing - Verifying Circuits Performance! John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn.
Introduction to Computer Organization Pipelining.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.
CS203 – Advanced Computer Architecture Pipelining Review.
Reconfigurable Computing - Options in Circuit Design John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound,
Real-World Pipelines Idea Divide process into independent stages
Reconfigurable Computing - Pipelined Systems
Computer Organization
Variable Word Width Computation for Low Power
Morgan Kaufmann Publishers
Performance of Single-cycle Design
Pipelining.
Processor Architecture: Introduction to RISC Datapath (MIPS and Nios II) CSCE 230.
Morgan Kaufmann Publishers The Processor
CDA 3101 Spring 2016 Introduction to Computer Organization
Pipelining.
Serial versus Pipelined Execution
Pipelining in more detail
The Processor Lecture 3.4: Pipelining Datapath and Control
Pipelining.
Morgan Kaufmann Publishers The Processor
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Computer Architecture Assembly Language
Presentation transcript:

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Pipelines  Key strategy for improving the performance of systems  Provide a form of parallelism (Pipeline parallelism)  Different parts of different computations are being processed at the same time  In general, blocks A, B, C, … will be different  Although in some applications eg pipelined multiplier, digital filter, image processing applications, …  some (or all) of them may be identical Register A C Clock B A, B, C – combinatorial blocks

Pipelines  Any modern high performance processor provides an example of a pipelined system  ‘Work’ of processing an instruction is broken up into several sub-tasks, eg  IF - Instruction fetch  ID/OF - Instruction decode and operand fetch  Ex - Execute  WB - Write back Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor

High performance processor pipelines  Basic idea  If an instruction requires x ns to fetch, decode, execute and store results,  Simple (non-pipelined) processor can be driven by clock, f=1/x However  divide the work into 4 blocks, each requiring x/4 ns  build a 4-stage pipeline clocked at 4/x = 4f  Pipeline completes an instruction every x/4 ns, so it appears as if it is processing instructions at a 4f rate  4 -fold increase in processing power!!  Because the system is processing 4 instructions at once!!  but …

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  Note  The time to actually process an instruction hasn’t changed It’s still x ns  Thus the latency (time for the first instruction to complete) is still x ns  It’s the throughput that has inceased to 4f

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  … and don’t forget reality!! 1.It will not be possible to divide the work exactly into x/4 ns chunks, so the longest stage will take y > x/4 ns 2.The registers are not ‘free’  There is a propagation delay associated with them  so the shortest cycle time is now y min = x/4 + (t SU + t OD ) ns  where t SU and t OD are setup and output delay times for the register  thus the real throughput will be f’ = 1/y max < 4f

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  So we should write..  n’ -fold increase in processing power!! where n’ < n  Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  So we should write..  n’ -fold increase in processing power!! where n’ < n  Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably  Remember  Throughput increases  but  Latency remains the same  In fact, it increases to n  y max

High performance processor pipelines  Pipeline stalls  The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls  For example,  Extend the simple RISC processor with a cache and data memory Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory

High performance processor pipelines  Pipeline stalls  The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls  For example,  Extend the simple RISC processor with a cache and data memory Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory Now, when the instruction is read from memory The execution unit tries to find the data in the cache and if that fails, then it looks in main memory Assume slowest arithmetic operation – multiply time = 5ns ( incl register time) So f can be set to 200MHz Now cache access time = 8ns main memory access time = 100ns This means that 1.For a cache access, the pipeline must stall (wait) for 1 extra cycle 2.For a main memory access, the pipeline must stall for 10 extra cycles

High performance processor pipelines  Pipeline stalls  The simple picture presented up to now makes one severe assumptions ie that the pipeline is always full or that it never stalls  When a pipeline may stall (as in a general purpose processor)  Effect of stalls on throughput is generally >> all other factors! eg in a typical processor, ~25% of instructions access memory and so stall the pipeline for 1-10 cycles  Calculate the effect for a cache hit rate of 90%  75% of instructions – stall 0 cycles  25x0.9 = 22.5% - stall 1 cycle  2.5% - stall 10 cycles  Average stall =   10 = cycles =  5ns  So effective cycle time is = 7.4 ns  Still considerably better than the original 4  5ns = 20ns! ie we still gained from pipelining! (Just not quite so much!)

Balance  If a processing operation is divided into n stages, in general, these stages will perform different operations and have different delay times, t 1, t 2, t 3, …, t n  The pipeline can not run faster than the slowest of these times. Thus, the critical time is:  t stage = max(t 1, t 2, t 3, …, t n )  f max = 1/(t stage + t SU + t OD )  In order that t stage   t i /n, the average time for a stage, the pipeline must be balanced ie the stage times must be as close to the same as possible!  One slow stage slows the whole pipeline!  This implies that 1.the separation of work into pipeline stages needs care! 2.because of the fixed overheads, too many stages can have a negative effect on performance!  Too many stages  t i < (t SU + t OD ) and no net gain!

Pipelines – Performance effects  Remember  Throughput increases  but  Latency remains (almost) the same  In fact, it increases slightly because of overhead factors!

Pipelines in VHDL  VHDL synthesizers will ‘register’ SIGNALs!  You don’t need to explicitly add registers!  Example – ALU (Execution unit) of pipelined processor Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2 Components result dest reg address exception flag

Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache 3 Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2 3 Components result dest reg address exception flag -- Pipelined ALU ENTITY ALU IS PORT( instn : IN std_ulogic_vector; op1, op2 : IN std_ulogic_vector; res : OUT std_ulogic_vector; add : OUT std_ulogic_vector; exception : OUT std_ulogic; clk : IN std_ulogic ); END ENTITY ALU;

-- ARCHITECTURE m OF ALU IS CONSTANT op_start : NATURAL := 0; CONSTANT op_end : NATURAL := 2; CONSTANT add_start : NATURAL := (op_end+1); CONSTANT add_end : NATURAL := (op_end+5); SUBTYPE opcode_wd IS std_logic_vector( op_start TO op_end ); CONSTANT no_op : opcode_wd := "000"; CONSTANT add_op : opcode_wd := "001"; CONSTANT sub_op : opcode_wd := "010"; BEGIN PROCESS ( clk ) VARIABLE result, op1s, op2s: signed; VARIABLE opcode : std_ulogic_vector; BEGIN IF clk'EVENT AND clk = '1' THEN opcode := instn( opcode_wd'RANGE ); CASE opcode IS WHEN no_op => exception <= '0'; WHEN add_op => result := SIGNED(op1) + SIGNED(op2); exception <= '0'; WHEN sub_op => result := SIGNED(op1) - SIGNED(op2); exception <= '0'; WHEN others => exception <= '1'; END CASE; END IF; res <= result; END PROCESS; add <= instn( add_start TO add_end ); END ARCHITECTURE;

-- ARCHITECTURE m OF ALU IS CONSTANT op_start : NATURAL := 0; CONSTANT op_end : NATURAL := 2; CONSTANT add_start : NATURAL := (op_end+1); CONSTANT add_end : NATURAL := (op_end+5); SUBTYPE opcode_wd IS std_logic_vector( op_start TO op_end ); CONSTANT no_op : opcode_wd := "000"; CONSTANT add_op : opcode_wd := "001"; CONSTANT sub_op : opcode_wd := "010"; BEGIN PROCESS ( clk ) VARIABLE result, op1s, op2s: signed; VARIABLE opcode : std_ulogic_vector; BEGIN IF clk'EVENT AND clk = '1' THEN opcode := instn( opcode_wd'RANGE ); CASE opcode IS WHEN no_op => exception <= '0'; WHEN add_op => result := SIGNED(op1) + SIGNED(op2); exception <= '0'; WHEN sub_op => result := SIGNED(op1) - SIGNED(op2); exception <= '0'; WHEN others => exception <= '1'; END CASE; END IF; res <= result; END PROCESS; add <= instn( add_start TO add_end ); END ARCHITECTURE;

Multipliers - Pipelined  Pipelining will  throughput (results produced per second) ‬ but also  total latency (time to produce full result) ······ ······ ······ ······ ······ ······ ······ ······ ············ Insert registers to capture partial sums Benefits * Simple * Regular * Register width can vary - Need to capture operands also! * Usual pipeline advantages Inserting a register at every stage may not produce a benefit!

Multipliers  We can add the partial products with FA blocks b0b0 b1b1 a0a0 a1a1 a2a2 a3a3 FA 0 p0p0 p1p1 b2b2 product bits Note that an extra adder is needed below the last row to add the last partial products and the carries from the row above! Carry select adder