Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

Pipelines  Key strategy for improving the performance of systems  Provide a form of parallelism (Pipeline parallelism)  Different parts of different computations are being processed at the same time  In general, blocks A, B, C, … will be different  Although in some applications eg pipelined multiplier, digital filter, image processing applications, …  some (or all) of them may be identical Register A C Clock B A, B, C – combinatorial blocks

Pipelines  Any modern high performance processor provides an example of a pipelined system  ‘Work’ of processing an instruction is broken up into several sub-tasks, eg  IF - Instruction fetch  ID/OF - Instruction decode and operand fetch  Ex - Execute  WB - Write back Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor

High performance processor pipelines  Basic idea  If an instruction requires x ns to fetch, decode, execute and store results,  Simple (non-pipelined) processor can be driven by clock, f=1/x However  divide the work into 4 blocks, each requiring x/4 ns  build a 4-stage pipeline clocked at 4/x = 4f  Pipeline completes an instruction every x/4 ns, so it appears as if it is processing instructions at a 4f rate  4 -fold increase in processing power!!  Because the system is processing 4 instructions at once!!  but …

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  Note  The time to actually process an instruction hasn’t changed It’s still x ns  Thus the latency (time for the first instruction to complete) is still x ns  It’s the throughput that has inceased to 4f

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  … and don’t forget reality!! 1.It will not be possible to divide the work exactly into x/4 ns chunks, so the longest stage will take y > x/4 ns 2.The registers are not ‘free’  There is a propagation delay associated with them  so the shortest cycle time is now y min = x/4 + (t SU + t OD ) ns  where t SU and t OD are setup and output delay times for the register  thus the real throughput will be f’ = 1/y max < 4f

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  So we should write..  n’ -fold increase in processing power!! where n’ < n  Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably

High performance processor pipelines  Basic idea  Use an n -stage pipeline  n -fold increase in processing power!!  Because the system is processing n instructions at once!!  So we should write..  n’ -fold increase in processing power!! where n’ < n  Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably  Remember  Throughput increases  but  Latency remains the same  In fact, it increases to n  y max

High performance processor pipelines  Pipeline stalls  The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls  For example,  Extend the simple RISC processor with a cache and data memory Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory

High performance processor pipelines  Pipeline stalls  The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls  For example,  Extend the simple RISC processor with a cache and data memory Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory Now, when the instruction is read from memory The execution unit tries to find the data in the cache and if that fails, then it looks in main memory Assume slowest arithmetic operation – multiply time = 5ns ( incl register time) So f can be set to 200MHz Now cache access time = 8ns main memory access time = 100ns This means that 1.For a cache access, the pipeline must stall (wait) for 1 extra cycle 2.For a main memory access, the pipeline must stall for 10 extra cycles

High performance processor pipelines  Pipeline stalls  The simple picture presented up to now makes one severe assumptions ie that the pipeline is always full or that it never stalls  When a pipeline may stall (as in a general purpose processor)  Effect of stalls on throughput is generally >> all other factors! eg in a typical processor, ~25% of instructions access memory and so stall the pipeline for 1-10 cycles  Calculate the effect for a cache hit rate of 90%  75% of instructions – stall 0 cycles  25x0.9 = 22.5% - stall 1 cycle  2.5% - stall 10 cycles  Average stall = 0.225  1 + 0.025  10 = 0.475 cycles = 0.475  5ns  So effective cycle time is 5 + 2.37 = 7.4 ns  Still considerably better than the original 4  5ns = 20ns! ie we still gained from pipelining! (Just not quite so much!)

Balance  If a processing operation is divided into n stages, in general, these stages will perform different operations and have different delay times, t 1, t 2, t 3, …, t n  The pipeline can not run faster than the slowest of these times. Thus, the critical time is:  t stage = max(t 1, t 2, t 3, …, t n )  f max = 1/(t stage + t SU + t OD )  In order that t stage   t i /n, the average time for a stage, the pipeline must be balanced ie the stage times must be as close to the same as possible!  One slow stage slows the whole pipeline!  This implies that 1.the separation of work into pipeline stages needs care! 2.because of the fixed overheads, too many stages can have a negative effect on performance!  Too many stages  t i < (t SU + t OD ) and no net gain!

Pipelines – Performance effects  Remember  Throughput increases  but  Latency remains (almost) the same  In fact, it increases slightly because of overhead factors!

Pipelines in VHDL  VHDL synthesizers will ‘register’ SIGNALs!  You don’t need to explicitly add registers!  Example – ALU (Execution unit) of pipelined processor Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache Data memory Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2 Components result dest reg address exception flag

Register IF Register Ex Clock ID OF WB Instruct n memory Register File Part of a simple pipelined RISC processor Cache 3 Components opcode – operation + dest reg address op1 - operand 1 op2 - operand 2 3 Components result dest reg address exception flag -- Pipelined ALU ENTITY ALU IS PORT( instn : IN std_ulogic_vector; op1, op2 : IN std_ulogic_vector; res : OUT std_ulogic_vector; add : OUT std_ulogic_vector; exception : OUT std_ulogic; clk : IN std_ulogic ); END ENTITY ALU;

-- ARCHITECTURE m OF ALU IS CONSTANT op_start : NATURAL := 0; CONSTANT op_end : NATURAL := 2; CONSTANT add_start : NATURAL := (op_end+1); CONSTANT add_end : NATURAL := (op_end+5); SUBTYPE opcode_wd IS std_logic_vector( op_start TO op_end ); CONSTANT no_op : opcode_wd := "000"; CONSTANT add_op : opcode_wd := "001"; CONSTANT sub_op : opcode_wd := "010"; BEGIN PROCESS ( clk ) VARIABLE result, op1s, op2s: signed; VARIABLE opcode : std_ulogic_vector; BEGIN IF clk'EVENT AND clk = '1' THEN opcode := instn( opcode_wd'RANGE ); CASE opcode IS WHEN no_op => exception <= '0'; WHEN add_op => result := SIGNED(op1) + SIGNED(op2); exception <= '0'; WHEN sub_op => result := SIGNED(op1) - SIGNED(op2); exception <= '0'; WHEN others => exception <= '1'; END CASE; END IF; res <= result; END PROCESS; add <= instn( add_start TO add_end ); END ARCHITECTURE;

Multipliers - Pipelined  Pipelining will  throughput (results produced per second) ‬ but also  total latency (time to produce full result) ······ ······ ······ ······ ······ ······ ······ ······ ············ Insert registers to capture partial sums Benefits * Simple * Regular * Register width can vary - Need to capture operands also! * Usual pipeline advantages Inserting a register at every stage may not produce a benefit!

Multipliers  We can add the partial products with FA blocks b0b0 b1b1 a0a0 a1a1 a2a2 a3a3 FA 0 p0p0 p1p1 b2b2 product bits Note that an extra adder is needed below the last row to add the last partial products and the carries from the row above! Carry select adder

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Similar presentations

Presentation on theme: "Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western.

Similar presentations

Presentation on theme: "Reconfigurable Computing - Pipelined Systems John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western."— Presentation transcript:

Similar presentations

About project

Feedback