Reconfigurable Computing - Pipelined Systems

Reconfigurable Computing - Pipelined Systems
John Morris Chung-Ang University The University of Auckland ‘Iolanthe’ at 13 knots on Cockburn Sound, Western Australia

A, B, C – combinatorial blocks
Pipelines Key strategy for improving the performance of systems Provide a form of parallelism (Pipeline parallelism) Different parts of different computations are being processed at the same time In general, blocks A, B, C, … will be different Although in some applications eg pipelined multiplier, digital filter, image processing applications, … some (or all) of them may be identical Register Register Register A B C Clock A, B, C – combinatorial blocks

Part of a simple pipelined RISC processor
Pipelines Any modern high performance processor provides an example of a pipelined system ‘Work’ of processing an instruction is broken up into several sub-tasks, eg IF - Instruction fetch ID/OF - Instruction decode and operand fetch Ex - Execute WB - Write back Instructn memory Register File Register Register Register IF ID OF Ex WB Clock Part of a simple pipelined RISC processor

High performance processor pipelines
Basic idea If an instruction requires x ns to fetch, decode, execute and store results, Simple (non-pipelined) processor can be driven by clock, f=1/x However divide the work into 4 blocks, each requiring x/4 ns build a 4-stage pipeline clocked at 4/x = 4f Pipeline completes an instruction every x/4 ns , so it appears as if it is processing instructions at a 4f rate 4-fold increase in processing power!! Because the system is processing 4 instructions at once!! but …

Basic idea Use an n-stage pipeline n-fold increase in processing power!! Because the system is processing n instructions at once!! Note The time to actually process an instruction hasn’t changed It’s still x ns Thus the latency (time for the first instruction to complete) is still x ns It’s the throughput that has inceased to 4f

Basic idea Use an n-stage pipeline n-fold increase in processing power!! Because the system is processing n instructions at once!! … and don’t forget reality!! It will not be possible to divide the work exactly into x/4 ns chunks, so the longest stage will take y > x/4 ns The registers are not ‘free’ There is a propagation delay associated with them so the shortest cycle time is now ymin = x/4 + (tSU + tOD) ns where tSU and tOD are setup and output delay times for the register thus the real throughput will be f’ = 1/ymax < 4f

Basic idea Use an n-stage pipeline n-fold increase in processing power!! Because the system is processing n instructions at once!! So we should write .. n’-fold increase in processing power!! where n’ < n Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably

Basic idea Use an n-stage pipeline n-fold increase in processing power!! Because the system is processing n instructions at once!! So we should write .. n’-fold increase in processing power!! where n’ < n Nevertheless, n’ is usually substantial, so that pipelining speeds up throughput considerably Remember Throughput increases but Latency remains the same In fact, it increases to n  ymax

Pipeline stalls The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls For example, Extend the simple RISC processor with a cache and data memory Data memory Instructn memory Register File Cache Register Register Register IF ID OF Ex WB Clock Part of a simple pipelined RISC processor

Now, when the instruction is read from memory The execution unit tries to find the data in the cache and if that fails, then it looks in main memory Assume slowest arithmetic operation – multiply time = 5ns (incl register time) So f can be set to 200MHz Now cache access time = 8ns main memory access time = 100ns This means that For a cache access, the pipeline must stall (wait) for 1 extra cycle For a main memory access, the pipeline must stall for 10 extra cycles Pipeline stalls The picture presented earlier makes a severe assumption ie that the pipeline is always full or that it never stalls For example, Extend the simple RISC processor with a cache and data memory Data memory Instructn memory Register File Cache Register Register Register IF ID OF Ex WB Clock Part of a simple pipelined RISC processor

Pipeline stalls The simple picture presented up to now makes one severe assumptions ie that the pipeline is always full or that it never stalls When a pipeline may stall (as in a general purpose processor) Effect of stalls on throughput is generally >> all other factors! eg in a typical processor, ~25% of instructions access memory and so stall the pipeline for 1-10 cycles Calculate the effect for a cache hit rate of 90% 75% of instructions – stall 0 cycles 25x0.9 = 22.5% stall 1 cycle 2.5% stall 10 cycles Average stall =   10 = cycles =  5ns So effective cycle time is = 7.4 ns Still considerably better than the original 4  5ns = 20ns! ie we still gained from pipelining! (Just not quite so much!)

 fmax = 1/(tstage + tSU + tOD)
Balance If a processing operation is divided into n stages, in general, these stages will perform different operations and have different delay times, t1, t2, t3, …, tn The pipeline can not run faster than the slowest of these times. Thus, the critical time is: tstage = max(t1, t2, t3, …, tn )  fmax = 1/(tstage + tSU + tOD) In order that tstage  ti/n, the average time for a stage, the pipeline must be balanced ie the stage times must be as close to the same as possible! One slow stage slows the whole pipeline! This implies that the separation of work into pipeline stages needs care! because of the fixed overheads, too many stages can have a negative effect on performance! Too many stages  ti < (tSU + tOD) and no net gain!

Pipelines – Performance effects
Remember Throughput increases but Latency remains (almost) the same In fact, it increases slightly because of overhead factors!

Assignment Feedback Adders Should have TWO architectures!
One for a ‘standard’ implementation using full adder components One using Altera’s (or Xilinx’ or …) fast carry logic

Ripple Carry Adder ENTITY
Formally describes the interface for this model Common Used by both implementations of the adder ENTITY adder IS GENERIC( n_bits : positive := 8 ); PORT( a, b : IN std_logic_vector; c : IN std_logic; sum: OUT std_logic_vector; c_out : OUT std_logic ); END ENTITY adder;

Ripple Carry Adder First ARCHITECTURE ENTITY adder IS
GENERIC( n_bits : positive := 8 ); PORT( a, b : IN std_logic_vector; c : IN std_logic; sum: OUT std_logic_vector; c_out : OUT std_logic ); END ENTITY adder; ARCHITECTURE simple OF adder IS COMPONENT full_adder … SIGNAL c_int: std_logic_vector( n_bits DOWNTO 1 ); BEGIN FOR J IN n_bits DOWNTO 1 GENERATE fa: full_adder PORT MAP( … ); END GENERATE; END ARCHITECTURE simple;

Ripple Carry Adder Second ARCHITECTURE ENTITY adder IS
GENERIC( n_bits : positive := 8 ); PORT( a, b : IN std_logic_vector; c : IN std_logic; sum: OUT std_logic_vector; c_out : OUT std_logic ); END ENTITY adder; ARCHITECTURE fast_carry OF adder IS COMPONENT lpm_add_sub … BEGIN rc: lpm_add_sub PORT MAP( … ); END ARCHITECTURE fast_carry ;

Specifies the architecture to be used
Configurations VHDL’93 allows you to write a configuration This capability is very important! It allows you to Check that the substitution of an improved architecture does not alter the function (correctness) of a system Substitute different versions of basic modules in different parts of your design, eg fast adder when speed is important small adder when space is important CONFIGURATION std OF test_bench IS FOR test_bench_architecture FOR ALL: adder USE ENTITY work.adder( simple ); END FOR; END FOR; END std; Specifies the architecture to be used

Specifies the architecture to be used
Configurations VHDL’93 allows you to write a configuration The full capabilities of a configuration allow Specification of an architecture for all instances of a model Specification of architectures for individual (labelled) instances An instantiation has a label adder_a : adder PORT MAP( a=>.., b=>…, … ); Used to identify a component in simulation Substitution of equivalent entities ‘Equivalent’ = having compatible interfaces CONFIGURATION std OF test_bench IS FOR test_bench_architecture FOR ALL: adder USE ENTITY work.adder( simple ); END FOR; END FOR; END std; Specifies the architecture to be used

Configurations (and Altera Limitations)
VHDL’93 allows you to write a configuration Unfortunately, Altera’s software is, in some respects, primitive!! It does not support configuration ie it does not allow you to specify which architecture to use when instantiating a component It also insists that the ENTITY and ARCHITECTURE parts are in the same file Preventing you from separating them! Good design separates interfaces and implementations! Separating the abstract from the concrete It probably has some good points though! eg the synthesizer and simulator appear to be integrated well CONFIGURATION std OF test_bench IS FOR test_bench_architecture FOR ALL: adder USE ENTITY work.adder( simple ); END FOR; END FOR; END std;

Altera - Getting around the problem
Copy (link) everything into a different directory Alter one of the architectures Fine until you change something else! This is the reason that we want to have configuration capabilities So that we could work with two (or more) architectures Do NOT try renaming one of the entities! Unless you need to use BOTH architectures in the same system! Altera only!

Abstract designs Copying and renaming defeats the idea that you can have the abstract part - a common interface represented by an ENTITY and concrete parts or different implementations represented by different ARCHITECTURES Abstraction allows you to design at a high level Concentrate on the major functions of a system … and leave low level details until later Example A washing machine clearly needs a timer Specify a generic timer By writing out its entity Decide how to implement the timer later

Abstract designs Concept: designs have two parts
the abstract part - a common interface and (several) concrete parts or implementations Designing at a high level: Example B At this point in the system, we need an adder but we know several types of adder that we could use Defer the decision as to what type of adder until more is known about the full system We may need a fast one, a compact one, a really small one (bit serial?) At an early stage in design, the only critical thing is that it adds correctly! You might even use the multiple architectures capability to test various implementations to find the best one

Assignment Summary The assignment specification called for a short report ‘summarizing what you have done so far’ Some of you interpreted ‘short’ in an extreme way!! Ideally, your summary should have consisted of a table with resource usage (# of logic blocks) and times for various adder configurations some text commenting on this table Which configuration is best for each adder width? Configuration with minimum delay should have been highlighted, marked or mentioned How much does the resource usage differ? If the time advantage is small, you might prefer the smaller structure! If there were anomalies or unexpected results They should have been explained or at least (if you were unable to explain them) mentioned as unexpected

Assignment Summary Most of you omitted the ‘base’ configuration for an n-bit adder – a simple n-bit ripple carry adder The 1  n configuration Requires you to synthesize the ripple carry adder only! Actually important because Fast carry logic  almost as fast as the CS adder Only ~ half the resources (or space) required! One important design question Use a simple ripple carry adder? Not too slow if the fast carry logic is used and Uses considerably less space Simple to implement (1 line of VHDL) Regular or Try to implement a faster adder? Is it really needed?

Assignment - Details The following details are important and were commonly omitted Which software package was used? Altera’s MaxPlus, Quartus, Xilinx’s Webpack, etc Precise explanation of measurement unit for resources What does a slice mean? Even the term ‘logic block’ has a different sense for different FPGAs Xilinx CLB has 2 FF’s, 9 inputs, 2 outputs Altera Logic Element has 1 FF, 4 inputs, 1 output Quicklogic has … So resource columns should be headed Altera Flex10K logic cells or Xilinx 4000 series CLBs …. ie be precise!

Delay Estimates and Accuracy
Do not take the figures for delays reported by the tools as perfectly accurate!! Delays are computed by summing the propagation delays through every element on a signal’s path These elements include Transmission gates (probably several types) Multiplexers Look up tables (LUTs) Flip flops and sometimes Buffers Other logic gates Allowance must also be made for heavily loaded lines Thus each reported path delay might be a sum of a 100 or more elementary delays

Delays For a given technology, the manufacturers estimate (by measurement or calculation) the propagation delays of each of these types of elements tpd for a real circuit is a function of Temperature Supply Voltage Circuit dimensions Width of gates Length of channels, etc Even if the voltage and circuit dimensions are precisely known, tpd values are only accurate for one temperature Circuits are usually rated for Commercial - 0oC – 70oC Industrial oC - ~100oC Military, space – even wider range

Delays Combined with manufacturing tolerances for circuit dimensions Track and gate widths Size of etched, implanted, … regions It is unlikely that a propagation delay for an individual circuit element is accurate to < 100ps or 0.1ns over even a small T range If several hundred figures with errors of 0.1ns are added, the error in the sum can be very large! A predicted delay is probably only good to 1ns … and could be much less accurate! Values in your report should reflect this!! Thus you should NOT copy the ns figure from the synthesis report A realistic estimate is probably 56ns This is the value which should appear in your report 55.8ns is tolerable, but probably completely unrealistic!

Delays Selecting the best configuration If the synthesizer reports tpd = 23ns for configuration A and tpd = 25ns for configuration B Is it safe to assume that A at 23ns is really faster? Almost certainly not! It may be … but on a chip with slightly different track widths or running at a slightly different T or Vdd Then the relative speeds may be different and B may be faster Again your report should reflect this It should not claim that a configuration differing from another by a small margin is faster It should note the similarity and look at other factors – such as size, regularity, …

Finally, English - The hardest bit?
Technical Reports should be mainly written in the simple past tense Use the simple past tense in the active voice Errors in measurement caused … I/we implemented … This result confirmed … and in the passive voice this effect was caused by … the designs were implemented in … our hypotheses were supported by these observations Acronyms Explain them before you use them! Several used ‘RCA’ for ripple-carry adder Not generally accepted (even in this context) so must be spelt out in full before use VHDL and FPGA are OK (in this context) though!

Finally, English … The hardest bit? Articles
Even native speakers have difficulty explaining the rules! I’ve tried to correct most errors See if you can follow the patterns in my use

Reconfigurable Computing - Pipelined Systems

Similar presentations

Presentation on theme: "Reconfigurable Computing - Pipelined Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Reconfigurable Computing - Pipelined Systems

Similar presentations

Presentation on theme: "Reconfigurable Computing - Pipelined Systems"— Presentation transcript:

Similar presentations

About project

Feedback