1 ECE 506 Reconfigurable Computing http://www. ece. arizona ECE 506 Reconfigurable Computing Lecture 4 Reconfigurable Architectures Ali AkogluGive qualifications of instructors:DAPteaching computer architecture at Berkeley since 1977Co-athor of textbook used in classBest known for being one of pioneers of RISCcurrently author of article on future of microprocessors in SciAm Sept 1995RYtook 152 as student, TAed 152,instructor in 152undergrad and grad work at Berkeleyjoined NextGen to design fact 80x86 microprocessorsone of architects of UltraSPARC fastest SPARC mper shipping this Fall
2 FPGA Introduced in 1985 by Xilinx Similar to CPLDs A function to be implemented in FPGAPartitioned into modules , each implemented in a logic block.Logic blocks connected with the programmable interconnection.
3 FPGA Components Problem: How to handle sequential logic Truth tables don’t workPossible solution:Add a flip-flop to the output of LUTBLEs: the basic logic elementCircuit can now use output from LUT or from FFWhere does select come from?
5 FPGA Components Example: 8-bit register using 3-input, 2-output LUTs Input: x, Output: yWhat does LUT need to do to implement register?x(7)x(6)x(5)x(4)x(3)x(2)x(1)x(0)3-in, 2-outLUT3-in, 2-outLUT3-in, 2-outLUT3-in, 2-outLUTFFFFFFFFFFFFFFFFy(7)y(6)y(5)y(4)y(3)y(2)y(1)y(0)
8 FPGA ComponentsIsn’t it a waste to use LUTs for registers?YES! (when it can be used for something else)Commonly used for pipelined circuitsExample: Pipelined adder3-in, 2-outLUT3-in, 2-outLUT++RegisterRegisterFFFFFFFF+RegisterAdder and output register combined – not a separate LUT for each
9 FPGA ComponentsConfigurable Logic Blocks (CLBs) usually contain more than 1 BLEWhy?Efficient way of handling common I/O between adjacent LUTsSaves routing resources2x13-in, 2-outLUT3-in, 2-outLUTCLBFFFFFFFF2x12x12x12x1
10 FPGA Components Example: Ripple-carry adder CLB Each LUT implements 1 full adderUse efficient connections between LUTs for carry signalsA(1)B(1)A(0)B(0)Cin(0)Cin(1)2x13-in, 2-outLUT3-in, 2-outLUTCLBFFFFFFFF2x12x12x12x1Cout(0)Cout(1)S(1)S(0)
11 FPGA ComponentsOn real FPGAs: a cluster of LUTs per switch matrix (e.g., eight LUTs and switch matrix form a configurable logic block on Xilinx FPGAs)
12 Typical CLB The arithmetic logic provides XOR-gate and faster carry chain to build faster adder without wasting too much LUT-resources.
13 Xilinx CLBs Macro cells are CLBs. The number of basic blocks in a CLB varies from device to device.4000, Virtex, Virtex E, Spartan: 1 slice, 2 basic blocksSpartan 3, VirtexII, Virtex II-Pro Virtex 4: 4 slices, 2 basic blocks/sliceVirtex 5:2 slices, 4 basic blocks/slice
14 Xilinx CLBs Left part slices of a CLB (SLICEM) SLICEL Each BLE configured either as combinatorial logic or SRAM or shift registerSLICELonly to be configured as combinatorial logic.Each BLE4 inputs and 1 outputSpartan 3, VirtexII, Virtex II-Pro Virtex 4: 4 slices, 2 basic blocks/slice
15 Xilinx CLBs LUT has 6 inputs and 2 outputs. LUT can be configured either asa 6-input LUT, in which case only one output can be used,or as two 5-input LUTs, two outputs usedVirtex 5:2 slices, 4 basic blocks/slice
16 Xilinx Virtex6 CLB A CLB contains 2 identical slices on Virtex 6 2 slices are split in two columns of 1 slices each1 slice contains:4x 6-inputs LUT8x FF for storing LUT resultsMUX to feed LUT either to a FF or the outputCarry in and carry out to construct fast adder using neighbor CLBs
18 Altera FPGA Basic Blocks Altera’s FPGAs (Cyclone, FLEX)basic unit of logic is the logic element (LE)also LUT-baseda 4-LUT, a flip flop, a multiplexer and additional logic for carry chainLEs can operate in different modes each of which defines different usage of the LUT inputs.Altera LEsgrouped into logic array blocks (LAB).Flex 6000 LAB contains 10 LEsFLEX 8000 LAB contains 8 LEs.Cyclone II LAB contains 16 LEs
19 Altera FPGA Basic Blocks Stratix IIbasic computing unit is called adaptive logic module (ALM)Each LAB contains 8 ALMsALM can be used to implement functions with variable number of inputs.Ensures a backward compatibility to 4-input-based designs,Possible to implement module with up to 8 inputs.Additional modules: including flip flops, adders and carry logic
20 FPGA ComponentsCLBs often have specialized connections between adjacent CLBsFurther improves carry chainsAvoids routing resourcesBasic building block is CLBCan implement combinational+sequential logicAll circuits consist of combinational and sequential logicSo what else is needed?FPGAs need some way of connecting CLBs togetherReconfigurable interconnectBut, we can only put fixed wires on a chip
21 FPGA ComponentsProblem: If FPGA doesn’t know which CLBs will be connected, where does it put wires?Solution:Put wires everywhere!Referred to as channel wires, routing channels, routing tracks, many othersCLBs typically arranged in a grid, with wires on all sidesCLB
22 How to connect CLB to wires? Solution: Connection box FPGA ComponentsHow to connect CLB to wires?Solution: Connection boxDevice that allows inputs and outputs of CLB to connect to different wiresConnection boxCLBCLB
23 FPGA Components Connection box characteristics CLB CLB CLB CLB TopologyDefines the specific wires each CLB I/O can connect toExamples: same flexibility, different topologyCLBCLBCLBCLB
24 FPGA ComponentsConnection boxes allow CLBs to connect to routing wiresBut, that only allows us to move signals along a single wireNot very usefulHow do FPGAs connect wires together?
25 Solution: Switch boxes, switch matrices FPGA ComponentsSolution: Switch boxes, switch matricesConnects horizontal and vertical routing channelsBut, we can only put fixed wires on a chipCLBCLBSwitch box/matrixCLBCLB
26 FPGA Components Switch boxes Every possible connection? Flexibility - defines how many wires a single wire can connect toEvery possible connection?Too bigToo slow
27 FPGA Components Why do flexiblity and topology matter? Routability: a measure of the number of circuits that can be routedHigher flexibility = better routabilityWilton switch box topology = better routabilitySrcSrcCLBCLBNo possible route from src to destDestDest
28 FPGA Components Many Topologies possible Fs = 3 is commonDisjointWiltonUniversalTopology - defines which wires can be connected
29 FPGA ComponentsDisjoint: a wire entering can only connect to other wires with the same numerical designation.potential source–destination routes in the FPGA are isolated into distinct routing domains, limiting routing flexibility.Wilton: uses same number of routing switches but overcomes the domain issueBy allowing for a change in domain assignment on connections that turn.ability to change domains in at least one direction facilitates routing as a greater diversity of routing paths from a net source to a destination is possible.G. Lemieux and D. Lewis, “Circuit design of routing switches,” in Proceedings: ACM/SIGDA International Symposium on Field Programmable Gate Array, pp. 19–28, February 2002.G. Lemieux and D. Lewis, Design and Interconnection Networks for Programmable Logic. Boston, MA: Kluwer Academic Publishers, 2004.
30 FPGA ComponentsAt each switch block: some tracks end some tracks pass right through
31 FPGA Components Switch boxes Short channels Long channels Useful for connecting adjacent CLBsLong channelsUseful for connecting CLBs that are separatedAllows for reduced routing delay for non-adjacent CLBsMediumShortLong
32 FPGA Components FPGA layout called a “fabric” 2-dimensional array of CLBs and programmable interconnectSometimes referred to as an “island style” architecture
33 Solution 1: Use LUTs for logic or memory FPGA and Data StorageSolution 1: Use LUTs for logic or memoryLUTs are just an SRAMXilinx refers to as distributed RAMSolution 2: Include dedicated RAM components in the FPGA fabricXilinx refers to as Block RAMCan be single/dual-portedCan be combined into arbitrary sizesCan be used as FIFODifferent clock speeds for reads/writes
34 FPGA Components Fabric with Block RAM . . . . . . . Block RAM can be placed anywhereTypically, placed in columns of the fabricBRCLBCLBCLBCLBBR. . .BRCLBCLBCLBCLBBRBRCLBCLBCLBCLBBR
35 FPGA Components FPGAs commonly used for DSP apps Example: Xilinx DSP48 Makes sense to include custom DSP units instead of mapping onto LUTsCustom unit = faster/smallerExample: Xilinx DSP48Starting with Virtex 4 family, Xilinx introduced DSP48 block for high-speed DSP on FPGAsEssentially a multiply-accumulate core with many other featuresProvides efficient way of implementingAdd/subtract/multiplyMAC (Multiply-accumulate)Barrel shifterFIR FilterSquare root
36 FPGA ComponentsFPGAs are 2-dimensional arrays of CLBs, DSP, Block RAM, and programmable interconnectActual layout/placement differs for different FPGAsBRDSPDSPDSPDSPBRBRCLBCLBCLBCLBBRBRCLBCLBCLBCLBBRBRCLBCLBCLBCLBBR
42 Virtex7 FPGA DSP48DSP48E1 Tile (Two DSP48E1 Slices and Interconnects)
43 Virtex7 FPGA DSP48Single-instruction-multiple-data (SIMD) arithmetic unit:Dual 24-bit or quad 12-bit add/subtract/accumulateCascading capability on both pipeline paths for larger multipliers and larger post-adders
48 Mapping a circuit onto FPGA fabric Programming FPGAsMapping a circuit onto FPGA fabricKnown as technology mappingProcess of converting a circuit in one representation into a representation that corresponds to physical componentsGates to LUTsMemory to Block RAMsMultiplications to DSP48sEtc.But, we need some way of configuring each component to behave as desiredExamples:How to store truth tables in LUTs?How to connecting wires in switch boxes?
49 Programming FPGAs FPGAs programmed with a “bitfile” File containing all information needed to program FPGAContains bits for each control FFAlso, contains bits to fill LUTsBut, how do you get the bitfile into the FPGA?> 10k LUTsSmall number of pins
50 Programming FPGAs Solution: Shift Registers General Idea Make a huge shift register out of all programmable components (LUTs, control FFs)Shift in bitfile one bit at a timeConfiguration bits input hereCLBShift register shifts bits to appropriate location in FPGA
51 Programming FPGAs Example: Program CLB with 3-input, 1-output LUT to implement sum output of full adderAssume data is shifted in this direction11Should look like this after programmingInOutABCinS1FFFF12x112x1
52 Programming FPGAs Example, Cont: Bitfile is just a sequence of bits based on order of shift registerDuring programmingAfter programming1FFFF2x112x1
53 Programming FPGAs During programming After programming FF FF 2x1 2x1 11FFFF2x112x1
54 Programming FPGAs During programming After programming FF FF 2x1 2x1 11FFFF2x112x1
55 Programming FPGAs During programming After programming FF FF 2x1 2x1 01101011FFFF2x112x1
56 Programming FPGAs During programming After programming FF FF 2x1 2x1 0110111FFFF2x112x1
57 Programming FPGAs During programming After programming FF FF 2x1 2x1 011011FFFF2x112x1
58 Programming FPGAs During programming After programming FF FF 2x1 2x1 01111FFFF2x112x1
59 Programming FPGAs During programming After programming FF FF 2x1 2x1 0111FFFF2x112x1
60 Programming FPGAs During programming After programming FF FF 2x1 2x1 1 11FFFF2x112x1
61 Programming FPGAs During programming After programming 11CLB is programmed to implement full adder!FFEasily extended to program entire FPGAFF12x112x1
62 Problem: Reconfiguring FPGA is slow Programming FPGAsProblem: Reconfiguring FPGA is slowShifting in 1 bit at a time not efficientBitfiles can be greater than 1 MBEliminates one of the main advantages of RCPartial reconfigurationWith shift registers, entire FPGA has to be reconfigured
64 CPLD vs FPGA Simpler interconnect structure Timing performance more predictable than FPGAs.Density is less than most FPGAsCPLDs feature logic resources with a wide number of inputs (AND planes)Performance is usually better than FPGAsA single FPGA can replace tens of normal PLDsPrimitive FPGA 'logic cells' are more complex than PLD cells.Can program the routing between FPGA logic cells in addition to programming the logic cells themselves.Many FPGAs now offer embedded memory blocks in addition to logic blocks or other special features such as fast carry logic chains.FPGAs offer a higher ratio of flip-flops to logic resources than do CPLDs.