Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reconfigurable Architectures

Similar presentations


Presentation on theme: "Reconfigurable Architectures"— Presentation transcript:

1 Reconfigurable Architectures
Andrea Lodi

2 SoC trends Increasing mask cost (~ 3M$) Increasing design complexity
Increasing design time (~ 3M$) Rapidly changing communication standards Low-power design in wireless environment Increasing algorithmic complexity requirements

3 time-to-market failed
Product life cycle sales Growth Maturity Decrease LOSS time-to-market met time-to-market failed time

4 Trends in wireless systems
Increased on-chip Transistor density Increased design complexity Algorithm complexity Moore’s law 400 Battery capacity 300 Millions of transistors/Chip 1997 1999 2001 2003 2005 2007 2009 200 Increased Algorithmic complexity Low battery capacity growth Technology (nm) 100 1997 1999 2001 2003 2005 2007 2009 Demand for reusability and flexibility Demand for high performance and energy efficiency

5 Digital architecture design space

6 Parallelism in computation
Thread level parallelism Instruction level parallelism (ILP) Pipeline (loop level) Fine-grain parallelism (bit/byte-level)

7 Instruction level parallelism
b c d + + + + + + ASIC Implementation 3 * * *3 * e - - + +

8 Spatial vs. Temporal Computing
(Ax + B)x + C Ax2 + Bx + c Temporal (Processor) Spatial (ASIC)

9 Superscalar/VLIW processors
FU limitations Register file size limitation Crossbar inefficiency

10 Byte-level parallelism in processors
MMX technology: 57 new instructions Byte and half word parallel computation SIMD execution model

11 Bit-level parallelism
Reverse (int v) { int x, r; for (c=0; x<WIDTH; x++) { r |= v&1; v = v >> 1; R = r << 1; } return r; popcount (int v) { int r=0; while (v) { if (v&1) r++; v = v >> 1; } return r; + v r v r

12 Pipeline parallelism + + + + + + + + + + + v r = register
for (j=0; j<MAX; j++) b[j] = popcount[a[j]]; = register + + + + + + + + + + + r

13 FPGA FPGA (Field-Programmable Gate Array) composed of 2 elements:
Array of clbs (configurable logic blocks) composed of : 1 or few small size LUTs (4:1 or 3:1) Control logic: mux controlled by configuration bits Dedicated computational logic (carry chain …) Configurable routing network connecting clbs composed of: Different length wires Connection blocks connecting clbs to the routing network Switch blocks connecting routing wires LUTs, configuration bits to program clbs and the routing network represent the FPGA configuration, which determines the function implemented

14 Configurable logic block

15 Xilinx Clb Xilinx clb 4000 series: 11 input 4 output bits 3 LUTs
Carry logic 2 output registers

16 Configurable routing network

17 Example

18 Density Comparison

19 FPGA vs. Processor FPGA Processor (computing in space)
Parallel execution Configurable in cycles Fine-grained data Application specific operators Large area (switches, SRAM) Entire applications don’t fit Slow synthesis, P&R tools Processor (computing in time) Sequential execution Programmable every cycle Fixed-size operands Basic operators (ALU) Compact Handles complex control flow Fast compilers

20 Reconfigurable processors
But: 90% execution time spent in computational kernels: FPGAs x speed-up over processors FPGAs x denser than processors (bit-ops/2s) Reconfigurable processor: Risc + FPGA

21 Reconfigurable processor architecture
Hybrid architectures: RISC processor FPGA

22 Computational models RC Array: IO Processor/Interface logic
Attached processor Piperench, T-Recs ISA Extension Function unit: PRISC, OneChip, Chimaera Coprocessor Garp, NAPA, Molen

23 IO Processor/Interface Logic
Case for: Always have some system adaptation to do Modern chips have capacity to hold processor + glue logic reduce part count Glue logic vary many protocols, services only need few at a time Logic used in place of ASIC environment customization external FPGA/PLD devices Looks like IO peripheral to processor Example protocol handling stream computation compression, encrypt peripherals sensors, actuators

24 Example: Interface/Peripherals
Triscend E5

25 Instruction Set Extension
Instruction Bandwidth Processor can only describe a small number of basic computations in a cycle I bits 2I operations This is a small fraction of the operations one could do even in terms of www Ops w22(2w) operations Processor could have to issue w2(2 (2w) -I) operations just to describe some computations An a priori selected base set of functions could be very bad for some applications

26 Instruction Set Extension
Idea: provide a way to augment the processor’s instruction set with operations needed by a particular application

27 Architectural Models for I.S.A extension
PLEIADES XTENSA Good performance Easy to program Configured at mask-level High performance Overdesigned for most applications Difficult to program Cpu surrounded by a collection of Application-specific Custom Computing Devices Risc CPU featuring application-specific function units optionally inserted in the processor pipeline Zhang et al, 2000 Tensilica inc, 2002

28 Dynamic ISA Extension models
Standard processor coupled with embedded programmable logic where application specific functions are dynamically re-mapped depending on the performed algorithm 1: Coprocessor model 2: Function unit model

29 Coprocessor model: Garp
Explicit instructions moving data to and from the array High communication overhead (long latency array operations) Processor stalled each time the array is active Array performs at TASK level (Very coarse grain) 10-20x on stream, feed-forward operations 2-3x when data-dependencies limit pipelining Callahan, Hauser, Wawrzynek, 2000

30 Function unit model: Prisc
Array fit in the risc pipeline No communication overhead Some degree of parallelism between function units Gate array performs combinatorial instructions ONLY (very fine grain) Low speedup figures (2x/3x) Razdan, Smith 1994

31 Function Unit Model: pros
No communication overhead: Strict synergy between FPGA and other function units FPGA can be used frequently even for small functions Small reconfigurable array area Flow control handled by the core Memory access handled by the core Easy instruction set extension Configuration streams compiled from C

32 32-bit load/store Risc architecture (5 stages pipeline)
EXTENDIBLE INSTRUCTION SET RISC ARCHITECTURE 32-bit load/store Risc architecture (5 stages pipeline) Set of specialized functional units Multiply/Mac Unit Branch/Decrement Unit Alu featuring “MMX” byte-wide concurrent operations VLIW Elaboration Concurrent fetch and execution of two 32-bit instructions per cycle Fully bypassed, to minimize pipeline stalls (Average of 10/20% for most computational cores) Embedded reconfigurable device for dynamic ISA extension DSP-oriented reconfigurable functional unit (PiCoGA) Fully configurable at execution time Elaboration and configuration controlled by asm instructions inserted in C source code PiCoGA used as a programmable Data-path with independent pipeline structure

33 XiRisc Architecture

34 Dynamic Instruction Set Extension

35 Dynamic Instruction Set Extension
Register File ….. pgaload pgaop $3,$4,$5 …... Add $8, $3 Configuration Memory

36 PiCoGA Architecture Processor Interface Dynamically reconfigurable
(Pipelined Configurable Gate Array): Embedded datapath for dynamic i.s.a. extension Dynamically reconfigurable Structured in rows activated in data flow fashion by the PiCoGA control unit Can hold a state pGA-op latency depends on the specific mapped function Functionality is determined from DFG extracted from C code Processor Interface PiCoGA Control Unit PicoRow (Synchronous Element)

37 Pico-cell Description
4x32-bit input data from Reg File 2x32-bit output data to Reg File PiCoGA Control Unit INPUT LOGIC LUT 16x2 OUTPUT LOGIC, REGISTERS CARRY CHAIN EN PiCoGA control unit signals Configuration bus Loop-back 12 global lines to/from Reg File CONNECT BLOCK SWITCH RLC

38 Computing on PiCoGA PiCoGA Control Unit Data in Mapping Pga_op1
Data Flow Graph PiCoGA Control Unit Data out Mapping Pga_op2

39 Multi-context Array PiCoGA Configuration Cache Func. 1 Func. 2 Func. 3 Func. 4 Func. n While a plane is executing another may be reconfigured → No reconfiguration time overhead Four configuration planes are available, one of them executing Plane switch takes just 1 clock cycle

40 Architecture Flexibility
Yes Speed-up from pGA (5x – 100x) Parallelism to exploit ? (Ex: Turbo Decod., Motion Est.) No Yes Bit-level operations ? (Ex: DES, Reed-Solomon) No Yes Speed-up from DSP instructions and VLIW (1.5x – 2x) MAC intensive ? (Ex: FFT, Scalar product) No Yes Memory intensive ? (Ex: DCT, Motion Est.) Improvements for a large number of Data & Signal Processing algorithms

41 Programming XiRisc: Restrictions
Fixed-point algorithms Variable size specification at the bit level Not supported yet: Dynamic memory allocation Math library Operating System

42 XiRisc Compilation Flow
File.c C COMPILER PROFILER Software Simulation PiCoGA Configurator PiCoGAop Configuration Bit stream Configuration Library

43 Example: Motion Estimation
Sum of Absolute Difference (SAD) - High instruction-level and inter-iteration parallelism

44 Data Flow Graph ….. pixel-pixel absolute difference
Abs (p1[i] – p2[i]) p1[i], p2[i] pixel ….. Absolute Difference Sum tree

45 Sum of Absolute Difference
AD1 AD2 AD3 AD4 From Register File SAD Writeback to Register File SAD8 SAD8

46 Latency and Issue Delay
Place & Route High-Level C Compiler Mapping Place & Route DFG-based description Configuration Bits Griffy Compiler Emulation Function with Latency and Issue Delay

47 Performance evaluation
Emulation function Latency and Issue-Delay back-annotation Profiling

48 Motion Estimation: Results
16 SAD operations in parallel PiCoGA occupation: ~100% Speed-up: 7x (with respect to standard XiRisc) MPEG preliminary result: H.261 standard QCIF (176x144): 10 frame/sec

49 Reed-Solomon Encoder: Results
Encoder RS(15,9): 4-bit symbols PiCoGA occupation: ~25% Speed-up: 37x Throughput: 70.6 Mb/sec Encoder RS(255,239) widely used: 8-bit symbols PiCoGA occupation: ~60% Speed-up: 135x Throughput: Mb/sec

50 Speed-up and Power Consumption
Algorithm Energy consumption reduction (vs. std. XiRisc) Speed-up DES encryption 89% 13.5x Turbo decoder 75% 11.7x Motion prediction 46% 4.5x Median filter 60% 7.7x CRC 49% 4.3x


Download ppt "Reconfigurable Architectures"

Similar presentations


Ads by Google