Presentation is loading. Please wait.

Presentation is loading. Please wait.

Reconfigurable Computing Nehir Sönmez 25-11-2004.

Similar presentations


Presentation on theme: "Reconfigurable Computing Nehir Sönmez 25-11-2004."— Presentation transcript:

1 Reconfigurable Computing Nehir Sönmez 25-11-2004

2 Reconfigurable Computing Standard Definition: A reconfigurable computer is a device which computes by using post-fabrication spatial components of compute elements. [Dehon] FPGA implementation of a processor core to run a program is excluded - not spatial mapping of problem. ASIC implementations excluded – not postfabrication programmable. The definition restricts RC to mapping to fine- grained devices (such as FPGAs). Whereas General Purpose computers compute by making connections in time.

3 What is Reconfigurable Computing? Computation using hardware that can adapt at the logic level to solve specific problems Why is this interesting? –Some applications are poorly suited to microprocessor. –VLSI “explosion” provides increasing resources. –Hardware/Software –Relatively new research area.

4 Spatial Computation Example: grade = 0.2 × mt1 + 0.2 × mt2 + 0.2 × mt3 + 0.4 × project; A hardware resource (multiplier or adder) is allocated for each operator in the compute graph. The abstract computation graph becomes the implementation template.

5 Temporal Computation A hardware resource is time-multiplexed to implement the actions of the operators in the compute graph. Close to a sequential processor/software solution. Many inbetween cases exist.

6 Why is Custom Logic Faster Than Software? Spatial vs. Temporal Computation –Processors divide computation across time, dedicated logic divides across space

7 Why is Custom Logic Faster Than Software? Specialization –Instruction set may not provide the operations your program needs –Processors provide hardware that may not be useful in every program or in every cycle of a given program Multipliers Dividers Instruction Memory –Processors need lots of memory to hold the instructions that make up a program and to hold intermediate results. Bit Width Mismatches –In general, processors have a fixed bit width, and all computations are performed on that many bits Multimedia vector instructions (MMX) a response to this

8 Microprocessor-based Systems –Generalized to perform many functions well. –Operates on fixed data sizes. –Inherently sequential. Data Storage (Register File) ALU ABC 64

9 Reconfigurable Computing –Create specialized hardware for each application. –Functional units optimized to perform a special task. If (A > B) { H = A; L = B; } Else { H = B; L = A; } Functional Unit A B H L

10 Dataflow Superscalar must find dataflow graph at run time RC constructs data flow graph at compile time no logic control overhead no window size limitations

11 Implementation Spectrum –ASIC gives high performance at cost of inflexibility. –Processor is very flexible but not tuned to the application. –Reconfigurable hardware is a nice compromise. MicroprocessorReconfigurable Hardware ASIC

12 Flexibility vs Data-Processing Rate

13

14 Field-Programmable Gate Array –Each logic element outputs one data bit. –Interconnect programmable between elements. –Interconnect tracks grouped into channels. LE Logic Element Tracks

15 FPGA Architecture Issues –Need to explore architectural issues. –How much functionality should go in a logic element? –How many routing tracks per channel? –Switch “population”? Logic Element

16 Real World Physical Issues –Modelling FPGA delay. –Improving performance through buffering/segmentation. –Technology dependent. –The cost of reconfigurability. SS Wires have real cost

17 Translating a Design to an FPGA –CAD to translate circuit from text description to physical implementation well understood. –CAD to translate from C program to circuit not well understood. –Very difficult for application designers to successfully write high-performance applications C program. C = A+B. Circuit A B +C Array Need for design automation!

18 High-level Compilers –Difficult to estimate hardware resources. –Some parts of program more appropriate for processor (hardware/software codesign). –Compiler must parallelize computation across many resources. –Engineers like to write in C rather than pushing little blocks around. C = A+B AB + C for (i = 0; i<n, i++) {. }

19 Reconfigurable Hardware –Each logic element operates on four one-bit inputs. –Output is one data bit. –Can perform any boolean function of four inputs 2 = 64K functions! Logic Element A B C D Out A B C D = out 2 4

20 Basic Logic Block Architecture

21 Xilinx - Spartan II Architecture IOBs provide the interface between the package pins and the internal logic CLBs provide the functional elements for constructing most logic Dedicated block RAM memories of 4096 bits each Clock DLLs for clockdistribution delay compensation and clock domain control Versatile multi-level interconnect structure

22 Spartan II Configurable Logic Block LUT capacity is completely determined by the number of inputs, not the complexity Basic block is a logic cell (LC) – A 4-input function generator (LUT), – Carry logic – storage element. Each CLB contains – four LCs, organized in two similar slices. – logic that combines function generators to provide functions of five or six inputs.

23 Spartan II CLB

24 Example: Two Bit Adder FA AB CoCo CiCi S Made of Full Adders A+B = D Logic synthesis tool reduces circuit to SOP form C o = ABC i + ABC i + ABC i + ABC i S = ABC i + ABC i + ABC i + ABC i LUT CoCo CiCi B A S CiCi B A

25 Circuit Compilation 1.Technology Mapping 2.Placement 3.Routing LUT ? Assign a logical LUT to a physical location. Select wire segments And switches for Interconnection.

26 Processor + FPGA 1. FPGA serves as coprocessor for data intensive applications. Three possibilities Backplane bus (e.g. PCI) Proc chip daughtercard FPGA chip FPGA Proc 2. FPGA serves as embedded computer for low latency transfer. “Reconfigurable Functional Unit”

27 Processor + FPGA (cont..) –FPGA logic embedded inside processor. –A number of problems with 2 and 3. Process technology an issue. ALU much faster than FPGA generally. FPGA much faster than the entire processor. RF ALU FPGA Processor 3. Processor integration

28 Multi-FPGA Systems –Most applications don’t fit on one device. –Create need for partitioning designs across many devices. –Effectively a “netlist computer” Each FPGA is a logic processor interconnected in a given topology. F F F F F F F F F

29 Xilinx XC4000 Cell –2 4-input look-up tables –1 3-input look-up table –2 D flip flops

30 Altera Flex10K

31 Xilinx Virtex CLB

32 Reconfiguration Reconfiguration methodology Static Partially static (=partial reconfiguration) Dynamic

33 The Design Process 1.Partition a program into sections to be implemented on hardware and software separately 2.Synthesize the computations destined for reconfigurable hardware into gate-level or circuit level description. 3.Map the circuit onto reconfigurable blocks and connect them using reconfigurable routing. 4.After compilation, the circuit is ready for configuration onto the hardware at runtime.

34 RC Objectives RC objectives: Specialization, performance, flexibility Basic idea: “Programmable Hardware”  Specialization l  Performance l  Power consumption l  Flexibility l  Programming

35 Routing strategies Reconfigurable Devices Reconfigurable Computing A B C A B C Continuous Routing Structured Routing

36 Xilinx XC4000 Routing 25

37 By including reconfigurability we can increase flexibility with high specialization Reconfigurable Instruction Set Processors ProcessorPLD Reconfigurable Processor

38 Coprocessor based approach ASIP based approach Reconfigurable Instruction Set Processors · · · Task 1 Task K · · · Task K+1Task N SoftwareHardware Task 1Task 2Task N Software Hardware · · ·

39 Typical example: CPU + PCI board –Altera ARC-PCI –Compaq Pamette System on Chip (SoC) –Altera´s Excalibur device –Chameleon Systems, Inc. Coprocessor based approach (I) Reconfigurable Instruction Set Processors

40 Altera ARC-PCI Coprocessor based approach (II) Reconfigurable Instruction Set Processors

41 Compaq Pamette Coprocessor based approach (III) Reconfigurable Instruction Set Processors

42 Altera´s Excalibur device –Embedded Processor: ARM, MIPS or NIOS Coprocessor based approach (IV) Reconfigurable Instruction Set Processors

43 Chameleon Systems, Inc. Coprocessor based approach (V) Reconfigurable Instruction Set Processors

44 Reconfigurable unit within CPU ASIP based approach (I) Reconfigurable Instruction Set Processors Fetch Decode Issue Integer Unit FP Unit Branch Unit LD/ST Unit Reconfigurable Unit

45 Challenge: CAD tools ASIP based approach (II) Reconfigurable Instruction Set Processors C Code Compiler Assembly Code Instruction Description (Configuration bits)

46 ASIP based approach (III) Reconfigurable Instruction Set Processors C Parsing Optimizations Inst. Identification Inst. Selection Config. Scheduling Code Generation C Code Assembly Code Hardware Generation Configuration bits Hardware Estimator Compiler Structure

47 Example: Philips CinCISe Architecture ASIP based approach (II) Reconfigurable Instruction Set Processors Encoded Instruction Word Register File ALU RFU MUX 5 5 5 4 32

48 Why Compute With FPGAs? Huge performance gap between software and hand-designed hardware systems –Often 100-to-1 ratio of performance or performance/area Hardware systems not so good for general computing –Big design, cost barriers to implementation –Not practical to buy a new machine every time you want to run a different program Reconfigurable systems offer best-of-both-worlds –Run-time programmability –Hardware-level performance

49 Good Applications for Reconfigurable Computing Relatively small application graph –FPGAs have limited capacity –Simple control flow helps a lot Data Parallelism –Execute same computations on many independent data elements –Pipeline computations through the hardware Small and/or varying bit widths –Take advantage of the ability to customize the size of operators

50 Reconfigurable Computing Successes RSA Decryption –Programmable-Active-Memory machine set record for decryption of RSA-encrypted data DNA Sequence Matching –Reconfigurable hardware has achieved 100x better performance than contemporary supercomputers Signal Processing –FPGA-based filters often get 10x better performance than DSP chips –Benefit from customization of hardware to the application Emulation –Use reconfigurable logic to simulate new processors at high speeds Cryptographic Attacks –High-performance low-cost implementations for breaking encryption algorithms

51 FPGAs vs CPUs Capacity: Instructions are very dense representation, logic blocks aren’t Tools: Compilers for reconfigurable logic aren’t very good –Some operations are hard to implement on FPGAs One approach to capacity is to exploit the 90-10 rule of software –Run the 90% of code that takes 10% of execution time on a conventional processor –Run the 10% of code that takes 90% of execution time on reconfigurable logic Programmable-reconfigurable processors

52 Fine-Grained System: CHIMERAE Treat reconfigurable array as ALU within superscalar –Array implements some number of custom instructions for each program –Register file is interface between programmable and reconfigurable

53 CHIMERAE Programmed in C –Instruction combining –Control localization –SIMD Within a Register Simulation Studies –Example applications only require 8 RFUOPs in the reconfigurable array –Equivalent to 32 rows in RFU Performance Results –Vary strongly from application to application –Also dependent on model used for RFU delay –Average speedup of 20-30%, one application sees >2x improvement

54 Coarse-Grained System: Garp Small programmable processor with large reconfigurable array –Interface through memory system

55 Garp Again, Programmed in C –Compiler attempts to map loop nests onto the reconfigurable array Data Encryption Standard –Estimate 24x speedup over UltraSPARC Image Dithering –9x Speedup Sorting –2x Speedup

56 Advantages of RC Relative to microprocessors: on average a higher percentage of peak (or raw) computational density is achieved with reconfigurable devices. Fine-grain flexibility leads to exploitation of problem specific parallelism at many levels. Also, many different computation models (or patterns) can be supported. In general, it is possible to match problem characteristics to hardware, through the use of problem specific architectures and low-level circuit specialization. Spatial mapping of computation versus multiplexing of function units (as in processors) relieves pressure for memory capacity, BW, and low-latency and local communication patterns. Modern FPGAs make good system-level components: Relatively large number of IOs (many parallel memory ports) High- BW communications. Machines based on these components can easily scale peak performance by riding Moore’s curve (FPGAs are process drivers). Low-level redundancy permits fault-tolerance and great cost savings. Built-in microprocessors. Is there still room for research in novel devices for RC?

57 Advantages of RC Even in an application with fixed algorithms, reconfigurable devices may offer advantages over a full-custom or ASIC approach: FPGAs are processes drivers, therefore a generation ahead of ASIC. Increasing NREs for ASIC and full-custom has pushed "cross-over" point way out. Time to market advantage. Programmability leads to: project risk management extended product life-times Dynamic reconfiguration might permit even higher efficiency through hardware sharing (multiplexing) and on the fly circuit specialization. Largely unexploited (unproven) to date. A few research projects have explored this idea.

58 RC Disadvantages Reconfiguration time might be critical in run-time reconfigurable systems. Low utilization of hardware resources in configurable systems.

59 FPGAs are Reconfigurable 1. Commercial applications have not taken advantage of reconfigurability. Xilinx/Altera haven’t done much to help. Methodologies/tools nearly nonexistent. 2. Volume/cost graphs don’t accurately capture the potential real costs and other advantages. Re configuration uses: Field upgrades. product life extension, changing requirements. In system board-level testing and field diagnostics. Tolerance to manufacturing faults. Risk-management in system development. Runtime reconfiguration -- higher silicon efficiency. Time-multiplexed pre-designed circuits take maximum use of resources. Runtime specialized circuit generation.

60 Silicon Usage

61 Performance: ~10x Speedup Efficiency: ~10x Lower Chip Costs: ~0.5x --increased yield Decreased complexity Decreased design cost


Download ppt "Reconfigurable Computing Nehir Sönmez 25-11-2004."

Similar presentations


Ads by Google