Download presentation
Presentation is loading. Please wait.
2
JR.S00 1 Lecture 13: (Re)configurable Computing Prof. Jan Rabaey Computer Science 252, Spring 2000 The major contributions of Andre Dehon to this slide set are gratefully acknowledged
3
JR.S00 2 Computers in the News … TI announces 2 new DSPs C64x –Up to 1.1 GHz –9 Billion Operations/sec –10x performance of C62x –32 full-rate DSL modems on a single chip! C55x –0.05 mW/MIPS (20 MIPS/mW!) –Cut power consumption of C54x by 85% –5x performance of C54x
4
JR.S00 3 C64x
5
JR.S00 4 Enhanced performance for communications and multimedia
6
JR.S00 5 From the C54x core …
7
JR.S00 6 To the C55x
8
JR.S00 7 Leading to higher energy efficiency (?)
9
JR.S00 8 Evaluation metrics for Embedded Systems Flexibility Power Cost Performance as a Functionality Constraint (“Just-in-Time Computing”) Components of Cost –Area of die / yield –Code density (memory is the major part of die size) –Packaging –Design effort –Programming cost –Time-to-market –Reusability
10
JR.S00 9 Special Instructions for Specific Applications
11
JR.S00 10 What is Configurable Computing? Spatially-programmed connection of processing elements “Hardware” customized to specifics of problem. Direct map of problem specific dataflow, control. Circuits “adapted” as problem requirements change.
12
JR.S00 11 Spatial vs. Temporal Computing Spatial Temporal
13
JR.S00 12 Defining Terms Computes one function (e.g. FP-multiply, divider, DCT) Function defined at fabrication time Computes “any” computable function (e.g. Processor, DSPs, FPGAs) Function defined after fabrication Fixed Function: Programmable: Parameterizable Hardware: Performs limited “set” of functions
14
JR.S00 13 “Any” Computation? (Universality) Any computation which can “fit” on the programmable substrate Limitations: hold entire computation and intermediate data Recall size/fit constraint
15
JR.S00 14 Benefits of Programmable Non-permanent customization and application development after fabrication –“Late Binding” economies of scale (amortize large, fixed design costs) time-to-market (evolving requirements and standards, new ideas) Disadvantages Efficiency penalty (area, performance, power) Correctness Verification
16
JR.S00 15 Spatial/Configurable Benefits 10x raw density advantage over processors Potential for fine-grained (bit-level) control --- can offer another order of magnitude benefit Locality! Each compute/interconnect resource dedicated to single function Must dedicate resources for every computational subtask Infrequently needed portions of a computation sit idle --> inefficient use of resources Spatial/Configurable Drawbacks
17
JR.S00 16 Density Comparison
18
JR.S00 17 Processor vs. FPGA Area
19
JR.S00 18 Processors and FPGAs
20
JR.S00 19 Early RC Successes Fastest RSA implementation is on a reconfigurable machine (DEC PAM) Splash2 (SRC) performs DNA Sequence matching 300x Cray2 speed, and 200x a 16K CM2 Many modern processors and ASICs are verified using FPGA emulation systems For many signal processing/filtering operations, single chip FPGAs outperform DSPs by 10-100x.
21
JR.S00 20 Issues in Configurable Design Choice and Granularity of Computational Elements Choice and Granularity of Interconnect Network (Re)configuration Time and Rate –Fabrication time --> Fixed function devices –Beginning of product use --> Actel/Quicklogic FPGAs –Beginning of usage epoch --> (Re)configurable FPGAs –Every cycle --> traditional Instruction Set Processors
22
JR.S00 21 The Choice of the Computational Elements ReconfigurableLogicReconfigurableDatapathsReconfigurableArithmeticReconfigurableControl Bit-Level Operations e.g. encoding Dedicated data paths e.g. Filters, AGU Arithmetic kernels e.g. Convolution RTOS Process management
23
JR.S00 22 FPGA Basics LUT for compute FF for timing/retiming Switchable interconnect …everything we need to build fixed logic circuits –don’t really need programmable gates –latches can be built from gates
24
JR.S00 23 Field Programmable Gate Array (FPGA) Basics Collection of programmable “gates” embedded in a flexible interconnect network. …a “user programmable” alternative to gate arrays. ? Programmable Gate
25
JR.S00 24 Look-Up Table (LUT) In Out 00 0 01 1 10 1 11 0 2-LUT Mem In1 In2 Out
26
JR.S00 25 LUTs K-LUT -- K input lookup table Any function of K inputs by programming table
27
JR.S00 26 Conventional FPGA Tile K-LUT (typical k=4) w/ optional output Flip-Flop
28
JR.S00 27 Commercial FPGA (XC4K) Cascaded 4 LUTs (2 4-LUTs -> 1 3-LUT) Fast Carry support Segmented interconnect Can use LUT config as memory.
29
JR.S00 28 XC4000 CLB
30
JR.S00 29 Not Restricted to Logic Gates Example: Paddi-2 (1995)
31
JR.S00 30 A Data-driven Computation Paradigm
32
JR.S00 31 Not restricted to Logic Gate Operations
33
JR.S00 32 For Spatial Architectures Interconnect dominant –area –power –time …so need to understand in order to optimize architectures
34
JR.S00 33 Dominant in Area
35
JR.S00 34 Dominant in Time
36
JR.S00 35 Dominant in Power XC4003A data from Eric Kusse (UCB MS 1997)
37
JR.S00 36 Interconnect Problem –Thousands of independent (bit) operators producing results »true of FPGAs today »…true for *LIW, multi-uP, etc. in future –Each taking as inputs the results of other (bit) processing elements –Interconnect is late bound »don’t know until after fabrication
38
JR.S00 37 Design Issues Flexibility -- route “anything” –(w/in reason?) Area -- wires, switches Delay -- switches in path, stubs, wire length Power -- switch, wire capacitance Routability -- computational difficulty finding routes
39
JR.S00 38 First Attempt: Crossbar Any operator may consume output from any other operator Try a crossbar?
40
JR.S00 39 Crossbar Flexibility (++) –routes everything (guaranteed) Delay (Power) (-) –wire length O(kn) –parasitic stubs: kn+n –series switch: 1 –O(kn) Area (-) –Bisection bandwidth n –kn 2 switches –O(n 2 ) Too expensive and not scalable
41
JR.S00 40 Avoiding Crossbar Costs Good architectural design –Optimize for the common case Designs have spatial locality We have freedom in operator placement Thus: Place connected components “close” together –don’t need full interconnect?
42
JR.S00 41 Exploit Locality Wires expensive Local interconnect cheap Try a mesh? LUT C Box S Box
43
JR.S00 42 The Toronto Model Switch Box Connect Box
44
JR.S00 43 Mesh Analysis Flexibility - ? –Ok w/ large w Delay (Power) –Series switches »1-- n –Wire length »w-- n –Stubs »O(w)--O(w n) Area –Bisection BW -- w n –Switches -- O(nw) –O(w 2 n)
45
JR.S00 44 Mesh Analysis Can we place everything close?
46
JR.S00 45 Mesh “Closeness” Try placing “everything” close
47
JR.S00 46 Adding Nearest Neighbor Connections Connection to 8 neighbors Improvement over Mesh by x3 Good for neighbor-neighbor connections
48
JR.S00 47 Typical Extensions Segmented Interconnect Hardwired/Cascade Inputs
49
JR.S00 48 XC4K Interconnect
50
JR.S00 49 XC4K Interconnect Details
51
JR.S00 50 Creating Hierarchy Example: Paddi-2
52
JR.S00 51 Level-1 Communication Network
53
JR.S00 52 Level-2 Communication Network (Pipelined)
54
JR.S00 53 Paddi-2 Processor 1- m 2-metal CMOS tech 1.2 x 1.2 mm 2 600k transistors 208-pin PGA fclock = 50 MHz P av = 3.6 W @ 5V
55
JR.S00 54 How to Provide Scalability? Tree of Meshes Main question: How to populate/ parameterize the tree?
56
JR.S00 55 Hierarchical Interconnect Two regions of connectivity lengths Manhattan Distance Energy x Delay Mesh Binary Tree Hybrid architecture using both Mesh and Binary structures favored
57
JR.S00 56 Hybrid Architecture Revisited Straightforward combination of Mesh and Binary tree is not smart Short connections will be through the Mesh architecture The cheap connections on the Binary tree will be redundant
58
JR.S00 57 Inverse Clustering Blocks further away are connected at the lowest levels Inverse clustering complements Mesh Architecture Manhattan Distance Energy x Delay Mesh Binary Tree Mesh + Inverse
59
JR.S00 58 Hybrid Interconnect Architecture Levels of interconnect targeting different connectivity lengths Level0 Nearest Neighbor Level1 Mesh Interconnect Level2 Hierarchical
60
JR.S00 59 Prototype Array Size: 8x8 (2 x 4 LUT) Power Supply: 1.5V & 0.8V Configuration: Mapped as RAM Toggle Frequency: 125MHz Area: 3mm x 3mm Process: 0.25U ST
61
JR.S00 60 Programming the Configurable Platform LUT Mapping PlacementRouting Bitstream Generation Tech. Indep. Optimization Config. Data RTL
62
JR.S00 61 Starting Point RTL –t=A+B –Reg(t,C,clk); Logic –O i =A i i C i –C i+1 = A i B i B i C i A i C i
63
JR.S00 62 LUT Map
64
JR.S00 63 Placement Maximize locality –minimize number of wires in each channel –minimize length of wires –(but, cannot put everything close) Often start by partitioning/clustering State-of-the-art finish via simulated annealing
65
JR.S00 64 Place
66
JR.S00 65 Routing Often done in two passes –Global to determine channel –Detailed to determine actual wires and switches Difficulty is –limited channels –switchbox connectivity restrictions
67
JR.S00 66 Route
68
JR.S00 67 Summary Configurable Computing using “programming in space” versus “programming in time” for traditional instruction-set computers Key design choices –Computational units and their granularity –Interconnect Network –(Re)configuration time and frequency Next class: Some practical examples of reconfigurable computers
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.