HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,

Slides:



Advertisements
Similar presentations
A Scalable and Reconfigurable Search Memory Substrate for High Throughput Packet Processing Sangyeun Cho and Rami Melhem Dept. of Computer Science University.
Advertisements

Implementation Approaches with FPGAs Compile-time reconfiguration (CTR) CTR is a static implementation strategy where each application consists of one.
BRASS Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, John Wawrzynek University of California, Berkeley – BRASS.
A Survey of Logic Block Architectures For Digital Signal Processing Applications.
Lecture 9: Coarse Grained FPGA Architecture October 6, 2004 ECE 697F Reconfigurable Computing Lecture 9 Coarse Grained FPGA Architecture.
Balancing Interconnect and Computation in a Reconfigurable Array Dr. André DeHon BRASS Project University of California at Berkeley Why you don’t really.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day17: November 20, 2000 Time Multiplexing.
Reconfigurable Computing: What, Why, and Implications for Design Automation André DeHon and John Wawrzynek June 23, 1999 BRASS Project University of California.
UC Berkeley BRASS Group Post Placement C-Slow Retiming for Xilinx Virtex FPGAs Nicholas Weaver Yury Markovskiy Yatish Patel John Wawrzynek UC Berkeley.
CS294-6 Reconfigurable Computing Day 5 September 8, 1998 Comparing Computing Devices.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 21: April 2, 2007 Time Multiplexing.
SCORE - Stream Computations Organized for Reconfigurable Execution Eylon Caspi, Michael Chu, Randy Huang, Joseph Yeh, Yury Markovskiy Andre DeHon, John.
Pipelining and Retiming 1 Pipelining  Adding registers along a path  split combinational logic into multiple cycles  increase clock rate  increase.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 20: March 28, 2007 Retiming 2: Structures and Balance.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day8: October 18, 2000 Computing Elements 1: LUTs.
HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani,
Penn ESE Fall DeHon 1 ESE (ESE534): Computer Organization Day 19: March 26, 2007 Retime 1: Transformations.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 11: February 14, 2007 Compute 1: LUTs.
CS294-6 Reconfigurable Computing Day 9 September 22, 1998 Project Startup: Mediabench With annotations from class discussion.
CS294-6 Reconfigurable Computing Day 2 August 27, 1998 FPGA Introduction.
Trends toward Spatial Computing Architectures Dr. André DeHon BRASS Project University of California at Berkeley.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 18: February 21, 2003 Retiming 2: Structures and Balance.
CS294-6 Reconfigurable Computing Day 16 October 15, 1998 Retiming.
CS294-6 Reconfigurable Computing Day 18 October 22, 1998 Control.
CS294-6 Reconfigurable Computing Day 14 October 7/8, 1998 Computing with Lookup Tables.
EDA (CS286.5b) Day 19 Covering and Retiming. “Final” Like Assignment #1 –longer –more breadth –focus since assignment #2 –…but ideas are cummulative –open.
CS294-6 Reconfigurable Computing Day 19 October 27, 1998 Multicontext.
Penn ESE Spring DeHon 1 ESE (ESE534): Computer Organization Day 23: April 9, 2007 Control.
CS294-6 Reconfigurable Computing Day 16 October 20, 1998 Retiming Structures.
Balancing Interconnect and Computation in a Reconfigurable Array Dr. André DeHon BRASS Project University of California at Berkeley Why you don’t really.
CS294-6 Reconfigurable Computing Day 25 Heterogeneous Systems and Interfacing.
BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.
EDA (CS286.5b) Day 18 Retiming. Today Retiming –cycle time (clock period) –C-slow –initial states –register minimization.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 8: February 13, 2008 Retiming.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
February 12, 1998 Aman Sareen DPGA-Coupled Microprocessors Commodity IC’s for the Early 21st Century by Aman Sareen School of Electrical Engineering and.
Lecture 5. Sequential Logic 3 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System Education & Research.
Power Reduction for FPGA using Multiple Vdd/Vth
Coarse and Fine Grain Programmable Overlay Architectures for FPGAs
Amalgam: a Reconfigurable Processor for Future Fabrication Processes Nicholas P. Carter University of Illinois at Urbana-Champaign.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
Design Space Exploration for Application Specific FPGAs in System-on-a-Chip Designs Mark Hammerquist, Roman Lysecky Department of Electrical and Computer.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 24: April 18, 2011 Covering and Retiming.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 7: January 24, 2003 Instruction Space.
ISSS 2001, Montréal1 ISSS’01 S.Derrien, S.Rajopadhye, S.Sur-Kolay* IRISA France *ISI calcutta Combined Instruction and Loop Level Parallelism for Regular.
A Configurable High-Throughput Linear Sorter System Jorge Ortiz Information and Telecommunication Technology Center 2335 Irving Hill Road Lawrence, KS.
CALTECH CS137 Winter DeHon CS137: Electronic Design Automation Day 7: February 3, 2002 Retiming.
Caltech CS184a Fall DeHon1 CS184a: Computer Architecture (Structures and Organization) Day16: November 15, 2000 Retiming Structures.
CALTECH CS137 Spring DeHon 1 CS137: Electronic Design Automation Day 5: April 12, 2004 Covering and Retiming.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 8: January 27, 2003 Empirical Cost Comparisons.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #23 – Function.
Lecture 17: Dynamic Reconfiguration I November 10, 2004 ECE 697F Reconfigurable Computing Lecture 17 Dynamic Reconfiguration I Acknowledgement: Andre DeHon.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 11: January 31, 2005 Compute 1: LUTs.
EEL 5722 FPGA Design Fall 2003 Digit-Serial DSP Functions Part I.
RTL Hardware Design by P. Chu Chapter 9 – ECE420 (CSUN) Mirzaei 1 Sequential Circuit Design: Practice Shahnam Mirzaei, PhD Spring 2016 California State.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 22: April 16, 2014 Time Multiplexing.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 10: January 28, 2005 Empirical Comparisons.
CprE / ComS 583 Reconfigurable Computing Prof. Joseph Zambreno Department of Electrical and Computer Engineering Iowa State University Lecture #22 – Multi-Context.
Caltech CS184 Winter DeHon 1 CS184a: Computer Architecture (Structure and Organization) Day 20: February 27, 2005 Retiming 2: Structures and Balance.
Penn ESE535 Spring DeHon 1 ESE535: Electronic Design Automation Day 25: April 17, 2013 Covering and Retiming.
Penn ESE534 Spring DeHon 1 ESE534: Computer Organization Day 21: April 12, 2010 Retiming.
CS137: Electronic Design Automation
CS184a: Computer Architecture (Structure and Organization)
Gouraud-shaded Triangle Rasterization
ESE534: Computer Organization
CprE / ComS 583 Reconfigurable Computing
CS184a: Computer Architecture (Structures and Organization)
ESE534: Computer Organization
Presentation transcript:

HSRA: High-Speed, Hierarchical Synchronous Reconfigurable Array William Tsu, Kip Macy, Atul Joshi, Randy Huang, Norman Walker, Tony Tung, Omid Rowhani, Varghese George, John Wawrzynek, and André DeHon BRASS Project University of California at Berkeley

Myth FPGAs inherently run at an order of magnitude lower clock rates than microprocessors.

What’s in a Clock Cycle FPGA cycle times are elusive –cycle not defined by architecture –varies almost continuously based on routing –makes timing difficult Processor cycles are well defined –cycle defined by architecture –all operations quantized to this cycle –for all applications => run processor at cycle

Defining a Cycle Pick a target clock cycle Define what happens in a clock cycle based on that –how much computation –how much interconnect Assemble computation by combining cycles –...you were paying for the delay anyway...

Don’t Believe It! Example: XC4000XL-09 (0.35  m) –Minimum clock low/high 2.3ns  4.6ns cycle –Composing: clock  Q 1.5ns interconnect budget 1.5ns logic  clock setup 1.6ns 4.6ns Also: Von Herzen FPGA97, XC  4ns

Cycle Comparison FPGA cycles comparable to contemporary microprocessors.

Outline FPGA cycle times Why low frequency? Architecture and CAD for high frequency HSRA Experiments Assessment

Why FPGA designs run slowly? Few designs run at 200+MHz Limited application/user requirements 2. Cyclic data dependencies 3. Poor tool support 4. Long interconnect delays 5. Pipelining expensive?

HSRA High-Speed, Hierarchical Synchronous Reconfigurable Array Attacks architecture and CAD impediments –pipeline the interconnect (4) –balance retiming resources (5) –CAD for auto retiming (3)

HSRA Architecture

HSRA 5-LUT with 5th input hardwired to neighbor –(can be used 4-input, 2-output LUT w/ some restrictions) Flip-flop bank on inputs for retiming Hierarchical Interconnect Fixed clock cycle (0.4  m = 4ns) Pipelined Interconnect

Input Retiming

Balancing Logic Evaluation Cycle (BLB Cascade Timing)

Hierarchical Interconnect Fat-Tree/Fat-Pyramid inspired network; Geometric bandwidth growth toward root. (Parameterized growth allows exploration/tuning. =>Our recent study suggests p=0.6 good for “random logic”)

What Cycle? Data from 0.4  m DRAM Process

Area vs. Cycle

Flop Experiment #1 Pipeline and retime to single LUT delay per cycle –MCNC benchmarks to LUTs –no interconnect accounting –average 1.7 registers/LUT (some circuits 2--7)

HSRA Retiming One additional twist to retiming task –long, pipelined interconnect  need more than one register on paths

Accommodating HSRA Interconnect Delays (CAD) Add “logical” buffers to LUT  LUT path to match interconnect register requirements Reduces HSRA retiming to existing retiming problem Retime to C=1 as before Buffer chains force enough registers to cover interconnect delays

Add Interconnect Delays

Flop Experiment #2 Pipeline and retime to HSRA cycle –place on HSRA –single LUT or interconnect domain –same MCNC benchmarks –average 4.7 registers/LUT

Design Question How deep should we make input retiming register bank? –Most inputs need only one (60%) –Some inputs need very deep (>10) –Average Input depth: 4.7

Limit Input Depth Experiment limiting input depths For each output -> input pair –calculate delay –get regs –if (regs-delay) > input_regs allocate retiming buffer(s) to cover regs share among sinks if possible

HSRA Input

Extra Blocks (limited input depth) AverageWorst Case Benchmark

Input Depth Optimization Real design, fixed input retiming depth –truncate deeper and allocate additional logic blocks

HSRA CAD Flow LUT Mapping PartitionPlacement Bitstream Generation Tech. Indep. Optimization Config. Data RTL RoutingRetiming BOOM design generator

HSRA Interconnect

Mapping => Retiming Exploit technique developed for Systolic Arrays (Leiserson) Retime –find a legal movement of registers to improve circuit performance (area) For HSRA: retime to fully pipeline design –match HSRA cycle –justify / cover interconnect delays

HSRA Retiming Automatic Mapping Attack –pipeline as far as possible –find resulting cycle, C –make C-slow –final retime to distribute C-slow registers

Cycle => C-slow

Retimed 2-Slow Cycle

C-Slow applicable? Available parallelism –solve C identical, independent problems e.g. process packets (blocks) separately e.g. independent regions in images Commutative operators –e.g. max example

Assessment Cost: –our designs: 1.5  area of no pipelining –plausible ballpark for other designs –w/ 8 deep retiming, 20% BLB overhead –total: 1.8  area Running LUT  LUT delay on FPGA –70% overhead for retiming –freq still vary with interconnect Benefits –2--17  higher frequency operation than unpipelined  Net Area-Time win + automation/consistency

Better way to build Arrays? Can we exploit higher frequency offered? –High throughput, feed-forward –Cycles in flowgraph abundant data level parallelism no data level parallelism –Low throughput tasks structured (e.g. datapaths) unstructured –Data dependent operations similar ops dis-similar ops

Better Efficiently use fully spatial design: –feed forward (no cycles, high throughput) –cycles w/ data level parallelism (C-slow) –low throughput datapaths (serialize or swap) –similar data dependent operations (local control, share datapaths) HSRA, clocked interconnect allows –reliable execution at high clock rate –(not achievable with traditional FPGAs)

Remaining Cases Benefit from multicontext as well as high clock rate –cycles, no parallelism –data dependent, dissimilar operations –low throughput, irregular (can’t afford swap?) Single context HSRA and FPGA suffer similarly in these cases HSRA style retiming/pipelining –applicable to multicontext design

HSRA Highlights Design achieves 250MHz operation 2M 2 /BLB in subarray –BLB = cascade 5-LUT or 2-output 4-LUT –scales to 6M 2 /BLB for large arrays room for density improvement (not satisfactory) Students in (RC Class) demo –full rate filters –FIR –IIR (nice bit-level cycle implementation by Michael Chu)

HSRA Testchip

Summary No inherent reasons for FPGAs/RC arrays to run slower than microprocessors Current FPGAs lack architectural and CAD support to reliably achieve high clock rates HSRA demonstrates how to attack problems –retiming balance – interconnect pipelining – automated retiming

Berkeley Reconfigurable Architectures Software and Systems (BRASS)