On “Control Adaptivity”: how to sell the idea of Atomic Transactions to the HW design community Rishiyur Nikhil Bluespec, Inc. WG 2.8 meeting, June 16,

Slides:



Advertisements
Similar presentations
ECOE 560 Design Methodologies and Tools for Software/Hardware Systems Spring 2004 Serdar Taşıran.
Advertisements

Using emulation for RTL performance verification
Verilog Intro: Part 1.
March, 2007http://csg.csail.mit.edu/arvind802.11a-1 Architectural Exploration: Area-Performance tradeoff in a Transmitter Arvind Computer Science.
June 5, Architectural Exploration: a Transmitter Arvind, Nirav Dave, Steve Gerding, Mike Pellauer Computer Science & Artificial Intelligence.
Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.
From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.
© 2011 Xilinx, Inc. All Rights Reserved Intro to System Generator This material exempt per Department of Commerce license exception TSU.
L : Complex Digital Systems Lecturer: Arvind TA: Richard S. Uhler Administration: Sally Lee February 2, 2011.
© Copyright Alvarion Ltd. Hardware Acceleration February 2006.
ECE 551 Digital System Design & Synthesis Lecture 11 Verilog Design for Synthesis.
6.375: Complex Digital Systems Lecturer: Arvind TA: Richard S. Uhler Administration: Sally Lee February 6, 2013http://csg.csail.mit.edu/6.375 L01-1.
1 VERILOG Fundamentals Workshop סמסטר א ' תשע " ה מרצה : משה דורון הפקולטה להנדסה Workshop Objectives: Gain basic understanding of the essential concepts.
February 4, 2009http://csg.csail.mit.edu/6.375/L Complex Digital System Spring 2009 Lecturer:Arvind TA: K. Elliott Fleming Assistant: Sally Lee.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
ASIC/FPGA design flow. FPGA Design Flow Detailed (RTL) Design Detailed (RTL) Design Ideas (Specifications) Design Ideas (Specifications) Device Programming.
Extreme Makeover for EDA Industry
System Design with CoWare N2C - Overview. 2 Agenda q Overview –CoWare background and focus –Understanding current design flows –CoWare technology overview.
SystemC and Levels of System Abstraction: Part I.
VHDL Project Specification Naser Mohammadzadeh. Schedule  due date: Tir 18 th 2.
Copyright © Bluespec Inc Bluespec SystemVerilog™ Design Example A DMA Controller with a Socket Interface.
February 14, 2007L04-1http://csg.csail.mit.edu/6.375/ Bluespec-1: Design methods to facilitate rapid growth of SoCs Arvind Computer Science & Artificial.
September 3, 2009L02-1http://csg.csail.mit.edu/korea Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
BSV 1 for ESL 2 Design Rishiyur S. Nikhil, PhD. CTO, Bluespec, Inc. © Bluespec, Inc., 2008 Lecture in 6.375, MIT March 10, : Bluespec SystemVerilog.
Folded Combinational Circuits as an example of Sequential Circuits Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
The Macro Design Process The Issues 1. Overview of IP Design 2. Key Features 3. Planning and Specification 4. Macro Design and Verification 5. Soft Macro.
Electrocardiogram (ECG) application operation – Part B Performed By: Ran Geler Mor Levy Instructor:Moshe Porian Project Duration: 2 Semesters Spring 2012.
- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.
© 2006 Synopsys, Inc. (1) CONFIDENTIAL Simulation and Formal Verification: What is the Synergy? Carl Pixley Disclaimer: These opinions are mine alone and.
Part A Presentation Implementation of DSP Algorithm on SoC Student : Einat Tevel Supervisor : Isaschar Walter Accompanying engineer : Emilia Burlak The.
March, 2007Intro-1http://csg.csail.mit.edu/arvind Design methods to facilitate rapid growth of SoCs Arvind Computer Science & Artificial Intelligence Lab.
Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.
Electrical and Computer Engineering University of Cyprus LAB 1: VHDL.
Slide 1 2. Verilog Elements. Slide 2 Why (V)HDL? (VHDL, Verilog etc.), Karen Parnell, Nick Mehta, “Programmable Logic Design Quick Start Handbook”, Xilinx.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
September 8, 2009http://csg.csail.mit.edu/koreaL03-1 Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
Constructive Computer Architecture Sequential Circuits - 2 Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
March 1, 2006http://csg.csail.mit.edu/6.375/L09-1 Bluespec-3: Architecture exploration using static elaboration Arvind Computer Science & Artificial Intelligence.
Introduction to Bluespec: A new methodology for designing Hardware Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.
February 13, 2008L04-1http://csg.csail.mit.edu/6.375 Bluespec: The need for a new design methodology Arvind Computer Science & Artificial Intelligence.
Multiple Clock Domains (MCD) Arvind with Nirav Dave Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Combinational Circuits in Bluespec Arvind Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology February 9, 2011L03-1
Problem: design complexity advances in a pace that far exceeds the pace in which verification technology advances. More accurately: (verification complexity)
6/6/20081 High-Throughput Pipelined Mergesort Kermin Fleming Myron King Man Cheuk Ng Asif Khan Muralidaran Vijayaraghavan.
DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.
Introduction to Bluespec: A new methodology for designing Hardware
Introduction to Bluespec: A new methodology for designing Hardware
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Architectural Exploration:
Sequential Circuits: Constructive Computer Architecture
Introduction to Bluespec: A new methodology for designing Hardware
Combinational Circuits in Bluespec
Combinational Circuits in Bluespec
Figure 1 PC Emulation System Display Memory [Embedded SOC Software]
RAMP Retreat, UC Berkeley
Bluespec-1: Design Affects Everything
Sequential Circuits - 2 Constructive Computer Architecture Arvind
Introduction to Bluespec: A new methodology for designing Hardware
Combinational Circuits in Bluespec
GCD: A simple example to introduce Bluespec
Win with HDL Slide 4 System Level Design
Design methods to facilitate rapid growth of SoCs
Introduction to Bluespec: A new methodology for designing Hardware
Bluespec SystemVerilog™
Architectural Exploration:
Digital Designs – What does it take
Presentation transcript:

On “Control Adaptivity”: how to sell the idea of Atomic Transactions to the HW design community Rishiyur Nikhil Bluespec, Inc. WG 2.8 meeting, June 16, 2008

2 Background: The BSV language (Bluespec SystemVerilog) is founded on composable Atomic Transactions (atomic rewrite rules). This talk is about how we’re slowly finding the vocabulary, the voice, the story to sell the idea to the HW design community (it’s taken several years!).

3 Arguments that go over like a ton of bricks...they’re new; modern; cool; different... they’re good for your soul... it helps you reason about correctness using invariants... they give compositional concurrency... see how elegantly they solve the Dining Philosophers problem!... see how elegantly they solve the Santa Claus problem! Atomic Transactions are good because they give consistent access to your shared resources Do I have a “soul”? I don’t use “shared resources” (!!!) Cute! But I can’t see the relevance to what I do. Thank you, but I don’t have the time What do you mean by “reason”? What’s an “invariant”? What’s “compositional”? What’s “atomic”?

4 In summary... Atomic Transactions are good because... Ah! I see! And maybe you should also tell me about atomicity, reasoning, invariants, compositionality, shared resources!... (summarizing slides that follow) they make your designs control-adaptive. And here’s how that helps you in writing executable specs, in modeling, in refining models to designs, in reusable designs, and in maintainable and evolvable designs.... and don’t forget about dining philosophers, Santa Claus and my soul!

5 module mkGCD (I_GCD); Reg#(int) x <- mkRegU; Reg#(int) y <- mkReg(0); rule swap ((x > y) && (y != 0)); x <= y; y <= x; endrule rule subtract ((x <= y) && (y != 0)); y <= y – x; endrule method Action start(int a, int b) if (y==0); x <= a; y <= b; endmethod method int result() if (y==0); return x; endmethod endmodule Internal behavior GCD in BSV External interface State

6 Generated Hardware next state values predicates x_en y_en x_en = y_en = x y > !(=0) swap?subtract? sub a b  en rdy x rdy start result swap? swap? OR subtract?

7 Generated Hardware Module x_en y_en x_en = swap? y_en = swap? OR subtract? x y > !(=0) swap?subtract? sub a b  en rdy x rdy start result rdy = start_en OR start_en (y==0) OR start_en Control logic (muxes, conditions on muxes and enables) on shared resources (x and y)

8 Generated Hardware Module x_en y_en When coding this in Verilog, this control logic is explicit: x y > !(=0) swap?subtract? sub a b  en rdy x rdy start result start_en Control logic (muxes, conditions on muxes and enables) on shared resources (x and y) CLK) if (start_en) x <= a; else if (swap_en) x <= y;

9 Summarizing, we have Point 1: BSV’s Atomic Transactions (rules) are just another way of specifying control logic that you would have written anyway

10 Fundamentally, we are scheduling three potentially concurrent atomic transactions that share resources. Note that control of x is affected by a “non-adjacent” condition (cond2), because of atomicity of the intervening process 1. This is fundamentally what makes RTL so fragile! xy +1+1 Process priority: 2 > 1 > 0 cond0 cond1 cond2 CLK) begin if (cond2) y <= y – 1; else if (cond1) begin y <= y + 1; x <= x – 1; end if (cond0 && !cond1) x <= x + 1; end * There are other ways to write this RTL, but all suffer from same analysis Resource-access scheduling logic i.e., control logic CLK) begin if (cond2) y <= y – 1; else if (cond1) begin y <= y + 1; x <= x – 1; end if (cond0 && (!cond1 || cond2) ) x <= x + 1; end Better scheduling

11 In BSV rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule rule proc2 (cond2); y <= y – 1; endrule (* descending_urgency = “proc2, proc1, proc0” *) xy +1+1 Process priority: 2 > 1 > 0 cond0 cond1 cond2

12 Now, let’s make a small change: add a new process and insert its priority xy Process priority: 2 > 3 > 1 > 0 cond0 cond1 cond cond3

13 Process priority: 2 > 3 > 1 > 0 Changing the BSV design xy cond0 cond1 cond cond3 (* descending_urgency = “proc2, proc1, proc0” *) rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x – 1; endrule rule proc2 (cond2); y <= y – 1; endrule (* descending_urgency = "proc2, proc3, proc1, proc0" *) rule proc0 (cond0); x <= x + 1; endrule rule proc1 (cond1); y <= y + 1; x <= x - 1; endrule rule proc2 (cond2); y <= y - 1; x <= x + 1; endrule rule proc3 (cond3); y <= y - 2; x <= x + 2; endrule Pre-Change ?

14 Process priority: 2 > 3 > 1 > 0 Changing the Verilog design xy cond0 cond1 cond cond3 CLK) begin if (!cond2 && cond1) x <= x – 1; else if (cond0) x <= x + 1; if (cond2) y <= y – 1; else if (cond1) y <= y + 1; end CLK) begin if ((cond2 && cond0) || (cond0 && !cond1 && !cond3)) x <= x + 1; else if (cond3 && !cond2) x <= x + 2; else if (cond1 && !cond2) x <= x - 1 if (cond2) y <= y - 1; else if (cond3) y <= y - 2; else if (cond1) y <= y + 1; end Pre-Change ?

15 A software analogy f(...) g(x,y) calls decide calling convention, decide register allocation... decide register allocation, In assembler,.... related! In C.... The compiler does all the work

16 A software analogy f(...) g(x,y,z) calls decide calling convention, decide register allocation... decide register allocation, In assembler,.... related! In C.... all over again (groan!) The compiler redoes all the work

17 In HW design module f module g commu- nicates with decide communications decide control logic... decide control logic, In Verilog,.... related! In BSV.... The compiler does all the work

18 In HW design In BSV.... all over again (groan!) The compiler redoes all the work module f’ module g’ commu- nicates with decide communications decide control logic... decide control logic, In Verilog,.... related!

19 Summarizing, we have Point 2: In RTL, control logic can be complex to write first time, and complex to change/maintain, because one needs to consider and reconsider non-local influences (even across module boundaries). BSV, on the other hand, is “control adaptive” because it automatically computes and recomputes the control logic as you write/change the design.

20 MEMOCODE 2008 codesign contest Goal: Speed up a software reference application running on the PowerPC on Xilinx XUP reference board using SW/HW codesign Application: decrypt, sort, re-encrypt db of records in external memory Time allotted: 4 weeks Xilinx XUP

21 The winning architecture The “merger” has to be used repeatedly in order to mergesort the db A control issue: address generation and sequencing of memory reads and writes Observation: throughput in merger is limited by apex, so at each lower level, a single comparator can be shared  can fit a deeper merger into available space AES Cores(2) xor Memory Write Logic Read Memory Logic Record Input Record Output Sort Tree merger

22 Merger: each level < Loop: 1.Choose non- empty input pair corresponding to output fifo with room (scheduling) 2.Compare the fifo heads 3.Dequeue the smaller one and put it on output fifo

23 Scheduler as a Module Parameter Level 1 merger: must check 2 input FIFOs Scheduler can be combinational Level 6 merger: must check 64 input FIFOs Scheduler must be pipelined  scheduler is a parameter of level merger

24 MEMOCODE 2008 codesign contest Contest Results C + Impulse C C + HDL C Bluespec Reference: 11X of runner-up (5x in 2007) (3 people, 3 weeks, direct to FPGA) 27 teams started the four week contest. 8 teams delivered 9 solutions.

25 Summarizing, we have Point 3: Some control logic is too difficult even to contemplate without the support of atomic transactions

26 Another application of control adaptivity: parameterizing an a Transmitter ControllerScramblerEncoderInterleaverMapper IFFT Cyclic Extend headers data accounts for 85% area 24 Uncoded bits

27 Combinational IFFT in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute All numbers are complex and represented as two sixteen bit quantities. Fixed-point arithmetic is used to reduce area, power,... * * * * *j t2t2 t0t0 t3t3 t1t1

28 Pipelined IFFT in0 … in1 in2 in63 in3 in4 Bfly4 x16 Bfly4 … … out0 … out1 out2 out63 out3 out4 Permute

29 Circular pipeline: Reusing the Pipeline Stage in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 … Bfly4 Permute Stage Counter

30 Superfolded circular pipeline: Just one Bfly-4 node! in0 … in1 in2 in63 in3 in4 out0 … out1 out2 out63 out3 out4 Bfly4 Permute Index == 15? Index: 0 to 15 64, 2-way Muxes 4, 16-way Muxes 4, 16-way DeMuxes Stage 0 to 2 These different micro-architectures will have different (area, speed, power). Each may be the “best” for different target chips.

31 IP Reuse via parameterized modules Example: OFDM based protocols (OFDM multi-radio) MAC standard specific potential reuse Scrambler FEC Encoder InterleaverMapper Pilot & Guard Insertion IFFT CP Insertion De- Scrambler FEC Decoder De- Interleaver De- Mapper Channel Estimater FFTSynchronizer TX Controller RX Controller S/P D/A A/D Different algorithms Different throughput requirements Reusable algorithm with different parameter settings WiFi: 0.25MHz WiMAX: 0.03MHz WUSB: 128pt 8MHz 85% reusable code between WiFi and WiMAX From WiFi to WiMAX in 4 weeks  (Alfred) Man Chuek Ng, … WiFi: x 7 +x 4 +1 WiMAX: x 15 +x WUSB: x 15 +x Convolutional Reed-Solomon Turbo

32 Summarizing, we have Point 4: Parameterized architectures can deliver tremendous succinctness and reuse but each instance of a parameterized architecture has different resource contentions  needs different control logic. Consider the effort to redesign the control logic for each instance (RTL), vs. using regenerating it automatically due to control-adaptivity.

33 Control-adaptivity has a profound effect on design methodology High-level Transactional I/F Design IP/SystemVerification IPSoftware void TestBench:: Result( void* tbPointer ) { TestBench* tb = reinterpret_cast ( tbPointer ); unsigned data[4]; doResp( tb->drv0, &(tb->out[0]), data ); Initial Testbench (approx) Initial IP (approx) Add Functionality Final Bus/ Interconnect I/F void TestBench:: Result( void* tbPointer ) { TestBench* tb = reinterpret_cast ( tbPointer ); unsigned data[4]; doResp( tb->drv0, &(tb->out[0]), data ); cout << "data is: " << hex << data[0] << data[1] << data[2] << data[3] << endl; Tcl_Obj* cmd = Tcl_NewStringObj( "reportResults ", -1 ); for( int i=0; i<4; i++ ) void TestBench:: Result( void* tbPointer ) { TestBench* tb = reinterpret_cast ( tbPointer ); unsigned data[4]; doResp( tb->drv0, &(tb->out[0]), data ); cout << "data is: " << hex << data[0] << data[1] << data[2] << data[3] << endl; Tcl_Obj* cmd = Tcl_NewStringObj( "reportResults ", -1 ); for( int i=0; i<4; i++ ) void TestBench:: Result( void* tbPointer ) { TestBench* tb = reinterpret_cast ( tbPointer ); unsigned data[4]; doResp( tb->drv0, &(tb->out[0]), data ); cout << "data is: " << hex << data[0] << data[1] << data[2] << data[3] << endl; Tcl_Obj* cmd = Tcl_NewStringObj( "reportResults ", -1 ); for( int i=0; i<4; i++ ) Add Functionality Add Architectural Detail Final Testbench Explore Architecture Final IP Add Functionality Refine/ Replace Early,fast SW platform Executable spec Synthesis & Execution at every step Verify/ Analyze/ Debug Bluespec Tools FPGA/Emulation (>>100X speed) Bluesim (10X speed) RTL simulation (1X speed) Bluespec Synthesis or RTL synthesis Power Estimation Verilog Synthesizable testbenches, for fast verification

34 Gradually expanding ways in which BSV is used Modeling for early SW development Denali  Cycle-accurate LPDDR controller Modeling for architecture exploration Intel, IBM, UT Austin, CMU  x86 and Power multi-thread/multi-core design Verification Qualcomm  Testbenches, transactors and, of course, IP creation (original use) TI, ST, Nokia, Mercury, MIT  DMAs, LCD controllers, Wireless modems, Video,...

35 In summary... Atomic Transactions are good because... Ah! I see! What about Santa Claus* and my soul?... (summarizing) they make your designs control-adaptive. And this helps you in writing executable specs, in modeling, in refining models to designs, in creating complex, reusable designs, maintainable and evolvable designs. * ask me if you want to see the Bluespec solution

36 Extra slides

37 Example: IP verification (AXI demo on Bluespec) Bluespec lines of code:2,000 (including comments) ASIC gates:125K Virtex-4 FX100 slices:4,723 (10% utilization) 22 day Verilog sim → 1 day Bluesim → 53 sec on FPGA Verilog 1X (1.4K cycles/sec) Bluesim 22X (31K cycles/sec) Emulation 35,714X (50M cycles/sec) C/SystemC/B luesim Verilog Simulation C/SystemC/B luesim Bluesim Simulation C/SystemC/B luesim FPGA Emulation

38 Workstation Avnet $3.5K FPGA board USB cable Example: IP verification (AXI demo on Bluespec) Why BSV? Inexpensive emulation platform (low cost FPGA board) Easy setup (BSV transactors) Quick availabilty (synthesizability of early approximation)