Presentation is loading. Please wait.

Presentation is loading. Please wait.

Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010.

Similar presentations


Presentation on theme: "Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010."— Presentation transcript:

1 Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010

2 In the beginning I’ve always been a computer architect. Before the ASIC (early 1980’s) we built computers with off-the-shelf chips. – Am2901 bit slices, PALs, 7400 logic. Just hook up some parts and run it now. Full-speed wire-wrapped prototypes. When it ran it shipped. Design Verification: It doesn’t crash. Debug visibility: scope, maybe LA. Design revision: wire-wrap gun. Project time: months, not years. Example: Kurzweil 1978 – Nova clone for Kurzweil Reading Machine – 2901s, 74F TTL, 16Kb DRAMs, 4 MHz clock – When the prototype ran the reading machine app for three days without crashing, I released the design to manufacturing. Mike Butts - RAMP - August, 20102

3 Then came the ASIC Tapeout Must get the design perfect before tapeout Emergence of EDA, design capture, logic simulation: “Daisy/Mentor/Valid” Simulation is very slow, must write testbenches, can’t run the real app. This makes the design process very conservative. Crimps architect’s style. To me EDA has always been a bit of a video game. Mike Butts - RAMP - August, 20103

4 FPGAs Emerge! Real hardware! We can prototype again! But simulators are automatic, and FPGA tools are strange and hard. What if we had an automatic box of FPGAs that plugs into an ASIC socket. Emulate! Many FPGAs are needed. How to interconnect? Extend the row-column FPGA architecture: Mike Butts - RAMP - August, 20104 Sample, US 5,109,353, 1992 XC2064 FPGA 64 CLBs, 1986

5 First Logic Emulator Product Quickturn RPM: 1989 Nearest-neighbor interconnect Hard to get expected logic capacity, hard to manage delays. But it worked! Mike Butts - RAMP - August, 20105 Sample, US 5,109,353, 1992

6 First big success: Intel P5 Quickturn worked closely with Intel to emulate the original Pentium microarchitecture: P5. – Ten RPM systems were cabled together, and the design was manually broken up into RPM-sized segments which were emulated. “ The emulator had one more benefit: blunting the spread of RISC. At a technology forum for PC companies and software developers last November (1991), (Intel VP Albert Yu) dialed it up and ran a Lotus 1-2-3 spreadsheet from a terminal. The crowd was astonished that a model was already working. Six months later, Compaq Computer Corp. scrubbed its plans for a RISC-based PC.” - Business Week 6/1/1992 “Inside Intel” Mike Butts - RAMP - August, 20106

7 But row/column doesn’t scale Logic circuit topology is not flat, 2D nearest-neighbor. Wires go anywhere. FPGA pins get used up by nets that are just passing through. Long delays. Quickturn RPM had serious capacity, placement and routing issues. It turns out the wires and pins of an FPGA are its most precious resource. – 80-90% of FPGA transistors are interconnect. – “We charge for the wires, the gates are free” -- Altera VP Eng. Clive McCarthy, 1994 Logic density follows Moore’s Law, but packaging and pin counts do not. – Not even the square root (perimeter). Logic emulators inevitably outstripped FPGA pin counts. Why??? Mike Butts - RAMP - August, 20107

8 Rent’s Rule The problem of how many pins to provide for each partition of a system came up in the IBM 1401 project, 1960. Ed Rent found this empirical rule for the relationship between pins per logic block and the number of gates in the block: p = Kg r where p = pins, g = gates, r is the “Rent exponent”, and K is the “Rent constant”. Mike Butts - RAMP - August, 20108

9 Rent’s Rule IBM 1401 used a Standard Modular System (SMS) of logic modules, backplanes and chassis, with standard pin counts. How to size? Rent’s Rule. Rent never published, but in 1971 Landman and Russo did. B. S. Landman, R. L. Russo, On a Pin Versus Block Relationship For Partitions of Logic Graphs, IEEE Trans. Comp., col. C-20, 1971. Profound influence on system architecture and CAD/EDA tools. Different Rent coefficients apply to different environments. Empirical. Theory? Inconclusive. – Exponent > 0.5: global connectivity. – Constant > 1: net fanout. Rent’s Rule guided FPGA emulation system architecture. We used p = 2.5g 0.57 Mike Butts - RAMP - August, 20109 IEEE Solid-State Circuits magazine, winter 2010

10 Emulators: Big Green Button A logic emulator is automatic and universal. It takes any arbitrary netlist and implements it in standard hardware, with little or no user intervention. Uniform hardware, uniform-size FPGAs. Design netlist is cut arbitrarily into many equal partitions to keep the chips full. – Balanced k-way partitioning (NP-hard) This means Rent’s Rule applies. An FPGA prototype is manual and specific. Hardware is usually chosen for one project, the design is manually partitioned according to its modular structure, FPGAs are sized accordingly. System modules naturally have smaller pinouts than arbitrary cuts. Rent’s Rule does not apply. (Well, yes it does but weakly.) Mike Butts - RAMP - August, 201010 G. Schelle, et. al., Intel Nehalem Processor Core Made FPGA Synthesizable, ACM FPGA 2010 M. Butts, “Emulators”, Wiley Encyclopedia of Electrical and Electronics Engineering, 1999.

11 Rent’s Rule says FPGA Pins are Precious XC3090: 640 LUTs, 5K gates. Rent’s Rule says 325 pins, FPGA has 144 pins, only 44% Lesson: FPGA pins are vital to FPGA emulator capacity. => Separate interconnect Crossbar is ideal – Interconnects any pins, any way, with any fanout – Uniform delay: one level Far too expensive: O(n 2 ) Far more fanout than needed, average net fanout is 2 to 3. Doesn’t take advantage of FPGA pin routability. Mike Butts - RAMP - August, 201011 Butts, US 5,036,473, 1991

12 Partial Crossbar Interconnect Drop out most of the crosspoints, leaving a partial crossbar. – Group FPGA pins into subsets, – Fully populate crosspoints within each subset, – Leave the rest out. For each net, find a subset which can route it. – High fanout nets first. Map nets to FPGA pins accordingly. Still uniform single-level delay. Symmetrical, no placement needed. Scalable: O(n) Mike Butts - RAMP - August, 201012 Butts, US 5,036,473, 1991

13 Partial Crossbar Systems Mike Butts - RAMP - August, 201013 Redraw: Group each subset’s crosspoints into a crossbar chip for that subset Each crossbar has pins to every FPGA, and vice versa. Make crossbar chip or use cheap FPGA Multilevel for systems: second-level crossbars on the backplane. Max delay is three hops. Cost is slightly higher than O(n). Scalable. Partial crossbar interconnect made large-scale logic emulation practical. Butts, US 5,036,473, 1991

14 History of FPGA Emulators, 1989-2000 Nearest-neighbor architecture Quickturn RPM (1989): First commercial emulator Virtual Machine Works (1994): Virtual Wires pin multiplexing Partial Crossbar architecture Mentor Realizer (1989): First hardware, emulated Apple II mobo Mentor Realizer (1991): Proof-of-concept system prototype – 8 logic boards (14 XC3090 FPGAs, 32 XC2018 xbars), 64 XC2018 2 nd -level xbars Mentor sold this logic emulator technology to Quickturn (1992). Quickturn Enterprise (1993): First commercial partial crossbar emulator – 11 logic boards (46 XC3090s, 46 custom xbars), 144 2 nd -level xbars, 330K gates HP Teramac (1995): Configurable computing research machine: 1M gates Quickturn System Realizer (1995): XC4000 series, 2M gates Quickturn Mercury Plus (2000): Large custom emulation FPGA, 20M gates Mike Butts - RAMP - August, 201014

15 FPGA Emulation Clocking Issues ASIC and custom chips have gated clocks, latches, many clock domains. FPGAs can introduce their own violations. FPGA interconnect delay is very hard to manage. – FPGAs use dedicated low-skew clock networks. Gated clocks: must run clock through logic blocks. Hold-time violations: clock gets sooner than the data. Latches: timing of both edges matters, plus there’s latch transparency. How to reliably map these to FPGA? Re-synthesis. – Map gated clocks to FPGA FF clock enables (which is the gate, which is the clock?) – Map latches into flops, using 2x clocking. Emulators developed sophisticated design mapping techniques. Mike Butts - RAMP - August, 201015

16 Emulator User Psychology Emulators were often hard to use, especially in the early days. – First-time users + clocking issues = errors. – Ultra-high pincount backplanes, cabling = errors. This trained users to blame the emulator. After weeks of effort, they finally get their design up and running on the emulator. A bug is found. What is their response? a) “Wonderful! It found a bug in our design. We’re getting value from all this expense.” b) “It’s not our design, it’s your emulator.” User starts running diagnostics and swapping boards. Swap enough boards and guess what happens..... Solutions: Locked board extractors, Better emulators. Mike Butts - RAMP - August, 201016 Emulators have thousands of pins per board

17 1995: Quickturn System Realizer Up to 990 FPGAs (Xilinx XC4013), custom crossbar chips Logic board: 45 FPGAs, 100 K gates – 2500 pins to backplane, 900 pins in-circuit or LA Max system 22 boards 2M gates, 14 MB RAM Built-in LAPG 14K I/Os for multiple systems Compiler 100KG/hr Two-level partial crossbar connects 990 FPGAs in 3 hops max. Mike Butts - RAMP - August, 201017

18 2000: Mercury Plus FPGA Custom FPGA for emulation Five-level partial crossbar across entire 20M gate system: – Logic cluster: full crossbar – Two partial crossbar levels on-chip – Two more levels in the system 10x faster compile Predictable capacity and delays 6-LUTs, FFs, RAMs – hold time trimmers Full visibility, on-chip logic analyzer Mike Butts - RAMP - August, 201018 QT’s last FPGA emulator

19 FPGA Pin Shortage Gets Worse Over Time Mike Butts - RAMP - August, 201019 LCs (4-LUT)GatesRent pinsReal pins*Shortfall XC20641281024130582.24 XC309064051203251442.26 XC406254724377611053523.14 XC402001675813406420924484.67 XCV8002116816934423905124.67 XC2V600067584540672463111044.20 XC4VLX160200448160358486079608.97 XC6VLX550T549888439910415299120012.75 XC7V2000T19545601563648031521120026.27 Using FPGAs directly in logic emulators falls to Rent’s Rule – FPGA-based emulators were always starved for pins. – Xilinx FPGAs from the beginning. Altera, other FPGAs are similar. * ordinary pins only, SERDES latency is too long for logic emulation

20 FPGA Emulator Pin Multiplexing Mike Butts - RAMP - August, 201020 Babb et. al, “Logic Emulation with Virtual Wires”, vol. 16, pp. 609 - 626, 1997. Xilinx data book Multiple nets per pin, slower design clock Quickturn: – Asynchronous free-running high-speed using DDR IOBs – Transparent to the emulated design VMW: Virtual Wires – Synchronous to design – Modify design netlist: Eval/mux/latch, many levels – Multiple clock domains?

21 Continuous to Discrete Time As FPGAs got further and further from Rent’s Rule, FPGA emulators went to deeper and deeper pin multiplexing. Continuous time: – Pure FPGA emulator runs in the continuous time of the design. Signals propagate as in the real hardware, just with different delays. Continuous / discrete time mix: – Pin-multiplexed FPGA emulator runs in an ad-hoc mix of continuous and discrete time. Yet pins still mostly lie idle. Discrete time: – Go all the way into discrete time == levelized simulation Now it’s a massively parallel computer Mike Butts - RAMP - August, 201021

22 Processor-based Emulation Levelize netlist, evaluate all gates every cycle, level-by-level. No branches: deep pipelining, fast, massively parallel, very scalable. Compile-time net scheduling: Emulated design escapes Rent’s Rule IBM Yorktown Simulation Engine Monty Denneau, DAC 1982. – “... high speed special purpose parallel processor designed and built at the IBM Thomas J. Watson Research Center to simulate logical operation... up to 2,000,000 gates at a rate exceeding 3 billion gate computations per second” IBM Engineering Verification Engine Beece et. al, DAC 1988. Mike Butts - RAMP - August, 201022

23 Quickturn CoBALT Mike Butts - RAMP - August, 201023 Wm. Beausoleil et. al., IBM 1997 commercialization of IBM engines 8M gates, 1 MHz emulation speed IBM HW, QT front end compiler Maps multi clock domains, latches, gated clocks onto single faster clock, making use of FPGA compiler experience Compiles 1M gates / hour Full custom 100 MHz 250um chip with 64 logic processors 65 chips / board

24 Processor-based Emulation in 2000’s IBM technology and team acquired by QT, then QT acquired by Cadence FPGA emulators dropped 2002: Palladium – 128M gates, 0.75 MHz – Full visibility – Compile 30M gates / hour – Multi-user 2004: Palladium II – 256M gates, 1.5 MHz 2007: Palladium III – 256M gates, 2 MHz 2010: Palladium XP – 2000M gates, 4 MHz Mike Butts - RAMP - August, 201024 Palladium XP

25 Emulation at NVIDIA Mike Butts - RAMP - August, 201025 One of the largest emulation labs in the world

26 Early Emulation Success In 1995, CEO Jensen Huang “spent $1 million, a third of the company’s cash, on a technology known as emulation, which allows engineers to play with virtual copies of their graphics chips before they put them into silicon. That allowed Nvidia to speed a new graphics chip to market every six to nine months, a pace the company has sustained ever since.” - Forbes, 1/7/08 RIVA 128, or "NV3", was one of the first consumer graphics processing units to integrate 2D and 3D acceleration. When announced in 1997, the market found the specifications hard to believe: performance superior to market-leader 3dfx. RIVA 128 shipped in volume, and the combination of its low cost and high performance made it a popular choice for OEMs. Mike Butts - RAMP - August, 201026 Wikipedia

27 Emulation in 2005 The specific verification goals that were required for the GeForce 6800 project include: Bring up a new generation of GPUs on an accelerated verification platform in a one- week time frame. Derivative chips must be brought up in a few days. Automate the Compile-Run-Debug process so that ASIC design engineers could use an accelerated verification platform. Verify GPU and frame-buffer/system-memory interaction. Validate AGP/PCI-bus interface functions. Ensure functionality at various levels of abstraction (RTL and gates). Expand accelerated verification solution to ATPG and BIST applications. Mike Butts - RAMP - August, 201027 - Chip Design Magazine, January 2005

28 Emulation Today 2010: Cadence Palladium XP Up to 2 billion gates, up to 4 MHz, up to 512 users – Compile up to 35M gates / hour on 1 PC Full visibility to all signals Integrates with logic and power simulation, SystemC/C++ models, prototype hardware System integration steps used at NVIDIA: – Design and verify the silicon itself. Power analysis is vital. – Run silicon in the virtual system (such as a PC), verify that the GPU works in a system. – Run lots of software applications on the virtualized platform. Mike Butts - RAMP - August, 201028 - “NVidia Engineer Cites HW/SW Integration Challenges”, 5/5/10, cadence.com

29 FPGA Prototyping today FPGA prototyping is widely used as a verification tool by chip development projects (not to mention RAMP of course). Practical for one to four to maybe ten FPGAs. – 2-4M gates each, typically 10 to 50 MHz Prototypes are rarely disclosed, two research efforts were: Mike Butts - RAMP - August, 201029 Atom CPU in one Virtex-5 LX330, 50 MHz (ACM FPGA ‘09) Nehalem CPU in five FPGAs, 520 kHz due to pin multiplexing, 18 to 24-ways (ACM FPGA ‘10)

30 Future State-of-the-art projects continue to rely heavily on processor- based emulation and FPGA prototyping for tapeouts. State-of-the-art tapeouts today cost $50-100M++. – Only possible for established $B vendors. – Very hard to get new chip startups funded. Therefore, ASIC project starts are dropping. FPGAs and GPUs are the only processing silicon that scales with Moore’s Law (so far). – Their vendors are the “foundries” for new HW efforts. Off-the-shelf chips: we’re coming full circle. Mike Butts - RAMP - August, 201030

31 The Ultimate Interconnect Human brain: 10 11 neurons, 10 14 to 10 15 total synapses, 20-40 W, somewhat reconfigurable. Mike Butts - RAMP - August, 201031 “The Brain Unveiled”, Technology Review, Nov-Dec, 2008


Download ppt "Logic Emulation and Prototyping: It’s the Interconnect (Rent rules) Mike Butts NVIDIA RAMP at Stanford, August 2010."

Similar presentations


Ads by Google