RAMP in Retrospect David Patterson August 25, 2010.

Slides:

Advertisements

Similar presentations

RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,

Advertisements

Full-System Timing-First Simulation Carl J. Mauer Mark D. Hill and David A. Wood Computer Sciences Department University of Wisconsin—Madison.

Jared Casper, Ronny Krashinsky, Christopher Batten, Krste Asanović MIT Computer Science and Artificial Intelligence Laboratory, Cambridge, MA, USA A Parameterizable.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

CS61C L40 Parallelism in Processor Design (1) Garcia, Spring 2008 © UCB License Except as otherwise noted, the content of this presentation is licensed.

Graduate Computer Architecture I Lecture 15: Intro to Reconfigurable Devices.

The Evolution of RISC A Three Party Rivalry By Jenny Mitchell CS147 Fall 2003 Dr. Lee.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.

Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,

RAMP Retreat August 2008 Christos Kozyrakis Pervasive Parallelism Laboratory Stanford University

1 Jan 07 RAMP PI Report: Plans until next Retreat & Beyond Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe(CMU), Christos Kozyrakis (Stanford), Shih-Lien.

1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),

1 Research Accelerator for MultiProcessing Dave Patterson, UC Berkeley January RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou.

The Design Process Outline Goal Reading Design Domain Design Flow

1 HW/SW Partitioning Embedded Systems Design. 2 Hardware/Software Codesign “Exploration of the system design space formed by combinations of hardware.

RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.

Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,

Behavioral Design Outline –Design Specification –Behavioral Design –Behavioral Specification –Hardware Description Languages –Behavioral Simulation –Behavioral.

RISC By Don Nichols. Contents Introduction History Problems with CISC RISC Philosophy Early RISC Modern RISC.

UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.

Configurable System-on-Chip: Xilinx EDK

1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.

Research Accelerator for Multiple Processors

1 A Community Vision for a Shared Experimental Parallel HW/SW Platform Dave Patterson, Pardee Professor of Comp. Science, UC Berkeley President, Association.

1 Introduction to Research Accelerator for Multiple Processors David Patterson (Berkeley, CO-PI), Arvind (MIT), Krste Asanovíc (Berkeley/MIT), Derek Chiou.

1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008.

RAMP Gold RAMPants Parallel Computing Laboratory University of California, Berkeley.

The Berkeley View: A New Framework & a New Platform for Parallel Research David Patterson and a cast of thousands Pardee Professor of Computer Science,

Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.

1 Some things we think we learned & the road ahead The RAMPants (as prepared by Mark Oskin) But first, let us thank you for the invaluable feedback you.

1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.

CalStan 3/2011 VIRAM-1 Floorplan – Tapeout June 01 Microprocessor –256-bit media processor –12-14 MBytes DRAM – Gops –2W at MHz –Industrial.

1 “Embedded” RAMP Workshop 8/23/06 bee2.eecs.berkeley.edu/wiki/embedded.

Implementation of DSP Algorithm on SoC. Characterization presentation Student : Einat Tevel Supervisor : Isaschar Walter Accompany engineer : Emilia Burlak.

From Concept to Silicon How an idea becomes a part of a new chip at ATI Richard Huddy ATI Research.

Constructive Computer Architecture Tutorial 4: SMIPS on FPGA Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T04-1.

1 Berkeley RAD Lab Technical Overview Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica March 2006.

DOP - A CPU CORE FOR TEACHING BASICS OF COMPUTER ARCHITECTURE Miloš Bečvář, Alois Pluháček and Jiří Daněček Department of Computer Science and Engineering.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

Infrastructure design & implementation of MIPS processors for students lab based on Bluespec HDL Students: Danny Hofshi, Shai Shachrur Supervisor: Mony.

1 of 23 Fouts MAPLD 2005/C117 Synthesis of False Target Radar Images Using a Reconfigurable Computer Dr. Douglas J. Fouts LT Kendrick R. Macklin Daniel.

Multi-Core Architectures

RAMPing Down Chuck Thacker Microsoft Research August 2010.

CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.

Design Verification An Overview. Powerful HDL Verification Solutions for the Industry’s Highest Density Devices  What is driving the FPGA Verification.

Xilinx Programmable Logic Design Solutions Version 2.1i Designing the Industry’s First 2 Million Gate FPGA Drop-In 64 Bit / 66 MHz PCI Design.

SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.

- 1 - EE898_HW/SW Partitioning Hardware/software partitioning  Functionality to be implemented in software or in hardware? No need to consider special.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Lecture 10: Logic Emulation October 8, 2013 ECE 636 Reconfigurable Computing Lecture 13 Logic Emulation.

Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.

An Architecture and Prototype Implementation for TCP/IP Hardware Support Mirko Benz Dresden University of Technology, Germany TERENA 2001.

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor.

Ted Pedersen – CS 3011 – Chapter 10 1 A brief history of computer architectures CISC – complex instruction set computing –Intel x86, VAX –Evolved from.

Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.

Computer Engineering 1502 Advanced Digital Design Professor Donald Chiarulli Computer Science Dept Sennott Square

FPGA-based Fast, Cycle-Accurate Full System Simulators Derek Chiou, Huzefa Sanjeliwala, Dam Sunwoo, John Xu and Nikhil Patil University of Texas at Austin.

Data Management for Decision Support Session-4 Prof. Bharat Bhasker.

1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.

ISA's, Compilers, and Assembly

Implementing Tile-based Chip Multiprocessors with GALS Clocking Styles Zhiyi Yu, Bevan Baas VLSI Computation Lab, ECE Department University of California,

3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.

Hardware Trends CSE451 Andrew Whitaker. Motivation Hardware moves quickly OS code tends to stick around for a while “System building” extends way beyond.

1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.

CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.

DAC50, Designer Track, 156-VB543 Parallel Design Methodology for Video Codec LSI with High-level Synthesis and FPGA-based Platform Kazuya YOKOHARI, Koyo.

Andrew Putnam University of Washington RAMP Retreat January 17, 2008

Derek Chiou The University of Texas at Austin

Combining Simulators and FPGAs “An Out-of-Body Experience”

The performance requirements for DSP applications continue to grow and the traditional solutions do not adequately address this new challenge Paradigm.

Presentation transcript:

RAMP in Retrospect David Patterson August 25, 2010

Outline Beginning and Original Vision Successes Problems and Mistaken Assumptions New Direction: FPGA Simulators Observations 2

Where did RAMP come from? June 7, 2005 ISCA Panel Session, 2:30-4PM  “Chip Multiprocessors are here, but where are the threads?” 3

4

Where did RAMP come from? (cont’d) Hallway conversations that evening (> 4PM) and next day (<noon end of ISCA) with Krste Asanovíc (MIT), Dave Patterson (UCB), … Krste recruited from “Workshop on Architec- ture Research using FPGAs” community  Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), and John Wawrzynek (Berkeley, PI) Met at Berkeley and wrote NSF proposal based on BEE2 board at Berkeley in July/August; funded March

6 1. Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip 2.  Only companies can build HW, and it takes years 3. Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 4. How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? 5. Can avoid waiting years between HW/SW iterations? Problems with “Manycore” Sea Change (Original RAMP Vision)

7 Build Academic Manycore from FPGAs As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, Manycore  E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent  150 MHz/CPU in 2007  RAMPants: 10 faculty at Berkeley, CMU, MIT, Stanford, Texas, and Washington “Research Accelerator for Multiple Processors” as a vehicle to attract many to parallel challenge (Original RAMP Vision)

8 Why Good for Research Manycore? SMPClusterSimulate RAMP Scalability (1k CPUs) CAAA Cost (1k CPUs) F ($40M) C ($2-3M) A+ ($0M) A ($ M) Cost of ownershipADAA Power/Space (kilowatts, racks) D (120 kw, 12 racks) A+ (.1 kw, 0.1 racks) A (1.5 kw, 0.3 racks) CommunityDAAA ObservabilityDCA+ ReproducibilityBDA+ ReconfigurabilityDCA+ CredibilityA+ FB+/A- Perform. (clock) A (2 GHz) A (3 GHz) F (0 GHz) C (0.1 GHz) GPACB-BA- (Original RAMP Vision)

Software Architecture Model Execution (SAME) Median Instructions Simulated/ Benchmark Median #Cores Median Instructions Simulated/ Core ISCA M1 ISCA M16100M Effect is dramatically shorter (~10 ms) simulation runs 9

10 Why RAMP More Credible? Starting point for processor is debugged design from Industry in HDL Fast enough that can run more software, more experiments than simulators Design flow, CAD similar to real hardware  Logic synthesis, place and route, timing analysis HDL units implement operation vs. a high-level description of function  Model queuing delays at buffers by building real buffers Must work well enough to run OS  Can’t go backwards in time, which simulators can unintentionally Can measure anything as sanity checks (Original RAMP Vision)

Outline Beginning and Original Vision Successes Problems and Mistaken Assumptions New Direction: FPGA Simulators Observations 11

12 Core is softcore MicroBlaze (32-bit Xilinx RISC) 12 MicroBlaze cores / FPGA 21 BEE2s (84 FPGAs) x 12 FPGAs/module = MHz  $10k/board Full star-connection between modules Works Jan 2007; runs NAS benchmarks in UPC Final RAMP Blue demo in poster session today! Krasnov, Burke, Schultz, Wawrzynek at Berkeley RAMP Blue 1008 core MPP

13 RAMP Red: Transactional Memory 8 CPUs with 32KB L1 data-cache with Transactional Memory support (Kozyrakis, Olukotun… at Stanford)  CPUs are hardcoded PowerPC405, Emulated FPU  UMA access to shared memory (no L2 yet)  Caches and memory operate at 100MHz  Links between FPGAs run at 200MHz  CPUs operate at 300MHz A separate, 9th, processor runs OS (PowerPC Linux) It works: runs SPLASH-2 benchmarks, AI apps, C-version of SpecJBB2000 (3-tier-like benchmark) 1st Transactional Memory Computer! Transactional Memory RAMP runs 100x faster than simulator on a Apple 2GHz G5 (PowerPC)

Academic / Industry Cooperation Cooperation between universities: Berkeley, CMU, MIT, Texas, Washington Cooperation between companies: Intel, IBM, Microsoft, Sun, Xilinx, … Offspring from marriage of Academia and Industry: BEEcube 14

Other successes RAMP Orange (Texas): FAST x86 software simulator + FPGA for cycle accurate timing ProtoFlex (CMU) SIMICS + FPGA: 16 processors, 40X faster than simulator OpenSPARC: Opensource T1 processor DOE: RAMP for HW/SW co-development BEFORE buying hardware (Yelick’s talk) Datacenter-in-a-box: 10K processors + networking simulator (Zhangxi Tan’s demo) BEE3 – Microsoft + BEEcube 15

BEE3 Around the World Anadolu University Barcelona Super Computing Cambridge University University of Cyprus Tshinghua University University of Alabama at Huntsville Leiden University MIT University of Michigan Pennsylvania State University Stanford University Technische Darmsstadt Tokyo University Peking University CMC Microsystems Thales Group Sun Microsystems Microsoft Corporation L3 Communications UC Berkeley Lawrence Berkeley National Laboratory UC Los Angeles UC San Diego North Carolina State University University of Pennsylvania Fort George G. Meade GE Global Research The Aerospace Corporation 16

Sun Never Sets on BEE3

Outline Beginning and Original Vision Successes Problems and Mistaken Assumptions New Direction: FPGA Simulators Observations 18

Problems and mistaken assumptions “Starting point for processor is debugged design from Industry in HDL” Tapeout everyday => its easy to fix =>debugged as well as software (but not done by world class programmers) Most “gateware” IP blocks are starting points for a working block Others are large, brittle, monolithic blocks of HDL that are hard to subset 19

Mistaken Assumptions: FPGA CAD Tools as good as ASIC “Design flow, CAD similar to real hardware”  Compared to ASIC tools, FPGA tools are immature  Encountered 84 formally-tracked bugs developing RAMP Gold ( Including several in the formal verification tools!)  Highly frustrating to many, Biggest barrier by far Making internal formats proprietary prevented “Mead-Conway” effect on VLSI era of 1980s  “I can do it better” and they did => reinvent CAD industry  FPGA = no academic is allowed to try to do it better 20

Mistaken Assumptions: FPGAs are easy “Architecture Researchers can program FPGAs” Reaction of some: “Too hard for me to write (or even modify)”  Do we have a generation of La-Z-Boy architecture researchers spoiled by ILP/cache studies using just software simulators?  Don’t know which end of soldering iron to grab?? Make sure our universities don’t graduate any more La-Z-Boy architecture researchers! 21

Problems and mistaken assumptions “RAMP consortium will share IP” Due to differences in:  Instruction Sets (x86 vs. SPARC)  Number of target cores (Multi- vs. Manycore)  HDL (BlueSpec vs. Verilong) Ended up sharing ideas, experiences vs. IP 22

Problems and mistaken assumptions “HDL units implement operation vs. a high-level description of function”  E.g., Model queuing delays at buffers by building real buffers Since couldn’t simply cut and paste IP, needed a new solution Build a architecture simulator in FPGAs vs. build an FPGA computer  FPGA Architecture Model Execution (FAME)  Took a while to figure out what to do and how to do it 23

FAME Design Space Three dimensions of FAME simulators  Direct or Decoupled: does one host cycle model one target cycle?  Full RTL or Abstract RTL?  Host Single-threaded or Host Multi- threaded? See ISCA paper for a FAME taxonomy! 24 “A Case for FAME: FPGA Architecture Model Execution” by Zhangxi Tan, Andrew Waterman, Henry Cook, Sarah Bird, Krste Asanović, David Patterson, Proc. Int’l Symposium On Computer Architecture, June 2010.

FAME Dimension 1: Direct vs. Decoupled Direct FAME: compile target RTL to FPGA Problem: common ASIC structures map poorly to FPGAs Solution: resource-efficient multi-cycle FPGA mapping Decoupled FAME: decouple host cycles from target cycles  Full RTL still modeled, so timing accuracy still guaranteed 25 R1R2R3R4W1W2 RegFile Rd1Rd2Rd3Rd4 R1R2W1 RegFile Rd1Rd2 Target System Regfile Decoupled Host Implementation FSM

FAME Dim. 2:Full RTL vs. Abstract RTL Decoupled FAME models full RTL of target machine  Don’t have full RTL in initial design phase  Full RTL is too much work for design space exploration Abstract FAME: model the target RTL at high level  For example, split timing and functional models (à la SAME)  Also enables runtime parameterization: run different simulations without re- synthesizing the design Advantages of Abstract FAME come at cost: model verification  Timing of abstract model not guaranteed to match target machine 26 Abstraction Functional Model Target RTL Timing Model

FAME Dimension 3: Single- or Multi-threaded Host Problem: can’t fit big manycore on FPGA, even abstracted Problem: long host latencies reduce utilization Solution: host-multithreading 27 CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreaded Emulation Engine (on FPGA) +1 2 PC1 PC1 PC1 PC1 I$ IR GPR GPR GPR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

RAMP Gold: A Multithreaded FAME Simulator Rapid accurate simulation of manycore architectural ideas using FPGAs Initial version models 64 cores of SPARC v8 with shared memory system on $750 board Hardware FPU, MMU, boots OS. FAME 1 => BEE3, FAME 7 => XUP Cost Performance (MIPS) Simulations per day Simics (SAME)$2, RAMP Gold (FAME)$2,000 + $

RAMP Gold Performance FAME (RAMP Gold) vs. SAME (Simics) Performance  PARSEC parallel benchmarks, large input sets  >250x faster than full system simulator for a 64-core target system 29

Researcher Productivity is Inversely Proportional to Latency Simulation latency is even more important than throughput (for OS/Arch study)  How long before experimenter gets feedback?  How many experimenter-days are wasted if there was an error in the experimental setup? 30 Median Latency (days)Maximum Latency (days) FAME0.04 (~1 hour) 0.12 (~3 hours) SAME

FAME Conclusion This is research, not product develop – often end up in different place than expect Eventually delivered on original inspiration 1. “How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ?” 2. “Can avoid waiting years between HW/SW iterations?” Need to simulate trillions of instructions to figure out how to best transition of whole IT technology base to parallelism 31 SAME

32 Potential to Accelerate Manycore With RAMP: Fast, wide-ranging exploration of HW/SW options + head-to-head competitions to determine winners and losers  Common artifact for HW and SW researchers  innovate across HW/SW boundaries  Minutes vs. years between “HW generations”  Cheap, small, low power  Every dept owns one  FTP supercomputer overnight, check claims locally  Emulate any Manycore  aid to teaching parallelism  If HP, IBM, Intel, M/S, Sun, …had RAMP boxes  Easier to carefully evaluate research claims  Help technology transfer Without RAMP: One Best Shot + Field of Dreams? (Original RAMP Vision)