1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008.

Slides:



Advertisements
Similar presentations
RAMP Gold : An FPGA-based Architecture Simulator for Multiprocessors Zhangxi Tan, Andrew Waterman, David Patterson, Krste Asanovic Parallel Computing Lab,
Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
August 8 th, 2011 Kevan Thompson Creating a Scalable Coherent L2 Cache.
LOGO HW/SW Co-Verification -- Mentor Graphics® Seamless CVE By: Getao Liang March, 2006.
RAMP in Retrospect David Patterson August 25, 2010.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering.
Computer Architecture Lab at Combining Simulators and FPGAs “An Out-of-Body Experience” Eric S. Chung, Brian Gold, James C. Hoe, Babak Falsafi {echung,
RAMP Retreat August 2008 Christos Kozyrakis Pervasive Parallelism Laboratory Stanford University
1 Jan 07 RAMP PI Report: Plans until next Retreat & Beyond Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe(CMU), Christos Kozyrakis (Stanford), Shih-Lien.
1 RAMP White RAMP Retreat, BWRC, Berkeley, CA 20 January 2006 RAMP collaborators: Arvind (MIT), Krste Asanovíc (MIT), Derek Chiou (Texas), James Hoe (CMU),
1 RAMP Implementation J. Wawrzynek. 2 RDL supports multiple platforms:  XUP, pure software, BEE2 BEE2 will be the standard RAMP platform for the next.
RAMP Gold: Architecture and Timing Model Andrew Waterman, Zhangxi Tan, Rimas Avizienis, Yunsup Lee, David Patterson, Krste Asanović Parallel Computing.
Computer Architecture Lab at 1 P ROTO F LEX : FPGA-Accelerated Hybrid Functional Simulator Eric S. Chung, Eriko Nurvitadhi, James C. Hoe, Babak Falsafi,
UC Berkeley 1 Time dilation in RAMP Zhangxi Tan and David Patterson Computer Science Division UC Berkeley.
Configurable System-on-Chip: Xilinx EDK
1 Breakout thoughts (compiled with N. Carter): Where will RAMP be in 3-5 Years (What is RAMP, where is it going?) Is it still RAMP if it is mapping onto.
1 RAMP Models and Platforms Krste Asanovic UC Berkeley RAMP Retreat, Berkeley, CA January 15, 2009.
Zhangxi Tan, Krste Asanovic, David Patterson UC Berkeley
1 Introduction to Research Accelerator for Multiple Processors David Patterson (Berkeley, CO-PI), Arvind (MIT), Krste Asanovíc (Berkeley/MIT), Derek Chiou.
RAMP Common Interface Krste Asanovic Derek Chiou Joel Emer.
RAMP-White Hari Angepat Derek Chiou University of Texas at Austin.
Climate Machine Update David Donofrio RAMP Retreat 8/20/2008.
RAMP Gold: ParLab InfiniCore Model Krste Asanovic UC Berkeley RAMP Retreat, January 16, 2008.
Murali Vijayaraghavan MIT Computer Science and Artificial Intelligence Laboratory RAMP Retreat, UC Berkeley, January 11, 2007 A Shared.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Lecture 7 Lecture 7: Hardware/Software Systems on the XUP Board ECE 412: Microcomputer Laboratory.
Field Programmable Gate Array (FPGA) Layout An FPGA consists of a large array of Configurable Logic Blocks (CLBs) - typically 1,000 to 8,000 CLBs per chip.
Router Architectures An overview of router architectures.
A Flexible Architecture for Simulation and Testing (FAST) Multiprocessor Systems John D. Davis, Lance Hammond, Kunle Olukotun Computer Systems Lab Stanford.
Students: Oleg Korenev Eugene Reznik Supervisor: Rolf Hilgendorf
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
UC Berkeley 1 The Datacenter is the Computer David Patterson Director, RAD Lab January, 2007.
1 Berkeley RAD Lab Technical Overview Armando Fox, Randy Katz, Michael Jordan, Dave Patterson, Scott Shenker, Ion Stoica March 2006.
 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.
Multi-core architectures. Single-core computer Single-core CPU chip.
Multi-Core Architectures
RAMPing Down Chuck Thacker Microsoft Research August 2010.
Sogang University Advanced Computing System Chap 1. Computer Architecture Hyuk-Jun Lee, PhD Dept. of Computer Science and Engineering Sogang University.
CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.
Advanced Computer Architecture, CSE 520 Generating FPGA-Accelerated DFT Libraries Chi-Li Yu Nov. 13, 2007.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
© 2007 Xilinx, Inc. All Rights Reserved This material exempt per Department of Commerce license exception TSU Hardware Design INF3430 MicroBlaze 7.1.
SPREE RTL Generator RTL Simulator RTL CAD Flow 3. Area 4. Frequency 5. Power Correctness1. 2. Cycle count SPREE Benchmarks Verilog Results 3. Architecture.
1 RAMP Infrastructure Status Daniel Burke 19 Aug 08.
PROCStar III Performance Charactarization Instructor : Ina Rivkin Performed by: Idan Steinberg Evgeni Riaboy Semestrial Project Winter 2010.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Chapter 1 Performance & Technology Trends Read Sections 1.5, 1.6, and 1.8.
Chapter 1 Computer Abstractions and Technology. Chapter 1 — Computer Abstractions and Technology — 2 The Computer Revolution Progress in computer technology.
Computer Organization & Assembly Language © by DR. M. Amer.
Lecture 12: Reconfigurable Systems II October 20, 2004 ECE 697F Reconfigurable Computing Lecture 12 Reconfigurable Systems II: Exploring Programmable Systems.
Lab 2 Parallel processing using NIOS II processors
CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.
1 Retreat (Advance) John Wawrzynek UC Berkeley January 15, 2009.
AN ASYNCHRONOUS BUS BRIDGE FOR PARTITIONED MULTI-SOC ARCHITECTURES ON FPGAS REPORTER: HSUAN-JU LI 2014/04/09 Field Programmable Logic and Applications.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
DDRIII BASED GENERAL PURPOSE FIFO ON VIRTEX-6 FPGA ML605 BOARD PART B PRESENTATION STUDENTS: OLEG KORENEV EUGENE REZNIK SUPERVISOR: ROLF HILGENDORF 1 Semester:
Survey of Reconfigurable Logic Technologies
Background Computer System Architectures Computer System Software.
3/12/07CS Visit Days1 A Sea Change in Processor Design Uniprocessor SpecInt Performance: From Hennessy and Patterson, Computer Architecture: A Quantitative.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Introduction to Field Programmable Gate Arrays (FPGAs) EDL Spring 2016 Johns Hopkins University Electrical and Computer Engineering March 2, 2016.
1 Scaling Soft Processor Systems Martin Labrecque Peter Yiannacouras and Gregory Steffan University of Toronto FCCM 4/14/2008.
CoDeveloper Overview Updated February 19, Introducing CoDeveloper™  Targeting hardware/software programmable platforms  Target platforms feature.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Lynn Choi School of Electrical Engineering
Andrew Putnam University of Washington RAMP Retreat January 17, 2008
Derek Chiou The University of Texas at Austin
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Combining Simulators and FPGAs “An Out-of-Body Experience”
Presentation transcript:

1 RAMP Tutorial Introduction/Overview Krste Asanovic UC Berkeley RAMP Tutorial, ASPLOS, Seattle, WA March 2, 2008

2 Technology Trends: CPU Microprocessor: Power Wall + Memory Wall + ILP Wall = Brick Wall  End of uniprocessors and faster clock rates  Every program(mer) is a parallel program(mer), Sequential algorithms are slow algorithms Since parallel more power efficient (W ≈ CV 2 F) New “Moore’s Law” is 2X processors or “cores” per socket every 2 years, same clock frequency  Conservative: cores, cores, cores for embedded, desktop, & server  Sea change for HW and SW industries since changing programmer model, responsibilities  HW/SW industries bet farm that parallel successful

3 1. Algorithms, Programming Languages, Compilers, Operating Systems, Architectures, Libraries, … not ready for 1000 CPUs / chip 2.  Only companies can build HW, and it takes years 3. Software people don’t start working hard until hardware arrives 3 months after HW arrives, SW people list everything that must be fixed, then we all wait 4 years for next iteration of HW/SW 4. How get 1000 CPU systems in hands of researchers to innovate in timely fashion on in algorithms, compilers, languages, OS, architectures, … ? 5. Can avoid waiting years between HW/SW iterations? Problems with “Manycore” Sea Change

4 Vision: Build Research MPP from FPGAs As  16 CPUs will fit in Field Programmable Gate Array (FPGA), 1000-CPU system from  64 FPGAs? 8 32-bit simple “soft core” RISC at 100MHz in 2004 (Virtex-II) FPGA generations every 1.5 yrs;  2X CPUs,  1.2X clock rate HW research community does logic design (“gate shareware”) to create out-of-the-box, MPP  E.g., 1000 processor, standard ISA binary-compatible, 64-bit, cache-coherent  150 MHz/CPU in 2007  6 universities, 10 faculty 3rd party sells RAMP 2.0 (BEE3) hardware at low cost “Research Accelerator for Multiple Processors”

5 Why RAMP Good for Research MPP? SMPCluster CustomSimulate RAMP Scalability (1k) CAAAA Cost (1k CPUs) F ($20M) C ($1M) F ($3M) A+ ($0M) A ($0.1M) Cost to ownADAAA Power/Space (kilowatts, racks) D (120 kw, 6 racks) A (100 kw, 3 racks) A+ (.1 kw, 0.1 racks) A (1.5 kw, 0.3 racks) CommunityDAFAA ObservabilityDCDA+ ReproducibilityBDBA+ ReconfigurabilityDCDA+ CredibilityA+ A-FB Perform. (clock) A (2 GHz) A (3 GHz) B (.4 GHz) F (0 GHz) C (.1 GHz) GPACB-C+BA-

6 Partnerships Co-PIs: Krste Asanovíc (UCB), Derek Chiou (UT Austin), Joel Emer (MIT/Intel), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley), and John Wawrzynek (Berkeley) RAMP hardware development activity centered at Berkeley Wireless Research Center. Three year NSF grant for staff (awarded 3/06). GSRC (Jan Rabaey) has paid partial staff and some students. Major continuing commitment from Xilinx Collaboration with MSR (Chuck Thacker) on BEE3 FPGA platform. Sun, IBM contributing processor designs, IBM faculty awards. High-speed high-confidence emulation is widely recognized as a necessary component of multiprocessor research and development. FPGA emulation is the only practical approach.

7 7 BEE3,1st prototype 11/07 New RAMP systems to be based on Berkeley Emulation Engine version 3 (BEE3). BEECube, Inc. – –(UC Berkeley spinout startup company) – –To provide manufacturing, distribution, and support to commercial and academic users. – –General availability 2Q08 BEE3 Design Chuck Thacker Chen Chang, UC Berkeley BEE3,1st prototype 11/07 For small scale design, or to get started, use Xilinx ML505

RAMP: An infrastructure to build simulators using FPGAs

9 Host Platform CPU Interconnect Network DRAM Target Model Hard Work Run Target Model on Host Platform

10 Reduce, Reuse, Recycle Reduce effort to build target models  Users just build components (units), infrastructure handles connections (The RDL Compiler) Reuse units by having good abstractions  Across different target models  Across different host platforms XUP, Calinx, BEE2, BEE3, ML505 also Altera platforms Recycle existing IP for use as simulation models  Commercial processor RTL is (almost) its own model

11 RAMP Target Model Units Relatively large chunks of functionality  e.g., processor + L1 cache User-written in some HDL or software Channels Point-point, undirectional, two kinds:  FIFO channel: Flow-controlled interface  Pipeline channel: Simple shift register, bits drop off end Generated by RAMP infrastructure Unit C Unit B Unit A FIFO Channel Pipeline Channel

12 Target Pipeline Channel Parameters D Forward Latency Datawidth D

13 RAMP Description Language (RDL) Unit C Unit B Unit A User describes target model topology, channel parameters, and (manual) mapping to host platform FPGAs using RDL RDL Compiler (RDLC) generates configurations Unit C Uni t B Uni t A FPGA1 FPGA2 RDLC Generated Unit Wrappers Generated links carry channels Target: Host: [ Greg Gibeling, UCB ]

14 Virtual Target Clock

15 Virtualized RTL Improves FPGA Resource Usage RAMP allows units to run at varying target-host clock ratios to optimize area and overall performance Example 1: Multiported register file  Example, Sun Niagara has 3 read ports and 2 write ports to 6KB of register storage  If RTL mapped directly, requires 48K flip-flops Slow cycle time, large area  If mapping into block RAMs (one read+one write per cycle), takes 3 host cycles and 3x2KB block RAMs Faster cycle time (~3X) and far less resources Example 2: Large L2/L3 caches  Current FPGAs only have ~1MB of on-chip SRAM  Use on-chip SRAM to build cache of active piece of L2/L3 cache, stall target cycle if access misses and fetch data from off-chip DRAM

16 Start/Done Timing Interface Wrapper generated by RDL asserts “Start” on the physical FPGA cycle when the inputs to the unit are ready for the next target cycle Unit asserts “Done” when it finishes the target cycle and its outputs are ready Unit can take variable amount of time Unvirtualized RTL unit can connect “Done” to “Start” (but must not clock until “Start”) Unit Start Done Wrapper Out In1 In2

17 Distributed Timing Models

18 Distributed Timing Example Unit A Unit B Latency L D Target:RDYsRDY Host: Unit A Unit B DD Start Done Start Done DEQs ENQDEQ Pipeline target channel implemented as distributed FIFO with at least L buffers

19 Other Automatically Generated Networks Control network has workstation as master and every unit as slave device  Memory-mapped interface with block transfers  Used for initialization, stats gathering, debugging, and monitoring Units can connect to DRAM resources outside of timed target channels  Used to support emulation and virtualization state Units can communicate with each other outside of timed target channels  Support arbitrary communication. E.g., for distributed stats gathering

20 Wide Variety of RAMP Simulators

21 Simulator Design Choices Structural Analog versus Highly Virtualized Functional-only versus Functional+Timing Timing via (virtual) RTL design versus separate functional and timing models Hybrid software/hardware simulators

22 Host Multithreading (Zhangxi Tan (UCB), Chung, (CMU)) CPU 1 CPU 2 CPU 3 CPU 4 Target Model Multithreading emulation engine reduces FPGA resource use and improves emulator throughput Hides emulation latencies (e.g., communicating across FPGAs) Multithreaded Host Emulation Engine (on FPGA) +1 2 PC 1 PC 1 PC 1 PC 1 I$ IR GPR1 X Y 2 D$ Single hardware pipeline with multiple copies of CPU state

23 Split Functional/Timing Models (HASIM Emer (MIT/Intel), FAST Chiou, (UT Austin)) Functional model executes CPU ISA correctly, no timing information  Only need to develop functional model once for each ISA Timing model captures pipeline timing details, does not need to execute code  Much easier to change timing model for architectural experimentation  Without RTL design, cannot be 100% certain that timing is accurate Many possible splits between timing and functional model Functional Model Timing Model

24 Multithreaded Func. & Timing Models (RAMP Gold: Tan, Gibeling, Asanovic, UCB) MT-Unit multiplexes multiple target units on a single host engine MT-Channel multiplexes multiple target channels over a single host link Functional Model Pipeline Arch State Timing Model Pipeline Timing State MT-Unit MT-Channels

25 Schedule 9:00- 9:45 Welcome/Overview 9:45-10:15 RAMP Blue Overview & Demo 10:15-10:45 Break 10:45-12:30 RAMP White Live Demo BEE3 Rollout (MSR/BEEcube/Q&A) 12:30-13:30 Lunch 13:30-15:00 ATLAS Transactional Memory (RAMP Red) 15:00-15:15 Break 15:15-16:45 CMU Simics/RAMP Cache Study 16:45 Wrapup

26 RAMP Blue Release 2/25/ design available from RAMP website - ramp.eecs.berkeley.edu

27 RAMP White Hari Angepat, Derek Chiou (UT Austin) RAMP-White27 Leon 3 MstSlvDbgInt Leon3 shim MP IntCntrl DSUEthDDR2 Leon 3 MstSlvDbgInt AHB bus Leon3 shim Intersectio n Unit NIU Intersectio n Unit NIU Route r Scalable Coherent Shared Memory Multiprocessor Support standard shared memory programming models DDR2 AHB bus AHB shim

28

29 CMU Simics/RAMP Simulator 16-CPU Shared-memory UltraSPARC III Server (SunFire 3800) BEE2 Platform

30 RAMP Home Page/Repository ramp.eecs.berkeley.edu Remotely accessible subversion repository

31 Thank You! Questions?