HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer.

Slides:



Advertisements
Similar presentations
Part 4: combinational devices
Advertisements

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
INTEL CONFIDENTIAL, FOR INTERNAL USE ONLY HAsim On-Chip Network Model Configuration Michael Adler.
CSC 4250 Computer Architectures December 8, 2006 Chapter 5. Memory Hierarchy.
Router Architecture : Building high-performance routers Ian Pratt
Closely-Coupled Timing-Directed Partitioning in HAsim Michael Pellauer † Murali Vijayaraghavan †, Michael Adler ‡, Arvind †, Joel.
1 RAMP 100K Core Breakout Assorted RAMPants RAMP Retreat, UC San Diego June 14, M.
Network based System on Chip Part A Performed by: Medvedev Alexey Supervisor: Walter Isaschar (Zigmond) Winter-Spring 2006.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
University College Cork IRELAND Hardware Concepts An understanding of computer hardware is a vital prerequisite for the study of operating systems.
State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.
Modern trends in computer architecture and semiconductor scaling are leading towards the design of chips with more and more processor cores. Highly concurrent.
1 RAMP Infrastructure Krste Asanovic UC Berkeley RAMP Tutorial, ISCA/FCRC, San Diego June 10, 2007.
Octavo: An FPGA-Centric Processor Architecture Charles Eric LaForest J. Gregory Steffan ECE, University of Toronto FPGA 2012, February 24.
Module I Overview of Computer Architecture and Organization.
1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.
Paper Review Building a Robust Software-based Router Using Network Processors.
ATM SWITCHING. SWITCHING A Switch is a network element that transfer packet from Input port to output port. A Switch is a network element that transfer.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.
High-Level Interconnect Architectures for FPGAs An investigation into network-based interconnect systems for existing and future FPGA architectures Nick.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.
High-Level Interconnect Architectures for FPGAs Nick Barrow-Williams.
George Michelogiannakis William J. Dally Stanford University Router Designs for Elastic- Buffer On-Chip Networks.
Main Memory CS448.
Interrupts, Buses Chapter 6.2.5, Introduction to Interrupts Interrupts are a mechanism by which other modules (e.g. I/O) may interrupt normal.
Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.
Lecture 13: Logic Emulation October 25, 2004 ECE 697F Reconfigurable Computing Lecture 13 Logic Emulation.
Caches Where is a block placed in a cache? –Three possible answers  three different types AnywhereFully associativeOnly into one block Direct mappedInto.
Motherboard A motherboard allows all the parts of your computer to receive power and communicate with one another.
Lecture 11: FPGA-Based System Design October 18, 2004 ECE 697F Reconfigurable Computing Lecture 11 FPGA-Based System Design.
UltraSPARC III Hari P. Ananthanarayanan Anand S. Rajan.
Lecture 16: Router Design
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.
FPGA-Based System Design: Chapter 6 Copyright  2004 Prentice Hall PTR Topics n Low power design. n Pipelining.
2/19/2016http://csg.csail.mit.edu/6.375L11-01 FPGAs K. Elliott Fleming Computer Science & Artificial Intelligence Lab Massachusetts Institute of Technology.
Sunpyo Hong, Hyesoon Kim
Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal
COMP SYSTEM ARCHITECTURE PRACTICAL CACHES Sergio Davies Feb/Mar 2014COMP25212 – Lecture 3.
Mohamed ABDELFATTAH Andrew BITAR Vaughn BETZ. 2 Module 1 Module 2 Module 3 Module 4 FPGAs are big! Design big systems High on-chip communication.
1 Appendix C. Review of Memory Hierarchy Introduction Cache ABCs Cache Performance Write policy Virtual Memory and TLB.
Network On Chip Cache Coherency Final presentation – Part A Students: Zemer Tzach Kalifon Ethan Kalifon Ethan Instructor: Walter Isaschar Instructor: Walter.
Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.
Buffering Techniques Greg Stitt ECE Department University of Florida.
Overview Parallel Processing Pipelining
Lecture 23: Interconnection Networks
Simultaneous Multithreading
SLP1 design Christos Gentsos 9/4/2014.
ESE532: System-on-a-Chip Architecture
Timing Model of a Superscalar O-o-O processor in HAsim Framework
5.2 Eleven Advanced Optimizations of Cache Performance
Azeddien M. Sllame, Amani Hasan Abdelkader
Cache Memory Presentation I
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
/ Computer Architecture and Design
HASim Implementing a Functional/Timing Partitioned Microprocessor Simulator with an FPGA Nirav Dave*, Michael Pellauer*, Joel Emer†*, & Arvind* Massachusetts.
Hyperthreading Technology
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Hardware Multithreading
Lesson 4 Synchronous Design Architectures: Data Path and High-level Synthesis (part two) Sept EE37E Adv. Digital Electronics.
Advanced Computer and Parallel Processing
/ Computer Architecture and Design
Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)
Overview Last lecture Digital hardware systems Today
Advanced Computer and Parallel Processing
Levels in Processor Design
William Stallings Computer Organization and Architecture
Presentation transcript:

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer

1 Simulating Multicores Simulating an N-multicore target Fundametally N times the work Plus on-chip network Duplicating cores will quickly fill FPGA Multi-FPGA will slow simulation CPU Network

2 Trading Time for Space Can leverage separation of model clock and FPGA clock to save space Two techniques: serialization and time-multiplexing But doesn’t this just slow down our simulator? The tradeoff is a good idea if we can: Save a lot of space Improve FPGA critical path Improve utilization Slow down rare events, keep common events fast LI approach enables a wide range of tradeoff options

3 Serialization: A First Tradeoff

4 Example Tradeoff: Multi-Port Register File 2 Read Ports, 2 Write Ports 5-bit index, 32-bit data Reads take zero clock cycles Virtex 2Pro FPGA: 9242 (>25%) slices, 104 MHz 2R/2W Register File rd addr 1 rd addr 2 wr addr 1 wr val 1 wr addr2 wr val 2 rd val 1 rd val 2

5 Trading Time for Space Simulate the circuit sequentially using BlockRAM 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x) Simulation rate is 224 / 3 = 75 MHz rd addr 1 rd addr 2 wr addr 1 wr val 1 wr addr 2 wr val 2 rd val 1 rd val 2 1R/1W Block RAM FSM Each module may have different FMR A-Ports allow us to connect many such modules together Maintain a consistent notion of model time FPGA-cycle to Model Cycle Ratio (FMR) FPGA-cycle to Model Cycle Ratio (FMR)

6 Example: Inorder Front End FET Branch Pred IMEM PC Resolve Inst Q I$ ITLB first deq slot enq or drop 1 fault mispred 1 training pred rspImm rspDel 1 1 redirect 1 vaddr (from Back End) vaddr 0 (from Back End) paddr 0 1 Line Pred 0 0 inst or fault Legend: Ready to simulate? YesNo FET Part IMEM Modules may simulate at any wall-clock rate Corollary: adjacent modules may not be simulating the same model cycle

7 Simulator “Slip” Adjacent modules simulating different cycles! In paper: distributed resynchronization scheme This can speed up simulation Case study: Achieved 17% better performance than centralized controller Can get performance = dynamic average FET DEC 1 FET DEC 1 vs Let’s see how...

8 Traditional Software Simulation Wallclock time FETDECEXEMEMWB 0A 1A 2NOP B 7B 8A 9A 10NOP = model cycle

Challenges in Conducting Compelling Architecture Research9 Global Controller “Barrier” Synchronization FPGA CCFETDECEXEMEMWB 0ANOP 1A 2A 3BA 4BA 5A 6CBA 7B 8DCBA 9D 10D = model cycle

10 A-Ports Distributed Synchronization FPGA CCFETDECEXEMEMWB 0ANOP 1BA 2CBA 3DBA 4E (full)BA 5BA 6BA 7CBA 8FDCBA 9G (full)DCB 10DC 11D 12D long-running ops can overlap even if on different CC long-running ops can overlap even if on different CC run-ahead in time until buffering fills run-ahead in time until buffering fills Takeaway: LI makes serialization tradeoffs more appealing

11 Modeling large caches Expensive instructions CPU Leveraging Latency-Insensitivity 1 1 FPU EXE LEAP Instruction Emulator (M5) RRR [With Parashar, Adler] FPGA 11 L2$ Cache Controller BRAM (KBs, 1 CC) SRAM (MBs, 10s CCs) System Memory (GBs, 100s CCs) RAM 256 KB FPGA LEAP LEAP Scratchpad

12 Time-Multiplexing: A Tradeoff to Scale Multicores (resume at 3:45)

13 Drawbacks: Probably won’t fit Low utilization of functional units Benefits: Simple to describe Maximum parallelism Multicores Revisited What if we duplicate the cores? state CORE 0CORE 1CORE 2

14 Module Utilization FET DEC 1 FET DEC 1 A module is unutilized on an FPGA cycle if: Waiting for all input ports to be non-empty or Waiting for all output ports to be non-full Case Study: In-order functional units were utilized 13% of FPGA cycles on average 1 1

15 Drawbacks: More expensive than duplication(!) Benefits: Better unit utilization Time-Multiplexing: First Approach Duplicate state, Sequentially share logic state physical pipeline virtual instances

16 Drawbacks: Head-of-line blocking may limit performance Benefits: Much better area Good unit utilization Round-Robin Time Multiplexing Fix ordering, remove multiplexors state physical pipeline Need to limit impact of slow events Pipeline at a fine granularity Need a distributed, controller-free mechanism to coordinate...

17 Port-Based Time-Multiplexing Duplicate local state in each module Change port implementation: Minimum buffering: N * latency + 1 Initialize each FIFO with: # of tokens = N * latency Result: Adjacent modules can be simultaneously simulating different virtual instances

18 The Front End Multiplexed FET Branch Pred IMEM PC Resolve Inst Q I$ ITLB first deq slot enq or drop 1 fault mispred 1 training pred rspImm rspDel 1 1 redirect 1 vaddr (from Back End) vaddr 0 (from Back End) paddr 0 1 Line Pred 0 0 inst or fault Legend: Ready to simulate? CPU 1 No CPU 2 FETIMEM

19 On-Chip Networks in a Time-Multiplexed World

20 Problem: On-Chip Network CPU L1/L2 $ msgcredit Memory Control r r r r [0 1 2] CPU 0 L1/L2 $ CPU 1 L1/L2 $ CPU 2 L1/L2 $ r router msg credit Problem: routing wires to/from each router Similar to the “global controller” scheme Also utilization is low

21 Router 0..3 Multiplexing On-Chip Network Routers Router 3 Router 0 Router 2 Router 1 curto 1to 2to 3fr 1fr 2fr reorder σ(x) = (x + 1) mod 4 σ(x) = (x + 2) mod 4 σ(x) = (x + 3) mod Simulate the network without a network

22 Ring/Double Ring Topology Multiplexed Router 3 Router 0 Router 2 Router 1 Router 0..3 “to next” “from prev” ??? curto Nfr P σ(x) = (x + 1) mod Opposite direction: flip to/from

23 Implementing Permutations on FPGAs Efficiently Side Buffer Fits networks like ring/torus (e.g. x+1 mod N) Indirection Table More general, but more expensive Perm Table RAM Buffer FSM σ(x) = (x + 1) mod Move first to Nth Move Nth to firstMove every K to N-K

24 Torus/Mesh Topology Multiplexed Mesh: Don’t transmit on non-existent links

25 Dealing with Heterogeneous Networks Compose “Mux Ports” with Permutation Ports In paper: generalize to any topology

26 Putting It All Together

27 Typical HAsim Model Leveraging these Techniques 16-core chip multiprocessor 10-stage pipeline (speculative, bypassed) 64-bit Alpha ISA, floating point 8 KB lockup-free L1 caches 256 KB 4-way set associative L2 cache Network: 2 v. channels, 4 slots, x-y wormhole FBP1BP2PCCIQDXDMCQC ITLBI$DTLBD$L/S Q L2$Route Single detailed pipeline, 16-way time-multiplexed 64-bit Alpha functional partition, floating point Caches modeled with different cache hierarchy Single router, multiplexed, 4 permutations

28 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR

29 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core

30 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core

31 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core

32 Takeaways The Latency-Insensitive approach provides a unified approach to interesting tradeoffs Serialization: Leverage FPGA-efficient circuits at the cost of FMR A-Port-based synchronization can amortize cost by giving dynamic average Especially if long events are rare Time-Multiplexing: Reuse datapaths and only duplicate state A-Port based approach means not all modules are fully utilized Increased utilization means that performance degradation is sublinear Time-multiplexing the on-chip network requires permutations

33 Next Steps Here we were able to push one FPGA to its limits What if we want to scale farther? Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques Also how we can increase designer productivity by abstracting platform

35 Resynchronizing Ports Modules follow modified scheme: If any incoming port is heavy, or any outgoing port is light, simulate next cycle (when ready) Result: balanced w/o centralized coordination Argument: Modules farthest ahead in time will never proceed Ports in (out) of this set will be light (resp. heavy) – Therefore those modules will try to proceed, but may not be able to There’s also a set farthest behind in time – Always able to proceed – Since graph is connected, simulating only enables modules, makes progress towards quiescence

36 Other Topologies Tree Butterfly

37 Generalizing OCN Permutations Represent model as Directed Graph G=(M,P) Label modules M with simulation order: 0..(N-1) Partition ports into sets P 0..P m where: – No two ports in a set P m share a source – No two ports in a set P m share a destination Transform each P m into a permutation σ m – Forall {s, d} in P m, σ m (s) = d – Holes in range represent “don’t cares” – Always send NoMessage on those steps Time-Multiplex module as usual – Associate each σ m with a physical port

38 Example: Arbitrary Network

39 Results: Multicore Simulation Rate FMRSimulation Rate MinMaxAvgMinMaxAvg Overall KHz3.2 MHz625 KHz Per-Core MHz4.54 MHz Must simulate multiple cores to get full benefit of time- multiplexed pipelines Functional cache-pressure rate-limiting factor