Presentation is loading. Please wait.

Presentation is loading. Please wait.

HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer.

Similar presentations


Presentation on theme: "HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer."— Presentation transcript:

1 HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer

2 1 Simulating Multicores Simulating an N-multicore target Fundametally N times the work Plus on-chip network Duplicating cores will quickly fill FPGA Multi-FPGA will slow simulation CPU Network

3 2 Trading Time for Space Can leverage separation of model clock and FPGA clock to save space Two techniques: serialization and time-multiplexing But doesn’t this just slow down our simulator? The tradeoff is a good idea if we can: Save a lot of space Improve FPGA critical path Improve utilization Slow down rare events, keep common events fast LI approach enables a wide range of tradeoff options

4 3 Serialization: A First Tradeoff

5 4 Example Tradeoff: Multi-Port Register File 2 Read Ports, 2 Write Ports 5-bit index, 32-bit data Reads take zero clock cycles Virtex 2Pro FPGA: 9242 (>25%) slices, 104 MHz 2R/2W Register File rd addr 1 rd addr 2 wr addr 1 wr val 1 wr addr2 wr val 2 rd val 1 rd val 2

6 5 Trading Time for Space Simulate the circuit sequentially using BlockRAM 94 slices (<1%), 1 BlockRAM, 224 MHz (2.2x) Simulation rate is 224 / 3 = 75 MHz rd addr 1 rd addr 2 wr addr 1 wr val 1 wr addr 2 wr val 2 rd val 1 rd val 2 1R/1W Block RAM FSM Each module may have different FMR A-Ports allow us to connect many such modules together Maintain a consistent notion of model time FPGA-cycle to Model Cycle Ratio (FMR) FPGA-cycle to Model Cycle Ratio (FMR)

7 6 Example: Inorder Front End FET Branch Pred IMEM PC Resolve Inst Q I$ ITLB 1110 1 2 0 0 first deq slot enq or drop 1 fault mispred 1 training pred rspImm rspDel 1 1 redirect 1 vaddr (from Back End) vaddr 0 (from Back End) paddr 0 1 Line Pred 0 0 inst or fault Legend: Ready to simulate? YesNo FET Part IMEM Modules may simulate at any wall-clock rate Corollary: adjacent modules may not be simulating the same model cycle

8 7 Simulator “Slip” Adjacent modules simulating different cycles! In paper: distributed resynchronization scheme This can speed up simulation Case study: Achieved 17% better performance than centralized controller Can get performance = dynamic average FET DEC 1 FET DEC 1 vs Let’s see how...

9 8 Traditional Software Simulation Wallclock time FETDECEXEMEMWB 0A 1A 2NOP 3 4 5 6B 7B 8A 9A 10NOP = model cycle

10 9 2008.06.3 0 Challenges in Conducting Compelling Architecture Research9 Global Controller “Barrier” Synchronization FPGA CCFETDECEXEMEMWB 0ANOP 1A 2A 3BA 4BA 5A 6CBA 7B 8DCBA 9D 10D = model cycle

11 10 A-Ports Distributed Synchronization FPGA CCFETDECEXEMEMWB 0ANOP 1BA 2CBA 3DBA 4E (full)BA 5BA 6BA 7CBA 8FDCBA 9G (full)DCB 10DC 11D 12D long-running ops can overlap even if on different CC long-running ops can overlap even if on different CC run-ahead in time until buffering fills run-ahead in time until buffering fills Takeaway: LI makes serialization tradeoffs more appealing

12 11 Modeling large caches Expensive instructions CPU Leveraging Latency-Insensitivity 1 1 FPU EXE LEAP Instruction Emulator (M5) RRR [With Parashar, Adler] FPGA 11 L2$ Cache Controller BRAM (KBs, 1 CC) SRAM (MBs, 10s CCs) System Memory (GBs, 100s CCs) RAM 256 KB FPGA LEAP LEAP Scratchpad

13 12 Time-Multiplexing: A Tradeoff to Scale Multicores (resume at 3:45)

14 13 Drawbacks: Probably won’t fit Low utilization of functional units Benefits: Simple to describe Maximum parallelism Multicores Revisited What if we duplicate the cores? state CORE 0CORE 1CORE 2

15 14 Module Utilization FET DEC 1 FET DEC 1 A module is unutilized on an FPGA cycle if: Waiting for all input ports to be non-empty or Waiting for all output ports to be non-full Case Study: In-order functional units were utilized 13% of FPGA cycles on average 1 1

16 15 Drawbacks: More expensive than duplication(!) Benefits: Better unit utilization Time-Multiplexing: First Approach Duplicate state, Sequentially share logic state physical pipeline virtual instances

17 16 Drawbacks: Head-of-line blocking may limit performance Benefits: Much better area Good unit utilization Round-Robin Time Multiplexing Fix ordering, remove multiplexors state physical pipeline Need to limit impact of slow events Pipeline at a fine granularity Need a distributed, controller-free mechanism to coordinate...

18 17 Port-Based Time-Multiplexing Duplicate local state in each module Change port implementation: Minimum buffering: N * latency + 1 Initialize each FIFO with: # of tokens = N * latency Result: Adjacent modules can be simultaneously simulating different virtual instances

19 18 The Front End Multiplexed FET Branch Pred IMEM PC Resolve Inst Q I$ ITLB 1110 1 2 0 0 first deq slot enq or drop 1 fault mispred 1 training pred rspImm rspDel 1 1 redirect 1 vaddr (from Back End) vaddr 0 (from Back End) paddr 0 1 Line Pred 0 0 inst or fault Legend: Ready to simulate? CPU 1 No CPU 2 FETIMEM

20 19 On-Chip Networks in a Time-Multiplexed World

21 20 Problem: On-Chip Network CPU L1/L2 $ msgcredit Memory Control r r r r [0 1 2] CPU 0 L1/L2 $ CPU 1 L1/L2 $ CPU 2 L1/L2 $ r router msg credit Problem: routing wires to/from each router Similar to the “global controller” scheme Also utilization is low

22 21 Router 0..3 Multiplexing On-Chip Network Routers Router 3 Router 0 Router 2 Router 1 curto 1to 2to 3fr 1fr 2fr 3 0 1 2 3 0 0 0 1 1 123 2 23 3 reorder σ(x) = (x + 1) mod 4 σ(x) = (x + 2) mod 4 σ(x) = (x + 3) mod 4 123 0 0 0 1 1 2 23 3 Simulate the network without a network

23 22 Ring/Double Ring Topology Multiplexed Router 3 Router 0 Router 2 Router 1 Router 0..3 “to next” “from prev” ??? curto Nfr P 0 1 2 3 σ(x) = (x + 1) mod 4 13 0 0 1 2 2 3 Opposite direction: flip to/from

24 23 Implementing Permutations on FPGAs Efficiently Side Buffer Fits networks like ring/torus (e.g. x+1 mod N) Indirection Table More general, but more expensive Perm Table RAM Buffer FSM σ(x) = (x + 1) mod 4 1000 0001 Move first to Nth Move Nth to firstMove every K to N-K

25 24 Torus/Mesh Topology Multiplexed Mesh: Don’t transmit on non-existent links

26 25 Dealing with Heterogeneous Networks Compose “Mux Ports” with Permutation Ports In paper: generalize to any topology

27 26 Putting It All Together

28 27 Typical HAsim Model Leveraging these Techniques 16-core chip multiprocessor 10-stage pipeline (speculative, bypassed) 64-bit Alpha ISA, floating point 8 KB lockup-free L1 caches 256 KB 4-way set associative L2 cache Network: 2 v. channels, 4 slots, x-y wormhole FBP1BP2PCCIQDXDMCQC ITLBI$DTLBD$L/S Q L2$Route Single detailed pipeline, 16-way time-multiplexed 64-bit Alpha functional partition, floating point Caches modeled with different cache hierarchy Single router, multiplexed, 4 permutations

29 28 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR15.727.118.4

30 29 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core5.414.48.95

31 30 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core8.513.511.6

32 31 Time-Multiplexed Multicore Simulation Rate Scaling BestWorstAvg FMR Per-Core8.4519.811.5

33 32 Takeaways The Latency-Insensitive approach provides a unified approach to interesting tradeoffs Serialization: Leverage FPGA-efficient circuits at the cost of FMR A-Port-based synchronization can amortize cost by giving dynamic average Especially if long events are rare Time-Multiplexing: Reuse datapaths and only duplicate state A-Port based approach means not all modules are fully utilized Increased utilization means that performance degradation is sublinear Time-multiplexing the on-chip network requires permutations

34 33 Next Steps Here we were able to push one FPGA to its limits What if we want to scale farther? Next, we’ll explore how latency-Insensitivity can help us scale to multiple FPGAs with better performance than traditional techniques Also how we can increase designer productivity by abstracting platform

35

36 35 Resynchronizing Ports Modules follow modified scheme: If any incoming port is heavy, or any outgoing port is light, simulate next cycle (when ready) Result: balanced w/o centralized coordination Argument: Modules farthest ahead in time will never proceed Ports in (out) of this set will be light (resp. heavy) – Therefore those modules will try to proceed, but may not be able to There’s also a set farthest behind in time – Always able to proceed – Since graph is connected, simulating only enables modules, makes progress towards quiescence

37 36 Other Topologies Tree Butterfly

38 37 Generalizing OCN Permutations Represent model as Directed Graph G=(M,P) Label modules M with simulation order: 0..(N-1) Partition ports into sets P 0..P m where: – No two ports in a set P m share a source – No two ports in a set P m share a destination Transform each P m into a permutation σ m – Forall {s, d} in P m, σ m (s) = d – Holes in range represent “don’t cares” – Always send NoMessage on those steps Time-Multiplex module as usual – Associate each σ m with a physical port

39 38 Example: Arbitrary Network

40 39 Results: Multicore Simulation Rate FMRSimulation Rate MinMaxAvgMinMaxAvg Overall1621880160 KHz3.2 MHz625 KHz Per-Core527111.849.5 MHz4.54 MHz Must simulate multiple cores to get full benefit of time- multiplexed pipelines Functional cache-pressure rate-limiting factor


Download ppt "HAsim FPGA-Based Processor Models: Multicore Models and Time-Multiplexing Michael Adler Elliott Fleming Michael Pellauer Joel Emer."

Similar presentations


Ads by Google