Presentation is loading. Please wait.

Presentation is loading. Please wait.

TMBox: A Configurable 16-core Hybrid TM FPGA prototype

Similar presentations

Presentation on theme: "TMBox: A Configurable 16-core Hybrid TM FPGA prototype"— Presentation transcript:

1 TMBox: A Configurable 16-core Hybrid TM FPGA prototype
Osman Unsal

2 The people Nehir Sonmez (BSC) Oriol Arcas (BSC) Osman Unsal (BSC)
Adrian Cristal (BSC) Satnam Singh (MSR Cambridge)

3 Software simulator on 2 GHz host
BeeFarm Software simulators are poorly parallelized An FPGA can be significantly faster for multicore emulation: FPGA emulator at 25 MHz can be faster than Software simulator on 2 GHz host From Plasma to BeeFarm: Design Experience of an FPGA-based Multicore Prototype. Nehir Sonmez, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero. In 7th International Symposium on Applied Reconfigurable Computing (ARC 2011), March 2011.

4 BeeFarm 8-core, FPGA-based multiprocessor
Completely modifiable from top to bottom Bus P L1 Arbiter DDR2 Controller Bootmem I/O 25 MHz 125 MHz Honeycomb MIPS R3000 compatible L1 cache Unified 8 KB cache Shared bus 128-bit split bus

5 The Honeycomb core R3000-compatible Honeycomb with flexible HTM support = Original Plasma (MIPS R2000-compatible) + MMU, FPU + exceptions support + synchronization primitives: LL/SC + snooping, coherent caches (MSI) + debugging, performance counters + system libraries to support string, I/O, TM

6 Functional simulation
BeeFarm performance Functional simulation Detailed simulation Results normalized to M5 with 1 thread.

7 TMbox HTM multiprocessor on FPGA BeeFarm improved:
Inspired by AMD’s Advanced Synchronization Facility BeeFarm improved: Ring bus instead of shared bus (which doesn’t fit well on FPGA) x2 frequency (50 MHz) TMbox: A Flexible and Reconfigurable 16-core Hybrid Transactional Memory System Nehir Sonmez, Oriol Arcas, Otto Pflucker, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero. In 19th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2011), May 2011.

8 HTM ISA extensions Inspired by AMD-ASF 10 new MIPS instructions
XBEGIN (addr) XLB, XLH, XLW, XSB, XSH, XSW XCOMMIT, XABORT (code) MFTM 4 new special registers Can only be read with the MFTM (move from TM) instruction $TM0 register contains the abort address (XCOMMIT) $TM1 has a copy of the stack pointer (XCOMMIT) $TM2 contains the abort cause (overflow, contention or explicit) $TM3 stores a 20-bit software abort code (XABORT)

9 HTM example atomic {a++} example in MIPS assembler: LI $11, 5
LI $13, HW_OFLOW J $TX Abort due to conflict, retry... $TX: XBEGIN $ABORT XLW $8, 0($a0) ADDI $8, $8, 1 XSW $8, 0($a0) XCOMMIT $ABORT: MFTM $12, $TM2 BEQ $12, $13, $ERR ADDIU $10, $10, 1 SLTU $12, $10, $11 BEQZ $12, $ERR2 J $TX Abort Transaction committed HW capacity exceeded next code... $ERR: ...

10 TinySTM – ASF integration
atomic {a++} example with TinySTM hybrid TM: tm_thread_init(); Abort due to conflict, retry... TinySTM conflict management tm_start(); t = tm_read(a); tm_write(a, t); tm_commit(); Abort HW capacity exceeded, explicit SW abort Transaction committed Switch to software next code...

11 Compilation Standard GCC-MIPS cross-compiler + HyTM extensions
(to use 10 new tx instr.) 4 new TM registers, read with MFTM instr. Also extend the cache FSM to support TM 11

12 TMbox architecture To commit (serialized):
DDR Responses Requests Invalidations L1 Honeycomb CPU TM Unit CAM RAM data hit addr Bus Node Bus Ctrl. To commit (serialized): 1. Lock ring (to prevent other writes and commits) Will destroy ongoing write/commit requests 2. Commit the TX writes through channel Will abort conflicting TXs snooping the ring 3. Unlock ring

13 Performance Eigenbench synthetic TM benchmark on 16 cores (lower is better): Left: 10 element r/w set: overflows the TM cache Right: 8 element r/w set: fits in the TM cache HyTM better

14 (HW aborts are less expensive)
Performance (cont.) From the STAMP TM bench. Suite: SSCA2: An efficient and scalable graph kernel constant algorithm. Intruder: A high abort rate benchmark. If the program scales, so do we… (higher is better) 48% TX in HW (HW aborts are less expensive) 5-8% better SSCA2 Intruder

15 Low-overhead, online profiling
Future Work: TMbox 2? DDR Directory Switch FPGA A FPGA B FPGA C FPGA D BEE3 board RS232 PCIe Ethernet MIPS R KB I$1 + 8 KB D$1 100 MHz Low-overhead, online profiling 4 GB DDR2 256 LB L2 cache Distributed memory directory 4 FPGAs 64 cores Maps well on FPGA Similar to Stanford Dash

16 Bluespec System Verilog
Functional language for HW modeling Functional, object-oriented, rule-based HW functional verification is fast and easy (static rule conditions verification) Compiles to Verilog source code (better for component refinement) First prototype: MIPS 5-stage processor Faster (100 MHz) and smaller

17 TMbox is available at:
Any questions? Contact:

Download ppt "TMBox: A Configurable 16-core Hybrid TM FPGA prototype"

Similar presentations

Ads by Google