Presentation is loading. Please wait.

Presentation is loading. Please wait.

TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal.

Similar presentations


Presentation on theme: "TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal."— Presentation transcript:

1 TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal

2 The people Nehir Sonmez (BSC) Oriol Arcas (BSC) Osman Unsal (BSC) Adrian Cristal (BSC) Satnam Singh (MSR Cambridge) 2

3 BeeFarm Software simulators are poorly parallelized An FPGA can be significantly faster for multicore emulation: FPGA emulator at 25 MHz can be faster than Software simulator on 2 GHz host 3 From Plasma to BeeFarm: Design Experience of an FPGA-based Multicore Prototype. Nehir Sonmez, Oriol Arcas, Gokhan Sayilar, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero. In 7th International Symposium on Applied Reconfigurable Computing (ARC 2011), March 2011.

4 BeeFarm 8-core, FPGA-based multiprocessor Completely modifiable from top to bottom 4 Bus P P L1 P P P P P P P P P P P P P P Arbiter DDR2 Controller DDR2 Controller Bootmem I/O 25 MHz 125 MHz Honeycomb MIPS R3000 compatible Shared bus 128-bit split bus L1 cache Unified 8 KB cache

5 The Honeycomb core R3000-compatible Honeycomb with flexible HTM support = Original Plasma (MIPS R2000-compatible) + MMU, FPU + exceptions support + synchronization primitives: LL/SC + snooping, coherent caches (MSI) + debugging, performance counters + system libraries to support string, I/O, TM 5

6 BeeFarm performance 6 Results normalized to M5 with 1 thread. Functional simulation Detailed simulation

7 TMbox HTM multiprocessor on FPGA –Inspired by AMDs Advanced Synchronization Facility BeeFarm improved: –Ring bus instead of shared bus (which doesnt fit well on FPGA) –x2 frequency (50 MHz) 7 TMbox: A Flexible and Reconfigurable 16-core Hybrid Transactional Memory System Nehir Sonmez, Oriol Arcas, Otto Pflucker, Osman S. Unsal, Adrián Cristal, Ibrahim Hur, Satnam Singh and Mateo Valero. In 19th IEEE International Symposium on Field-Programmable Custom Computing Machines (FCCM 2011), May 2011.

8 HTM ISA extensions Inspired by AMD-ASF 10 new MIPS instructions –XBEGIN (addr) –XLB, XLH, XLW, XSB, XSH, XSW –XCOMMIT, XABORT (code) –MFTM 4 new special registers –Can only be read with the MFTM (move from TM) instruction –$TM0 register contains the abort address (XCOMMIT) –$TM1 has a copy of the stack pointer (XCOMMIT) –$TM2 contains the abort cause (overow, contention or explicit) –$TM3 stores a 20-bit software abort code (XABORT) 8

9 HTM example atomic {a++} example in MIPS assembler: 9 $ERR:... $ERR:... $ABORT: MFTM$12, $TM2 BEQ$12, $13, $ERR ADDIU$10, $10, 1 SLTU$12, $10, $11 BEQZ $12, $ERR2 J$TX $ABORT: MFTM$12, $TM2 BEQ$12, $13, $ERR ADDIU$10, $10, 1 SLTU$12, $10, $11 BEQZ $12, $ERR2 J$TX $TX: XBEGIN $ABORT XLW$8, 0($a0) ADDI$8, $8, 1 XSW$8, 0($a0) XCOMMIT $TX: XBEGIN $ABORT XLW$8, 0($a0) ADDI$8, $8, 1 XSW$8, 0($a0) XCOMMIT LI$11, 5 LI$13, HW_OFLOW J$TX LI$11, 5 LI$13, HW_OFLOW J$TX next code... Abort due to conflict, retry... HW capacity exceeded Abort Transaction committed

10 TinySTM – ASF integration atomic {a++} example with TinySTM hybrid TM: 10 Switch to software tm_start(); t = tm_read(a); tm_write(a, t); tm_commit(); tm_start(); t = tm_read(a); tm_write(a, t); tm_commit(); tm_thread_init(); next code... Abort due to conflict, retry... HW capacity exceeded, explicit SW abort Abort Transaction committed TinySTM conflict management

11 Compilation Standard GCC-MIPS cross-compiler + HyTM extensions (to use 10 new tx instr.) 4 new TM registers, read with MFTM instr. Also extend the cache FSM to support TM 11

12 TMbox architecture 12 C7 C0 C1 C2 C3 C4 C5 C6 DDR ResponsesRequests Invalidations L1 Honeycomb CPU TM Unit CAM RAM data hit addr Bus Node Bus Ctrl. To commit (serialized): 1. Lock ring (to prevent other writes and commits) Will destroy ongoing write/commit requests 2. Commit the TX writes through channel Will abort conflicting TXs snooping the ring 3. Unlock ring

13 Performance Eigenbench synthetic TM benchmark on 16 cores (lower is better): –Left: 10 element r/w set: overows the TM cache –Right: 8 element r/w set: ts in the TM cache 13 HyTM better

14 Performance (cont.) From the STAMP TM bench. Suite: –SSCA2: An efcient and scalable graph kernel constant algorithm. –Intruder: A high abort rate benchmark. If the program scales, so do we… (higher is better) % better SSCA2Intruder 48% TX in HW (HW aborts are less expensive) 48% TX in HW (HW aborts are less expensive)

15 Future Work: TMbox 2? 15 Distributed memory directory 4 FPGAs 64 cores Maps well on FPGA Similar to Stanford Dash

16 Bluespec System Verilog Functional language for HW modeling –Functional, object-oriented, rule-based –HW functional verification is fast and easy (static rule conditions verification) –Compiles to Verilog source code (better for component refinement) First prototype: MIPS 5-stage processor –Faster (100 MHz) and smaller 16

17 TMbox is available at: Any questions? Contact:


Download ppt "TMBox: A Configurable 16-core Hybrid TM FPGA prototype Osman Unsal."

Similar presentations


Ads by Google