Presentation is loading. Please wait.

Presentation is loading. Please wait.

DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan.

Similar presentations


Presentation on theme: "DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan."— Presentation transcript:

1 DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan (IBM Research)

2 DASX : Hardware Accelerator for Software Data Structures 2 Executive Summary Data void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; } mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov CORE Reorder Buffer mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov mul add ld Extra work encumbers the core! DASX : Accelerate the access of and compute on software data structures H1 H2 H3 H4 H5 High level info lost!

3 DASX : Hardware Accelerator for Software Data Structures 3 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

4 DASX : Hardware Accelerator for Software Data Structures 4 Challenge 1/3 : Instruction Overhead 1D Vector : 2 2D Vector : 3 6D Vector : 12 Instructions / Element OLAP Cube [Gray et al. DMKD ‘96] upto 15D! Unordered Set : avg. 12 instructions BTree : 100s of instructions COMPUTE DATA 9%66% void simple() { for (int i = 0; i<size; ++i){ a[i] = b[i] + c[i]; }

5 DASX : Hardware Accelerator for Software Data Structures 5 Challenge 2/3 : Memory Level Parallelism Each element independent mov CORE Reorder Buffer mov $400, %r1 mov $4, %r2 mul %r3, %r2 add %r1, %r2 ld (%r2), %r4 for each array element! mov mul add ld Cant discover more MLP! Accessing multiple data structures makes this worse!

6 DASX : Hardware Accelerator for Software Data Structures 6 Challenge 3/3 : Managing Cache Space CPU L1 L2 MEM Not enough space in cache

7 DASX : Hardware Accelerator for Software Data Structures 7 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

8 DASX : Hardware Accelerator for Software Data Structures 8 Existing Mechanisms – Prefetching + Increases Memory Level Parallelism – Increases instructions (SW PF) – Best effort (HW PF) – Can cause cache thrashing void simple() { for (int i = 0; i<size; ++i){ prefetch(a + k); prefetch(b + k); prefetch(c + k); a[i] += b[i] + c[i]; } add Reorder Buffer add pref mov load

9 DASX : Hardware Accelerator for Software Data Structures 9 Existing Mechanisms – SIMD + Reduce Instructions – Algorithm change – Increase power void simple(){ for (int i = 0; i<size; i+=k){ SIMD_LOAD(a[i]:a[i+k]); SIMD_LOAD(b[i]:b[i+k]); SIMD_LOAD(c[i]:c[i+k]); SIMD_ADD(a[…], b[…], c[…]); } add load add load add Reorder Buffer

10 DASX : Hardware Accelerator for Software Data Structures 10 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

11 CACHE OOO CORE DASX : Hardware Accelerator for Software Data Structures 11 Our Approach – DASX SHARED LAST LEVEL CACHE Collector Processing Elements (PEs) DASX Data structure specific fetch engine Lightweight pipelines All ins. fixed latency

12 DASX : Hardware Accelerator for Software Data Structures 12 DASX – Sample Programmer’s API void simple() { for (int i=0; i<size; ++i){ a[i] = b[i] + c[i]; } coll_a = new coll(ST, &a, INT, size, 0, VEC); coll_b = new coll(LD, &b, INT, size, 0, VEC); coll_c = new coll(LD, &c, INT, size, 0, VEC); BEGIN SIMPLE END SIMPLE auto kfn = [](auto i, auto j) { return i + j; } Initialize Collectors group::add(coll_a, coll_b, coll_c); start(kfn, size); Run in lock-step Start processing

13 DASX : Hardware Accelerator for Software Data Structures 13 DASX – Data Structure Accelerator 1 CACHE MEM Translate key, fetch elements 2 Allocate 3 Lock iteration data 4 Fill local storage 5 Compute (SPMD) STOP GO 6 Write back dirty data 7 Unlock iteration data STOP Collector PEs

14 Collector DASX : Hardware Accelerator for Software Data Structures 14 DASX – Data Structure Accelerator CACHE MEM Lock iteration data Write back dirty data STOP Compute (SPMD) Fill local storage 1 Translate key, fetch elements Allocate 2 3 7 Unlock iteration data 4 6 5 DECOUPLED ACCESS (1 – 3) EXECUTE (5 – 7) PEs

15 DASX : Hardware Accelerator for Software Data Structures 15 Challenges Recap  Challenge 1 : Reduce Instruction Overhead  Challenge 2 : Increase Memory Level Parallelism  Challenge 3 : Better Cache Management

16 DASX : Hardware Accelerator for Software Data Structures 16 DASX – Processing Elements Instruction Memory (1KB) REG (32) REG (32) … … LANE 1 LANE 8 … Features 3 stage pipeline Single Program Multiple Data Each PE – exec. 1 iteration No address generation Reference data using “keys” “Reduce Instruction Overhead” by using SPMD Model and removing address generation.

17 DASX : Hardware Accelerator for Software Data Structures 17 DASX – Key Interface Vector Keys LD Key == LD Iter * Size + Offset Hash Table Keys LD KEY BTree Keys 1230 KeyData 0 1 2 Remove address generation overhead

18 DASX : Hardware Accelerator for Software Data Structures 18 DASX – Collector Data structure fetch engine Specialize traversal User defined elements Data StructureCollector HW OP VectorAddress / Stride Calc. – ADD, CMP Hash TableIndex Calc + Bucket Traversal. – INT ALU BTreeTraversal – CMOV, ADD, CMP Tasks – 1) Prefetch 2) Manage Cache Space

19 DASX : Hardware Accelerator for Software Data Structures 19 Collector Task 1 : Prefetch 1 CACHE MEM Translate keys, fetch elements 2 Allocate Run asynchronously with compute Reduce address generation cost Granularity of access : Data structure element Enhanced memory level parallelism Collector

20 DASX : Hardware Accelerator for Software Data Structures 20 Collector Task 2 : Manage Cache Space CACHE 3 Lock iteration data 4 Fill local storage 6 Write back dirty data 7 Unlock iteration data Manage cache fill and replacement Bulk fill OBJ-Store before iteration Per element refill from cache to OBJ-Store Collector PEs OBJ-Store

21 DASX : Hardware Accelerator for Software Data Structures 21 Outline – Challenges of data-centric applications – Existing mechanisms to address challenges – DASX : Data Structure Accelerator – Benchmarks and Evaluation

22 DASX : Hardware Accelerator for Software Data Structures 22 Benchmarks RecommenderText SearchHash Table OLAP CubingBTreeBlack-Scholes H1 H2 H3 H4 H5

23 DASX : Hardware Accelerator for Software Data Structures 23 Evaluation – Setup DASX vs 8 1KB 32 KB L1 IO CORE MT (8 threads) LLC – 4MB, 16 WAY, NUCA DRAM – DDR2-400, 16GB, 4 Chn. 64 KB L1 OOO CORE vs OOO

24 DASX : Hardware Accelerator for Software Data Structures 24 Evaluation – Performance Breakdown 1.25 0.00 0.25 0.50 0.75 1.00 D. Cube (Memory Bound) Black. (Compute Bound) 1 In-Order Core at LLC Normalized to OOO Core ( Lower is better) + Collector (data structure engine) – Address Gen. + Local Store X 8 MT

25 DASX : Hardware Accelerator for Software Data Structures 25 Evaluation – Performance MT (8)

26 DASX : Hardware Accelerator for Software Data Structures 26 Evaluation – Energy vs Performance Execution Cycles Energy Data-Cubing MT-32 MT-16 MT-8 DASX-4DASX-8 OOO Best

27 DASX : Hardware Accelerator for Software Data Structures 27 Summary  Highlighted the challenges of data-centric workloads  Demonstrated the effectiveness of using data structure specific information  Data structure aware hardware accelerator achieves 4.4X performance improvement

28 DASX : Hardware Accelerator for Software Data Structures 28 Q & A

29 DASX : Hardware Accelerator for Software Data Structures 29 Backup 1.Percentage of data structure instructions – 30 2.Why collector groups? – 31 3.Energy breakdown – 32 4.Obj-Store details – 33 5.Address Translation for keys – 34

30 DASX : Hardware Accelerator for Software Data Structures 30 Percentage of data structure instructions

31 DASX : Hardware Accelerator for Software Data Structures 31 Why collector groups

32 DASX : Hardware Accelerator for Software Data Structures 32 Evaluation – Energy Reduction Streaming Cache Thrashing

33 DASX : Hardware Accelerator for Software Data Structures 33 DASX – OBJ-Store Reduce energy – filter access to LLC Organization : Decoupled sector cache (1KB) Minimize tag overhead for vectors Adapt to spatial locality (eg. struct fields) KEYV/ILLC* Tag LD / ST – PE Write backs Data

34 DASX : Hardware Accelerator for Software Data Structures 34 DASX – Address Translation for Keys Reduce energy overhead Keys are coalesced by the collector into cache lines Only one translation per line vs. per access No reverse translation, due to back pointer (refer OBJ-Store)


Download ppt "DASX : Hardware Accelerator for Software Data Structures Snehasish Kumar, Naveen Vedula, Arrvindh Shriraman (Simon Fraser University), Vijayalakshmi Srinivasan."

Similar presentations


Ads by Google