Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In Proc. 2011 ACM/IEEE Int’l Symp. Microarchitecture (MICRO-44)

Talk Outline What we mean by “GPU” in this work. Data Synchronization on GPUs. What is Transactional Memory (TM)? TM is compatible with OpenCL. … but is TM compatible with GPU hardware? KILO TM: A Hardware TM for GPUs. Results Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 3 What is a GPU (in this work)? GPU is NVIDIA/AMD-like, Compute Accelerator  SIMD HW + Aggressive Memory Subsystem => High Compute Throughput and Efficiency Non-Graphics API: OpenCL, DirectCompute, CUDA  Programming Model: Hierarchy of scalar threads  Today: Limited Communication & Synchronization Kernel Blocks Work Group / Thread Blocks Scalar Thread Wavefront / Warp 123456789101112 Shared (Local) MemoryBarrier Global Memory

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 4 Memory Partition Baseline GPU Architecture Interconnection Network SIMT Core SIMT Front End SIMD Datapath Fetch Decode Schedule Branch Done (Warp ID) Memory Subsystem Icnt. Network SMem Non-Coherent L1 D-Cache Tex $Const$ Memory Partition Last-Level Cache Bank Off-Chip DRAM Channel Atomic Op. Unit

Stack-Based SIMD Reconvergence (“SIMT”) (Levinthal SIGGRAPH’84, Fung MICRO’07) 5 -G1111 TOS B CD E F A G Thread Warp Common PC Thread 2 Thread 3 Thread 4 Thread 1 B/1111 C/1001D/0110 E/1111 A/1111 G/1111 -A1111 TOS ED0110 EC1001 TOS -E1111 ED0110 TOS -E1111 ADGA Time CBE -B1111 TOS -E1111 TOS Reconv. PC Next PCActive Mask Stack ED0110 EE1001 TOS -E1111 17

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 6 Data Synchronizations on GPUs Motivation  Solve wider range of problems on GPU  Data Race  Data Synchronization Current Solution: Atomic read-modify-write (32-bit/64-bit). Best Sol’n?  Why Transactional Memory? E.g. N-Body with 5M bodies (traditional sync, not TM) CUDA SDK: O(n 2 ) – 1640 s (barrier) Barnes Hut: O(nLogn) – 5.2 s (atomics, harder to get right) Easier to Write/Debug Efficient Algorithms Practical efficiency. Want efficiency of GPU with reasonable (not superhuman) effort and time.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 7 Data Synchronizations on GPUs Deadlock-free code with fine-grained locks and 10,000+ hardware scheduled threads is hard # Locks x # Sharing Thread # Possible Global Lock States Which of these states are deadlocks?! Other general problems with lock based synchronization  Implicit relationship between locks and objects being protected  Code is not composable

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 8 Data Synchronization Problems Specific to GPUs Interaction between locks and SIMT control flow can cause deadlocks A: while(atomicCAS(lock,0,1)==1); B: // Critical Section … C: lock = 0; A: done = 0; B: while(!done){ C: if(atomicCAS(lock,0,1)==1){ D: // Critical Section … E: lock = 0; F: done = 1; G: } H: }

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 9 Transactional Memory Program specifies atomic code blocks called transactions [Herlihy’93] TM Version: atomic { X[c] = X[a]+X[b]; } Lock Version: Lock(X[a]); Lock(X[b]); Lock(X[c]); X[c] = X[a]+X[b]; Unlock(X[c]); Unlock(X[b]); Unlock(X[a]); Potential Deadlock!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 10 Transactional Memory Commit TX1 Non-conflicting transactions may run in parallel TX2 A B C D Memory Conflicting transactions automatically serialized TX1 A B C D Memory TX2 CommitAbort Commit TX2 Programmers’ View: TX1 TX2TX1 TX2 OR Time

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 11 Transactional Memory Each transaction has 3 phases  Execution Track all memory accesses (Read-Set and Write-Set)  Validation Detect any conflicting accesses between transactions Resolve conflict if needed (abort/stall)  Commit Update global memory

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 12 Transactional Memory on OpenCL A natural extension to OpenCL Programming Model  Program can launch many more threads than the hardware can execute concurrently  GPU-TM? Current threads running transactions do not need to wait for future unscheduled threads GPU HW

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 13 Are TM and GPUs Incompatible? The problem with GPUs (from TM perspective): 1000s of concurrent threads Inter-thread spatial locality common No cache coherence No private cache for each thread (Buffering?) Tx Abort  Control flow divergence

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 14 Scalable Coherence Hardware TM for GPUs Challenge: Conflict Detection Bus Inv C Conflict! 1024-bit Signature/Thread 3.8MB / 30k Threads TX1 TX4 TX2 TX3 R(A), W(C) R(D) R(A) R(C), W(B) Private Data Cache Signature No coherence on GPUs? Each scalar thread needs own cache?

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 15 Register File CPU Core 10s of Registers Hardware TM for GPUs Challenge: Transaction Rollback Checkpoint Register File @ TX Entry @ TX Abort Register File GPU Core (SM) 32k Registers Warp 2MB Total On-Chip Storage Checkpoint?

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 16 Hardware TM for GPUs Challenge: Access Granularity and Write Buffer CPU Core L1 Data Cache TX GPU Core (SM) L1 Data Cache Warp 1-2 Threads32kB Cache 1024-1536 Threads Fermi’s L1 Data Cache (48kB) = 384 X 128B Lines Global Memory Commit Warp Problem: 384 lines / 1536 threads < 1 line per thread!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 17 Hardware TM on GPUs Challenge: SIMT Hardware On GPUs, scalar threads in a warp/wavefront execute in lockstep... TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit... Committed A Warp with 8 Scalar Threads Aborted Reconvergence?

Goal We take it as a given that most programmers trying lock based programming on a GPU will give up before they manage to get their application working. Hence, our goal was to find the most efficient approach to implement TM on GPU. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 18

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 19 KILO TM Supports 1000s of concurrent transactions Transaction-aware SIMT stack No cache coherence protocol dependency Word-level conflict detection Captures 59% of FG Lock Performance  128X Faster than Serialized Tx Exec.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 20 KILO TM: Design Highlights Value-Based Conflict Detection  Self-Validation + Abort: Simple Communication  No Cache Coherence Dependence Speculative Validation  Increase Commit Parallelism

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 21 High Level GPU Architecture + KILO TM Implementation Overview

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 22 KILO TM: SIMT Core Changes SW Register Checkpoint  Observation: Most overwritten registers not used  Compiler analysis can identify what to checkpoint Transaction Abort  ~ Do-While Loop  Extend SIMT Stack with special entries to track aborted transactions in each warp TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit Abort Overwritten

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 23 Transaction-Aware SIMT Stack A: t = tid.x; if (…) { B: tx_begin; C: x[t%10] = y[t] + 1; D: if (s[t]) E: y[t] = 0; F: tx_commit; G: z = y[t]; } H: w = y[t+1]; Implicit loop when abort @ tx_begin: Copy Active Mask Active Mask 1111 1111 0011 0000 1111 0011 RPCPCType --HN HBN CR CT TOS @ tx_commit, restart Tx for thread 6 & 7: Active Mask 1111 1111 0011 0000 0000 0011 RPCPCType --HN HBN CR CT TOS @ tx_commit, thread 6 & 7 failed validation: Copy Active Mask + PC Active Mask 1111 1111 0011 0000 0011 0000 RPCPCType --HN HBN CR FT TOS Branch Divergence within Tx: Active Mask 1111 1111 0011 0000 1111 0011 0001 0011 RPCPCType --HN HBN CR FT FEN TOS @ tx_commit, all threads with Tx committed: Active Mask 1111 1111 0011 0000 RPCPCType --HN HGN CR TOS

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 24 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log KILO TM: Value-Based Conflict Detection TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 TxBegin LD r1,[A] ADD r1,r1,1 ST r1,[B] TxCommit A=1 B=0 A=1 B=0 A=2 B=2 Self-Validation + Abort:  Only detects existence of conflict (not identity) => No Tx to Tx Msg – Simple Communication TxBegin LD r2,[B] ADD r2,r2,2 ST r2,[A] TxCommit

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 25 TX2 atomic{A=B+2} Private Memory Read-Log Write-Log Parallel Validation? TX1 atomic{B=A+1} Private Memory Read-Log Write-Log Global Memory A=1 B=2 A=1 B=0 A=1 B=0 A=2 B=0 B=2 A=2 Init: A=1,B=0 Tx1 then Tx2: B=2,A=4 Tx2 then Tx1: A=2,B=3 OR Data Race!?!

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 26 Serialize Validation? Global Memory Commit Unit Benefit #1: No Data Race Benefit #2: No Live Lock (generic lazy TM prob.) Drawback: Serializes Non-Conflicting Transactions (“collateral damage”) TX1TX2 V + CStall Time V + C

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 27 Identifying Non-conflicting Tx: Step 1: Leverage Parallelism TX1 Global Memory Partition Commit Unit TX2 Global Memory Partition Commit Unit Global Memory Partition Commit Unit TX1 TX3

Solution: Speculative Validation Key Idea: Split Validation into two parts  Part 1: Check recently committed transactions  Part 2: Check concurrently committing transactions Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 28

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 29 KILO TM: Speculative Validation Memory subsystem is deeply pipelined and highly parallel Global Memory Partition Commit Unit TX1 TX2 TX3 Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 Read-Log Write-Log TX2 TX3 Validation Queue R(C),W(D) R(A),W(B) R(D),W(E)

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 30 KILO TM: Speculative Validation Global Memory Partition Commit Unit Hazard Detection Log Transfer Spec. Validation Validation Wait Finalize Outcome Commit C A D TX1 TX2 TX3 Validation Queue Last Writer History TX1TX2TX3 R(C),W(D)R(A),W(B)R(D),W(E) C? Nil W(D) TX1DTX2BTX3E TX1 A?D? STALL Last Writer History AddrCID Evict Lookup Table Recency Bloom Filter CID

KLMN 7654 EFGH 6168 Log Storage Transaction logs are stored at the private memory of each thread  Located in DRAM, cached in L1 and L2 caches T0T1T2T3 ABCD LD Wavefront ABCD 3497 T0’s view of private memory Consecutive physical address Address Value Read-Log Ptr EFGH LD Write-Log Ptr KLMN ST

Log Transfer Entries heading to same memory partition can be grouped into a larger packet KLMN 7654 EFGH 6168 ABCD 3497 Read-Log Ptr Write-Log Ptr Partition 0 Partition 1Partition 3 B4 AC39 D7 Packets to Commit Units Commit Unit 0 Commit Unit 1 Commit Unit 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 33 Distributed Commit / HW Org.

ABA Problem? Classic Example: Linked List Based Stack Thread 0 – pop(): while (true) { t = top; Next = t->Next; // thread 2: pop A, pop B, push A if (atomicCAS(&top, t, next) == t) break; // succeeds! } topA Next B C Null topA Next C Null topB Next C Null topB Next C Null NextBtAtopC NextNull

ABA Problem? atomicCAS protects only a single word  Only part of the data structure Value-based conflict detection protects all relevant parts of the data structure topA Next B C Null while (true) { t = top; Next = t->Next; if (atomicCAS(&top, t, next) == t) break; // succeeds! }

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 36 Evaluation Methodology GPGPU-Sim 3.0 (BSD license)  Detailed: IPC Correlation of 0.93 vs GT 200  KILO TM (Timing-Driven Memory Accesses) GPU TM Applications  Hash Table (HT-H, HT-L)  Bank Account (ATM)  Cloth Physics (CL)  Barnes Hut (BH)  CudaCuts (CC)  Data Mining (AP)

0.976 correlation on subset of CUDA SDK that decuda correctly Disassembles Note: Rest of data uses PTX instead of SASS (0.93 correlation) GPGPU-Sim 3.0.x running SASS (decuda) (We believe GPGPU-Sim is reasonable proxy.)

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 38 Performance (vs. Serializing Tx)

Absolute Performance (IPC) TM on GPU performs well for applications with low contention. Poorly: Memory divergence, low parallelism, high conflict rate (tackle through alg. design/tuning?) CPU vs GPU?  CC: FG-Lock version 400X faster than its CPU version  BH: FG-Lock version 2.5X faster than its CPU version Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 39 IPC

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 40 Performance (Exec. Time) Captures 59% of FG Lock Performance 128X Faster than Serialized Tx Exec.

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 41 KILO TM Scaling

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 42 Abort Commit Ratio Increasing number of TXs => increase probability of conflict Two possible solutions (future work): Solution 1: Application performance tuning (easier with TM vs. FG Lock) Solution 2: Transaction schedule

Thread Cycle Breakdown Status of a thread at each cycle Categories:  TC: In a warp stalled by concurrency control  TO: In a warp committing its transactions  TW: Have passed commit, and waiting for other threads in the warp to pass  TA: Executing an eventually aborted transaction  TU: Executing an eventually committed transaction (Useful work)  AT: Acquiring a lock or doing an Atomic Operation  BA: Waiting at a Barrier  NL: Doing non-transactional (Normal) work Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 43

Thread Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 44 HT-H HT-L ATM CL BH CC AP

Core Cycle Breakdown Action performed by a core at each cycle Categories:  EXEC: Issuing a warp for execution  STALL: Stalled by a downstream warp  SCRB: All warps blocked by the scoreboard, due to data hazards, concurrency control, pending commits (or any combination thereof)  IDLE: None of the warps are ready in the instruction buffer. Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 45

Core Cycle Breakdown Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 46

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 47 Read-Write Buffer Usage

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 48 # In-Flight Buffers

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 49 Implementation Complexity Logs in Private Memory @ L1 Data Cache Commit Unit  5kB Last Writer History Unit  19kB Transaction Status  32kB Read-Set and Write-Set Buffer CACTI 5.3 @ 40nm  0.40mm 2 x 6 Memory Partition  0.5% of 520mm 2

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 50 Summary KILO TM  1000s of Concurrent Transactions  Value-Based Conflict Detection  Speculative Validation for Commit Parallelism  59% Fine-Grained Locking Performance  0.5% Area Overhead

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 51 Backup Slides

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 52 Logical Stage Organization

Wilson Fung, Inderpeet Singh, Andrew Brownsword, Tor Aamodt Hardware TM for GPU Architectures 53 Execution Time Breakdown

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.

Similar presentations

Presentation on theme: "Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In.

Similar presentations

Presentation on theme: "Hardware Transactional Memory for GPU Architectures* Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia *In."— Presentation transcript:

Similar presentations

About project

Feedback