Presentation is loading. Please wait.

Presentation is loading. Please wait.

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Similar presentations


Presentation on theme: "Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University."— Presentation transcript:

1 Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research 3 University of Washington

2 Hadi JooybarGPUDet: A Deterministic GPU Architecture2 GPUs are … Fast Energy efficient Commodity hardware But… ×Mostly use for certain range of applications Why? Communication among concurrent threads 1000s of Threads

3 Hadi JooybarGPUDet: A Deterministic GPU Architecture3 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = - Active = - Cost = - Active = - V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 1 Active = 1 V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007

4 Hadi JooybarGPUDet: A Deterministic GPU Architecture4 I will debug it this time What about debuggers?! The bug may appear occasionally or in different places in each run. OMG! Where was that bug?! Motivation

5 Hadi JooybarGPUDet: A Deterministic GPU Architecture5 GPUDet Strong Determinism (hardware proposal) Same Outputs Same Execution Path Makes the program easier to Debug Test

6 Hadi JooybarGPUDet: A Deterministic GPU Architecture6 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007

7 Hadi JooybarGPUDet: A Deterministic GPU Architecture7 GPUDet Strong Determinism Same Outputs Same Execution Path Makes the program easier to Debug Test ×There is no free lunch ×Performance Overhead Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead

8 Hadi JooybarGPUDet: A Deterministic GPU Architecture8 DRAM GPU Architecture Compute Unit Memory Unit L1 Cache ALU DRAM L2 Cache Workgroups CPU Kernel launch workgroup 2 workgroup 1 workgroup 0 x = input[threadID]; y= func(x); output[threadID] = y;

9 Hadi JooybarGPUDet: A Deterministic GPU Architecture9 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

10 Hadi JooybarGPUDet: A Deterministic GPU Architecture10 Normal Execution T0T0 T1T1 T2T2 T3T3 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Quantum 0 T0T0 T1T1 T2T2 T3T3 Quantum n T0T0 T1T1 T2T2 T3T3 … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication

11 Hadi JooybarGPUDet: A Deterministic GPU Architecture11 … Deterministic GPU Execution Challenges Isolation mechanism Lack of private caches Lack of cache coherency Provide method to pause execution of a thread Single Instruction Multiple Threads (SIMT) Potential deadlock condition Major changes in control flow hardware Performance overhead workgroup n wavefront

12 Hadi JooybarGPUDet: A Deterministic GPU Architecture12 Deterministic GPU Execution Challenges Very large number of threads Expensive global synchronization Expensive serialization Different program properties Large number of short running threads Frequent workgroup synchronization Less locality in intra thread memory accesses

13 Hadi JooybarGPUDet: A Deterministic GPU Architecture13 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

14 Hadi JooybarGPUDet: A Deterministic GPU Architecture14 if (tid < 16) x[tid%2] = tid; x[0] = 0 T0 Coalescing Unit x[1] = 1 T1 x[0] = 2 T2 x[1] = 15 T15 Deterministic Execution of a Wavefront Data Race Mask v v - - - - - - … - Address x Data 14 15 - - - - - - … - x[0] = 14 x[1] = 15 Not modified To memory … Execution of one wavefront is deterministic

15 Hadi JooybarGPUDet: A Deterministic GPU Architecture15 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication wavefront granularity not a challenge anymore

16 Hadi JooybarGPUDet: A Deterministic GPU Architecture16 Reaching Quantum Boundary Global Memory Read Only Store Buffers Local Memory Wavefronts … Load Op Commit Atomic Op GPUDet-Basic 1.Instruction Count 2.Atomic Operations 3.Memory Fences 4.Workgroup Barriers 5.Execution Complete

17 Hadi JooybarGPUDet: A Deterministic GPU Architecture17 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

18 Hadi JooybarGPUDet: A Deterministic GPU Architecture18 Workgroup-Aware Quantum Formation Extra global synchronizations Load Imbalance Reducing number of synchronizations Avoid unnecessary quantum termination Reducing number of synchronizations Avoid unnecessary quantum termination

19 Hadi JooybarGPUDet: A Deterministic GPU Architecture19 Workgroup-Aware Quantum Formation Quanta are finished by workgroup barriers All reach a workgroup barrier Continue execution in the parallel mode Workgroup-Aware Decision Making

20 Hadi JooybarGPUDet: A Deterministic GPU Architecture20 Finish execution of the Kernel function Workgroup-Aware Decision Making Workgroup-Aware Quantum Formation Deterministic workgroup partitioning

21 Hadi JooybarGPUDet: A Deterministic GPU Architecture21 Deterministic Parallel Commit using the Z-Buffer Unit ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ 777∞∞∞ 777∞∞∞ 777∞∞∞ 888888 888888 777888 777888 777888 885588 885558 755555 755555 555555 Depth Buffer Store Buffer Contents ≈ Color Values Wavefront ID ≈ Depth Values Z-Buffer Unit

22 Hadi JooybarGPUDet: A Deterministic GPU Architecture22 GPUs preserve Point to Point Ordering Serialization is only among compute units Compute Unit Level Serialization

23 Hadi JooybarGPUDet: A Deterministic GPU Architecture23 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

24 Hadi JooybarGPUDet: A Deterministic GPU Architecture24 Results 2x Slowdown GPGPU-Sim 3.0.2 Applications with atomic operations

25 Hadi JooybarGPUDet: A Deterministic GPU Architecture25 20% Performance Improvement for application with barriers 19% Performance Improvement for application with small kernel functions Quantum Formation

26 Hadi JooybarGPUDet: A Deterministic GPU Architecture26 Deterministic Parallel Commit using the Z-Buffer Unit 60% Performance Improvement on Average

27 Hadi JooybarGPUDet: A Deterministic GPU Architecture27 Compute Unit Level Serialization 6.1x Performance Improvement in Serial Mode

28 Hadi JooybarGPUDet: A Deterministic GPU Architecture28 Conclusion Encourages programmers to use GPUs in broader range of applications Exploits GPU characteristics to reduce performance overhead Deterministic execution within a wavefront Workgroup-aware quantum formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Questions?

29 Hadi JooybarGPUDet: A Deterministic GPU Architecture29 if(tid == 0) x = 0; Else if (tid ==1) x = 1; Racey code in CPU multi-threaded programming model SIMT Execution within a wavefront Handled by SMIT Stack SIMT Execution within a wavefront Handled by SMIT Stack Data-race Different instructions The execution order of instructions within a wavefront is deterministic

30 Hadi JooybarGPUDet: A Deterministic GPU Architecture30 Deterministic parallel commit using the Z-Buffer Unit The Z-Buffer Unit manages Z-Buffer ensure each pixel on the screen displays the color of the foremost triangle covering that pixel The Z-Buffer Unit allows out-of-order writes to produce a deterministic result GPUDet uses the wavefront ID as the depth value for Z-Buffer operations

31 Hadi JooybarGPUDet: A Deterministic GPU Architecture31 Interconnect A : = 6 A : = 2 B:=7 B : = 2 A:=6 D(A):-0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A - ∞ B - ∞ … … LocValueDepth L2 Cache Z-Buffer Unit Memory Partition DRAM Interface A:=6 D(A):-0 B:=7 D(B):=1 B:=2 D(B):=2 B:=7 D(B):=1 B:=2 D(B):=2 B:=2 D(B):=2 A:=6 D(A):=0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A21 B - ∞ … … Depth Comparison A60 B - ∞ … … A21 B - ∞ … … A - ∞ B - ∞ … … A60 B - ∞ … … A60 B71 … … A60 B71 … … A60 B71 … … Deterministic Parallel Commit Using Z-Buffer Unit W0W0 W1W1 W2W2 Store Buffers


Download ppt "Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University."

Similar presentations


Ads by Google