Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University.

Slides:



Advertisements
Similar presentations
Exploring Memory Consistency for Massively Threaded Throughput- Oriented Processors Blake Hechtman Daniel J. Sorin 0.
Advertisements

Cache Coherence for GPU Architectures Inderpreet Singh 1, Arrvindh Shriraman 2, Wilson Fung 1, Mike O’Connor 3, Tor Aamodt 1 Image source:
ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
A Complete GPU Compute Architecture by NVIDIA Tamal Saha, Abhishek Rawat, Minh Le {ts4rq, ar8eb,
Hardware Transactional Memory for GPU Architectures Wilson W. L. Fung Inderpeet Singh Andrew Brownsword Tor M. Aamodt University of British Columbia In.
GPGPU Introduction Alan Gray EPCC The University of Edinburgh.
Instructor Notes This lecture deals with how work groups are scheduled for execution on the compute units of devices Also explain the effects of divergence.
Calvin: Deterministic or Not? Free Will to Choose Derek R. Hower, Polina Dudnik, Mark D. Hill, David A. Wood.
Development of a track trigger based on parallel architectures Felice Pantaleo PH-CMG-CO (University of Hamburg) Felice Pantaleo PH-CMG-CO (University.
Thread-Level Transactional Memory Decoupling Interface and Implementation UW Computer Architecture Affiliates Conference Kevin Moore October 21, 2004.
PARALLEL PROGRAMMING with TRANSACTIONAL MEMORY Pratibha Kona.
Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow Wilson W. L. Fung Ivan Sham George Yuan Tor M. Aamodt Electrical and Computer Engineering.
1 MetaTM/TxLinux: Transactional Memory For An Operating System Hany E. Ramadan, Christopher J. Rossbach, Donald E. Porter and Owen S. Hofmann Presenter:
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
1 Last Class: Introduction Operating system = interface between user & architecture Importance of OS OS history: Change is only constant User-level Applications.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
RCDC SLIDES README Font Issues – To ensure that the RCDC logo appears correctly on all computers, it is represented with images in this presentation. This.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
Rahul Sharma (Stanford) Michael Bauer (NVIDIA Research) Alex Aiken (Stanford) Verification of Producer-Consumer Synchronization in GPU Programs June 15,
More on Locks: Case Studies
18-447: Computer Architecture Lecture 30B: Multiprocessors Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/22/2013.
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.
CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA
1 The Performance Potential for Single Application Heterogeneous Systems Henry Wong* and Tor M. Aamodt § *University of Toronto § University of British.
15-740/ Oct. 17, 2012 Stefan Muller.  Problem: Software is buggy!  More specific problem: Want to make sure software doesn’t have bad property.
GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.
CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.
How should a highly multithreaded architecture, like a GPU, pick which threads to issue? Cache-Conscious Wavefront Scheduling Use feedback from the memory.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.
CUDA Optimizations Sathish Vadhiyar Parallel Programming.
GPU Architecture and Programming
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 Control Flow/ Thread Execution.
A Closer Look At GPUs By Kayvon Fatahalian and Mike Houston Presented by Richard Stocker.
Divergence-Aware Warp Scheduling
1 ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Jan 25, 2011 Synchronization.ppt Synchronization These notes will introduce: Ways to achieve.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.
Optimizing Parallel Reduction in CUDA Mark Harris NVIDIA Developer Technology.
QCAdesigner – CUDA HPPS project
The ATOMOS Transactional Programming Language Mehdi Amirijoo Linköpings universitet.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
Detecting Atomicity Violations via Access Interleaving Invariants
Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
Synchronization These notes introduce:
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
Parallel Computing Presented by Justin Reschke
My Coordinates Office EM G.27 contact time:
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
Efficient and Easily Programmable Accelerator Architectures Tor Aamodt University of British Columbia PPL Retreat, 31 May 2013.
1 ITCS 4/5145 Parallel Programming, B. Wilkinson, Nov 12, CUDASynchronization.ppt Synchronization These notes introduce: Ways to achieve thread synchronization.
GMProf: A Low-Overhead, Fine-Grained Profiling Approach for GPU Programs Mai Zheng, Vignesh T. Ravi, Wenjing Ma, Feng Qin, and Gagan Agrawal Dept. of Computer.
Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, Andreas Moshovos University of Toronto Demystifying GPU Microarchitecture through Microbenchmarking.
Explicitly Parallel Programming with Shared-Memory is Insane: At Least Make it Deterministic! Joe Devietti, Brandon Lucia, Luis Ceze and Mark Oskin University.
18-447: Computer Architecture Lecture 30B: Multiprocessors
PHyTM: Persistent Hybrid Transactional Memory
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Implementation of Efficient Check-pointing and Restart on CPU - GPU
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Operation System Program 4
Presented by: Isaac Martin
CS179: GPU PROGRAMMING Recitation 2 GPU Memory Synchronization
Synchronization These notes introduce:
Problems with Locks Andrew Whitaker CSE451.
DMP: Deterministic Shared Memory Multiprocessing
CIS 6930: Chip Multiprocessor: Parallel Architecture and Programming
Presentation transcript:

Hadi JooybarGPUDet: A Deterministic GPU Architecture1 Hadi Jooybar 1, Wilson Fung 1, Mike O’Connor 2, Joseph Devietti 3, Tor M. Aamodt 1 1 The University of British Columbia 2 AMD Research 3 University of Washington

Hadi JooybarGPUDet: A Deterministic GPU Architecture2 GPUs are … Fast Energy efficient Commodity hardware But… ×Mostly use for certain range of applications Why? Communication among concurrent threads 1000s of Threads

Hadi JooybarGPUDet: A Deterministic GPU Architecture3 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = - Active = - Cost = - Active = - V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 1 Active = 1 V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007

Hadi JooybarGPUDet: A Deterministic GPU Architecture4 I will debug it this time What about debuggers?! The bug may appear occasionally or in different places in each run. OMG! Where was that bug?! Motivation

Hadi JooybarGPUDet: A Deterministic GPU Architecture5 GPUDet Strong Determinism (hardware proposal) Same Outputs Same Execution Path Makes the program easier to Debug Test

Hadi JooybarGPUDet: A Deterministic GPU Architecture6 0 __global__ void BFS_step_kernel(...) { 1 if( active[tid] ) { 2 active[tid] = false; 3 visited[tid] = true; 4 foreach (int id = neighbour_nodes){ 5 if( visited[id] == false ){ 6 cost[id] = cost[tid] + 1; 7 active[id] = true; 8 *over = true; 9 } } } } V0V0 V2V2 V1V1 Cost = 1 Active = 1 Cost = 2 Active = 1 Motivation BFS algorithm Published in HiPC 2007 BFS algorithm Published in HiPC 2007

Hadi JooybarGPUDet: A Deterministic GPU Architecture7 GPUDet Strong Determinism Same Outputs Same Execution Path Makes the program easier to Debug Test ×There is no free lunch ×Performance Overhead Our goal is to provide Deterministic Execution on GPU architectures with acceptable performance overhead

Hadi JooybarGPUDet: A Deterministic GPU Architecture8 DRAM GPU Architecture Compute Unit Memory Unit L1 Cache ALU DRAM L2 Cache Workgroups CPU Kernel launch workgroup 2 workgroup 1 workgroup 0 x = input[threadID]; y= func(x); output[threadID] = y;

Hadi JooybarGPUDet: A Deterministic GPU Architecture9 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

Hadi JooybarGPUDet: A Deterministic GPU Architecture10 Normal Execution T0T0 T1T1 T2T2 T3T3 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Quantum 0 T0T0 T1T1 T2T2 T3T3 Quantum n T0T0 T1T1 T2T2 T3T3 … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication

Hadi JooybarGPUDet: A Deterministic GPU Architecture11 … Deterministic GPU Execution Challenges Isolation mechanism Lack of private caches Lack of cache coherency Provide method to pause execution of a thread Single Instruction Multiple Threads (SIMT) Potential deadlock condition Major changes in control flow hardware Performance overhead workgroup n wavefront

Hadi JooybarGPUDet: A Deterministic GPU Architecture12 Deterministic GPU Execution Challenges Very large number of threads Expensive global synchronization Expensive serialization Different program properties Large number of short running threads Frequent workgroup synchronization Less locality in intra thread memory accesses

Hadi JooybarGPUDet: A Deterministic GPU Architecture13 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

Hadi JooybarGPUDet: A Deterministic GPU Architecture14 if (tid < 16) x[tid%2] = tid; x[0] = 0 T0 Coalescing Unit x[1] = 1 T1 x[0] = 2 T2 x[1] = 15 T15 Deterministic Execution of a Wavefront Data Race Mask v v … - Address x Data … - x[0] = 14 x[1] = 15 Not modified To memory … Execution of one wavefront is deterministic

Hadi JooybarGPUDet: A Deterministic GPU Architecture15 Deterministic GPU Execution Challenges Isolation mechanism Provide method to pause execution of a thread … Isolation T0T0 T1T1 T2T2 T3T3 Communication Isolation T0T0 T1T1 T2T2 T3T3 Communication wavefront granularity not a challenge anymore

Hadi JooybarGPUDet: A Deterministic GPU Architecture16 Reaching Quantum Boundary Global Memory Read Only Store Buffers Local Memory Wavefronts … Load Op Commit Atomic Op GPUDet-Basic 1.Instruction Count 2.Atomic Operations 3.Memory Fences 4.Workgroup Barriers 5.Execution Complete

Hadi JooybarGPUDet: A Deterministic GPU Architecture17 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

Hadi JooybarGPUDet: A Deterministic GPU Architecture18 Workgroup-Aware Quantum Formation Extra global synchronizations Load Imbalance Reducing number of synchronizations Avoid unnecessary quantum termination Reducing number of synchronizations Avoid unnecessary quantum termination

Hadi JooybarGPUDet: A Deterministic GPU Architecture19 Workgroup-Aware Quantum Formation Quanta are finished by workgroup barriers All reach a workgroup barrier Continue execution in the parallel mode Workgroup-Aware Decision Making

Hadi JooybarGPUDet: A Deterministic GPU Architecture20 Finish execution of the Kernel function Workgroup-Aware Decision Making Workgroup-Aware Quantum Formation Deterministic workgroup partitioning

Hadi JooybarGPUDet: A Deterministic GPU Architecture21 Deterministic Parallel Commit using the Z-Buffer Unit ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ ∞∞∞∞∞∞ 777∞∞∞ 777∞∞∞ 777∞∞∞ Depth Buffer Store Buffer Contents ≈ Color Values Wavefront ID ≈ Depth Values Z-Buffer Unit

Hadi JooybarGPUDet: A Deterministic GPU Architecture22 GPUs preserve Point to Point Ordering Serialization is only among compute units Compute Unit Level Serialization

Hadi JooybarGPUDet: A Deterministic GPU Architecture23 Outline Introduction GPU Architecture Challenges Deterministic Execution with GPUDet GPUDet Optimizations Workgroup-Aware Quantum Formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Results and Conclusion

Hadi JooybarGPUDet: A Deterministic GPU Architecture24 Results 2x Slowdown GPGPU-Sim Applications with atomic operations

Hadi JooybarGPUDet: A Deterministic GPU Architecture25 20% Performance Improvement for application with barriers 19% Performance Improvement for application with small kernel functions Quantum Formation

Hadi JooybarGPUDet: A Deterministic GPU Architecture26 Deterministic Parallel Commit using the Z-Buffer Unit 60% Performance Improvement on Average

Hadi JooybarGPUDet: A Deterministic GPU Architecture27 Compute Unit Level Serialization 6.1x Performance Improvement in Serial Mode

Hadi JooybarGPUDet: A Deterministic GPU Architecture28 Conclusion Encourages programmers to use GPUs in broader range of applications Exploits GPU characteristics to reduce performance overhead Deterministic execution within a wavefront Workgroup-aware quantum formation Deterministic parallel commit using Z-Buffer Unit Compute Unit level serialization Questions?

Hadi JooybarGPUDet: A Deterministic GPU Architecture29 if(tid == 0) x = 0; Else if (tid ==1) x = 1; Racey code in CPU multi-threaded programming model SIMT Execution within a wavefront Handled by SMIT Stack SIMT Execution within a wavefront Handled by SMIT Stack Data-race Different instructions The execution order of instructions within a wavefront is deterministic

Hadi JooybarGPUDet: A Deterministic GPU Architecture30 Deterministic parallel commit using the Z-Buffer Unit The Z-Buffer Unit manages Z-Buffer ensure each pixel on the screen displays the color of the foremost triangle covering that pixel The Z-Buffer Unit allows out-of-order writes to produce a deterministic result GPUDet uses the wavefront ID as the depth value for Z-Buffer operations

Hadi JooybarGPUDet: A Deterministic GPU Architecture31 Interconnect A : = 6 A : = 2 B:=7 B : = 2 A:=6 D(A):-0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A - ∞ B - ∞ … … LocValueDepth L2 Cache Z-Buffer Unit Memory Partition DRAM Interface A:=6 D(A):-0 B:=7 D(B):=1 B:=2 D(B):=2 B:=7 D(B):=1 B:=2 D(B):=2 B:=2 D(B):=2 A:=6 D(A):=0 A:=2 D(A):=1 B:=7 D(B):=1 B:=2 D(B):=2 A21 B - ∞ … … Depth Comparison A60 B - ∞ … … A21 B - ∞ … … A - ∞ B - ∞ … … A60 B - ∞ … … A60 B71 … … A60 B71 … … A60 B71 … … Deterministic Parallel Commit Using Z-Buffer Unit W0W0 W1W1 W2W2 Store Buffers