1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24, 29.10.2010 Yngve Sneen Lindal

2 The article Implementation paper Suggests a source-to-source compiler (CUDA C to C) Chapters – Introduction – Programming model background CUDA features Mapping possibilities – Kernel translation Implementation challenges (mostly synchronization) – Implementation and performance – Related work – Conclusions

3 Introduction Motivation: Why write code two times? Programming models should map well. Aims to maintain synchronization and data locality benefits to achieve good performance.

4 Programming model background CUDA: Threads organized in blocks on a grid (hereby called logical threads). Per-block thread synchronizing. Overview of different memory types. Expensive branching. Warps (SIMD) use a stack based reconvergence algorithm. Performance strategy: assign each block to a specific core to avoid intra-core synchronization overhead and to preserve high locality. Similar control flows/operations should enable use of vector instructions.

5 Programming model background Thread-local (registers) and block-shared memory fits well in L1 cache. Constant memory should fit well in L2 cache (which is often shared among CPU cores)

6 Kernel translation One OS thread per GPU thread mitigates locality aims, and will be very expensive to schedule. Will rather assign blocks to a core and run each block sequentially. The blocks will be divided into “thread loops”, which we will return to soon. Involves three explicit transformation stages (performed on the AST) – Transform a kernel into a serial function (fig. 1) – Enforce synchronization (translate __synchtreads()) (fig. 2) – Replicate thread-local data (fig. 3)

7 Transforming a thread block into a serial function Introducing an iterative structure called a “thread loop” – No synchronization needed inside – No side entries or exits Thread loops expose similar instructions in a “non- branching environment”, and thereby helps an eventual optimizing C compiler to generate fast code.

8 Enforcing synchronization with deep fission For-loops becomes while loops to get rid of initialization and update statements (removing side effects). A loop fission transforms a synchronization statement S into two thread loops which are placed over and under S, or, it divides a thread loop into two thread loops (more on that later). Apply algorithm 1 to each synchronization statement, which will also be run for their containing constructs. Any conditional affecting a synchronization statement must evaluate to true or false for all threads. This is part of the CUDA spec.

9 Enforcing synchronization with deep fission Must pass the AST one more time to correct eventual incorrect control flow inside thread loops.

10 Replicating thread-local data Shared memory – straightforward Local variables: “universal replication”; an array[num_threads] with var for each thread. Inefficient when variables could be reused. Live variable analysis to detect if variables can be reused. This is called “selective replication”.

11 Work distribution and runtime framework Iterating through all blocks and calling the function for each is not optimal on a multi-core processor. Scheduling a portion of blocks to each core would be optimal, and corresponds to the programming model.

12 Implementation and performance analysis Uses OpenMP's “parallel for” to take advantage of multiple OS threads on a multi-core machine. Benchmarks of a selection of algorithms with highly optimized CPU versions. Linear performance wrt. number of cores suggests good exploitation of locality, and “independent” blocks.

13 Related work By the article, noone has done this before (why am I not surprised?). Nvidia CUDA CPU emulation meant for debugging, not performance. MCUDA less suitable for debugging since code is compiled. Mentions some other frameworks that uses other approaches (parallelizing serial code).

14 Conclusions Translated kernels performs comparable to optimized serial code (based on this benchmark). That means high locality is preserved, and computational regularity is exposed for an optimizing compiler. Trade off between portability and performance.

15 My thoughts Untraditional (and a bit cool) problem, but when do you need the CPU? GPUs are cheap. A reversed GPGPU development cycle? Maybe some more benchmarks? The examples use a quite simplified kernel. This is very conceptual, but I guess one have to refine the problem when making something like this. What about C++?

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Similar presentations

Presentation on theme: "1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

Similar presentations

Presentation on theme: "1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,"— Presentation transcript:

Similar presentations

About project

Feedback