Presentation is loading. Please wait.

Presentation is loading. Please wait.

Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)

Similar presentations


Presentation on theme: "Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)"— Presentation transcript:

1 Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms) Instructor: Dr. Sushil Prasad

2 Outline Background on CPU-GPU communication Problem Statement What is Big Kernel How does Big Kernel help? Implementation Details Results & Improvements

3 Review: GPU Programming Programming Model : o GPU = Device, CPU = Host, Kernel = Program o GPU/CUDA program should copy the data to GPU, trigger the Kernel, copy data back after execution. o Threads and Warps (blocks) Memory Model : o Registers Per thread Data lifetime = thread lifetime o Local memory Per thread off-chip memory (physically in device DRAM) Data lifetime = thread lifetime o Shared memory Per thread block on-chip memory Data lifetime = block lifetime o Global (device) memory Accessible by all threads as well as host (CPU) Data lifetime = from allocation to deallocation o Host (CPU) memory Not directly accessible by CUDA threads

4 Problem Statement Scope: Streaming algorithms that process data that does not fit inside GPU memory. Problem: suboptimal execution due to issues stemming from transfer of data between CPU and GPU Traditional Solution: Partition data on CPU side, call the Kernel iteratively for each partition. Double Buffering Scheme: CPU fills one buffer while GPU consumes data on a second buffer. Dynamic Stream Graph[2] API: Programmer specifies high level communication hints for optimization Issues: Heavy on programming side, error prone

5 Introducing Big Kernel Analyzing the problem statement further: o Tendency for coding errors o Efficiency of partitioning the data o Physical bandwidth of PCIe Link between the memories o GPU needs the threads to access memory in adjacent locations. Solution: 4 stage pipeline with data prefetching Acts like a virtual memory for GPU threads Programmer writes arbitrarily large data structures Static (compile time) transforms the Kernel into ‘Big Kernel’. ie, Data partitioning, data transfer and communication between CPU-GPU is managed underneath.

6 Big Kernel: Pipeline 1) Prefetch Address Generation : GPU threads calculate the address of the data that is needed for later computation and records it on CPU side address buffer 2) Data Assembly : CPU assembles the data in the prefetch buffer based on addresses from (1) 3) Data Transfer : GPU DMA engine transfers the contents of prefetch buffer to data buffer in GPU 4) Kernel Computation : GPU threads execute the actual computation using the data in data buffer.

7 Stage1: Prefetch Address Generation Compile time: From GPU Kernel code, remove all instructions other than o (1) control flow statements o (2) statements contributing to memory access o (3) memory access instructions Memory access instructions are changed to write the accessed address into an address buffer on CPU side. Optimization by applying patterns to encode on GPU side, transfer the pattern to CPU and decode on CPU side.

8 State 2: Data Assembly CPU fetches the address pattern / addresses from address buffer, decodes the pattern, determines the address of data items to be fetched. CPU fetches data and assembles it in continuous locations, in the order of the addresses received. This helps the GPU threads in one warp (block) to access the memory in one go (once the data gets to GPU memory).

9 State 3: Data Transfer & Stage 4: Execution DMA (Memory Access module) transfers the memory from CPU prefetch buffer to GPU memory via PCIe link. One of the advantages is that only the minimal amount of data (that is urgently needed) is transferred through PCIe link. Synchronization is required at stage 3 so that data that is still being used by threads is not overwritten. Stage 3 can only proceed when all the data has been consumed from GPU memory Synchronization is also required at stage 4. Threads need to wait for the notification that data transfer is complete before they can start consuming the data.

10 Big Kernel: Data flow in Buffers

11 Big Kernel Pipeline

12 Example: K means computation K means computation Taken in a set of data points in numP Compare with existing cluster centers FindClosestCluster returns the ID of the cluster that is closest to the data point x,y,z in numP array. Particles array collects the return info.

13 Example: K means computation Assume the NumP array wont fit into GPU memory. Partition NumP array into chunks, Kernel executes each chunk at one time.

14 Example: K means computation Big Kernal method: CPU code uses StreamingMalloc() and StreamingMap() provided by Big Kernel. Address prefetching is computed from GPU code

15 Additional Optimizations Pattern Recognition in prefetch address generation: Looks for patterns in address locations, encodes the addresses to be decoded CPU side. Data Locality in assembling data: Read all the data needed by 1 GPU thread at one time. Synchronization: first 3 stages produce data, 4 th stage consumes it. Sync between production and consumption. Buffer Allocation: active vs inactive thread blocks : only allocate buffer space to active blocks.

16 Experimental Results Big-data / Streaming type application scenarios: Speed up comparison between (1) Multi threaded CPU (2) GPU with Single Buffer (3) GPU with Double Buffer (4) GPU with Big Kernel.

17 Improvements Consider applying to complex algorithms that include pointer / complex control instructions in the Kernel. Integration with Map Reduce

18 References MOKHTARI, R; STUMM, M. BigKernel -- High Performance CPU-GPU Communication Pipelining for Big Data-Style Applications. 2014 IEEE 28th International Parallel & Distributed Processing Symposium. 819, Jan. 2014. ISSN: 9781479937998 [2] T. Komoda, S. Miwa, and H. Nakamura. Communication Library to Overlap Computation and Communication for OpenCL Application. In Proc. 26th IEEE Intl. Parallel and Distributed Processing Symp. Workshops PhD Forum (IPDPSW), pages 567–573, 2012.


Download ppt "Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)"

Similar presentations


Ads by Google