GPU Programming using OpenCL

Slides:



Advertisements
Similar presentations
Intermediate GPGPU Programming in CUDA
Advertisements

Speed, Accurate and Efficient way to identify the DNA.
Complete Unified Device Architecture A Highly Scalable Parallel Programming Framework Submitted in partial fulfillment of the requirements for the Maryland.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
1 Threading Hardware in G80. 2 Sources Slides by ECE 498 AL : Programming Massively Parallel Processors : Wen-Mei Hwu John Nickolls, NVIDIA.
L13: Review for Midterm. Administrative Project proposals due Friday at 5PM (hard deadline) No makeup class Friday! March 23, Guest Lecture Austin Robison,
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
An Introduction to Programming with CUDA Paul Richmond
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
Computer Graphics Ken-Yi Lee National Taiwan University.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.
GPU Architecture and Programming
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
CUDA - 2.
Jie Chen. 30 Multi-Processors each contains 8 cores at 1.4 GHz 4GB GDDR3 memory offers ~100GB/s memory bandwidth.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
1 GPU programming Dr. Bernhard Kainz. 2 Dr Bernhard Kainz Overview About myself Motivation GPU hardware and system architecture GPU programming languages.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.
Martin Kruliš by Martin Kruliš (v1.0)1.
University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
My Coordinates Office EM G.27 contact time:
Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
GPGPU Programming with CUDA Leandro Avila - University of Northern Iowa Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Introduction to Operating Systems Concepts
Single Instruction Multiple Threads
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Sathish Vadhiyar Parallel Programming
CS427 Multicore Architecture and Parallel Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Leiming Yu, Fanny Nina-Paravecino, David Kaeli, Qianqian Fang
Lecture 2: Intro to the simd lifestyle and GPU internals
Lecture 11 – Related Programming Models: OpenCL
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Linchuan Chen, Xin Huo and Gagan Agrawal
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
GPU Scheduling on the NVIDIA TX2:
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.0)
6- General Purpose GPU Programming
Peter Oostema & Rajnish Aggarwal 6th March, 2019
Presentation transcript:

GPU Programming using OpenCL Blaise Tine School of Electrical and Computer Engineering Georgia Institute of Technology

Outline What’s OpenCL? The OpenCL Ecosystem OpenCL Programming Model OpenCL vs CUDA OpenCL VertorAdd Sample Compiling OpenCL Programs Optimizing OpenCL Programs Debugging OpenCL Programs The SPIR Portable IL Other Compute APIs (DirectX, C++ AMP, SyCL) Resources

What’s OpenCL? Low-level programming API for data parallel computation Platform API: device query and pipeline setup Runtime API: resources management + execution Cross-platform API Windows, MAC, Linux, Mobile, Web…  Portable device targets CPUs, GPUs, FPGAs, DSPs, etc… Implementation based on C99 Maintained by Kronos Group (www.khronos.org) Current version: 2.2 with C++ support (classes & templates) 

OpenCL Implementations Morgan Kaufmann Publishers September 22, 2018 OpenCL Implementations Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL Front-End APIs - 200+ front-end APIs - They internally implements a source-to-source compiler to convert programs written using their API to OpenCL Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL Platform Model Multiple compute devices attached to a host processor Each compute device has multiple compute units Each compute unit has multiple processing elements Each processing element execute the same work-item within a compute unit in log steps. An Introduction to the OpenCL Programming Model (2012) Jonathan Thompson, Kristofer Schlachter - A compute device can be a GPU, FPGA, Custom ICs Chapter 1 — Computer Abstractions and Technology

OpenCL Execution Model Morgan Kaufmann Publishers September 22, 2018 OpenCL Execution Model A kernel is logical unit of instructions to be executed on a compute device. Kernels are executed in multi- dimensional index space: NDRange For every element of the index space a work-item is executed The index space is tiled into work- groups Work items within a workgroup are synchronized using barriers or fences AMD OpenCL User Guide 2015 - NDRange construct makes it easier to partition a computation domain - work-item => compute thread - work group => thread block Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL Memory Model Global Memory Shared memory accessible to all work-items + host Constant/texture Memory Read-only shared memory accessible to all work-items + host Local Memory Sharing data between work-items within same work-group Private Memory Only accessible within work-item Implemented as Register File - some implementation will internally optimize storage for space or speed Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA OpenCL terminology aims for generality OpenCL Terminology CUDA Terminology Compute Unit Streaming Processor (SM) Processing Element Processor Core Wavefront (AMD) Warp Work-item Thread Work-group Thread Block NDRange Grid Global Memory Constant Memory Local Memory Shared Memory Private Memory - Local memory mean thread block shared memory Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA (2) Resources Qualifiers Description OpenCL Terminology CUDA Terminology Kernel global function __kernel  __global__  Kernel local function nothing* __device__  Readonly memory __constant  Global memory __global  Private memory __local  __shared__  Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA (3) Work Items Indexing OpenCL Terminology CUDA Terminology get_num_groups() gridDim get_local_size() blockDim get_group_id() blockIdx  get_local_id()  threadIdx get_global_id() blockIdx * blockDim + threadIdx get_global_size() gridDim * blockDim Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA (4) Threads Synchronization OpenCL Terminology CUDA Terminology barrier() __syncthreads() No direct equivalent* __threadfence() mem_fence() __threadfence_block() __threadfence_system() __syncwarp() Read_mem_fence() Write_mem_fence() Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA (5) API Terminology OpenCL Terminology CUDA Terminology clGetContextInfo() cuDeviceGet() clCreateCommandQueue() No direct equivalent* clBuildProgram() clCreateKernel() clCreateBuffer() cuMemAlloc() clEnqueueWriteBuffer() cuMemcpyHtoD() clEnqueueReadBuffer() cuMemcpyDtoH() clSetKernelArg() clEnqueueNDRangeKernel() kernel<<<...>>>() clReleaseMemObj() cuMemFree() Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL vs CUDA (6) Which is Best? Strengths API Performance CUDA is better on Nvidia cards Device Capabilities CUDA has an edge Portability CUDA is not portable Documentation CUDA has many online resources Tools CUDA has more mature tools Language Accessibility CUDA C++ extension is nice Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 OpenCL Program Flow Compile kernel programs Offline or Online Load kernel objects Load application data to memory objects Build command queue Batch instructions Defer rendering Submit command queue Execute program Chapter 1 — Computer Abstractions and Technology

Compiling OpenCL Programs Morgan Kaufmann Publishers September 22, 2018 Compiling OpenCL Programs The compiler tool chain uses LLVM optimizer LLVM generates a device specific IL for the target GPU LLVM can also generate CPU target binary CPU target can be used for verification and debugging Chapter 1 — Computer Abstractions and Technology

OpenCL VertexAdd Sample Morgan Kaufmann Publishers September 22, 2018 OpenCL VertexAdd Sample Address space qualifier kernel qualifier Global thread index Vector addition Chapter 1 — Computer Abstractions and Technology

OpenCL VertexAdd Sample (2) Morgan Kaufmann Publishers September 22, 2018 OpenCL VertexAdd Sample (2) Setup kernel grid Allocate host resources Create device context Allocate device resources Populate device memory Chapter 1 — Computer Abstractions and Technology

OpenCL VertexAdd Sample (3) Morgan Kaufmann Publishers September 22, 2018 OpenCL VertexAdd Sample (3) Build kernel program Set kernel arguments Launch kernel execution Read destination buffer Chapter 1 — Computer Abstractions and Technology

Optimizing OpenCL Programs Morgan Kaufmann Publishers September 22, 2018 Optimizing OpenCL Programs Profile before optimizing! Fine grain workload partitioning  Subdivide work to feed all compute resources Use constant memory when possible Implementations may optimized access to the data. Use local memory Much faster than global memory Tile computation to fit local memory  Reduce thread synchronization Reduce host-device communication overhead Command Queue batching and synchronization Chapter 1 — Computer Abstractions and Technology

Debugging OpenCL Programs Morgan Kaufmann Publishers September 22, 2018 Debugging OpenCL Programs Debugging Support is vendor specific Most implementations support debugging via GDB Compile program with debug symbols clBuildProgram() with "-g" Internally uses CPU target for efficiency Runtime Profiling supported clGetEventProfilingInfo() Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 The SPIR Portable IL Portable equivalent of Nvidia PTX Chapter 1 — Computer Abstractions and Technology

Other Compute APIs: DirectX 12 Morgan Kaufmann Publishers September 22, 2018 Other Compute APIs: DirectX 12 Full-featured compute API Major vendors support Nvidia, AMD, Intel Optimized for Gaming Graphics AI Windows only https://www2.cs.duke.edu/courses/compsci344/spring15/classwork/15_shading Chapter 1 — Computer Abstractions and Technology

Other Compute APIs: C++ AMP, SyCL Morgan Kaufmann Publishers September 22, 2018 Other Compute APIs: C++ AMP, SyCL Single source compute API Exploit modern C++ lambda extension Productivity without performance lost! Device resources Grid dimension Parallel lambda function Chapter 1 — Computer Abstractions and Technology

Morgan Kaufmann Publishers September 22, 2018 Resources API specifications: https://www.khronos.org/registry/OpenCL Open-source implementation: https://01.org/beignet OpenCL tutorials: http://www.cmsoft.com.br/opencl-tutorial Kronos resources: https://www.khronos.org/opencl/resources Chapter 1 — Computer Abstractions and Technology