OpenCL Programming James Perry EPCC The University of Edinburgh.

Slides:

Advertisements

Similar presentations

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Intermediate GPGPU Programming in CUDA

GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.

Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, , SSL 2014, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

GPU Programming EPCC The University of Edinburgh.

An Introduction to Programming with CUDA Paul Richmond

1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.

Programming GPUs using Directives Alan Gray EPCC The University of Edinburgh.

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

GPU Programming David Monismith Based on notes taken from the Udacity Parallel Programming Course.

Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

CUDA All material not from online sources/textbook copyright © Travis Desell, 2012.

Advanced / Other Programming Models Sathish Vadhiyar.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

GPU Architecture and Programming

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

GPU Programming and CUDA Sathish Vadhiyar Parallel Programming.

Training Program on GPU Programming with CUDA 31 st July, 7 th Aug, 14 th Aug 2011 CUDA Teaching UoM.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Martin Kruliš by Martin Kruliš (v1.0)1.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

© David Kirk/NVIDIA and Wen-mei W. Hwu, CS/EE 217 GPU Architecture and Programming Lecture 2: Introduction to CUDA C.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction.

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

Prof. Zhang Gang School of Computer Sci. & Tech.

CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)

Objective To Understand the OpenCL programming model

An Introduction to GPU Computing

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Lecture 11 – Related Programming Models: OpenCL

GPU Programming using OpenCL

CUDA Execution Model – III Streams and Events

6- General Purpose GPU Programming

Presentation transcript:

OpenCL Programming James Perry EPCC The University of Edinburgh

OpenCL Programming Introducing OpenCL Important concepts –Work groups and work items Programming with OpenCL –Initialising OpenCL and finding devices –Allocating and copying memory –Defining kernels –Launching kernels OpenCL vs. CUDA

Introducing OpenCL

What is OpenCL? OpenCL is an open, royalty-free standard for parallel programming across heterogeneous devices –Originally developed by Apple –Now maintained by the Khronos Group –Adopted by many major chip vendors => cross platform support –Including Intel, ARM, AMD, NVIDIA –Same program can run on CPUs and various types of accelerator –Further details at

What is OpenCL? (2) Consists of: –A programming language for writing kernels (code fragments that run on devices), based on standard C –An API (used for querying and initialising devices, managing memory, launching kernels, etc.) Kernel functions are compiled at runtime for whatever device is available –Same code can run on NVIDIA GPU, AMD GPU, Intel CPU, more exotic accelerators…

Important Concepts

Work Items and Work Groups For OpenCL, each problem is composed of an array of work items –May be a 1D, 2D or 3D array The domain is sub-divided into work groups, each consisting of a number of work items Domain Work items Work groups This domain has global dimensions 12x8 and local dimensions 4x4

Work items and kernels When kernel function is invoked, it runs on every work item in the domain independently Many work items will be processed in parallel –Depending on the hardware

Traditional looped vector addition void add_vectors(float *a, float *b, float *c, int n) { for (i = 0; i < n; i++) { c[i] = a[i] + b[i]; }

OpenCL vector addition __kernel void add_vectors(__global float *a, __global float *b, __global float *c) { int i = get_global_id(0); c[i] = a[i] + b[i]; } No loop – each kernel instance operates on a single element Global ID determines position in domain

Work groups All the work items within a work group will be scheduled together on the same processing unit –E.g. same SM on NVIDIA GPU Synchronisation is possible between items in the same work group But not between different work groups Items in the same work group also share the same local memory

Programming with OpenCL

Initialising OpenCL Quite a long-winded process; typically need to: –Find the platform you wish to target –System may have multiple platforms (e.g. CPU and GPU) –Find the target device within that platform –Create a context and a command queue on the target device –Compile your kernels for the target device –Kernels typically distributed as source so they can be compiled optimally for whatever device is present at runtime OpenCL has many API functions to help with this See practical for an example of how to do this

Allocating Memory Global device memory can be allocated with: mem = clCreateBuffer(context, flags, size, host_ptr, error); and freed with: clReleaseMemObject(mem);

Copying to and from device memory Copying between host and device memory is initiated by clEnqueueWriteBuffer and clEnqueueReadBuffer Example: clEnqueueWriteBuffer(queue, d_input, CL_TRUE, 0, size, h_input, 0, NULL, NULL); //... perform processing on device... clEnqueueReadBuffer(queue, d_output, CL_TRUE, 0, size, h_output, 0, NULL, NULL); Can also perform non-blocking transfers, synchronise with events, partial transfers, etc.

Writing kernels OpenCL kernels are functions that run on the device –Kernel processes a single work item Written in separate source file from host C code –Often.cl file extension –Usually compiled at runtime Mostly C syntax but with some extensions and some restrictions Various OpenCL API calls available –For synchronisation etc.

Declaring a kernel Like a normal C function declaration, but preceeded by __kernel: __kernel void add_vectors(__global float *a, __global float *b, __global float *c, int n) __global specifies that the pointers point to global memory on the device – i.e. main memory that is accessible to all work groups __local and __constant qualifiers also available

Some OpenCL APIs get_global_id(dimension) returns the position of the current work item within the whole domain, e.g.: int x = get_global_id(0); int y = get_global_id(1); get_local_id(dimension) returns position within the local work group barrier() synchronises all items within the local work group –useful for co-ordinating accesses to local memory

Vector datatypes Vectors of length 2, 4, 8 or 16 are supported –Both integer and floating point May improve performance on hardware that has vector instructions Simple to use: float4 v1 = (float4)(1.0, 2.0, 3.0, 4.0); float4 v2 = (float4)(0.0, -0.5, -0.3, 7.9); v1 -= v2; Various specific vector operations and functions also available

Launching Kernels Call clSetKernelArg to set each argument Call clEnqueueNDRangeKernel to launch the kernel clFinish can be used to wait for all work groups to complete Example: clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_input); clSetKernelArg(kernel, 1, sizeof(int), &size); clEnqueueNDRangeKernel(queue, kernel, 1, NULL, globalsize, localsize, 0, NULL, NULL); clFinish(queue);

OpenCL vs. CUDA

Similar in many ways Different terminology –CUDA has threads and blocks –OpenCL has work items and work groups OpenCL is cross platform, CUDA is NVIDIA-only –OpenCL supports NVIDIA and AMD GPUs, CPUs, etc. CUDA currently more mature –On NVIDIA hardware, CUDA is faster and better supported OpenCL API tends to be more verbose –But also more flexible OpenCL kernels normally compiled at runtime

Summary OpenCL allows programming of heterogeneous processors and accelerators using a common language and API –Supports both NVIDIA and AMD GPUs, as well as CPUs For more information, see: