OpenCL The Open Standard for Heterogenous Parallel Programming.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

Intermediate GPGPU Programming in CUDA
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
China MCP 1 OpenCL. Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
Case Study: Accelerating Full Waveform Inversion via OpenCL on AMD GPUs Case Study: Accelerating Full Waveform Inversion via OpenCL™ on AMD GPUs ©2014.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.
© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Advanced / Other Programming Models Sathish Vadhiyar.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 Process Description and Control Chapter 3. 2 Process A program in execution An instance of a program running on a computer The entity that can be assigned.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
My Coordinates Office EM G.27 contact time:
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec , KTH, Stockholm.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Martin Kruliš by Martin Kruliš (v1.1)1.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
Objective To Understand the OpenCL programming model
OpenCL 소개 류관희 충북대학교 소프트웨어학과.
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Lecture 11 – Related Programming Models: OpenCL
GPU Programming using OpenCL
Operation System Program 4
Introduction to OpenCL 2.0
6- General Purpose GPU Programming
Presentation transcript:

OpenCL The Open Standard for Heterogenous Parallel Programming

Processor Parallelism CPUs Multiple cores driving performance increases GPUs Increasingly general purpose data-parallel computing Improving numerical precision Graphics APIs and Shading Languages Multi-processor programming – e.g. OpenMP Emerging Intersection OpenCL Heterogenous Computing OpenCL – Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors OpenCL – Open Computing Language Open, royalty-free standard for portable, parallel programming of heterogeneous parallel computing CPUs, GPUs, and other processors

OpenCL Timeline Six months from proposal to released specification - Due to a strong initial proposal and a shared commercial incentive to work quickly Apple’s Mac OS X Snow Leopard will include OpenCL - Improving speed and responsiveness for a wide spectrum of applications Multiple OpenCL implementations expected in the next 12 months - On diverse platforms Apple works with AMD, Intel, NVIDIA and others on draft proposal Apple proposes OpenCL working group and contributes draft specification to Khronos OpenCL working group develops draft into cross- vendor specification Working Group sends completed draft to Khronos Board for Ratification Khronos publicly releases OpenCL as royalty-free specification Khronos to release conformance tests to ensure high-quality implementations Jun08Oct08 Dec08 May09

OpenCL Technical Overview

OpenCL Design Requirements Use all computational resources in system - Program GPUs, CPUs, and other processors as peers - Support Data-parallel and Task-parallel compute models Efficient C-based parallel programming model - Abstract the specifics of underlying hardware Abstraction is low-level, high-performance… but device-portable - Approachable – but primarily targeted at expert developers - Ecosystem foundation – no middleware or “convenience” functions Implementable on a range of embedded, desktop, and server systems - HPC, desktop, and handheld profiles in one specification Drive future hardware requirements - Floating point precision requirements - Applicable to both consumer and HPC applications

Anatomy of OpenCL Language Specification - C-based cross-platform programming interface - Subset of ISO C99 with language extensions - familiar to developers - Well-defined numerical accuracy (IEEE 754 rounding with specified max error) - Online or offline compilation and build of compute kernel executables - Includes a rich set of built-in functions Platform Layer API - A hardware abstraction layer over diverse computational resources - Query, select and initialize compute devices - Create compute contexts and work-queues Runtime API - Execute compute kernels - Manage scheduling, compute, and memory resources

Hierarchy of Models Platform Model Memory Model Execution Model Programming Model

OpenCL Platform Model (Section 3.1) One Host + one or more Compute Devices - Each Compute Device is composed of one or more Compute Units - Each Compute Unit is further divided into one or more Processing Elements

Compute Unit 1 Private Memory Work Item 1Work Item M Compute Unit N Private Memory Work Item 1Work Item M Local Memory Global / Constant Memory Data Cache Global Memory OpenCL Memory Model (Section 3.3) Shared memory model - Relaxed consistency Multiple distinct address spaces - Address spaces can be collapsed depending on the device’s memory subsystem Address spaces - Private - private to a work-item - Local - local to a work-group - Global - accessible by all work-items in all work-groups - Constant - read only global space Implementations map this hierarchy - To available physical memories Compute Device Memory Compute Device PE

OpenCL Execution Model (Section 3.2) OpenCL Program : - Kernels - Basic unit of executable code — similar to C functions, CUDA kernels, etc. - Data-parallel or task-parallel - Host Program - Collection of compute kernels and internal functions - Analogous to a dynamic library Kernel Execution - The host program invokes a kernel over an index space called an NDRange - NDRange, “N-Dimensional Range”, can be a 1D, 2D, or 3D space - A single kernel instance at a point in the index space is called a work-item - Work-items have unique global IDs from the index space - Work-items are further grouped into work-groups - Work-groups have a unique work-group ID - Work-items have a unique local ID within a work-group

Kernel Execution Total number of work-items = G x * G y Size of each work-group = S x * S y Global ID can be computed from work-group ID and local ID

Programming Model Data-Parallel (Section 3.4.1) Data-Parallel Model (Section 3.4.1) Must be implemented by all OpenCL compute devices Define N-Dimensional computation domain - Each independent element of execution in an N-Dimensional domain is called a work-item - N-Dimensional domain defines total # of work-items that execute in parallel = global work size Work-items can be grouped together — work-group - Work-items in group can communicate with each other - Can synchronize execution among work-items in group to coordinate memory access Execute multiple work-groups in parallel - Mapping of global work size to work-group can be implicit or explicit

Programming Model Task-Parallel (Section 3.4.2) Task-Parallel Model (Section 3.4.2) Some compute devices can also execute task-parallel compute kernels Execute as a single work-item - A compute kernel written in OpenCL - A native C / C++ function

Host program - Query compute devices - Create contexts - Create memory objects associated to contexts - Compile and create kernel program objects - Issue commands to command-queue - Synchronization of commands - Clean up OpenCL resources Kernels - C code with some restrictions and extensions Basic OpenCL Program Structure Platform Layer Runtime Language

Platform Layer (Chapter 4) Platform layer allows applications to query for platform specific features Querying platform info (i.e., OpenCL profile) (Chapter 4.1) Querying devices (Chapter 4.2) - clGetDeviceIDs () - Find out what compute devices are on the system - Device types include CPUs, GPUs, or Accelerators - clGetDeviceInfo () - Queries the capabilities of the discovered compute devices such as: - Number of compute cores - NDRange limits - Maximum work-group size - Sizes of the different memory spaces (constant, local, global) - Maximum memory object size Creating contexts (Chapter 4.3) - OpenCL runtime uses contexts to manage objects and execute kernels on 1 or more devices - Contexts are associated to one or more devices - Multiple contexts can be associated to the same device - clCreateContext () and clCreateContextFromType () returns a handle to the created contexts

OpenCL Runtime (Chapter 5) Command-Queues (Section 5.1) Store a set of operations to perform Are associated to a context Multiple command-queues can be created to handle independent commands that don’t require synchronization Command-queue execution is guaranteed complete at sync points

Memory Objects (Section 5.2) Buffer objects - 1D collection of objects (like C arrays) - Scalar types & Vector types, as well as user-defined Structures - Buffer objects accessed via pointers in the kernel Image objects - 2D or 3D texture, frame-buffer, or images - Must be addressed through built-in functions Sampler objects - Describe how to sample an image in the kernel - Addressing modes - Filtering modes

Creating Memory Objects clCreateBuffer (), clCreateImage2D (), and clCreateImage3D () Memory objects are created with an associated context Memory can be created as read-only, write-only, or read-write Where objects are created in the platform memory space can be controlled - Device memory - Device memory with data copied from a host pointer - Host memory - Host memory associated with a pointer - Memory at that pointer is guaranteed to be valid at synchronization points Image objects are also created with a channel format - Channel order (e.g., RGB, RGBA,etc.) - Channel type (e.g., UNORM INT8, FLOAT, etc.)

Manipulating Object Data Object data can be copied to or from host memory, or to other objects Memory commands are enqueued in the command buffer and processed when the command is executed - clEnqueueReadBuffer (), clEnqueueReadImage () - clEnqueueWriteBuffer (), clEnqueueWriteImage () - clEnqueueCopyBuffer (), clEnqueueCopyImage () Data can be copied between Image and Buffer objects - clEnqueueCopyImageToBuffer () - clEnqueueCopyBufferToImage () Regions of the object data can be accessed by mapping into the host address space - clEnqueueMapBuffer (), clEnqueueMapImage () - clEnqueueUnmapMemObject ()

Program Objects (Section 5.4) Program objects encapsulate: - An associated context - Program source or binary - Latest successful program build, list of targeted devices, build options - Number of attached kernel objects Build process - Create program object: - clCreateProgramWithSource () - clCreateProgramWithBinary () - Build program executable: clBuildProgram () - Compile & link from source or binary for all devices or specific devices in the associated context - Build options - Preprocessor - Math intrinsics (floating-point behavior) - Optimizations

Kernel Objects (Section 5.5) Kernel objects encapsulate - Specific kernel functions declared in a program - Argument values used for kernel execution Creating kernel objects - clCreateKernel () - creates a kernel object for a single function in a program - clCreateKernelsInProgram () - creates an object for all kernels in a program Setting arguments - clSetKernelArg (, ) - Each argument data must be set for the kernel function - Argument values are copied and stored in the kernel object Kernel vs. program objects - Kernels are related to program execution - Programs are related to program source

Kernel Execution (Section 5.6) To execute, a kernel must be enqueued to the command-queue clEnqueueNDRangeKernel () - Data-parallel execution model - Describes the index space for kernel execution - Requires information on NDRange dimensions and work-group size clEnqueueTask () - Task-parallel execution model (multiple queued tasks) - Kernel is executed on a single work-item clEnqueueNativeKernel () - Task-parallel execution model - Executes a native C/C++ function not compiled using the OpenCL compiler - This mode does not use a kernel object so arguments must be passed in

Command-Queues & Synchronization Command-queue execution - Execution model signals when commands are complete or data is ready - Command-queue could be explicitly flushed to the device - Command-queues execute in-order or out-of-order - In-order - commands complete in the order queued and correct memory is consistent - Out-of-order - no guarantee when commands are executed or memory is consistent without synchronization Synchronization - Signals when commands are completed to the host or other commands in queue - Blocking calls - Commands that do not return until complete - clEnqueueReadBuffer() can be called as blocking and will block until complete - Event objects - Track execution status of a command - Some commands can be blocked until event objects signal a completion of previous command - clEnqueueNDRangeKernel() can take an event object as an argument and wait until a previous command (e.g., clEnqueueWriteBuffer) is complete - Profiling - Queue barriers - queued commands that can block command execution

OpenCL C for Compute Kernels (Chapter 6) Derived from ISO C99 - A few restrictions: recursion, function pointers, functions in C99 standard headers... - Preprocessing directives defined by C99 are supported Built-in Data Types - Scalar and vector data types, Pointers - Data-type conversion functions: convert_type - Image types: image2d_t, image3d_t and sampler_t Built-in Functions — Required - Work-item functions, math.h, read and write image - Relational, geometric functions, synchronization functions Built-in Functions — Optional - double precision, atomics to global and local memory - selection of rounding mode, writes to image3d_t surface

OpenCL C Language Highlights Function qualifiers - “__kernel” qualifier declares a function as a kernel - Kernels can call other kernel functions Address space qualifiers - __global, __local, __constant, __private - Pointer kernel arguments must be declared with an address space qualifier Work-item functions - Query work-item identifiers - get_work_dim() - get_global_id(), get_local_id(), get_group_id() Image functions - Images must be accessed through built-in functions - Reads/writes performed through sampler objects from host or defined in source Synchronization functions - Barriers - all work-items within a work-group must execute the barrier function before any work-item can continue - Memory fences - provides ordering between memory operations

Sample walkthrough oclVectorAdd Simple element by element vector addition For all i, C(i) = A(i) + B(i) Outline - Query compute devices - Create Context and Queue - Create memory objects associated to contexts - Compile and create kernel program objects - Issue commands to command-queue - Synchronization of commands - Clean up OpenCL resources

// The JIT source code for the computation kernel // ********************************************************************* const char* cVectorAdd[ ] = { "__kernel void VectorAdd(", " __global const float* a,", " __global const float* b,", " __global float* c)", "{", " int iGID = get_global_id(0);", " c[iGID] = a[iGID] + b[iGID];", "}" }; const int SOURCE_NUM_LINES = sizeof(cVectorAdd) / sizeof(cVectorAdd[0]); Kernel Code

Declarations cl_context cxMainContext; // OpenCL context cl_command_queue cqCommandQue; // OpenCL command que cl_device_id* cdDevices; // OpenCL device list cl_program cpProgram; // OpenCL program cl_kernel ckKernel; // OpenCL kernel cl_mem cmMemObjs[3]; // OpenCL memory buffer objects cl_int ciErrNum = 0;// Error code var size_t szGlobalWorkSize[1]; // Global # of work items size_t szLocalWorkSize[1]; // # of Work Items in Work Group size_t szParmDataBytes;// byte length of parameter storage size_t szKernelLength;// byte Length of kernel code int iTestN = 10000;// Length of demo test vectors

// create the OpenCL context on a GPU device cxMainContext = clCreateContextFromType (0, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // get the list of GPU devices associated with context clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, 0, NULL, &szParmDataBytes); cdDevices = (cl_device_id*)malloc(szParmDataBytes); clGetContextInfo (cxMainContext, CL_CONTEXT_DEVICES, szParmDataBytes, cdDevices, NULL); // create a command-queue cqCommandQue = clCreateCommandQueue (cxMainContext, cdDevices[0], 0, NULL); Contexts and Queues

// allocate the first source buffer memory object… source data, so read only cmMemObjs[0] = clCreateBuffer (cxMainContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * iTestN, srcA, NULL); // allocate the second source buffer memory object … source data, so read only cmMemObjs[1] = clCreateBuffer (cxMainContext, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(cl_float) * iTestN, srcB, NULL); // allocate the destination buffer memory object … result data, so write only cmMemObjs[2] = clCreateBuffer (cxMainContext, CL_MEM_WRITE_ONLY, sizeof(cl_float) * iTestN, NULL, NULL); Create Memory Objects

// create the program, in this case from OpenCL C source string array cpProgram = clCreateProgramWithSource (cxMainContext, SOURCE_NUM_LINES, cVectorAdd, NULL, &ciErrNum); // build the program ciErrNum = clBuildProgram (cpProgram, 0, NULL, NULL, NULL, NULL); // create the kernel ckKernel = clCreateKernel (cpProgram, "VectorAdd", &ciErrNum); // set the kernel Argument values ciErrNum = clSetKernelArg (ckKernel, 0, sizeof(cl_mem), (void*)&cmMemObjs[0] ); ciErrNum |= clSetKernelArg (ckKernel, 1, sizeof(cl_mem), (void*)&cmMemObjs[1] ); ciErrNum |= clSetKernelArg (ckKernel, 2, sizeof(cl_mem), (void*)&cmMemObjs[2] ); Create Program and Kernel

// set work-item dimensions szGlobalWorkSize[0] = iTestN; szLocalWorkSize[0]= 1; // execute kernel ciErrNum = clEnqueueNDRangeKernel (cqCommandQue, ckKernel, 1, NULL, szGlobalWorkSize, szLocalWorkSize, 0, NULL, NULL); // read output ciErrNum = clEnqueueReadBuffer(cqCommandQue, cmMemObjs[2], CL_TRUE, 0, iTestN * sizeof(cl_float), dst, 0, NULL, NULL); Launch Kernel and Read Results

// release kernel, program, and memory objects DeleteMemobjs (cmMemObjs, 3); free (cdDevices); clReleaseKernel (ckKernel); clReleaseProgram (cpProgram); clReleaseCommandQueue (cqCommandQue); clReleaseContext (cxMainContext); Cleanup

Questions?