Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.

Slides:



Advertisements
Similar presentations
GPGPU Labor 8.. CLWrapper OpenCL Framework primitívek – ClWrapper(cl_device_type _device_type); – cl_device_id device_id() – cl_context context() – cl_command_queue.
Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future.
China MCP 1 OpenCL. Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
GPU Programming EPCC The University of Edinburgh.
An Introduction to Programming with CUDA Paul Richmond
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Basic CUDA Programming Computer Architecture 2014 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Basic CUDA Programming Computer Architecture 2015 (Prof. Chih-Wei Liu) Final Project – CUDA Tutorial TA Cheng-Yen Yang
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
CIS 565 Fall 2011 Qing Sun
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
High Efficiency Computing with OpenCL and FPGAs Fernando Martinez June 2014.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Martin Kruliš by Martin Kruliš (v1.0)1.
Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
© Wen-mei W. Hwu and John Stone, Urbana July 22, Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 2: Case Studies Wen-mei Hwu.
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL Ashwin M. Aji, Antonio J. Pena, Pavan Balaji and Wu-Chun Feng Virginia Tech and.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
CUDA Introduction Martin Kruliš by Martin Kruliš (v1.1)
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Basic CUDA Programming
CUDA and OpenCL Kernels
Краткое введение в программирование на языке OpenCL.
Lecture 11 – Related Programming Models: OpenCL
Antonio R. Miele Marco D. Santambrogio Politecnico di Milano
GPU Programming using OpenCL
Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos
ECE 8823A GPU Architectures Module 3: CUDA Execution Model -I
© 2012 Elsevier, Inc. All rights reserved.
OpenCL introduction II.
OpenCL introduction.
OpenCL introduction III.
OpenCL introduction II.
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Quiz Questions CUDA ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson, 2013, QuizCUDA.ppt Nov 12, 2014.
6- General Purpose GPU Programming
Presentation transcript:

Heterogeneous Computing with OpenCL Dr. Sergey Axyonov

Overview What is OpenCL? Execution Model Conceptual OpenCL Device Architecture Program execution sequence Kernel Functions & Examples

1.OpenCL allows to Use different processors (CPU&GPU) to accelerate parallel computation Get speedups for computationally intensive applications Write portable code across different devices and architectures

OpenCL Parallelism ConceptCUDA Equivalent kernel host program NDRange (index space)grid work itemthread work groupblock 2.OpenCL to CUDA data parallelism model mapping

3.OpenCL execution model

4.Mapping of OpenCL dimensions and indices to CUDA OpenCL API CallExplanationCUDA Equivalent get_global_id(0);global index of the work item in the x dimension blockIdx.x×blockDim.x+threadIdx.x get_local_id(0)local index of the work item within the work group in the x dimension blockIdx.x get_global_size(0);size of NDRange in the x dimension gridDim.x ×blockDim.x get_local_size(0);Size of each work group in the x dimension blockDim.x

5. Conceptual OpenCL Device Architecture

6. Mapping OpenCL memory types to CUDA OpenCL Memory TypesCUDA Equivalent global memory constant memory local memoryshared memory private memorylocal memory

9.Program execution sequence Set up Set work sizes for kernel execution Allocate and init host data buffers Create context for device Query compute devices Create command queue Create buffers on the device Create and build program Create kernel and set its arguments Core sequence Copy data from host to device Launce kernel in command-queue Copy data from device to host Clean up

8. OpenCL context for device management

9. Useful functions To get the list of platforms available cl_intcl_int clGetPlatformIDs ( cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_pl atforms) cl_uintcl_platform_idcl_uint To get the list of devices available on a platform cl_intcl_int clGetDeviceIDs( cl_platform_id platform, cl_device_type device_type, cl_uint num _entries, cl_device_id *devices, cl_uint *num_devices) cl_platform_idcl_device_typecl_uintcl_device_idcl_uint

10. Kernel example __kernel void vectorAdd(__global const float *a, __global const float *b, __global float *result) { int id = get_global_id(0); result[id] = a[id] + b[id]; }

11.K ernel storage const char * simple_kernel[] = { “__kernel void vectorAdd(__global const float *a, \n”, “__global const float *b, __global float *result) \n”, “{\n”, “int id = get_global_id(0);\n”, “result[id] = a[id] + b[id];\n”, “}\n” } In File As **char

12.Host code: context & program cl_context context; context = clCreateContext(NULL, 1, devices, NULL, NULL, &err); cl_program program; program = clCreateProgramWithSource(context, sizeof(program_source) / sizeof(*program_source), program_source, NULL, &err); clUnloadCompiler();

13.Host code: memory objects cl_mem input_Abuffer; input_Abuffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int)*NUM_DATA, NULL, &err); cl_mem input_Bbuffer; input_Bbuffer = clCreateBuffer(context, CL_MEM_READ_ONLY, sizeof(int)*NUM_DATA, NULL, &err); cl_mem output_buffer; output_buffer = clCreateBuffer(context, CL_MEM_WRITE_ONLY, sizeof(int)*NUM_DATA, NULL, &err);

14.Host code: kernel & queue cl_kernel kernel; kernel = clCreateKernel(program, "simple_demo", &err); clSetKernelArg(kernel, 0, sizeof(input_Abuffer), &input_Abuffer); clSetKernelArg(kernel, 1, sizeof(input_Bbuffer), &input_Bbuffer); clSetKernelArg(kernel, 2, sizeof(output_buffer), &output_buffer); cl_command_queue queue; queue = clCreateCommandQueue(context, devices[0], 0, &err);

15.Host code: Copy data from host to device & back clEnqueueWriteBuffer(queue, input_bufferA, CL_TRUE, 0, NUM_OF_ELEMENTS*sizeof(int), sourceA, 0, NULL, NULL); clEnqueueReadBuffer(queue, output_buffer, CL_TRUE, i*sizeof(int), sizeof(int), &data, 0, NULL, NULL);

16.Host code: Kernel cl_event kernel_completion; size_t global_work_size[1] = { NUM_OF_ELEMENTS }; clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, NULL, 0, NULL, &kernel_completion); clWaitForEvents(1, &kernel_completion); clReleaseEvent(kernel_completion);

17.Host code: Clean up clReleaseMemObject(input_Abuffer); clReleaseMemObject(input_Bbuffer); clReleaseMemObject(output_buffer); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseContext(context);