Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Slides:



Advertisements
Similar presentations
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE408 / CS483 Applied Parallel Programming.
Advertisements

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Intermediate GPGPU Programming in CUDA
GPGPU Labor 8.. CLWrapper OpenCL Framework primitívek – ClWrapper(cl_device_type _device_type); – cl_device_id device_id() – cl_context context() – cl_command_queue.
Dynamic Memory Allocation in C.  What is Memory What is Memory  Memory Allocation in C Memory Allocation in C  Difference b\w static memory allocation.
APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.
GPU programming: CUDA Acknowledgement: the lecture materials are based on the materials in NVIDIA teaching center CUDA course materials, including materials.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OpenCL Ryan Renna. Overview  Introduction  History  Anatomy of OpenCL  Execution Model  Memory Model  Implementation  Applications  The Future.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
CS 179: GPU Computing Lecture 2: The Basics. Recap Can use GPU to solve highly parallelizable problems – Performance benefits vs. CPU Straightforward.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
Programming with CUDA WS 08/09 Lecture 5 Thu, 6 Nov, 2008.
OpenCL Usman Roshan Department of Computer Science NJIT.
Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.
1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Illinois UPCRC Summer School 2010 The OpenCL Programming Model Part 1: Basic Concepts Wen-mei Hwu and John Stone with special contributions from Deepthi.
The Open Standard for Parallel Programming of Heterogeneous systems James Xu.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Martin Kruliš by Martin Kruliš (v1.0)1.
Lecture 12: Attributes, Performance and Scalability Many-Core vs. CUDA Synchronization.
First CUDA Program. #include "stdio.h" int main() { printf("Hello, world\n"); return 0; } #include __global__ void kernel (void) { } int main (void) {
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT
Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.
Dr. Lawlor, U. Alaska: EPGPU 1 EPGPU: Expressive Programming for GPGPU Dr. Orion Sky Lawlor U. Alaska Fairbanks
+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.
Advanced / Other Programming Models Sathish Vadhiyar.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
GPU Architecture and Programming
UNIX Files File organization and a few primitives.
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
OpenCL Programming James Perry EPCC The University of Edinburgh.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
Euro-Par, 2006 ICS 2009 A Translation System for Enabling Data Mining Applications on GPUs Wenjing Ma Gagan Agrawal The Ohio State University ICS 2009.
Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.
1 Chapter 15-1 Pointers, Dynamic Data, and Reference Types Dale/Weems.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors Lecture.
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
Heterogeneous Programming OpenCL and High Level Programming Tools Stéphane BIHAN, CAPS Stream Computing Workshop, Dec , KTH, Stockholm.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Java Programming Language Lecture27- An Introduction.
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL Ashwin M. Aji, Antonio J. Pena, Pavan Balaji and Wu-Chun Feng Virginia Tech and.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Computer Engg, IIT(BHU)
Objective To Understand the OpenCL programming model
An Introduction to GPU Computing
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Lecture 11 – Related Programming Models: OpenCL
GPU Programming using OpenCL
OpenCL introduction II.
OpenCL introduction.
Course Overview PART I: overview material PART II: inside a compiler
© David Kirk/NVIDIA and Wen-mei W. Hwu,
Presentation transcript:

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

What is OpenCL Open Computing Language –Framework for writing programs Target: Heterogeneous Platforms –CPUs, GPUs, other accelerators/processors Many companies/entities behind it Looks a lot like CUDA OpenCL reference pages: sdk/1.0/docs/man/xhtml/index.html

Devices and Host Similar to CUDA Host runs main program Devices run kernels –Entry points to device executed code SIMT model –Single instruction multiple thread –Same static code per thread –Threads may follow different paths –Threads may access different data

Kernel Structure Work-item  thread –Smallest execution entity –Has ID Work-group  blocks –Has ID ND-Range –N-Dimensional grid of work-groups

Kernel Computation Structure

Kernel Example __kernel void vector_add_gpu ( __global const float* src_a, __global const float* src_b, __global float* res, const int num) { /* get_global_id(0) returns the ID of the thread in execution. As many threads are launched at the same time, executing the same kernel, each one will receive a different ID, and consequently perform a different computation.*/ const int idx = get_global_id(0); /* Now each work-item asks itself: "is my ID inside the vector's range?" If the answer is YES, the work-item performs the corresponding computation*/ if (idx < num) res[idx] = src_a[idx] + src_b[idx]; }

Creating a Basic OpenCL Run-Time Environment Step #1: Allocate a platform object Platform: –Host + devices –Allow an application to share resources and execute kernels // Returns the error code cl_int oclGetPlatformID (cl_platform_id *platforms) // Pointer to the platform object

#2: Initialize Devices cl_int clGetDeviceIDs ( cl_platform_id platform, cl_device_type device_type, // Bitfield identifying the type. For the GPU we use CL_DEVICE_TYPE_GPU cl_uint num_entries, // Number of devices, typically 1 cl_device_id *devices, // Pointer to the device object cl_uint *num_devices) // Puts here the number of devices matching the device_type

#3: Context OpenCL environment: –Kernel, devices, mem. mgmt, command-queues // Returs the context cl_context clCreateContext ( const cl_context_properties *properties, // Bitwise with the properties (see specification) cl_uint num_devices, // Number of devices const cl_device_id *devices, // Pointer to the devices object void (*pfn_notify)(const char *errinfo, const void *private_info, size_t cb, void *user_data), // (don't worry about this) void *user_data, // (don't worry about this) cl_int *errcode_ret) // error code result

#4: Command Queue Per device. Ordering. Multiple queues. cl_command_queue clCreateCommandQueue ( cl_context context, cl_device_id device, cl_command_queue_properties properties, // Bitwise with the properties cl_int *errcode_ret) // error code result

Initialization: Complete Example

Memory Allocation // Returns the cl_mem object referencing the memory allocated on the device cl_mem clCreateBuffer ( cl_context context, // The context where the memory will be allocated cl_mem_flags flags, size_t size, // The size in bytes void *host_ptr, cl_int *errcode_ret)

Memory Flags Where flags is a bit vector and the options are: –CL_MEM_READ_WRITE –CL_MEM_WRITE_ONLY –CL_MEM_READ_ONLY –CL_MEM_USE_HOST_PTR Device may access memory or copy No valid data –CL_MEM_ALLOC_HOST_PTR –CL_MEM_COPY_HOST_PTR copies the memory pointed by host_ptr Can be combined –CL_MEM_ALLOC_HOST_PTR | CL_MEM_COPY_HOST_PTR –On host, can data copied from host pointer

Back to our kernel example const int mem_size = sizeof(float)*size; // Allocates a buffer of size mem_size and copies mem_size bytes from src_a_h cl_mem src_a_d = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, src_a_h, &error); cl_mem src_b_d = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, mem_size, src_b_h, &error); cl_mem res_d = clCreateBuffer(context, CL_MEM_WRITE_ONLY, mem_size, NULL, &error);

Program and Kernels Kernels: –Functions that the host calls and run on devices –Kernels are compiled at run time –Result is the Program Program: –Created at runtime –Kernels, functions and declarations –Must specify which files compose the program

#5: Creating a Program // Returns the OpenCL program cl_program clCreateProgramWithSource ( cl_context context, cl_uint count, // number of files const char **strings, // array of strings, each one is a file const size_t *lengths, // array specifying the file lengths cl_int *errcode_ret) // error code to be returned

#6: Compiling the Program cl_int clBuildProgram ( cl_program program, cl_uint num_devices, const cl_device_id *device_list, const char *options, // Compiler options, see the specifications for more details void (*pfn_notify)(cl_program, void *user_data), // NULL  block and wait, otherwise, callback when the compilation hasfinished void *user_data // pass to pfn_notify )

#7: Compiler Log cl_int clGetProgramBuildInfo ( cl_program program, cl_device_id device, cl_program_build_info param_name, // The parameter we want to know size_t param_value_size, void *param_value, // The answer size_t *param_value_size_ret)

#8: Get the Kernel Entry Points cl_kernel clCreateKernel ( cl_program program, // The program where the kernel is const char *kernel_name, // The name of the kernel, i.e., the name of the kernel function as it's declared in the code cl_int *errcode_ret)

Back to our example

#9: Preparting to Launch the Kernel -- arguments One call per argument cl_int clSetKernelArg ( cl_kernel kernel, // Which kernel cl_uint arg_index, // Which argument size_t arg_size, // Size of the next argument (not of the value pointed by it!) const void *arg_value) // Value

#10: Calling the Kernel cl_int clEnqueueNDRangeKernel ( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, // Choose if we are using 1D, 2D or 3D work-items and work-groups const size_t *global_work_offset, // must be NULL – ID grid origin in the future const size_t *global_work_size, // The total number of work-items (must have work_dim dimensions) const size_t *local_work_size, // The number of work-items per work-group (must have work_dim dimensions) cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

Back to our example: Calling the Kernel

#11: Reading the Results cl_int clEnqueueReadBuffer ( cl_command_queue command_queue, cl_mem buffer, // from which buffer cl_bool blocking_read, // whether is a blocking or non-blocking read size_t offset, // offset from the beginning size_t cb, // size to be read (in bytes) void *ptr, // pointer to the host memory cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

For our Example

#12: Cleaning up

More info OpenCL reference pages: sdk/1.0/docs/man/xhtml/index.html