Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.

Similar presentations


Presentation on theme: "1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL."— Presentation transcript:

1 1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL

2 2 OpenCL (Open Computing Language) A standard based upon C for portable parallel applications. Focuses on multi platform support (multiple CPUs, GPUs, …) Task parallel and data parallel applications. Very similar to CUDA but a little more complicated to handle heterogeneous platforms. Initiated by Apple. Developed by Khromos group who also managed OpenGL. Now adopted by Intel, AMD, NVIDIA, … OpenCL 1.0 2008. Released with Max OS 10.6 (Snow Leopard) Most recent: OpenCL 1.2 Nov 2011 Implementation available for NVIDIA GPUs Wikipedia “OpenCL http://en.wikipedia.org/wiki/OpenCL http://www.khronos.org/opencl/

3 3 OpenCL Programming Model Uses data parallel programming model, similar to CUDA Host program launches kernel routines as in CUDA, but allows for just-in-time compilation during host execution. OpenCL “work items” corresponds to CUDA threads OpenCL “work groups” corresponds to CUDA thread blocks OpenCL “NDRange” corresponds to CUDA Grid Work items in same work group can be synchronized with a barrier as in CUDA.

4 4 Sample OpenCL code to add two vectors To illustrate OpenCL commands Add two vectors, A and B to produce C A and B transferred to device (GPU) Result, C, returned to host (CPU) Similar to CUDA vector addition

5 5 Structure of OpenCL main program 1. Get information about platform and devices available on system 2. Select devices to use - context 3. Create an OpenCL command queue 4. Create memory buffers on device 6. Create kernel program object 7. Build (compile) kernel in-line (or load precompiled binary) 8. Create OpenCL kernel object 9. Set kernel arguments 10. Execute kernel 11. Read kernel memory and copy to host memory. 5. Transfer data from host to device memory buffers

6 6 1. Platform "The host plus a collection of devices managed by the OpenCL framework that allow an application to share resources and execute kernels on devices in the platform." Platforms represented by a cl_platform object, initialized with clGetPlatformID() http://opencl.codeplex.com/wikipage?title=OpenCL%20Tutorials%20-%201

7 7 clGetPlatformIDs Obtain the list of platforms available. cl_int clGetPlatformIDs(cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms) Parameters num_entries Number of cl_platform_id entries that can be added to platforms. If platforms is not NULL, num_entries must be greater than zero. platforms Returns list of OpenCL platforms found. cl_platform_id values returned in platforms can be used to identify a specific OpenCL platform. If platforms argument is NULL, this argument ignored. Number of OpenCL platforms returned is mininum of value specified by num_entries or number of OpenCL platforms available. num_platforms Returns number of OpenCL platforms available. If num_platforms is NULL, this argument ignored. http://www.khronos.org/registry/cl/sdk/1.1/docs/man/xhtml/

8 8 Simple code for identifying platform //Platform cl_platform_id platform; clGetPlatformIDs (1, &platform, NULL); List of OpenCL platforms found. (Platform IDs) In our case just one platform, identified by &platform Number of platform entries Returns number of OpenCL platforms available. If NULL, ignored.

9 9 2. Context “The environment within which the kernels execute and the domain in which synchronization and memory management is defined. The context includes a set of devices, the memory accessible to those devices, the corresponding memory properties and one or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.” The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

10 10 //Context cl_context_properties props[3]; props[0] = (cl_context_properties) CL_CONTEXT_PLATFORM; props[1] = (cl_context_properties) platform; props[2] = (cl_context_properties) 0; cl_context GPUContext = clCreateContextFromType(props,CL_DEVICE_TYPE_GPU,NULL,NULL,NULL); //Context info size_t ParmDataBytes; clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,0,NULL,&ParmDataBytes); cl_device_id* GPUDevices = (cl_device_id*)malloc(ParmDataBytes); clGetContextInfo(GPUContext,CL_CONTEXT_DEVICES,ParmDataBytes,GPUDevices, NULL); Code for context

11 11 3. Command Queue “An object that holds commands that will be executed on a specific device. The command-queue is created on a specific device in a context. Commands to a command-queue are queued in-order but may be executed in-order or out-of-order....” The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

12 12 // Create command-queue cl_command_queue GPUCommandQueue = clCreateCommandQueue(GPUContext,GPUDevices[0],0,NULL); Simple code for creating a command queue

13 13 4. Allocating memory on device Use clCreatBuffer: cl_mem clCreateBuffer(cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret) OpenCL context, from clCreateContextFromType() Bit field to specify type of allocation/usage (CL_MEM_READ_WRITE,…) No of bytes in buffer memory object Returns error code if an error Ptr to buffer data (May be previously allocated.) Returns memory object

14 14 Sample code for allocating memory on device for source data // source data on host, two vectors int *A, *B; A = new int[N]; B = new int[N]; for(int i = 0; i < N; i++) { A[i] = rand()%1000; B[i] = rand()%1000; } … // Allocate GPU memory for source vectors cl_mem GPUVector1 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, A, NULL); cl_mem GPUVector2 = clCreateBuffer(GPUContext,CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,sizeof(int)*N, B, NULL);

15 15 Sample code for allocating memory on device for results on GPU // Allocate GPU memory for output vector cl_mem GPUOutputVector = clCreateBuffer(GPUContext,CL_MEM_WRITE_ONLY,sizeof(int)*N, NULL,NULL);

16 16 6. Kernel Program Simple programs might be in the same file as the host code (as in our CUDA examples) In that case need to formed into strings in a character array. If in a separate file, can read that file into host program as a character string

17 17 Kernel program const char* OpenCLSource[] = { "__kernel void vectorAdd (const __global int* a,", " const __global int* b,", " __global int* c)", "{", " unsigned int gid = get_global_id(0);", " c[gid] = a[gid] + b[gid];", "}" }; … int main(int argc, char **argv){ … } If in same program as host, kernel needs to be strings (I think it can be a single string) OpenCL qualifier to indicate kernel code OpenCL qualifier to indicate kernel memory (Memory objects allocated from global memory pool) Double underscores optional in OpenCL qualifiers Returns global work-item ID in given dimension (0 here)

18 18 Kernel in a separate file // Load the kernel source code into the array source_str FILE *fp; char *source_str; size_t source_size; fp = fopen("vector_add_kernel.cl", "r"); if (!fp) { fprintf(stderr, "Failed to load kernel.\n"); exit(1); } source_str = (char*)malloc(MAX_SOURCE_SIZE); source_size = fread( source_str, 1, MAX_SOURCE_SIZE, fp); fclose( fp ); http://mywiki-science.wikispaces.com/OpenCL

19 19 Create kernel program object const char* OpenCLSource[] = { … }; int main(int argc, char **argv) … // Create OpenCL program object cl_program OpenCLProgram = clCreateProgramWithSource(GPUContext,7,OpenCLSource,NULL,NULL); Number of strings in kernel program array Used if strings not null-terminated to given length of strings Used to return error code if error This example uses a single file for both host and kernel code. Can use clCreateprogramWithSource() with a separate kernel file read into host program

20 20 7. Build kernel program // Build the program (OpenCL JIT compilation) clBuildProgram(OpenCLProgram,0,NULL,NULL,NULL,NULL); Program object from clCreateProgramwithSource Number of devices List of devices, if more than one Build options Arguments for notification routine Function ptr to notification routine called with build complete. Then clBuildProgram will return immediately, otherwise only when build complete

21 21 8. Creating Kernel Objects // Create a handle to the compiled OpenCL function cl_kernel OpenCLVectorAdd = clCreateKernel(OpenCLProgram, "vectorAdd", NULL); Built prgram from clBuildProgram Function name with __kernel qualifier Return error code

22 22 9. Set Kernel Arguments // Set kernel arguments clSetKernelArg(OpenCLVectorAdd,0,sizeof(cl_mem), (void*)&GPUVector1); clSetKernelArg(OpenCLVectorAdd,1,sizeof(cl_mem), (void*)&GPUVector2); clSetKernelArg(OpenCLVectorAdd,2,sizeof(cl_mem), (void*)&GPUOutputVector); Kernel object from clCreateKernel() Which argument Size of argument Pointer to data for argument, from clCreateBuffer()

23 Number of events to complete before this commands Array describing no of global work items Array describing no of work items that make up a work group 23 10. Enqueue command to execute kernel on device // Launch the kernel size_t WorkSize[1] = {N}; // Total number of work items size_t localWorkSize[1]={256}; //No of work items in work group // Launch the kernel clEnqueueNDRangeKernel(GPUCommandQueue,OpenCLVectorAdd,1,NULL, WorkSize, localWorkSize, 0, NULL, NULL); Kernel object from clCreatKernel() Dimensions of work items Offset used with work item Event wait list Event

24 24 Function to copy from host memory to buffer object The following function enqueue command to write to a buffer object from host memory: cl_int clEnqueueWriteBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t cb, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

25 25 Function to copy from buffer object to host memory The following function enqueue command to read from a buffer object to host memory: cl_int clEnqueueReadBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t cb, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) The OpenCL Specification version 1.1 http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf

26 26 11. Copy data back from kernel // Copy the output back to CPU memory int *C; C = new int[N]; clEnqueueReadBuffer(GPUCommandQueue,GPUOutputVector, CL_TRUE, 0, N*sizeof(int), C, 0, NULL, NULL); Read is blocking Byte offset in buffer Size of data to read in bytes Pointer to buffer in host to write data Number of events to complete before this commands Event wait list Event Command queue from clCreateCommandQueue Device buffer from clCreateBuffer

27 27 Results from GPU cout << "C[“ << 0 << "]: " << A[0] <<"+"<< B[0] <<"=" << C[0] << "\n"; cout << "C[“ << N-1 << "]: “ << A[N-1] << "+“ << B[N-1] << "=" << C[N-1] << "\n"; C++ here

28 28 Clean-up // Cleanup free(GPUDevices); clReleaseKernel(OpenCLVectorAdd); clReleaseProgram(OpenCLProgram); clReleaseCommandQueue(GPUCommandQueue); clReleaseContext(GPUContext); clReleaseMemObject(GPUVector1); clReleaseMemObject(GPUVector2); clReleaseMemObject(GPUOutputVector);

29 29 Compiling Need OpenCL header: #include (For mac: #include ) and link to the OpenCL library. Compile OpenCL host program main.c using gcc, two phases: gcc -c -I /path-to-include-dir-with-cl.h/ main.c -o main.o gcc -L /path-to-lib-folder-with-OpenCL-libfile/ -l OpenCL main.o -o host Ref: http://www.thebigblob.com/getting-started-with-opencl-and-gpu-computing/

30 30 Make File (Program called scalarmulocl) CC = g++ LD = g++ -lm CFLAGS = -Wall -shared CDEBUG = LIBOCL = -L/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/lib INCOCL = -I/nfs-home/mmishra2/NVIDIA_GPU_Computing_SDK/OpenCL/common/inc SRCS = scalarmulocl.cpp OBJS = scalarmulocl.o EXE = scalarmulocl.a all: $(EXE) $(OBJS): $(SRCS) $(CC) $(CFLAGS) $(INCOCL) -I/usr/include -c $(SRCS) $(EXE): $(OBJS) $(LD) -L/usr/local/lib $(OBJS) $(LIBOCL) -o $(EXE) -l OpenCL clea: rm -f $(OBJS) *~ clear References: http://mywiki-science.wikispaces.com/OpenCL Submitted by: Manisha Mishra

31 31 #include #include //OpenCL header for C #include //C++ input/output using namespace std; Includes

32 32 #include // Enable double precision values #pragma OPENCL EXTENSION cl_khr_fp64 : enable // OpenCL kernel. Each work item takes care of one element of c const char *kernelSource = "\n" \ "__kernel void vecAdd( __global double *a, \n" \ " __global double *b, \n" \ " __global double *c, \n" \ " const unsigned int n) \n" \ "{ \n" \ " //Get our global thread ID \n" \ " int id = get_global_id(0); \n" \ " \n" \ " //Make sure we do not go out of bounds \n" \ " if (id < n) \n" \ " c[id] = a[id] + b[id]; \n" \ "} \n" \ "\n" ; Another OpenCL program to add two vectors Kernel code http://www.olcf.ornl.gov/tutorials/opencl-vector-addition/

33 33 int main( int argc, char* argv[] ) { // Length of vectors unsigned int n = 100000; // Host input vectors double *h_a; double *h_b; // Host output vector double *h_c; // Device input buffers cl_mem d_a; cl_mem d_b; // Device output buffer cl_mem d_c; cl_platform_id cpPlatform; // OpenCL platform cl_device_id device_id; // device ID cl_context context; // context cl_command_queue queue; // command queue cl_program program; // program cl_kernel kernel; // kernel // Size, in bytes, of each vector size_t bytes = n*sizeof(double); // Allocate memory for each vector on host h_a = (double*)malloc(bytes); h_b = (double*)malloc(bytes); h_c = (double*)malloc(bytes); // Initialize vectors on host int i; for( i = 0; i < n; i++ ) { h_a[i] = sinf(i)*sinf(i); h_b[i] = cosf(i)*cosf(i); } size_t globalSize, localSize; cl_int err; // Number of work items in each local work group localSize = 64; // Number of total work items - localSize must be devisor globalSize = ceil(n/(float)localSize)*localSize; // Bind to platform err = clGetPlatformIDs(1, &cpPlatform, NULL); // Get ID for the device err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, 1, &device_id, NULL); // Create a context context = clCreateContext(0, 1, &device_id, NULL, NULL, &err); // Create a command queue queue = clCreateCommandQueue(context, device_id, 0, &err);

34 34 // Create the compute program from the source buffer program = clCreateProgramWithSource(context, 1, (const char **) & kernelSource, NULL, &err); // Build the program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); // Create the compute kernel in the program we wish to run kernel = clCreateKernel(program, "vecAdd", &err); // Create the input and output arrays in device memory for our calculation d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes, NULL, NULL); d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes, NULL, NULL); // Write our data set into the input array in device memory err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0, bytes, h_a, 0, NULL, NULL); err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0, bytes, h_b, 0, NULL, NULL); // Set the arguments to our compute kernel err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a); err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b); err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c); err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n); // Execute the kernel over the entire range of the data set err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL); // Wait for the command queue to get serviced before reading back results clFinish(queue); // Read the results from the device clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL ); //Sum up vector c and print result divided by n, this should equal 1 within error double sum = 0; for(i=0; i<n; i++) sum += h_c[i]; printf("final result: %f\n", sum/n); // release OpenCL resources clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c); clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(queue); clReleaseContext(context); //release host memory free(h_a); free(h_b); free(h_c); return 0; } Build program Run program Program arguments Write input data Set input arguments Read results

35 Questions

36 36 Chapter 11 of Programming Massively Parallel Processors by D. B. Kirk and W-M W. Hwu, Morgan Kaufmann, 2010 More Information


Download ppt "1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL."

Similar presentations


Ads by Google