OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT.11 2014.

Slides:



Advertisements
Similar presentations
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Advertisements

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.
ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.
OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.
Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,
GPU Processing for Distributed Live Video Database Jun Ye Data Systems Group.
National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.
Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.
Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.
BIAF Print Label software setup
OPTIMIZING AND DEBUGGING GRAPHICS APPLICATIONS WITH AMD'S GPU PERFSTUDIO 2.5 GPG Developer Tools Gordon Selley Peter Lohrmann GDC 2011.
OpenCL Introduction A TECHNICAL REVIEW LU OCT
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.
Computer Graphics Ken-Yi Lee National Taiwan University.
Instructor Notes GPU debugging is still immature, but being improved daily. You should definitely check to see the latest options available before giving.
1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.
AMD-SPL Runtime Programming Guide Jiawei. Outline.
Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.
ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,
1 ITCS 4/5010 CUDA Programming, UNC-Charlotte, B. Wilkinson, Feb 28, 2013, OpenCL.ppt OpenCL These notes will introduce OpenCL.
OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.
Instructor Notes Discusses synchronization, timing and profiling in OpenCL Coarse grain synchronization covered which discusses synchronizing on a command.
OpenCL Programming James Perry EPCC The University of Edinburgh.
ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.
Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.
OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.
 2008 Pearson Education, Inc. All rights reserved. 1 Arrays and Vectors.
© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.
STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.
FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY
SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.
SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †
Heterogeneous Computing with OpenCL Dr. Sergey Axyonov.
IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.
OpenCL The Open Standard for Heterogenous Parallel Programming.
Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.
OpenCL. #include int main() { cl_platform_id platform[10]; cl_uint num_platforms; clGetPlatformIDs(10, platform, &num_platforms); cl_int clGetPlatformIDs.
Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:
Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
Automatic Command Queue Scheduling for Task-Parallel Workloads in OpenCL Ashwin M. Aji, Antonio J. Pena, Pavan Balaji and Wu-Chun Feng Virginia Tech and.
Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.
OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.
Lecture 15 Introduction to OpenCL
Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014
Objective To Understand the OpenCL programming model
ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration
An Introduction to GPU Computing
Measuring and Modeling On-Chip Interconnect Power on Real Hardware
BLIS optimized for EPYCTM Processors
Patrick Cozzi University of Pennsylvania CIS Spring 2011
Lecture 11 – Related Programming Models: OpenCL
The Small batch (and Other) solutions in Mantle API
Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,
GPU Programming using OpenCL
SOC Runtime Gregory Stoner.
libflame optimizations with BLIS
Interference from GPU System Service Requests
Interference from GPU System Service Requests
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
RegMutex: Inter-Warp GPU Register Time-Sharing
Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.
OpenCL introduction II.
Advanced Micro Devices, Inc.
Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018
Presentation transcript:

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

2OPENCL INTRODUCTION | APRIL 11, 2014 CONTENTS 1.Environment Configuration 2.Case Analyzing

1. ENVIRONMENT CONFIGURATION

4OPENCL INTRODUCTION | APRIL 11, ENVIRONMENT CONFIGURATION  IDE –Any IDE for C/C++ could use OpenCL. –We use Microsoft Visual Studio  Setting for the requiring projects: –Add include path of the SDK to Additional include directories. –Add library path of the SDK to Additional library directories.

5OPENCL INTRODUCTION | APRIL 11, ENVIRONMENT CONFIGURATION  Include Directory

6OPENCL INTRODUCTION | APRIL 11, ENVIRONMENT CONFIGURATION  Lib Directory

7OPENCL INTRODUCTION | APRIL 11, ENVIRONMENT CONFIGURATION  OpenCL Lib

2. CASE ANALYZING

9OPENCL INTRODUCTION | APRIL 11, CASE ANALYZING 1.Problem Description 2.Algorithm 3.Calculation Features 4.Parallelizing 5.Programming 1.Kernel 2.Host 6.Tools 1.AMD Profiler 2.gDEBugger

10OPENCL INTRODUCTION | APRIL 11, PROBLEM DESCRIPTION  Input an image, the rotation center and angle;  Output the rotated image with the same size of the input (original) image. OriginalRotated

11OPENCL INTRODUCTION | APRIL 11, ALGORITHM

12OPENCL INTRODUCTION | APRIL 11, CALCULATION FEATURES  The calculation for each point is the same and independent;  A large amount of points.  So it is fit for parallel computing with GPU.

13OPENCL INTRODUCTION | APRIL 11, PARALLELIZING  With OpenCL framework, assign one work-item for the calculation for each point.  There are two methods to implement the algorithm: –Assign work-items as per original image; For each point, calculate the new position and copy it to the output image; Write-memory conflict. –Assign work-items as per output image. For each point, calculate the source position and copy it from the original image; Read-memory conflict.

14OPENCL INTRODUCTION | APRIL 11, PROGRAMMING 1.Kernel –which run in GPU. 1.Host –which run in CPU.

15OPENCL INTRODUCTION | APRIL 11, KERNEL 1.__kernel void image_rotate( 2.__global float * src_data, __global float * dest_data,//Data in global memory 3.int W, int H,//Image Dimensions 4.float sinTheta, float cosTheta )//Rotation Parameters 5.{ 6.//Thread gets its index within index space 7.const int ix = get_global_id(0); 8.const int iy = get_global_id(1); 9.//Calculate location of data to move into ix and iy– Output decomposition as mentioned 10.float xpos = (((float)ix) * cosTheta + ((float)iy) * sinTheta); 11.float ypos = (((float)iy) * cosTheta - ((float)ix) * sinTheta); 12.//Bound Checking 13. if ((((int)xpos >= 0) && ((int)xpos = 0) && ((int)ypos < H))) 14.{ 15.//Read (xpos,ypos) src_data and store at (ix,iy) in dest_data 16.dest_data[iy * W + ix] = src_data[(int)(floor(ypos * W + xpos))]; 17. } 18.}

16OPENCL INTRODUCTION | APRIL 11, KERNEL

17OPENCL INTRODUCTION | APRIL 11, KERNEL  KernelAnalyzer

18OPENCL INTRODUCTION | APRIL 11, KERNEL  KernelAnalyzer –We can see the bottlenecks are ALU ops. –It means that the main work of kernel is calculation, but not the data transfer. –This kernel has high performance.

19OPENCL INTRODUCTION | APRIL 11, HOST Platform Query Platform Query Devices Create Context Create Command Queue Compiler Compile Program Create Kernel Runtime Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers

20OPENCL INTRODUCTION | APRIL 11, HOST  Query Platform cl_int clGetPlatformIDs (cl_uint num_entries, cl_platform_id *platforms, cl_uint *num_platforms) –This function is usually called twice; first calling is for getting the number of platform, and second calling is for getting the platforms. –First calling: clGetPlatformIDs(NULL, NULL, num) –Second calling: clGetPlatformIDs(num, platforms, NULL) Query Devices Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers

21OPENCL INTRODUCTION | APRIL 11, HOST  Query Devices cl_int clGetDeviceIDs (cl_platform_id platform, cl_device_type device_type, cl_uint num_entries, cl_device_id *devices, cl_uint *num_devices) –This function is also usually called twice just like clGetPlatformIDs. –device_type: CL_DEVICE_TYPE_ALL CL_DEVICE_TYPE_CPU CL_DEVICE_TYPE_GPU Query Platform Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers

22OPENCL INTRODUCTION | APRIL 11, HOST Query Platform Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers

23OPENCL INTRODUCTION | APRIL 11, HOST  Create Context cl_context clCreateContext ( const cl_context_properties *properties, cl_uint num_devices, const cl_device_id *devices, void (CL_CALLBACK *pfn_notify)(const char *errinfo, const void *private_info, size_t cb, void *user_data), void *user_data, cl_int *errcode_ret)  Create Command Queue cl_command_queue clCreateCommandQueue ( cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret) Query Platform Query Devices Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers

24OPENCL INTRODUCTION | APRIL 11, HOST Query Platform Query Devices Create Context Create Command Queue Create Buffers Write buffers Set Kernel Arguments Run Kernel Read buffers  Compile Program cl_program clCreateProgramWithSource( cl_context context, cl_uint count, const char **strings, const size_t *lengths, cl_int *errcode_ret)  Create Kernel cl_kernel clCreateKernel ( cl_program program, const char *kernel_name, cl_int *errcode_ret)

25OPENCL INTRODUCTION | APRIL 11, HOST Query Platform Query Devices Create Context Create Command Queue Compile Program Create Kernel Set Kernel Arguments Run Kernel Read buffers  Create Buffers cl_mem clCreateBuffer (cl_context context, cl_mem_flags flags, size_t size, void *host_ptr, cl_int *errcode_ret)  Write Buffers cl_int clEnqueueWriteBuffer (cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t size, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

26OPENCL INTRODUCTION | APRIL 11, HOST Query Platform Query Devices Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Read buffers  Set Kernel Arguments (for each one) cl_int clSetKernelArg (cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)  Run Kernel cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

27OPENCL INTRODUCTION | APRIL 11, HOST  Parameters of clEnqueueNDRangeKernel –work_dim is the number of dimensions used to specify the global work-items and work-items in the work-group. –global_work_offset can be used to specify an array of work_dim unsigned values that describe the offset used to calculate the global ID of a work-item. –If global_work_offset is NULL, the global IDs start at offset (0, 0, … 0). –local_work_size points to an array of work_dim unsigned values that describe the number of work- items that make up a work-group (also referred to as the size of the work-group) that will execute the kernel specified by kernel. Query Platform Query Devices Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Read buffers

28OPENCL INTRODUCTION | APRIL 11, HOST  Parameters of clEnqueueNDRangeKernel –global_work_size into appropriate work-group instances. If local_work_size is specified, global_work_size must be evenly divisible by local_work_size. –event_wait_list and num_events_in_wait_list specify events that need to complete before this particular command can be executed. –event returns an event object that identifies this particular kernel execution instance. Query Platform Query Devices Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Read buffers

29OPENCL INTRODUCTION | APRIL 11, HOST  Read Buffers cl_int clEnqueueReadBuffer ( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t size, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) Query Platform Query Devices Create Context Create Command Queue Compile Program Create Kernel Create Buffers Write buffers Set Kernel Arguments Run Kernel

30OPENCL INTRODUCTION | APRIL 11, HOST  Release –clReleaseKernel –clReleaseProgram –clReleaseMemObject –clReleaseCommandQueue –clReleaseContext –clReleaseDevice

31OPENCL INTRODUCTION | APRIL 11, TOOLS 1.AMD Profiler 2.gDEBugger

32OPENCL INTRODUCTION | APRIL 11, AMD PROFILER  Counters We can see the running information of any kernel.

33OPENCL INTRODUCTION | APRIL 11, AMD PROFILER  Trace Trace the OpenCL Runtime.

34OPENCL INTRODUCTION | APRIL 11, GDEBUGGER  Debug into kernel

35OPENCL INTRODUCTION | APRIL 11, 2014 THANK YOU!

36OPENCL INTRODUCTION | APRIL 11, 2014 DISCLAIMER & ATTRIBUTION The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. SPEC is a registered trademark of the Standard Performance Evaluation Corporation (SPEC). Other names are for informational purposes only and may be trademarks of their respective owners.