 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Lecture 6: Multicore Systems

Programming with CUDA WS 08/09 Lecture 6 Thu, 11 Nov, 2008.

China MCP 1 OpenCL. Agenda OpenCL Overview Usage Memory Model Synchronization Operational Flow Availability.

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

OPENCL OVERVIEW. Copyright Khronos 2009 OpenCL Commercial Objectives Grow the market for parallel computing For vendors of systems, silicon, middleware,

OpenCL Peter Holvenstot. OpenCL Designed as an API and language specification Standards maintained by the Khronos group  Currently 1.0, 1.1, and 1.2.

National Tsing Hua University ® copyright OIA National Tsing Hua University OpenCL Tutorial.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

CS-341 Dick Steflik Introduction. C++ General purpose programming language A superset of C (except for minor details) provides new flexible ways for defining.

Shekoofeh Azizi Spring  CUDA is a parallel computing platform and programming model invented by NVIDIA  With CUDA, you can send C, C++ and Fortran.

1. GPU – History  First true 3D graphics started with early display controllers (video shifters)  They acted as pass between CPU and display  RCA’s.

OpenCL Introduction A TECHNICAL REVIEW LU OCT

The Open Standard for Parallel Programming of Heterogeneous systems James Xu.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Nvidia CUDA Programming Basics Xiaoming Li Department of Electrical and Computer Engineering University of Delaware.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Computer Graphics Ken-Yi Lee National Taiwan University.

Краткое введение в OpenCL. Примеры использования в научных вычислениях А.С. Айриян 7 мая 2015 Лаборатория информационных технологий.

CEG 4131-Fall General Purpose Computing on GPU GPGPU CEG4131 – Fall 2012 University of Ottawa Bardia Bandali CEG4131 – Fall 2012.

+ CUDA Antonyus Pyetro do Amaral Ferreira. + The problem The advent of multicore CPUs and manycore GPUs means that mainstream processor chips are now.

Introduction and Features of Java. What is java? Developed by Sun Microsystems (James Gosling) A general-purpose object-oriented language Based on C/C++

Advanced / Other Programming Models Sathish Vadhiyar.

Instructor Notes This is a brief lecture which goes into some more details on OpenCL memory objects Describes various flags that can be used to change.

Introduction to OpenCL* Ohad Shacham Intel Software and Services Group Thanks to Elior Malul, Arik Narkis, and Doron Singer 1.

OpenCL Sathish Vadhiyar Sources: OpenCL overview from AMD OpenCL learning kit from AMD.

ITCS 6/8010 CUDA Programming, UNC-Charlotte, B. Wilkinson,

Automatic translation from CUDA to C++ Luca Atzori, Vincenzo Innocente, Felice Pantaleo, Danilo Piparo 31 August, 2015.

GPU Architecture and Programming

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

OpenCL Programming James Perry EPCC The University of Edinburgh.

CUDA. Assignment  Subject: DES using CUDA  Deliverables: des.c, des.cu, report  Due: 12/14,

12/22/ Thread Model for Realizing Concurrency B. Ramamurthy.

Portability with OpenCL 1. High Performance Landscape High Performance computing is trending towards large number of cores and accelerators It would be.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

Parallel Programming Basics  Things we need to consider:  Control  Synchronization  Communication  Parallel programming languages offer different.

Object Oriented Software Development 4. C# data types, objects and references.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, ECE 498AL, University of Illinois, Urbana-Champaign ECE408 / CS483 Applied Parallel Programming.

Lecture 8 : Manycore GPU Programming with CUDA Courtesy : SUNY-Stony Brook Prof. Chowdhury’s course note slides are used in this lecture note.

Martin Kruliš by Martin Kruliš (v1.0)1.

UDI Technology Benefits Slide 1 Uniform Driver Interface UDI Technology Benefits.

My Coordinates Office EM G.27 contact time:

OpenCL The Open Standard for Heterogenous Parallel Programming.

Instructor Notes This is a straight-forward lecture. It introduces the OpenCL specification while building a simple vector addition program The Mona Lisa.

Introduction to CUDA Programming Introduction to OpenCL Andreas Moshovos Spring 2011 Based on:

Lecture 10: OpenCL overview Lecturer: Simon Winberg Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)

Heterogeneous Computing using openCL lecture 2 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Heterogeneous Computing using openCL lecture 3 F21DP Distributed and Parallel Technology Sven-Bodo Scholz.

Lecture 15 Introduction to OpenCL

Computer Engg, IIT(BHU)

NFV Compute Acceleration APIs and Evaluation

C Language VIVA Questions with Answers

Patrick Cozzi University of Pennsylvania CIS Spring 2011

GPU Programming using OpenCL

.NET and .NET Core 5.2 Type Operations Pan Wuming 2016.

Lecture 9: OpenCL overview

Introduction to OpenCL 2.0

Konstantis Daloukas Nikolaos Bellas Christos D. Antonopoulos

6- General Purpose GPU Programming

Presentation transcript:

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing resources available  Includes a language based on C99 for writing kernels and API used to define and control the devices  Parallel computing through task-based and data-based parallelism.

 Use all computational resources the system  Platform independence  Provide a data and task parallel computational model  Provide a programming model which abstracts the specifics of the underlying hardware  Specify accuracy of floating-point computations  Support both desktop and handheld/portables.

 Host connected to one or more OpenCL devices  Device consists of one or more cores  Execution per processor may be SIMD or SPMD  Contexts group together devices and enable inter- device communication Context Device A - CPU Device B - GPU Device C - DSP HOST Context

 Kernel – basic unit of execution – data parallel  Program – collection of kernels and other related functions  Kernels executed across a collection of work- items – one work-item per computation  Work-items grouped together into workgroups  Workgroups executed together on one device  Multiple workgroups are executed independently  Applications queue kernel instances for execution in-order, but they may be executed in- order or out-of-order

 Some compute devices can also execute task- parallel kernels  Execute as a single work item  Implemented as either a kernel in OpenCL C or a native C/C++ function

 Private memory is available per work item  Local memory shared within workgroup  No synchronization between workgroups  Synchronization possible between work items in a workgroup  Global/Constant memory for access by work-items – not synchronized  Host memory - access through the CPU  Memory management is explicit  Data should be moved from host->global->local and back

 Devices – multiple cores on CPU/GPU together taken as a single device  Kernels executed across all cores in a data-parallel manner  Contexts – Enable sharing between different devices  Devices must be within the same context to be able to share  Queues – used for submitting work, one per device  Buffers – simple chunks of memory like arrays; read- write access  Images – 2D/3D data structures  Access using read_image(), write_image()  Either read or write within a kernel, but not both

 Declared with a kernel qualifier  Encapsulate a kernel function  Kernel objects are created after the executable is built  Execution  Set the kernel arguments  Enqueue the kernel  Kernels are executed asynchronously  Events used to track the execution status  Used for synchronizing execution of two kernels  clWaitForEvents(), clEnqueueMarker() etc.

 Encapsulate  A program source/binary  List of devices and latest successfully built executable for each device  List of kernel objects  Kernel source specified as a string can be provided and compiled at runtime using clCreateProgramWithSource() – platform independence  Overhead – compiling programs can be expensive  OpenCL allows for reusing precompiled binaries

 Derived from ISO C99  No standard headers, function pointers, recursion, variable length arrays, bit fields  Added features: work-items, workgroups, vector types, synchronization  Address space qualifiers  Optimized image access  Built-in functions specific to OpenCL  Data-types  Char, uchar, short, ushort, int, uint, long, ulong  Bool, intptr_t, ptrdiff_t, size_t, uintptr_t, half  Image2d_t, image3d_t, sampler_t  Vector types – portable, varying length (2,4,8,16), endian safe  Char2,ushort4,int8,float16,double2 etc.

 Work-item and workgroup functions  get_work_dim(), get_global_size()  get_group_id(), get_local_id()  Vector operations and components are pre-defined as a language feature  Kernel functions  get_global_id() – gets the next work item  Conversions  Explicit – convert_destType  Reinterpret – as_destType  Scalar and pointer conversions follow C99 rules  No implicit conversions/casts for vector typs

 Address spaces  Kernel pointer arguments must use global, local or constant  Default for local variables is private  Image2d_t and image3d_t are always in global address space  Global variables must be in constant address space  Casting between different address spaces undefined

 Portable and high-performance framework  Computationally intensive algorithms  Access to all computational resources  Well defined memory/computational model  An efficient parallel programming language  C99 with extensions for task and data parallelism  Set of built in functions for synchronization, math and memory operations  Open standard for parallel computing across heterogenous collection of devices.