© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / 2009-02-26 / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009.

Slides:

Advertisements

Similar presentations

Programming with OpenGL - Getting started - Hanyang University Han Jae-Hyek.

Advertisements

Intermediate GPGPU Programming in CUDA

COMPUTER GRAPHICS CS 482 – FALL 2014 NOVEMBER 10, 2014 GRAPHICS HARDWARE GRAPHICS PROCESSING UNITS PARALLELISM.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Computer Abstractions and Technology

APARAPI Java™ platform’s ‘Write Once Run Anywhere’ ® now includes the GPU Gary Frost AMD PMTS Java Runtime Team.

Android Platform Overview (1)

This work is licensed under the Creative Commons Attribution 4.0 International License. To view a copy of this license, visit

 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.

Performance Analysis of Multiprocessor Architectures

University of Michigan Electrical Engineering and Computer Science Transparent CPU-GPU Collaboration for Data-Parallel Kernels on Heterogeneous Systems.

A Performance and Energy Comparison of FPGAs, GPUs, and Multicores for Sliding-Window Applications From J. Fowers, G. Brown, P. Cooke, and G. Stitt, University.

1 Lecture 10: FP, Performance Metrics Today’s topics:  IEEE 754 representations  FP arithmetic  Evaluating a system Reminder: assignment 4 due in a.

MACHINE VISION GROUP Head-tracking virtual 3-D display for mobile devices Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Lixin Fan**, * University.

GPGPU platforms GP - General Purpose computation using GPU

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

1 Design and Implementation of an Efficient MPEG-4 Interactive Terminal on Embedded Devices Yi-Chin Huang, Tu-Chun Yin, Kou-Shin Yang, Yan-Jun Chang, Meng-Jyi.

Antigone Engine Kevin Kassing – Period

OpenGL 3.0 Texture Arrays Presentation: Olivia Terrell, Dec. 4, 2008.

MACHINE VISION GROUP Graphics hardware accelerated panorama builder for mobile phones Miguel Bordallo López*, Jari Hannuksela*, Olli Silvén* and Markku.

Computer Performance Computer Engineering Department.

Introduction to CUDA (1 of 2) Patrick Cozzi University of Pennsylvania CIS Spring 2012.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

Interactive Time-Dependent Tone Mapping Using Programmable Graphics Hardware Nolan GoodnightGreg HumphreysCliff WoolleyRui Wang University of Virginia.

Open CL Hucai Huang. Introduction Today's computing environments are becoming more multifaceted, exploiting the capabilities of a range of multi-core.

Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.

Implementing a Speech Recognition System on a GPU using CUDA

Porting the physical parametrizations on GPUs using directives X. Lapillonne, O. Fuhrer, Cristiano Padrin, Piero Lanucara, Alessandro Cheloni Eidgenössisches.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto OS-Related Hardware.

Accelerating image recognition on mobile devices using GPGPU

© Copyright Khronos Group, Page 1 Shaders Go Mobile: An Introduction to OpenGL ES 2.0 Tom Olson, Texas Instruments Inc.

GPU Architecture and Programming

Tone Mapping on GPUs Cliff Woolley University of Virginia Slides courtesy Nolan Goodnight.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

OpenCL Sathish Vadhiyar Sources: OpenCL quick overview from AMD OpenCL learning kit from AMD.

1 Latest Generations of Multi Core Processors

© Copyright Khronos Group, Page 1 Coping with Fixed Point Mik BRY CEO

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Computer Graphics 3 Lecture 6: Other Hardware-Based Extensions Benjamin Mora 1 University of Wales Swansea Dr. Benjamin Mora.

OpenCL Joseph Kider University of Pennsylvania CIS Fall 2011.

David Luebke 1 1/25/2016 Programmable Graphics Hardware.

Implementation and Optimization of SIFT on a OpenCL GPU Final Project 5/5/2010 Guy-Richard Kayombya.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2014.

Antigone Engine. Introduction Antigone = “Counter Generation” Library of functions for simplifying 3D application development Written in C for speed (compatible.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

CUDA Compute Unified Device Architecture. Agent Based Modeling in CUDA Implementation of basic agent based modeling on the GPU using the CUDA framework.

My Coordinates Office EM G.27 contact time:

Fast and parallel implementation of Image Processing Algorithm using CUDA Technology On GPU Hardware Neha Patil Badrinath Roysam Department of Electrical.

GPU Computing for GIS James Mower Department of Geography and Planning University at Albany.

An Introduction to the Cg Shading Language Marco Leon Brandeis University Computer Science Department.

Martin Kruliš by Martin Kruliš (v1.0)1.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

OpenCL. Sources Patrick Cozzi Spring 2011 NVIDIA CUDA Programming Guide CUDA by Example Programming Massively Parallel Processors.

Computer Engg, IIT(BHU)

CS203 – Advanced Computer Architecture

NFV Compute Acceleration APIs and Evaluation

COMPUTER GRAPHICS CHAPTER 38 CS 482 – Fall 2017 GRAPHICS HARDWARE

CUDA Interoperability with Graphical Environments

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Patrick Cozzi University of Pennsylvania CIS Spring 2011

Texas Instruments TDA2x and Vision SDK

GPU Programming using OpenCL

NVIDIA Fermi Architecture

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Presentation transcript:

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 1 OpenCL Embedded Profile Presentation for Multicore Expo 16 March 2009 V0.3 Improved draft – Still need some work Kari Pulli Nokia Research Center Jyrki Leskelä Nokia Devices R&D / Technology Renewal

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 2 OpenCL Embedded Profile - Basics

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 3 OpenCL Relation to Khronos Embedded Ecosystem

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 4 OpenCL 1.0 Embedded Profile One-Slider

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 5 Embedded Profile Main Differencies The embedded profile is defined to be a subset for each version of OpenCL: Online compiler is optional No 64-bit integers, or integer vectors Float 2D/3D images can only be used with nearest neighbor sampling Macro __EMBEDDED_PROFILE__ is added in the language and CL_PLATFORM_PROFILE capability will return the string EMBEDDED_PROFILE if the OpenCL implementation supports the embedded profile only. Minimum requirements for constant buffer size, object allocation size, constant argument count and local memory size are scaled down. Image support and floating point support is aligned with OpenGL ES 2.0 texture requirements The extensions of full profile can be applied to embedded profile

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 6 Floating Point Numbers in Embedded Profile INF and NAN values for floats are not mandated Accuracy requirements of some single precision floating-point operations are relaxed from full profile: x / y <= 3 ulp exp <= 4 ulp log <= 4 ulp Float add, sub, mul, mad can be rounded to zero resulting an error <= 1 ulp due to strict HW area. Denormalized numbers for the half float data type can be flushed to zero. The precision of conversions from normalized integers is <= 2 ulp for the embedded profile (instead of <= 1.5 ulp)

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 7 Image Support in Embedded Profile Image support is an optional feature within an OpenCL device If Images are supported, the minimum requirements for the supported image capabilities are lowered to the level of OpenGL ES 2.0 textures Kernel must be able to read >= 8 simultaneous image objects Kernel must be able to write >= 1 simultaneous image objects Width and height of 2D image >= 2048 Number of samplers >= 8 Image formats are similar to corresponding OpenGL ES 2.0 texture formats Support for 3D images is optional for embedded implementations

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 8 Potential Mobile Device Use-Cases Image post-processing and enhancement Image editing software Compatibility for devices lacking high-end imaging HW Machine vision, Local media search, Augmented reality Support emerging new coding schemes quickly For example web-originated media codecs Streaming math/algorithm libraries Physics modeling Gaming engines and WOW effects

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 9 Potential Benefits for Mobile Devices Easier programming in a heterogeneous processor environment Instead of learning different programming methods for CPU, GPU, DSP OpenCL framework handles also event queuing Code developed once will run with future hardware If the application conforms to the specification, it will run OpenCL computing model will be relatively easy to virtualize Area and energy constrained embedded devices Computing power of each computing device close to ”sweet spot” Allocation of the workload to multiple computing devices is valuable

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 10 Example Case 1: Split computation

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 11 Split computation: Image Post Processing CPU GPU Host Application CL API Calls Camera Image OpenCL Post- Processing CL Buffer CL Buffer … Render

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 12 Image Post-Processing Kernel Program __kernel void convolution( _global const uchar4 *srcdata, _global uchar4 *destdata, _global float *kernel, float kernel_multiplier, float kernel_bias, int kernel_dim ) { int x = get_global_id(0), y = get_global_id(1); int sizex = get_global_size( 0 ), sizey = get_global_size( 1 ); int half_kernel = kernel_dim / 2; uint4 sum; for( int j = y-half_kernel, kj = 0; j <= y+half_kernel; j++, kj++ ) { if( ( j >= 0 ) && ( j <= sizey ) ) { for( int i = x-half_kernel, ki = 0; i <= x+half_kernel; i++, ki++ ) { if( ( i >= 0 ) && ( i <= sizex ) ) { sum += srcdata[ j * sizex + i ] * kernel[ kj * kernel_dim + ki ]; } sum = sum * kernel_multiplier + kernel_bias; destdata[ y * sizex + x ] = convert_uchar4_sat(sum); }

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 13 Split computation: Speedup t cpu is the time to process the task with only CPU, t gpu is the time to process the task with only GPU and t gpuif is the time to transfer the data between CPU and GPU (the transfer is modeled to be CPU bound). In this case, the speed-optimal workload split between CPU and GPU would yield the following execution time: Example: t gpu = k t cpu, k є 0.5 … 1.5 t gpuif = 0.1 t cpu Comparison of total execution times:

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 14 Split computation: Energy efficiency t cpu, t gpu and t gpuif from the previous slide. p cpu, p gpu and p gpuif are the average battery power drain by CPU execution, GPU execution and data transfer between CPU and GPU respectively. p split is the average power drain when the computation is time-optimally split to between CPU and GPU. c split is the corresponding battery capacity as a product of power and time. Example: t gpu = k t cpu, k є 0.5…1.5 t gpuif = 0.1 t cpu p gpu = 0.5 p cpu p gpuif = 0.1 p cpu Total consumption of battery capacity:

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 15 More Example Cases

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 16 DSP CPU GPU Pipelining: Mixing computation and graphics OpenCL Fractal Anim. Texture OpenGL ES 2.0 Rendering Host Application CL API Calls GL API Calls GL Renderbuffer CL Buffer GL Texture CL Buffer

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 17 Multimedia Frameworks: OpenMAX environment More portability by using OpenCL in some hotspots Diagram Copyright © 2009 Khronos Group

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 18 Summary

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 19 Summary OpenCL 1.0 Embedded Profile is a subset of the full profile Not an ”ES” specification of its own Easier programming of heterogeneous multi-processor Fast multiprocessor code without portability hassle Speedups and energy efficiency via parallelism Parallelize a uniform task to different processors Split pipeline stages to different processors

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 20 Demo

© 2009 Nokia V1-OpenCLEnbeddedProfilePresentation.ppt / / JyrkiLeskelä 21 Demo: Magnification Lense Internal development environment for evaluating the OpenCL Embedded Profile Early pilot version only No conformance test coverage at the moment Runs on N810 (OMAP2420 CPU) Zoom MDK (OMAP3430 CPU+SIMD+DSP) The lens effect is a mapping of the original image f(x,y) into modified image g(x,y) as piecewise continuous function where R o and R i are the outer and inner boundaries of the lens frame, (x c, y c ) is the center point of the lens, and M is the magnification factor in the center area of the lens.