SOC Runtime Gregory Stoner.

Slides:

Advertisements

Similar presentations

ATI Stream Computing OpenCL™ Histogram Optimization Illustration Marc Romankewicz April 5, 2010.

Advertisements

ATI Stream Computing ACML-GPU – SGEMM Optimization Illustration Micah Villmow May 30, 2008.

ATI Stream ™ Physics Neal Robison Director of ISV Relations, AMD Graphics Products Group Game Developers Conference March 26, 2009.

OpenCL™ - Parallel computing for CPUs and GPUs Benedict R. Gaster AMD Products Group Lee Howes Office of the CTO.

Cooperative Boosting: Needy versus Greedy Power Management INDRANI PAUL 1,2, SRILATHA MANNE 1, MANISH ARORA 1,3, W. LLOYD BIRCHER 1, SUDHAKAR YALAMANCHILI.

© 2012 Autodesk Presenter’s First and Last Name Presenter’s Title AutoCAD ® Architecture 2013 What’s New Image courtesy of Wilson Architects.

Coordinated Energy Management in Heterogeneous Processors INDRANI PAUL 1,2, VIGNESH RAVI 1, SRILATHA MANNE 1, MANISH ARORA 1,3, SUDHAKAR YALAMANCHILI 2.

Panel Discussion: The Future of I/O From a CPU Architecture Perspective #OFADevWorkshop Brad Benton AMD, Inc.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU †, SOORAJ PUTHOOR †, BRADFORD M BECKMANN †, MARK.

AMD platform security processor

OpenCL Introduction A TECHNICAL REVIEW LU OCT

OpenCL Introduction AN EXAMPLE FOR OPENCL LU OCT

1| AMD FirePro™ / Creo 2.0 Launch Event | April 2012 | Confidential – NDA Required AMD FIREPRO ™ / CREO 2.0 Sales Deck April 2012.

Sequential Consistency for Heterogeneous-Race-Free DEREK R. HOWER, BRADFORD M. BECKMANN, BENEDICT R. GASTER, BLAKE A. HECHTMAN, MARK D. HILL, STEVEN K.

1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.

A new perspective on processing-in-memory architecture design These data are submitted with limited rights under Government Contract No. DE-AC52-8MA27344.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

ATI Stream Computing ATI Radeon™ HD 2900 Series GPU Hardware Overview Micah Villmow May 30, 2008.

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

C O N F I D E N T I A LC O N F I D E N T I A L ATI FireGL ™ Workstation Graphics from AMD April 2008 AMD Graphics Product Group.

Processor Structure and Function Chapter8:. CPU Structure  CPU must:  Fetch instructions –Read instruction from memory  Interpret instructions –Instruction.

STRUCTURAL AGNOSTIC SPMV: ADAPTING CSR-ADAPTIVE FOR IRREGULAR MATRICES MAYANK DAGA AND JOSEPH L. GREATHOUSE AMD RESEARCH ADVANCED MICRO DEVICES, INC.

FAULTSIM: A FAST, CONFIGURABLE MEMORY-RESILIENCE SIMULATOR DAVID A. ROBERTS, AMD RESEARCH PRASHANT J. NAIR, GEORGIA INSTITUTE OF TECHNOLOGY

SIMULATION OF EXASCALE NODES THROUGH RUNTIME HARDWARE MONITORING JOSEPH L. GREATHOUSE, ALEXANDER LYASHEVSKY, MITESH MESWANI, NUWAN JAYASENA, MICHAEL IGNATOWSKI.

SYNCHRONIZATION USING REMOTE-SCOPE PROMOTION MARC S. ORR †§, SHUAI CHE §, AYSE YILMAZER §, BRADFORD M. BECKMANN §, MARK D. HILL †§, DAVID A. WOOD †§ †

My Coordinates Office EM G.27 contact time:

IMPLEMENTING A LEADING LOADS PERFORMANCE PREDICTOR ON COMMODITY PROCESSORS BO SU † JOSEPH L. GREATHOUSE ‡ JUNLI GU ‡ MICHAEL BOYER ‡ LI SHEN † ZHIYING.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Martin Kruliš by Martin Kruliš (v1.1)1.

Azure Stack Foundation

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

NFV Compute Acceleration APIs and Evaluation

µC-States: Fine-grained GPU Datapath Power Management

Data Platform and Analytics Foundational Training

Joseph L. GreathousE, Mayank Daga AMD Research 11/20/2014

Introduction to Parallel Computing: MPI, OpenMP and Hybrid Programming

Chapter 4: Threads.

Microarchitecture.

Visit for more Learning Resources

OpenCL 소개 류관희 충북대학교 소프트웨어학과.

Enabling machine learning in embedded systems

Measuring and Modeling On-Chip Interconnect Power on Real Hardware

BLIS optimized for EPYCTM Processors

Grid Computing.

The Small batch (and Other) solutions in Mantle API

Heterogeneous System coherence for Integrated CPU-GPU Systems

Many-core Software Development Platforms

Antonio R. Miele Marco D. Santambrogio Politecnico di Milano

Blake A. Hechtman†§, Shuai Che†, Derek R. Hower†, Yingying Tian†Ϯ,

Harris Gasparakis, Ph.D. Raghunath Rao, Ph.D.

libflame optimizations with BLIS

Chapter 4: Threads.

Chapter 4: Threads.

NVIDIA Fermi Architecture

Interference from GPU System Service Requests

Parallel programming with GPGPU coprocessors

Simulation of exascale nodes through runtime hardware monitoring

Interference from GPU System Service Requests

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

RegMutex: Inter-Warp GPU Register Time-Sharing

Machine Learning for Performance and Power Modeling of Heterogeneous Systems Joseph L. Greathouse, Gabriel H. Loh Advanced Micro Devices, Inc.

IBM Power Systems.

Software Acceleration in Hybrid Systems Xiaoqiao (XQ) Meng IBM T. J

Advanced Micro Devices, Inc.

Multicore and GPU Programming

Computer Architecture

6- General Purpose GPU Programming

Jason Stewart (AMD) | Rolando Caloca O. (Epic Games) | 21 March 2018

Peter Oostema & Rajnish Aggarwal 6th March, 2019

Presentation transcript:

SOC Runtime Gregory Stoner

Is it 1988 all over again- Highly-specialized, memory-mapped IO device To take advantage of the performance offered by the 3167, Applications had to be re-compiled using a compiler that supported the Weitek co-processor. Applications re-compiled for Weitek operate faster than with any other 80387 co-processor, Especially when the applications use single-precision floating point numbers. 2.5 times the performance of an Intel 387DX coprocessor Latency Kills – does any one remember the WEITEK 4167

Pull Forward 30 Years: We are seeing Hardware Acceleration is again common path way : Current solutions have significant hardware and software challenges when integrating accelerators Programmability is core issue: Parallel programming is difficult – Currently very specialized languages (e.g, OpenCL and CUDA) are used to take advantage of GPU acceleration which limit number of developer who can access system. We need better solution to support better integration of emerging memory technologies into SOC : to solve core problem that need increased bandwidth and better match arithmetic intensities Consider changing “Performance/Watt/$” to allow different key goals. We need to discuss the value on this slide. Put off what we want from the discuss in a set of bullets that are setoff. We removed ”If AMD does not secure a partner, we are likely to scale back the program”. This can be said if needed. Previous bullet for customized hardware acceleration: “Customized Hardware Acceleration: New silicon processes carry a heavy tax for application-specific acceleration.” Additional questions to ask if they don’t provide more big challenges: What features are most important to you for future processors? Programmability, performance, power efficiency, memory bandwidth, memory capacity, network connectivity, etc. Do you currently use GPUs or many-core CPUs as accelerators? If yes, what has been your experience with these? What programming languages and environments are most important to you? What types of applications are you most interested in executing? What are key characteristics of your most important applications? Memory bound, compute bound, IO bound

Can we simplify programing Accelerators? Single Source Compilation of C++ int column = 128; int row = 256; // define the compute grid bounds<2> grid { column, row }; float* a = new float[grid.size()]; float* b = new float[grid.size()]; float* c = new float[grid.size()]; … // begin() and end() returns iterator objects from the grid // the index<> object provides the loop indices/workitem ids parallel::for_each(par, begin(grid), end(grid), [&](index<2> idx) { int i = idx[1] * column + idx[0]; // idx contains work-item coord in grid c[i] = a[i] + b[i]; } Single-Heap Single-Source

HCC - Heterogeneous Compute Compiler C++ & C Compiler HCC support ISO C++ 11/14 & C11 compilation Fully Open Sourced Compiler leverage CLANG/LLVM Technology Single-source : Host and dev code in the same source file C++17 “Parallel Standard Template Library” support Support for OpenMP 3.1/4.0 on CPU initially, planning for GPU acceleration with 4.5 Supports both CPU and GPU code generation Traditional CPU programs can be compiled with HCC (just like using the clang++ compiler) Heterogeneous programs Compiles the host code for CPU Compiles the kernel/parallel region for the GPU Motivated programmers can drill down to optimize, where desired Control, pre-fetch, discard data movement Run asynchronous compute kernels Access GPU scratchpad memories

AMD hSA Runtime provides Open Source Linux driver and runtime Initialization, device discovery and shutdown API’s Low-latency user-mode dispatch & scheduling API’s Architected queuing packet Signals API Memory management API’s Unified coherent address space Demand-paged system memory System and Device Information Code and Object and Executable API Notifications API Documentation @ http://www.hsafoundation.com/html/HSA_Library.htm Available Now In Development

Compiler Architecture Single source C or C++ HCC C/C++ Compiler HCC C/C++ Compiler Front End HCC Runtime LLVM IR API calls HSA Runtime LLVM-X86 Code Gen LLVM-HSAIL Compiler HSAIL ISA Finalizer ISA System (CPU + GPU )

Runtime Foundation which support broad set of Languages OpenCL™ App OpenMP App C++ App Python App Additional languages OpenCL Runtime OpenMP Runtime C++ Runtime Python Runtime HSA Helper Libraries HSA Core Runtime HSA Finalizer User mode Common infrastructure of HSAIL and HSA Runtime -> Bring benefits of GPU computing to programming languages that programmers are already using. Kernel mode HSA Kernel Mode Driver

HSA: Takes the challenges head on. HSA addresses industry challenges through: Defines efficient accelerator integration: Coherent, single address space, with Low-latency communication between processors Runtime puts in place foundation for the development of simplified programming: Single-source, standard computing environments for mainstream languages Fast memory access to new memory architecture: Also supports hieratical ( near/fast and far/slow ) memory in shared memory space Some of the notes below can be integrated into slides 10 and 11 or into this slide. Simplified programming Make GPU compute programming much more like multi-core CPU programming Enable a vast pool of programmers to leverage the power of GPU compute Provides single-source, standard computing environments for mainstream languages C, C++, Fortran, Python, OpenMP Lower development costs through faster code optimization Improved performance and energy usage High performance and energy efficiency for data and compute intensive workloads Fast CPU cores for sequential code Energy-efficient GPU cores for parallel code Advanced memory and networking features Single shared memory address space High memory bandwidth and capacity HSA-enabled memory and networking enhancements

Disclaimer & Attribution The information presented in this document is for informational purposes only and may contain technical inaccuracies, omissions and typographical errors. The information contained herein is subject to change and may be rendered inaccurate for many reasons, including but not limited to product and roadmap changes, component and motherboard version changes, new model and/or product releases, product differences between differing manufacturers, software changes, BIOS flashes, firmware upgrades, or the like. AMD assumes no obligation to update or otherwise correct or revise this information. However, AMD reserves the right to revise this information and to make changes from time to time to the content hereof without obligation of AMD to notify any person of such revisions or changes. AMD MAKES NO REPRESENTATIONS OR WARRANTIES WITH RESPECT TO THE CONTENTS HEREOF AND ASSUMES NO RESPONSIBILITY FOR ANY INACCURACIES, ERRORS OR OMISSIONS THAT MAY APPEAR IN THIS INFORMATION. AMD SPECIFICALLY DISCLAIMS ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE. IN NO EVENT WILL AMD BE LIABLE TO ANY PERSON FOR ANY DIRECT, INDIRECT, SPECIAL OR OTHER CONSEQUENTIAL DAMAGES ARISING FROM THE USE OF ANY INFORMATION CONTAINED HEREIN, EVEN IF AMD IS EXPRESSLY ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. ATTRIBUTION © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo and combinations thereof are trademarks of Advanced Micro Devices, Inc. in the United States and/or other jurisdictions. Other names are for informational purposes only and may be trademarks of their respective owners.