S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and Programmability for Heterogeneous Many-core Architectures.

Slides:



Advertisements
Similar presentations
Issues of HPC software From the experience of TH-1A Lu Yutong NUDT.
Advertisements

Houssam Haitof Technische Univeristät München
Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
INSPIRE The Insieme Parallel Intermediate Representation Herbert Jordan, Peter Thoman, Simone Pellegrini, Klaus Kofler, and Thomas Fahringer University.
Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.
Parallel Processing with OpenMP
Lecture 6: Multicore Systems
Portable and Predictable Performance on Heterogeneous Embedded Manycores (ARTEMIS ) ARTEMIS 2 nd Project Review October 2014 WP2 “Application use.
A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager
Thoughts on Shared Caches Jeff Odom University of Maryland.
 Open standard for parallel programming across heterogenous devices  Devices can consist of CPUs, GPUs, embedded processors etc – uses all the processing.
1 Lawrence Livermore National Laboratory By Chunhua (Leo) Liao, Stephen Guzik, Dan Quinlan A node-level programming model framework for exascale computing*
University of Houston So What’s Exascale Again?. University of Houston The Architects Did Their Best… Scale of parallelism Multiple kinds of parallelism.
GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.
March 18, 2008SSE Meeting 1 Mary Hall Dept. of Computer Science and Information Sciences Institute Multicore Chips and Parallel Programming.
Presented by Rengan Xu LCPC /16/2014
Mahapatra-Texas A&M-Fall'001 cosynthesis Introduction to cosynthesis Rabi Mahapatra CPSC498.
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
November 18, 2004 Embedded System Design Flow Arkadeb Ghosal Alessandro Pinto Daniele Gasperini Alberto Sangiovanni-Vincentelli
Trend towards Embedded Multiprocessors Popular Examples –Network processors (Intel, Motorola, etc.) –Graphics (NVIDIA) –Gaming (IBM, Sony, and Toshiba)
Big Kernel: High Performance CPU-GPU Communication Pipelining for Big Data style Applications Sajitha Naduvil-Vadukootu CSC 8530 (Parallel Algorithms)
HW/SW Co-Synthesis of Dynamically Reconfigurable Embedded Systems HW/SW Partitioning and Scheduling Algorithms.
University of Michigan Electrical Engineering and Computer Science Amir Hormati, Mehrzad Samadi, Mark Woh, Trevor Mudge, and Scott Mahlke Sponge: Portable.
Course Instructor: Aisha Azeem
Copyright Arshi Khan1 System Programming Instructor Arshi Khan.
Advanced Hybrid MPI/OpenMP Parallelization Paradigms for Nested Loop Algorithms onto Clusters of SMPs Nikolaos Drosinos and Nectarios Koziris National.
ORIGINAL AUTHOR JAMES REINDERS, INTEL PRESENTED BY ADITYA AMBARDEKAR Overview for Intel Xeon Processors and Intel Xeon Phi coprocessors.
SEC(R) 2008 Intel® Concurrent Collections for C++ - a model for parallel programming Nikolay Kurtov Software and Services.
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
Implementing Dense Linear Algebra Algorithms on Multi-Core Processors Using Dataflow Execution Model Jakub Kurzak Jack Dongarra University of Tennessee.
ET E.T. International, Inc. X-Stack: Programming Challenges, Runtime Systems, and Tools Brandywine Team May2013.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.
TRACEREP: GATEWAY FOR SHARING AND COLLECTING TRACES IN HPC SYSTEMS Iván Pérez Enrique Vallejo José Luis Bosque University of Cantabria TraceRep IWSG'15.
Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:
Uncovering the Multicore Processor Bottlenecks Server Design Summit Shay Gal-On Director of Technology, EEMBC.
4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.
GPU Architecture and Programming
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping 林孟諭 Dept. of Electrical Engineering National Cheng Kung University.
1 The Portland Group, Inc. Brent Leback HPC User Forum, Broomfield, CO September 2009.
MILAN: Technical Overview October 2, 2002 Akos Ledeczi MILAN Workshop Institute for Software Integrated.
DIPARTIMENTO DI ELETTRONICA E INFORMAZIONE Novel, Emerging Computing System Technologies Smart Technologies for Effective Reconfiguration: The FASTER approach.
Automating and Optimizing Data Transfers for Many-core Coprocessors Student: Bin Ren, Advisor: Gagan Agrawal, NEC Intern Mentor: Nishkam Ravi, Yi Yang.
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
Evaluating and Improving an OpenMP-based Circuit Design Tool Tim Beatty, Dr. Ken Kent, Dr. Eric Aubanel Faculty of Computer Science University of New Brunswick.
System-level power analysis and estimation September 20, 2006 Chong-Min Kyung.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
Experiences with Achieving Portability across Heterogeneous Architectures Lukasz G. Szafaryn +, Todd Gamblin ++, Bronis R. de Supinski ++ and Kevin Skadron.
Platform Abstraction Group 3. Question How to deal with different types hardware and software platforms? What detail to expose to the programmer? What.
Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
TensorFlow– A system for large-scale machine learning
CS427 Multicore Architecture and Parallel Computing
Enabling machine learning in embedded systems
Texas Instruments TDA2x and Vision SDK
A Pattern Specification and Optimizations Framework for Accelerating Scientific Computations on Heterogeneous Clusters Linchuan Chen Xin Huo and Gagan.
Many-core Software Development Platforms
Linchuan Chen, Xin Huo and Gagan Agrawal
Introduction to cosynthesis Rabi Mahapatra CSCE617
Compiler Back End Panel
Compiler Back End Panel
GENERAL VIEW OF KRATOS MULTIPHYSICS
6- General Purpose GPU Programming
Question 1 How are you going to provide language and/or library (or other?) support in Fortran, C/C++, or another language for massively parallel programming.
Presentation transcript:

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Portability and Programmability for Heterogeneous Many-core Architectures (PEPPHER) Siegfried Benkner (on behalf of the PEPPHER Consortium) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 EU Project PEPPHER Performance Portability & Programmability for Heterogeneous Manycore Architectures ICT FP7, Computing Systems; 3 years; finished Feb Partners, Coordinated by University of Vienna Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems. Holistic Approach Component-Based High-Level Program Development Auto-tuned Algorithms & Data Structures Compilation Strategies Runtime Systems Hardware Mechanisms

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance.Portability.Programmability Application (C/C++) Many- core CPU CPU+GPU PePU (Movidius) PEPPHER Sim Intel Xeon Phi Focus: Single-node/chip heterogeneous architectures Approach Multi-architectural, performance-aware components multiple implementation variants of functions; each with a performance model Task-based execution model & intelligent runtime system runtime selection of best task implementation variant for given platform Methodology & framework for development of performance portable code. Execute same application efficiently on different heterogeneous architectures. Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB,... PEPPHER Framework

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 PEPPHER Approach C1 C2 :::... ::: Component-based application with annotations Mainstream Programmer Component impl. variants for different platforms, algorithms, inputs... C1 C2 Expert Programmer (Compiler/Autotuner) C1 C2 C1C2 C1 C2 Target Platforms Feed-back of measured performance Programmer Annotate calls to performance- critical functions (=components) Provide implementation variants for different platforms (PDL) Provide component meta-data Platform Descriptors (PDL)

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 PEPPHER Approach C1 C2 :::... ::: Component-based application with annotations Mainstream Programmer Component impl. variants for different platforms, algorithms, inputs... C1 C2 Expert Programmer (Autotuner) C1 C2 C1C2 C1 C2 Target Platforms Feed-back of measured performance PEPPHER framework Management of components and implementation variants Transformation / Composition Implementation variant selection Dynamic, performance-aware task scheduling (StarPU runtime) Dynamic selection of best implementation variant Heterogenous Task Scheduler Runtime System Transformation/ Composition Intermediate task-based representation Programmer Annotate calls to performance-critical functions (=components) Provide implementation variants for different platforms (PDL) Provide component meta-data

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 PEPPHER Framework C/C++ source code with annotated component calls Component implementation variants for different core architectures... algorithms,... Component glue code Static variant selection (if any) Component task graph with explicit data dependecies Performance-aware, data-aware dynamic scheduling of best component variants onto free execution units Single-node heterogeneous manycore SIM = PEPPHER simulator PePU = Peppher proc. unit (Movidius) Applications Embedded General Purpose HPC Applications Embedded General Purpose HPC PEPPHER Run-time (StarPU) Drivers (CUDA, OpenCL, OpenMP) CPU GPU SIM PEPPHER Taskgraph PePU Scheduling Strategy Performance Models Performance Models Components C/C++, OpenMP, CUDA, OpenCL, TBB, Offload Components C/C++, OpenMP, CUDA, OpenCL, TBB, Offload Autotuned Algorithms Data Structures High-Level Coordination/Patterns/Skeletons Asnynchronous calls, Data distribution Patterns, SkePU Skeletons Xeon Phi Autotuned Data Structures & Algorithms Transformation Tool Composition Tool

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 PEPPHER Components Component Interface Specification of functionality Used by mainstream programmers Implementation Variants Different architectures/platforms Different algorithms/data structures Different input characteristics Different performance goals Written by expert programmers (or generated, e.g. auto-tuning cf. EU Autotune Project) Component Implementation Variants … «interface» C f(param-list) «variant» C n f(param-list){…} «variant» C 1 f(param-list){…} Interface meta-data Variant meta-data Variant meta-data Features Different programming languages (C/C++, OpenCL, Cuda, OpenMP) Task & Data parallelism Constraints No side-effects; Non-preemptive Stateless; Composition on CPU only

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Platform Description Language (PDL) Goal: Make platform specific information explicit for tools and users. Processing Units (PUs) Master (initiates program execution) Worker (executes delegated tasks) Hybrid (master & worker) Memory Regions Express key characteristics of memory hierarchy Can be defined for all processing units Interconnects describe communication facilities between PUs Hardware and Software Properties e.g., core-count, memory sizes, available libraries Data movement

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Component calls asynchronous & synchronous PEPPHER Coordination Language #pragma pph call //read A, write B -> meta data cf1(A, N, B, M); #pragma pph call cf2(B, M); #pragma pph call sync cf(A, N); Other Features: Specification of optimization goals (time vs. power) and execution targets Data partitioning; array access patterns; parameter assertions; Memory consistency control #pragma pph pipeline while(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage);... } Patterns e.g. pipeline pattern

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Source-to-Source Transformation based on ROSE generates C++ with calls to coordination layer and StarPU runtime Coordination Layer Support for parallel patterns (pipelining) Submission of tasks to StarPU Heterogeneous Runtime System Based on INRIAs StarPU Selection of implementation variants based on available hardware resources Data-aware & performance-aware task scheduling onto heterogeneous PUs Transformation System Hybrid Hardware GPUMIC PEPPHER Component Framework Task-based Heterogeneous Runtime Application with Annotations Transformation Tool Coordination Layer SMP PEPPHER Component Repository Platform Descriptor PDL

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Results OpenCV Face Detection 3425 images Image resolution: 640x480 (VGA) Different implementation variants for middle stages (CPU vs. GPU) Comparison to plain OpenCV version and hand-coded Intel TBB(pipeline) version Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Major Results of PEPPHER Component Framework Multi-architectural, resource- & performance-aware components; PDL adopted by Open Community Runtime (OCR) – US XStack program Transformation, Composition, Compilation Transformation Tool (U. Vienna) Composition Tool & SkePU (U. Linköping) Offload C++ compiler used by game industry (Codeplay) Runtime System (U. Bordeaux) StarPU part of Linux (Debian) distribution and MAGMA library Superior parallel algorithms and data structures (KIT, Chalmers) PePU Experimental Hardware Platform & Simulator (Movidius) PeppherSIM used in industry

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Backup Slides

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Example FOR k = 0..TILES-1 POTRF(A[k][k]) FOR m = k+1..TILES-1 TRSM(A[k][k], A[m][k]) FOR n = k+1..TILES-1 SYRK(A[n][k], A[n][n]) FOR m = n+1..TILES-1 GEMM(A[m][k], A[n][k], A[m][n]) Utilize expert written components: BLAS kernels from MAGMA and PLASMA Implementation variants: multi-core CPU (PLASMA) GPU (MAGMA) Cholesky factorization PEPPHER component: Interface, implementation variants + meta-data

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 PEPPHER Approach Transformation/Composition Processing of user annotations/meta data Generation of task-based representation (DAG) Static pre-selection of variants Multi-level parallelism Coarse-grained inter-component parallelism Fine(r) grained intra-component parallelism Exploit ALL execution units POTFR SYRK GEMM TRSM CPU-GEMMGPU-GEMM SYRK TRSM Task-based Execution Model Runtime task variant selection & scheduling Data/topology-aware: minimize data transfer Performance-aware: minimize make-span, or other objective (power, …)...

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Component Meta-Data Interface Meta-Data (XML) Parameter intent (read/write) Supported performance apsects (execution-time, power) Implementation Variant Meta-Data (XML) Supported target platforms (PDL) Performance Model Input data constraints (if any) Tunable parameters (if any) Required components (if any) Key issues Make platform specific optimizations/dependencies explicit. Make components performance and resource aware. Support runtime variant selection. Support code transformation and auto-tuning. XML Schema for Variant Meta-Data XML Schema for Interface Meta-Data

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance-Aware Components Each component is associated with an abstract performance model. Invocation Context: captures performance-relevant information of input data (problem size, data layout, etc.) Resource Context: specifies main HW/SW characteristics (cores, memory, …) Performance Descriptor: usually includes (relative) runtime, power estimates Generic performance prediction function: Component Performance Model Performance Descriptor PerfDsc getPrediction(InvocationContextDsc icd, ResourceContextDsc rcd) Invocation Context Descriptors Resource Context Descr. (PDL)

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Memory Consistency flush; for ensuring consistency btw. host and workers Component calls implicit memory consistency across workers Basic Coordination Language #pragma pph call cf1 (A, N);... #pragma pph flush(A) // block until A has become available int first = A[0]; // explicit flush req. since A is accessed #pragma pph call cf1 (A, N); // A: read / write... // implicit memory consistency on workers only... // no explicit flush is needed here provided A... // is not accessed within the master process #pragma pph call cf2(A, N); // A:read; actual values of A produced by cf1()

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Parameter Assertions influence component variant selection Optimization Goals specify optimization goals to be taken into account by runtime scheduler Execution Target specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor Basic Coordination Language #pragma pph call parameter(size < 1000) cf1(A, size); #pragma pph call optimize(TIME) cf1(A, size);... #pragma pph call optimize(POWER < 100 && TIME < 10) cf2(A, size); #pragma pph call target(OPENCL) cf(A, size);

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Data Partitioning generate multiple component calls, one for each partition (cf. HPF) Access to Array Sections specify which array section is accessed in component call (cf. Fortran array sections) Basic Coordination Language #pragma pph call partition(A(size:BLOCK(size/2))) cf1(A, size); #pragma pph call access(A(size:50:size-1)) cf(A+50, size-50);

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Performance Results Leukocyte Tracking Adapted from Rodinia benchmark suite Different implementation variants for Motion Gradient Vector Flow (CPU vs. GPU) Comparison to OpenMP version Architecture: 2 Xeon X5550 (4 core) 2 NVIDIA C NVIDIA C different configurations -> PDL descriptors SEQ OMP 8CPU cores 7CPU 1GPU 6CPU 2GPU 5CPU 3GPU Speedup PEPPHER Orig. (Rodinia)

S. Benkner, University of Vienna HIPEAC CSW 2013, Paris, May 2, 2013 Future Work EU AutoTune Project Autotuning of high-level patterns (pipeline replication factor, …) Tunable parameters specified in component descriptors Work on Energy Efficiency Energy-aware components; Runtime scheduling for energy efficiency User-specified optimization goals Tradeoff execution time vs. energy consumption; QoS support Extension towards Clusters Combine with global MPI layer across nodes