Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling Extracting performance models at runtime  Memory management Asymmetric.

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Advanced Virtualization Techniques for High Performance Cloud Cyberinfrastructure Andrew J. Younge Ph.D. Candidate Indiana University Advisor: Geoffrey.

Multi-GPU System Design with Memory Networks

Exploiting Graphics Processors for High- performance IP Lookup in Software Routers Author: Jin Zhao, Xinya Zhang, Xin Wang, Yangdong Deng, Xiaoming Fu.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar.

Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.

Accelerating Machine Learning Applications on Graphics Processors Narayanan Sundaram and Bryan Catanzaro Presented by Narayanan Sundaram.

Panda: MapReduce Framework on GPU’s and CPU’s

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

Review student: Fan Bai Instructor: Dr. Sushil Prasad Andrew Nere, AtifHashmi, and MikkoLipasti University of Wisconsin –Madison IPDPS 2011.

Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.

Virtualization Concept. Virtualization  Real: it exists, you can see it.  Transparent: it exists, you cannot see it  Virtual: it does not exist, you.

Computer System Architectures Computer System Software

Supporting GPU Sharing in Cloud Environments with a Transparent

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

BY: ALI AJORIAN ISFAHAN UNIVERSITY OF TECHNOLOGY 2012 GPU Architecture 1.

Chapter 4 Threads, SMP, and Microkernels Patricia Roy Manatee Community College, Venice, FL ©2008, Prentice Hall Operating Systems: Internals and Design.

By Arun Bhandari Course: HPC Date: 01/28/12. GPU (Graphics Processing Unit) High performance many core processors Only used to accelerate certain parts.

Introduction to CUDA 1 of 2 Patrick Cozzi University of Pennsylvania CIS Fall 2012.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 1 Programming Massively Parallel Processors Lecture Slides for Chapter 1: Introduction.

© David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30-July 2, Taiwan 2008 CUDA Course Programming Massively Parallel Processors: the CUDA experience.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 Programming Massively Parallel Processors CUDA Threads.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 10-14, VSCSE Summer School 2009 Many-core processors for Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu Urbana, Illinois, August 18-22, 2008 VSCSE Summer School 2008 Accelerators for Science and Engineering Applications:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

GPU Architecture and Programming

FIGURE 11.1 Mapping between OpenCL and CUDA data parallelism model concepts. KIRK CH:11 “Programming Massively Parallel Processors: A Hands-on Approach.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

MIDORI The Post Windows Operating System Microsoft Research’s.

(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.

OPERATING SYSTEM SUPPORT DISTRIBUTED SYSTEMS CHAPTER 6 Lawrence Heyman July 8, 2002.

Multi-Core Development Kyle Anderson. Overview History Pollack’s Law Moore’s Law CPU GPU OpenCL CUDA Parallelism.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Some key aspects of NVIDIA GPUs and CUDA. Silicon Usage.

Harmony: A Run-Time for Managing Accelerators Sponsor: LogicBlox Inc. Gregory Diamos and Sudhakar Yalamanchili.

GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE498AL, University of Illinois, Urbana-Champaign 1 ECE498AL Lecture 4: CUDA Threads – Part 2.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

Introduction to CUDA (1 of n*) Patrick Cozzi University of Pennsylvania CIS Spring 2011 * Where n is 2 or 3.

Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University of Seoul) Chao-Yue Lai (UC Berkeley) Slav Petrov (Google Research) Kurt Keutzer (UC Berkeley)

Contemporary Languages in Parallel Computing Raymond Hummel.

1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.

Author : Cedric Augonnet, Samuel Thibault, and Raymond Namyst INRIA Bordeaux, LaBRI, University of Bordeaux Workshop on Highly Parallel Processing on a.

Parallel Performance Measurement of Heterogeneous Parallel Systems with GPUs Allen D. Malony, Scott Biersdorff, Sameer Shende, Heike Jagode†, Stanimire.

Martin Kruliš by Martin Kruliš (v1.0)1.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.

Computer Architecture Lecture 24 Parallel Processing Ralph Grishman November 2015 NYU.

J++ Machine Jeremy Sugerman Kayvon Fatahalian. Background  Multicore CPUs  Generalized GPUs (Brook, CTM, CUDA)  Tightly coupled traditional CPU (more.

My Coordinates Office EM G.27 contact time:

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

Matthew Royle Supervisor: Prof Shaun Bangay.  How do we implement OpenCL for CPUs  Differences in parallel architectures  Is our CPU implementation.

Synergy.cs.vt.edu VOCL: An Optimized Environment for Transparent Virtualization of Graphics Processing Units Shucai Xiao 1, Pavan Balaji 2, Qian Zhu 3,

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

Prof. Zhang Gang School of Computer Sci. & Tech.

Operating System Concepts

Linchuan Chen, Xin Huo and Gagan Agrawal

Presented by: Isaac Martin

Chapter 4: Threads.

ECE498AL Spring 2010 Lecture 4: CUDA Threads – Part 2

Introduction to CUDA.

©Sudhakar Yalamanchili and Jin Wang unless otherwise noted

Operating System Overview

Presentation transcript:

Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling Extracting performance models at runtime  Memory management Asymmetric Distributed Shared Memory StarPU: a Runtime System for Scheduling Tasks over Accelerator-Based Multicore Machines, Cédric Augonnet, Samuel Thibault, and Raymond Namyst. TR-7240, INRIA, March [link] [link] An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems, Isaac Gelado, Javier Cabezas, John Stone, Sanjay Patel, Nacho Navarro, Wen-mei Hwu, ASPLOS’10 [pdf]pdf

Today:  Bridging runtime and language support  ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011

Today:  Bridging runtime and language support  ‘Virtualizing GPUs’ Achieving a Single Compute Device Image in OpenCL for Multiple GPUs, Jungwon Kim, Honggyu Kim, Joo Hwan Lee, Jaejin Lee, PPoPP’11 [pdf] Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework, Vignesh T. Ravi et al., HPDC 2011  best paper!

Context: clouds shift to support HPC applications  initially tightly coupled applications not suited for could applications  today Chinese – cloud with 40Gbps infiniband Amazaon HPC instance GPU instances: Amazon, Nimbix Challenge: make GPUs shared resources in the could.

Challenge: make GPUs a shared resource in the could. Why do this?  GPUs are costly resources Multiple VMs on a node with a single GPU  Increase utilization app level: some apps might not use GPUs much; kernel level: some kernels can be collocatd

Two streams 1. How? 2. Evaluate … opportunities gains overheads

1. The ‘How?’ Preamble: Concurrent kernels are supported by today’s GPUs  Each kernel can execute a different task  Tasks can be mapped to different streaming multiprocessors (using thread-block configuration)  Problem: concurrent execution limited to the set of kernels invoked within a single processor context Past virtualization solutions  API rerouting / intercept library

1. The ‘How?’ Preamble: Concurrent kernels are supported by today’s GPUs  Each kernel can execute a different task  Tasks can be mapped to different streaming multiprocessors (using thread-block configuration)  Problem: concurrent execution limited to the set of kernels invoked within a single processor context

1. The ‘How?’ Architecture

2. Evaluation – The opportunity The opportunity  Key assumption: Under-utilization of GPUs  Space-sharing Kernels occupy different SP  Time-sharing Kernels time-share same SP (benefit form harware support form context switces) Note: is it not always possible

2. Evaluation – The opportunity The opportunity  Key assumption: Under-utilization of GPUs  Sharing Space-sharing  Kernels occupy different SP Time-sharing  Kernels time-share same SP (benefit form harware support form context switces) Note: resource conflicts may prevent this  Molding – change kernel configuration (different number of thread blocks / threads per block) to improve collocation

2. Evaluation – The gains

2. Evaluation – The overheads

Discussion  Limitations  Hardware support

OpenCL vs. CUDA