Supporting GPU Sharing in Cloud Environments with a Transparent

Slides:

Advertisements

Similar presentations

Instructor Notes This lecture describes the different ways to work with multiple devices in OpenCL (i.e., within a single context and using multiple contexts),

Advertisements

Shredder GPU-Accelerated Incremental Storage and Computation

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.

Optimization on Kepler Zehuan Wang

1 A GPU Accelerated Storage System NetSysLab The University of British Columbia Abdullah Gharaibeh with: Samer Al-Kiswany Sathish Gopalakrishnan Matei.

Chimera: Collaborative Preemption for Multitasking on a Shared GPU

Advanced Virtualization Techniques for High Performance Cloud Cyberinfrastructure Andrew J. Younge Ph.D. Candidate Indiana University Advisor: Geoffrey.

GPU Virtualization Support in Cloud System Ching-Chi Lin Institute of Information Science, Academia Sinica Department of Computer Science and Information.

Multi-GPU System Design with Memory Networks

The Who, What, Why and How of High Performance Computing Applications in the Cloud Soheila Abrishami 1.

PARALLEL PROCESSING COMPARATIVE STUDY 1. CONTEXT How to finish a work in short time???? Solution To use quicker worker. Inconvenient: The speed of worker.

GPUs on Clouds Andrew J. Younge Indiana University (USC / Information Sciences Institute) UNCLASSIFIED: 08/03/2012.

OpenFOAM on a GPU-based Heterogeneous Cluster

ValuePack: Value-Based Scheduling Framework for CPU-GPU Clusters Vignesh Ravi, Michela Becchi, Gagan Agrawal, Srimat Chakradhar.

Evaluating GPU Passthrough in Xen for High Performance Cloud Computing Andrew J. Younge 1, John Paul Walters 2, Stephen P. Crago 2, and Geoffrey C. Fox.

Towards High-Availability for IP Telephony using Virtual Machines Devdutt Patnaik, Ashish Bijlani and Vishal K Singh.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

Panda: MapReduce Framework on GPU’s and CPU’s

MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.

Accelerating SQL Database Operations on a GPU with CUDA Peter Bakkum & Kevin Skadron The University of Virginia GPGPU-3 Presentation March 14, 2010.

Operating Systems Should Manage Accelerators Sankaralingam Panneerselvam Michael M. Swift Computer Sciences Department University of Wisconsin, Madison,

SAGE: Self-Tuning Approximation for Graphics Engines

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Last time: Runtime infrastructure for hybrid (GPU-based) platforms  Task scheduling Extracting performance models at runtime  Memory management Asymmetric.

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 7: Threading Hardware in G80.

Predictive Runtime Code Scheduling for Heterogeneous Architectures 1.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

Presenter: Hung-Fu Li HPDS Lab. NKUAS vCUDA: GPU Accelerated High Performance Computing in Virtual Machines Lin Shi, Hao Chen and Jianhua.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

Extracted directly from:

Scheduling Concurrent Applications on a Cluster of CPU-GPU Nodes Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Wei Jiang.

Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.

Improving Network I/O Virtualization for Cloud Computing.

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal

An Autonomic Framework in Cloud Environment Jiedan Zhu Advisor: Prof. Gagan Agrawal.

张俊 BTLab Embedded Virtualization Group Outline  Introduction  Performance Analysis  PerformanceTuning Methods.

SCHOOL OF ELECTRICAL AND COMPUTER ENGINEERING | GEORGIA INSTITUTE OF TECHNOLOGY Optimizing Data Warehousing Applications for GPUs Using Kernel Fusion/Fission.

Taking the Complexity out of Cluster Computing Vendor Update HPC User Forum Arend Dittmer Director Product Management HPC April,

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.

Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 CS 395 Winter 2014 Lecture 17 Introduction to Accelerator.

StreamX10: A Stream Programming Framework on X10 Haitao Wei School of Computer Science at Huazhong University of Sci&Tech.

Introducing collaboration members – Korea University (KU) ALICE TPC online tracking algorithm on a GPU Computing Platforms – GPU Computing Platforms Joohyung.

Autonomic scheduling of tasks from data parallel patterns to CPU/GPU core mixes Published in: High Performance Computing and Simulation (HPCS), 2013 International.

Disco : Running commodity operating system on scalable multiprocessor Edouard et al. Presented by Vidhya Sivasankaran.

© David Kirk/NVIDIA and Wen-mei W. Hwu, ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lectures 8: Threading Hardware in G80.

Introduction What is GPU? It is a processor optimized for 2D/3D graphics, video, visual computing, and display. It is highly parallel, highly multithreaded.

Optimizing MapReduce for GPUs with Effective Shared Memory Usage Department of Computer Science and Engineering The Ohio State University Linchuan Chen.

Compiler and Runtime Support for Enabling Generalized Reduction Computations on Heterogeneous Parallel Configurations Vignesh Ravi, Wenjing Ma, David Chiu.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

Full and Para Virtualization

PDAC-10 Middleware Solutions for Data- Intensive (Scientific) Computing on Clouds Gagan Agrawal Ohio State University (Joint Work with Tekin Bicer, David.

Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:

University of Michigan Electrical Engineering and Computer Science Adaptive Input-aware Compilation for Graphics Engines Mehrzad Samadi 1, Amir Hormati.

Sunpyo Hong, Hyesoon Kim

My Coordinates Office EM G.27 contact time:

Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,

NFV Compute Acceleration APIs and Evaluation

Gwangsun Kim, Jiyun Jeong, John Kim

A Dynamic Scheduling Framework for Emerging Heterogeneous Systems

CS427 Multicore Architecture and Parallel Computing

Controlled Kernel Launch for Dynamic Parallelism in GPUs

Accelerating MapReduce on a Coupled CPU-GPU Architecture

Linchuan Chen, Xin Huo and Gagan Agrawal

Optimizing MapReduce for GPUs with Effective Shared Memory Usage

Multithreaded Programming

Presentation transcript:

Supporting GPU Sharing in Cloud Environments with a Transparent Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Laboratories America)

Two Interesting Trends GPU, “Big player” in High Performance Computing Excellent “price-performance” and “performance-per-watt” ratio Heterogeneous architectures – AMD Fusion APU, Intel Sandy Bridge, NVIDIA Denver Project 3 out of top 4 super computers (Tianhe-1A, Nebulae, and Tsubame) Emergence of Cloud – “Pay-as-you-go” model Cluster instances , High-speed interconnects for HPC users Amazon, Nimbix GPU instances BIG FIRST STEP! But at initial stages

Motivation Sharing is the basis of cloud, GPU no exception Multiple virtual machines may share a physical node Modern GPUs are expensive than multi-core CPUs Fermi cards with 6 GB memory, 4000 $ Better resource utilization Modern GPUs expose high degree of parallelism Applications may not utilize full potential

Enable GPU Visibility from Virtual Machines Related Work Enable GPU Visibility from Virtual Machines vCUDA (Shi et al.) GViM (Gupta et al.) gVirtuS (Guinta et al.) rCuda (Duato et al.) How to share GPUs from Virtual Machines? CUDA Compute 2.0 + Supports Task Parallelism Limitation: Only from Single Process Context

Contributions A Framework for transparent GPU sharing in cloud No source code changes required, feasible in cloud Propose sharing through consolidation Solution to conceptual consolidation problem New method for computing consolidation affinity scores Two new molding methods Overall Runtime consolidation algorithm Extensive evaluation with 8 benchmarks on 2 GPUs At high contention, 50% improved throughput Framework overheads are small

Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

Background Outline Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

BACKGROUND GPU Architecture CUDA Mapping and Scheduling

... ... GPU Device Memory Background SM SM SM SH MEM SM SH MEM SM SH MEM ... ... GPU Device Memory Resource Requirements < Max Available  Inter-leaved execution Resource Requirements > Max Available  Serialized execution

Understanding Consolidation on GPU Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

UNDERSTANDING CONSOLIDATION on GPU Demonstrate Potential of Consolidation Relation between Utilization and Performance Preliminary experiments with consolidation

GPU Utilization vs Performance Scalability of Applications Good Improvement Sub-Linear Linear No Significant Improvement

Consolidation with Space and Time Sharing SM SH MEM SM SH MEM SM SH MEM SM SH MEM App 1 App 2 Cannot utilize all SMs effectively Better Performance at large no. of blocks

Framework Design Outline Background Understanding Consolidation on GPU Consolidation Decision Making Layer Experimental Results Conclusions

FRAMEWORK DESIGN Challenges gVirtuS Current Design Consolidation Framework & its Components

Design Challenges Need a Virtual Process Context Enabling GPU Sharing When & What to Consolidate Need Policies and Algorithms to decide Overheads Light-Weight Design

gVirtuS Current Design VM1 VM2 Guest Side CUDA App1 CUDA App2 Frontend Library Frontend Library Linux / VMM Guest-Host Communication Channel Fork Process No Communication b/w processes gVirtuS Backend Backend Process 1 Backend Process 2 Host Side CUDA Runtime CUDA Driver GPU1 GPUn …

Runtime Consolidation Framework Workloads arrive from Frontend BackEnd Server Queues Workloads to Dispatcher Dispatcher HOST SIDE Consolidation Decision Maker Queues Workloads to Virtual Context Ready Queue Policies Heuristics Virtual Context Virtual Context Thread Workload Consolidator Workload Consolidator GPU GPU

Consolidation Decision Making Layer Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

CONSOLIDATION DECISION MAKING LAYER GPU Sharing Mechanisms & Resource Contention Two Molding Policies Consolidation Runtime Scheduling Algorithm

Sharing Mechanisms & Resource Contention Consolidation by Space Sharing Consolidation by Time Sharing Large No. of Threads with in a block Resource Contention Basis of Affinity Score Pressure on Shared Memory

Molding Kernel Configuration Perform molding dynamically Leverage gVirtuS to intercept kernel launch Flexible for configuration modification Mold the configuration to reduce contention Potential increase in application latency However, may still improve global throughput

Two Molding Policies Molding Policies Forced Space Sharing Time Sharing with Reduced Threads 14 * 256 14 * 512 7 * 256 14 * 128 May resolve shared memory Contention May reduce register pressure in the SM

Consolidation Scheduling Algorithm Greedy-based Scheduling Algorithm Schedule “N” kernels on 2 GPUs Input: 3-Tuple Execution Configuration list of all kernels Data Structure: Work Queue for each Virtual Context Overall Algorithm Generate Pair-wise Affinity Generate Affinity for List Get Affinity By Molding

Consolidation Scheduling Algorithm Create Work Queues for Virtual Contexts Configuration list Generate Pair-wise Affinity (a1, a2) = Generate Affinity For List for each rem. Kernel With each Work Queue Find the pair with min. affinity Split the pair into diff. Queues (a3, a4) = Get Affinity By Molding for each rem. Kernel With each Work Queue Find Max(a1, a2, a3, a4) Dispatch Queues into Virtual Contexts Push kernel into Queue

Experimental Results Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

EXPERIMENTAL RESULTS Setup, Metric & Baselines Benchmarks Results

Setup, Metric & Baselines A Machine with Two Intel Quad core Xeon E5520 CPU Two NVIDIA Tesla C2050 GPU Cards 14 Streaming Multi Processors, each containing 32 cores 3 GB Device Memory 48 KB Shared Memory per SM Virtualized with gVirtuS 2.0 Evaluation Metric Global Throughput benefit obtained after consolidation of kernels Baselines Serialized execution, based on CUDA Runtime Scheduling Blind Round-Robin based consolidation (Unaware of exec. configuration)

Benchmarks and its Characteristics Benchmarks & Goals Benchmarks and its Characteristics

Benefits of Space and Time Sharing Mechanisms No resource contention Consolidation through Blind Round-Robin algorithm Compared against serialized execution of kernels Space Sharing Time Sharing

Drawbacks of Blind Scheduling Presence of Resource Contentions Large Number of Threads Shared Memory Contention No benefit from Consolidation

Effect of Molding Contention – Large Threads Contention – Shared Memory Time Sharing with Reduced Threads Forced Space Sharing

Effect of Affinity Scores Kernel Configurations 2 kernels with 7*512 2 kernels with 14*256 No affinity – Unbalanced Threads per SM With affinity – Better Thread Balancing per SM

Benefits at High Contention Scenario 8 Kernels on 2 GPUs 6 out of 8 Kernels molded 31.5% improvement over Blind Scheduling 50% over serialized execution

Framework Overheads No Consolidation With Consolidation Compared to plain gVirtuS execution Overhead always less than 1% Compared with manually consolidated execution Overhead always less than 4%

Conclusions Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

Conclusions A Framework for transparent sharing of GPUs Use Consolidation as a mechanism for sharing GPUs No source code level changes New Affinity and Molding methods Runtime Consolidation Scheduling Algorithm At high contention, significant throughput benefits The overheads of the framework are small

Thank You for your attention! Questions? Authors Contact Information: raviv@cse.ohio-state.edu becchim@missouri.edu agrawal@cse.ohio-state.edu chak@nec-labs.com

Impact of Large Number of Threads

Per Application Slowdown/ Choice of Molding Choice of Molding Type