Presentation is loading. Please wait.

Presentation is loading. Please wait.

Supporting GPU Sharing in Cloud Environments with a Transparent

Similar presentations


Presentation on theme: "Supporting GPU Sharing in Cloud Environments with a Transparent"— Presentation transcript:

1 Supporting GPU Sharing in Cloud Environments with a Transparent
Runtime Consolidation Framework Vignesh Ravi (The Ohio State University) Michela Becchi (University of Missouri) Gagan Agrawal (The Ohio State University) Srimat Chakradhar (NEC Laboratories America)

2 Two Interesting Trends
GPU, “Big player” in High Performance Computing Excellent “price-performance” and “performance-per-watt” ratio Heterogeneous architectures – AMD Fusion APU, Intel Sandy Bridge, NVIDIA Denver Project 3 out of top 4 super computers (Tianhe-1A, Nebulae, and Tsubame) Emergence of Cloud – “Pay-as-you-go” model Cluster instances , High-speed interconnects for HPC users Amazon, Nimbix GPU instances BIG FIRST STEP! But at initial stages

3 Motivation Sharing is the basis of cloud, GPU no exception
Multiple virtual machines may share a physical node Modern GPUs are expensive than multi-core CPUs Fermi cards with 6 GB memory, 4000 $ Better resource utilization Modern GPUs expose high degree of parallelism Applications may not utilize full potential

4 Enable GPU Visibility from Virtual Machines
Related Work Enable GPU Visibility from Virtual Machines vCUDA (Shi et al.) GViM (Gupta et al.) gVirtuS (Guinta et al.) rCuda (Duato et al.) How to share GPUs from Virtual Machines? CUDA Compute Supports Task Parallelism Limitation: Only from Single Process Context

5 Contributions A Framework for transparent GPU sharing in cloud
No source code changes required, feasible in cloud Propose sharing through consolidation Solution to conceptual consolidation problem New method for computing consolidation affinity scores Two new molding methods Overall Runtime consolidation algorithm Extensive evaluation with 8 benchmarks on 2 GPUs At high contention, 50% improved throughput Framework overheads are small

6 Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

7 Background Outline Understanding Consolidation on GPU Framework Design
Consolidation Decision Making Layer Experimental Results Conclusions

8 BACKGROUND GPU Architecture CUDA Mapping and Scheduling

9 ... ... GPU Device Memory Background SM SM SM
SH MEM SM SH MEM SM SH MEM ... ... GPU Device Memory Resource Requirements < Max Available  Inter-leaved execution Resource Requirements > Max Available  Serialized execution

10 Understanding Consolidation on GPU
Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

11 UNDERSTANDING CONSOLIDATION on GPU
Demonstrate Potential of Consolidation Relation between Utilization and Performance Preliminary experiments with consolidation

12 GPU Utilization vs Performance
Scalability of Applications Good Improvement Sub-Linear Linear No Significant Improvement

13 Consolidation with Space and Time Sharing
SM SH MEM SM SH MEM SM SH MEM SM SH MEM App 1 App 2 Cannot utilize all SMs effectively Better Performance at large no. of blocks

14 Framework Design Outline Background Understanding Consolidation on GPU
Consolidation Decision Making Layer Experimental Results Conclusions

15 FRAMEWORK DESIGN Challenges gVirtuS Current Design
Consolidation Framework & its Components

16 Design Challenges Need a Virtual Process Context Enabling GPU Sharing
When & What to Consolidate Need Policies and Algorithms to decide Overheads Light-Weight Design

17 gVirtuS Current Design
VM1 VM2 Guest Side CUDA App1 CUDA App2 Frontend Library Frontend Library Linux / VMM Guest-Host Communication Channel Fork Process No Communication b/w processes gVirtuS Backend Backend Process 1 Backend Process 2 Host Side CUDA Runtime CUDA Driver GPU1 GPUn

18 Runtime Consolidation Framework
Workloads arrive from Frontend BackEnd Server Queues Workloads to Dispatcher Dispatcher HOST SIDE Consolidation Decision Maker Queues Workloads to Virtual Context Ready Queue Policies Heuristics Virtual Context Virtual Context Thread Workload Consolidator Workload Consolidator GPU GPU

19 Consolidation Decision Making Layer
Outline Background Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

20 CONSOLIDATION DECISION MAKING LAYER
GPU Sharing Mechanisms & Resource Contention Two Molding Policies Consolidation Runtime Scheduling Algorithm

21 Sharing Mechanisms & Resource Contention
Consolidation by Space Sharing Consolidation by Time Sharing Large No. of Threads with in a block Resource Contention Basis of Affinity Score Pressure on Shared Memory

22 Molding Kernel Configuration
Perform molding dynamically Leverage gVirtuS to intercept kernel launch Flexible for configuration modification Mold the configuration to reduce contention Potential increase in application latency However, may still improve global throughput

23 Two Molding Policies Molding Policies Forced Space Sharing
Time Sharing with Reduced Threads 14 * 256 14 * 512 7 * 256 14 * 128 May resolve shared memory Contention May reduce register pressure in the SM

24 Consolidation Scheduling Algorithm
Greedy-based Scheduling Algorithm Schedule “N” kernels on 2 GPUs Input: 3-Tuple Execution Configuration list of all kernels Data Structure: Work Queue for each Virtual Context Overall Algorithm Generate Pair-wise Affinity Generate Affinity for List Get Affinity By Molding

25 Consolidation Scheduling Algorithm
Create Work Queues for Virtual Contexts Configuration list Generate Pair-wise Affinity (a1, a2) = Generate Affinity For List for each rem. Kernel With each Work Queue Find the pair with min. affinity Split the pair into diff. Queues (a3, a4) = Get Affinity By Molding for each rem. Kernel With each Work Queue Find Max(a1, a2, a3, a4) Dispatch Queues into Virtual Contexts Push kernel into Queue

26 Experimental Results Outline Background
Understanding Consolidation on GPU Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

27 EXPERIMENTAL RESULTS Setup, Metric & Baselines Benchmarks Results

28 Setup, Metric & Baselines
A Machine with Two Intel Quad core Xeon E5520 CPU Two NVIDIA Tesla C2050 GPU Cards 14 Streaming Multi Processors, each containing 32 cores 3 GB Device Memory 48 KB Shared Memory per SM Virtualized with gVirtuS 2.0 Evaluation Metric Global Throughput benefit obtained after consolidation of kernels Baselines Serialized execution, based on CUDA Runtime Scheduling Blind Round-Robin based consolidation (Unaware of exec. configuration)

29 Benchmarks and its Characteristics
Benchmarks & Goals Benchmarks and its Characteristics

30 Benefits of Space and Time Sharing Mechanisms
No resource contention Consolidation through Blind Round-Robin algorithm Compared against serialized execution of kernels Space Sharing Time Sharing

31 Drawbacks of Blind Scheduling
Presence of Resource Contentions Large Number of Threads Shared Memory Contention No benefit from Consolidation

32 Effect of Molding Contention – Large Threads
Contention – Shared Memory Time Sharing with Reduced Threads Forced Space Sharing

33 Effect of Affinity Scores
Kernel Configurations 2 kernels with 7*512 2 kernels with 14*256 No affinity – Unbalanced Threads per SM With affinity – Better Thread Balancing per SM

34 Benefits at High Contention Scenario
8 Kernels on 2 GPUs 6 out of 8 Kernels molded 31.5% improvement over Blind Scheduling 50% over serialized execution

35 Framework Overheads No Consolidation With Consolidation
Compared to plain gVirtuS execution Overhead always less than 1% Compared with manually consolidated execution Overhead always less than 4%

36 Conclusions Outline Background Understanding Consolidation on GPU
Framework Design Consolidation Decision Making Layer Experimental Results Conclusions

37 Conclusions A Framework for transparent sharing of GPUs
Use Consolidation as a mechanism for sharing GPUs No source code level changes New Affinity and Molding methods Runtime Consolidation Scheduling Algorithm At high contention, significant throughput benefits The overheads of the framework are small

38 Thank You for your attention!
Questions? Authors Contact Information:

39 Impact of Large Number of Threads

40 Per Application Slowdown/ Choice of Molding
Choice of Molding Type


Download ppt "Supporting GPU Sharing in Cloud Environments with a Transparent"

Similar presentations


Ads by Google