Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta.

Slides:



Advertisements
Similar presentations
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
Advertisements

Ahmad Lashgar, Amirali Baniasadi, Ahmad Khonsari ECE, University of Tehran, ECE, University of Victoria.
Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
Optimization on Kepler Zehuan Wang
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Utilization of GPU’s for General Computing Presenter: Charlene DiMeglio Paper: Aspects of GPU for General Purpose High Performance Computing Suda, Reiji,
Efficient Sparse Matrix-Matrix Multiplication on Heterogeneous High Performance Systems AACEC 2010 – Heraklion, Crete, Greece Jakob Siegel 1, Oreste Villa.
OpenFOAM on a GPU-based Heterogeneous Cluster
Cost-based Workload Balancing for Ray Tracing on a Heterogeneous Platform Mario Rincón-Nigro PhD Showcase Feb 17 th, 2012.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Scheduling for Embedded Real-Time Systems Amit Mahajan and Haibo.
Tao Yang, UCSB CS 240B’03 Unix Scheduling Multilevel feedback queues –128 priority queues (value: 0-127) –Round Robin per priority queue Every scheduling.
Many-Core Programming with GRAMPS Jeremy Sugerman Kayvon Fatahalian Solomon Boulos Kurt Akeley Pat Hanrahan.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Synergistic Execution of Stream Programs on Multicores with Accelerators Abhishek Udupa et. al. Indian Institute of Science.
A Lightweight Infrastructure for Graph Analytics Donald Nguyen Andrew Lenharth and Keshav Pingali The University of Texas at Austin.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Supporting GPU Sharing in Cloud Environments with a Transparent
COLLABORATIVE EXECUTION ENVIRONMENT FOR HETEROGENEOUS PARALLEL SYSTEMS Aleksandar Ili´c, Leonel Sousa 2010 IEEE International Symposium on Parallel & Distributed.
Challenges Bit-vector approach Conclusion & Future Work A subsequence of a string of symbols is derived from the original string by deleting some elements.
Shared memory systems. What is a shared memory system Single memory space accessible to the programmer Processor communicate through the network to the.
11 If you were plowing a field, which would you rather use? Two oxen, or 1024 chickens? (Attributed to S. Cray) Abdullah Gharaibeh, Lauro Costa, Elizeu.
Task Scheduling for Highly Concurrent Analytical and Transactional Main-Memory Workloads Iraklis Psaroudakis (EPFL), Tobias Scheuer (SAP AG), Norman May.
Parallel Applications Parallel Hardware Parallel Software IT industry (Silicon Valley) Users Efficient Parallel CKY Parsing on GPUs Youngmin Yi (University.
Kinshuk Govil, Dan Teodosiu*, Yongqiang Huang, and Mendel Rosenblum
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
A GPU Implementation of Inclusion-based Points-to Analysis Mario Méndez-Lojo (AMD) Martin Burtscher (Texas State University, USA) Keshav Pingali (U.T.
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo, Vignesh T. Ravi, Gagan Agrawal Department of Computer Science and Engineering.
Evaluating FERMI features for Data Mining Applications Masters Thesis Presentation Sinduja Muralidharan Advised by: Dr. Gagan Agrawal.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
GreenSched: An Energy-Aware Hadoop Workflow Scheduler
(1) Kernel Execution ©Sudhakar Yalamanchili and Jin Wang unless otherwise noted.
CUDA - 2.
Service-oriented Resource Broker for QoS-Guaranteed in Grid Computing System Yichao Yang, Jin Wu, Lei Lang, Yanbo Zhou and Zhili Sun Centre for communication.
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
CS 484 Designing Parallel Algorithms Designing a parallel algorithm is not easy. There is no recipe or magical ingredient Except creativity We can benefit.
Computer Network Lab. Korea University Computer Networks Labs Se-Hee Whang.
CUDA Dynamic Parallelism
Warp-Level Divergence in GPUs: Characterization, Impact, and Mitigation Ping Xiang, Yi Yang, Huiyang Zhou 1 The 20th IEEE International Symposium On High.
CS 484 Load Balancing. Goal: All processors working all the time Efficiency of 1 Distribute the load (work) to meet the goal Two types of load balancing.
DISTRIBUTED COMPUTING
Data Structures and Algorithms in Parallel Computing Lecture 7.
Dynamic Scheduling Monte-Carlo Framework for Multi-Accelerator Heterogeneous Clusters Authors: Anson H.T. Tse, David B. Thomas, K.H. Tsoi, Wayne Luk Source:
A N I N - MEMORY F RAMEWORK FOR E XTENDED M AP R EDUCE 2011 Third IEEE International Conference on Coud Computing Technology and Science.
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Martin Kruliš by Martin Kruliš (v1.1)1.
1 November 11, 2015 A Massively Parallel, Hybrid Dataflow/von Neumann Architecture Yoav Etsion November 11, 2015.
Shangkar Mayanglambam, Allen D. Malony, Matthew J. Sottile Computer and Information Science Department Performance.
CS 420 Design of Algorithms Parallel Algorithm Design.
Sunpyo Hong, Hyesoon Kim
Martin Kruliš by Martin Kruliš (v1.0)1.
Rigel: An Architecture and Scalable Programming Interface for a 1000-core Accelerator Paper Presentation Yifeng (Felix) Zeng University of Missouri.
Multi-dimensional Range Query Processing on the GPU Beomseok Nam Date Intensive Computing Lab School of Electrical and Computer Engineering Ulsan National.
My Coordinates Office EM G.27 contact time:
Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations Xin Huo Vignesh T. Ravi Gagan Agrawal Department of Computer Science and Engineering,
Auburn University
Controlled Kernel Launch for Dynamic Parallelism in GPUs
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Mapping Techniques Dr. Xiao Qin Auburn University.
Task Scheduling for Multicore CPUs and NUMA Systems
Parallel Algorithm Design
Dynamic Parallelism Martin Kruliš by Martin Kruliš (v1.0)
Accelerating MapReduce on a Coupled CPU-GPU Architecture
Store Recycling Function Experimental Results
Peng Jiang, Linchuan Chen, and Gagan Agrawal
©Sudhakar Yalamanchili and Jin Wang unless otherwise noted
Adaptive Data Refinement for Parallel Dynamic Programming Applications
L. Glimcher, R. Jin, G. Agrawal Presented by: Leo Glimcher
Presentation transcript:

Task Based Execution of GPU Applications with Dynamic Data Dependencies Mehmet E Belviranli Chih H Chou Laxmi N Bhuyan Rajiv Gupta

GP-GPU Computing  GPUs enable high throughput data & compute intensive computations  Data is partitioned into a grid of “Thread Blocks” (TBs)  Thousands of TBs in a grid can be executed in any order  No HW support for efficient inter-TB communication  High scalability & throughput for independent data  Challenging & inefficient for inter-TB dependent data

The Problem  Data-dependent & irregular applications  Simulations (n-body, heat)  Graph algorithms (BFS, SSSP)  Inter-TB synchronization  Sync through global memory  Irregular task graphs  Static partitioning fails  Heterogeneous execution  Unbalanced distribution ! ! ! Data Dependency Graph

The Solution “Task based execution” Transition from SIMD -> MIMD

5 Challenges  Breaking applications into tasks  Task to SM assignment  Dependency tracking  Inter–SM communication  Load Balancing

6 Proposed Task Based Execution Framework  Persistent Worker TBs (per SM)  Distributed task queues (per SM)  In-GPU dependency tracking & scheduling  Load balancing via different queue insertion policies

7 Overview (1). Grab a ready Task (2). Queue (3). Retrieve & Execute (4). Output (5). Resolve Dependencies (6). Grab new

8 Concurrent Worker&Scheduler Worker Scheduler

Queue Access & Dependency Tracking  IQS and OQS  Efficient signaling mechanism via global memory  Parallel task pointer retrieval  Queues store pointers to tasks  Parallel dependency check

10 Queue Insertion Policy Round robin:  Better load balancing  Poor cache locality Tail submit: [J. Hoogerbrugge et al. ]:  First child task is always processed by the same SM with parent.  Increased locality t t+1 t+2

11 API Application specific data is added under WorkerContext and Task user_task is called by worker_kernel

12 Experimental Results  NVIDIA Tesla 2050  14 SMs, 3GB memory  Applications:  Heat 2D: Simulation of heat dissipation over a 2D surface  BFS: Breadth-first-search  Comparison:  Central queue vs. distributed queue

13 Applications Heat 2D:  Regular dependencies, wavefront parallelism.  Each tile is a task, intra-tile and inter-tile parallelism

14 Applications BFS:  Irregular dependencies.  Unreached neighbors of a node forms a task

15 Runtime

16 Scalability

17 Future Work S/W support for: Better task representation More task insertion policies Automated task graph partitioning for higher SM utilization.

18 Future Work H/W support for: Fast inter-TB sync Support for TB to SM affinity “Sleep” support for TBs

19 Conclusion  Transition from SIMD -> MIMD  Task-based execution model  Per-SM task assignment  In-GPU dependency tracking  Locality aware queue management  Room for improvement with added HW and SW support