CS6963 L19: Dynamic Task Queues and More Synchronization.

Slides:



Advertisements
Similar presentations
On Dynamic Load Balancing on Graphics Processors Daniel Cederman and Philippas Tsigas Chalmers University of Technology.
Advertisements

Module R2 Overview. Process queues As processes enter the system and transition from state to state, they are stored queues. There may be many different.
Wait-Free Reference Counting and Memory Management Håkan Sundell, Ph.D.
WORK STEALING SCHEDULER 6/16/2010 Work Stealing Scheduler 1.
CS444/CS544 Operating Systems Synchronization 2/16/2006 Prof. Searleman
TOWARDS A SOFTWARE TRANSACTIONAL MEMORY FOR GRAPHICS PROCESSORS Daniel Cederman, Philippas Tsigas and Muhammad Tayyab Chaudhry.
Programming with CUDA, WS09 Waqar Saleem, Jens Müller Programming with CUDA and Parallel Algorithms Waqar Saleem Jens Müller.
Introduction to Lock-free Data-structures and algorithms Micah J Best May 14/09.
CUDA Programming continued ITCS 4145/5145 Nov 24, 2010 © Barry Wilkinson Revised.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
Processes 1 CS502 Spring 2006 Processes Week 2 – CS 502.
A. Frank - P. Weisberg Operating Systems Introduction to Cooperating Processes.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Why The Grass May Not Be Greener On The Other Side: A Comparison of Locking vs. Transactional Memory Written by: Paul E. McKenney Jonathan Walpole Maged.
SUPPORTING LOCK-FREE COMPOSITION OF CONCURRENT DATA OBJECTS Daniel Cederman and Philippas Tsigas.
CS6235 L15: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6235 Administrative STRSM due March 23 (EXTENDED) Midterm coming -In class March 28, can.
Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.
1 Lock-Free Linked Lists Using Compare-and-Swap by John Valois Speaker’s Name: Talk Title: Larry Bush.
Simple Wait-Free Snapshots for Real-Time Systems with Sporadic Tasks Håkan Sundell Philippas Tsigas.
CS6963 L14: Dynamic Scheduling. L14: Dynamic Task Queues 2 CS6963 Administrative STRSM due March 17 (EXTENDED) Midterm coming -In class April 4, open.
CS6963 L15: Design Review and CUBLAS Paper Discussion.
Understanding Performance of Concurrent Data Structures on Graphics Processors Daniel Cederman, Bapi Chatterjee, Philippas Tsigas Distributed Computing.
Håkan Sundell, Chalmers University of Technology 1 NOBLE: A Non-Blocking Inter-Process Communication Library Håkan Sundell Philippas.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
CS6235 L17: Generalizing CUDA: Concurrent Dynamic Execution, and Unified Address Space.
Maged M.Michael Michael L.Scott Department of Computer Science Univeristy of Rochester Presented by: Jun Miao.
Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.
11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.
CS6963 L18: Global Synchronization and Sorting. L18: Synchronization and Sorting 2 CS6963 Administrative Grading -Should have exams. Nice job! -Design.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
Concurrency Control 1 Fall 2014 CS7020: Game Design and Development.
Wait-Free Multi-Word Compare- And-Swap using Greedy Helping and Grabbing Håkan Sundell PDPTA 2009.
1  process  process creation/termination  context  process control block (PCB)  context switch  5-state process model  process scheduling short/medium/long.
© David Kirk/NVIDIA and Wen-mei W. Hwu University of Illinois, CS/EE 217 GPU Architecture and Parallel Programming Lecture 15: Atomic Operations.
CUDA Odds and Ends Patrick Cozzi University of Pennsylvania CIS Fall 2013.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
CPS110: Thread cooperation Landon Cox. Constraining concurrency  Synchronization  Controlling thread interleavings  Some events are independent  No.
Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.
1 Critical Section Problem CIS 450 Winter 2003 Professor Jinhua Guo.
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
CS61C L20 Thread Level Parallelism II (1) Garcia, Spring 2013 © UCB Senior Lecturer SOE Dan Garcia inst.eecs.berkeley.edu/~cs61c.
CS 179: GPU Computing Recitation 2: Synchronization, Shared memory, Matrix Transpose.
pThread synchronization
December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.
L16: Dynamic Task Queues and Synchronization
EMERALDS Landon Cox March 22, 2017.
Operating Systems (CS 340 D)
Parallel Programming By J. H. Wang May 2, 2017.
Atomic Operations in Hardware
Atomic Operations in Hardware
Atomicity CS 2110 – Fall 2017.
Threads, Concurrency, and Parallelism
A Lock-Free Algorithm for Concurrent Bags
Operating Systems (CS 340 D)
Lecture 5: GPU Compute Architecture
Recitation 2: Synchronization, Shared memory, Matrix Transpose
Anders Gidenstam Håkan Sundell Philippas Tsigas
Lecture 5: GPU Compute Architecture for the last time
Lesson Objectives Aims
L16: Feedback from Design Review
Yiannis Nikolakopoulos
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Background and Motivation
NOBLE: A Non-Blocking Inter-Process Communication Library
Lecture 25: Multiprocessors
CSE 153 Design of Operating Systems Winter 19
CSE 153 Design of Operating Systems Winter 2019
Lecture 5: Synchronization and ILP
GPU Scheduling on the NVIDIA TX2:
Lecture 18: Coherence and Synchronization
Presentation transcript:

CS6963 L19: Dynamic Task Queues and More Synchronization

L19: Dynamic Task Queues 2 CS6963 Administrative Design review feedback -Sent out yesterday – feel free to ask questions Deadline Extended to May 4: Symposium on Application Accelerators in High Performance Computing Final Reports on projects -Poster session April 29 with dry run April 27 -Also, submit written document and software by May 6 -Invite your friends! I’ll invite faculty, NVIDIA, graduate students, application owners,.. Industrial Advisory Board meeting on April 29 -There is a poster session immediately following class -I was asked to select a few projects (4-5) to be presented -The School will spring for printing costs for your poster! -Posters need to be submitted for printing immediately following Monday’s class

L19: Dynamic Task Queues 3 CS6963 Final Project Presentation Dry run on April 27 -Easels, tape and poster board provided -Tape a set of Powerpoint slides to a standard 2’x3’ poster, or bring your own poster. Final Report on Projects due May 6 -Submit code -And written document, roughly 10 pages, based on earlier submission. -In addition to original proposal, include -Project Plan and How Decomposed (from DR) -Description of CUDA implementation -Performance Measurement -Related Work (from DR)

L19: Dynamic Task Queues 4 CS6963 Let’s Talk about Demos For some of you, with very visual projects, I asked you to think about demos for the poster session This is not a requirement, just something that would enhance the poster session Realistic? -I know everyone’s laptops are slow … -… and don’t have enough memory to solve very large problems Creative Suggestions?

L19: Dynamic Task Queues 5 CS6963 Sources for Today’s Lecture “On Dynamic Load Balancing on Graphics Processors,” D. Cederman and P. Tsigas, Graphics Hardware (2008). Balancing-GH08.pdf “A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems,” P. Tsigas and Y. Zhang, SPAA (more on lock-free queue) Thread Building Blocks (more on task stealing)

L19: Dynamic Task Queues 6 CS6963 Last Time: Simple Lock Using Atomic Updates Can you use atomic updates to create a lock variable? Consider primitives: int lockVar; atomicAdd(&lockVar, 1); atomicAdd(&lockVar,-1);

L19: Dynamic Task Queues 7 CS6963 Suggested Implementation // also unsigned int and long long versions int atomicCAS(int* address, int compare, int val); reads the 32-bit or 64-bit word old located at the address address in global or shared memory, computes (old == compare ? val : old), and stores the result back to memory at the same address. These three operations are performed in one atomic transaction. The function returns old (Compare And Swap). 64-bit words are only supported for global memory. __device__ void getLock(int *lockVarPtr) { while (atomicCAS(lockVarPtr, 0, 1) == 1); }

L19: Dynamic Task Queues 8 CS6963 Constructing a dynamic task queue on GPUs Must use some sort of atomic operation for global synchronization to enqueue and dequeue tasks Numerous decisions about how to manage task queues -One on every SM? -A global task queue? -The former can be made far more efficient but also more prone to load imbalance Many choices of how to do synchronization -Optimize for properties of task queue (e.g., very large task queues can use simpler mechanisms) All proposed approaches have a statically allocated task list that must be as large as the max number of waiting tasks

L19: Dynamic Task Queues 9 CS6963 Synchronization Blocking -Uses mutual exclusion to only allow one process at a time to access the object. Lockfree -Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps. Waitfree -Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps. Slide source: Daniel Cederman

L19: Dynamic Task Queues 10 CS6963 Load Balancing Methods Blocking Task Queue Non-blocking Task Queue Task Stealing Static Task List Slide source: Daniel Cederman

L19: Dynamic Task Queues 11 CS6963 Static Task List (Recommended) function DEQUEUE(q, id) return q.in[id] ; function ENQUEUE(q, task) localtail ← atomicAdd (&q.tail, 1) q.out[localtail ] = task function NEWTASKCNT(q) q.in, q.out, oldtail, q.tail ← q.out, q.in, q.tail, 0 return oldtail procedure MAIN(taskinit) q.in, q.out ← newarray(maxsize), newarray(maxsize) q.tail ← 0 enqueue(q, taskinit ) blockcnt ← newtaskcnt (q) while blockcnt != 0 do run blockcnt blocks in parallel t ← dequeue(q, TBid ) subtasks ← doWork(t ) for each nt in subtasks do enqueue(q, nt ) blocks ← newtaskcnt (q) Two lists: q_in is read only and not synchronized q_out is write only and is updated atomically When NEWTASKCNT is called at the end of major task scheduling phase, q_in and q_out are swapped Synchronization required to insert tasks, but at least one gets through (wait free)

L19: Dynamic Task Queues 12 CS6963 Blocking Dynamic Task Queue function DEQUEUE(q) while atomicCAS(&q.lock, 0, 1) == 1 do; if q.beg != q.end then q.beg ++ result ← q.data[q.beg] else result ← NIL q.lock ← 0 return result function ENQUEUE(q, task) while atomicCAS(&q.lock, 0, 1) == 1 do; q.end++ q.data[q.end ] ← task q.lock ← 0 Use lock for both adding and deleting tasks from the queue. All other threads block waiting for lock. Potentially very inefficient, particularly for fine-grained tasks

L19: Dynamic Task Queues 13 CS6963 Lock-free Dynamic Task Queue function DEQUEUE(q) oldbeg ← q.beg lbeg ← oldbeg while task = q.data[lbeg] == NI L do lbeg ++ if atomicCAS(&q.data[l beg], task, NIL) != task then restart if lbeg mod x == 0 then atomicCAS(&q.beg, oldbeg, lbeg) return task function ENQUEUE(q, task) oldend ← q.end lend ← oldend while q.data[lend] != NIL do lend ++ if atomicCAS(&q.data[lend], NIL, task) != NIL then restart if lend mod x == 0 then atomicCAS(&q.end, oldend, lend ) Idea: At least one thread will succeed to add or remove task from queue Optimization: Only update beginning and end with atomicCAS every x elements.

L19: Dynamic Task Queues 14 CS6963 Task Stealing No code provided in paper Idea: -A set of independent task queues. -When a task queue becomes empty, it goes out to other task queues to find available work -Lots and lots of engineering needed to get this right -Best work on this is in Intel Thread Building Blocks

L19: Dynamic Task Queues 15 CS6963 General Issues One or multiple task queues? Where does task queue reside? -If possible, in shared memory -Actual tasks can be stored elsewhere, perhaps in global memory