Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

9.4 Page Replacement What if there is no free frame?
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Module R2 Overview. Process queues As processes enter the system and transition from state to state, they are stored queues. There may be many different.
Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.
Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.
Achieving Quality of Service in Wireless Networks A simulation comparison of MAC layer protocols. CS444N Presentation By: Priyank Garg Rushabh Doshi.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Parallelism Analysis, and Work Distribution BY DANIEL LIVSHEN. BASED ON CHAPTER 16 “FUTURES, SCHEDULING AND WORK DISTRIBUTION” ON “THE ART OF MULTIPROCESSOR.
Threads Section 2.2. Introduction to threads A thread (of execution) is a light-weight process –Threads reside within processes. –They share one address.
SCHEDULER ACTIVATIONS Effective Kernel Support for the User-level Management of Parallelism Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska, Henry.
Introduction to Operating Systems – Windows process and thread management In this lecture we will cover Threads and processes in Windows Thread priority.
3.5 Interprocess Communication Many operating systems provide mechanisms for interprocess communication (IPC) –Processes must communicate with one another.
Supporting Time-sensitive Application on a Commodity OS Ashvin Goel, Luca Abeni, Charles Krasic, Jim Snow, Jonathan Walpole Presented by Wen Sun Some Slides.
 2004 Deitel & Associates, Inc. All rights reserved. Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for Threads.
Strategies for Implementing Dynamic Load Sharing.
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
Chapter 6 Real-Time Embedded Multithreading The Thread – The Essential Component.
Intro to OS CUCS Mossé Processes and Threads What is a process? What is a thread? What types? A program has one or more locus of execution. Each execution.
A Parallel, Real-Time Garbage Collector Author: Perry Cheng, Guy E. Blelloch Presenter: Jun Tao.
Scheduler Activations Jeff Chase. Threads in a Process Threads are useful at user-level – Parallelism, hide I/O latency, interactivity Option A (early.
The memory allocation problem Define the memory allocation problem Memory organization and memory allocation schemes.
ADLB Update Recent and Current Adventures with the Asynchronous Dynamic Load Balancing Library Rusty Lusk Mathematics and Computer Science Division Argonne.
SYNCHRONIZATION Module-4. scheduling Scheduling is an operating system mechanism that arbitrate CPU resources between running tasks. Different scheduling.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
1 Distributed Process Scheduling: A System Performance Model Vijay Jain CSc 8320, Spring 2007.
Threads in Java. History  Process is a program in execution  Has stack/heap memory  Has a program counter  Multiuser operating systems since the sixties.
Chapter 4 Processes. Process: what is it? A program in execution A program in execution usually usually Can also have suspended or waiting processes Can.
Scheduler Activations: Effective Kernel Support for the User- Level Management of Parallelism. Thomas E. Anderson, Brian N. Bershad, Edward D. Lazowska,
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
1 Coscheduling in Clusters: Is it a Viable Alternative? Gyu Sang Choi, Jin-Ha Kim, Deniz Ersoz, Andy B. Yoo, Chita R. Das Presented by: Richard Huang.
On the Performance of Window-Based Contention Managers for Transactional Memory Gokarna Sharma and Costas Busch Louisiana State University.
Standard Template Library The Standard Template Library was recently added to standard C++. –The STL contains generic template classes. –The STL permits.
A Methodology for Creating Fast Wait-Free Data Structures Alex Koganand Erez Petrank Computer Science Technion, Israel.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Processes, Threads, and Process States. Programs and Processes  Program: an executable file (before/after compilation)  Process: an instance of a program.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.
1 Real-Time Scheduling. 2Today Operating System task scheduling –Traditional (non-real-time) scheduling –Real-time scheduling.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
Martin Kruliš by Martin Kruliš (v1.1)1.
Threads A thread is an alternative model of program execution
Lecture 20 Threads Richard Gesick. Threads Makes use of multiple processors/cores if available Share memory within the program (as opposed to processes.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
Chapter 4 – Thread Concepts
CILK: An Efficient Multithreaded Runtime System
Processes and threads.
CPU SCHEDULING.
Threads vs. Events SEDA – An Event Model 5204 – Operating Systems.
Scheduling and Fairness
Prabhanjan Kambadur, Open Systems Lab, Indiana University
Day 12 Threads.
Chapter 4 – Thread Concepts
Task Scheduling for Multicore CPUs and NUMA Systems
Dynamic Processor Allocation for Adaptively Parallel Jobs
Main Memory Management
Nachos Threads and Concurrency
Process management Information maintained by OS for process management
Introduction What is an operating system bootstrap
Multithreaded Programming in Cilk LECTURE 1
Fast Communication and User Level Parallelism
Transactions with Nested Parallelism
Threads and Concurrency
Run Time Environments 薛智文
Foundations and Definitions
Threads CSE 2431: Introduction to Operating Systems
Presentation transcript:

Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler Idea: combine space saving ideas from Narlikar with the work-stealing scheduler 1.Girija J. Narlikar and Guy E. Blelloch. Space-Efficient Scheduling of Nested Parallelism. ACM Transactions on Programming Languages and Systems (TOPLAS), 21(1), January, 1999.

What We Did  Implemented Narlikar Scheduler for Cilk Replaced WS scheduling code Modified cilk2c  Designed WS-Narlikar Hybrid Scheduler  Implemented Hybrid Scheduler Modified WS scheduling code Modified cilk2c  Performed empirical tests for space and time comparisons

Results Data from running the modified fib program on 16 processors  Hybrid retains some of the space saving benefits of Narlikar with a much smaller overhead.

Outline I.Example II.Narlikar Algorithm a.Description b.Overheads/Bottlenecks III.Hybrid Algorithm a.Motivation b.Description IV.Empirical Results V.Future Work VI.Conclusions

Example main() { for(i = 1 to n) spawn F(i, n); } F(int i, int n) { Temp B[n]; for(j = 1 to n) spawn G(i, j, n); }

Schedule 1 Schedule outer parallelism first Memory used (heap): θ(n 2 ) Similar to work-stealing scheduler (θ(pn) space) Green nodes are executed before white nodes

Schedule 2 Schedule inner parallelism first Memory used (heap): θ(n) Similar to Narlikar scheduler (θ(n+ pKT  ) = θ(n) space) Green nodes are executed before white nodes

Narlikar Algorithm - Idea  Perform a p-leftmost execution of the DAG p-depth first execution for p = 2

Narlikar Data Structures … Q out Q in R increasing thread order Processors  Q in, Q out are FIFO queues that support parallel accesses  R is a priority queue that maintains the depth first order of all threads in the system

Narlikar – Thread Life Cycle … Q out Q in R Processors  A processor executes a thread until: spawn memory allocation return  Processor puts thread in Q in, gets new thread from Q out  Scheduler thread moves threads from Q in to R, performs spawns, moves the leftmost p to Q out

Narlikar – Memory Allocation  “Voodoo” parameter K  If a thread wants to allocate more than K bytes, preempt it  To allocate M, where M > K, put thread to sleep for M/K scheduling rounds. … Q out Q in R Processors

Problems with Narlikar  Large scheduling overhead (can be more than 400 times slower than the WS scheduler) Bad locality: must preempt on every spawn Contention on global data structures Bookkeeping performed by scheduling thread Wasted processor time (bad scalability)  As of yet, haven’t performed empirical tests to determine a breakdown of overhead

Hybrid Scheduler Idea  Keeping track of left-to-right ordering is expensive  What about just delaying the threads that wish to perform large memory allocations?  Can we achieve some space efficiency with a greedy scheduler biased toward non-memory intensive threads?

Hybrid Algorithm  Start with randomized Work-stealing scheduler  Preempt threads that perform large memory allocations and put them to sleep  Reactivate sleeping threads when work- stealing

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 wake_time: - Sleep QueueDequeProcessor current_time: 0 per processorglobal

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 wake_time: - Sleep QueueDequeProcessor Get thread from bottom of deque current_time: 0

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: - Processor Get thread from bottom of deque Before malloc(size), sleep_rounds = f(size+current_allocation) current_time: 0

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: 4 Processor Get thread from bottom of deque Before malloc(size), sleep_rounds = f(size+current_allocation) If sleep_rounds > 0, wake_time = sleep_rounds + current_time current_time: 0

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: 4 Processor Get thread from bottom of deque Before malloc(size), sleep_rounds = f(size+current_allocation) If sleep_rounds > 0, wake_time = sleep_rounds + current_time and insert thread into sleep queue current_time: 0

Hybrid Algorithm wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDequeProcessor wake_time: 4 Get thread from bottom of deque Before malloc(size), sleep_rounds = f(size+current_allocated) If sleep_rounds > 0, wake_time = sleep_rounds + current_time and insert thread into sleep queue current_time: 0

wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDequeProcessor wake_time: 4 If no threads on deque, increment current_time current_time: 0 Hybrid Algorithm

wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDequeProcessor wake_time: 4 If no threads on deque, increment current_time current_time: 1 Hybrid Algorithm

wake_time: 1 wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDequeProcessor wake_time: 4 If no threads on deque, increment current_time if first thread in Sleep Queue is ready, get thread from Sleep Queue current_time: 1 Hybrid Algorithm

wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: 1 Processor wake_time: 4 If no threads on deque, increment current_time if first thread in Sleep Queue is ready, get thread from Sleep Queue current_time: 1 Hybrid Algorithm

wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: - Processor wake_time: 4 If no threads on deque, increment current_time if first thread in Sleep Queue is ready, get thread from Sleep Queue reset wake_time and current_allocated execute it current_time: 1 Hybrid Algorithm

wake_time: 2 wake_time: 5 wake_time: 8 Sleep QueueDeque wake_time: - Processor wake_time: 4 If no threads on deque, increment current_time if first thread in Sleep Queue is ready, get thread from Sleep Queue reset wake_time and current_allocated execute it otherwise, work-steal current_time: 1 Hybrid Algorithm

How long to Sleep?  Want sleep time to be proportional to the size of the memory allocation  Increment time on every work-steal attempt  Scale with number of processors  Place for future improvement? Current function sleep_rounds = floor(size/(α+β*p)) α and β are “voodoo” parameters

Empirical Results

Future Work on Hybrid Scheduler  Find the best sleep function and values for “voodoo” parameters  Optimize the implementation to reduce scheduling overhead  Determine theoretical space bound  More detailed empirical analysis

Conclusions  Narlikar scheduler provides a provably good space bound but incurs a large scheduling overhead  It appears that it is possible to achieve space usage that scales well with the number of processors while retaining much of the efficiency of work-stealing