U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and.

Slides:



Advertisements
Similar presentations
Multi-core Computing Lecture 3 MADALGO Summer School 2012 Algorithms for Modern Parallel and Distributed Models Phillip B. Gibbons Intel Labs Pittsburgh.
Advertisements

© 2009 Charles E. Leiserson and Pablo Halpern1 Introduction to Cilk++ Programming PADTAD July 20, 2009 Cilk, Cilk++, Cilkview, and Cilkscreen, are trademarks.
MINJAE HWANG THAWAN KOOBURAT CS758 CLASS PROJECT FALL 2009 Extending Task-based Programming Model beyond Shared-memory Systems.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Grace: Safe Multithreaded Programming for C/C++ Emery Berger University of Massachusetts,
A Program Transformation For Faster Goal-Directed Search Akash Lal, Shaz Qadeer Microsoft Research.
CPU Review and Programming Models CT101 – Computing Systems.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
CILK: An Efficient Multithreaded Runtime System. People n Project at MIT & now at UT Austin –Bobby Blumofe (now UT Austin, Akamai) –Chris Joerg –Brad.
Nested Parallelism in Transactional Memory Kunal Agrawal, Jeremy T. Fineman and Jim Sukha MIT.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts Amherst Operating Systems CMPSCI 377 Lecture.
Operating Systems CMPSCI 377 Lecture 11: Memory Management
Memory Allocation. Three kinds of memory Fixed memory Stack memory Heap memory.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science pH: A Parallel Dialect of Haskell Jim Cipar & Jacob Sorber University of Massachusetts.
Recursion A recursive function is a function that calls itself either directly or indirectly through another function. The problems that can be solved.
A. Frank - P. Weisberg Operating Systems Introduction to Tasks/Threads.
Operating Systems (CSCI2413) Lecture 3 Processes phones off (please)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Cilk CISC 879 Parallel Computation Erhan Atilla Avinal.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.
Maria-Cristina Marinescu Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology A Synthesis Algorithm for Modular Design of.
Juan Mendivelso.  Serial Algorithms: Suitable for running on an uniprocessor computer in which only one instruction executes at a time.  Parallel Algorithms:
Orchestra: Intrusion Detection Using Parallel Execution and Monitoring of Program Variants in User-Space Babak Salamat, Todd Jackson, Andreas Gal, Michael.
STRATEGIC NAMING: MULTI-THREADED ALGORITHM (Ch 27, Cormen et al.) Parallelization Four types of computing: –Instruction (single, multiple) per clock cycle.
CSC 501 Lecture 2: Processes. Process Process is a running program a program in execution an “instantiation” of a program Program is a bunch of instructions.
Object Oriented Analysis & Design SDL Threads. Contents 2  Processes  Thread Concepts  Creating threads  Critical sections  Synchronizing threads.
Implementing Processes and Process Management Brian Bershad.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
Exec Function calls Used to begin a processes execution. Accomplished by overwriting process imaged of caller with that of called. Several flavors, use.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Processes & Threads Emery Berger and Mark Corner University.
Multithreaded Programming in Cilk L ECTURE 3 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
Computer Science Department Data Structure & Algorithms Lecture 8 Recursion.
ICS220 – Data Structures and Algorithms Dr. Ken Cosh Week 5.
Review Introduction to Searching External and Internal Searching Types of Searching Linear or sequential search Binary Search Algorithms for Linear Search.
Overview Work-stealing scheduler O(pS 1 ) worst case space small overhead Narlikar scheduler 1 O(S 1 +pKT  ) worst case space large overhead Hybrid scheduler.
Processes: program + execution state
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science Performance of Work Stealing in Multiprogrammed Environments Matthew Hertz Department.
Copyright ©: University of Illinois CS 241 Staff1 Threads Systems Concepts.
1 CS 140 : Feb 19, 2015 Cilk Scheduling & Applications Analyzing quicksort Optional: Master method for solving divide-and-conquer recurrences Tips on parallelism.
Computer Studies (AL) Operating System Process Management - Process.
Implementing Subprograms What actions must take place when subprograms are called and when they terminate? –calling a subprogram has several associated.
CS212: OPERATING SYSTEM Lecture 2: Process 1. Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition, Chapter 3: Process-Concept.
ITCS 3181 Logic and Computer Systems 2015 B. Wilkinson Slides4-2.ppt Modification date: March 23, Procedures Essential ingredient of high level.
P ARALLEL P ROCESSING F INAL P RESENTATION CILK Eliran Ben Moshe Neriya Cohen.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Software Systems Advanced Synchronization Emery Berger and Mark Corner University.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Processes. Process Concept Process Scheduling Operations on Processes Interprocess Communication Communication in Client-Server Systems.
Scheduling Multithreaded Computations By Work-Stealing Robert D. Blumofe The University of Texas, Austin Charles E. Leiserson, MIT Laboratory for Computer.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science John Cavazos J Eliot B Moss Architecture and Language Implementation Lab University.
1 Cilk Chao Huang CS498LVK. 2 Introduction A multithreaded parallel programming language Effective for exploiting dynamic, asynchronous parallelism (Chess.
6/27/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
CILK: An Efficient Multithreaded Runtime System
Processes and threads.
Topic 3 (Textbook - Chapter 3) Processes
Sujata Ray Dey Maheshtala College Computer Science Department
Chapter 9: Virtual-Memory Management
Multithreaded Programming in Cilk LECTURE 1
Introduction to CILK Some slides are from:
Fast Communication and User Level Parallelism
Sujata Ray Dey Maheshtala College Computer Science Department
Transactions with Nested Parallelism
Cilk A C language for programming dynamic multithreaded applications on shared-memory multiprocessors. Example applications: virus shell assembly graphics.
CSE 451: Operating Systems Winter 2003 Lecture 4 Processes
Cilk and Writing Code for Hardware
Outline Chapter 2 (cont) Chapter 3: Processes Virtual machines
Operating System Overview
Introduction to CILK Some slides are from:
Presentation transcript:

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science The Implementation of the Cilk-5 Multithreaded Language (Frigo, Leiserson, and Randall) Alistair Dundas Department of Computer Science University of Massachusetts

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 2 Outline What is Cilk? Cilk example: the Fibonacci algorithm. The work-first principle. Work Stealing. The T.H.E. Protocol. Empirical results. Summary and questions.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 3 What Is Cilk? Extension of C for parallel programming. Designed for SMP machines with support for shared memory. Benefits: Provably efficient work stealing scheduler. Clean programming model. Benefits over normal thread programming: discussion topic! Specifically: Source to source compiler generating C.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 4 Example: Fibonacci Algorithm int main (int argc, char *argv[]) { int n, result; n = atoi(argv[1]); result = fib(n); printf(“Result:%d\n”, result); return 0; } int fib (int n) { if (n<2) return n; else { int x, y; x = fib (n-1); y = fib (n-2); return (x+y); } }

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 5 Example: Fibonacci In Parallel cilk int main (int argc, char *argv[]) { int n, result; n = atoi(argv[1]); result = spawn fib(n); sync; printf(“Result:%d\n”, result); return 0; } cilk int fib (int n) { if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } }

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 6 Source to Source Compiler

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 7 The Work First Principle Work is the amount of time needed to execute the computation serially. Critical path length is the execution time on an infinite number of processors. The Work-First Principle: Minimize scheduling overhead borne by work at the expense of increasing the critical path.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 8 Theory: The Work First Principle Where T P is the time on P processors: T P = T 1 /P + O(T  ) (1) Making critical path overhead explicit: T P <= T 1 /P + c  T  (2) Define average parallelism (max speedup): P AVERAGE = T 1 /T  Define parallel slackness: P AVERAGE /P

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 9 The Work First Principle (cont) Assumption of parallel slackness: P AVERAGE /P ≫ c  Combining these with the inequality, we get: T P ≈ T 1 /P Define work overhead: c 1 = T 1 /T S T P ≈ c 1 T S /P Conclusion: Minimize work overhead.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 10 Work Stealing Algorithm Each worker keeps a ready deque (double ended queue) of procedure instances waiting to run. Workers treat the deque as a stack, pushing and popping procedure calls on to the end. When workers have no more work, they steal from the front of another workers’ deque. Parents are stolen before children. This is implemented using two versions of each procedure: a fast clone, and a slow clone.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 11 Fast Clone Run fast clone when a procedure is spawned. Little support for parallelism. Whenever a call is made, save complete state, and push on to end of deque. When call returns, check to see if procedure was stolen. If stolen, return immediately. If not stolen, carry on execution. Since children are never stolen, sync is a no op.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 12 Fast Clone Example cilk int fib (int n) { if (n<2) return n; else { int x, y; x = spawn fib (n-1); y = spawn fib (n-2); sync; return (x+y); } }

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 13 1 int fib (int n) 2 { 3 fib.frame *f; frame pointer 4 f = alloc(sizeof(*f)); allocate frame 5 f->sig = fib.sig; initialize frame 6 if (n!2) { 7 free(f, sizeof(*f)); free frame 8 return n; 9 } 10 else { … } Fast Clone Example

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 14 Fast Clone Example 11 int x, y; 12 f->entry = 1; save PC 13 f->n = n; save live vars 14 *T = f; store frame pointer 15 push(); push frame 16 x = fib (n-1); do C call 17 if (pop(x) == FAILURE) pop frame 18 return 0; procedure stolen 19 second spawn 20 ; sync is free! 21 free(f, sizeof(*f)); free frame 22 return (x+y); 23 } }

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 15 Slow Clone Slow clone used when a procedure is stolen. Similar to fast clone, but also supports concurrent execution. It restores program counter and procedure state using copy stored on deque. Calling sync makes call to runtime system for check on children’s status.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 16 The T.H.E. Protocol Deques held in shared memory. Workers operate at the end, thiefs at the front. We must prevent race conditions where a thief and victim try to access the same procedure frame. Locking deques would be expensive for workers. The T.H.E Protocol removes overhead of the common case, where there is no conflict.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 17 The T.H.E. Protocol Assumes only reads and writes are atomic. Head of the deque is H, tail is T, and (T ≥ H) Only thief can change H. Only worker can change T. To steal thiefs must get the lock L. At most two processors operating on deque. Three cases of interaction: Two or more items on deque – each gets one. One item on deque – either worker or thief gets frame, but not both. No items on deque – both worker and thief fail.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 18 One item on deque case Both thief and worker assume they can get a procedure frame and change H or T. If both thief and worker try to steal frame, one or both of them will discover (H > T), depending on instruction order. If thief discovers (H > T) it backs off and restores H. If worker discovers (H > T) it restores T, and then tries for the lock. Inside lock, procedure can be safely popped if still there.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 19 Empirical Results On an eight processor Sun SMP, achieved average speed up of 6.2 from elison (serial C non-threaded versions). Assumptions of work-first seem sound: Applications tested all showed high amounts of “average parallelism”. Work overhead small for most programs. Least speed up is where overhead is greatest.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 20 Summary Extension of C for parallel programming. Aims to simplify parallelization. Main idea is to prevent overhead for workers rather than focus on critical path. Empirical results show speed up average of 6.2 on an 8 processor machine.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 21 My Questions A cilk spawn is always just a C call. Who starts the threads, and how many are there? Why use Cilk rather than use threads directly? What about using Cilk on a bewoulf cluster? Are their test programs representative of SMP applications?

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 22 Other Extentions Inlets – a wrapper around spawned procedure returns. Abort – terminates work no longer needed (e.g. in parallel search). Locking facilities for access to shared data.

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 23 T.H.E. Protocol: The Worker/Victim push() { T++; } pop() { T--; if (H > T) { T++; lock(L); T--; if (H > T) { T++; unlock(L); return FAILURE; } unlock(L); } return SUCCESS; } steal() { lock(L); H++; if (H > T) { H--; unlock(L); return FAILURE; } unlock(L); return SUCCESS; }

U NIVERSITY OF M ASSACHUSETTS, A MHERST – Department of Computer Science 24 Fibonacci Illustration