Presentation is loading. Please wait.

Presentation is loading. Please wait.

Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer.

Similar presentations


Presentation on theme: "Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer."— Presentation transcript:

1 Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer Science and Engineering Department © Philippas Tsigas

2 2 WHY PARALLEL PROGRAMMING IS ESSSENTIAL IN DISTRIBUTED SYSTEMS AND NETWORKING © Philippas Tsigas

3 3 tsigas@cs.chalmers.se Philippas Tsigas How did we reach there? Picture from Pat Gelsinger, Intel Developer Forum, Spring 2004 (Pentium at 90W)

4 4 3GHz Concurrent Software Becomes Essential 3GHz 6GHz 3GHz 12GHz 24GHz 1 Core 2 Cores 4 Cores 8 Cores Our work is to help the programmer to develop efficient parallel programs but also survive the multicore transition. © Philippas Tsigas 1) Scalability becomes an issue for all software. 2) Modern software development relies on the ability to compose libraries into larger programs.

5 5 DISTRIBUTED APPLICATIONS © Philippas Tsigas

6 6 Distributed Applications Demand Quite High Level Data Sharing:  Commercial computing (media and information processing)  Control Computing (on board flight-control system) © Philippas Tsigas

7 7 Data Sharing: Gameplay Simulation as an example This is the hardest problem…  10,000’s of objects  Each one contains mutable state  Each one updated 30 times per second  Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 © Philippas Tsigas

8 8 NETWORKING © Philippas Tsigas

9 9 40 multithreaded packet-processing engines http://www.cisco.com/assets/cdc_content_elements/ embedded-video/routers/popup.html  On chip, there are 40 32-bit, 1.2-GHz packet-processing engines. Each engine works on a packet from birth to death within the Aggregation Services Router.  each multithreaded engine handles four threads (each thread handles one packet at a time) so each QuantumFlow Processor chip has the ability to work on 160 packets concurrently © Philippas Tsigas

10 10 DATA SHARING © Philippas Tsigas

11 11 Data Sharing: Gameplay Simulation as an example This is the hardest problem…  10,000’s of objects  Each one contains mutable state  Each one updated 30 times per second  Each update touches 5-10 other objects Manual synchronization (shared state concurrency) is hopelessly intractable here. Solutions? Slide: Tim Sweeney CEO Epic Games POPL 2006 © Philippas Tsigas

12 12 tsigas@cs.chalmers.se Philippas Tsigas Blocking Data Sharing class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } next = 0 A typical Counter Impl: Thread1: getNumber() t = 0 Thread2: getNumber() result=0 Lock released Lock acquired result=1 next = 1next = 2

13 13 tsigas@cs.chalmers.se Philippas Tsigas Do we need Synchronization? class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } What can go wrong here? next = 0 Thread1: getNumber() t = 0 Thread2: getNumber() t = 0 result=0 next = 1 result=0 

14 14 Blocking Synchronization = Sequential Behavior © Philippas Tsigas

15 15 BS ->Priority Inversion  A high priority task is delayed due to a low priority task holding a shared resource. The low priority task is delayed due to a medium priority task executing.  Solutions: Priority inheritance protocols  Works ok for single processors, but for multiple processors … Task H: Task M: Task L: © Philippas Tsigas

16 16 Critical Sections + Multiprocessors  Reduced Parallelism. Several tasks with overlapping critical sections will cause waiting processors to go idle. Task 1: Task 2: Task 3: Task 4: © Philippas Tsigas

17 17  The BIGEST Problem with Locks? Blocking Locks are not composable All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides. © Philippas Tsigas

18 18 Interprocess Synchronization = Data Sharing  Synchronization is required for concurrency  Mutual exclusion (Semaphores, mutexes, spin-locks, disabling interrupts: Protects critical sections) - Locks limits concurrency - Busy waiting – repeated checks to see if lock has been released or not - Convoying – processes stack up before locks - Blocking Locks are not composable - All code that accesses a piece of shared state must know and obey the locking convention, regardless of who wrote the code or where it resides.  A better approach is … not to lock

19 19 tsigas@cs.chalmers.se Philippas Tsigas A Lock-free Implementation

20 How did it start?  ”Synchronization is an enforcing mechanism used to impose constraints on the order of execution of threads.... Synchronization is used to coordinate threads execution and manage shared data.” Does it have to be like that? When we share data do we have to impose constraints on the execution of threads?

21 21 HOW SAFE IS IT: LET US START FROM THE BEGINING © Philippas Tsigas

22 22  Object in memory - Supports some set of operations (ADT) - Concurrent access by many processes/threads - Useful to e.g.  Exchange data between threads  Coordinate thread activities Shared Abstract Data Types P1 P2 P3 Op A Op B P4 Op B Op A

23 23 Borrowed from H. Attiya Executing Operations P1P1 invocationresponse P2P2 P3P3

24 24 Interleaving Operations Concurrent execution

25 25 Interleaving Operations (External) behavior

26 26 Interleaving Operations, or Not Sequential execution

27 27 Interleaving Operations, or Not Sequential behavior: invocations & response alternate and match (on process & object) Sequential specification: All the legal sequential behaviors, satisfying the semantics of the ADT - E.g., for a (LIFO) stack: pop returns the last item pushed

28 28 Correctness: Sequential consistency [Lamport, 1979]  For every concurrent execution there is a sequential execution that - Contains the same operations - Is legal (obeys the sequential specification) - Preserves the order of operations by the same process

29 29 Sequential Consistency: Examples push(4) pop():4push(7) Concurrent (LIFO) stack push(4) pop():4push(7) Last In First Out

30 30 Sequential Consistency: Examples push(4) pop():7push(7) Concurrent (LIFO) stack Last In First Out

31 31 Safety: Linearizability  Linearizable ADTs - Sequential specification defines legal sequential executions - Concurrent operations allowed to be interleaved - Operations appear to execute atomically  External observer gets the illusion that each operation takes effect instantaneously at some point between its invocation and its response(preserves order of all operation) time push(4) pop():4push(7) push(4) pop():4 push(7) Last In First Out concurrent LIFO stack T1T1 T2T2

32 32 Safety II An accessible node is never freed.

33 33 Liveness Non-blocking implementations - Wait-free implementation of an ADT [Lamport, 1977]  Every operation finishes in a finite number of its own steps. - Lock-free (≠ FREE of LOCKS) implementation [Lamport, 1977]  At least one operation (from a set of concurrent operation) finishes in a finite number of steps (the data structure as a system always make progress)

34 34 Liveness II  every garbage node is eventually collected

35 35 Abstract Data Types (ADT)  Cover most concurrent applications  At least encapsulate their data needs  An object-oriented programming point of view  Abstract representation of data & set of methods (operations) for accessing it  Signature  Specification data

36 36 Implementing High-Level ADT data ------------------ ------------------- ------------------ ---------------- --------------- ------------------ ------------------- ------------------ ------------------- ------------------ ---------------- --------------- ------------------ ------------------- Using lower-level ADTs & procedures

37 37 Lower-Level Operations  High-level operations translate into primitives on base objects that are available on H/W  Obvious: read, write  Common: compare&swap (CAS), LL/SC, FAA

38 38 CAN I FIND A JOB IF I STUDY THIS? © Philippas Tsigas

39  8 Feb 2002 Release of NOBLE version 1.0  23 Jan 2002 Expert Group Formation (JSR: Java Concurrency Utilities)  8 Jan 2004 JSR first Release  29 Aug 2006 INTEL’s TBB release 1.0

40 40 ERLANG  OTP_R15A: R15 pre-release OTP_R15A: R15 pre-release  Written by Kenneth, 23 Nov 2011Kenneth  We have recently pushed a new master to GitHub tagged OTP_R15A.GitHub  This is a stabilized snapshot of the current R15 development (to be released as R15B on December 14:th) which, among other things, includes:  OTP-9468 'Line numbers in exceptions'  OTP-9451 'Parallel make'  OTP-4779 A new GUI for Observer. Integrating pman, etop and tv into observer with tracing facilities.  OTP-7775 A number of memory allocation optimizations have been implemented. Most optimizations reduce contention caused by synchronization between threads during allocation and deallocation of memory. Most notably: Synchronization of memory management in scheduler specific allocator instances has been rewritten to use lock-free synchronization.  Synchronization of memory management in scheduler specific pre-allocators has been rewritten to use lock-free synchronization.  The 'mseg_alloc' memory segment allocator now use scheduler specific instances instead of one instance. Apart from reducing contention this also ensures that memory allocators always create memory segments on the local NUMA node on a NUMA system.  OTP-9632 An ERTS internal, generic, many to one, lock-free queue for communication between threads has been introduced. The many to one scenario is very common in ERTS, so it can be used in a lot of places in the future. Currently it is used by scheduling of certain jobs, and the async thread pool, but more uses are planned for the future.  Drivers using the driver_async functionality are not automatically locked to the system anymore, and can be unloaded as any dynamically linked in driver.  Scheduling of ready async jobs is now also interleaved in between other jobs. Previously all ready async jobs were performed at once.  OTP-9631 The ERTS internal system block functionality has been replaced by new functionality for blocking the system. The old system block functionality had contention issues and complexity issues. The new functionality piggy-backs on thread progress tracking functionality needed by newly introduced lock-free synchronization in the runtime system. When the functionality for blocking the system isn't used, there is more or less no overhead at all. This since the functionality for tracking thread progress is there and needed anyway. ... and much much more.  This is not a full release of R15 but rather a pre-release. Feel free to try our R15A release and get back to us with your findings.  Your feedback is important to us and highly welcomed.  Regards,  The OTP Team © Philippas Tsigas

41 41 © Philippas Tsigas

42 42 © Philippas Tsigas

43 Locks are not supported  Not in CUDA, not in OpenCL  Fairness of hardware scheduler unknown  Thread block holding a lock might be swapped out indefinitely, for example

44 No Fairness Guarantees … while(atomicCAS(&lock,0,1)); ctr++; lock = 0; … Thread holding lock is never scheduled!

45 Where do we stand at?


Download ppt "Parallel Programming in Distributed Systems Or Distributed Systems in Parallel Programming Philippas Tsigas Chalmers University of Technology Computer."

Similar presentations


Ads by Google