CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support

2 CMPT 431 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Scalable synchronization

3 CMPT 431 © A. Fedorova Process/Thread Support: Good Enough? Many computer scientists observed limited scalability of MT and MP architectures Performance of a threaded web server M. Welsh, SOSP ‘01

4 CMPT 431 © A. Fedorova Alternative Web Services Architectures Alternative architectures for web services that rely less heavily on threads/processes: –Single-Process Event-Driven (SPED) –Asymmetric Multiprocess Event-Driven (AMPED) –Stage Event-Driven Architecture (SEDA)

5 CMPT 431 © A. Fedorova Web Services Architecture. Case Study: A Web server Sequence of actions at the web server Each step can block: –Socket read/accept can block on network I/O –File find/read can block for disk I/O –Send can block on TCP buffer queue How do servers overlap blocking and computation? V. Pai, USENIX ‘99

6 CMPT 431 © A. Fedorova Multiprocess (MP) or Multithreaded (MT) Architecture: A Review V. Pai, USENIX ‘99 One process performs all steps for a request I/O and computation overlap naturally OS switches to a new process when a process blocks MP V. Pai, USENIX ‘99 One thread performs all steps for a request I/O and computation overlap is possible using kernel threads (provided by modern OS) MT

7 CMPT 431 © A. Fedorova Single Process Event-Driven Architecture A single process executes processing steps for all requests Uses non-blocking network and disk I/O system calls Uses select system call to check on the status of those operations Problem #1: many OSs do not provide non-blocking system calls for disk I/O Problem #2: those that do, do not integrate them with select – cannot check for completion of network and disk I/O simultaneously V. Pai, USENIX ‘99

8 CMPT 431 © A. Fedorova Asymmetric Multiprocess Event Driven Architecture (AMPED) AMPED = MP + SPED Use SPED architecture for I/O operations with non-blocking interface: socket read/write, accept Use MP architecture for I/O operations without the non-blocking interface: file read/write: mmap the file Use mincore to check if the file is in memory If not, spawn a helper process to bring the file into memory Communicate with the helper process via IPC V. Pai, USENIX ‘99 Flash – a web server implemented using AMPED (V. Pai, et al., USENIX ‘99) Matches or exceeds performance of existing web servers by up to 50%

9 CMPT 431 © A. Fedorova Staged Event-Driven Architecture Observation: AMPED is good, but it is not easy to control application resources. E.g., which event to process first? SEDA: Create a stage for each logical step of processing; Manage each stage separately There is a queue of events for each stage, so you can tell how each stage is loaded Each stage can be processed by several (a small number of) threads Adaptive load shedding – manage queues to control load –E.g., if the stage that involves disk I/O is the bottleneck, drop the queued up requests or reject new requests Dynamic control – adjust the number of threads per stage based on demand M. Welsh, SOSP ‘01

10 CMPT 431 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization Distributed operating systems

11 CMPT 431 © A. Fedorova OS Support for Inter-Process Communication (IPC) Cooperating processes or threads need to communicate Threads share address space, so they communicate via shared memory What about processes? They do not share an address space. They communicate via: –Unix pipes –Memory-mapped files –Inter-process shared memory

12 CMPT 431 © A. Fedorova Unix Pipes Pipe is a communication channel among two processes Using pipe in a shell: prompt% cat log_file | grep “May 16” cat grep write read Pipes can also be created using pipe() system call

13 CMPT 431 © A. Fedorova Implementation of Pipes In Solaris: a data structure containing two vnodes, a lock and a buffer lock fnode buffer vnode To the user, each end of the pipe is represented by a file descriptor The user reads/writes the pipe by reading/writing the file descriptor The OS blocks the process reading from an empty pipe The OS blocks the process writing into the full pipe (when the buffer is full)

14 CMPT 431 © A. Fedorova Memory-mapped Files Address space of process A File Mapped file Address space of process B Mapped File

15 CMPT 431 © A. Fedorova Inter-process Shared Memory Inter-process shared memory: a piece of physical memory set up to be shared among processes Allocate inter-process shared memory using shmget Get permission to use (attach to it) via shmat Disadvantages: shared memory is not cleaned up automatically when processes exit; it needs to be cleaned up explicitly

16 CMPT 431 © A. Fedorova Performance of IPC IPC involves inter-process context switching The expensive kind of context switch, because it involves switching address spaces The cost of a context switch determines the cost of IPC – largely depends on the hardware

17 CMPT 431 © A. Fedorova Outline Continue discussing OS support for threads and processes Alternative distributed systems architectures inspired by limitations of threads Support for IPC Support for scalable synchronization

18 CMPT 431 © A. Fedorova Synchronization if(account_balance >= amount) { account_balance -= amount; } Thread 1: perform a withdrawal if(account_balance >= service_fee) { account_balance -= service_fee; } Thread 2: subtract service fee 12 3 4 Unsynchronized Access Account balanced has changed between steps 2 and 4!!! Synchronized Access lock_aquire(account_balance_lock); if(account_balance >= amount) { account_balance -= amount; } lock_release(account_balance); lock_aquire(account_balance_lock); if(account_balance >= service_fee) { account_balance -= service_fee; } lock_release(account_balance);

19 CMPT 431 © A. Fedorova Synchronization Primitives (SP) Synchronization primitives provide atomic access to a critical section Types of synchronization primitives –mutex –semaphore –lock –condition variable –etc. Synchronization primitives are provided by the OS Can also be implemented by a library (e.g., pthreads) or by the application Hardware provides special atomic instructions for implementation of synchronization primitives (test-and-set, compare-and-swap, etc.)

20 CMPT 431 © A. Fedorova Implementation of SP Performance of applications that use SP is determined by an implementation of the SP A SP must be scalable – must continue to perform well as the number of contending threads increases We will look at several implementations of locks to understand how to create a scalable implementation

21 CMPT 431 © A. Fedorova What should you do if you can’t get a lock? Keep trying –“spin” or “busy-wait” –Good if delays are short Give up the processor –Good if delays are long –Always good on uniprocessor Systems usually use a combination: –Spin for a while, then give up the processor We will focus on multiprocessors, so we’ll look at spinlock implementations © Herlihy-Shavit 2007

22 CMPT 431 © A. Fedorova A Shared Memory Multiprocessor Bus cache memory cache © Herlihy-Shavit 2007

23 CMPT 431 © A. Fedorova Basic Spinlock CS Resets lock upon exit spin lock critical section... …lock suffers from contention Sequential Bottleneck  no parallelism © Herlihy-Shavit 2007

24 CMPT 431 © A. Fedorova Review: Test-and-Set We have a boolean value in memory Test-and-set (TAS) –Swap true with prior value –Return value tells if prior value was true or false Can reset just by writing false © Herlihy-Shavit 2007

25 CMPT 431 © A. Fedorova TAS Provided by the hardware Example SPARC: an assembly instruction load-store unsigned byte ldstub public class AtomicBoolean { boolean value; public synchronized boolean getAndSet(boolean newValue) { boolean prior = value; value = newValue; return prior; } Swap old and new values loads a byte from memory to a return register writes the value 0xFF into the addressed byte atomically. © Herlihy-Shavit 2007 TAS can be implemented in a high-level language. Example in Java:

26 CMPT 431 © A. Fedorova TAS Locks Value of TAS’ed memory shows lock state: –Lock is free: value is false –Lock is taken: value is true Acquire lock by calling TAS: –If result is false, you win –If result is true, you lose Release lock by writing false

27 CMPT 431 © A. Fedorova TAS Lock in SPARC Assembly spin_lock: busy_loop: ldstub [%o0],%o1 tst %o1 bne busy_loop nop ! delay slot for branch retl nop ! delay slot for branch loads old value into reg. o1. Writes “1” into memory at address in %o0. Test if %o1 equals to zero. If %o1 is not zero (old value is true), spin.

28 CMPT 431 © A. Fedorova TAS Lock in Java class TASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (state.getAndSet(true)) {} } void unlock() { state.set(false); } Initialize lock state to false (unlocked) While lock is taken (true) spin. Release the lock – set state to false © Herlihy-Shavit 2007

29 CMPT 431 © A. Fedorova Performance of TAS Lock Experiment –N threads on a multiprocessor –Increment shared counter 1 million times (total) –The thread acquires a lock before incrementing the counter –Each thread does 1,000,000/N increments N does not exceed the number of processors  no thread switching overhead How long should it take? How long does it take?

30 CMPT 431 © A. Fedorova Expected performance ideal Total time Number of threads no speedup because there is no parallelism © Herlihy-Shavit 2007 lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release lock_acquire increment lock_release Thread 1 Thread 2 same as sequential execution

31 CMPT 431 © A. Fedorova Actual Performance TAS lock Ideal Much worse than ideal Total time Number of threads © Herlihy-Shavit 2007

32 CMPT 431 © A. Fedorova Reasons for Bad TAS Lock Performance Has to do with cache behaviour on the multiprocessor system TAS causes a lot of invalidation misses –This hurts performance To understand what this means, let’s review how caches work

33 CMPT 431 © A. Fedorova Processor Issues Load Request Bus cache memory cache data © Herlihy-Shavit 2007

34 CMPT 431 © A. Fedorova Another Processor Issues Load Request Bus cache memory cache data Bus I got data data Bus I want data © Herlihy-Shavit 2007

35 CMPT 431 © A. Fedorova memory Bus Processor Modifies Data cache data Now other copies are invalid data © Herlihy-Shavit 2007

36 CMPT 431 © A. Fedorova Send Invalidation Message to Others memory Bus cache data Invalidate ! Bus Other caches lose read permission No need to change now: other caches can provide valid data © Herlihy-Shavit 2007

37 CMPT 431 © A. Fedorova Processor Asks for Data memory Bus cache data Bus I want data data © Herlihy-Shavit 2007

38 CMPT 431 © A. Fedorova Multiprocessor Caches: Summary Simultaneous reads and writes of shared data: –Make data invalid Invalidation is bad for performance On next data request: –Data must be fetched from another cache This slows down performance

39 CMPT 431 © A. Fedorova What This Has to Do with TAS Locks Recall that TAS lock had bad performance Invalidations were the cause TAS lock Ideal Total time Number of threads Here is why: All spinners do load/store in a loop They all read/write the same location Cause lots of invalidations

40 CMPT 431 © A. Fedorova A Solution: Test-And-Test-And-Set Lock Wait until lock “looks” free –Spin on local cache –No bus use while lock busy class TTASlock { AtomicBoolean state = new AtomicBoolean(false); void lock() { while (true) { while (state.get()) {} if (!state.getAndSet(true)) return; } Wait until the lock looks free. We read the lock instead of TASing it. We avoid repeated invalidations. Now try to acquire it. © Herlihy-Shavit 2007

41 CMPT 431 © A. Fedorova TTAS Lock Performance TAS lock TTAS lock Ideal Better, but still far from ideal Total time Number of threads © Herlihy-Shavit 2007

42 CMPT 431 © A. Fedorova The Problem with TTAS Lock When the lock is released: –Everyone tries to acquire is –Everyone does TAS –There is a storm of invalidations Only one processor can use the bus at a time So all processors queue up, waiting for the bus, so they can perform the TAS

43 CMPT 431 © A. Fedorova A Solution: TTAS Lock with Backoff Intuition: If I fail to get the lock there must be contention So I should back off before trying again Introduce a random “sleep” delay before trying to acquire the lock again © Herlihy-Shavit 2007

44 CMPT 431 © A. Fedorova TTAS Lock with Backoff: Performance TAS lock TTAS lock Backoff lock Ideal Total time Number of threads © Herlihy-Shavit 2007

45 CMPT 431 © A. Fedorova Backoff Locks Better performance than TAS and TTAS Caveats: –Performance is sensitive to the choice of delay parameter –The delay parameter depends on the number of processors and their speed –Easy to tune for one platform –Difficult to write an implementation that will work well across multiple platforms © Herlihy-Shavit 2007

46 CMPT 431 © A. Fedorova An Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others © Herlihy-Shavit 2007

47 CMPT 431 © A. Fedorova Anderson Queue Lock flags next TFFFFFFF idle locations on which thread spin, one per thread Points to the next unused “spin” location acquiring getAndIncrement: atomically get value of “next”, and increment “next” pointer If “next” was TRUE, lock is acquired © Herlihy-Shavit 2007

48 CMPT 431 © A. Fedorova Acquiring a Held Lock flags next TFFFFFFF acquired acquiring getAndIncrement T released acquired © Herlihy-Shavit 2007

49 CMPT 431 © A. Fedorova Anderson Lock: Performance TAS lock TTAS lock Ideal Anderson lock Total time Number of threads Almost ideal. We avoid all unnecessary invalidations. Portable – no tunable parameters. © Herlihy-Shavit 2007

50 CMPT 431 © A. Fedorova Scalable Synchronization: Summary Making synchronization primitives scalable is tricky Performance tied to the hardware architecture We looked at these spinlocks: –TAS – poor performance due to invalidations –TTAS – avoids constant invalidations, but a storm of invalidations on lock release –TTAS with backoff – eliminates the storm of invalidations on release –Anderson Queue Lock – completely eliminates all useless invalidations One could think of other optimizations… For more information, look at the references in the syllabus

51 CMPT 431 © A. Fedorova Transactional Memory Programming with locks is tough Yet everyone has to do synchronization – multithreaded programming is driven by multicore revolution Transactional memory: concurrent programming without locks

52 CMPT 431 © A. Fedorova Coarse vs. Fine Synchronization int update_shared_counters(int *counters, int n_counters) { int i; coarse_lock_acquire(counters_lock); for (i=0; i<n_counters; i++) { fine_lock_acquire(counter_locks[i]); counters[i]++; fine_lock_release(counter_locks[i]); } coarse_lock_release(counters_lock); } Coarse locks are easy to program But perform poorly Fine locks perform well But are difficult to program

53 CMPT 431 © A. Fedorova Transactional Memory To the Rescue! Can we have the best of both worlds? –Good performance –Ease of programming The answer is: –Transactional Memory (TM)

54 CMPT 431 © A. Fedorova Transactional Memory (TM) Programming model: –Extension to the language –Runtime and/or hardware support Lets you do synchronization without locks Performance of fine grained locks Ease of programming of coarse grained locks

55 CMPT 431 © A. Fedorova Transactional Memory vs. Locks int update_shared_counters(int *counters, int n_counters) { int i; ATOMIC_BEGIN(); coarse_lock_acquire(counters_lock); for (i=0; i<n_counters; i++) { fine_lock_acquire(counter_locks[i]); counters[i]++; fine_lock_release(counter_locks[i]); } coarse_lock_release(counters_lock); ATOMIC_END(); } Transactional section Looks like coarse grained lock Acts like fine grained lock Performance degrades only if there is conflict

56 CMPT 431 © A. Fedorova The Backend of TM read A write B read B write A write D read C write C read E write E read D Abort! restart

57 CMPT 431 © A. Fedorova State of TM Still evolving –More work needed to make it usable and well performing It is very real –Sun’s new Rock processor has TM support –Intel is very active

58 CMPT 431 © A. Fedorova OS Support For Distributed Systems: Summary (I) Networking –Access to network devices –Implementation of network protocols: TPC, UDP, IP Processes and Threads (because many DS components use MP/MT architectures). Must ensure: –Good load balance –Good response time –Minimize context switches –We looked at how Solaris time-sharing scheduler does this

59 CMPT 431 © A. Fedorova OS Support For Distributed Systems: Summary (II) Inter-process communication –Pipes –Memory-mapped files –Inter-process shared memory Scalable Synchronization

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Similar presentations

Presentation on theme: "CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support.

Similar presentations

Presentation on theme: "CMPT 431 Dr. Alexandra Fedorova Lecture IV: OS Support."— Presentation transcript:

Similar presentations

About project

Feedback