Presentation is loading. Please wait.

Presentation is loading. Please wait.

Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm.

Similar presentations


Presentation on theme: "Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm."— Presentation transcript:

1 Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm

2 Locality What do they mean by locality? – locality of reference? – temporal locality? – spatial locality?

3 Temporal Locality Recently accessed data and instructions are likely to be accessed in the near future

4 Spatial Locality Data and instructions close to recently accessed data and instructions are likely to be accessed in the near future

5 Locality of Reference If we have good locality of reference, is that a good thing for multiprocessors?

6 Locality in Multiprocessors Good performance depends on data being local to a CPU – Each CPU uses data from its own cache cache hit rate is high each CPU has good locality of reference – Once data is brought into cache it stays there cache contents not invalidated by other CPUs different CPUs have different locality of reference

7 Example: Shared Counter Memory CPU Cache CPU Cache Counter

8 Example: Shared Counter Memory CPU 0

9 Example: Shared Counter Memory CPU 0 0

10 Example: Shared Counter Memory CPU 1 1

11 Example: Shared Counter Memory CPU 1 1 1 Read : OK

12 Example: Shared Counter Memory CPU 2 2 Invalidate

13 Performance

14 Problems Counter bounces between CPU caches – cache miss rate is high Why not give each CPU its own piece of the counter to increment? – take advantage of commutativity of addition – counter updates can be local – reads require all counters

15 Array-based Counter Memory CPU 00

16 Array-based Counter Memory CPU 1 10

17 Array-based Counter Memory CPU 1 1 11

18 Array-based Counter Memory CPU 1 1 11 2 Read Counter Add All Counters (1 + 1)

19 Performance Performs no better than ‘shared counter’!

20 Problem: False Sharing Caches operate at the granularity of cache lines – if two pieces of the counter are in the same cache line they can not be cached (for writing) on more than one CPU at a time

21 False Sharing Memory CPU 0,0

22 False Sharing Memory CPU 0,0 CPU 0,0

23 False Sharing Memory CPU 0,0 CPU 0,0 Sharing

24 False Sharing Memory CPU 1,0 CPU 1,0 Invalidate

25 False Sharing Memory CPU 1,0 CPU 1,0 Sharing

26 False Sharing Memory CPU 1,1 Invalidate

27 Solution? Spread the counter components out in memory: pad the array

28 Padded Array Memory CPU 00

29 Padded Array Memory CPU 1 1 11 Updates independent of each other

30 Performance Works better

31 Locality in OS Serious performance impact Difficult to retrofit Tornado – Ground up design – Object Oriented approach (natural locality)

32 Tornado Object oriented approach Clustered objects Protected procedure call Semi-automatic garbage collection – Simplifies locking protocols

33 Object Oriented Structure Each resource is represented by an object Requests to virtual resources handled independently – No shared data structure access – No shared locks

34 Why Object Oriented? Process 1 Process 2 … … Process Table

35 Why Object Oriented? Coarse-grain locking: Process 1 Process 2 … … Process Table Process 1 Lock

36 Why Object Oriented? Coarse-grain locking: Process 1 Process 2 … … Process Table Process 1 Lock Process 2

37 Object Oriented Approach Class ProcessTableEntry{ data lock code }

38 Object Oriented Approach Fine-grain, instance locking: Process 1 Process 2 … … Process Table Process 1 Lock Process 2 Lock

39 Clustered Objects Problem: how to improve locality for widely shared objects? A single logical object can be composed of multiple local representatives – the reps coordinate with each other to manage the object’s state – they share the object’s reference

40 Clustered Objects

41 Clustered Object References

42 Clustered Objects : Implementation A translation table per processor – Located at same virtual address – Pointer to rep Clustered object reference is just a pointer into the table – created on demand when first accessed – global miss handling object

43 Clustered Objects Degree of clustering Management of state – partitioning – distribution – replication (how to maintain consistency?) Coordination between reps? – Shared memory – Remote PPCs

44 Counter: Clustered Object Counter – Clustered Object CPU rep 1 Object Reference

45 Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

46 Counter: Clustered Object Counter – Clustered Object CPU 2 1 rep 2rep 1 Object Reference Update independent of each other

47 Counter: Clustered Object Counter – Clustered Object CPU 1 1 rep 1 Object Reference

48 Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU 1 1 rep 1 Read Counter

49 Counter: Clustered Object rep 1 Object Reference Counter – Clustered Object CPU 1 1 rep 1 Add All Counters (1 + 1)

50 Synchronization Two distinct locking issues – Locking mutually exclusive access to objects – Existence guarantees making sure an object is not freed while still in use

51 Locking in Tornado Encapsulate locking within individual objects Uses clustered objects to limit contention Uses spin-then-block locks

52 Existence Guarantees: the problem Use a lock to protect all references to an object? – eliminates races where one thread is accessing the object and another is deallcoating it – results in complex global hierarchy of locks Tornado - semi automatic garbage collection – Clustered object reference can be used any time – Eliminates needs for locks

53 Existence Guarantees in Tornado Semi-automatic garbage collection: – programmer decides what to free, system decided when to free it – guarantees that object references can be used safely – eliminates needs for reference locks

54 How does it work? Programmer removes all persistent references – Normal cleanup done manually System tracks all temporary references – Event driven kernel – Maintain an activity counter for each processor – Delete object only when activity counter is zero

55 Performance Scalability

56 Conclusion Object-oriented approach and clustered objects exploit locality to improve concurrency OO design has some overhead, but it is low compared to the performance advantages Tornado scales extremely well and achieves high performance on shared-memory multiprocessors


Download ppt "Tornado: Maximizing Locality and Concurrency in a Shared Memory Multiprocessor Operating System Ben Gamsa, Orran Krieger, Jonathan Appavoo, Michael Stumm."

Similar presentations


Ads by Google