By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Snoopy Caches I Steve Ko Computer Sciences and Engineering University at Buffalo.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Memory Consistency in Vector IRAM David Martin. Consistency model applies to instructions in a single instruction stream (different than multi-processor.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
1 Sharing Objects – Ch. 3 Visibility What is the source of the issue? Volatile Dekker’s algorithm Publication and Escape Thread Confinement Immutability.
Memory Consistency Models
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Operating Systems CSE 411 CPU Management Oct Lecture 13 Instructor: Bhuvan Urgaonkar.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Threads Cannot be Implemented as a Library Hans-J. Boehm.
Threads cannot be implemented as a library Hans-J. Boehm (presented by Max W Schwarz)
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
CS533 Concepts of Operating Systems Jonathan Walpole.
Fundamentals of Memory Consistency Smruti R. Sarangi Prereq: Slides for Chapter 11 (Multiprocessor Systems), Computer Organisation and Architecture, Smruti.
1 Programming with Shared Memory - 3 Recognizing parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Jan 22, 2016.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
CS5102 High Performance Computer Systems Memory Consistency
Memory Consistency Models
Threads Cannot Be Implemented As a Library
Lecture 11: Consistency Models
Memory Consistency Models
12.4 Memory Organization in Multiprocessor Systems
Threads and Memory Models Hal Perkins Autumn 2011
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Chapter 26 Concurrency and Thread
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Introduction to High Performance Computing Lecture 20
Threads and Memory Models Hal Perkins Autumn 2009
Lecture 22: Consistency Models, TM
Background and Motivation
Shared Memory Consistency Models: A Tutorial
Lecture 10: Consistency Models
Programming with Shared Memory Specifying parallelism
Memory Consistency Models
CSE 153 Design of Operating Systems Winter 19
Lecture 24: Multiprocessors
Programming with Shared Memory - 3 Recognizing parallelism
Programming with Shared Memory Specifying parallelism
Lecture 21: Synchronization & Consistency
Lecture: Consistency Models, TM
Advanced Operating Systems (CS 202) Memory Consistency and Transactional Memory Feb. 6, 2019.
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Lecture 11: Consistency Models
Presentation transcript:

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 1 Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 1 Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 1 Flag1 = 1 Critical Section is Protected Works the same if Process 2 runs first! Process 2 enters its Critical Section

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1 Arbitrary interleaving of Processes

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 1 Flag1 = 1 Arbitrary interleaving of Processes

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 1 Flag1 = 1 Both processes can block but the critical section remains protected. Deadlock can be fixed by extending the algorithm with turn-taking

Outline Concurrent Programming on a Uniprocessor The effect of optimizations on a Uniprocessor The effect of the same optimizations on a Multiprocessor without Sequential Consistency Methods for restoring Sequential Consistency Conclusion

Optimization: Write Buffer with Bypass SpeedUp: Write takes 100 cycles, buffering takes 1 cycle. So Buffer and keep going. Problem: Read from a Location with a buffered Write pending?? (Single Processor Case)

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Write Buffering Flag1 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Write Buffering Flag1 = 1 Flag2 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Write Buffering Flag1 = 1 Flag2 = 1

Optimization: Write Buffer with Bypass SpeedUp: Write takes 100 cycles, buffering takes 1 cycle. Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete.

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Write Buffering Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. STALL! Flag1 = 1 Flag2 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1 Write Buffering Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Flag2 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Does this work for Multiprocessors??

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Does this work for Multiprocessors?

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case

Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1 Process 1:: Flag1 = 1 If (Flag2 == 0) critical section

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1Flag2 = 1 Process 2:: Flag2 = 1 If (Flag1 == 0) critical section

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1Flag2 = 1 Process 2:: Flag2 = 1 If (Flag1 == 0) critical section

Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1Flag2 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1Flag2 = 1

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Multiprocessor Case Flag1 = 1Flag2 = 1

What happens on a Processor stays on that Processor

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Rule: If a WRITE is issued, buffer it and keep executing Unless: there is a READ from the same location (subsequent WRITEs don't matter), then wait for the WRITE to complete. Processor 2 knows nothing about the write to Flag1, so has no reason to stall! Flag1 = 1Flag2 = 1 Process 2:: Flag2 = 1 If (Flag1 == 0) critical section

A more general way to look at the Problem: Reordering of Reads and Writes (Loads and Stores).

Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Consider the Instructions in these processes. WX RX WY RY Simplify as:

RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior). WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior) 18 of the 24 orderings are OK. But the other 6 are trouble!

Consider another example...

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect Data = 2000

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect Data = 2000 Head = 1

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 1 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect Data = 2000

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 1 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect Data = 2000

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 1 Data = 0 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Memory Interconnect Data = 2000

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 1 Data = 2000 Write By-Pass: General Interconnect to multiple memory modules means write arrival in memory is indeterminate. Fix: Write must be acknowledged before another write (or read) from the same processor. Memory Interconnect

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Global Data Initialized to 0 Head = 0 Data = 2000 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Global Data Initialized to 0 Head = 1 Data = 2000 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Global Data Initialized to 0 Head = 1 Data = 2000 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Global Data Initialized to 0 Head = 1 Data = 2000 Memory Interconnect Non-Blocking Reads: Lockup-free Caches, speculative execution, dynamic scheduling allow execution to proceed past a Read. Assume Writes are acknowledged.

Let's reason about reordering of reads and writes again.

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Consider the Instructions in these processes. WX RX WY RY Simplify as:

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of Write Acknowledgment means WX < WY. Does that Help? Disallows only 12 out of still incorrect!

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Sequential Consistency for Multiprocessors Why is it surprising that these code examples break on a multi- processor? What ordering property are we assuming (incorrectly!) that multiprocessors support? Are we assuming that they are sequentially consistent?

Sequential Consistency Sequential Consistency requires that the result of any execution be the same as if the memory accesses executed by each processor were kept in order and the accesses among different processors were interleaved arbitrarily....appears as if a memory operation executes atomically or instantaneously with respect to other memory operations (Hennessy and Patterson, 4th ed.)

Understanding Ordering Program Order Compiled Order Interleaving Order Execution Order

Reordering Writes reach memory, and Reads see memory in an order different than that in the Program. Caused by Processor Caused by Multiprocessors (and Cache) Caused by Compilers

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: Enforce Sequential Consistency at the memory level? Use Coherent (Consistent) Cache ? Or what ?

Enforce Sequential Consistency? Removes virtually all optimizations => Too Slow!

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: Enforce Sequential Consistency at the memory level? Use Coherent (Consistent) Caches ? Or what ?

Cache Coherence Multiple processors have a consistent view of memory (i.e. by using the MESI protocol) But this does not say when a processor must see a value updated by another processor. Cache coherency does not guarantee Sequential Consistency! Example: a write-through cache acts just like a write buffer with bypass.

What Are the Choices? Enforce Sequential Consistency ? Too Slow! Use Coherent (Consistent) Caches ? Won't help! What's left ?????

Involve the Programmer?

If you don't talk to your CPU about concurrency, who's going to??

What Are the Choices? Enforce Sequential Consistency? (Too Slow!) Use Coherent (Consistent) Cache? (Won't help) Provide Memory Barrier (Fence) Instructions?

Barrier Instructions Methods for overriding relaxations of the Sequential Consistency Model. Also known as a Safety Net. Example: A Fence would require buffered Writes to complete before allowing further execution. Not Cheap, but not often needed. Must be placed by the Programmer. Memory Consistency Model for Processor tells you how.

Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Consider the Instructions in these processes. WX RX WY RY Simplify as: >>Fence<< Fence: WX < RY Fence: WY < RX

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior) 18 of the 24 orderings are OK. But the other 6 are trouble! Enforce WX<RY and WY<RX. Only 6 of the 18 good orderings are allowed OK. But the 6 bad ones are forbidden!

Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Flag1 = 1 Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Fence: Wait for pending I/O to complete before more I/O (includes cache updates).

Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 0 Flag1 = 1 Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Fence: Wait for pending I/O to complete before more I/O (includes cache updates). Flag2 = 1

Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1 Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Fence: Wait for pending I/O to complete before more I/O (includes cache updates). Flag2 = 1Flag1 = 1

Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1 Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Fence: Wait for pending I/O to complete before more I/O (includes cache updates). Flag2 = 1

Dekker's Algorithm: Global Flags Init to 0 Flag2 = 0 Flag1 = 1 Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Flag2 = 1 Flag1 protects critical section when Process 2 continues at Mem_Bar.

Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Consider the Instructions in these processes. WX RX WY RY Simplify as: >>Fence<< Fence: WX < WYFence: RY < RX

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of We can require WX<WY and RY<RX. Is that enough? Program requires WY<RX. Thus, WX<WY<RY<RX; hence WX<RX and WY<RY. Only 2 of the 6 good orderings are allowed - But all 18 incorrect orderings are forbidden.

Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Data = 2000 Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Fence: Wait for pending I/O to complete before more I/O (includes cache updates).

Global Data Initialized to 0 Head = 0 Data = 0 Memory Interconnect Data = 2000 Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Fence: Wait for pending I/O to complete before more I/O (includes cache updates).

Global Data Initialized to 0 Head = 0 Data = 2000 Memory Interconnect Data = 2000 Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Fence: Wait for pending I/O to complete before more I/O (includes cache updates).

Global Data Initialized to 0 Head = 1 Data = 2000 Memory Interconnect Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Fence: Wait for pending I/O to complete before more I/O (includes cache updates).

Global Data Initialized to 0 Head = 1 Data = 2000 Memory Interconnect Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Fence: Wait for pending I/O to complete before more I/O (includes cache updates). When Head reads as 1, Data will have the correct value when Process 2 continues at Mem_Bar.

Results appear in a Sequentially Consistent manner.

I've never heard of this. Is this for Real??

Memory Ordering in Modern Microprocessors, Part I Linux provides a carefully chosen set of memory-barrier primitives, as follows: smp_mb(): “memory barrier” that orders both loads and stores. This means loads and stores preceding the memory barrier are committed to memory before any loads and stores following the memory barrier. smp_rmb(): “read memory barrier” that orders only loads. smp_wmb(): “write memory barrier” that orders only stores.

OK, I get it. So what's a Programmer supposed to do??

Words of Advice? “The difficult problem is identifying the ordering constraints that are necessary for correctness.” “...the programmer must still resort to reasoning with low level reordering optimizations to determine whether sufficient orders are enforced.” “...deep knowledge of each CPU's memory- consistency model can be helpful when debugging, to say nothing of writing architecture-specific code or synchronization primitives.”

Memory Consistency Models Explain what relaxations of Sequential Consistency are implemented. Explain what Barrier statements are available to avoid them. Provided for every processor (YMMV).

Memory Consistency Models

Programmer's View What does a programmer need to do? How do they know when to do it? Compilers & Libraries can help, but still need to use primitives in parallel programming (like in a kernel). Assuming the worst and synchronizing everything results in sequential consistency. Too slow, but a good way to debug.

How to Reason about Sequential Consistency Applies to parallel programs (Kernel!) Parallel Programming Language may provide the protection (DoAll loops). Language may have types to use. Distinguish Data and Sync regions. Library may provide primitives (Linux). How to know if you need synchronization?

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Parallel programming on a Multiprocessor that relaxes the Sequentially Consistent Model presents new challenges. Know the memory consistency models for the processors you use. Use barrier (fence) instructions to allow optimizations while protecting your code. Simple examples were used, there are others much more subtle. The fix is basically the same.

Conclusion Parallel programming on a Multiprocessor that relaxes the Sequentially Consistent Model presents new challenges. Know the memory consistency models for the processors you use. Use barrier (fence) instructions to allow optimizations while protecting your code. Simple examples were used, there are others much more subtle. The fix is basically the same.

Conclusion Parallel programming on a Multiprocessor that relaxes the Sequentially Consistent Model presents new challenges. Know the memory consistency models for the processors you use. Use barrier (fence) instructions to allow optimizations while protecting your code. Simple examples were used, there are others much more subtle. The fix is basically the same.

Conclusion Parallel programming on a Multiprocessor that relaxes the Sequentially Consistent Model presents new challenges. Know the memory consistency models for the processors you use. Use barrier (fence) instructions to allow optimizations while protecting your code. Simple examples were used, there are others much more subtle. The fix is basically the same.

References Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo Vince Shuster Presentation for CS533, Winter, 2010 Memory Ordering in Modern Microprocessors, Part I, Paul E. McKenney, Linux Journel, June, 2005 Computer Architecture, Hennessy and Patterson, 4 th Ed., 2007