CS533 Concepts of Operating Systems Jonathan Walpole.

Slides:



Advertisements
Similar presentations
Symmetric Multiprocessors: Synchronization and Sequential Consistency.
Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,
Shared Memory Consistency
1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.
CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,
© Krste Asanovic, 2014CS252, Spring 2014, Lecture 12 CS252 Graduate Computer Architecture Spring 2014 Lecture 12: Synchronization and Memory Models Krste.
Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.
(C) 2001 Daniel Sorin Correctly Implementing Value Prediction in Microprocessors that Support Multithreading or Multiprocessing Milo M.K. Martin, Daniel.
Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.
CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.
By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.
1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.
1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Lecture 13: Consistency Models
Computer Architecture II 1 Computer architecture II Lecture 9.
1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.
Memory Consistency Models
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.
1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )
Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.
Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.
Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.
Shared Memory Consistency Models. Quiz (1)  Let’s define shared memory.
Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Threads cannot be implemented as a library Hans-J. Boehm (presented by Max W Schwarz)
Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
CS510 Concurrent Systems Jonathan Walpole. RCU Usage in Linux.
An Evaluation of Memory Consistency Models for Shared- Memory Systems with ILP processors Vijay S. Pai, Parthsarthy Ranganathan, Sarita Adve and Tracy.
Fundamentals of Memory Consistency Smruti R. Sarangi Prereq: Slides for Chapter 11 (Multiprocessor Systems), Computer Organisation and Architecture, Smruti.
740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.
EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.
CSC/ECE 506: Architecture of Parallel Computers Bus-Based Coherent Multiprocessors 1 Lecture 12 (Chapter 8) Lecture 12 (Chapter 8)
6/27/20161 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam King,
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Lecture 20: Consistency Models, TM
CS5102 High Performance Computer Systems Memory Consistency
Memory Consistency Models
Lecture 11: Consistency Models
Memory Consistency Models
12.4 Memory Organization in Multiprocessor Systems
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Shared Memory Consistency Models: A Tutorial
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Introduction to High Performance Computing Lecture 20
Lecture 22: Consistency Models, TM
Background and Motivation
Shared Memory Consistency Models: A Tutorial
Lecture 10: Consistency Models
Memory Consistency Models
Relaxed Consistency Part 2
Lecture 24: Multiprocessors
Programming with Shared Memory Specifying parallelism
Lecture 21: Synchronization & Consistency
Lecture: Consistency Models, TM
CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture Lecture 19 Memory Consistency Models Krste Asanovic Electrical Engineering.
Lecture 11: Consistency Models
Presentation transcript:

CS533 Concepts of Operating Systems Jonathan Walpole

Shared Memory Consistency Models: A Tutorial

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Critical section is protected!

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Both processes can block, but the critical section is still protected!

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Write Buffer With Bypass SpeedUp: -Write takes 100 cycles -Buffering takes 1 cycle -So Buffer and keep going! Problem: Read from a location with a buffered write pending?

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1 Critical section is not protected!

Write Buffer With Bypass Rule: - If a write is issued, buffer it and keep executing Unless: there is a read from the same location (subsequent writes don't matter), then wait for the write to complete

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1 Stall!

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 1 Flag2 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 1 Flag1 = 1

Is This a General Solution ? -If each CPU has a write buffer with bypass, and follows the rules, will the algorithm still work correctly?

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Dekker’s Algorithm Process 1:: Flag1 = 1 If (Flag2 == 0) critical section Process 2:: Flag2 = 1 If (Flag1 == 0) critical section Flag2 = 0 Flag1 = 0 Flag2 = 1 Flag1 = 1

Its Broken! How did that happen? -write buffers are processor specific -writes are not visible to other processors until they hit memory

Generalization of the Problem Dekker’s algorithm has the form: WXWY RYRX -The write buffer delays the writes until after the reads! -It reorders the reads and writes -Both processes can read the value prior to the other’s write!

RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior). WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX

RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior). WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX 18 of the 24 orderings are OK. But the other 6 are trouble!

Another Example What happens if reads and writes can be delayed by the interconnect? -non-uniform memory access time -cache misses -complex interconnects

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 0 Data = 0 Memory Interconnect Non-Uniform Write Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 0 Data = 0 Memory Interconnect Non-Uniform Write Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 0 Data = 0 Memory Interconnect Non-Uniform Write Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 1 Data = 0 Memory Interconnect Non-Uniform Write Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 1 Data = 0 Memory Interconnect Non-Uniform Write Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 1 Data = 0 Memory Interconnect Non-Uniform Write Delays WRONG DATA !

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 1 Data = 2000 Memory Interconnect Non-Uniform Write Delays

What Went Wrong? Maybe we need to acknowledge each write before proceeding to the next?

Write Acknowledgement? But what about reordering of reads? -Non-Blocking Reads -Lockup-free Caches -Speculative execution -Dynamic scheduling... all allow execution to proceed past a read Acknowledging writes may not help!

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 0 Data = 0 Memory Interconnect General Interconnect Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Head = 0 Data = 0 Memory Interconnect General Interconnect Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Head = 0 Data = 2000 Memory Interconnect General Interconnect Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data Head = 1 Data = 2000 Memory Interconnect General Interconnect Delays

Process 1:: Data = 2000; Head = 1; Process 2:: While (Head == 0) {;} LocalValue = Data (0) Head = 1 Data = 2000 Memory Interconnect General Interconnect Delays WRONG DATA !

Generalization of the Problem This algorithm has the form: WXRY WYRX -The interconnect reorders reads and writes

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of Write Acknowledgment means WX < WY. Does that Help? Disallows only 12 out of still incorrect!

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

Sequential Consistency for MPs Why is it surprising that these code examples break on a multi-processor? What ordering property are we assuming (incorrectly!) that multiprocessors support? We are assuming they are sequentially consistent!

Sequential Consistency Sequential Consistency requires that the result of any execution be the same as if the memory accesses executed by each processor were kept in order and the accesses among different processors were interleaved arbitrarily....appears as if a memory operation executes atomically or instantaneously with respect to other memory operations (Hennessy and Patterson, 4th ed.)

Understanding Ordering Program Order Compiled Order Interleaving Order Execution Order

Reordering Writes reach memory, and reads see memory, in an order different than that in the program! -Caused by Processor -Caused by Multiprocessors (and Cache) -Caused by Compilers

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: -Enforce Sequential Consistency at the memory level? -Use Coherent (Consistent) Cache ? -Or what ?

Enforce Sequential Consistency? Removes virtually all optimizations Too slow!

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: -Enforce Sequential Consistency at the memory level? -Use Coherent (Consistent) Cache ? -Or what ?

Cache Coherence Multiple processors have a consistent view of memory (i.e. MESI protocol) But this does not say when a processor must see a value updated by another processor. Cache coherency does not guarantee Sequential Consistency! Example: a write-through cache acts just like a write buffer with bypass.

What Are the Choices? If we want our results to be the same as those of a Sequentially Consistent Model. Do we: -Enforce Sequential Consistency at the memory level? -Use Coherent (Consistent) Cache ? -Or what ?

Involve the Programmer Someone’s got to tell your CPU about concurrency! Use memory barrier / fence instructions when order really matters!

Memory Barrier Instructions A way to prevent reordering -Also known as a safety net - Require previous instructions to complete before allowing further execution on that CPU Not cheap, but perhaps not often needed? -Must be placed by the programmer -Memory consistency model for processor tells you what reordering is possible

Process 1:: Flag1 = 1 >>Mem_Bar<< If (Flag2 == 0) critical section Process 2:: Flag2 = 1 >>Mem_Bar<< If (Flag1 == 0) critical section Using Memory Barriers WX RX WY RY >>Fence<< Fence: WX < RY Fence: WY < RX

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX There are 4! or 24 possible orderings. If either WX<RX or WY<RY Then the Critical Section is protected (Correct Behavior) 18 of the 24 orderings are OK. But the other 6 are trouble! Enforce WX<RY and WY<RX. Only 6 of the 18 good orderings are allowed OK. But the 6 bad ones are still forbidden!

Process 1:: Data = 2000; >>Mem_Bar<< Head = 1; Process 2:: While (Head == 0) {;} >>Mem_Bar<< LocalValue = Data Example 2 WX RXWY RY >>Fence<< Fence: WX < WYFence: RY < RX

WX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX RY WY RX WY RX WX WY RX RY RX WY RX RY RX WY RX RY RX WY RX RY RX WY WX RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY RX WY RX WY RY WX Correct behavior requires WX<RX, WY<RY. Program requires WY<RX. => 6 correct orders out of We can require WX<WY and RY<RX. Is that enough? Program requires WY<RX. Thus, WX<WY<RY<RX; hence WX<RX and WY<RY. Only 2 of the 6 good orderings are allowed - But all 18 incorrect orderings are forbidden.

Memory Consistency Models Every CPU architecture has one! -It explains what reordering of memory operations that CPU can do The CPUs instruction set contains memory barrier instructions of various kinds -These can be used to constrain reordering where necessary -The programmer must understand both the memory consistency model and the memory barrier instruction semantics!!

Memory Consistency Models

Code Portability? Linux provides a carefully chosen set of memory-barrier primitives, as follows: -smp_mb(): “memory barrier” that orders both loads and stores. This means loads and stores preceding the memory barrier are committed to memory before any loads and stores following the memory barrier. -smp_rmb(): “read memory barrier” that orders only loads. -smp_wmb(): “write memory barrier” that orders only stores.

Words of Advice -“The difficult problem is identifying the ordering constraints that are necessary for correctness.” -“...the programmer must still resort to reasoning with low level reordering optimizations to determine whether sufficient orders are enforced.” -“...deep knowledge of each CPU's memory- consistency model can be helpful when debugging, to say nothing of writing architecture-specific code or synchronization primitives.”

Programmer's View -What does a programmer need to do? -How do they know when to do it? -Compilers & Libraries can help, but still need to use primitives in truly concurrent programs - Assuming the worst and synchronizing everything results in sequential consistency - -Too slow, but may be a good way to start

Outline Concurrent programming on a uniprocessor The effect of optimizations on a uniprocessor The effect of the same optimizations on a multiprocessor Methods for restoring sequential consistency Conclusion

-Parallel programming on a multiprocessor that relaxes the sequentially consistent memory model presents new challenges -Know the memory consistency models for the processors you use -Use barrier (fence) instructions to allow optimizations while protecting your code -Simple examples were used, there are others much more subtle.

References -Shared Memory Consistency Models: A Tutorial By Sarita Adve & Kourosh Gharachorloo -Memory Ordering in Modern Microprocessors, Part I, Paul E. McKenney, Linux Journel, June, Computer Architecture, Hennessy and Patterson, 4 th Ed., 2007