“Shared Memory Consistency Models: A Tutorial” By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster.

Slides:

Advertisements

Similar presentations

Symmetric Multiprocessors: Synchronization and Sequential Consistency.

Advertisements

1 Episode III in our multiprocessing miniseries. Relaxed memory models. What I really wanted here was an elephant with sunglasses relaxing On a beach,

Shared Memory Consistency

Memory Consistency Models Sarita Adve Department of Computer Science University of Illinois at Urbana-Champaign Ack: Previous tutorials.

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

SE-292 High Performance Computing

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

D u k e S y s t e m s Time, clocks, and consistency and the JMM Jeff Chase Duke University.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

Is SC + ILP = RC? Presented by Vamshi Kadaru Chris Gniady, Babak Falsafi, and T. N. VijayKumar - Purdue University Spring 2005: CS 7968 Parallel Computer.

CS492B Analysis of Concurrent Programs Consistency Jaehyuk Huh Computer Science, KAIST Part of slides are based on CS:App from CMU.

Cache Coherence in Scalable Machines (IV) Dealing with Correctness Issues Serialization of operations Deadlock Livelock Starvation.

Slides 8d-1 Programming with Shared Memory Specifying parallelism Performance issues ITCS4145/5145, Parallel Programming B. Wilkinson Fall 2010.

By Sarita Adve & Kourosh Gharachorloo Review by Jim Larson Shared Memory Consistency Models: A Tutorial.

1 Lecture 7: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Lecture 13: Consistency Models

Computer Architecture II 1 Computer architecture II Lecture 9.

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Memory Consistency Models

1 Lecture 12: Relaxed Consistency Models Topics: sequential consistency recap, relaxing various SC constraints, performance comparison.

2/4/200408/29/2002CS267 Lecure 51 CS 267: Shared Memory Parallel Machines Katherine Yelick

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Meenaktchi Venkatachalam.

ECE669 L17: Memory Systems April 1, 2004 ECE 669 Parallel Computer Architecture Lecture 17 Memory Systems.

1 Lecture 22: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Shared Memory Consistency Models: A Tutorial By Sarita V Adve and Kourosh Gharachorloo Presenter: Sunita Marathe.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Evaluation of Memory Consistency Models in Titanium.

Lecture 4. Memory Consistency Models

Shared Memory Consistency Models: A Tutorial Sarita V. Adve Kouroush Ghrachorloo Western Research Laboratory September 1995.

Shared Memory Consistency Models. Quiz (1)  Let’s define shared memory.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

Memory Consistency Models Alistair Rendell See “Shared Memory Consistency Models: A Tutorial”, S.V. Adve and K. Gharachorloo Chapter 8 pp of Wilkinson.

By Sarita Adve & Kourosh Gharachorloo Slides by Jim Larson Shared Memory Consistency Models: A Tutorial.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.

Release Consistency Yujia Jin 2/27/02. Motivations Place partial order on memory accesses for correct parallel program behavior Relax partial order for.

Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.

Memory Consistency Zhonghai Lu Outline Introduction What is a memory consistency model? Who should care? Memory consistency models Strict.

ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.

CS533 Concepts of Operating Systems Jonathan Walpole.

1 Lecture 3: Coherence Protocols Topics: consistency models, coherence protocol examples.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Fundamentals of Memory Consistency Smruti R. Sarangi Prereq: Slides for Chapter 11 (Multiprocessor Systems), Computer Organisation and Architecture, Smruti.

740: Computer Architecture Memory Consistency Prof. Onur Mutlu Carnegie Mellon University.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Lecture 20: Consistency Models, TM

Memory Consistency Models

Lecture 11: Consistency Models

Memory Consistency Models

CMSC 611: Advanced Computer Architecture

Threads and Memory Models Hal Perkins Autumn 2011

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Shared Memory Consistency Models: A Tutorial

Lecture 12: TM, Consistency Models

Symmetric Multiprocessors: Synchronization and Sequential Consistency

Introduction to High Performance Computing Lecture 20

Threads and Memory Models Hal Perkins Autumn 2009

Lecture 22: Consistency Models, TM

Background for Debate on Memory Consistency Models

Shared Memory Consistency Models: A Tutorial

Lecture 10: Consistency Models

Memory Consistency Models

Relaxed Consistency Part 2

Programming with Shared Memory Specifying parallelism

Lecture: Consistency Models, TM

Lecture 11: Relaxed Consistency Models

Lecture 11: Consistency Models

Presentation transcript:

“Shared Memory Consistency Models: A Tutorial” By Sarita Adve, Kourosh Gharachorloo WRL Research Report, 1995 Presentation: Vince Schuster

Contents  Overview  Uniprocessor Review  Sequential Consistency  Relaxed Memory Models  Program Abstractions  Conclusions 2

Overview  Correct & Efficient Shmem Programs  Require precise notion of behavior w.r.t. read (R) and write (W) operations between processor memories. 3 P 1 While (no more tasks) { Task = GetFromFreeList(); Task->Data = …; insert Task in task queue } Head = head of task queue; P 2, P 3, …, P n While (MyTask == null) { Begin Critical Section if (Head != null) { MyTask = Head; Head = Head->Next; } End Critical Section } … = MyTask->Data; Example 1, Figure 1 Initially, all ptrs = NULL; all ints = 0; Q: What will Data be?A: Could be old Data

Definitions  Memory Consistency Model  Formal Specification of Mem System Behavior to Programmer  Program Order  The order in which memory operations appear in program  Sequential Consistency (SC): An MP is SC if  Exec Result is same as if all Procs were in some sequence.  Operations of each Proc appear in this sequence in order specified by its program. (Lamport [16])  Relaxed Memory Consistency Models (RxM)  An RxM less restrictive than SC. Valuable for efficient shmem.  System Centric : HW/SW mechanism enabling Mem Model  Programmer-centric : Observation of Program behavior a memory model from programmer’s viewpoint.  Cache-Coherence: 1. A write is eventually made visible to all MPs. 2. Writes to same loc appear as serialized (same order) by MPs NOTE: not equivalent to Sequential Consistency (SC) 4

UniProcessor Review 5  Only needs to maintain control and data dependencies.  Compiler can perform extreme Optz: (reg alloc, code motion, value propagation, loop transformations, vectorizing, SW pipelining, prefetching, …  A multi-threaded program will look like: T1 T2 T3 T4 Tn... Memory All of memory will appear to have the same values to the threads in a UniProcessor System. You still have to deal with the normal multi-threaded problems by one processor, but you don’t have to deal with issues such as Write Buffer problems or Cache Coherence. Conceptually, SC wants the one program memory w/ switch that connects procs to memory + Program Order on a per- Processor basis

Seq. Consist. Examples 6 P 1 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section P 2 // init: all = 0 Flag2 = 1 If (Flag1 == 0) critical section Dekker’s Algorithm: What if Flag1 set to 1 then Flag2 set to 1 then if s? Or F2 Read bypasses F1 Write? A: Sequential Consistency (program order & Proc seq) P 1 A = 1 P 2 If (A == 1) B = 1 P 3 If (B == 1) reg1 = A What if P2 gets Read of A but P3 gets old value of A ? A: Atomicity of memops (All procs see instant and identical view of memops.) NOTE: UniProcessor system doesn’t have to deal with old values or R/W bypasses.

Architectures  Will visit:  Architectures w/o Cache  Write Bufferes w/ Bypass Capability  Overlapping Write Operations  Non-Blocking Read Operations  Architectures w/ Cache  Cache Coherence & SC  Detecting Completion of Write Operations  Illusion of Write Atomicity 7

Write Buffer w/ Bypass Capability 8 P1 Write Flag1 t3 Shared Bus Read Flag2 t1 P2 Write Flag2 t4 Read Flag1 t2 Flag1: 0 Flag2: 0 P 1 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section P 2 // init: all = 0 Flag2 = 1 If (Flag1 == 0) critical section Bus-based Mem System w/o Cache Bypass can hide Write latency Violates Sequential Consistency A: Both enter critical section Q: What happens if Read of Flag1 & Flag2 bypass Writes? NOTE: Write Buffer not a problem on UniProcessor Programs

Overlapping Writes 9 P1 Write HeadWrite Data t1 t4 P2 Head: 0 Read Data t3 Read Head t2 Data: 0 P 1 // init: all = 0 Data = 2000 Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data Interconnection network alleviates the serialization bottleneck of a bus-based design. Also, Writes can be coalesced. Memory Q: What happens if Write of Head bypasses Write of Data? A: Data Read returns 0

Non-Blocking Reads 10 P1 Read Head Read Data t4 t1 P2 Head: 0 Write Head t3 Write Data t2 Data: 0 Memory Interconnect P 1 // init: all = 0 Data = 2000 Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data Non-Blocking Reads Enable non-blocking caches speculative execution dynamic scheduling Q: What happens if Read of Data bypasses Read of Head? A: Data Read returns 0

Cache-Coherence & SC  Write buffer w/o cache similar to Write-thru cache  Reads can proceed before Write completes (on other MPs)  Cache-Coherence: not equiv to Sequential Consistency (SC) 1. A write is eventually made visible to all MPs. 2. Writes to same loc appear as serialized (same order) by MPs 3. Propagate value via invalidating or updating cache-copy(ies)  Detecting Completion of Write Operation  What if P2 gets new Head but old Data?  Avoided if invalidate/update before 2 nd Write  Write ACK needed  Or at least Invalidate ACK 11 P1 Write Head Write Data t1 t4 Write-thru cache P2 Head: 0 Read Data t3 Read Head t2 Data: 0 Memory

Illusion of Write Atomicity  Cache-coherence Problems: 1. Cache-coherence (cc) Protocol must propogate value to all copies. 2. Detecting Write completion takes multi ops w/ multiple replications 3. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. 12  Cache-coherence Problems: 1. Cache-coherence (cc) Protocol must propogate value to all copies. 2. Detecting Write completion takes multi ops w/ multiple replications 3. Hard to create “ Illusion of Atomicity” w/ non-atomic writes. P 1: A=B=C=0 A = 1 B = 1 P 2 = 0 A = 2 C = 1 P3P3 While (B != 1) ; While (C != 1) ; Reg1 = A P4P4 While (B != 1) ; While (C != 1) ; Reg2 = A Q: What if P1 & P2 updates reach P3 & P4 differently? A: Reg1 & Reg2 might have different results (& violates SC) Solution: Can serialize writes to same location Alternative: Delay updates until ACK of previous to same loc Still not equiv to Sequential Consistency.

Ex2: Illusion of Wr Atomicity Q: What if P2 reads new A before P3 gets updated w/ A; AND P2 update of B reaches P3 before its update of A AND P3 reads new B & old A? 13 P 1 A = 1 P 2 If (A == 1) B = 1 P 3 If (B == 1) reg1 = A A: Prohibit read from new value until all have ACK’d. Update Protocol (2-phase scheme): 1.Send update, Recv ACK from each MP 2.Updated MPs get ACK of all ACKs. (Note: Writing proc can consider Write complete after #1.)

Compilers  Compilers do many optz w.r.t. mem reorderings:  CSE, Code motion, reg alloc, SW pipe, vect temps, const prop,…  All done from uni-processor perspective. Violates shmem SC  e.g. Would never exit from many of our while loops.  Compiler needs to know shmem objects and/or Sync points or must forego many optz. 14

Sequential Consistency Summary  SC imposes many HW and Compiler constraints  Requirements: 1. Complete of all mem ops before next (or Illusion thereof) 2. Writes to same loc need be serialized (cache-based). 3. Write Atomicity (or illusion thereof)  Discuss HW Techniques useful for SC & Efficiency:  Pre-Exclusive Rd (Delays due to Program Order); cc invalid mems  Read Rolebacks (Due to speculative exec or dyn sched).  Global shmem data dep analysis (Shasha & Snir)  Relaxed Memory Models (RxM) next 15

Relaxed Memory Models  Characterization (3 models, 5 specific types) 16 Relaxation 1a. Relax Write to Read program order (PO) 1b. Relax Write to Write PO 1c. Relax Read to Read & Read to Write POs 2. Read others’ Write early 3. Read own Write early (assume different locations) (cache-based only) Some RxMs can be detected by programmer, others not. Various Systems use different fence techniques to provide safety net s. AlphaServer, Cray T3D, SparcCenter, Sequent, IBM 370, PowerPC (most allow & usually safe; but what if two writers to same loc?)

Relaxed Write to Read PO  Relax constraint of Write then Read to a diff loc.  Reorder Reads w.r.t. previous Writes w/ memory disambiguation.  3 Models handle it differently. All do it to hide Write Latency  Only IBM 370 provides serialization instr as safety net between W&R  TSO can use Read-Modify-Write (RMW) of either Read or Write  PC must use RMW of Read since it uses less stringent RMW requirements. 17 P 1 P 2 F1 = 1F2 = 1 A = 1A = 2 Rg1 = ARg3 = A Rg2 = F2Rg4 = F1 Rslt: Rg1 = 1, Rg3 = 2 Rg2 = Rg4 = 0 P1 P2P3P1 P2P3 A = 1 if(A==1) B = 1 if (B==1) Rg1 = A Rslt: Rg1 = 0, B = 1 TSO & PC since they allow Read of F1/F2 before Write of F1/F2 on each proc IBM 370 since it allows P 2 Read of new A while P 3 Reads old A

Relaxing W2R and W2W  SPARC Partial Store Order (PSO)  Writes to diff locs from same proc can be pipelined or overlapped and are allowed to reach memory or other caches out of program order.  PSO == TSO w.r.t. letting itself read own write early and prohibitting others from reading value of another proc’s write before ACK from all. 18 P 1 // init: all = 0 Flag1 = 1 If (Flag2 == 0) critical section P 2 // init: all = 0 Flag2 = 1 If (Flag1 == 0) critical section W2R: Decker’s Algorithm – PSO will still allow non-SC rslts P 1 // init: all = 0 Data = 2000 STBAR // Write Barrier Head = 1 P 2 // init: all = 0 While (Head == 0) ;... = Data W2W: PSO Safety net is to provide STBAR (store barrier)

Relaxing All Program Order  R or W may be reordered w.r.t. R or W to diff location  Can hide latency of Reads in the context of either static or dynamic (out-of-order) scheduling processors (can use spec exec & non- blocking caches).  Alpha, SPARC V9 RMO, PowerPC  SPARC & PPC allow reorder of reads to same location.  Violate SC for previous codes (let’s get dangerous!)  All allow a proc to read own write early but:  RCpc and PPC allow a read to get value of other MP Wr early (complex)  Two catagories of Parallel program semantics: 1. Distinguish MemOps based on type (WO, RCsc, RCpc) 2. Provide explicit fence capabilities (Alpha, RMO, PPC) 19

Weak Ordering (WO)  Two MemOp Catagories: 1) Data Ops 2) Sync Ops  Program order enforced between these  Programmer must ID Sync Op (safety net) – counter utilized (inc/dec)  Data regions between Sync Ops can be reordered/optimized.  Writes appear atomic to programmer. 20

Release Consistency (RCsc/RCpc) 21 shared specialordinary syncnsync acquirerelease RCsc: Maintains SC among special ops Constraints: acquire  all ; all  release ; special  special RCpc: Maintains Processor Consistency among special ops; W  R program order among special ops eliminated. Constraints: acquire  all ; all  release ; special  special except special WR  special RD RMW used to maintain Program Order subtle details

Alpha,RMO,PPC  Alpha : fences provided  MB – Memory Barrier  WMB - Write Memory Barrier  Write atomicity fence not needed.  SPARC V9 RMO : MEMBAR fence  bits used for any of R  W; W  W; R  R; W  R  No need for RMW.  Write atomicity fence not needed.  PowerPC : SYNC fence  Similar to MB except:  R  R to same location can still be OoO Exec (use RMW)  Write Atomicity may require RMW  Allows write to be seen early by another processor’s read 22

Abstractions  Generally, compiler optz can go full bore between sync/special/fence or sync IDs.  Some optz can be done w.r.t. global shmem objects.  Programmer supplied, standardized safety nets.  “Don’t know; Assume worst” – Starting method?  Over-marking SYNCs is overly-conservative  Programming Model Support  doall – no deps between iterations –(HPF/F95 – forall, where )  SIMD (CUDA) – Implied multithread access w/o sync or IF cond  Data type - volatile - C/C++  Directives – OpenMP: #pragma omp parallel Sync Region  #pragma omp shared(A) Data Type  Library – (eg, MPI, OpenCL, CUDA) 23

Using the HW & Conclusions  Compilers can  protect memory ranges [low…high]  Assign data segments to protected page locations  Use high-order bits for addrs of VM  Extra opcode usage (eg, GPU sync)  Modify internal memory disambiguation methods  Perform Inter-Procedural Optz for shmem optz. Relaxed Memory Consistency Models  + Puts more performance, power & responsibilities into hands of programmers and compilers.  - Puts more performance, power & responsibilities into hands of programmers and compilers. 24

The End