Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik.

Slides:



Advertisements
Similar presentations
The Effect of Network Total Order, Broadcast, and Remote-Write on Network- Based Shared Memory Computing Robert Stets, Sandhya Dwarkadas, Leonidas Kontothanassis,
Advertisements

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.
Multiple-Writer Distributed Memory. The Sequential Consistency Memory Model P1P2 P3 switch randomly set after each memory op ensures some serial order.
Distributed Shared Memory
1 Release Consistency Slides by Konstantin Shagin, 2002.
1 Munin, Clouds and Treadmarks Distributed Shared Memory course Taken from a presentation of: Maya Maimon (University of Haifa, Israel).
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
CS 425 / ECE 428 Distributed Systems Fall 2014 Indranil Gupta (Indy) Lecture 25: Distributed Shared Memory All slides © IG.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
(Software) Distributed Shared Memory (aka Shared Virtual Memory)
Lightweight Logging For Lazy Release Consistent DSM Costa, et. al. CS /01/01.
Memory consistency models Presented by: Gabriel Tanase.
1 Lecture 5: Directory Protocols Topics: directory-based cache coherence implementations.
Memory Consistency Models
PRASHANTHI NARAYAN NETTEM.
1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )
Distributed Shared Memory Systems and Programming
TreadMarks Distributed Shared Memory on Standard Workstations and Operating Systems Pete Keleher, Alan Cox, Sandhya Dwarkadas, Willy Zwaenepoel.
Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.
Distributed Shared Memory (DSM)
2008 dce Distributed Shared Memory Pham Quoc Cuong & Phan Dinh Khoi Use some slides of James Deak - NJIT.
Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z.
CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.
TECHNIQUES FOR REDUCING CONSISTENCY- RELATED COMMUNICATION IN DISTRIBUTED SHARED-MEMORY SYSTEMS J. B. Carter University of Utah J. K. Bennett and W. Zwaenepoel.
CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,
1 Lecture 12: Hardware/Software Trade-Offs Topics: COMA, Software Virtual Memory.
Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.
Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.
Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck)
Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.
Distributed Shared Memory Presentation by Deepthi Reddy.
Distributed Shared Memory (part 1). Distributed Shared Memory (DSM) mem0 proc0 mem1 proc1 mem2 proc2 memN procN network... shared memory.
Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel The Winter Usenix.
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
DISTRIBUTED COMPUTING
Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.
TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems Present By: Blair Fort Oct. 28, 2004.
CIS 720 Distributed Shared Memory. Shared Memory Shared memory programs are easier to write Multiprocessor systems Message passing systems: - no physically.
1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.
Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.
1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.
August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,
The University of Adelaide, School of Computer Science
Implementation and Performance of Munin (Distributed Shared Memory System) Dongying Li Department of Electrical and Computer Engineering University of.
Distributed Memory and Cache Consistency (some slides courtesy of Alvin Lebeck)
Distributed Shared Memory
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 18: Coherence and Synchronization
Relaxed Consistency models and software distributed memory
Pete Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel
The University of Adelaide, School of Computer Science
The University of Adelaide, School of Computer Science
Lecture 26 A: Distributed Shared Memory
Distributed Shared Memory
Distributed Shared Memory
CSS490 Distributed Shared Memory
Exercises for Chapter 16: Distributed Shared Memory
Lecture 9: Directory-Based Examples
The University of Adelaide, School of Computer Science
Lecture 26 A: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Distributed Resource Management: Distributed Shared Memory
Lecture 17 Multiprocessors and Thread-Level Parallelism
Lecture 19: Coherence and Synchronization
Lecture 18: Coherence and Synchronization
The University of Adelaide, School of Computer Science
Lecture 17 Multiprocessors and Thread-Level Parallelism
Presentation transcript:

Lazy Release Consistency for Software Distributed Shared Memory Pete Keleher Alan L. Cox Willy Z. By Nooruddin Shaik

Lazy Release consistency Problem: To reduce both the number of messages and the amount of data exchanged for remote memory accesses. Importance: These reductions are important for the programs that exhibit false sharing and make extensive use of locks.

Overview Software DSM Release Consistency Eager Release Consistency Lazy Release Consistency Simulations Conclusion Future Work

Software DSM Software DSM is a runtime system that provides the shared address space abstraction across a message-passing based cluster of computers Rely on (user level) memory management techniques to detect access/updates to shared data High Communication overheads and Large page-size coherence units Sending messages expensive in Software DSM

Pipelining Remote Memory Accesses in DASH Dash implementation of RC combat memory latency by pipelining writes to shared memory. The processor is stalled only when executing a release, at that time it must wait for all its previous writes to perform.

Problem with DASH Pipelining of writes in DASH increases the messages passing thru the network. So Munin’s write-shared protocol, a software implementation of RC buffers writes until a release instead of pipelining them.

Merging of Remote Memory Updates in Munin. At the release all writes going to the same destination are merged into a single message. Even Munin’s write shared protocol may send more messages than a message passing implementation of the same application.

RC – Formal Definition A system is release consistent if  Before an ordinary access is allowed to perform with respect to any other processor, all previous acquires must be performed  Before a release is allowed to perform with respect to any other processor, all the previous reads and writes must be performed.  Special accesses are sequentially consistent with each other.

Eager Release Consistency (based on Munin’s write share protocol) ERC  Modification are propagated at release. Invalidate Protocol  Sends invalidations for all the modified pages to the other processors that cache these pages. Update Protocol  Sends a diff of each modified page to other cachers and then merged.  Diffs – limit the amount of data exchanged

Eager Release Consistency (..Contd) Acquire  No consistency related operations  Protocol locates the processor that last executed a release on the same variable Access Miss  Message to directory manager.  Directory manager forwards request to current owner

Repeated Updates of Cached copies in Eager RC In above figure processors P1 through P4 repeatedly acquire the lock l, write the shared variable x, and then release l. If an update policy is used in conjunction with Munin’s writeshared protocol, and x is present in all caches then all of these cache copies are updated at every release.

Lazy Release Consistency Rather than eagerly “sync up” data at release point, LRC “lazily” waits until the subsequent acquire. Propagation of modifications postponed until the time of an acquire.

Lazy Release consistency At this time the acquiring processor determines which modifications it needs to see according to the definition of RC. To do this LRC uses a representations of the happened before-1 partial order introduced by Adve and Hill.

happened-before-1 Partial Order Shared memory accesses are partially ordered by happened-before-1, denoted by, defined as follows:  If a1 and a2 are accesses on the same processor, and a1 occurs before a2 in program order, then a1 a2  If a1 is a release on processor p1, and a2 is an acquire on the same location on processor p2, and a2 returns the value written by a1, then a1 a2  If a1 a2, a2 a3, then a1 a3. hb1

Write Notices A write notice is an indication that a page has been modified in a particular interval, but it doesn't contain the actual modifications. LRC – Guaranteed by write notices

Write Notice Propagation Execution of each processor is divided into intervals Interval performed at a processor  All modifications during that interval have been performed at the processor V p (i)  Vector Timestamp for interval i and processor p. Number of elements in V p (i) = Number of processors Entry for p in V p (i) = i Entry for q in V p (i) = Most recent interval of q performed at p

Write Notice Propagation V p1 (i p1 ) = { i p1, 0, 0, 0} V p2 (i p2 ) = {i p1, i p2, 0, 0} V p3 (i p3 ) = {0, i p2, i p3, 0} V p4 (i p4 ) = {0, 0, i p3, i p4 } P1 P2 P3 P4 w(x) rel acq w(x) rel acq r(x) i p1 i P2 i p3 i p4

Data Movement Protocols Multiple Writer Protocol - allows multiple processors write to different parts of the same page concurrently without intervening of the same page concurrently. False Sharing  Occurs when two or more processors access different variables within a page, with at least one of the accesses being a write.  Generates large amount of message traffic DASH – Exclusive-write protocols. LRC – Multiple writer protocols.  Allows to write into falsely shared pages.  Modifications merged using diffs.  Message traffic is reduced.

Invalidate Vs Update Invalidate  Acquiring processor invalidates all pages in its cache for which it receives write notices. Update  Updates the pages for which it receives write notices.  Diffs must be obtained for all concurrent modifiers.  For interval i, diffs must be obtained from all intervals j, such that, j i, and there exists no k such that j k i, in which the modifications from j was overwritten. hb1

EI Vs LI

EU Vs LU

Access Misses Copy of page as well as a number of diffs may have to be retrieved. Modifications summarized by diffs are merged before access. Access Miss:  At interval i, diffs must be obtained from all intervals j, such that, j i, and there exists no interval k, such that j k i If processor has an invalidated copy of page  Write-notices contain all the necessary information to determine which diffs need to be applied.  Reduces the amount of data sent. hb1

Simulation Simulation study was done based on the multiprocessor traces of five shared-memory application programs from the SPLASH suite. Measured the number of messages and the amount of data exchanged by each program for an execution using four proposed protocols: Lazy Update (LU), Lazy Invalidate (LI) Eager Update (EI), Eager Invalidate (EI). Methodology: A trace was generated from a 32-processor execution of each program using the Tango multiprocessor simulator.These traces were then fed into our protocol simulator and simulated page sizes from 512 to 892 bytes. They assumed infinite caches and reliable FIFO communication channel but didn’t assume any broadcast or multicast capability of N/W

Shared memory operation Message costs M = #concurrent last modifiers for the missing page H = #other concurrent last modifiers for any local page C = #other cachers of the page N = #processors in system P = #pages in system U = ∑ n i=1 (# other cachers of pages modified by i) V = ∑ n i=1 (# excess invalidations of page i)

SPLASH Program Suite For simulating the four protocols LU,LI,EU,EI five SPLASH bench mark programs are taken into consideration. Locus Route Cholesky Factorization MP3D Water Pthor using these all programs, they have compared the data and message exchanged when using the four different protocol and observed the results.

LocusRoute Synchronization is dominated by locks

Cholesky Factorization Synchronization is dominated by locks

Pthor Synchronization is dominated by locks

MP3D Synchronization is dominated by barriers

Water Synchronization is dominated by barriers

Eager Vs Lazy

Conclusion Performance of Software DSM – Sensitive to the number of messages and amount of data exchanged to create shared memory abstraction. LRC aims at reducing both the number of messages and amount of data exchanged by allowing changes to propagate lazily, only when needed.

Future Work We can expect the future work is an implementation of lazy release consistency to estimate the runtime cost of the algorithm.

References www-cse.ucsd.edu/classes/fa99/cse221/OSSurveyF99/ papers