Distributed Shared Memory: Ivy Brad Karp UCL Computer Science CS GZ03 / M030 14 th, 16 th October, 2008.

Slides:

Advertisements

Similar presentations

COMP 655: Distributed/Operating Systems Summer 2011 Dr. Chunbo Chu Week 7: Consistency 4/13/20151Distributed Systems - COMP 655.

Advertisements

Memory Consistency Models Kevin Boos. Two Papers Shared Memory Consistency Models: A Tutorial – Sarita V. Adve & Kourosh Gharachorloo – September 1995.

SE-292 High Performance Computing

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

CS 162 Memory Consistency Models. Memory operations are reordered to improve performance Hardware (e.g., store buffer, reorder buffer) Compiler (e.g.,

Distributed Shared Memory

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

12a.1 Introduction to Parallel Computing UNC-Wilmington, C. Ferner, 2008 Nov 4, 2008.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

Memory Management 1 CS502 Spring 2006 Memory Management CS-502 Spring 2006.

Distributed Resource Management: Distributed Shared Memory

1 Lecture 15: Consistency Models Topics: sequential consistency, requirements to implement sequential consistency, relaxed consistency models.

Consistency. Consistency model: –A constraint on the system state observable by applications Examples: –Local/disk memory : –Database: What is consistency?

PRASHANTHI NARAYAN NETTEM.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed Shared Memory.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Multiprocessor Cache Coherency

Distributed File Systems Sarah Diesburg Operating Systems CS 3430.

Distributed Shared Memory and Sequential Consistency.

© 2009 Matthew J. Sottile, Timothy G. Mattson, and Craig E Rasmussen 1 Concurrency in Programming Languages Matthew J. Sottile Timothy G. Mattson Craig.

Multiple Processor Systems. Multiprocessor Systems Continuous need for faster and powerful computers –shared memory model ( access nsec) –message passing.

Distributed Shared Memory: A Survey of Issues and Algorithms B,. Nitzberg and V. Lo University of Oregon.

Distributed File Systems

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

CSE 486/586, Spring 2012 CSE 486/586 Distributed Systems Distributed Shared Memory Steve Ko Computer Sciences and Engineering University at Buffalo.

Distributed Shared Memory: Ivy & memcached Costin Raiciu Advanced Topics in Distributed Systems 2012 Slides courtesy of Brad Karp (UCL) and Robert Morris.

IT253: Computer Organization

April 26, CSE8380 Parallel and Distributed Processing Presentation Hong Yue Department of Computer Science & Engineering Southern Methodist University.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Background: I/O Concurrency Brad Karp UCL Computer Science CS GZ03 / M030 2 nd October, 2008.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

CS425/CSE424/ECE428 – Distributed Systems Nikita Borisov - UIUC1 Some material derived from slides by I. Gupta, M. Harandi, J. Hou, S. Mitra,

L/O/G/O Cache Memory Chapter 3 (b) CS.216 Computer Architecture and Organization.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Ch 10 Shared memory via message passing Problems –Explicit user action needed –Address spaces are distinct –Small Granularity of Transfer Distributed Shared.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Paxos: Agreement for Replicated State Machines Brad Karp UCL Computer Science CS GZ03 / M st, 23 rd October, 2008.

Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.

Replication (1). Topics r Why Replication? r System Model r Consistency Models – How do we reason about the consistency of the “global state”? m Data-centric.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

Page 1 Distributed Shared Memory Paul Krzyzanowski Distributed Systems Except as otherwise noted, the content of this presentation.

Two-Phase Commit Brad Karp UCL Computer Science CS GZ03 / M th October, 2008.

1 Chapter 9 Distributed Shared Memory. 2 Making the main memory of a cluster of computers look as though it is a single memory with a single address space.

Memory Coherence in Shared Virtual Memory System ACM Transactions on Computer Science(TOCS), 1989 KAI LI Princeton University PAUL HUDAK Yale University.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Distributed Shared Memory

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

Example Cache Coherence Problem

Background and Motivation

CSE 451: Operating Systems Autumn 2005 Memory Management

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

The University of Adelaide, School of Computer Science

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Lecture 24: Multiprocessors

Distributed Resource Management: Distributed Shared Memory

Programming with Shared Memory Specifying parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Presentation transcript:

Distributed Shared Memory: Ivy Brad Karp UCL Computer Science CS GZ03 / M th, 16 th October, 2008

2 Increasing Transparency: From RPC to Shared Memory In RPC, we’ve seen one way to split application across multiple nodes –Carefully specify interface between nodes –Explicitly communicate between nodes –Transparent to programmer? Can we hide all inter-node communication from programmer, and improve transparency? –Today’s topic: Distributed Shared Memory

3 Ivy: Distributed Shared Memory Supercomputer: super-expensive 100-CPU machine, custom-built hardware Ivy: 100 cheap PCs and a LAN (all off-the- shelf hardware!) Both offer same easy view for programmer: – single, shared memory, visible to all CPUs

4 Distributed Shared Memory: Problem An application has a shared address space; all memory locations accessible to all instructions Divide code for application into pieces, assign one piece to each of several computers on a LAN Each computer has own separate memory Each piece of code may want to read or write any part of data Where do you put the data?

5 Distributed Shared Memory: Solution Goal: create illusion that all boxes share single memory, accessible by all Shared memory contents divided across nodes –Programmer needn’t explicitly communicate among nodes –Pool memory of all nodes into one shared memory Performance? Correctness? –Far slower to read/write across LAN than read/write from/to local (same host’s) memory –Remember NFS: caching should help –Remember NFS: caching complicates consistency!

6 Context: Parallel Computation Still need to divide program code across multiple CPUs Potential benefit: more total CPU power, so faster execution Potential risk: how will we know if distributed program executes correctly? To understand distributed shared memory, must understand what “correct” execution means…

7 Simple Case: Uniprocessor Correctness When you only have one processor, what does “correct” mean? Define “correct” separately for each instruction Each instruction takes machine from one state to another (e.g., ADD, LD, ST) –LD should return value of most recent ST to same memory address “Correct” means: Execution gives same result as if you ran one instruction at a time, waiting for each to complete

8 Why Define Correctness? Programmers want to be able to predict how CPU executes program! –…to write correct program Note that modern CPUs don’t execute instructions one-at-a-time in program order –Multiple instruction issue –Out-of-order instruction issue Nevertheless, CPUs must behave such that they obey uniprocessor correctness!

9 Distributed Correctness: Naïve Shared Memory Suppose we have multiple hosts with (for now) naïve shared memory –3 hosts, each with one CPU, connected by Internet –Each host has local copy of all memory –Reads local, so very fast –Writes sent to other hosts (and execution continues immediately) Is naïve shared memory correct?

10 Example 1: Mutual Exclusion CPU0: x = 1; if (y == 0) critical section; CPU1: y = 1; if (x == 0) critical section; Why is code correct? –If CPU0 sees y == 0, CPU1 can’t have executed “y = 1” –So CPU1 will see x == 1, and can’t enter critical section Initialization: x = y = 0 on both CPUs So CPU0 and CPU1 cannot simultaneously enter critical section

11 Naïve Distributed Memory: Incorrect for Example 1 Problem A: –CPU0 sends “write x=1”, reads local “y == 0” –CPU1 reads local “x == 0” before write arrives Local memory and slow writes cause disagreement about read/write order! –CPU0 thinks its “x = 1” was before CPU1’s read of x –CPU1 thinks its read of x was before arrival of “write x = 1” Both CPU0 and CPU1 enter critical section!

12 Example 2: Data Dependencies CPU0: v0 = f0(); done0 = true; CPU1: while (done0 == false) ; v1 = f1(v0); done1 = true; CPU2: while (done1 == false) ; v2 = f2(v0, v1); Intent: CPU2 should run f2() with results from CPU0 and CPU1 Waiting for CPU1 implies waiting for CPU0

13 Naïve Distributed Memory: Incorrect for Example 2 Problem B: –CPU0’s writes of v0 and done0 may be reordered by network, leaving v0 unset, but done0 true But even if each CPU sees each other CPU’s writes in issue order… Problem C: –CPU2 sees CPU1’s writes before CPU0’s writes –i.e., CPU2 and CPU1 disagree on order of CPU0’s and CPU1’s writes Naïve distributed memory isn’t correct (Or we shouldn’t expect code like these examples to work…)

14 Distributed Correctness: Consistency Models How can we write correct distributed programs with shared storage? Need to define rules that memory system will follow Need to write programs with these rules in mind Rules are a consistency model Build memory system to obey model; programs that assume model then correct

15 How Do We Choose a Consistency Model? No such thing as “right” or “wrong” model –All models are artificial definitions Different models may be harder or easier to program for –Some models produce behavior that is more intuitive than others Different models may be harder or easier to implement efficiently –Performance vs. semantics trade-off, as with NFS/RPC

16 Back to Ivy: What’s It Good For? Suppose you’ve got 100 PCs on a LAN and shared memory across all of them Fast, parallel sorting program: Load entire array into shared memory Each PC processes one section of array On PC i: sort own piece of array done[i] = true; wait for all done[] to be true merge my piece of array with my neighbors’…

17 Partitioning Address Space: Fixed Approach Fixed approach: –First MB on host 0, 2 nd on host 1, &c. –Send all reads and writes to “owner” of address –Each CPU read- and write-protects pages in address ranges held by other CPUs Detect reads and writes to remote pages with VM hardware What if we placed pages on hosts poorly? Can’t always predict which hosts will use which pages

18 Partitioning Address Space: Dynamic, Single-Copy Approach Move the page to the reading/writing CPU each time it is used CPU trying to read or write must find current owner, then take page from it Requires mechanism to find current location of page What if many CPUs read same page?

19 Partitioning Address Space: Dynamic, Multi-Copy Approach Move page for writes, but allow read-only copies When CPU reads page it doesn’t have in its own local memory, find other CPU that most recently wrote to page Works if pages are read-only and shared or read-write by one host Bad case: write sharing –When does write sharing occur? –False sharing, too…

20 Simple Ivy: Centralized Manager (Section 3.1) CPU0 CPU1 CPU2 / MGR lockaccessowner? lockaccessowner? lockaccessowner? lockcopy_setowner ptableinfo ptable (all CPUs) access: R, W, or nil owner: T or F info (MGR only) copy_set: list of CPUs with read- only copies owner: CPU that can write page

21 Centralized Manager (2): Message Types Between CPUs RQ (read query, reader to MGR) RF (read forward, MGR to owner) RD (read data, owner to reader) RC (read confirm, reader to MGR) WQ (write query, writer to MGR) IV (invalidate, MGR to copy_set) IC (invalidate confirm, copy_set to MGR) WF (write forward, MGR to owner) WD (write data, owner to writer) WC (write confirm, writer to MGR)

22 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? FnilF … lockaccessowner? FnilF … lockaccessowner? FWT … lockcopy_setowner F{}CPU0 … ptableinfo read

23 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? FWT … lockcopy_setowner F{}CPU0 … ptableinfo read RQ

24 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? FWT … lockcopy_setowner T{}CPU0 … ptableinfo read RQ

25 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? FWT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF

26 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? TWT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF

27 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? TRT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF RD

28 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? FnilF … lockaccessowner? FRT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF RDRC

29 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? TRF … lockaccessowner? FnilF … lockaccessowner? FRT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF RDRC

30 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? FRF … lockaccessowner? FnilF … lockaccessowner? FRT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo read RQ RF RDRC

31 Centralized Manager Example 1: Owned by CPU0, CPU1 wants to read CPU0 CPU1 CPU2 / MGR lockaccessowner? FRF … lockaccessowner? FnilF … lockaccessowner? FRT … lockcopy_setowner F{CPU1}CPU0 … ptableinfo read RQ RF RDRC

32 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FRF … lockaccessowner? FnilF … lockaccessowner? FRT … lockcopy_setowner F{CPU1}CPU0 … ptableinfo write

33 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FRF … lockaccessowner? TnilF … lockaccessowner? FRT … lockcopy_setowner F{CPU1}CPU0 … ptableinfo write WQ

34 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FRF … lockaccessowner? TnilF … lockaccessowner? FRT … lockcopy_setowner T{CPU1}CPU0 … ptableinfo write WQ IV

35 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? TnilF … lockaccessowner? TnilF … lockaccessowner? FRT … lockcopy_setowner F{CPU1}CPU0 … ptableinfo write WQ IV IC

36 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FnilF … lockaccessowner? TnilF … lockaccessowner? FRT … lockcopy_setowner T{}CPU0 … ptableinfo write WQ IV IC WF

37 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FnilF … lockaccessowner? TnilF … lockaccessowner? TnilF … lockcopy_setowner T{}CPU0 … ptableinfo write WQ IV IC WF WD

38 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FnilF … lockaccessowner? TWT … lockaccessowner? FnilF … lockcopy_setowner T{}CPU0 … ptableinfo write WQ IV IC WF WD WC

39 Centralized Manager Example 2: Owned by CPU0, CPU2 wants to write CPU0 CPU1 CPU2 / MGR lockaccessowner? FnilF … lockaccessowner? FWT … lockaccessowner? FnilF … lockcopy_setowner T{}CPU2 … ptableinfo write WQ IV IC WF WD WC

40 What if Two CPUs Want to Write to Same Page at Same Time? Write has several steps, modifies multiple tables Invariants for tables: –MGR must agree with CPUs about single owner –MGR must agree with CPUs about copy_set –copy_set != {} must agree with read-only for owner Write operation should thus be atomic! What enforces atomicity?

41 Sequential Consistency: Definition Must exist total order of operations such that: –All CPUs see results consistent with that total order (i.e., LDs see most recent ST in total order) –Each CPU’s instructions appear in order in total order Two rules sufficient to implement sequential consistency [Lamport, 1979]: –Each CPU must execute reads and writes in program order, one at a time –Each memory location must execute reads and writes in arrival order, one at a time

42 Ivy and Consistency Models Consider done{0,1,2} example: –v0 = fn0(); done0 = true –In Ivy, can other CPU see done == true, but still see old v0? Does Ivy obey sequential consistency? –Yes! –Each CPU does R/W in program order –Each memory location does R/W in arrival order

43 Ivy: Evaluation Experiments include performance of PDE, matrix multiplication, and “block odd-even based merge-split algorithm” How to measure performance? –Speedup: x-axis is number of CPUs used, y- axis is how many times faster the program ran with that many CPUs What’s the best speedup you should ever expect? –Linear When do you expect speedup to be linear?

44 What’s “Block Odd-Even Based Merge- Split Algorithm?” Partition data to be sorted over N CPUs, held in one shared array Sort data in each CPU locally View CPUs as in a line, number 0 to N-1 Repeat N times: –Even CPUs send to (higher) odd CPUs –Odd CPUs merge, send lower half back to even CPUs –Odd CPUs send to (higher) even CPUs –Even CPUs merge, send lower half back to odd CPUs “Send” just means “receiver reads from right place in shared memory”

45 Ivy’s Speedup PDE and matrix multiplication: linear Sorting: worse than linear, flattens significantly beyond 2 CPUs

46 Ivy vs. RPC When would you prefer DSM to RPC? –More transparent –Easier to program for When would you prefer RPC to DSM? –Isolation –Control over communication –Latency-tolerance –Portability Could Ivy benefit from RPC? –Possibly for efficient blocking/unblocking

47 DSM: Successful Idea? Spreading a computation across workstations? –Yes! Google, Inktomi, Beowulf, … Coherent access to shared memory? –Yes! Multi-CPU PCs use Ivy-like protocols for cache coherence between CPUs DSM as model for programming workstation cluster? –Little evidence of broad adoption –Too little control over communication, and communication dictates performance