Reactive Synchronization Algorithms for Multiprocessors

Slides:

Advertisements

Similar presentations

Implementation and Verification of a Cache Coherence protocol using Spin Steven Farago.

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Ch. 7 Process Synchronization (1/2) I Background F Producer - Consumer process :  Compiler, Assembler, Loader, · · · · · · F Bounded buffer.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

1 Sec (2.3) Program Execution. 2 In the CPU we have CU and ALU, in CU there are two special purpose registers: 1. Instruction Register 2. Program Counter.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization (Barriers) Parallel Processing (CS453)

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Sequential Hardware Prefetching in Shared-Memory Multiprocessors Fredrik Dahlgren, Member, IEEE Computer Society, Michel Dubois, Senior Member, IEEE, and.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Siva and Osman March 7, 2000 Cache Coherence Schemes for Multiprocessors Sivakumar M Osman Unsal.

1 Lecture 8: Snooping and Directory Protocols Topics: 4/5-state snooping protocols, split-transaction implementation details, directory implementations.

Multiprocessors – Locks

Maurice Herlihy and J. Eliot B. Moss, ISCA '93

CS703 – Advanced Operating Systems

Lecture 21 Synchronization

Lock-Free Linked Lists Using Compare-and-Swap

Lecture 19: Coherence and Synchronization

Lecture 5: Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 18: Coherence and Synchronization

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS510 Concurrent Systems Jonathan Walpole.

Cache Coherence Protocols 15th April, 2006

Designing Parallel Algorithms (Synchronization)

Outline Midterm results summary Distributed file systems – continued

Lecture 5: Snooping Protocol Design Issues

Lecture 8: Directory-Based Cache Coherence

Lecture 21: Synchronization and Consistency

Yiannis Nikolakopoulos

Thread Synchronization

Lecture: Coherence and Synchronization

Lecture 25: Multiprocessors

CS533 Concepts of Operating Systems

Lecture 4: Synchronization

Lecture 25: Multiprocessors

CS510 Concurrent Systems Jonathan Walpole.

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Multiprocessors

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Reactive Synchronization Algorithms for Multiprocessors Beng-Hong Lim and Anant Agarwal Margoob Mohiyuddin

Outline Motivation Algorithms Results Reactive spin lock Reactive fetch-and-op Results Locks: test-and-test&set vs. queue Fetch-and-op: lock vs. software combining tree Shared memory vs. message passing

Motivation Passive algorithms Choice of protocol fixed  optimized for certain contention/concurrency levels Passive Spin Lock Passive Fetch-and-Op

Reactive Algorithm: Components Protocol selection algorithm Which protocol to use for synchronization operation? test&set vs. queuing Low contention  test-and-set High contention  queuing Waiting algorithm What do we do when waiting for synchronization delays? Spinning vs. blocking This paper spins

Challenges Protocol selection algorithm Waiting algorithm What is the current protocol? For each synchronization operation  efficient Multiple processes may try to synchronize at the same time How to change protocols? Correctly update state Frequent protocol changes  efficient When to change protocols? Crossover point architecture dependent  tuning needed Waiting algorithm Local operation Different processes can use different waiting algorithms

Reactive Spin Lock Mode variable, test-and-test&set lock, MCS queue lock What is the current protocol? Mode variable: TTS  test-and-test&set, QUEUE  queue Read-cached and infrequent mode changes  efficient At most one lock free Protocol change between mode variable access and lock access  processes executing wrong protocol retry How to change protocols? Done by lock holder TTSQUEUE  update mode, release queue lock QUEUETTS  update mode, signal queue waiters to retry, release test-and-test&set Ensures at most lock free When to change protocols? # failed test-and-test&set attempts > threshold  TTSQUEUE # consecutive lock acquisitions when queue empty > threshold  QUEUETTS

Generalizing: Required Properties Each protocol has a unique consensus object Atomic access Valid or invalid A process can complete protocol execution after accessing consensus object Protocol state modified only by a process with atomic access to a consensus object Allows processes to safely execute an invalid protocol Changing protocol AB: Acquire atomic access to B’s consensus object Invalidate A’s consensus object Update state of B to reflect current state of synch. Operation Change mode variable Validate B’s consensus object and release it

Reactive Fetch-and-Op test-and-test&set lock, queue lock, software combining tree Consensus object for combining tree is the lock for the root What is the current protocol? Mode variable At most one protocols’ consensus object valid Invalid lock access  retry Combining tree  backtrack tree traversal and notify waiting processes to retry How to change protocol? Lock holder updates state of target protocol to current value of fetch-and-op variable When to change protocol? #failed test-and-test&set attempts > threshold  TTSQUEUE #successive fetch-op attempts when {waiting time on queue > time limit} > threshold  QUEUEcombining tree #fetch-and-op requests reaching root without combining with enough fetch-and-op requests > threshold  combining treeQUEUE #successive fetch-and-op attempts when queue empty > threshold  QUEUETTS

Reactive Message-Passing Shared memory vs. message-passing Shared memory on top of message-passing test-and-test&set vs. message-passing queue locks test-and-test&set lock vs. centralized message-passing vs. software combining tree fetch-and-op

Experimental Setup Alewife machine Cycle accurate simulator Message passing 2D mesh network LimitLESS cache coherence protocol Atomic fetch-and-store hardware primitive Cycle accurate simulator 16-node Alewife machine Applications Gamteb (9 counters using fetch-and-increment) Traveling Salesman Problem (fetch-and-increment for access to global task queue) Adaptive Quadrature (similar task queue to TSP, but lower contention) MP3D (a lock used for atomic update of collision counts, other lock for atomic update of cell parameters) Cholesky (not important)

Results: Baseline Performance Fetch-and-Op Spin Lock Fetch-and-increment Delay for 0-500 cycles Repeat 1-2 (Radix-2 combining tree with 64 leaves) Acquire lock Execute 100 cycle critical section Release lock Delay for 0-500 cycles Repeat 1-4

Results: Applications (Fetch-and-Op) Choice of fetch-and-op algorithm important Gamteb: Reactive algorithm best for 128 processors due to different protocols for different counters

Results: Applications (Spin Locks) Queue lock good for all contention levels Coarse grained computation MP3D Queue lock for collision counts update

Results: Reactive Message-Passing (Baseline)

Results: Reactive Message-Passing 16 processors Normalized w.r.t. queue lock Contention levels varying frequently  poor performance Overhead dominates

Discussion