Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Synchronization without Contention

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Multiple Processor Systems

Synchron. CSE 471 Aut 011 Some Recent Medium-scale NUMA Multiprocessors (research machines) DASH (Stanford) multiprocessor. –“Cluster” = 4 processors on.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

ECE669 L18: Scalable Parallel Caches April 6, 2004 ECE 669 Parallel Computer Architecture Lecture 18 Scalable Parallel Caches.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Synchron. CSE 4711 The Need for Synchronization Multiprogramming –“logical” concurrency: processes appear to run concurrently although there is only one.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

CS533 - Concepts of Operating Systems 1 Class Discussion.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Multiprocessor Cache Coherency

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

Synchronization (Barriers) Parallel Processing (CS453)

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

A Two-Lock Concurrent Queue Algorithm Maged M. Michael, Michael L. Scott University of Rochester Presented by Hussain Tinwala.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Fundamentals of Parallel Computer Architecture - Chapter 91 Chapter 9 Hardware Support for Synchronization Yan Solihin Copyright.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Distributed Shared Memory Based on Reference paper: Distributed Shared Memory, Concepts and Systems.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 10: May 8, 2001 Synchronization.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 22: May 20, 2005 Synchronization.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

1 Global and high-contention operations: Barriers, reductions, and highly- contended locks Katie Coons April 6, 2006.

Computer Science and Engineering Copyright by Hesham El-Rewini Advanced Computer Architecture CSE 8383 February Session 13.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Synchronization Todd C. Mowry CS 740 November 24, 1998 Topics Locks Barriers.

Lecture 28Fall 2006 Computer Architecture Fall 2006 Lecture 28: Bus Connected Multiprocessors Adapted from Mary Jane Irwin ( )

1 Dynamic Decentralized Cache Schemes for MIMD Parallel Processors Larry Rudolph Zary Segall Presenter: Tu Phan.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Multiprocessors – Locks

Lecture 18: Coherence and Synchronization

Reactive Synchronization Algorithms for Multiprocessors

Multiprocessor Cache Coherency

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

CMSC 611: Advanced Computer Architecture

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Lecture: Coherence and Synchronization

Lecture 25: Multiprocessors

CS533 Concepts of Operating Systems

CS510 Concurrent Systems Jonathan Walpole.

CSE 153 Design of Operating Systems Winter 19

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Presentation transcript:

Symmetric Multiprocessors and Performance of SpinLock Techniques Based on Anderson’s paper “Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors”

Multiprocessors Machines with many processors where all memory is accessed through the same mechanism –there is no distinction between local and non-local memory at the machine level –processors, caches, memory and interconnect network or shared bus –Shared Memory Symmetric Multiprocessors: all processors are equal and have full access to all resources, main memory is equidistant from all processors

need to communicate and synchronize: message passing MPs –communication is explicit, synchronization implicit shared address space MPs –synchronization explicit, communication implicit –cache coherent (CC) KSR-1, Origin 2000 –non-cache coherent (NCC) BBN Butterfly Cray T3D/T3E

Caches and Consistency cache with each processor to hide latency for memory accesses miss in cache => go to memory to fetch data multiple copies of same address in caches caches need to be kept consistent cache..... memory ICN

Cache Coherence write-invalidate X -> X’ invalidate --> X -> Inv..... memory ICN Before a write, an “invalidate” command goes out on the bus. All other processors are “snooping” and they invalidate their caches. Invalid entries cause read faults and new values will be fetched from main memory.

Cache Coherence write-update (also called distributed write) X -> X’ update --> X -> X’..... memory ICN As soon as the cache line is written, it is broadcast on the bus, so all processors update their caches.

write-invalidate vs. write-update write-update –simple –wins if most values are written and read elsewhere –eats bandwidth write-invalidate –complex –wins if values are often overwritten before being read (loop index) –conserves bandwidth

Paper objectives Anderson’s paper discusses some spin lock alternatives proposes a new technique based on sequence numbers, i.e. queued lock requests analogies are drawn to CSMA networks (carrier-sense multiple access – e.g. Ethernet)

Shared Memory Programming synchronization –for coordination of the threads –atomic op to read-modify-write a mem location test-and-set fetch-and-store(R, ) fetch-and-add(, ) compare-and-swap(,, ) communication –for inter-thread sharing of data

Why spin locks? efficient for brief critical sections, (e.g. barriers) –context switching and blocking is more expensive can be done solely in software but expensive based on hardware support for atomic operations –during an atomic operation, other atomic operations are disabled, and caches are invalidated, even if value unchanged –why? barrier

Sync. Alg. Objectives reduce latency –how quickly can I get the lock in the absence of competition reduce waiting time – delaying the application –time between lock is freed and another processor acquires it reduce contention / bandwidth consumption –other processors need to do useful work, process holding lock needs to complete asap… –how to design the waiting scheme to reduce interconnection network contention Any conflicts here??

Locks in Shared Memory MP L is a shared memory location lock: while (test-and-set(L) == locked); unlock: L = unlocked; spin or block if lock not available? –focus of Anderson’s paper is efficient spin-waiting spin locks are a polling mechanism. if they poll often, they’ll hurt bandwidth, if rarely, they’ll hurt delay –(note: with snooping, not exactly polling)

Spin locks two opposing concerns –retry to acquire lock as soon as it is released –minimize disruptions to processors doing work spin on test-and-set initlock = clear; lockwhile(t&s(lock) == busy); /* spin */ unlocklock = clear;

spin on test&set performance should be terrible: –t&s always goes to memory (bypasses caches) does not exploit caches –unlocks are queued behind other test&sets –in bus-based systems, delay lock holder and other processes –in network-based systems create hot spots

spin on read (test-t&s) lockwhile (lock == busy) spin; if (t&s(lock) == busy) goto lock; –why is this better? exploits caching, goes to bus only when lock “looks” free –problem? parallel contention on lock release increased network traffic on lock release –invalidation -- O(N**2) »some processors will get new L and see it’s busy, others will think it’s clear and go to t&s => invalidate L again –update -- O(N) »all N processors will see L is free and issues t&s normal memory accesses in C.S. affected

Results from Paper 20 processor Sequent Symmetry CC bus-based WI MP

Other issues after lock is released, it creates “bandwidth storm” Quiescence: time for “b/w storm” to subside to measure: –acquire lock, idle for time t, generate bus accesses, release lock –should take same total time if 1 or many processors waiting on lock, but doesn’t

Results 20 processor Sequent Symmetry CC bus-based WI MP

Summary so Far problem: –more than one processor sees lock is free –more than one processor tries to acquire it can be avoided –one processor sees the release –does t&s before others decide that they should do so

Delay Alternatives delay after lock release while ((lock == busy) or (t&s(lock) == busy)) { /* failed to get lock */ while (lock == busy) spin; delay(); /* don’t check immediately */ } –Pi’s statically assigned different delay periods –at most one Pi will try t&s at a time reducing contention –when is this good/bad? –problem? lock release noticed immediately but acquire delay

Dynamic Delay idea – vary delay based on contention level –how do we detect contention level online –like in CSMA: the more collisions you have the more contention there is

delay on lock release with exponential backoff –dynamic delay proportional to contention for lock –try t&s rightaway –double delay on failure –problem? “collisions” in the spin lock case cost more if there are more processors waiting – it will take longer to start spinning on the cache again –bursty arrival of lock requests will take a long quiesce time making static slots better –each t&s is bad news in combination with invalidation-based CC protocol –exponential backoff with write-update good for a scalable spinlocks

good heuristic for backoff –maximum bound on mean delay equal to the number of processors –initial delay should be a function of the delay last time before the lock was acquired (in Ethernet things start over, here t&s collision more costly) –Anderson suggest half of last delay –note that if lock is free it will be acquired immediately both static and dynamic delay perform poorly with bursty t&s’s

delay between each memory reference while ((lock == busy) or (t&s(lock) == busy)) delay(); –works for NCC and CC MPs –static delay similar as delay-after-noticing- released-lock –dynamic delay assignment with backoff may be very bad (delay will increase while waiting, not contending)

Queuing locks a unique ticket for each arriving processor get ticket in waiting queue go into CS when you have lock signal next waiter when you are done assume a primitive op r&inc(mem) 0 1 ………. p-1 flags array current lock holder queuelast {has-lock, must-wait} hlmw

initflags[0] = has-lock; flags[1..p-1] = must-wait; queuelast = 0; /* global variable */ lockmyplace = r&inc(queuelast); /* get ticket */ while (flags[myplace mod p] == must-wait); /* now in C.S */ flags[myplace mod p] = must-wait; unlockflags[myplace+1 mod p] = has-lock; –latency not good (test, read, and write to acquire lock, a write to release it). bandwidth/delay? –only one atomic op per CS –processors are sequenced –spin variable is distinct for each processor (invalidations reduced)

with write-update –All P’s can spin on a single counter –Pj on lock release writes next sequence number into this counter for notification –single bus transaction notifies everyone; winner determined locally with write-invalidate –each Pj has a flag in a separate cache block –two bus transactions to notify the next winner bus-based MP without caches –algorithm no good without delay MIN-based MP without caches –no hot-spots due to distinct spin locations

Problems not only is it bad for the lock holder to be preemted, but also for each spin- waiting thread that reaches the front of the queue before being scheduled –why? can be addressed by telling thread it is about to be preemted (remember Psyche?) what if architecture doesn’t support r&inc?

Results again 20 processor Sequent Symmetry CC bus-based WI MP

queueing high latency (startup cost) –need to simulate r&inc static release better than backoff at high loads –why? some collisions needed for the backoff to kick-in delay after ref. better than after rel. since Sequent uses a WI protocol queuing best at high load

Results