Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Slides:

Advertisements

Similar presentations

1 Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization.

Advertisements

Synchronization without Contention

Operating Systems Part III: Process Management (Process Synchronization)

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

John M. Mellor-Crummey Algorithms for Scalable Synchronization on Shared- Memory Multiprocessors Joseph Garvey & Joshua San Miguel Michael L. Scott.

Concurrency: Mutual Exclusion and Synchronization Chapter 5.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Chapter 6: Process Synchronization

5.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts with Java – 8 th Edition Chapter 5: CPU Scheduling.

Spin Locks and Contention Management The Art of Multiprocessor Programming Spring 2007.

Monitors & Blocking Synchronization 1. Producers & Consumers Problem Two threads that communicate through a shared FIFO queue. These two threads can’t.

Scalable Reader-Writer Synchronization for Shared- Memory Multiprocessors Mellor-Crummey and Scott Presented by Robert T. Bauer.

Synchronization without Contention John M. Mellor-Crummey and Michael L. Scott+ ECE 259 / CPS 221 Advanced Computer Architecture II Presenter : Tae Jun.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Spin Locks and Contention Based on slides by by Maurice Herlihy & Nir Shavit Tomer Gurevich.

Local-Spin Algorithms

Parallel Processing (CS526) Spring 2012(Week 6).  A parallel algorithm is a group of partitioned tasks that work with each other to solve a large problem.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

Transactional Memory Yujia Jin. Lock and Problems Lock is commonly used with shared data Priority Inversion –Lower priority process hold a lock needed.

The Performance of Spin Lock Alternatives for Shared-Memory Microprocessors Thomas E. Anderson Presented by David Woodard.

1 Multiprocessors. 2 Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) bad.

CS510 Concurrent Systems Class 1b Spin Lock Performance.

Local-Spin Algorithms Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

1 CSE SUNY New Paltz Chapter Nine Multiprocessors.

Synchronization Todd C. Mowry CS 740 November 1, 2000 Topics Locks Barriers Hardware primitives.

Multiprocessor Cache Coherency

CS510 Concurrent Systems Introduction to Concurrency.

August 15, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 12: Multiprocessors: Non-Uniform Memory Access * Jeremy R. Johnson.

Fast Multi-Threading on Shared Memory Multi-Processors Joseph Cordina B.Sc. Computer Science and Physics Year IV.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Spin Locks and Contention

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Jeremy Denham April 7,  Motivation  Background / Previous work  Experimentation  Results  Questions.

11/18/20151 Operating Systems Design (CS 423) Elsa L Gunter 2112 SC, UIUC Based on slides by Roy Campbell, Sam.

Multiprocessor Architecture Basics Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Based on: The art of multiprocessor programming Maurice Herlihy and Nir Shavit, 2008 Appendix A – Software Basics Appendix B – Hardware Basics Introduction.

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

Practice Chapter Five.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

Local-Spin Mutual Exclusion Multiprocessor synchronization algorithms ( ) Lecturer: Danny Hendler This presentation is based on the book “Synchronization.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit Modified by Rajeev Alur for CIS 640,

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.

The Relative Power of Synchronization Operations Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

Homework-6 Questions : 2,10,15,22.

Scalable Computing model : Lock free protocol By Peeyush Agrawal 2010MCS3469 Guided By Dr. Kolin Paul.

Spin Locks and Contention

Multiprocessor Cache Coherency

Global and high-contention operations: Barriers, reductions, and highly-contended locks Katie Coons April 6, 2006.

Spin Locks and Contention

CMSC 611: Advanced Computer Architecture

CS510 Concurrent Systems Jonathan Walpole.

Designing Parallel Algorithms (Synchronization)

Spin Locks and Contention Management

Barrier Synchronization

CS533 Concepts of Operating Systems

High Performance Computing

CS510 - Portland State University

CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization

CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization

CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization

Shared Counters and Parallelism

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

Presentation transcript:

Queue Locks and Local Spinning Some Slides based on: The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit

Art of Multiprocessor Programming2 Memory Models Memory Contention Communication Contention Communication Latency Cache Coherent (CC) memory Distributed Shared Memory (DSM)

Art of Multiprocessor Programming3 Today: Revisit Mutual Exclusion Think of performance, not just correctness and progress Begin to understand how performance depends on our software properly utilizing the multiprocessor machine’s hardware

Remote Access Remote access is expensive! Allow spinning only on local variables: –DSM: spin only on variables in the local memory –CC: spin only on variables in cache 4

Art of Multiprocessor Programming5 Basic Spin-Lock CS Resets lock upon exit spin lock critical section...

Art of Multiprocessor Programming6 Basic Spin-Lock CS Resets lock upon exit spin lock critical section... …lock suffers from contention – no local spinning!

Art of Multiprocessor Programming7 Idea Avoid useless invalidations –By keeping a queue of threads Each thread –Notifies next in line –Without bothering the others

Art of Multiprocessor Programming8 Anderson Queue Lock flags next TFFFFFFF acquired acquiring getAndIncrement

Art of Multiprocessor Programming9 Anderson Queue Lock Good –Local spinning (CC model) –Simple, easy to implement Bad –One bit per thread Unknown number of threads? Small number of actual contenders?

Art of Multiprocessor Programming10 CLH Lock FIFO order Small, constant-size overhead per thread

Art of Multiprocessor Programming11 Initially false tail idle

Art of Multiprocessor Programming12 Green Wants the Lock false tail acquiring

Art of Multiprocessor Programming13 Green Wants the Lock false tail acquiring true

Art of Multiprocessor Programming14 Green Wants the Lock false tail acquiring true Swap

Art of Multiprocessor Programming15 Green Has the Lock false tail acquired true

Art of Multiprocessor Programming16 Blue Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming17 Blue Wants the Lock false tail acquired acquiring true Swap true

Art of Multiprocessor Programming18 Blue Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming19 Blue Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming20 Blue Wants the Lock false tail acquired acquiring true Implicitely Linked list

Art of Multiprocessor Programming21 Blue Wants the Lock false tail acquired acquiring true

Art of Multiprocessor Programming22 Blue Wants the Lock false tail acquired acquiring true Actually, it spins on cached copy

Art of Multiprocessor Programming23 Green Releases false tail release acquiring false true false Bingo!

Art of Multiprocessor Programming24 Green Releases tail released acquired true

CLH Queue Lock Entry section Exit section Art of Multiprocessor Programming25 new myNode myNode := true do myPred := tail while !CAS(tail,myPred,myNode) wait until !myPred new myNode myNode := true do myPred := tail while !CAS(tail,myPred,myNode) wait until !myPred myNode := false

Art of Multiprocessor Programming26 CLH Lock Good –Lock release affects predecessor only –Small, constant-sized space Bad –Not local spinning for DSM model

Art of Multiprocessor Programming27 CLH Lock Each thread spin’s on predecessor’s memory Could be far away …

Art of Multiprocessor Programming28 MCS Lock FIFO order Spin on local memory only Small, Constant-size overhead

Art of Multiprocessor Programming29 Initially tail false idle

Art of Multiprocessor Programming30 Acquiring false true acquiring (allocate Qnode) tail

Art of Multiprocessor Programming31 Acquiring tail true swap false acquiring

Art of Multiprocessor Programming32 Acquiring tail true false acquiring

Art of Multiprocessor Programming33 Acquired tail true acquired false

Art of Multiprocessor Programming34 Acquiring tail false acquired acquiring true swap

Art of Multiprocessor Programming35 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming36 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming37 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming38 Acquiring tail acquired acquiring true false

Art of Multiprocessor Programming39 Acquiring tail acquired acquiring true Yes!

MCS Queue Lock Entry section Exit section Art of Multiprocessor Programming40 new myNode do myPred := tail while !CAS(tail,myPred,myNode) If myPred!=null myNode.locked:= true myPred.next:= myNode wait until !(myPred.locked) new myNode do myPred := tail while !CAS(tail,myPred,myNode) If myPred!=null myNode.locked:= true myPred.next:= myNode wait until !(myPred.locked) If myNode.next == null if CAS(tail,myNode,null)then return wait until myNode.next!=null myNode.next.locked := false If myNode.next == null if CAS(tail,myNode,null)then return wait until myNode.next!=null myNode.next.locked := false

Art of Multiprocessor Programming41 Green Release false releasing swap false

Art of Multiprocessor Programming42 Green Release false releasing swap false By looking at the queue, I see another thread is active

Art of Multiprocessor Programming43 Green Release false releasing swap false By looking at the queue, I see another thread is active I have to wait for that thread to finish

Art of Multiprocessor Programming44 Green Release false releasing prepare to spin true

Art of Multiprocessor Programming45 Green Release false releasing spinning true

Art of Multiprocessor Programming46 Green Release false releasing spinning false

Art of Multiprocessor Programming47 Green Release false releasing Acquired lock false

Non-Uniform Memory Architecture (NUMA) 48 memory

Non-Uniform Memory Architecture (NUMA) Today, many large scale modern multiprocessors are NUMA: –Clusters of processors with shared local memory –Access by a processor to the memory of its cluster two or more times faster than remote memory –Per cluster cache 49

Lock Bouncing 50 memory

Hierarchical Locks Encourage threads with high mutual memory locality to acquire the lock consecutively Reduce overall cache misses 51

Hierarchical CLH (HCLH) Lock Local queue per cluster Global queue to enter the critical section A local queue is added to the global queue with a single CAS 52 [Luchangco, Nussbaum and Shavit 2006]

HCLH Lock First, add the thread to the local queue If a thread is the first in the local queue, it is responsible for merging into the global queue 53

HCLH Lock 54 false Local tail acquiring

HCLH Lock 55 false Local tail acquiring cidtruefalse Successor_must_wait Tail_when_merged

HCLH Lock 56 false Local tail acquiring cid Swap truefalse Successor_must_wait Tail_when_merged

HCLH Lock 57 false Local tail cidtruefalse acquiring

HCLH Lock 58 false Local tail cidtruefalse acquiring cidtruefalse acquiring

HCLH Lock 59 false Local tail cidtruefalse acquiring cidtruefalse Swap acquiring

HCLH Lock 60 false Local tail cidtruefalse acquiring cidtruefalse acquiring

HCLH Lock 61 false Local tail cidtruefalse acquiring cidtruefalse acquiring

HCLH Lock 62 false Local tail cidtruefalsecidtruefalse

HCLH Lock 63 false Local tail cidtruefalse cidtruefalse cidtruefalse cidtrueTRUE Global tail Cluster master: sees lock is held, so waits a “combining delay”

HCLH Lock 64 Local tail cidtruefalse cidtruefalse cidtruefalse cidtrueTRUE Global tail Cluster master: sees lock is held, so waits a “combining delay”

HCLH Lock 65 Local tail cidtruefalse cidtruefalse cidtruefalse cidtrueTRUE SWAP Global tail

HCLH Lock 66 Local tail cidtruefalse cidtruefalse cidtruefalse cidtrueTRUE Global tail

HCLH Lock 67 Local tail cidtruefalse cidtrueTRUE false cidtruefalse cidtrueTRUE Global tail

References Spin, Anderson, CLH, MCS Locks: “The Art of Multiprocessor Programming”, Herlihy and Shavit, Chapter 7. HCLH Lock: “A Hierarchical CLH Queue Lock”, Luchangco, Nussbaum and Shavit, Euro-Par