More on Locks: Case Studies

Slides:

Advertisements

Similar presentations

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

Advertisements

The University of Adelaide, School of Computer Science

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Chapter 6: Process Synchronization

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Multiple Processor Systems

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

CS444/CS544 Operating Systems Synchronization 2/21/2006 Prof. Searleman

The University of Adelaide, School of Computer Science

CS444/CS544 Operating Systems Synchronization 2/16/2007 Prof. Searleman

CS 7810 Lecture 19 Coherence Decoupling: Making Use of Incoherence J.Huh, J. Chang, D. Burger, G. Sohi Proceedings of ASPLOS-XI October 2004.

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Synchronization CSCI 444/544 Operating Systems Fall 2008.

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Spring 2003CSE P5481 Cache Coherency Cache coherent processors reading processor must get the most current value most current value is the last write Cache.

Multi-core architectures. Single-core computer Single-core CPU chip.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

MULTIVIE W Slide 1 (of 23) The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors Paper: Thomas E. Anderson Presentation: Emerson.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

The University of Adelaide, School of Computer Science

Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.

The University of Adelaide, School of Computer Science

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Multiprocessors – Locks

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Lecture 18: Coherence and Synchronization

Memory hierarchy.

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Multi-core systems COMP25212 System Architecture

Lecture 21: Synchronization and Consistency

Implementing Mutual Exclusion

Lecture 25: Multiprocessors

Implementing Mutual Exclusion

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Multiprocessors

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 21: Synchronization & Consistency

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

More on Locks: Case Studies Topics Case Study of two Architectures Xeon and Opteron Detailed Lock code and Cache Coherence

Putting it all together Background: architecture of the two testing machines A more detailed treatment of locks and cache-coherence with code examples and implications to parallel software design in the above context

Two case studies 48-core AMD Opteron 80-core Intel Xeon

Last level cache (LLC) NOT shared Directory-based cache coherence 48-core AMD Opteron L1 C LLC 6-cores per die …6x… L1 C LLC 6-cores per die …6x… cross-socket! …8x… (motherboard) RAM Last level cache (LLC) NOT shared Directory-based cache coherence

Snooping-based cache coherence 80-core Intel Xeon L1 C Last Level Cache (LLC) 10-cores per die …10x… L1 C L1 C L1 C cross-socket …10x… …8x… (motherboard) 10-cores per die RAM LLC shared Snooping-based cache coherence

Interconnect between sockets Cross-sockets communication can be 2-hops

Performance of memory operations

Local caches and memory latencies Memory access to a line cached locally (cycles) Best case: L1 < 10 cycles Worst case: RAM 136 – 355 cycles

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a cache line in a remote cache. Cross-socket communication is expensive! Xeon: loading from Shared state is 7.5 times more expensive over two hops than within socket Opteron: cross-socket latency even larger than RAM Opteron: uniform latency regardless of the cache state Directory-based protocol (directory is distributed across all LLC) Xeon: load from “Shared” state is much faster than from “M” and “E” states “Shared” state read is served from LLC instead from remote cache Intel Xeon: the cache state (shared) and ownership is stored in LLC Opteron: the directory is distributed across all LLC --- so if you’re unlucky, your dir lookup may be expensive These computers may be the future

Latency of remote access: write (cycles) Ignore “State” is the MESI state of a cache line in a remote cache. Cross-socket communication is expensive! Opteron: store to “Shared” cache line is much more expensive Directory-based protocol is incomplete Does not keep track of the sharers Equivalent to broad-cast and have to wait for all invalidations to complete Xeon: store latency similar regardless of the previous cache line state Snooping-based coherence Intel Xeon: the cache state (shared) and ownership is stored in LLC Opteron: the directory is distributed across all LLC --- so if you’re unlucky, your dir lookup may be expensive

Detailed Treatment of Lock-based synchronization

Synchronization implementation Hardware support is required to implement synchronization primitives In the form of atomic instructions Common examples include: test-and-set, compare-and-swap, etc. Used to implement high-level synchronization primitives e.g., lock/unlock, semaphores, barriers, cond. var., etc. We will only discuss test-and-set here

Test-And-Set The semantics of test-and-set are: Record the old value Set the value to TRUE This is a write! Return the old value Hardware executes it atomically!

Read-exclusive (invalidations) completes all the mem. op. Test-And-Set Read-exclusive (invalidations) Modify (change state) Memory barrier completes all the mem. op. before this TAS cancel all the mem. op. after this TAS atomic!

Using Test-And-Set Courtesy Ding Yuan Here is our lock implementation with test-and-set: struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); void release (lock) { lock->held = 0; Courtesy Ding Yuan

TAS and cache coherence acq(lock) Thread A: Thread B: Processor Processor Cache Cache State Data State Data Read-Exclusive Shared Memory (held = 0)

TAS and cache coherence acq(lock) Thread A: Thread B: Processor Processor Cache Cache Dirty State held=1 Data State Data Fill Read-Exclusive Shared Memory (held = 0)

TAS and cache coherence acq(lock) acq(lock) Thread A: Thread B: Processor Processor Cache Cache Dirty State held=1 Data State Data invalidation Read-Exclusive Shared Memory (held = 0)

TAS and cache coherence acq(lock) acq(lock) Thread A: Thread B: Processor Processor Cache Cache Inval State held=1 Data State Data invalidation Read-Exclusive update Shared Memory (held = 1)

TAS and cache coherence acq(lock) acq(lock) Thread A: Thread B: Processor Processor Cache Cache Inval State held=1 Data Dirty State Data held=1 Fill Read-Exclusive Shared Memory (held = 1)

What if there are contentions? while(TAS(l)) ; while(TAS(l)) ; Thread A: Thread B: Processor Processor Cache Cache State Data State Data Shared Memory (held = 1)

Recall: TAS essentially is a Store + Memory Barrier How bad can it be? TAS Ignore Store Recall: TAS essentially is a Store + Memory Barrier

How to optimize? When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1 Not necessary! void test_and_test_and_set (lock) { do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held)); void release (lock) { lock->held = 0; why is the “do.. while” necessary?

What if there are contentions? Thread A: Thread B: Thread B: holding lock while(held==1) ; Processor Processor Processor Cache Cache Cache Dirty State held=1 Data State Data State Data Read request Read Shared Memory (held = 0)

What if there are contentions? Thread A: Thread B: Thread C: holding lock while(held==1) ; Processor Processor Processor Cache Cache Cache Share State held=1 Data Share State Data State Data held=1 Read request update Read Shared Memory (held = 1)

What if there are contentions? Thread A: Thread B: Thread C: holding lock while(held==1) ; while(held==1) ; Processor Processor Processor Cache Cache Cache Share State held=1 Data Share State Data Share State Data held=1 held=1 6 times speed up! Repeated read to “Shared” cache line: no cache coherence traffic! Shared Memory (held = 1)

Let’s put everything together TAS Ignore Load Write Local access

Implications to programmers Cache coherence is expensive (more than you thought) Avoid unnecessary sharing (e.g., false sharing) Avoid unnecessary coherence (e.g., TAS -> TATAS) Clear understanding of the performance Crossing sockets is a killer Can be slower than running the same program on single core! pthread provides CPU affinity mask pin cooperative threads on cores within the same die Loads and stores can be as expensive as atomic operations Programming gurus understand the hardware So do you now! Have fun hacking! More details in “Everything you always wanted to know about synchronization but were afraid to ask”. David, et. al., SOSP’13