ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

Slides:

Advertisements

Similar presentations

Cache Coherence. Memory Consistency in SMPs Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale values write-through: cache-2 has.

Advertisements

1 Lecture 20: Synchronization & Consistency Topics: synchronization, consistency models (Sections )

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Concurrent programming: From theory to practice Concurrent Algorithms 2014 Vasileios Trigonakis Georgios Chatzopoulos.

ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 18 Multiprocessors Benjamin Lee Electrical and Computer Engineering Duke University

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Multi-core systems System Architecture COMP25212 Daniel Goodman Advanced Processor Technologies Group.

Cache Optimization Summary

Process Synchronization. Module 6: Process Synchronization Background The Critical-Section Problem Peterson’s Solution Synchronization Hardware Semaphores.

Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.

The University of Adelaide, School of Computer Science

Computer Architecture 2011 – coherency & consistency (lec 7) 1 Computer Architecture Memory Coherency & Consistency By Dan Tsafrir, 11/4/2011 Presentation.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.

More on Locks: Case Studies

CS510 Concurrent Systems Introduction to Concurrency.

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

Computer Architecture 2015 – Cache Coherency & Consistency 1 Computer Architecture Memory Coherency & Consistency By Yoav Etsion and Dan Tsafrir Presentation.

CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 8: April 26, 2001 Simultaneous Multi-Threading (SMT)

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors THOMAS E. ANDERSON Presented by Daesung Park.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

Lecture 13: Multiprocessors Kai Bu

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

SYNAR Systems Networking and Architecture Group CMPT 886: The Art of Scalable Synchronization Dr. Alexandra Fedorova School of Computing Science SFU.

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

CS 2200 Presentation 18b MUTEX. Questions? Our Road Map Processor Networking Parallel Systems I/O Subsystem Memory Hierarchy.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

The University of Adelaide, School of Computer Science

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

CS267 Lecture 61 Shared Memory Hardware and Memory Consistency Modified from J. Demmel and K. Yelick

Spin Locks and Contention Companion slides for The Art of Multiprocessor Programming by Maurice Herlihy & Nir Shavit.

CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.

1 Lecture: Coherence Topics: snooping-based coherence, directory-based coherence protocols (Sections )

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

Processor Level Parallelism 2. How We Got Here Developments in PC CPUs.

The University of Adelaide, School of Computer Science

Multiprocessors – Locks

COMP 740: Computer Architecture and Implementation

Operating Systems Engineering Scalable Locks

Atomic Operations in Hardware

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Atomic Operations in Hardware

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

Lecture 21: Synchronization and Consistency

Lecture 25: Multiprocessors

Lecture 25: Multiprocessors

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

CSE 153 Design of Operating Systems Winter 19

Lecture 24: Virtual Memory, Multiprocessors

Lecture 24: Multiprocessors

Lecture: Coherence, Synchronization

Lecture: Coherence Topics: wrap-up of snooping-based coherence,

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture: Coherence and Synchronization

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

ECE 454 Computer Systems Programming Parallel Architectures and Performance Implications (II) Ding Yuan ECE Dept., University of Toronto

What we already learnt How to benefit from multi-cores by parallelize sequential program into multi-threaded program Watch out of locks: atomic regions are serialized Use fine-grained locks, and avoid locking if possible But are these all? As long as you do the above, your multi-threaded program will run Nx faster on an N-core machine? Ding Yuan, ECE454 2

Putting it all together [1] Performance implications for parallel architecture Background: architecture of the two testing machines Cache-coherence performance and implications to parallel software design [1] Everything you always wanted to know about synchronization but were afraid to ask. David, et. al., SOSP’13 Ding Yuan, ECE454 3

Two case studies 48-core AMD Opteron 80-core Intel Xeon Socket Question to keep in mind: which machine would you use? Ding Yuan, ECE454 4

48-core AMD Opteron RAM LLC NOT shared Directory-based cache coherence (motherboard) L1 C C Last Level Cache 6-cores per die (each socket contains 2 dies) L1 C …6x… …8x… L1 C C Last Level Cache L1 C …6x… cross-die! 6-cores per die Ding Yuan, ECE454 5

80-core Intel Xeon RAM LLC shared Snooping-based cache coherence (motherboard) L1 C C Last Level Cache 10-cores per die L1 C …10x… …8x… L1 C C 10-cores per die L1 C …10x… cross-socket Ding Yuan, ECE454 6

Interconnect between sockets Cross-sockets communication can be 2-hops Ding Yuan, ECE454 7

Performance of memory operations Ding Yuan, ECE454 8

Local caches and memory latencies Memory access to a line cached locally (cycles) Best case: L1 < 10 cycles (remember this) Worst case: RAM 136 – 355 cycles (remember this) Ding Yuan, ECE454 9

Latency of remote access: read (cycles) Ignore “State” is the MESI state of a cache line in a remote cache (local state is invalid) Cross-socket communication is expensive! Xeon: loading from Shared state is 7.5 times more expensive over two hops than within socket Opteron: cross-socket latency even larger than RAM Opteron: uniform latency regardless of the cache state Directory-based protocol (directory is distributed across all LLC, here we assume the directory lookup stays in the same die) Xeon: load from “Shared” state is much faster than from “M” and “E” states “Shared” state read is served from LLC instead from remote cache Ding Yuan, ECE454 10

Latency of remote access: write (cycles) “State” is the MESI state of a cache line in a remote cache. Cross-socket communication is expensive! Opteron: store to “Shared” cache line is much more expensive Directory-based protocol is incomplete Does not keep track of the sharers, therefore it is Equivalent to broadcast and have to wait for all invalidations to complete Xeon: store latency similar regardless of the previous cache line state Snooping-based coherence Ignore Ding Yuan, ECE454 11

How about synchronization? Ding Yuan, ECE454 12

Synchronization implementation Hardware support is required to implement sync. primitives In the form of atomic instructions Common examples include: test-and-set, compare-and-swap, etc. Used to implement high-level synchronization primitives e.g., lock/unlock, semaphores, barriers, cond. var., etc. We will only discuss test-and-set Ding Yuan, ECE454 13

14 Test-And-Set The semantics of test-and-set are: Record the old value Set the value to TRUE This is a write! Return the old value Hardware executes it atomically ! When executing test-and-set on “flag” What is value of flag afterwards if it was initially False? True? What is the return result if flag was initially False? True? bool test_and_set (bool *flag){ bool old = *flag; *flag = True; return old; } Read-exclusive (invalidations) Modify (change state) Memory barrier completes all the mem. op. before this TAS cancel all the mem. op. after this TAS Hardware implementation: atomic! Ding Yuan, ECE454

15 Using Test-And-Set Here is our lock implementation with test-and-set: When will the while return? What is the value of held? Does it work? What about multiprocessors? struct lock { int held = 0; } void acquire (lock) { while (test-and-set(&lock->held)); } void release (lock) { lock->held = 0; }

TAS and cache coherence Shared Memory (lock->held = 0) Cache Processor State Data Thread A: Cache Processor State Data Thread B: acquire(lock) Read-Exclusive Ding Yuan, ECE454 16

TAS and cache coherence Shared Memory (lock->held = 0) Cache Processor Dirty State lock->held=1 Data Thread A: Cache Processor State Data Thread B: acquire(lock) Read-ExclusiveFill Ding Yuan, ECE454 17

TAS and cache coherence Shared Memory (lock->held = 0) Cache Processor Dirty State lock->held=1 Data Thread A: Cache Processor acquire(lock) State Data Thread B: acquire(lock) Read-Exclusive invalidation Ding Yuan, ECE454 18

TAS and cache coherence Shared Memory (lock->held = 1) Cache Processor Invalid State lock->held=1 Data Thread A: Cache Processor acquire(lock) State Data Thread B: acquire(lock) Read-Exclusive invalidation update Ding Yuan, ECE454 19

TAS and cache coherence Shared Memory (lock->held = 1) Cache Processor Invalid State lock->held=1 Data Thread A: Cache Processor acquire(lock) Dirty State Data Thread B: lock->held=1 acquire(lock) Read-ExclusiveFill Ding Yuan, ECE454 20

What if there are contentions? Shared Memory (lock->held = 1) Cache Processor State Data Thread A: Cache Processor while(TAS(lock)) ; State Data Thread B: while(TAS(lock)) ; Ding Yuan, ECE454 21

How bad can it be? TAS Recall: TAS essentially is a Store + Memory Barrier Ignore Store Ding Yuan, ECE Takeaway: heavy lock contentions may lead to worse performance than serializing the execution!

How to optimize? When the lock is being held, a contending “acquire” keeps modifying the lock var. to 1 Not necessary! void test_and_test_and_set (lock) { do { while (lock->held == 1) ; // spin } } while (test_and_set(lock->held)); } void release (lock) { lock->held = 0; } Ding Yuan, ECE454 23

What if there are contentions? Shared Memory (lock->held = 0) Cache Processor Dirty State lock->held=1 Data Thread A: Cache Processor while(lock->held==1) ; State Data Thread B: holding lock Cache Processor State Data Thread B: Read Read request Ding Yuan, ECE454 24

What if there are contentions? Shared Memory (lock->held = 1) Cache Processor Shared State lock->held=1 Data Thread A: Cache Processor while(lock->held==1) ; Shared State Data Thread B: lock->held=1 holding lock Cache Processor State Data Thread B: Read Read request update Ding Yuan, ECE454 25

What if there are contentions? Shared Memory (lock->held = 1) Cache Processor Shared State lock->held=1 Data Thread A: Cache Processor while(lock->held==1) ; Shared State Data Thread B: lock->held=1 holding lock Cache Processor Shared State Data Thread B: lock->held=1 while(lock->held==1) ; Repeated read to “Shared” cache line: no cache coherence traffic! Ding Yuan, ECE454 26

Let’s put everything together TAS Load Ignore Write Local access Ding Yuan, ECE454 27

Implications to programmers Cache coherence is expensive (more than you thought) Avoid unnecessary sharing (e.g., false sharing) Avoid unnecessary coherence (e.g., TAS -> TATAS) Clear understanding of the performance Crossing sockets is a killer Can be slower than running the same program on single core! pthread provides CPU affinity mask pin cooperative threads on cores within the same die Loads and stores can be as expensive as atomic operations Programming gurus understand the hardware So do you now! Have fun hacking! More details in “Everything you always wanted to know about synchronization but were afraid to ask”. David, et. al., SOSP’13 28