Multiprocessors and Thread-Level Parallelism Cont’d

Slides:

Advertisements

Similar presentations

L.N. Bhuyan Adapted from Patterson’s slides

Advertisements

The University of Adelaide, School of Computer Science

Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.

Multiprocessors—Synchronization. Synchronization Why Synchronize? Need to know when it is safe for different processes to use shared data Issues for Synchronization:

1 Lecture 6: Directory Protocols Topics: directory-based cache coherence implementations (wrap-up of SGI Origin and Sequent NUMA case study)

12 – Shared Memory Synchronization. 2 Review Caches contain all information on state of cached memory blocks Snooping cache over shared medium for smaller.

CS492B Analysis of Concurrent Programs Lock Basics Jaehyuk Huh Computer Science, KAIST.

Cache Optimization Summary

EEL 5708 Multiprocessor 2 Larger multiprocessors Lotzi Bölöni Fall 2003.

The University of Adelaide, School of Computer Science

CIS629 Coherence 1 Cache Coherence: Snooping Protocol, Directory Protocol Some of these slides courtesty of David Patterson and David Culler.

1 Lecture 21: Synchronization Topics: lock implementations (Sections )

CS252/Patterson Lec /23/01 CS213 Parallel Processing Architecture Lecture 7: Multiprocessor Cache Coherency Problem.

BusMultis.1 Review: Where are We Now? Processor Control Datapath Memory Input Output Input Output Memory Processor Control Datapath  Multiprocessor –

CS252/Patterson Lec /28/01 CS 213 Lecture 9: Multiprocessor: Directory Protocol.

CPE 731 Advanced Computer Architecture Snooping Cache Multiprocessors Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

CS252/Patterson Lec /28/01 CS 213 Lecture 10: Multiprocessor 3: Directory Organization.

1 Lecture 20: Protocols and Synchronization Topics: distributed shared-memory multiprocessors, synchronization (Sections )

Snooping Cache and Shared-Memory Multiprocessors

1 Shared-memory Architectures Adapted from a lecture by Ian Watson, University of Machester.

Multiprocessor Cache Coherency

Shared Address Space Computing: Hardware Issues Alistair Rendell See Chapter 2 of Lin and Synder, Chapter 2 of Grama, Gupta, Karypis and Kumar, and also.

EECS 252 Graduate Computer Architecture Lec 13 – Snooping Cache and Directory Based Multiprocessors David Patterson Electrical Engineering and Computer.

ECE200 – Computer Organization Chapter 9 – Multiprocessors.

CPE 631 Session 24: Multiprocessors (cont’d) Department of Electrical and Computer Engineering University of Alabama in Huntsville.

Lecture 13: Multiprocessors Kai Bu

Ch4. Multiprocessors & Thread-Level Parallelism 2. SMP (Symmetric shared-memory Multiprocessors) ECE468/562 Advanced Computer Architecture Prof. Honggang.

Caltech CS184 Spring DeHon 1 CS184b: Computer Architecture (Abstractions and Optimizations) Day 12: May 3, 2003 Shared Memory.

Multiprocessors— Performance, Synchronization, Memory Consistency Models Professor Alvin R. Lebeck Computer Science 220 Fall 2001.

1 Lecture 19: Scalable Protocols & Synch Topics: coherence protocols for distributed shared-memory multiprocessors and synchronization (Sections )

Distributed shared memory u motivation and the main idea u consistency models F strict and sequential F causal F PRAM and processor F weak and release.

August 13, 2001Systems Architecture II1 Systems Architecture II (CS ) Lecture 11: Multiprocessors: Uniform Memory Access * Jeremy R. Johnson Monday,

The University of Adelaide, School of Computer Science

1 Introduction and Taxonomy SMP Architectures and Snooping Protocols Distributed Shared-Memory Architectures Performance Evaluations Synchronization Issues.

Ch4. Multiprocessors & Thread-Level Parallelism 4. Syn (Synchronization & Memory Consistency) ECE562 Advanced Computer Architecture Prof. Honggang Wang.

CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis.

The University of Adelaide, School of Computer Science

Multi Processing prepared and instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University June 2016Multi Processing1.

EE 382 Processor DesignWinter 98/99Michael Flynn 1 EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors, Part I.

Lecture 13: Multiprocessors Kai Bu

12 – Shared Memory Synchronization

The University of Adelaide, School of Computer Science

The University of Adelaide, School of Computer Science

CS 704 Advanced Computer Architecture

Lecture 18: Coherence and Synchronization

Multiprocessor Cache Coherency

The University of Adelaide, School of Computer Science

CMSC 611: Advanced Computer Architecture

Example Cache Coherence Problem

CPE 631 Session 22: Multiprocessors (Part 3)

The University of Adelaide, School of Computer Science

Lecture 2: Snooping-Based Coherence

CPE 631 Session 21: Multiprocessors (Part 2)

Cache Coherence Protocols 15th April, 2006

Multiprocessors - Flynn’s taxonomy (1966)

11 – Snooping Cache and Directory Based Multiprocessors

CSCI 211 Computer System Architecture Lec 9 – Multiprocessor II

Lecture 25: Multiprocessors

Professor David A. Patterson Computer Science 252 Spring 1998

12 – Shared Memory Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Lecture 24: Virtual Memory, Multiprocessors

Coherent caches Adapted from a lecture by Ian Watson, University of Machester.

Lecture 17 Multiprocessors and Thread-Level Parallelism

CPE 631 Lecture 20: Multiprocessors

Lecture 19: Coherence and Synchronization

Lecture 18: Coherence and Synchronization

The University of Adelaide, School of Computer Science

Lecture 17 Multiprocessors and Thread-Level Parallelism

Presentation transcript:

Multiprocessors and Thread-Level Parallelism Cont’d ENGS 116 Lecture 16 Multiprocessors and Thread-Level Parallelism Cont’d Vincent Berk November 2, 2005 Reading for Today: Sections 6.1 – 6.3 Reading for Friday: Sections 6.4 – 6.7 Reading for Wednesday: Sections 6.8 – 6.10, 6.14

An Example Snoopy Protocol ENGS 116 Lecture 16 An Example Snoopy Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches Each cache block is in one state (track these): Shared: block can be read OR Exclusive: cache has only copy, it is writable and dirty OR Invalid: block contains no data Read misses: cause all caches to snoop bus Writes to clean line are treated as misses

ENGS 116 Lecture 16 Figure 6.11 A write-invalidate, cache-coherence protocol for a write-back cache showing the states and state transitions for each block in the cache.

ENGS 116 Lecture 16 Figure 6.12 Cache-coherence state diagram with the state transitions induced by the local processor shown in black and by the bus activities shown in gray.

Snooping Cache Variations ENGS 116 Lecture 16 Snooping Cache Variations Basic Protocol Exclusive Shared Invalid Berkeley Protocol Owned Exclusive Owned Shared Shared Invalid Illinois Protocol Private Dirty Private Clean Shared Invalid MESI Protocol Modified (private, ≠ Memory) Exclusive (private, = Memory) Shared (shared, = Memory) Invalid Owner can update via bus invalidate operation Owner must write back when replaced in cache If read sourced from memory, then Private Clean If read sourced from other cache, then Shared Can write in cache if held private clean or dirty

Implementing Snooping Caches ENGS 116 Lecture 16 Implementing Snooping Caches Multiple processors must be on bus, access to both addresses and data Add a few new commands to perform coherency, in addition to read and write Processors continuously snoop on address bus If address matches tag, either invalidate or update Since every bus transaction checks cache tags, could interfere with CPU just to check: solution 1: duplicate set of tags for L1 caches just to allow checks in parallel with CPU solution 2: L2 cache already duplicate, provided L2 obeys inclusion with L1 cache block size, associativity of L2 affects L1

Implementing Snooping Caches ENGS 116 Lecture 16 Implementing Snooping Caches Bus serializes writes, getting bus ensures no one else can perform memory operation On a miss in a write back cache, may have the desired copy and its dirty, so must reply Add extra state bit to cache to determine shared or not Add 4th state (MESI)

Larger Multiprocessors ENGS 116 Lecture 16 Larger Multiprocessors Separate memory per processor Local or remote access via memory controller 1 cache coherency solution: non-cached pages Alternative: directory per cache that tracks state of every block in every cache Which caches have copies of block, dirty vs. clean, ... Info per memory block vs. per cache block? PLUS: In memory  simpler protocol (centralized/one location) MINUS: In memory  directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which processors have copies of their blocks

Distributed-Directory Multiprocessors ENGS 116 Lecture 16 Distributed-Directory Multiprocessors Interconnection Network Processor + cache Memory I/O Directory T3E 480 MB/sec per link, 3 links per node memory on node switch based up to 2048 nodes $30M to $50M

Directory Protocol Similar to Snoopy Protocol: Three states ENGS 116 Lecture 16 Directory Protocol Similar to Snoopy Protocol: Three states Shared: ≥ 1 processors have data, memory up-to-date Uncached (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory out-of-date In addition to cache state, must track which processors have data when in the shared state (usually bit vector, 1 if processor has copy) Keep it simple: Writes to non-exclusive data  write miss Processor blocks until access completes Assume messages received and acted upon in order sent

ENGS 116 Lecture 16 Figure 6.29 State transition diagram for an individual cache block in a directory-based system.

ENGS 116 Lecture 16 Figure 6.30 The state transition diagram for the directory has the same states and structure as the transition diagram for an individual cache.

ENGS 116 Lecture 16 Summary Caches contain all information on state of cached memory blocks Snooping and directory protocols similar; bus makes snooping easier because of broadcast (snooping  uniform memory access) Directory has extra data structure to keep track of state of all cache blocks Distributed directory  scalable shared address multiprocessor  cache coherent, non-uniform memory access

ENGS 116 Lecture 16 Synchronization Why synchronize? Need to know when it is safe for different processes to use shared data Issues for synchronization: Uninterruptable instruction to fetch and update memory (atomic operation) User-level synchronization operation using this primitive For large-scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Uninterruptable Instruction to Fetch and Update Memory ENGS 116 Lecture 16 Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0  synchronization variable is free 1  synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: returns the value of a memory location and atomically increments it

Uninterruptable Instruction to Fetch and Update Memory ENGS 116 Lecture 16 Uninterruptable Instruction to Fetch and Update Memory Hard to have read and write in 1 instruction: use 2 instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg–reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

User-Level Synchronization ENGS 116 Lecture 16 User-Level Synchronization Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ; atomic exchange bnez R2,lockit ; already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likely to get cache hits for such variables Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it changes, then try exchange (“test and test&set”): try: li R2,#1 lockit: lw R3,0(R1) ; load var bnez R3,lockit ; not free  spin exch R2,0(R1) ; atomic exchange bnez R2,try ; already locked?

ENGS 116 Lecture 16 Memory Consistency What is consistency? When must a processor see the new value? P1: A = 0; P2: B = 0; ..… ..... A = 1; B = 1; L1: if (B == 0) … L2: if (A == 0) … Impossible for both of statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Memory consistency models: what are the rules for such cases? Sequential consistency (SC): result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved delay completion of memory access until all invalidations complete delay memory access until previous one is complete

Memory Consistency Model ENGS 116 Lecture 16 Memory Consistency Model More efficient approach is to assume that programs are synchronized All accesses to shared data are ordered by synchronization ops write (x) ... release (s) // unlock ... acquire (s) // lock ... read (x) Only those programs willing to be nondeterministic are not synchronized; outcome depends on processor speed and data race occurs Several relaxed models for memory consistency since most programs are synchronized; characterized by their attitude towards: RAR, WAR, RAW, WAW to different addresses

Measuring Performance of Parallel Systems ENGS 116 Lecture 16 Measuring Performance of Parallel Systems Wall clock time is good measure Often want to know how performance scales as processors are added to the system Unscaled speedup: processors added to fixed-size problem Scaled speedup: processors added to scaled-size problem For scaled speedup, how do we scale the application? Memory-constrained scaling Time-constrained scaling (CPU power) Bandwidth-constrained scaling (communication) How do we measure the uniprocessor application?

Symmetric Multithreading ENGS 116 Lecture 16 Symmetric Multithreading Process: Address Space Context One or more execution Threads Thread: Shared Heap (with other threads in process) Shared Text Exclusive Stack (one per thread) Exclusive CPU Context

ENGS 116 Lecture 16 The Big Picture Figure 6.44

Superscalar Each thread is considered as a separate process ENGS 116 Lecture 16 Superscalar Each thread is considered as a separate process Address space is shared through shared pages A thread switch requires a full context switch Examples: Linux 2.4

Real Multithreading Thread switches don’t require context switches ENGS 116 Lecture 16 Real Multithreading Thread switches don’t require context switches Threads switched on: timeout blocking for I/O CPU has to keep multiple states (contexts): 1 for each concurrent thread Program Counter Register Array (usually done through renaming) Since memory is shared: threads operate in same virtual address space TLB and Caches are specific to process, same for each thread

Course MT CPU hardware support is inexpensive: ENGS 116 Lecture 16 Course MT CPU hardware support is inexpensive: Amount of concurrent threads low (<4) = # of contexts Thread library prepares operating system Threads are hard to share between CPU’s Switching occurs on block boundaries Instruction cache block Instruction fetch block Implemented in: SPARC Solaris

Fine MT Switching between threads on each cycle ENGS 116 Lecture 16 Fine MT Switching between threads on each cycle Hardware support is extensive: Has to support several concurrent contexts Hazards in write-back stage (RAW, WAW, WAR) IBM used both Fine MT and SMT in experimental CPUs

Symmetric Multi Threading ENGS 116 Lecture 16 Symmetric Multi Threading Coolest of all Interleaves instructions from multiple threads in each cycle Hardware Control Logic borders on the impossible Extensive context and renaming logic Very tight coupling with operating system IBM had great successes: Bottle neck: memory BW Intel recently implemented “Hyperthreading” in Xeon series slimmed down version of SMT No OS support: you see 2 ‘virtual’ CPU’s Just one more Marketing Term from our friends in Silicon Valley