December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today.

Slides:



Advertisements
Similar presentations
Synchronization. How to synchronize processes? – Need to protect access to shared data to avoid problems like race conditions – Typical example: Updating.
Advertisements

Background Concurrent access to shared data can lead to inconsistencies Maintaining data consistency among cooperating processes is critical What is wrong.
Review: Multiprocessor Systems (MIMD)
Synchronization. Shared Memory Thread Synchronization Threads cooperate in multithreaded environments – User threads and kernel threads – Share resources.
Operating Systems ECE344 Ding Yuan Synchronization (I) -- Critical region and lock Lecture 5: Synchronization (I) -- Critical region and lock.
Threads 1 CS502 Spring 2006 Threads CS-502 Spring 2006.
CS510 Concurrent Systems Class 2 A Lock-Free Multiprocessor OS Kernel.
CPS110: Implementing threads/locks on a uni-processor Landon Cox.
Concurrency Recitation – 2/24 Nisarg Raval Slides by Prof. Landon Cox.
CS510 Concurrent Systems Introduction to Concurrency.
CS510 Concurrent Systems Jonathan Walpole. A Lock-Free Multiprocessor OS Kernel.
Implementing Synchronization. Synchronization 101 Synchronization constrains the set of possible interleavings: Threads “agree” to stay out of each other’s.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
University of Washington What is parallel processing? Spring 2014 Wrap-up When can we execute things in parallel? Parallelism: Use extra resources to solve.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.
Threads Tutorial #7 CPSC 261. A thread is a virtual processor Each thread is provided the illusion that it owns a core – Copy of the registers – It is.
December 1, 2006©2006 Craig Zilles1 Threads and Cache Coherence in Hardware  Previously, we introduced multi-cores. —Today we’ll look at issues related.
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
Implementing Lock. From the Previous Lecture  The “too much milk” example shows that writing concurrent programs directly with load and store instructions.
CSCI1600: Embedded and Real Time Software Lecture 17: Concurrent Programming Steven Reiss, Fall 2015.
CS510 Concurrent Systems Jonathan Walpole. Introduction to Concurrency.
Implementing Mutual Exclusion Andy Wang Operating Systems COP 4610 / CGS 5765.
Mutual Exclusion -- Addendum. Mutual Exclusion in Critical Sections.
1 Introduction to Threads Race Conditions. 2 Process Address Space Revisited Code Data OS Stack (a)Process with Single Thread (b) Process with Two Threads.
Multiprocessors – Locks
Kernel Synchronization David Ferry, Chris Gill CSE 522S - Advanced Operating Systems Washington University in St. Louis St. Louis, MO
Symmetric Multiprocessors: Synchronization and Sequential Consistency
CSE 120 Principles of Operating
CS703 – Advanced Operating Systems
Background on the need for Synchronization
Synchronization.
© Craig Zilles (adapted from slides by Howard Huang)
Memory Consistency Models
Threads Threads.
Atomic Operations in Hardware
Atomic Operations in Hardware
Parallel Shared Memory
Memory Consistency Models
Exploiting Parallelism
143a discussion session week 3
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Designing Parallel Algorithms (Synchronization)
Concurrent Programming
COT 5611 Operating Systems Design Principles Spring 2014
Chapter 26 Concurrency and Thread
Symmetric Multiprocessors: Synchronization and Sequential Consistency
Jonathan Walpole Computer Science Portland State University
Thread Implementation Issues
COT 5611 Operating Systems Design Principles Spring 2012
Mutual Exclusion.
UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department
Background and Motivation
Coordination Lecture 5.
Implementing Mutual Exclusion
Multithreading Tutorial
CSCI1600: Embedded and Real Time Software
Implementing Mutual Exclusion
Kernel Synchronization II
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CS510 Operating System Foundations
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Chapter 6: Synchronization Tools
CSCI1600: Embedded and Real Time Software
© Craig Zilles (adapted from slides by Howard Huang)
Presentation transcript:

December 1, 2006©2006 Craig Zilles1 Threads & Atomic Operations in Hardware  Previously, we introduced multi-core parallelism & cache coherence —Today we’ll look at 2 things: 1.Threads 2.Instruction support for synchronization. —And some pitfalls of parallelization. —And solve a few mysteries. Intel Core i7

Process View  Process = thread + code, data, and kernel context shared libraries run-time heap 0 read/write data Program context: Data registers Condition codes Stack pointer (SP) Program counter (PC) Code, data, and kernel context read-only code/data stack SP PC brk Thread Kernel context: VM structures brk pointer Descriptor table

Process with Two Threads shared libraries run-time heap 0 read/write data Program context: Data registers Condition codes Stack pointer (SP) Program counter (PC) Code, data, and kernel context read-only code/data stack SP PC brk Thread 1 Kernel context: VM structures brk pointer Descriptor table Program context: Data registers Condition codes Stack pointer (SP) Program counter (PC) stack SP Thread 2

Thread Execution  Single Core Processor —Simulate concurrency by time slicing  Multi-Core Processor —Can have true concurrency Time Thread AThread BThread C Thread AThread BThread C Run 3 threads on 2 cores

December 1, 2006©2006 Craig Zilles5 A simple piece of code unsigned counter = 0; void *do_stuff(void * arg) { for (int i = 0 ; i < ; ++ i) { counter ++; } return arg; } How long does this program take? How can we make it faster? adds one to counter

December 1, 2006©2006 Craig Zilles6 A simple piece of code unsigned counter = 0; void *do_stuff(void * arg) { for (int i = 0 ; i < ; ++ i) { counter ++; } return arg; } How long does this program take? Time for iterations How can we make it faster? Run iterations in parallel adds one to counter

December 1, 2006©2006 Craig Zilles7 unsigned counter = 0; void *do_stuff(void * arg) { for (int i = 0 ; i < ; ++ i) { counter ++; } return arg; } Exploiting a multi-core processor #1#2 Split for-loop across multiple threads running on separate cores

December 1, 2006©2006 Craig Zilles8 How much faster?

December 1, 2006©2006 Craig Zilles9 How much faster?  We’re expecting a speedup of 2  OK, perhaps a little less because of Amdahl’s Law —overhead for forking and joining multiple threads  But its actually slower!! Why??  Here’s the mental picture that we have – two processors, shared memory counter shared variable in memory

December 1, 2006©2006 Craig Zilles10 This mental picture is wrong!  We’ve forgotten about caches! —The memory may be shared, but each processor has its own L1 cache —As each processor updates counter, it bounces between L1 caches Multiple bouncing slows performance

December 1, 2006©2006 Craig Zilles11 The code is not only slow, its WRONG!  Since the variable counter is shared, we can get a data race  Increment operation: counter++ MIPS equivalent:  A data race occurs when data is accessed and manipulated by multiple processors, and the outcome depends on the sequence or timing of these events. Sequence 1 Sequence 2 Processor 1Processor 2 Processor 1 Processor 2 lw $t0, counter addi $t0, $t0, 1 lw $t0, counter sw $t0, counter addi $t0, $t0, 1 lw $t0, counter addi $t0, $t0, 1 addi $t0, $t0, 1 sw $t0, counter sw $t0, counter counter increases by 2 counter increases by 1 !! lw $t0, counter addi $t0, $t0, 1 sw $t0, counter

December 1, 2006©2006 Craig Zilles12 What is the minimum value at the end of the program?

December 1, 2006©2006 Craig Zilles13 Atomic operations  You can show that if the sequence is particularly nasty, the final value of counter may be as little as 2, instead of  To fix this, we must do the load-add-store in a single step —We call this an atomic operation —We’re saying: “Do this, and don’t allow other processors to interleave memory accesses while doing this.”  “Atomic” in this context means “as if it were a single operation” —either we succeed in completing the operation with no interruptions or we fail to even begin the operation (because someone else was doing an atomic operation) —Furthermore, it should be “isolated” from other threads.  x86 provides a “lock” prefix that tells the hardware: “don’t let anyone read/write the value until I’m done with it” —Not the default case (because it is slower!)

December 1, 2006©2006 Craig Zilles14 What if we want to generalize beyond increments?  The lock prefix only works for individual x86 instructions.  What if we want to execute an arbitrary region of code without interference? —Consider a red-black tree used by multiple threads.

December 1, 2006©2006 Craig Zilles15 What if we want to generalize beyond increments?  The lock prefix only works for individual x86 instructions.  What if we want to execute an arbitrary region of code without interference? —Consider a red-black tree used by multiple threads.  Best mainstream solution: Locks —Implements mutual exclusion You can’t have it if I have it, I can’t have it if you have it

December 1, 2006©2006 Craig Zilles16 What if we want to generalize beyond increments?  The lock prefix only works for individual x86 instructions.  What if we want to execute an arbitrary region of code without interference? —Consider a red-black tree used by multiple threads.  Best mainstream solution: Locks —Implement “mutual exclusion” You can’t have it if I have, I can’t have it if you have it when lock = 0, set lock = 1, continue lock = 0

December 1, 2006©2006 Craig Zilles17 Lock acquire code High-level versionMIPS version unsigned lock = 0; while (1) { if (lock == 0) { lock = 1; break; }  What problem do you see with this? spin: lw$t0, 0($a0) bne$t0, 0, spin li$t1, 1 sw$t1, 0($a0) &lock

December 1, 2006©2006 Craig Zilles18 Race condition in lock-acquire spin: lw$t0, 0($a0) bne$t0, 0, spin li$t1, 1 sw$t1, 0($a0)

December 1, 2006©2006 Craig Zilles19 Doing “lock acquire” atomically  Make sure no one gets between load and store  Common primitive: compare-and-swap (old, new, addr) —If the value in memory matches “old”, write “new” into memory temp = *addr; if (temp == old) { *addr = new; } else { old = temp; }  x86 calls it CMPXCHG (compare-exchange) —Use the lock prefix to guarantee it’s atomicity

December 1, 2006©2006 Craig Zilles20 Using CAS to implement locks  Acquiring the lock: lock_acquire: li $t0, 0 # old li $t1, 1 # new cas $t0, $t1, lock beq $t0, $t1, lock_acquire # failed, try again  Releasing the lock: sw $0, lock

December 1, 2006©2006 Craig Zilles21 Conclusions  When parallel threads access the same data, potential for data races —Even true on uniprocessors due to context switching  We can prevent data races by enforcing mutual exclusion —Allowing only one thread to access the data at a time —For the duration of a critical section  Mutual exclusion can be enforced by locks —Programmer allocates a variable to “protect” shared data —Program must perform: 0  1 transition before data access — 1  0 transition after  Locks can be implemented with atomic operations —(hardware instructions that enforce mutual exclusion on 1 data item) —compare-and-swap If address holds “old”, replace with “new”