Threads Cannot Be Implemented As a Library

Slides:



Advertisements
Similar presentations
Bounded Model Checking of Concurrent Data Types on Relaxed Memory Models: A Case Study Sebastian Burckhardt Rajeev Alur Milo M. K. Martin Department of.
Advertisements

Threads Cannot be Implemented As a Library Andrew Hobbs.
Chapter 6: Process Synchronization
“THREADS CANNOT BE IMPLEMENTED AS A LIBRARY” HANS-J. BOEHM, HP LABS Presented by Seema Saijpaul CS-510.
Summary of Boehm’s “threads … as a library” + other thoughts and class discussions CS 5966, Feb 4, 2009, Week 4.
CS510 Concurrent Systems Class 5 Threads Cannot Be Implemented As a Library.
To GPU Synchronize or Not GPU Synchronize? Wu-chun Feng and Shucai Xiao Department of Computer Science, Department of Electrical and Computer Engineering,
Processes Part I Processes & Threads* *Referred to slides by Dr. Sanjeev Setia at George Mason University Chapter 3.
Memory Consistency Models Some material borrowed from Sarita Adve’s (UIUC) tutorial on memory consistency models.
Evaluation of Memory Consistency Models in Titanium.
Cosc 4740 Chapter 6, Part 3 Process Synchronization.
CDP 2013 Based on “C++ Concurrency In Action” by Anthony Williams, The C++11 Memory Model and GCCThe C++11 Memory Model and GCC Wiki and Herb Sutter’s.
 2004 Deitel & Associates, Inc. All rights reserved. 1 Chapter 4 – Thread Concepts Outline 4.1 Introduction 4.2Definition of Thread 4.3Motivation for.
Foundations of the C++ Concurrency Memory Model Hans-J. Boehm Sarita V. Adve HP Laboratories UIUC.
COMP 111 Threads and concurrency Sept 28, Tufts University Computer Science2 Who is this guy? I am not Prof. Couch Obvious? Sam Guyer New assistant.
Operating Systems ECE344 Ashvin Goel ECE University of Toronto Mutual Exclusion.
Shared Memory Consistency Models. SMP systems support shared memory abstraction: all processors see the whole memory and can perform memory operations.
Memory Consistency Models. Outline Review of multi-threaded program execution on uniprocessor Need for memory consistency models Sequential consistency.
Threads Cannot be Implemented as a Library Hans-J. Boehm.
Threads cannot be implemented as a library Hans-J. Boehm (presented by Max W Schwarz)
CS399 New Beginnings Jonathan Walpole. 2 Concurrent Programming & Synchronization Primitives.
ICFEM 2002, Shanghai Reasoning about Hardware and Software Memory Models Abhik Roychoudhury School of Computing National University of Singapore.
Thread basics. A computer process Every time a program is executed a process is created It is managed via a data structure that keeps all things memory.
U NIVERSITY OF M ASSACHUSETTS A MHERST Department of Computer Science Computer Systems Principles Synchronization Emery Berger and Mark Corner University.
CoreDet: A Compiler and Runtime System for Deterministic Multithreaded Execution Tom Bergan Owen Anderson, Joe Devietti, Luis Ceze, Dan Grossman To appear.
CS533 Concepts of Operating Systems Jonathan Walpole.
Threads. Readings r Silberschatz et al : Chapter 4.
Specifying Multithreaded Java semantics for Program Verification Abhik Roychoudhury National University of Singapore (Joint work with Tulika Mitra)
Week 9, Class 3: Java’s Happens-Before Memory Model (Slides used and skipped in class) SE-2811 Slide design: Dr. Mark L. Hornick Content: Dr. Hornick Errors:
SMP Basics KeyStone Training Multicore Applications Literature Number: SPRPxxx 1.
4 November 2005 CS 838 Presentation 1 Nested Transactional Memory: Model and Preliminary Sketches J. Eliot B. Moss and Antony L. Hosking Presented by:
1 Chapter 5: Threads Overview Multithreading Models & Issues Read Chapter 5 pages
Chapter 4 – Thread Concepts
Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.
An Operational Approach to Relaxed Memory Models
Background on the need for Synchronization
Memory Consistency Models
Chapter 4 – Thread Concepts
Chapter 4: Multithreaded Programming
Memory Consistency Models
Effective Data-Race Detection for the Kernel
Chapter 5: Process Synchronization
Specifying Multithreaded Java semantics for Program Verification
Threads and Memory Models Hal Perkins Autumn 2011
The C++ Memory model Implementing synchronization)
Implementing synchronization
Introduction to High Performance Computing Lecture 20
Memory Management Tasks
Threads and Memory Models Hal Perkins Autumn 2009
Critical section problem
Threads Chapter 4.
Background and Motivation
Grades.
Shared Memory Consistency Models: A Tutorial
Memory Consistency Models
Userspace Synchronization
CSE 451: Operating Systems Autumn 2004 Module 6 Synchronization
CSE 451: Operating Systems Autumn 2003 Lecture 7 Synchronization
CSE 451: Operating Systems Autumn 2005 Lecture 7 Synchronization
CSE 451: Operating Systems Winter 2004 Module 6 Synchronization
CSE 451: Operating Systems Winter 2003 Lecture 7 Synchronization
CSE 153 Design of Operating Systems Winter 19
CS333 Intro to Operating Systems
Chapter 6: Synchronization Tools
Relaxed Consistency Finale
Compilers, Languages, and Memory Models
Foundations and Definitions
Lecture 20: Synchronization
Atomicity, Mutex, and Locks
CS Introduction to Operating Systems
Presentation transcript:

Threads Cannot Be Implemented As a Library or how C Pthreads / user library threading is broken Hans-J. Boehm Presented for CS533 @ PSU by Larry Diehl

Thesis Threads cannot be implemented as a user library Threads must be part of the programming language specification

Outline Course context Motivation Specifications 3 Specification Bugs Performance Conclusion

Course Context Heavyweight kernel threads Lightweight user threads Swap out scheduling / threading library Small static number of threads Large dynamic number of threads Problems for C language More generally, could use different langs to preserve the swapping “dream”

Motivation for Threads Multithreaded programs pervasive Word processor character input & spellchecking Background processing & multiple GUI windows rendering Multiprocessors now common Utilize more processing power

Programming Language Specification Written in formal logic using judgements Model theoretic or syntactic abstraction of local context, definitions, state, types, threads, etc Semantics define how these interact Defines a memory model Concurrent execution is part of the formal semantics; e.g. Java

Unspecified Parts Can be implemented differently by each compiler User application responsible for knowing specifics if it cares

C Specification Consumers C language specification [formal - threads] C compiler Pthreads specification [informal + threads] User program

Pthreads Specification Informal, i.e. English Justification: Programmers cannot read formal specifications If it was formal, it would extend the C language specification If C and Pthreads had formal proofs of metatheory, paper would not exist

Pthreads Specification Access to memory locations shared by multiple threads must be restricted via synchronization primitives pthread_mutex_lock/unlock C Compiler logically treats primitives like function calls accessing global data Prevents hardware instruction reordering via additional instructions (memory barriers)

Pthreads Specification C Compiler logically treats primitives like function calls accessing global data f() pthread_mutex_lock() is like Initial state x = 0 f() { x++ } Prevented Incorrect reordering x = 1 f() [x==2] f() [x==1] x = 1

Pthreads Specification Problems Incorrect / not safe Not always performant due to lack of non-locking primitives

Memory Model Defines which variable assignments are visible to other threads More practically Code transformations to consider when determining if a variable is shared between threads If a variable is shared after transformations, then lock

Sequential Consistency Memory Model

Initial state x = y = 0 Thread 1 Thread 2 x = 1 r1 = y y = 1 r2 = x

Interleaving x = 1 x = 1 r1 = y y = 1 r1 = y[1] r2 = x[1] y = 1 r2 = x Thread 1 x = 1 r1 = y x = 1 y = 1 r1 = y[1] r2 = x[1] Thread 2 y = 1 r2 = x

Interleaving x = 1 y = 1 r1 = y r2 = x[0] x = 1 r1 = y[1] y = 1 r2 = x Thread 1 x = 1 r1 = y y = 1 r2 = x[0] x = 1 r1 = y[1] Thread 2 y = 1 r2 = x

Sequential Consistency Intuitive Not used much in practice Hard to optimize Reordering of operations by compiler Reordering of operations by hardware

Weaker Memory Model

Thread-sequential Reordering x = 1 r1 = y r1 = y x = 1 Thread 2 Thread 2 y = 1 r2 = x r2 = x y = 1

Reordering & Interleaving Thread 1 Thread 1 x = 1 r1 = y r1 = y x = 1 r1 = y[0] r2 = x[0] x = 1 y = 1 Thread 2 Thread 2 y = 1 r2 = x r2 = x y = 1

3 Pthreads Bugs

1. Concurrent Modification Pthreads mandates protecting shared variables What variables should be protected is determined by memory model, which is unspecified If you assume sequential consistency, you do not need a lock However, a bug occurs because weak consistency speculative store transformations are performed

Interleaving Sequential Consistency Thread 1 if (x == 1) y++ Initial state x = y = 0 Thread 2 if (y == 1) x++

Interleaving Sequential Consistency Thread 1 if (x == 1) y++ if (x == 1) if (y == 1) // noop [y==0] // noop [x==0] Thread 2 if (y == 1) x++

Speculative Store & Interleaving Weaker Consistency Thread 1 Thread 1 y++ [y==1] if (x != 1) y-- [y==0] if (x == 1) y++ Thread 2 Thread 2 if (y == 1) x++ x++ [x==1] if (y != 1) x-- [x==0]

Speculative Store & Interleaving Weaker Consistency Thread 1 y++ [y==1] if (x != 1) y-- [y==0] y++ [y==1] x++ [x==1] if (x != 1) // noop [y==1] if (y != 1) // noop [x==1] Thread 2 x++ [x==1] if (y != 1) x-- [x==0]

2. Rewriting Adjacent Data Previous “speculative store” optimization fishy because we introduce operations on a variable in the source Current bug happens even though the optimization only adds operations to variables not in the original source Occurs due to adjacent data transformations when partially updating C structs

Adjacency Transformation Initial state struct { int a:0; int b:0 } x Thread 1 tmp = x // Read both fields into // 32-bit variable. tmp &= ~0x1ffff // Mask off old a. tmp |= 42 x = tmp // Overwrite all of x. x.a = 42

Adjacency Transformation locking a but not b as it is not shared between threads Thread 1 Thread 1 x.a = 1 x.b = 0 x.a = 1 Thread 2 Thread 2 x.a = 2 x.b = 3 x.a = 2 x.b = 3

Adjacency Transformation & Interleaving Thread 1 Thread 1 x.a = 1 x.b = 0 x.a = 1 x.a = 1 x.a = 2 x.b = 3 x.b = 0 Thread 2 Thread 2 x.a = 2 x.b = 3 x.a = 2 x.b = 3

3. Register Promotion Would like to promote a variable to a register sometimes for speedup Example is promoting a variable in a loop to a register because profiling determined that the loop usually runs without threads

mt is whether or not we are using threads for (...) ... if (mt) lock() x = ...[x]... unlock()

r = x for (...) ... if (mt) x = r lock() r = ...[r]... if (mt) unlock() r = x x = r

Initial state Thread 1 Thread 1 f() lock() is like x = 0 f() { x++ } Thread 1 Thread 1 r = 0 for(...) x = r f() r = x r = 2*r for(...) f() x = 2*x now we have an x read “without a lock” threads could interleave such lockless code

Poor Performance Locking is useful for most concurrent programs Some concurrent programs can be more efficiently expressed by directly using hardware primitives that locks implicitly use

Sieve of Eratosthenes for (my_prime = 2 ;my_prime < 10,000 if (!get(my_prime)) for (multiple = my_prime ;multiple < 100,000,000 ;multiple += my_prime) if (!get(multiple)) set(multiple)

Sieve of Eratosthenes outer loop 2 3 4 6 8 9 12 16 inner loop

Sieve execution time for bit array (secs)

Sieve execution time for bit array (secs) The mutex is around segments of the array Spin lock performs worse when processors are in heavy use The lock-free (atomic instruction) version does not have this problem

Conclusion Threading should be a part of a formal language specification, rather than a user library Failing to do so results in incorrect compilation and the inability to express lock-free algorithms, resulting in poor performance