Multicore programming

Multicore programming
Advanced transactional memory, and OpenMP Week 6 – Monday Trevor Brown

Last time Transactional memory (TM)
Intel’s restricted transactional memory (RTM / HTM / TSX-NI) Transactional lock elision (TLE) Hash table External binary search tree Downside: does not do well with “heavy” operations (which abort a lot)

This time More sophisticated uses of hardware transactional memory (HTM) What to do when transactional lock elision (TLE) performs poorly? Accelerating lock-free KCAS

Sometimes TLE performs poorly
Lists: when a thread is traversing the list, a huge prefix of the list is in its read-set! Thread p: Search(27) Thread q: Insert(6) Abort! TLE is GREAT with few aborts, but sometimes we need a different approach to get good performance p’s read-set −∞ 15 17 20 25 27 +∞ Any change aborts p q’s write-set

Performance problems with TLE
Traversals are performed inside the transaction Usually fine for trees and hash tables, where threads naturally spread out But terrible for lists, where many threads follow the same path Global locking fallback path kills scalability when aborts are common What if we use HTM to accelerate existing concurrent algorithms like KCAS? Does not help with correctness / progress arguments But can obtain big performance benefits! Use KCAS for these

Using HTM to accelerate lock-free KCAS
Goal HTM-based KCAS that uses lock-free KCAS as the fallback path Fast path transactions should be able to run concurrently with the fallback path! Approach Fast path algorithm: Wrap KCAS in a transaction: xstart ; KCAS ; xend Now each KCAS on the fast path is atomic, just because it is in a transaction Some parts of the algorithm are no longer needed because of the transaction Example: DCSS is not needed – could just do two reads and a write in the transaction! Get rid of parts of the algorithm that are unnecessary

Step 1: Adding transactions
bool KCAS_LF(addr1..., exp1..., new1...) 1 KCAS_desc * d = new KCAS_desc(addr1...); 2 d->status = Undecided; 3 SortRowsByAddress(d); // can often skip 4 return KCASHelp(d); Recall: (lock-free KCAS) Fast path: lock-free KCAS code inside a transaction bool KCAS(addr1..., exp1..., new1...) (We still use the same old KCASRead) 2 int retries = 5; 3 retry: 4 if (xbegin() == XSTARTED) { 5 bool result = KCAS_TXN(addr1..., exp1..., new1...); 6 xend(); 7 return result; 8 } else { 9 if (--retries > 0) goto retry; 10 return KCAS_LF(addr1..., exp1..., new1...); 11 } KCAS_TXN is initially the same as KCAS_LF. We will optimize it. Fallback path: lock-free KCAS code

Optimizing KCAS_TXN bool KCAS_TXN(addr1..., exp1..., new1...)
12 KCAS_desc * d = new KCAS_desc(addr1...); 13 d->status = Undecided; 14 SortRowsByAddress(d); 15 return KCASHelp(d); Inline this help function so we can modify it here (and not affect the lock-free KCAS code)

bool KCAS_TXN(addr1..., exp1..., new1...)
12 KCAS_desc * d = new KCAS_desc(addr1...); 13 d->status = Undecided; 14 SortRowsByAddress(d); 15 if (d->status == Undecided) 16 | int newStatus = Succeeded; 17 | for (int i = 0; i < d->n; i++) 18 | | word_t val2 = DCSS(&d->status, d->row[i].addr, | | | Undecided, d->row[i].exp, | | | packKCAS(d)); 19 | | if (val2 != d->row[i].exp) // if DCSS failed 20 | | | if (isKCAS(val2)) // because of a KCAS 21 | | | if (unpack(val2) != d) // a DIFFERENT KCAS 22 | | | KCASHelp(unpack(val2)); 23 | | | i; continue; // retry "locking" this addr 24 | | | else // addr does not contain its exp value 25 | | | newStatus = Failed; break; 26 | CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < d->n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; DCSS: change addr from exp to my KCAS descriptor, only if my descriptor has status Undecided. Status is always Undecided here Status is always Undecided here Status is always Undecided here Can any other thread access my KCAS descriptor? Phase 1: lock-free “locking” Only if I store a pointer to it and commit (xend)! Runs entirely inside one transaction! I never do that… Before I return, I always CAS each address to the new value, or back to the expected value… Status CAS Phase 2: completion If no one can see my descriptor, why create it at all?

Now that we have no descriptor pointer to store. This becomes a READ.
bool KCAS_TXN(addr1..., exp1..., new1...) 12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < d->n; i++) 18 | word_t val2 = DCSS(&d->status, d->row[i].addr, | | Undecided, d->row[i].exp, | | packKCAS(d)); 19 | if (val2 != d->row[i].exp) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | if (unpack(val2) != d) // a DIFFERENT KCAS 22 | | KCASHelp(unpack(val2)); 23 | | i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | newStatus = Failed; break; 26 CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < d->n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; Now that we have no descriptor pointer to store. This becomes a READ.

12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < d->n; i++) 18 | word_t val2 = *addri; | 19 | if (val2 != d->row[i].exp) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | if (unpack(val2) != d) // a DIFFERENT KCAS 22 | | KCASHelp(unpack(val2)); 23 | | i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | newStatus = Failed; break; 26 CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < d->n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; Descriptor d does not exist. Fix references to it. Descriptor d does not exist. Must fix references to it. Descriptor d does not exist. Fix references to it. Descriptor d does not exist. Fix references to it. Descriptor d does not exist. Fix references to it. Descriptor d does not exist. Fix references to it.

12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < n; i++) 18 | word_t val2 = *addri; | 19 | if (val2 != expi) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | if (unpack(val2) != d) // a DIFFERENT KCAS 22 | | KCASHelp(unpack(val2)); 23 | | i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | newStatus = Failed; break; 26 CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; Wherever we got n from to put in the descriptor, we pass it to functions etc., to make it available here Since d does not exist, this if-statement always evaluates to true! Kill it.

Could we make xabort work?
bool KCAS_TXN(addr1..., exp1..., new1...) Small optimization: why not abort instead of commit? We want to return false, and xabort will move our program counter back to the last xbegin, so we will not execute any return statement. Could we make xabort work? 12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < n; i++) 18 | word_t val2 = *addri; | 19 | if (val2 != expi) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | 22 | | KCASHelp(unpack(val2)); 23 | | --i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | newStatus = Failed; break; 26 CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; If we get here, KCAS will return false. Any further steps are simply done to roll back previous changes. But we haven’t made any changes! Just return false (and commit)!

12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < n; i++) 18 | word_t val2 = *addri; | 19 | if (val2 != expi) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | 22 | | KCASHelp(unpack(val2)); 23 | | --i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | return false; 26 CAS(&d->status, Undecided, newStatus); 27 bool succ = (d->status == Succeeded); 28 for (int i = 0; i < n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; d->status does not exist d->status does not exist

Also fix references to d
bool KCAS_TXN(addr1..., exp1..., new1...) 12 13 14 SortByAddress(addr1..., exp1..., new1...); 15 16 17 for (int i = 0; i < n; i++) 18 | word_t val2 = *addri; | 19 | if (val2 != expi) // if DCSS failed 20 | | if (isKCAS(val2)) // because of a KCAS 21 | | 22 | | KCASHelp(unpack(val2)); 23 | | --i; continue; // retry "locking" this addr 24 | | else // addr does not contain its exp value 25 | | return false; 26 | | 27 | | 28 for (int i = 0; i < n; i++) 29 | val = (succ) ? d->row[i].new : d->row[i].exp; 30 | CAS(d->row[i].addr, packKCAS(d), val); 31 return succ; If we are here, we saw all of our expected values. No need to test for success or store expected values. We haven’t stored anything yet! Just store new values! Also fix references to d

Cleaning up white space / comments
bool KCAS_TXN(addr1..., exp1..., new1...) 12 SortByAddress(addr1..., exp1..., new1...); 13 for (int i = 0; i < n; i++) 14 | word_t val2 = *addri; 15 | if (val2 != expi) // if we see a non-expected val 16 | | if (isKCAS(val2)) // --that is a KCAS descriptor 17 | | KCASHelp(unpack(val2)); // unpack & help it 18 | | --i; continue; // retry "locking" this addr 19 | | else // addr contain a non-expected program val 20 | | return false; 21 for (int i = 0; i < n; i++) 22 | *addri = newi; 23 return true; Seems implausible that we will get to retry “locking” this addr (by reading it). Aren’t we likely to get aborted by then?

Helping and transactions
Helping involves touching data other threads are working on (data conflicts!!) Transactions that help non-transactional operations If you read some data, and someone else writes to it, your transaction will abort They are highly likely to write to data you’ve read, since you have found them in the middle of their operation Non-transactional operations helping transactions If you perform a write that a transaction is trying to do also, two cases arise: (a) you write after the transaction commits, and you didn’t really help (b) you write before the transaction commits, and it must abort Transactions helping transactions No. Just no.

Why do we help at all? To guarantee lock-free progress:
Some operation always completes How much helping is needed to guarantee progress in our algorithm? What if transactions don’t help, and we don’t help them? Suppose all transactions abort (so they do not make progress) Then all operations go to their fallback code paths, and run lock-free code This lock-free code guarantees progress

Removing transactional helping
Note: could even remove this sorting as a fast-path optimization! bool KCAS_TXN(addr1..., exp1..., new1...) 12 SortByAddress(addr1..., exp1..., new1...); 13 for (int i = 0; i < n; i++) 14 | word_t val2 = *addri; 15 | if (val2 != expi) // if we see a non-expected val 16 | | if (isKCAS(val2)) // --that is a KCAS descriptor 17 | | KCASHelp(unpack(val2)); // unpack & help it 18 | | --i; continue; // retry "locking" this addr 19 | | else // addr contain a non-expected program val 20 | | return false; 21 for (int i = 0; i < n; i++) 22 | *addri = newi; 23 return true; Instead of helping, just assume we will get aborted, and issue our own explicit xabort.

Final KCAS_TXN implementation
Step 1: Sort args by address bool KCAS_TXN(addr1..., exp1..., new1...) 12 SortByAddress(addr1..., exp1..., new1...); 13 for (int i = 0; i < n; i++) 14 | word_t val2 = *addri; 15 | if (val2 != expi) // if we see a non-expected val 16 | | if (isKCAS(val2)) // --that is a KCAS descriptor 17 | | xabort(); // give up (and try again) 18 | | else // addr contain a non-expected program val 19 | | return false; 20 for (int i = 0; i < n; i++) 21 | *addri = newi; 22 return true; Step 2: Read all addresses and check if they contain their expected values. If an address contains a non-expected program value, return false. If we encounter a KCAS descriptor, abort (and retry) Step 3: Write new values and return true

Example execution 1 Consider an execution where KCAS is used to increment cells in an array Suppose thread p runs on the fallback path, and “lock-free locks” k addresses Then thread q runs on the fast path and reads one of these addresses Thread q sees a pointer to p’s KCAS descriptor and aborts Thread p then completes its KCAS Thread q can then retry and perform its KCAS 7 11 5 3 4 9 11 10 15 14 8 10 9 5 9 10 4 11 14 13 6 12

Example execution 2 Consider an execution where KCAS is used to increment cells in an array Suppose thread p runs on the fast path, reads all k addresses, and sees the expected values Before p commits, thread q runs on the fallback path and uses CAS to store a descriptor pointer in one of these addresses Thread p will be immediately aborted by the HTM system due to a data conflict q’s KCAS descriptor 7 11 5 3 9 10 14 8 9 5 9 4 11 13 6 12

Correctness argument intuition
For simplicity, consider a system with two threads Imagine two operations running on the fallback path Both behave correctly because the lock-free algorithm is correct Two operations on the fast path Correct because both are atomic, because of transactional memory One operation on the fast path and one operation on the fallback path Claim: the fast path operation does not modify addresses while they are “lock-free locked” by the fallback path operation I.e., fast path respects the “lock-free locks” taken by the fallback path

Mechanics of Proving correctness
Correctness of each path in isolation: Fallback path is correct in isolation Fast path is atomic because of transactions, and correct in isolation Compatibility between paths: Fast path was obtained from fallback by wrapping it in a transaction (which makes it atomic) and then performing correctness-preserving transformations To be rigorous, you start with a correct lock-free algorithm, and prove that each transformation preserves correctness

Performance I don’t know exactly, but I think it should be really good
Planning to make this part of assignment 2 HTM-based KCAS has been studied before [Timnat2015] But their performance evaluation didn’t look at the impact on KCAS itself (only on some data structures built from it) This obscures the performance advantage for KCAS (Amdahl’s law) Their results have somewhat lower absolute throughputs than I’d expect

using HTM to implement synchronization primitives like KCAS
Advantages Programmer only needs to write one code path (fast path & fallback path are hidden in the KCAS implementation) Hides the complexity of proving correctness for interactions between fast path & fallback path Makes it practical to design / accelerate data structures with KCAS (should result in great performance) Code still works on systems with no HTM (just run the fallback path) Disadvantages Still need to prove correctness for searches Minor: must use KCASRead to read

Summarizing We can use TLE to make designing new data structures easy
We can use advanced HTM-based techniques to make existing data structures faster Open question: can we make designing new data structures both easy and fast? Hybrid transactional memory? Combines HTM with software implementations of transactional memory to guarantee progress Good algorithms have been designed, but they may be too complex to implement in compilers! KCAS with some generic theory that proves searches work? Some work has been done in this direction “Generalized hindsight” and “Data expansion” lemmas Easy proofs that searches work for data structures that satisfy some simple invariants

What next This time Next time OpenMP library Sequence locks
Using sequence locks to implement KCAS Accelerating sequence locking KCAS with HTM Programming tools: Valgrind and GDB (time permitting) Idea: today we saw how to accelerate a lock-free algorithm with HTM. We also want to see how to accelerate a lock-based algorithm with HTM.

OpenMP (Open Multi processing)
A powerful library for fork-join parallelism (arguably, what most programs do) Diagram shamelessly stolen from: this presentation

OpenMP (Open Multi processing)
Easy to use in your own projects (and this class might make it seem less like “magic” and more like “easy shortcuts”) Tons of features; we just look at a couple of simple tools Parallel sections Parallel for loops Reductions Warning: Windows Subsystem for Linux appears to use a global lock in its implementation of OpenMP. It offered no speedup in my testing. (Try running on a real Linux box if you expected more speedup.)

Output: “ hello(0) world(0)”
Parallel section Shortcut for: spawning n threads, where n = # of logical processors in the system, having them all execute the same code block, and then joining them Spawn n threads Sequential code Concurrent/parallel code #include <cstdio> void main() { int id = 0; printf(" hello(%d)", id); printf(" world(%d)\n", id); } #include <omp.h> void main() { #pragma omp parallel { int id = omp_get_thread_num(); printf(" hello(%d)", id); printf(" world(%d)\n", id); } What is my thread ID? n threads execute this Output: “ hello(0) world(0)” Stop n threads Output?

You can see the parallelism…
Parallel output You can see the parallelism…

Why is this useful? Using OpenMP Equivalent code without using OpenMP
#pragma omp parallel { doSomething(); } const int n = SomehowGetNumLogicalProcessors(); pthread_t *threads = new pthread_t[n]; for (int i=0;i<n;++i) { if (pthread_create(&threads[i], NULL, doSomething)) { std::cerr<<"ERROR: could not create thread"<<std::endl; exit(-1); } if (pthread_join(threads[i], NULL)) { std::cerr<<"ERROR: could not join thread"<<std::endl; delete[] threads;

Volatile to prevent benchmark from being optimized into nothing
Parallel for loop If your for loop iterations are independent, why not run them in parallel? Sequential code to zero-out an array Parallel OpenMP version void main() { long n = L; auto array = new volatile long[n]; for (long i=0;i<n;++i) { array[i] = 0; } #include <omp.h> void main() { long n = L; auto array = new volatile long[n]; #pragma omp parallel for for (long i=0;i<n;++i) { array[i] = 0; } Many threads (# of logical processors) will execute the body of the loop, each performing different iterations Volatile to prevent benchmark from being optimized into nothing The loop execusion is divided into batches of k iterations, and threads repeatedly grab the next batch to do, until no more batches remain.

Why is this useful? Using OpenMP
Without using OpenMP: starting threads to run loop_body #pragma omp parallel for for (long i=0;i<n;++i) { array[i] = 0; } const int numThreads = GetNumberOfLogicalProcessors(); pthread_t *threads = new pthread_t[numThreads]; for (int i=0;i<numThreads;++i) if (pthread_create(&threads[i], NULL, loop_body, ...)) { std::cerr<<"ERROR: could not create thread"<<std::endl; exit(-1); } if (pthread_join(threads[i], NULL)) { std::cerr<<"ERROR: could not join thread"<<std::endl; delete[] threads; Without using OpenMP: loop_body int id = getMyThreadID(); int istart = id * n / numThreads; int iend = (id+1) * n / numThreads; if (id == numThreads-1) iend = n; for (int i=istart; i<iend; i++) array[i] = 0;

Performance 144 thread Intel machine with 4 sockets
Sequential code to zero-out an array Parallel OpenMP version void main() { long n = L; auto array = new volatile long[n]; for (long i=0;i<n;++i) { array[i] = 0; } #include <omp.h> void main() { long n = L; auto array = new volatile long[n]; #pragma omp parallel for for (long i=0;i<n;++i) { array[i] = 0; } Time to complete: 19.2 seconds Time to complete: 2.0 seconds Why only ~9x speedup with 144 threads? 12.3s in the OS (managing pages) 6.9s writing to the array 0.4s in the OS 1.6s writing to the array

Reductions A reduction takes a vector (array) and turns it into a scalar (single number) Example: summing an array Naïve use of OpenMP Problem: many threads do this, and this increment is not atomic! long sum = 0; for (long i=0;i<n;++i) { sum += array[i]; } long sum = 0; #pragma omp parallel for for (long i=0;i<n;++i) { sum += array[i]; } Problem: correct, but not very scalable! How about using fetch&add? long sum = 0; #pragma omp parallel for for (long i=0;i<n;++i) { __sync_fetch_and_add(&sum, array[i]); } Ideally: want a thread to maintain a local sum while processing a batch, and fetch&add its local sum into the global sum at the end of the batch

OpenMP reductions OpenMP natively supports reductions over numerous operators (+, *, &, |, …) Must tell OpenMP which variable will be used to store the reduction Example: summing an array Correct OpenMP reduction long sum = 0; for (long i=0;i<n;++i) { sum += array[i]; } long sum = 0; #pragma omp parallel for reduction (+:sum) for (long i=0;i<n;++i) { sum += array[i]; } Time to complete: 4330 ms Time to complete: 185ms 23x speedup overall, 39x speedup in the reduction 146ms in the OS (paging) 4184ms reducing the array 77ms in the OS 108ms reducing the array

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Multicore programming

Similar presentations

Presentation on theme: "Multicore programming"— Presentation transcript:

Similar presentations

About project

Feedback