A Methodology for Implementing Highly Concurrent Data Objects

A Methodology for Implementing Highly Concurrent Data Objects
Maurice Herlihy October 1991 Presented by Tina Swenson April 15, 2010 Dr Herlihy is a professor of Computer Science at Brown University. His focus is multiprocessor programming. He was awarded the Gödel Prize in Gödel's are awarded to outstanding journal articles in theoretical computer science.

Agenda Introduction Small Objects Large Objects Conclusion
Non-Blocking Transformation Wait-free Transformation Large Objects Conclusion Introduction Keywords Motivation Addressing practical issues Automatic Transformation “Primitives” Used (load_linked, store_conditional) Hardware Used Priority Queues The author implements a priority queue to test his new coding paradigm. Pqueues are heaps. Heaps are full binary trees. The root of the binary tree contains the highest priority value. Each parent node has a higher value than either of its children. Small Objects (focus of most of our time) An object that is small enough to be copied in one instruction. Non-Blocking Transformation – transforming a sequential object into a non-blocking concurrent object. Guaranteed correctness. Improved performance. The code. Race condition and its solution Issues with CAS Fault Tolerance – pros and cons. Experimental results. Expential backoff to improve performance Wait-Free Protocol – based on non-blocking, but also applying operational combining. Keywords. Large Objects – non-blocking Objects that are too large to be copied at once. Represented by a set of blocks linked by pointers. Memory management is more complex. Copying and tracking ownership. Performance improvements – skew heap Experimental results Conclusion Transforming data from sequential to concurrent. Motivation (problem is solves) Results. Going forward.

Introduction

Key Words Critical Section – In the author’s context, CS refers to blocking code. Non-blocking (NB) – some process will complete its operation after a finite number of steps. Wait-free (a.k.a. starvation-free) (WF) – all processes will complete their operations after a finite number of steps.

Motivation Conventional Techniques – The use of a critical sections (by author’s definition) means only one process has access to the data. Implementing NB/WF - We cannot use a critical section since it could cause a process to block forever (thus violating the definitions of NB And WF) Practical issues addressed. Reasoning is hard. Fault tolerance is costly.

Automatic Transformations
Allow the programmer to reason and program sequentially. The sequential code is converted into concurrent objects. The author doesn’t specify what performs this transformation! Access to the concurrent object is protected via atomic instructions.

Atomics Used Load_linked Store_conditional
Copies the value of the shared variable to a local. Watches the memory for any other processor accessing it. Store_conditional Uploads the new version to the shared variable, returning success or failure. If LL tells SC that some other process accessed the memory, SC will fail. NOTE: Implemented in software (did not exist in hardware) “Typically, this behavior is implemented using the cache hardware - on load-linked the hardware starts watching the cache line, and if that cache line is accessed by another process (and hence invalidated) the subsequent SC will fail. This failure might be spurious because (a) the cache line might be invalidated because a different location on that cache line was accessed, or (b) because some other unrelated data that maps to the same cache line was pulled into the cache causing the line in question to be evicted. At this point the hardware loses the ability to tell if the address has been accessed, so it defaults to failure at the SC.” Jon Walpole Author claims these software “primitives” could easily be implemented in hardware. MIPS R4000 – has LL and SC instructions. Power PC - PPC has load-linked (lwarx) and store cond (stwcx) instructions.

Atomics Used 3 Reasons for LL and SC:
Efficient implementation in cache-coherent architectures. CAS instruction is inadequate. Less efficient & more complex. LL and SC are easy to use (compared to CAS code). SC only has to check if the cached copy of the shared variable was invalidated. LL and SC are universal. “powerful enough to transform any sequential object implementation into a NB or WF implementation.” CAS has same functionality as LL/SC, but the author claims it’s more difficult to reason about CAS code.

Correctness Linearizability. Is this claim really strong enough?
Used as the basic correctness condition for the concurrent objects created by the automatic transformation. Is this claim really strong enough? What about this quote from p18? “...as long as the store_conditional has no spurious failures, each operation will complete after at most 2 loop iterations.”

Priority Queues The author implements a priority queue to test his new coding paradigm. Dequeue Sequential Code int pqueue_deq(pqueue_type *p){ int best; if (!p->size) return PQUEUE_EMPTY; best = p->element[0]; p->element[0] = p->element[-- p->size]; pqueue_heapify(p, 0); return best; } Notice: No code to protect the shared data! The author uses heaps because they are notoriously difficult to program concurrently. Pqueues are heaps. Heaps are full binary trees (except at leaf level). The root of the binary tree contains the highest priority value. Each parent node has a higher value than either of its children.

Hardware & Software Used
18 Processors National Semiconductor Encore Multimax NS32532 processors Code implemented with C National Semiconductor. CISC. Spring 1987 National Semi is publicly traded, headquartered in Santa Clara CA. Traded $14.7 Product platform today include Audio, Data Conversion, Interface Conversion, Power Conversion

Small Objects

Key Words Small Object - An object that is small enough to be copied in one instruction. Sequential Object – A data structure that occupies a fixed size, contiguous region of memory. The Heap. Concurrent Object – A shared variable that holds a pointer to a structure with 2 fields: Version – the Heap Check[2] Version is our help. Check[] is an array of 2 counters used to check for data consistency. If the counters are equal, the data is consistent.

Non-Blocking Transformations
Small Objects

Non-Blocking Transformation
Transforming a sequential object into a non-blocking concurrent object. Our sequential program code must: have no side-effects. be total. No side-effects other than modifying the block (our seq obj) occupied by the object. Total = “sequential operation must return a valid result for every possible state of the data structure.” Edge cases must be handled in the sequential code.

Race Condition Processes X and Y read pointer to block b. Y replaces b with b’. X copies b while Y is copying b’ to b. P’s copy may not be a valid state of the sequential object. Solution – code example coming! Consistency check after copying the old version and before applying the sequential write. Processes P and Q read a pointer to block b. Q swings the pointer to block b' and starts a new operation. P copies b while Q is swinging the pointer to b'. P's copy is no longer valid.

The Code: Non-Blocking
Typedef struct { pqueue_type version; unsigned check[2]; }Pqueue_type; ... We’ve converted our sequential object (the heap) into a concurrent object! version is our original heap. check is our flag to help with race conditions.

... Static Pqueue_type *new_pqueue; int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version; int result; unsigned first, last; Local copies of pointers: old_pqueue = the concurrent object old_version = the heap. result is our priority queue value removed from this Pqueue_deq operation. first, last help us with detecting a race condition. More later.

int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } new_pqueue = old_pqueue; return result; Use our atomic primitive load_linked to copy the concurrent object (loads into a register) and starts watching the memory for any other processor trying to access this memory. Dereference our old and new heaps, saving the version. load_linked(). copies concurrent object (loads into a register) and starts watching the memory for any other processor accessing this memory.

int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } new_pqueue = old_pqueue; return result; Preventing the race condition! Copy the old, new data. If the check values do not match, loop again. We failed.

int Pqueue_deq(Pqueue_type **Q){ ... while(1){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; copy(old_version, new_version); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } new_pqueue = old_pqueue; return result; If the check values DO match, now we can perform our dequeue operation! Try to publicize the new heap via store_conditional, which could fail and we loop back. Lastly, copy the old concurrent object pointer to the new concurrent pointer. Return our priority queue result. The SC fails if the location in question has been accessed since it was load-linked. SC could also have a spurious failure. Example: “Typically, this behavior is implemented using the cache hardware - on load-linked the hardware starts watching the cache line, and if that cache line is accessed by another process (and hence invalidated) the subsequent SC will fail. This failure might be spurious because (a) the cache line might be invalidated because a different location on that cache line was accessed, or (b) because some other unrelated data that maps to the same cache line was pulled into the cache causing the line in question to be evicted. At this point the hardware loses the ability to tell if the address has been accessed, so it defaults to failure at the SC.” - Jon Walpole

Experimental Results Small Object, Non-Blocking (naive)
Ugh! That’s terrible! Bus contention Starvation Wasted Parallelism! Benchmark – million enqueue/dequeue pairs. Even taking the software implemented atomics instructions into account, the performance of the NB code is poor: Bus Contention – The spin lock uses a cached copy of the lock, avoiding bus contension. The NB code does work on the memory repeated, even when store_conditional or the consistency check fail. This increases traffic on the bus. Starvation - enqueue is slower than dequeue. It ends up starving for processor time.

Exponential Backoff ... if (first == last) { result = pqueue_deq(new_version); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); } /* end while*/ new_pqueue = old_pqueue; return result; When the consistency check or the store_conditional fails, introduce back-off for a random amount of time!

Experimental Results Small Object, Non-Blocking (back-off)
Better, but NB is still not as fast as spin-locks (w/ backoff). Wasted Parallelism! Benchmark – million enqueue/dequeue pairs. Spin-lock: A test-and-test-and-set loop repeatedly reads the lock until it observes the lock is free and then tries the test&set operation.

Wait-Free Transformations
Small Objects

Key Words Operational Combining – Process starts an operation.
Record the call in Invocation. Upon completion of the operation, record the result in Result.

Wait-Free Protocol Based on non-blocking and applying operational combining. Record an operation in Invocation. Invocation structure: operation name argument value toggle bit

Wait-Free Protocol Concurrent object:
Version check[2] response[n] All the processes share an array to announce invocations. New to our concurrent object! The pth element is the result of the last completed operation.

Wait-Free Protocol When an operation starts, record the operation name and argument in announce[p] When a process records a new invocation, flip the toggle bit inside the invocation struct! Flipping the bit distinguishes old invocations from new invocations.

Wait-Free Protocol New Function: Apply()
Does the work of any waiting threads before it does its own work. void apply (inv_type announce[MAX_PROCS], pqueue_type *object){ int i; for (i = 0; i < MAX_PROCS; i++){ if(announce[i].toggle != object->res_types[i].toggle){ switch(announce[i].op_name){ case ENG_CODE: object->res_type[i].value = pqueue_enq(&ojbect->version, announce[i].arg); break; case DEQ_CODE: pqueue_deq(&ojbect->version, announce[i].arg); default: fprintf(stderr, “Unknown operation code \n”); exit(1); }; object->res_types[i].toggle = announce[i].toggle; } For ALL Processes, do ALL the outstanding work!

The Code: Wait-Free responses is new to concurrent object. Pth element is the result of the last completed operation. announce[P]; Track all processes! Typedef struct { pqueue_type version; unsigned check[2]; responses[n]; }Pqueue_type; static Pqueue_type *new_pqueue; static int max_delay; static invocation announce[MAX_PROCS]; static int P; /* current process ID */ ... responses[n]; Pth element is the result of the P’s last completed operation. announce[P]; Track our processes (Operation name, argument, toggle bit).

The Code: Wait-Free Record the process name. Flip the toggle bit.
int Pqueue_deq(Pqueue_type **Q){ Pqueue_type *old_pqueue; Pqueue_type *old_version, *new_version; int i, delay, result, new_toggle; unsigned first, last; announce[P].op_name = DEQ_CODE; new_toggle = announce[P].toggle = !announce[P].toggle; if (max_delay> 1) max_delay = max_delay >> 1; responses[n]; Pth element is the result of the P’s last completed operation. Record the process name. Flip the toggle bit.

Check the toggle bit TWICE!
The author claims it avoids a race condition??? ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); new_pqueue = old_pqueue; return result;

Same as before. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); new_pqueue = old_pqueue; return result;

Pretty much same as before.
... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); new_pqueue = old_pqueue; return result;

apply pending operations to the NEW version.
... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); new_pqueue = old_pqueue; return result; I added this code! Seems to be missing from the handout.

Same. ... while(((*Q)->responses[P].toggle != new_toggle) || ((*Q)->responses[P].toggle != new_toggle)){ old_pqueue = load_linked(Q); old_version = &old_pqueue->version; new_version = &new_pqueue->version; first = old_pqueue->check[1]; memcopy(old_version, new_version, sizeof(pqueue_type)); last = old_pqueue->check[0]; if (first == last){ result = pqueue_deq(new_version); apply(announce, Q); if (store_conditional(Q, new_version )) break; } if (max_delay < DELAY_LIMIT) max_delay = 2 * max_delay; delay = random() % max_delay; for (i = 0; i < delay; i++); new_pqueue = old_pqueue; return result; SC the data. Implement the backoff if SC or comparison fails. If SC succeeds, break out of while loop. Reclaim the old data. Return the result.

Race Condition Solution: Checking the value of the toggle bit twice.
P reads a pointer to version v (our heap). Q replaces v with v’. Q starts another operation. Q checks the announce array and applies P’s operations to v’ and stores the result in v’s response array! P sees the toggle bits match and returns. Q fails to install v as the next version, thus ensuring P has the wrong result. Solution: Checking the value of the toggle bit twice. What?

Experimental Results Wasted Parallelism!
Author’s comments: “Substantial overhead imposed by scanning announce[] and by copying the version’s response[] with each operation.” “The probablistic guarantee against starvation provided by exponential backoff may be preferable to the deterministic guarentee provided by operation combining.” Wasted Parallelism!

Large Objects

Key Words Large Objects - Logically Distinct –
Objects that are too large to be copied at once. Represented by a set of blocks linked by pointers. Logically Distinct – An operation creates and returns a new object based on the old one. The old and new version may share a lot of memory. Give control of copying to the programmer.

Memory Management Per-process pool of memory Operations:
3 states: committed, allocated and freed Operations: set_alloc moves block from committed (freed?) to allocated and returns address set_free moves block to freed set_prepare marks blocks in allocated as consistent set_commit sets committed to union of freed and committed set_abort sets freed and allocated to the empty set This slide directly from Professor Walpole! Each process owns its own pool of blocks. It tracks its blocks in a stuct called a recoverable set (set_type).

Performance Improvements
Skew Heap Approximatly-balanced binary tree. Easier to maintain, thus better performance. The update process doesn’t touch most of the tree.

Experimental Results Author’s conclusion: Ummm. You figure out a ingenious way to implement large blocks...

Conclusion

Transforming Data Transforming Data from Sequential To Concurrent.
Let programmer write sequentially without thought to memory. Let some mechanism (e.g. compiler) do the transformation to concurrent automatically. Key Instructions: Load_Linked Store_Conditional

General Observation Is it really worth all the extra work and wasted parallelism just to avoid starvation? Just to ensure fault tolerance? “We propose extremely simple and efficient memory management technieques...” Is this true? I doesn’t seem simple to me!

Going Forward Resulting Research? Are we in the wrong paradigm?

Thank You

A Methodology for Implementing Highly Concurrent Data Objects

Similar presentations

Presentation on theme: "A Methodology for Implementing Highly Concurrent Data Objects"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Methodology for Implementing Highly Concurrent Data Objects

Similar presentations

Presentation on theme: "A Methodology for Implementing Highly Concurrent Data Objects"— Presentation transcript:

Similar presentations

About project

Feedback