Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.

Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department of Computer Science, University of Rochester

Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Contributions from Past Presentations by: Ahmed Badran (Cs510-2008) Joseph Rosenski (Cs510-2006)

Agenda Presentation of the Issues Non-Blocking Queue Algorithm Two Lock Concurrent Queue Algorithm Performance Conclusion

Issue – Concurrent FIFO queues Must synchronize access to insure correctness Two types of synchronize (Generally) - Blocking Allows slower process to delay faster process - Non-blocking Guarantee if there are one or more active processes trying to perform operations on a shared data structure, SOME operation will complete within a finite number of time steps.

Issue – Blocking Blocking algorithms in general - Uses locks - May deadlock - Processes may wait for arbitrarily long times - Lock/unlock primitives need to interact with scheduling logic to avoid priority inversion - Possibility of starvation

Issue – Blocking If Blocking used for queues: - Two Locks better than one - Allows concurrency between enqueue and dequeue processes - Use Dummy node to prevent contention

Issue – Blocking, Why two locks? Queue with one lock: P1(enqueue) – acquires lock & finishes P2(enqueue) – acquires lock & proceeds P3(dequeue) – Blocked, P2 has lock A more efficient algorithm would allow P3 to proceed as soon as P1 has completed, i.e. when there is a node available to be dequeued.

Issue – Blocking Using Two Locks Two locks allows, one for tail pointer updates and a second lock for updates to the head pointer. - Enqueue process locks tail pointer - Dequeue process locks head pointer A dummy node prevents contention because enqueue process does not access head pointer/lock and dequeue process does not access tail pointer/lock.

Issue – Blocking Using Two Locks without dummy node value next Head Tail H_lock T_lock Dequeue requires updata of both head and tail

Issue – Blocking Using Two Locks without dummy node Head Tail H_lock T_lock Dequeue requires updata of both head and tail OO

Issue – Blocking Using Two Locks without dummy node Head Tail H_lock T_lock Enqueue requires updata of both tail and head O O

Issue – Blocking Using Two Locks without dummy node valu e next Head Tail H_lock T_lock Issue – Blocking Using Two Locks without dummy node value next Head Tail H_lock T_lock Enqueue requires updata of both tail and head

Issue – Tail lag behind Head value next value next value next Head Tail H_lock T_lock This illustrates a Blocking/locking algorithm, but the same issue applies to Non-blocking Algorithms. The issue is caused by separate head and tail pointers

Issue – Tail lag behind Head value next value next Head Tail H_lock T_lock

Issue – Tail lag behind Head value next value next Head Tail H_lock T_lock free(node) One solution is for a faster dequeue process to complete the work of the slower enqueue process by swinging the tail pointer to the next node before dequeuing the node. - What about lock contention, or deadlock? - Can use reference counter, and not free node until counter is zero

Issue – Tail lag behind Head Problem with reference count Head Tail H_lock T_lock If cnt == 0 free(node) valuenextcnt=1 ProcessY slow or never releases reference can cause system to run out of memory Will discuss solutions during algorithm walk through valuenextcnt=1... valuenextcnt=3

Non-Blocking Algorithms Optimistic Have a structure similar to: Initialize local structures Begin loop Do some work If CAS == true break End loop

Issue - Linearizability A data structure gives an external observer the illusion that the operations takes effect instantaneously This requires that there be a specific point during each operation at which it is considered to “take effect”.

Issue - Linearizability Similar to Serializability in Databases - History instead of Transactions - Invocations/responses vs Reads/Writes

ABA problem SM 5 Stack THREAD1THREAD2 v1=Pop() SP = 5 newSP = 4 data=X time v2=Pop() SP = 5 newSP = 4 data=X x Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break …

ABA problem THREAD1THREAD2 time v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X SM 5 Stack x …

ABA problem SM 4 Stack THREAD1THREAD2 time v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x x … v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break

ABA problem SM 4 Stack THREAD1THREAD2 time Push(z) x … v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA problem SM 4 Stack THREAD1THREAD2 time Push(z) SP = 4 x … v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA problem SM 4 Stack THREAD1THREAD2 time Push(z) SP = 4 newSP=5 x … v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA problem SM 5 Stack THREAD1THREAD2 time Push(z) SP = 4 newSP=5 CAS(&SM,SP,newSP) z … v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA problem SM 4 Stack THREAD1THREAD2 time Push(z) SP = 4 newSP=5 CAS(&SM,SP,newSP) z v1=x … v1=Pop() SP = 5 newSP = 4 data=X v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA problem SM 4 Stack THREAD1THREAD2 time Push(z) SP = 4 newSP=5 CAS(&SM,SP,newSP) z v1=x … CAS should fail but it succeeds Both threads retrieved data=x and More important, the stack is now corrupt v1=Pop() SP = 5 newSP = 4 data=X Pop Pop () { loop SP = SM newSP = SP -1 data = Stack Data CAS(&SM, SP, newSP) break return data Push Push (d) { loop SP = SM newSP = SP +1 Stack Data = d CAS(&SM, SP, newSP) break v2=Pop() SP = 5 newSP = 4 data=X CAS(&SM,SP,newSP) v2 = x

ABA Solution Add a modification counter to the pointer - atomic update of pointer and counter - can use double word CAS - Not available on most platforms - Pack pointer and counter into single word - Use single word CAS to update - Authors solution

Correctness Properties 1. The linked list is always connected 2. Nodes are only inserted after the last node in the linked list. 3. Nodes are only deleted from the beginning of the linked list 4. Head always points to the first node in the linked list 5. Tail always points to a node in the linked list.

Algorithm – Non blocking, Globals structure pointer_t {ptr: pointer to node t, count: unsigned integer} structure node_t {value: data type, next: pointer_t} structure queue_t {Head: pointer_t, Tail: pointer_t} ptr count HeadTail ptr count ptr count value next ptr count (packed Word)

Algorithm – Non Blocking Intialization initialize(Q: pointer to queue_t) node = new node() node–>next.ptr = NULL Q–>Head = Q–>Tail = node Q value next ptr count HeadTail ptr count ptr count node O

Algorithm – Non blocking Enqueue enqueue(Q: pointer to queue t, value: data type) node = new node() node–>value = value node–>next.ptr = NULL loop tail = Q–>Tail next = tail.ptr–>next if tail == Q–>Tail if next.ptr == NULL if CAS(&tail.ptr–>next, next, ) break endif else CAS(&Q–>Tail, tail, ) endif endloop CAS(&Q–>Tail, tail, ) Q value next ptr count HeadTail ptr count ptr count tail value next ptr count O node ptr count next ptr count O O

Algorithm – Non blocking Enqueue enqueue(Q: pointer to queue t, value: data type) node = new node() node–>value = value node–>next.ptr = NULL loop tail = Q–>Tail next = tail.ptr–>next if tail == Q–>Tail if next.ptr == NULL if CAS(&tail.ptr–>next, next, ) break endif else CAS(&Q–>Tail, tail, ) endif endloop CAS(&Q–>Tail, tail, ) Q value next ptr count+1 HeadTail ptr count ptr count value next ptr count O tail node ptr count next ptr count O

Algorithm – Non blocking Enqueue enqueue(Q: pointer to queue t, value: data type) node = new node() node–>value = value node–>next.ptr = NULL loop tail = Q–>Tail next = tail.ptr–>next if tail == Q–>Tail if next.ptr == NULL if CAS(&tail.ptr–>next, next, ) break endif else CAS(&Q–>Tail, tail, ) endif endloop CAS(&Q–>Tail, tail, ) Q value next ptr count HeadTail ptr count ptr count+1 value next ptr count O tail node ptr count next ptr count O

Algorithm – Non blocking Enqueue Two Process enqueue(Q: pointer to queue t, value: data type) node = new node() node–>value = value node–>next.ptr = NULL loop tail = Q–>Tail next = tail.ptr–>next if tail == Q–>Tail if next.ptr == NULL if CAS(&tail.ptr–>next, next, ) break endif else CAS(&Q–>Tail, tail, ) endif endloop CAS(&Q–>Tail, tail, ) Q value next ptr count HeadTail ptr count ptr count value next ptr count O tail ptr count next ptr count value next ptr count O node

Algorithm – Non blocking Dequeue dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean loop head = Q–>Head tail = Q–>Tail next = head–>next if head == Q–>Head if head.ptr == tail.ptr if next.ptr == NULL return FALSE endif Q value next ptr count HeadTail ptr count ptr count ptr count ptr count ptr count head tail next O O

Algorithm – Non blocking Dequeue dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean loop head = Q–>Head tail = Q–>Tail next = head–>next if head == Q–>Head if head.ptr == tail.ptr if next.ptr == NULL return FALSE endif CAS(&Q–>Tail, tail, ) else *pvalue = next.ptr–>value if CAS(&Q–>Head, head, ) break endif endloop free(head.ptr) return TRUE Q value next ptr count HeadTail ptr count ptr count ptr count ptr count ptr count head tail next value next ptr count O

Algorithm – Non blocking Dequeue dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean loop head = Q–>Head tail = Q–>Tail next = head–>next if head == Q–>Head if head.ptr == tail.ptr if next.ptr == NULL return FALSE endif CAS(&Q–>Tail, tail, ) else *pvalue = next.ptr–>value if CAS(&Q–>Head, head, ) break endif endloop free(head.ptr) return TRUE Q value next ptr count HeadTail ptr count ptr count ptr count ptr cout ptr count head tail next value next ptr count O

Algorithm – Non blocking Dequeue dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean loop head = Q–>Head tail = Q–>Tail next = head–>next if head == Q–>Head if head.ptr == tail.ptr if next.ptr == NULL return FALSE endif CAS(&Q–>Tail, tail, ) else *pvalue = next.ptr–>value if CAS(&Q–>Head, head, ) break endif endloop free(head.ptr) return TRUE Q value next ptr count HeadTail ptr count ptr count ptr count next count ptr count head tail next value next ptr count O

Algorithm – Non blocking Dequeue dequeue(Q: pointer to queue t, pvalue: pointer to data type): boolean loop head = Q–>Head tail = Q–>Tail next = head–>next if head == Q–>Head if head.ptr == tail.ptr if next.ptr == NULL return FALSE endif CAS(&Q–>Tail, tail, ) else *pvalue = next.ptr–>value if CAS(&Q–>Head, head, ) break endif endloop free(head.ptr) return TRUE Q value next ptr count HeadTail ptr count+1 ptr count ptr count next count ptr count head tail next O value next ptr count

Algorithm – Two-Lock Concurrent Queue structure node_t {value: data type, next: pointer to node_t} structure queue_t {Head: pointer to node_t, Tail: pointer to node_t, H_lock: lock type, T_lock: lock type} value next Head Tail H_lock T_lock

Algorithm – Two-Lock Concurrent Queue value next Head Tail H_lock T_lock initialize(Q: pointer to queue_t) node = new_node() node->next = NULL Q->Head = Q->Tail = node Q->H_lock = Q->T_lock = FREE O Free

Algorithm – Two-Lock Concurrent Queue value next Head Tail H_lock T_lock O FreeLocked enqueue(Q: pointer to queue_t, value: data type) node = new_node() // Allocate a new node from the free list node->value = value// Copy enqueued value into node node->next = NULL // Set next pointer of node to NULL lock(&Q->T_lock)// Acquire T_lock in order to access Tail Q->Tail->next = node// Link node at the end of the linked list Q->Tail = node// Swing Tail to node unlock(&Q->T_lock)// Release T_lock value next O

Algorithm – Two-Lock Concurrent Queue Head Tail H_lock T_lock value next Free Locked enqueue(Q: pointer to queue_t, value: data type) node = new_node() // Allocate a new node from the free list node->value = value// Copy enqueued value into node node->next = NULL // Set next pointer of node to NULL lock(&Q->T_lock)// Acquire T_lock in order to access Tail Q->Tail->next = node// Link node at the end of the linked list Q->Tail = node// Swing Tail to node unlock(&Q->T_lock)// Release T_lock value next O

Algorithm – Two-Lock Concurrent Head Tail H_lock T_lock value next Locked value next O dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean lock(&Q->H_lock) // Acquire H_lock in order to access Head node = Q->Head// Read Head new_head = node->next// Read next pointer if new_head == NULL// Is queue empty? unlock(&Q->H_lock)// Release H_lock before return return FALSE// Queue was empty endif *pvalue = new_head->value// Queue not empty. Read value before release Q->Head = new_head// Swing Head to next node unlock(&Q->H_lock)// Release H_lock free(node)// Free node return} TRUE node new_head

Algorithm – Two-Lock Concurrent Head Tail H_lock T_lock value next Free Locked value next O dequeue(Q: pointer to queue_t, pvalue: pointer to data type): boolean lock(&Q->H_lock) // Acquire H_lock in order to access Head node = Q->Head// Read Head new_head = node->next// Read next pointer if new_head == NULL// Is queue empty? unlock(&Q->H_lock)// Release H_lock before return return FALSE// Queue was empty endif *pvalue = new_head->value// Queue not empty. Read value before release Q->Head = new_head// Swing Head to next node unlock(&Q->H_lock)// Release H_lock free(node)// Free node return} TRUE node new_head

Performance Parameters Net execution time for one million enqueue/dequeue pairs 12-processor Silicon Graphics Challenge multiprocessor Algorithms compiled with using highest optimization level Including many hand optimizations

Performance – Dedicated Processor

Performance – Two processes Per Processor

Performance – Three Processes Per Processor

Conclusion NBS clear winner for multiprocessor multiprogrammed systems Above 5 processors, use the new non-blocking queue If hardware only supports test-and-set use two lock queue For two or less processors use a single lock algorithm for queues

Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.

Similar presentations

Presentation on theme: "Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department.

Similar presentations

Presentation on theme: "Simple, Fast, and Practical Non- Blocking and Blocking Concurrent Queue Algorithms Presenter: Jim Santmyer By: Maged M. Micheal Michael L. Scott Department."— Presentation transcript:

Similar presentations

About project

Feedback