Download presentation
Presentation is loading. Please wait.
1
Computer Laboratory Practical non-blocking data structures Tim Harris tim.harris@cl.cam.ac.uk Computer Laboratory
2
Overview Introduction Lock-free data structures Correctness requirements Linked lists using CAS Multi-word CAS Conclusions
3
Computer Laboratory Introduction class Counter { int next = 0; int getNumber () { int t; t = next; next = t + 1; return t; } What can go wrong here? next = 0 Thread1: getNumber() t = 0 Thread2: getNumber() t = 0 result=0 next = 1 result=0
4
Computer Laboratory Introduction (2) class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } next = 0 What about now? Thread1: getNumber() t = 0 Thread2: getNumber() result=0 Lock released Lock acquired result=1 next = 1next = 2
5
Computer Laboratory Introduction (3) class Counter { int next = 0; synchronized int getNumber () { int t; t = next; next = t + 1; return t; } Now the problem is liveness Thread1: getNumber() Thread2: getNumber() Priority inversion: 1 is low priority, 2 is high priority, but some other thread 3 (of medium priority) prevents 1 making any progress Sharing: suppose that these operations may be invoked both in ordinary code and in interrupt handlers… Failure: what if thread 1 fails while holding the lock? The lock’s still held and the state may be inconsistent
6
Computer Laboratory Introduction (4) class Counter { int next = 0; int getNumber () { int t; do { t = next; } while (CAS (&next, t, t + 1) != t); return t; } In this case a non-blocking design is easy: Atomic compare and swap Location Expected value New value
7
Computer Laboratory Correctness Safety: we usually want a ‘linearizable’ implementation (Herlihy 1990) The data structure is only accessed through a well-defined interface Operations on the data structure appear to occur atomically at some point between invocation and response Liveness: usually one of two requirements A ‘wait free’ implementation guarantees per-thread progress A ‘non-blocking’ implementation guarantees only system-wide progress
8
Computer Laboratory Overview Introduction Linked lists using CAS Basic list operations Alternative implementations Extensions Multi-word CAS Conclusions
9
Computer Laboratory Lists using CAS Insert 20: H10 30 T10 30 20 30 20
10
Computer Laboratory Lists using CAS (2) Insert 20: H10 30 T 20 30 20 25 30 25
11
Computer Laboratory Lists using CAS (3) Delete 10: H10 30 TH10 30 10 30
12
Computer Laboratory Lists using CAS (4) Delete 10 & insert 20: H10 30 TH10 30 H10 30 H10 30 10 30 20 30 20
13
Computer Laboratory Logical vs physical deletion Use a ‘spare’ bit to indicate logically deleted nodes: H10 30 TH 20 30 20 10 30 30 30X 10 30
14
Computer Laboratory Implementation problems Also need to consider visibility of updates H10 30 T 20 30 20 Write barrier
15
Computer Laboratory Implementation problems (2) …and the ordering of reads too H10 30 T 20 10 30 while (val < seek) { p = p->next; val = p->val; } val = ???
16
Computer Laboratory Overview Introduction Linked lists using CAS Multi-word CAS Design Results Conclusions
17
Computer Laboratory Multi-word CAS Atomic read-modify-write to a set of locations A useful building block: Many existing designs (queues, stacks, etc) use CAS2 directly (e.g. Detlefs ’00) More generally it can be used to move a structure between consistent states We’d like it to be non-blocking, disjoint-access parallel, linearizable, and efficient with natural data
18
Computer Laboratory Previous work Lots of designs… Anderson ’95YesStrong LL/SCp(w+l)+l l=log 2 p+log 2 a I+R ’95YesCASp + log 2 p Herlihy ’93NoCAS0 YesCAS0 or 2 Moir ’97YesStrong LL/SClog 2 p+log 2 n I+R ’95YesStrong LL/SClog 2 p …none of them practicable p processors, word size w, max n locations, max a addresses ParallelRequiresReserved bits
19
Computer Laboratory Design H 10 20 T 0x100 0x108 0x110 0x118 0x104 0x10C 0x114 0x11C status=UNDECIDED locations=2 a1=0x10C o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2= Build descriptor Acquire locations Decide outcome Release locations DCSS (&status, UNDECIDED, 0x10C, 0x110, &descriptor) DCSS (&status, UNDECIDED, 0x114, 0x118, &descriptor) CAS (&status, UNDECIDED, SUCCEEDED) status=SUCCEEDED CAS (0x10C, &descriptor, 0x118)CAS (0x114, &descriptor, null) null
20
Computer Laboratory Reading H 10 20 T 0x100 0x108 0x110 0x118 0x104 0x10C 0x114 0x11C status=UNDECIDED locations=2 a1=0x10c o1=0x110 n1=0x118 a2=0x114 o2=0x118 n2= word_t read (addr_t a) { word_t val = *a; if (!isDescriptor(val)) return val else { SUCCEEDED => return new value; return old value; }
21
Computer Laboratory 100x108 0x10C ac=0x200 oc=0 au=0x10C ou=0x110 nu=0x200 Now we need DCSS from CAS: Easier than full CAS2: the locations used for ‘control’ and ‘update’ addresses must not overlap, only the ‘update’ address may be changed + we don’t need the result DCSS(&status, UNDECIDED 0x10C, 0x110, &descriptor): CAS (0x10C, 0x110, &DCSSDescriptor) if (*0x200 == 0) CAS (0x10C, &DCSSDescriptor, 0x200) else CAS (0x10C, &DCSSDescriptor, 0x110); Whither DCSS?
22
Computer Laboratory Evaluation: method Attempt to permute elements in a vector. Can control: Level of concurrency Length of the vector Number of elements being permuted Padding between elements Management of descriptors 2343455460676
23
Computer Laboratory Evaluation: small systems 248163264 HF 1.62.86.01771280 HF-RC 1.52.65.61668270 IR 3.44.47.91976300 MCS 5.68.213244692 MCS-FG 1.42.86.01442130 gargantubrain.cl: 4-processor IA-64 (Itanium) Vector=1024, Width=2-64, No padding s per successful update CASn width (words permuted per update) Algorithm used
24
Computer Laboratory Evaluation: large systems ms per successful update Number of processors hodgkin.hpcf: 64-processor Origin-2000, MIPS R12000 Vector=1024, Width=2 One element per cache line HF-RC IR MCS
25
Computer Laboratory Overview Introduction Linked lists using CAS Multi-word CAS Conclusions
26
Computer Laboratory Conclusions Some general techniques The descriptor pointers serve two purposes: They allow ‘helpers’ to find out the information needed to complete their work. They indicate ownership of locations Correctness seems clearest when thinking about the state of the shared memory, not the state of individual threads Unlike previous work we need only a small and constant number of reserved bits (e.g. 2 to identify descriptor pointers if there’s no type information available at run time)
27
Computer Laboratory Conclusions (2) Our scheme is the first practical one: Can operate on general pointer-based data structures Competitive with lock-based schemes Can operate on highly parallel systems Disjoint-access parallel, non-blocking, linearizable http://www.cl.cam.ac.uk/~tlh20/papers/hfp-casn-submitted.pdf
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.