Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS-510 Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss 1993 Presented by Steve Coward.

Similar presentations


Presentation on theme: "CS-510 Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss 1993 Presented by Steve Coward."— Presentation transcript:

1 CS-510 Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss 1993 Presented by Steve Coward PSU Fall 2011 CS-510 11/02/2011 Slide content heavily borrowed from Ashish Jha PSU SP 2010 CS-510 05/04/2010

2 2 CS-510 Agenda  Lock-based synchronization  Non-blocking synchronization  TM (Transactional Memory) concept  A HW-based TM implementation –Core additions –ISA additions –Transactional cache –Cache coherence protocol changes  Test Methodology and results  Blue Gene/Q  Summary

3 3 CS-510  Generally easy to use, except not composable  Generally easy to reason about  Does not scale well due to lock arbitration/communication  Pessimistic synchronization approach  Uses Mutual Exclusion –Blocking, only ONE process/thread can execute at a time Lock-based synchronization Lo-priority Holds Lock X Hi-priority Pre-emptionCan’t proceed Priority Inversion Holds Lock X De-scheduledCan’t proceed Ex: Quantum expiration, Page Fault, other interrupts Proc AProc B Convoying Can’t proceed Deadlock Can’t proceed Get Lock X Holds Lock Y Get Lock X Holds Lock X Get Lock Y High context switch overhead Med-priority Proc C

4 4 CS-510 Lock-Free Synchronization  Non-Blocking - optimistic, does not use mutual exclusion  Uses RMW operations such as CAS, LL&SC –limited to operations on single-word or double-words  Avoids common problems seen with conventional techniques such as Priority inversion, Convoying and Deadlock  Difficult programming logic  In absence of above problems and as implemented in SW, lock-free doesn’t perform as well as a lock-based approach

5 5 CS-510 Non-Blocking Wish List  Simple programming usage model  Avoids priority inversion, convoying and deadlock  Equivalent or better performance than lock- based approach –Less data copying  No restrictions on data set size or contiguity  Composable  Wait-free Enter Transactional Memory (TM) …

6 6 CS-510 What is a transaction (tx)?  A tx is a finite sequence of revocable operations executed by a process that satisfies two properties: –Serializable –Steps of one tx are not seen to be interleaved with the steps of another tx –Tx’s appear to all processes to execute in the same order –Atomic –Each tx either aborts or commits – –Abort causes all tentative changes of tx to be discarded – –Commit causes all tentative changes of tx to be made effectively instantaneously globally visible  This paper assumes that a process executes at most one tx at a time –Tx’s do not nest (but seems like a nice feature) –Hence, these Tx’s are not composable –Tx’s do not overlap (seems less useful)

7 7 CS-510 What is Transactional Memory?  Transactional Memory (TM) is a lock-free, non- blocking concurrency control mechanism based on tx’s that allows a programmer to define customized read-modify-write operations that apply to multiple, independently chosen memory locations  Non-blocking –Multiple tx’s optimistically execute CS in parallel (on diff CPU’s) –If a conflict occurs only one can succeed, others can retry

8 8 CS-510 Basic Transaction Concept beginTx:[A]=1;[B]=2;[C]=3;x=_VALIDATE(); Proc A beginTx:z=[A];y=[B];[C]=y;x=_VALIDATE(); Proc B Atomicity ALL or NOTHING IF (x) _COMMIT(); _COMMIT();ELSE _ABORT(); _ABORT(); GOTO beginTx; GOTO beginTx; // _COMMIT - instantaneously make all above changes visible to all Proc’s IF (x) _COMMIT(); _COMMIT();ELSE _ABORT(); _ABORT(); GOTO beginTx; GOTO beginTx; // _ABORT - discard all above changes, may try again True concurrency How is validity determined? Optimistic execution Serialization ensured if only one tx commits and others abort Linearization ensured by (combined) atomicity of validate, commit, and abort operations How to COMMIT? How to ABORT? Changes must be revocable!

9 9 CS-510 TM vs. SW Non-Blocking LT or LTX // READ set of locations VALIDATE // CHECK consistency of READ values ST // MODIFY set of locations COMMIT Succeed Fail Fail Critical Section // HW makes changes PERMANENT Pass // TM While (1) { curCnt = LTX(&cnt); if (Validate()) { int c = curCnt+1; ST(&cnt, c); if (Commit()) return; } // Non-Block While (1) { curCnt = LL(&cnt); // Copy object if (Validate(&cnt)) { int c = curCnt + 1; if (SC(&cnt, c)) return; } // Do work

10 10 CS-510 HW or SW TM?  TM may be implemented in HW or SW –Many SW TM library packages exist –C++, C#, Java, Haskell, Python, etc. –2-3 orders of magnitude slower than other synchronization approaches  This paper focuses on HW implementation –HW offers significantly greater performance than SW for reasonable cost –Minor tweaks to CPU’s* – –Core – –ISA – –Caches – –Bus and Cache coherency protocols –Leverage cache-coherency protocol to maintain tx memory consistency –But HW implementation w/o SW cache overflow handling is problematical * See Blue Gene/Q

11 11 CS-510 Core: TM Updates  Each CPU maintains two additional Status Register bits –TACTIVE –flag indicating whether a transaction is in progress on this cpu –Implicitly set upon entering transaction –TSTATUS –flag indicating whether the active transaction has conflicted with another transaction TACTIVETSTATUSMEANING FalseDCNo tx active TrueFalseOrphan tx - executing, conflict detected, will abort True Active tx - executing, no conflict yet detected

12 12 CS-510 ISA: TM Memory Operations  LT: load from shared memory to register  ST: tentative write of register to shared memory which becomes visible upon successful COMMIT  LTX: LT + intent to write to same location later –A performance optimization for early conflict detection ISA: TM Verification Operations   VALIDATE: validate consistency of read set – –Avoid misbehaved orphan   ABORT: unconditionally discard all updates – –Used to handle extraordinary circumstances   COMMIT: attempt to make tentative updates permanent

13 13 CS-510 TM Conflict Detection  LOAD and STORE are supported but do not affect tx’s READ or WRITE set –Why would a STORE be performed within a tx?  Left to implementation –Interaction between tx and non-tx operations to same address –Is generally a programming error –Consider LOAD/STORE as committed tx’s with conflict potential, otherwise non- linearizable outcome LT reg, [MEM] LTX reg, [MEM] ST [MEM], reg // Tx READ with intent to WRITE later // Tx WRITE to local cache, value globally // visible only after COMMIT // pure Tx READ // pure Tx READ Tx Dependencies Definition Tx Abort Condition Diagram* READ-SET WRITE-SET DATASET DATA-SETUpdated? WRITE-SET Read by any other Tx? COMMIT //WRITE-SET visible to other processes ABORT //Discard changes to WRITE-SET N N Y Y LOAD reg, [MEM] STORE[MEM], reg // non-Tx READ // non-Tx WRITE - careful! *Subject to arbitration (as in Reader/Writer paradigm)

14 14 CS-510 Core TM Cache Architecture  Two primary, mutually exclusive caches –In absence of TM, non-Tx ops uses same caches, control logic and coherency protocols as non-Tx architecture –To avoid impairing design/performance of regular cache –To prevent regular cache set size from limiting max tx size –Accessed sequentially based on instruction type!  Tx Cache –Fully-associative –Otherwise how would tx address collisions be handled? –Single-cycle COMMIT and ABORT –Small - size is implementation dependent –Intel Core i5 has a first level TLB with 32 entries! –Holds all tentative writes w/o propagating to other caches or memory –May be extended to at act as Victim Cache –Upon ABORT –Modified lines set to INVALID state –Upon COMMIT –Lines can be snooped by other processors –Lines WB to memory upon replacement 1 st level Cache L2DL3D 2 nd level … 3 rd level Main Mem 1 Clk 4 Clk Core L1DDirect-mappedExclusive 2048 lines x 8B Tx Cache Fully-associativeExclusive 64 lines x 8B L2DL3D

15 15 CS-510 HW TM Leverages Cache Protocol  M - cache line only in current cache and is modified  E - cache line only in current cache and is not modified  S - cache line in current cache and possibly other caches and is not modified  I - invalid cache line  Tx commit logic will need to detect the following events (akin to R/W locking) –Local read, remote write   S -> I   E -> I –Local write, remote write   M -> I –Local write, remote read –M, snoop read, write back -> S  Works with bus-based (snoopy cache) or network-based (directory) architectures http://thanglongedu.org/showthread.php?2226-Nehalem/page2

16 16 CS-510 Protocol: Cache States  Every Tx Op allocates 2 CL entries (2 cycle allocation) –Single CL entry also possible –Effectively makes tx cache twice as big –Two entry scheme has optimizations –Allows roll back (and roll forward) w/o bus traffic –LT allocates one entry (if LTX used properly) –Second LTX cycle can be hidden on cache hit –Authors decision appears to be somewhat arbitrary  A dirty value “originally read” must either be –WB to memory, or –Allocated to XCOMMIT entry as its “old” entry –avoids WB’s to memory and improves performance XCOMMIT XABORT New Value Old value Tx Cache Line Entry, States and Replacement Tx Op Allocation 2 lines COMMITABORT CL Entry & States CL Replacement EMPTY search No replace Yes NORMAL No replace XCOMMIT WB if DIRTY then replace EMPTY NORMAL New Value Old value New Value Old value NORMAL EMPTY No ABORT Tx/Trap to SW YesYes

17 17 CS-510 ISA: TM Verification Operations  Orphan = TACTIVE==TRUE && TSTATUS==FALSE –Tx continues to execute, but will fail at commit  Commit does not force write back to memory –Memory written only when CL is evicted or invalidated  Conditions for calling ABORT –Interrupts –Tx cache overflow VALIDATE [Mem] Return TRUE TSTATUS TSTATUS=TRUETACTIVE=FALSE FALSE ABORT TSTATUS=TRUETACTIVE=FALSE For ALL entries 1. Drop XABORT 2. Set XCOMMIT to NORMAL to NORMAL For ALL entries 1. Drop XCOMMIT 2. Set XABORT to NORMAL to NORMAL Return FALSE COMMIT TSTATUS ABORT TSTATUS=TRUETACTIVE=FALSE TRUE Return TRUE

18 18 CS-510 TM Memory Access Is XABORT DATA? LT reg, [Mem] //search Tx cache Is NORMAL DATA? XABORT NORMAL DATA XCOMMIT DATA T_READ cycle XABORT DATA XCOMMIT DATA //ABORT Tx For ALL entries 1. Drop XABORT 1. Drop XABORT 2. Set XCOMMIT to NORMAL 2. Set XCOMMIT to NORMALTSTATUS=FALSE Return DATA Return arbitrary DATA Y Y OK BUSY CL State as Goodman’s protocol for LOAD (Valid) CL State as Goodman’s protocol for LOAD (Reserved) ST to Tx cache only!!! Is XABORT DATA? LTX reg, [Mem] //search Tx cache Is NORMAL DATA? XABORT NORMAL DATA XCOMMIT DATA T_RFO cycle XABORT DATA XCOMMIT DATA //ABORT Tx For ALL entries 1. Drop XABORT 1. Drop XABORT 2. Set XCOMMIT to NORMAL 2. Set XCOMMIT to NORMALTSTATUS=FALSE Return DATA Return arbitrary DATA //Tx cache miss Y Y OK BUSY Is XABORT DATA? ST [Mem], reg Is NORMAL DATA? XCOMMIT NORMAL DATA XABORT NEW DATA T_RFO cycle XCOMMIT DATA XABORT NEW DATA //ABORT Tx For ALL entries 1. Drop XABORT 1. Drop XABORT 2. Set XCOMMIT to NORMAL 2. Set XCOMMIT to NORMALTSTATUS=FALSE Return arbitrary DATA //Tx cache miss Y OK BUSY CL State as Goodman’s protocol for STORE (Dirty) XCOMMIT XABORT New Value Old value Tx Op Allocation For reference XCOMMIT DATA XABORT NEW DATA Y   Tx requests REFUSED by BUSY response – –Tx aborts and retries (after exponential backoff?) – –Prevents deadlock or continual mutual aborts   Exponential backoff not implemented in HW – –Performance is parameter sensitive – –Benchmarks appear not to be optimized

19 19 CS-510 TM – Snoopy Cache Actions  Both Regular and Tx Cache snoop on the bus  Main memory responds to all L1 read misses  Main memory responds to cache line replacement WRITE ‘s  If TSTATUS==FALSE, Tx cache acts as Regular cache (for NORMAL entries)

20 20 CS-510 Test Methodology  TM implemented in Proetus - execution driven simulator from MIT –Two versions of TM implementation –Goodman’s snoopy protocol for bus-based arch –Chaiken directory protocol for (simulated) Alewife machine –32 Processors –memory latency of 4 clock cycles –1 st level cache latency of 1 clock cycles – –2048x8B Direct-mapped regular cache – –64x8B fully-associative Tx cache –Strong Memory Consistency Model  Compare TM to 4 different implementation Techniques –SW –TTS (test-and-test-and-set) spinlock with exponential backoff [TTS Lock] –SW queuing [MCS Lock] – –Process unable to lock puts itself in the queue, eliminating poll time –HW –LL/SC (LOAD_LINKED/STORE_COND) with exponential backoff [LL/SC Direct/Lock] –HW queuing [QOSB] – –Queue maintenance incorporated into cache-coherency protocol Goodman’s QOSB protocol - head in memory elements in unused CL’s  Benchmarks –Counting –LL/SC directly used on the single-word counter variable –Producer & Consumer –Doubly-Linked List –All benchmarks do fixed amount of work

21 21 CS-510 Counting Benchmark  N processes increment shared counter 2^16/n times, n=1 to 32  Short CS with 2 shared-mem accesses, high contention  In absence of contention, TTS makes 5 references to mem for each increment –RD + test-and-set to acquire lock + RD and WR in CS + release lock  TM requires only 3 mem accesses –RD & WR to counter and then COMMIT (no bus traffic) SOURCE: from paper

22 22 CS-510 Counting Results  LL/SC outperforms TM –LL/SC applied directly to counter variable, no explicit commit required –For other benchmarks, adv lost as shared object spans multiple words – only way to use LL/SC is as a spin lock  TM has higher thruput than all other mechanisms at most levels of concurrency –TM uses no explicit locks and so fewer accesses to memory (LL/SC -2, TM-3,TTS-5) Total cycles needed to complete the benchmark Concurrent Processes BUSNW SOURCE: Figure copied from paper TTS Lock MCS Lock - SWQ QOSB - HWQ TM LL/SC Direct TTS Lock TM LL/SC Direct MCS Lock - SWQ QOSB - HWQ

23 23 CS-510 Prod/Cons Benchmark  N processes share a bounded buffer, initially empty –Half produce items, half consume items  Benchmark finishes when 2^16 operations have completed SOURCE: from paper

24 24 CS-510 Prod/Cons Results  In Bus arch, almost flat thruput for all –TM yields higher thruput but not as dramatic as counting benchmark  In NW arch, all thruputs suffer as contention increases –TM suffers the least and wins Cycles needed to complete the Benchmark Concurrent Processes BUSNW SOURCE: Figure copied from paper N QOSB - HWQ TTS Lock TM LL/SC Direct MCS Lock - SWQ TM QOSB - HWQ LL/SC Direct MCS Lock - SWQ TTS Lock

25 25 CS-510 Doubly-Linked List Benchmark  N processes share a DL list anchored by Head & Tail pointers –Process Dequeues an item by removing the item pointed by tail and then Enqueues it by threading it onto the list as head –Process that removes last item sets both Head & Tail to NULL –Process that inserts item into an empty list set’s both Head & Tail to point to the new item  Benchmark finishes when 2^16 operations have completed SOURCE: from paper

26 26 CS-510 Doubly-Linked List Results  Concurrency difficult to exploit by conventional means –State dependent concurrency is not simple to recognize using locks –Enquerers don’t know if it must lock tail-ptr until after it has locked head-ptr & vice-versa for dequeuers –Queue non-empty: each Tx modifies head or tail but not both, so enqueuers can (in principle) execute without interference from dequeuers and vice-versa –Queue Empty: Tx must modify both pointers and enqueuers and dequeuers conflict –Locking techniques uses only single lock –Lower thruput as single lock prohibits overlapping of enqueues and dequeues  TM naturally permits this kind of parallelism Cycles needed to complete the Benchmark Concurrent Processes BUSNW SOURCE: Figure copied from paper MCS Lock - SWQ TM QOSB - HWQ TTS Lock LL/SC Direct

27 27 CS-510 Blue Gene/Q Processor  First Tx Memory HW implementation  Used in Sequoia supercomputer built by IBM for Lawrence Livermore Labs  Due to be completed in 2012  Sequoia is 20 petaflop machine  Blue Gene/Q will have 18 cores  One dedicated to OS tasks  One held in reserve (fault tolerance?)  4 way hyper-threaded, 64-bit PowerPC A2 based  Sequoia may use up to 100k Blue Genes  1.6 GHz, 205 Gflops, 55W, 1.47 B transistors, 19 mm sq

28 28 CS-510 TM on Blue Gene/Q  Transactional Memory only works intra chip  Inter chip conflicts not detected  Uses a tag scheme on the L2 cache memory  Tags detect load/store data conflicts in tx  Cache data has ‘version’ tag  Cache can store multiple versions of same data  SW commences tx, does its work, then tells HW to attempt commit  If unsuccessful, SW must re-try  Appears to be similar to approach on Sun’s Rock processor  Ruud Haring on TM: ”a lot of neat trickery”, “sheer genius” –Full implementation much more complex than paper suggests?  Blue Gene/Q is the exception and does not mark wide scale acceptance of HW TM

29 29 CS-510 Pros & Cons Summary  Pros –TM matches or outperforms atomic update locking techniques for simple benchmarks –Uses no locks and thus has fewer memory accesses –Avoids priority inversion, convoying and deadlock –Easy programming semantics –Complex NB scenarios such as doubly-linked list more realizable through TM –Allow true concurrency and hence highly scalable (for smaller Tx sizes)  Cons –TM can not perform undoable operations including most I/O –Single cycle commit and abort restrict size of 1 st level cache and hence Tx size –Is it good for anything other than data containers? –Portability is restricted by transactional cache size –Still SW dependent –Algorithm tuning benefits from SW based adaptive backoff –Tx cache overflow handling –Longer Tx increases the likelihood of being aborted by an interrupt or scheduling conflict –Tx should be able to complete within one scheduling time slot –Weaker consistency models require explicit barriers at start and end, impacting performance –Other complications make it more difficult to implement in HW –Multi-level caches –Nested Transactions (required for composability) –Cache coherency complexity on many-core SMP and NUMA arch’s –Theoretically subject to starvation –Adaptive backoff strategy suggested fix - authors used exponential backoff –Else queuing mechanism needed –Poor debugger support

30 30 CS-510 Summary  TM is a novel multi-processor architecture which allows easy lock-free multi- word synchronization in HW –Leveraging concept of Database Transactions –Overcoming single/double-word limitation –Exploiting cache-coherency mechanisms WishGranted?Comment Simple programming modelYesVery elegant Avoids priority inversion, convoying and deadlock YesInherent in NB approach Equivalent or better performance than lock- based approach YesFor very small and short tx’s No restrictions on data set size or contiguity NoLimited by practical considerations ComposableNoPossible but would add significant HW complexity Wait-freeNoPossible but would add significant HW complexity

31 31 CS-510 References  M.P. Herlihy and J.E.B. Moss. Transactional Memory: Architectural support for lock-free data structures. Technical Report 92/07, Digital Cambridge Research Lab, One Kendall Square, Cambridge MA 02139, December 1992.

32 32 CS-510 Appendix Linearizability: Herlihy’s Correctness Condition  Invocation of an object/function followed by a response  History - sequence of invocations and responses made by a set of threads  Sequential history - a history where each invocation is followed immediately by its response  Serializable - if history can be reordered to form sequential history which is consistent with sequential definition of the object  Linearizable - serializable history in which each response that preceded invocation in history must also precede in sequential reordering  Object is linearizable if all of its usage histories may be linearized. A invokes lock | B invokes lock | A fails | B succeeds An example history Reordering 1 - A sequential history but not serializable reordering Reordering 2 - A linearizable reordering A invokes lock | A fails | B invokes lock | B succeeds B invokes lock | B succeeds | A invokes lock | A fails

33 33 CS-510 Appendix Goodman’s Write-Once Protocol  D - line present only in one cache and differs from main memory. Cache must snoop for read requests.  R - line present only in one cache and matches main memory.  V - line may be present in other caches and matches main memory.  I - Invalid.  Writes to V,I are write thru  Writes to D,R are write back

34 34 CS-510 Appendix Tx Memory and dB Tx’s  Differences with dB Tx’s –Disk vs. memory - HW better for Tx’s than dB systems –Tx’s do not need to be durable (post termination) –dB is a closed system; Tx interacts with non-Tx operations –Tx must be backward compatible with existing programming environment –Success in database transaction field does not translate to transactional memory


Download ppt "CS-510 Transactional Memory: Architectural Support for Lock-Free Data Structures By Maurice Herlihy and J. Eliot B. Moss 1993 Presented by Steve Coward."

Similar presentations


Ads by Google