# Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014.

## Presentation on theme: "Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014."— Presentation transcript:

Transactional- Memory Real Time Systems Leeor Peled, Advanced topics 049011 Technion, December 2014

Lock-freedom Shared data that does not require mutual exclusion. –Avoid common problems as deadlocks, livelocks, priority inversion, convoying, fail- tolerance, async signal safety –Allow interruption/preemption without blocking the objects being operated upon. LF Algorithms vs LF data structures

Lock-Free Wait-Free Wait-Free bounded Synchronization Paradigms Classification: –Blocking Blocking Starvation-Free –Obstruction-Free –Lock-Free –Wait-Free Wait-Free Wait-Free Bounded Wait-Free Population Oblivious Wait-Free population oblivious

Synchronization for lawyers Starvation-Free : As long as one thread is in the critical section, then some other thread that wants to enter in the critical section will eventually succeed (even if the thread in the critical section has halted). Obstruction-Free: A function is Obstruction-Free if, from any point after which it executes in isolation, if finishes in a finite number of steps. Lock-Free: A method is Lock-Free if it guarantees that infinitely often some thread calling this method finishes in a finite number of steps. Wait-Free: A method is Wait-Free if it guarantees that every call finishes its execution in a finite number of steps. Wait-Free Bounded: A method is Wait-Free Bounded if it guarantees that every call finishes its execution in a finite and bounded number of steps. This bound may depend on the number of threads. Wait-Free Population Oblivious: A Wait-Free method whose performance does not depend on the number of active threads.

Synchronization Paradigms (2) Are lock-free algorithms completely useless in RT context? –Bounded number of retries in priority-based systems (Anderson, ’97) Hard-RT scheduler based on lock-free objects often incurs less overhead than wait-free implementation –NonBlocking serialization for RT systems (Hohmuth & Härtig ‚‘01) Implement linux kernel benchmarks with LF/WF algorithms, demonstrating RT capabilities

Alternative: Transactional Memory Originally proposed by Herlihy & Moss, ’93 –earlier idea by Knight, ’86 HW concept based on cache coherency extension –Speculative work, writes are marked in cache and can’t become external/visible until commit Upon commit, allow snoops/WB Upon abort – invalidate spec lines and rollback Reads are also marked to monitor conflicts

Example – deadlock prevention consider implementations of move(A,B, elem) –moves a single element from data structure A to B Drawbacks? Think of a linked-list Lock A Lock B A.remove(elem) B.insert(elem) Unlock B Unlock A atomic { A.remove(elem) B.insert(elem) } Non TMTM

Overflow… Way 0Way 1Way 2Way 3 store 0,[a] TX_begin store 1,[a] store 1,[b] store 1,[c] store 1,[d] store 1,[e] TX_end TX_begin ld [b+10] ld [b+20] ld [b+30] ld [b+40] ld [b+50] TX_end [a], 1, w [b], 1, w [c], 1, w[d], 1, w[a], 0, M 4-way L1 cache [e], 1 What happens if a write hits a spec/non-spec line? Other resources are also limited Assume [a]..[e] all map to the same L1 set – Limited capacity – Worse - non determinism

Software Transactional Memory Proposed by Shavit and Touitou (‘95) –Manage data structure through a SW intermediate layer –Log all reads/writes to track conflicts Enhanced in TL2 –Rely on versioned clock for commits Standalone approach or temporary solution until HW catches up?

TM flavors TM (Herlihy, Moss, ‘93) - original design, best effort SLE (Rajwar, Goodman, ’01) - simplify interface: avoid locks, no TM ISA required LTM (Ananian, ’03) - physical memory spilling by HW UTM (“) - virtual memory, context switch support, very heavy (virtualizes each line) VTM (Rajwar, Herlihy, ’05) – another unbounded flavor, virtualizes Txs like virt-mem HyTM (Moir, Sun Labs, ’05) - attempt HTM, fall back on STM. Special consideration to syncing between instances of both types. DSTM (Koomar) - similar to HyTM (although both are trying hard to deny it) TL2 (Dice, Shavit ’06) – another hybrid, very popular as baseline for others PhTM (Lev, ’07) – another hybrid, no simultaneous HW/SW Transactions USTM (Baugh, ’08) - another hybrid - user fault-on STM, with unbounded HTM based on HW memory protection TLE (Dice, ’08) – TM version of SLE TTM, LogTM, etc (Moore) Bottom line: Most of the above are still best-effort HTMs – no success (forward progress) guaranteed, some level of SW support required

HTM: Industry Trends Sun Microsystems: Rock CPU –Feat. Hybrid-TM and lots of other goodies such as spec-lookahead, OOO retirement, and a built in desk warmer (250W!). Allows mix of Tx and non-Tx code inside Tx boundaries, but retains TSO. –R.I.P as of May 2010 Azul: Vega 2/3 - “Java Compute Appliance (JCA)”. –Release 2007/8. RISC, in order, CMP (48/54 cores per die) –JVM oriented, >100k threads –Simple HTM, no regs rollbacks (rely on SW), no STM fallback AMD: Advanced Synchronization Facility (ASF) –Spec released on 2009. ISA includes Speculate/commit, locked-mov –Very resource constrained (4 atomic lines), flat nesting, also allows mix of Tx and non-Tx code inside tx boundaries, but may break x86 mem consistency. Intel: –TM compiler with HW support (HASTM based on RSM) –TSX on Haswell! Oops, sorry - only as of HSW-EX due to errata  Sun: http://labs.oracle.com/scalable/pubs/ASPLOS2006.pdfhttp://labs.oracle.com/scalable/pubs/ASPLOS2006.pdf Azul: http://sss.cs.purdue.edu/projects/tm/tmw2010//talks/Click-2010_TMW.pdfhttp://sss.cs.purdue.edu/projects/tm/tmw2010//talks/Click-2010_TMW.pdf AMD: http://www.amd64.org/fileadmin/user_upload/pub/transact-2010-asfooo.pdf http://www- ali.cs.umass.edu/~moss/transact-2010/public-papers/08.pdf http://llvm.org/pubs/2010-04-EUROSYS- DresdenTM.pdfhttp://www.amd64.org/fileadmin/user_upload/pub/transact-2010-asfooo.pdfhttp://www- ali.cs.umass.edu/~moss/transact-2010/public-papers/08.pdfhttp://llvm.org/pubs/2010-04-EUROSYS- DresdenTM.pdf

RTTM (Schöeberl ‘10)- premise “RTTM brings the benefits of transactional memories into the real-time systems world”. Paper contributions: –Design of a time-predictable hardware transactional memory –Analysis of the worst-case number of retries in a periodic thread model –suggestions for analysis to reduce the number of possible conflicting transactions –First evaluation of RTTM on a simulation within a Java based CMP. Optimized for WCET, not avg performance Implemented on Java optimized processor(JOP)

Java optimized processor ( Schoeberl ‘07) Unlike JVM, JOP is "a RISC stack architecture”

WCET-friendly CPU Time-predictable computer Architecture, Schoeberl ‘08 –A collection of simplifications for CPU design to reduce the bounds on WCET, at small penalty to ACET/BCET –Provides some reasoning (but no concrete proof)

WCET-friendly CPU - 2 Time Division Multiple Access (TDMA) memory access scheduling (Pitter and Schoeberl, ’09, Rosen ‘07) Memory access allows a slot per core –Transactions may only start during the access window –Gap allows completion (depends on memory access time)

Memory access WCET

OS scheduling “Real Time Specification for Java” –RT threads are assigned a deadline –Scheduler is preemptive based on priority Same priority behaves like fifo –Scheduler guarantees all threads hit their deadline Estimation on blocking boundaries

RTTM - proposal Transaction buffering - fully assoc. Read set caching (tags only) Word granularity (no false conflicts) Commit in bursts –All other cores listen (conflict checks) –Protected by global lock (“commit token”) (what is the overhead for short transactions?) No aborts on overflow! Grab the commit token on the fly On true abort – mark as zombie transaction

RTTM Analysis

RTTM Analysis (2)

Preliminary analysis Possible directions –Context-sensitive points-to analysis –Static detection of race conditions –Simulation-based analysis of buffer overflows RTTM’s Analysis was based on WALA analyzer (open source from IBM, 06’)WALA

Experiment methodology Implemented over JOP simulated on JVM 3 tasks –Producer enqueues into a buffer –Consumer removes elements from its buffer –Mover atomically moves elements between Buffer types –Standard Java vector –Bounded queue

Results

STM example (Fahmy, ‘09) EDF scheduling Response time analysis –Predicted vs simulated w/ random alignments (> 1) –Utilization: task time vs period (< 1)

Bibliography Maurice Herlihy and J. Eliot B. Moss. Transactional memory: architectural support for lock-free data structures. ISCA ‘93. J.H. Anderson, S. Ramamurthy, K. Jeffay. Real-time computing with lock-free shared objects. ACM ToCS, May ‘97 M. Hohmuth H. Härtig, Pragmatic nonblocking synchronization for real- time systems, USENIX ‘01 M. Schoeberl, F. Brandner, J. Vitek, RTTM: Real-Time Transactional Memory, SAC ’10 M. Schoeberl, A Java processor architecture for embedded real-time systems, Journal of Systems Architecture, volume 54, Jan 2008, 265- 286 M.Schoeberl. Time-predictable computer architecture. EURASIP J. Embedded Syst. 2009, Article 2 (January 2009) C. Pitter and M. Schoeberl. A real-time Java chip-multiprocessor. Trans. on Embedded Computing Sys., accepted for publication 2009. Manson (‘05) – Preemptible atomic regions (uni-processor)

Memory ordering rules TypeAlphaARMv7 PA- RISC POWER SPARC RMO SPARC PSO SPARC TSO x86 x86 oostore AMD64IA-64zSeries Loads reordered after loads YYYYYYY Loads reordered after stores YYYYYYY Stores reordered after stores YYYYYYYY Stores reordered after loads YYYYYYYYYYYY Atomic reordered with loads YYYYY Atomic reordered with stores YYYYYY Dependent loads reordered Y Incoherent instruction cache pipeline YYYYYYYYYY Source: http://en.wikipedia.org/wiki/Memory_orderinghttp://en.wikipedia.org/wiki/Memory_ordering