Presentation is loading. Please wait.

Presentation is loading. Please wait.

EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,

Similar presentations


Presentation on theme: "EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,"— Presentation transcript:

1 EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris, Mateo Valero Barcelona Supercomputing Center, UPC BITS Pilani Microsoft Research Cambridge

2 Why Transactional Memory? Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid Lock-based parallel programming has problems –Deadlocks, races, complexity, performance, … Transactional Memory (TM) to the rescue –Optimistic concurrency control mechanism –Easy to use –Deadlock free –Supports composability –Protects data in critical sections Hardware-TM (HTM), Software-TM (STM) and hybrid 2

3 HTM terminology Atomic section/transaction: group of instructions that appear to take effect instantaneously Where are speculative values stored (version management): –in-place, and log the original value, or –buffered in private storage, publish on commit Conflict: TX writes where others TX reads –Detection: an action in which we check for conflicts –Resolution: an action performed to resolve the conflict Can be abort, stalling the execution, … 3

4 A.k.a. pessimistic Writes in-place, detects&resolves conflicts on every access LogTM [Moore, HPCA06], LogTM-SE [Yen, HPCA07] Eager HTM 4 Stall W R R TX 1 TX 2 TX 3 fast commit Limited concurrency Fast commit Slow abort

5 A.k.a. optimistic Writes buffered, detect&resolve conflicts on commit TCC [Hammond, ISCA04], Scalable-TCC [Chafi, HPCA07] Lazy HTM 5 W R R TX 1 TX 2 TX 3 complex commit: validate + write Fast abort Complex commit Good concurrency

6 The Motivation Splitting conflict management Eager-Lazy hardware-software TM exists (FlexTM [Shriraman, ISCA08]): –Software begin, commit and abort –Probabilistic (signature based) conflict detection EazyHTM is the first pure-hardware TM 6 Conflict detection Eager Lazy Conflict resolution EagerLazy LogTM TCC, S-TCC Impossible EazyHTM Fast commit Good concurrency

7 Outline Motivation Contributions Hardware changes The Protocol Evaluation Conclusions 7

8 EazyHTM Contributions The best of two worlds –Eager conflict detection: simple commit/exact list of conflicts in advance –Lazy conflict resolution: good concurrency Parallel commits of non-conflicting TXs Designed for CMPs (Chip-Multiprocessors) –Use cores proximity –MESI/MOESI protocol upgrade (easier verification) 8

9 Hardware changes 9 Racers list – 1 bit per core Killers list – 1 bit per core SR – 1 bit per line SM – 1 bit per line TD – 1 bit per line Register file checkpoint Racers list Killers list CPU SRSR SRSR Existing cache logic Private Cache(s) SMSM SMSM TDTD TDTD Existing directory logic Directory tracks conflicts bit-vector 32 bits for 32 cores tracks conflicts bit-vector 32 bits for 32 cores holds read/write set read-only optimization bit (details in the paper) read-only optimization bit (details in the paper) core...

10 Racers and killers list If line is shared between two TXs: –Read-Read No conflict –Write-Read, Read-Write, Write-Write Writer adds reader TX into “racers” list –“TXs that I have to abort” list, if I commit first Reader adds writer TX into “killers” list –“TXs that can abort me” list, if they commit first We illustrate only the Write-after-Read (WAR) conflict 10

11 txMark @A ACK @A, 0... no other sharers EazyHTM Protocol Conflict Detection (1/2) 11 racers killers TX 0 racers killers TX 2 sharers @A Directory 1 2 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX Replaces GETS/GETX

12 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX racers killers TX 2 sharers @A Directory racers killers TX 0 ACK @A, 1 txAccessor #2, @A txMark @A Reader #0, @A Potential conflict 1 other sharer Writer #2, @A EazyHTM Protocol Conflict Detection (2/2) 12 Remember: abort TX#0 on commit Remember: TX#2 can abort me 1 23 4 5

13 racers killers TX 2 racers killers TX 0 sharers @A Directory Abort from TX#2 WR @A (commit) Abort Ack from TX#0 EazyHTM Protocol Conflict Resolution 13 TX#2 first came to the commit point, abort TX#0! 1 1 2 3 TX 0TX 2 BTX RD A WR A CTX TX 0TX 2 BTX RD A WR A CTX

14 TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX TX 0TX 2 BTX WR A WR B CTX 0 other sharers EazyHTM Protocol Disjoint data => parallel commit 14 txMark @B... txMark @A ACK @A, 0 WR @A (commit) WR @B (commit) TX#0 works with line @ATX#2 works with line @B sharers @A Directory sharers @B 11 ACK @B, 0 22 racers killers TX 0 3 racers killers TX 2 3... NO SERIALIZATION 0 other sharers

15 Implementation Implemented in M5, full-system simulator (Alpha) Private L1 (32KB, 4-way, 64B CL, 2 cycles) Private L2 (512KB, 8-way, 64B CL, 10 cycles) Memory (with directory, 100 cycles) ICN (2D Mesh, 10 cycles per hop) 15

16 Evaluation Evaluated STAMP benchmarks Compared with Scalable-TCC-like HTM –Same base simulator –Implemented specialized directory protocol Compared with ideal lazy HTM (MESI based) –magical conflict detection –instant conflict resolution –parallel write-back commit 16

17 Kmeans Low Small TXs (RS 15 CL; WS 5 CL) Low contention (10% aborts) Similar profile to “replacing locks with atomic” Near ideal performance K-means: groups N-dimensional space into K clusters Most of the SPLASH-2 suite has similar profile 17

18 SSCA2 Small TXs (RS 50 CL, WS 10 CL) Low contention (1.2% aborts) Near ideal performance Scalability affected by barriers, not by contention SSCA2: large directed graph operations 18

19 Yada Large TXs (260 CL RS, 140 CL WS) Moderate contention (35% aborts) We can see good performance also for large TXs! Yada: delaunay mesh refinement 19

20 Intruder Medium TXs (53 CL RS, 20 CL WS) High contention (85% aborts) Very bad scalability for all HTMs Every transaction detects conflicts over and over again – lot of conflict detection messages slow down the execution Intruder: signature based network intrusion detection system 20

21 Only high-conflict STAMP >50% abort rate only High contention high-core-count should be optimized Averages: Labyrinth Intruder Kmeans-Hi Results highly affected by Intruder 21

22 Only low-conflict STAMP <50% abort rate only Low abort rate necessary for scaling Excludes: Labyrinth 8-32 Intruder 16-32 Kmeans-Hi 32 22

23 Conclusions Introduced EazyHTM, a new HTM implementation –Eager conflict detection, lazy conflict resolution –Fast: performs well for low conflict parallel applications –Minimal changes to directory protocols (easier verification) –As scalable as standard directory protocol EazyHTM mechanism could allow (future work): –Simpler transaction prioritization –Less wasted work –Better performance optimization –Power efficient TM mechanisms 23

24 Thank you! Questions? sasa.tomic@bsc.es 24


Download ppt "EazyHTM: Eager-Lazy Hardware Transactional Memory Saša Tomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián Cristal, Osman Unsal, Tim Harris,"

Similar presentations


Ads by Google