Presentation is loading. Please wait.

Presentation is loading. Please wait.

More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet.

Similar presentations


Presentation on theme: "More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet."— Presentation transcript:

1 More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet

2 Thread Level Speculation (TLS) A technique for automatic parallelization. Run threads in parallel, but in a speculative state. Check for violations. Commit upon successful completion. Squash when detecting a violation. – Propagate the squash onwards. – Re-run the thread.

3 Thread Level Speculation Example

4 Mechanism of TLS 1.Managing speculative state. 2.Disambiguation: checking addresses for violating dependencies – Eager vs. Lazy 3.Upon commit – Broadcast (Everybody? Relevant?) – Invalidate/update of other threads – Leave speculative state 4.Upon squash – Broadcast – Invalidate changes for this thread – Re-run At hardware level. Involve Cache.  Simple. Fast.

5 Scenarios Thread attributes: – Length – Memory accesses – Dependences ? Many ? Many ??0??0 ??0??0 Serial Easily parallel Short Many Few Short Many Few TLS costly Short Few Short Few TLS works Long Many Long Many TLS costly Long Few Long Few TLS costly Length Accesses Depend.

6 When is TLS Too Costly? “Too much data” scenario – Thread touches too many addresses. “Too much time” scenario – Execution involves many instructions (e.g. Databases transactions). Bulk Disambiguation of Speculative Threads in multiprocessors Ceze, Tuck, Cascaval, Torrellas. Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Colohan, Ailamaki, Steffan, Mowry.

7 Too Many Addresses – Solution 1 Each thread maintains a bitwise mask of the cache. Flip bit on when touching an address. Upon completion, check addresses you and others touched. (Lazy) Commit / Squash : send mask. Invalidating/replacing/changing address state in cache: use mask. All bitwise operations. Very simple! Infeasible for size reasons (won’t scale).

8 Solution: Hash! Introducing BULK - a hardware that hashes the address space into a signature (~2k in size). 010100001 001100100 011100101 Address Space Signature Bitwise OR Upon completion, send signature! Upon receiving, pull back to a superset of possible addresses.

9 Bulk Features: Separate Reading / Writing signatures. Committing: sending signature. Invalidating: pulling back signature into a superset. Granularity is on word level (not cache line) – since we map addresses Caveat: We might see violations even if there weren't any!

10 Bulk Performance

11 Fraction of False Positives as a function of Signature Length

12 When is TLS Too Costly? “Too much data” scenario – Thread touches too many addresses. “Too much time” scenario – Execution involves many instructions (e.g. Databases transactions). Bulk Disambiguation of Speculative Threads in multiprocessors Ceze, Tuck, Cascaval, Torrellas. Tolerating Dependences Between Large Speculative Threads Via Sub-Threads Colohan, Ailamaki, Steffan, Mowry.

13 Handling Long Threads (Attempt 1) Image courtesy Chris Colohan Q: Does eliminating a data dependence help? *p= *q= =*p R2 Violation! =*p =*q Parallel Upon violation – we re-execute a long thread.

14 Handling Long Threads (Attempt 1) *p= *q= =*p R2 Violation! =*p =*q Parallel *q= =*q Violation! Eliminate *p Dep. Image courtesy Chris Colohan

15 Handling Long Threads (Attempt 2): Sub-Threads Sub-threads are checkpoints during thread execution No longer “all or nothing” Must be lightweight Help with primary and secondary violations *q= Violation! =*q Image courtesy Chris Colohan

16 Sub-thread Implementation Assume CMP with shared L2 L1 is unaware of sub-threads – Speculatively modified bit per cache line L2 performs eager violation detection – 2 additional bits per cache line per sub-thread – Replication to track different sub-thread contexts

17 17 Sub-thread Evaluation 0 0.2 0.4 0.6 0.8 1 1.2 Idle CPU Failed Cache Miss Busy Time (normalized) New Order New Order 150 Delivery Delivery Outer Stock Level Payment Order Status NSLNSLNSLNSLNSLNSLNSL N = no sub-threads S = with sub-threads L = limit, ignoring violations Image courtesy Chris Colohan

18 Summary Thread attributes: – Length – Memory accesses – Dependences ? Many ? Many ??0??0 ??0??0 Serial Easily parallel Short Few Short Few TLS works Long Many Few Long Many Few Hopeless?? Length Accesses Depend. Short Many Few Short Many Few Long Few Long Few TLS costly BULK TLS costly Sub-Threads

19 Open Questions Long threads that also touch many addresses. – Bulk on top of sub-threads? Combining lazy/eager evaluations Thank you!

20 Backup Slides

21 21 Buffering Large Threads store X, 0x00 L1$ 0x00: 0x01: L2$ X 0x00: 0x01: L1$ 0x00: 0x01: XS1 Store and load bit per thread Slide courtesy Chris Colohan

22 22 Buffering Large Threads store X, 0x00 store A, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: L1$ 0x00: 0x01: X A S1 0x01: Slide courtesy Chris Colohan

23 23 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: X X A S1 L2 Slide courtesy Chris Colohan

24 24 XL2 XS1 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: L1$ 0x00: 0x01: XY AS1 YS2L2 Replicate line – one version per thread Slide courtesy Chris Colohan

25 25 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: Y A S1 S2L2 S1L2 Slide courtesy Chris Colohan

26 26 Buffering Large Threads store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: Y A S1 YS2L2 S1L2 B B Slide courtesy Chris Colohan

27 27 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 store B, 0x01 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A L1$ 0x00: 0x01: S1 L2 B B Y YS2L2 a { b { Divide into two sub-threads Only roll back violated sub-thread Slide courtesy Chris Colohan

28 Copyright 2006 Chris Colohan28 Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X A Y L1$ 0x00: 0x01: A S1a A A S2aL2a L2b Y a { b { Store and load bit per sub-thread store B, 0x01 B Slide courtesy Chris Colohan

29 Copyright 2006 Chris Colohan29 A AAL2b S1a Sub-thread Support store X, 0x00 store A, 0x01load 0x00 load 0x01 store Y, 0x00 L1$ 0x00: 0x01: L2$ X A 0x00: 0x01: X Y L1$ 0x00: 0x01: Y S1a A S2aL2a B store B, 0x01 S1b AB a { b { Slide courtesy Chris Colohan

30 Sub-thread Evaluation Evaluate using large database transactions Parallelize the loops Can we place an upper bound on the possible speedup?


Download ppt "More on Thread Level Speculation Anthony Gitter Dafna Shahaf Or Sheffet."

Similar presentations


Ads by Google