Presentation is loading. Please wait.

Presentation is loading. Please wait.

Thread-Level Speculation Karan Singh CS 612 2.23.2006.

Similar presentations


Presentation on theme: "Thread-Level Speculation Karan Singh CS 612 2.23.2006."— Presentation transcript:

1 Thread-Level Speculation Karan Singh CS 612 2.23.2006

2 CS 6122 Introduction  extraction of parallelism at compile time is limited  TLS allows automatic parallelization by supporting thread execution without advance knowledge of any dependence violations  Thread-Level Speculation (TLS) is a form of optimistic parallelization

3 2.23.2006CS 6123 Introduction  Zhang et al. extensions to cache coherence protocol hardware to detect dependence violations  Pickett et al. design for a Java-specific software TLS system that operates at the bytecode level

4 Hardware for Speculative Run- Time Parallelization in Distributed Shared-Memory Multiprocessors Ye Zhang Lawrence Rauchwerger Josep Torrellas

5 2.23.2006CS 6125 Outline  Loop parallelization basics  Speculative Run-Time Parallelization in Software  Speculative Run-Time Parallelization in Hardware  Evaluation and Comparison

6 2.23.2006CS 6126 Loop parallelization basics  a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations  need to analyze data dependences across iterations: flow, anti, output  if no dependences – doall loop  if only anti or output dependences – privatization, scalar expansion …

7 2.23.2006CS 6127 Loop parallelization basics to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration dependences

8 2.23.2006CS 6128 Speculative Run-Time Parallelization in Software  mechanism for saving/restoring state before executing speculatively, we need to save the state of the arrays that will be modified dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel after execution, arrays are restored from their backups

9 2.23.2006CS 6129 Speculative Run-Time Parallelization in Software  LRPD test to detect dependences flags existence of cross-iteration dependences apply to those arrays whose dependences cannot be analyzed at compile-time two phases: Marking & Analysis

10 2.23.2006CS 61210 LRPD test  setup backup A(1:s) initialize shadow arrays to zero A r (1:s), A w (1:s) initialize scalar Atw to zero

11 2.23.2006CS 61211 LRPD test  marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set A w (i) read from A(i): if A(i) not written in this iteration, set A r (i) at end of iteration, count how many different elements of A have been written and add count to Atw

12 2.23.2006CS 61212 LRPD test  analysis: performed after the speculative execution compute Atm = number of non-zero A w (i) for all elements i of the shadow array if any(A w (:)^ A r (:)), loop is not a doall; abort execution else if Atw == Atm, then loop is a doall

13 2.23.2006CS 61213 Example w(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0any(Aw ^ Ar) = 0 Atw = 2 Atm = 1 Since Atw ≠ Atm, parallelization fails

14 2.23.2006CS 61214 Example w(x)r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

15 2.23.2006CS 61215 Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

16 2.23.2006CS 61216 Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0*any(Aw ^ Ar) = 0 Atw = 1 Atm = 1 Since Atw == Atm, loop is a doall * if A(i) not written in this iteration, set Ar(i)

17 2.23.2006CS 61217 Example

18 2.23.2006CS 61218 Speculative Run-Time Parallelization in Software  implementation in a DSM system, each processor allocates a private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are merged in parallel  compiler integration part of a front-end parallelizing compiler parallelize loops chosen based on user feedback or heuristics about previous success rate

19 2.23.2006CS 61219 Speculative Run-Time Parallelization in Software  improvements privatization iteration-wise vs. processor-wise  shortcomings overhead of analysis phase and extra instructions for marking we get to know parallelization failed only after the loop completes execution

20 2.23.2006CS 61220  privatization example for i = 1 to N tmp = f(i) /* f is some operation */ A(i) = A(i) + tmp enddo in privatization, for each processor, we create private copies of the variables causing anti or output dependences privatization

21 2.23.2006CS 61221 Speculative Run-Time Parallelization in Hardware  extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences  on detection, parallel execution is immediately aborted  extra state in tags of all caches  fast memory in the directories

22 2.23.2006CS 61222 Speculative Run-Time Parallelization in Hardware  two sets of transactions non-privatization algorithm privatization algorithm

23 2.23.2006CS 61223 non-privatization algorithm  identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor  a pattern where an element is read by several processors and later written by one is flagged as not parallel

24 2.23.2006CS 61224 non-privatization algorithm  fast memory has three entries: ROnly, NoShr, First  these entries are also sent to cache and stored in tags of the corresponding cache line  per-element bits in tags of different caches and directories are kept coherent

25 2.23.2006CS 61225 non-privatization algorithm

26 2.23.2006CS 61226 Speculative Run-Time Parallelization in Hardware  implementation need three supports: storage for access bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address modify three parts: primary cache, secondary cache, directory

27 2.23.2006CS 61227 implementation  primary cache access bits stored in an SRAM table called Access Bit Array algorithm operations determined by Control input Test Logic performs operations

28 2.23.2006CS 61228 implementation  secondary cache need Access Bit Array L1 miss, L2 hit  L2 provides data and access bits to L1 access bits sent directly to the test logic in L1 bits generated are stored in access bit array of L1

29 2.23.2006CS 61229 implementation  directory small dedicated memory for access bits with lookup table access bits generated by logic are sent to processor transaction overlapped with memory and directory access

30 2.23.2006CS 61230 Evaluation  execution drive simulations of CC-NUMA shared memory multiprocessor using Tango-lite  loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track  compare four environments: Serial, Ideal, SW, HW  loops run with 16 processes (except Ocean which runs with 8 processes)

31 2.23.2006CS 61231 Evaluation  loop execution speedup

32 2.23.2006CS 61232 Evaluation  slowdown due to failure

33 2.23.2006CS 61233 Evaluation  scalability

34 2.23.2006CS 61234 Software vs. Hardware  in hardware, failure to parallelize is detected on the fly  several operations are performed in hardware, which reduces overheads  hardware scheme has better scalability with number of processors  hardware scheme has less space overhead

35 2.23.2006CS 61235 Software vs. Hardware  in hardware, non-privatization test is processor-wise without requiring static scheduling  hardware scheme can be applied to pointer-based C code more efficiently  however, software implementation does not require any hardware!


Download ppt "Thread-Level Speculation Karan Singh CS 612 2.23.2006."

Similar presentations


Ads by Google