Thread-Level Speculation Karan Singh CS 612 2.23.2006.

Thread-Level Speculation Karan Singh CS 612 2.23.2006

CS 6122 Introduction  extraction of parallelism at compile time is limited  TLS allows automatic parallelization by supporting thread execution without advance knowledge of any dependence violations  Thread-Level Speculation (TLS) is a form of optimistic parallelization

2.23.2006CS 6123 Introduction  Zhang et al. extensions to cache coherence protocol hardware to detect dependence violations  Pickett et al. design for a Java-specific software TLS system that operates at the bytecode level

Hardware for Speculative Run- Time Parallelization in Distributed Shared-Memory Multiprocessors Ye Zhang Lawrence Rauchwerger Josep Torrellas

2.23.2006CS 6125 Outline  Loop parallelization basics  Speculative Run-Time Parallelization in Software  Speculative Run-Time Parallelization in Hardware  Evaluation and Comparison

2.23.2006CS 6126 Loop parallelization basics  a loop can be executed in parallel without synchronization only if the outcome is independent of the order of iterations  need to analyze data dependences across iterations: flow, anti, output  if no dependences – doall loop  if only anti or output dependences – privatization, scalar expansion …

2.23.2006CS 6127 Loop parallelization basics to parallelize the loop above, we need a way of saving and restoring state a method to detect cross-iteration dependences

2.23.2006CS 6128 Speculative Run-Time Parallelization in Software  mechanism for saving/restoring state before executing speculatively, we need to save the state of the arrays that will be modified dense access – save whole array sparse access – save individual elements save only modified shared arrays in all cases, if loop is not found parallel after execution, arrays are restored from their backups

2.23.2006CS 6129 Speculative Run-Time Parallelization in Software  LRPD test to detect dependences flags existence of cross-iteration dependences apply to those arrays whose dependences cannot be analyzed at compile-time two phases: Marking & Analysis

2.23.2006CS 61210 LRPD test  setup backup A(1:s) initialize shadow arrays to zero A r (1:s), A w (1:s) initialize scalar Atw to zero

2.23.2006CS 61211 LRPD test  marking: performed for each iteration during speculative parallel execution of the loop write to A(i): set A w (i) read from A(i): if A(i) not written in this iteration, set A r (i) at end of iteration, count how many different elements of A have been written and add count to Atw

2.23.2006CS 61212 LRPD test  analysis: performed after the speculative execution compute Atm = number of non-zero A w (i) for all elements i of the shadow array if any(A w (:)^ A r (:)), loop is not a doall; abort execution else if Atw == Atm, then loop is a doall

2.23.2006CS 61213 Example w(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0any(Aw ^ Ar) = 0 Atw = 2 Atm = 1 Since Atw ≠ Atm, parallelization fails

2.23.2006CS 61214 Example w(x)r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

2.23.2006CS 61215 Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 1 Ar = 1any(Aw ^ Ar) = 1 Atw = 1 Atm = 1 Since any(Aw ^ Ar) == 1, parallelization fails

2.23.2006CS 61216 Example w(x) r(x)  parallel threads write to element x of array Aw = 1Aw ^ Ar = 0 Ar = 0*any(Aw ^ Ar) = 0 Atw = 1 Atm = 1 Since Atw == Atm, loop is a doall * if A(i) not written in this iteration, set Ar(i)

2.23.2006CS 61217 Example

2.23.2006CS 61218 Speculative Run-Time Parallelization in Software  implementation in a DSM system, each processor allocates a private copy of the shadow arrays marking phase performed locally for analysis phase, private shadow arrays are merged in parallel  compiler integration part of a front-end parallelizing compiler parallelize loops chosen based on user feedback or heuristics about previous success rate

2.23.2006CS 61219 Speculative Run-Time Parallelization in Software  improvements privatization iteration-wise vs. processor-wise  shortcomings overhead of analysis phase and extra instructions for marking we get to know parallelization failed only after the loop completes execution

2.23.2006CS 61220  privatization example for i = 1 to N tmp = f(i) /* f is some operation */ A(i) = A(i) + tmp enddo in privatization, for each processor, we create private copies of the variables causing anti or output dependences privatization

2.23.2006CS 61221 Speculative Run-Time Parallelization in Hardware  extend cache coherence protocol hardware of a DSM multiprocessor with extra transactions to flag any cross-iteration data dependences  on detection, parallel execution is immediately aborted  extra state in tags of all caches  fast memory in the directories

2.23.2006CS 61222 Speculative Run-Time Parallelization in Hardware  two sets of transactions non-privatization algorithm privatization algorithm

2.23.2006CS 61223 non-privatization algorithm  identify as parallel those loops where each element of the array under test is either read-only or is accessed by only one processor  a pattern where an element is read by several processors and later written by one is flagged as not parallel

2.23.2006CS 61224 non-privatization algorithm  fast memory has three entries: ROnly, NoShr, First  these entries are also sent to cache and stored in tags of the corresponding cache line  per-element bits in tags of different caches and directories are kept coherent

2.23.2006CS 61225 non-privatization algorithm

2.23.2006CS 61226 Speculative Run-Time Parallelization in Hardware  implementation need three supports: storage for access bits, logic to test and change the bits, table in the directory to find the access bits for a given physical address modify three parts: primary cache, secondary cache, directory

2.23.2006CS 61227 implementation  primary cache access bits stored in an SRAM table called Access Bit Array algorithm operations determined by Control input Test Logic performs operations

2.23.2006CS 61228 implementation  secondary cache need Access Bit Array L1 miss, L2 hit  L2 provides data and access bits to L1 access bits sent directly to the test logic in L1 bits generated are stored in access bit array of L1

2.23.2006CS 61229 implementation  directory small dedicated memory for access bits with lookup table access bits generated by logic are sent to processor transaction overlapped with memory and directory access

2.23.2006CS 61230 Evaluation  execution drive simulations of CC-NUMA shared memory multiprocessor using Tango-lite  loops from applications in the Perfect Club set and one application from NCSA: Ocean, P3m, Adm, Track  compare four environments: Serial, Ideal, SW, HW  loops run with 16 processes (except Ocean which runs with 8 processes)

2.23.2006CS 61231 Evaluation  loop execution speedup

2.23.2006CS 61232 Evaluation  slowdown due to failure

2.23.2006CS 61233 Evaluation  scalability

2.23.2006CS 61234 Software vs. Hardware  in hardware, failure to parallelize is detected on the fly  several operations are performed in hardware, which reduces overheads  hardware scheme has better scalability with number of processors  hardware scheme has less space overhead

2.23.2006CS 61235 Software vs. Hardware  in hardware, non-privatization test is processor-wise without requiring static scheduling  hardware scheme can be applied to pointer-based C code more efficiently  however, software implementation does not require any hardware!

Thread-Level Speculation Karan Singh CS 612 2.23.2006.

Similar presentations

Presentation on theme: "Thread-Level Speculation Karan Singh CS 612 2.23.2006."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Thread-Level Speculation Karan Singh CS 612 2.23.2006.

Similar presentations

Presentation on theme: "Thread-Level Speculation Karan Singh CS 612 2.23.2006."— Presentation transcript:

Similar presentations

About project

Feedback