Automatic Thread Extraction with Decoupled Software Pipelining

Automatic Thread Extraction with Decoupled Software Pipelining
Ottoni, Rangan, Stoler and August [2005]

DSWP Long-running Concurrently executing Non-speculative
Truly decoupled Reduce inter-core communication and per-core resources

Hardware Additions Software Additions Other techniques
Two new instructions (produce and consume) Software Additions DSWP is added in the back-end of existing compilers Other techniques Either too limiting in implementation (for only a few cases of parallelization problems) Speculative methods => hardware support

Comparison with DOACROSS

DOACROSS: Communications cost between cores
DSWP : Critical path remains in core 0, no communication latency because X will already have been executed in core 1

Key Concepts Flow of data between cores is acyclic : instructions in each recurrence are executed in the same core Decoupling of acyclic flows with the help of inter-core queues for communication Different recurrences assigned to different cores Better overall utilization of cores Better execution speed Limited number of restrictions (in comparison with DOACROSS and other techniques)

Communication and Synchronization
“Produce” instruction to send data to other core “Consume” instruction to obtain data from other core Produce and consume are matched in order Data are stored in software implemented inter-core queues Synchronization blocks only when enqueuing to a full queue or dequeuing from empty queue to avoid overhead Queue latency measured of no importance

DSWP Algorithm

Transformed Code Partition of instructions to multiple threads
Replication of the necessary ones Insertion of produce and consume to deal with dependencies Dependences go only in one direction, from the producer to the consumer thread Main thread produces loop live-in values for the other thread and consumes live-out values after consumer loop termination

Transforming the original code

Identify (data, control) dependencies and produce necessary arcs
SCCs : sets of instructions that have “reverse” dependences between them (recurrences) All instructions in an SCC are executed in the same thread Creation of SCC sequences (Pn)

Rules for Grouping SCCs in Pn parttitions
1<=n<=t where t max simultaneous threads Each vertex belongs to exactly one partition in P Each dependence arc goes from Pi to Pj where i<j => each partition Pi can be executed separately thus forming a pipeline

Load Balancing (TPP) NP-complete
Crucial for good performance (thread pipeline is limited by the longest average latency) Solution: heuristic method Count each SCC latency E Add consecutive latencies until Ei -> n:maximum threads Prefered partitions with minimum outgoing dependences in order to reduce communication and synchronization Total estimation to determine if it is profitable to create multiple threads

Splitting the code Computation of Basic Blocks (BB) through adding all instructions that belong to Pi Addition of produce and consume for data and control (branch flags) dependence Creation of BBs Fix branch targets and replicate instructions where necessary to respect control flow

Extra Control Dependencies
Loop-iteration : to use the correct queues across iterations (different queues can be used in each iteration) Two-loop graph Extract classic dependences Coalesce final graph

Conditional control dependence: to communicate the condition under which a dependence may occur or a live-out value can origin from Insertion of an extra dependence arc for the first Do not ignore output dependences for the second

Measurements Not sensitive to communication latency
Fairly insensitive to queue size Can be used along with current ILP techniques for better results

IMPROVING DATA LOCALITY WITH LOOP TRANSFORMATIONS
McKinley, Carr and Tseng [1996]

Main Focus Processor speed increases faster than memory speed
Memory hierarchy with fast but small memory levels (cache) Goal of model: To enhance spatial and temporal locality of programs through estimating reuse of cache lines and transforming code for desirable loop organization

Compiler Optimization
Improve order of memory accesses to exploit all levels of the memory hierarchy through loop permutation, fusion, distribution and reversal Machine independent Only knowledge of cache line size

Data Dependence δ = {δ1...δκ} hybrid distance/direction vector, represents dependence between two array references from the outermost to the innermost loop enclosing the references Direction: outer to inner loop positive Distance: the minimum difference of iteration numbers of the pair of references when dependence occurs

Measure of Locality Number of cache lines a loop nest accesses
We assume there will be no conflict or capacity cache misses in one iteration of the innermost loop Algorithms RefGroup, RefCost, LoopCost to determine the total number of cache lines accessed when a candidate loop l is placed innermost

RefGroup Calculates group-reuse
Two references are in the same group if they exhibit spatial/temporal locality Ref1 and Ref2 are in the same group if (temporal) there is Ref1δRef2 and δ is a loop independent dependence δ l is a small constant d (|d|<=2) and all other entries are zero (spatial) They refer to the same array and the first subscript differs by at most d’ where d’ is less than or equal to the cache line size in terms of array elements A reference can only be in one group

RefGroup Example

RefCost Calculates locality for a loop l, number of cache lines l uses: 1 for loop invariant references Trip/(cls/stride) for consecutive references where trip= (ub-lb+step)/step and stride is the step size of l multiplied by the subscript coeficient of the loop index variable Trip for non-consecutive references

LoopCost Calculates total number of cache lines accessed by all references when l is innermost loop Sums RefCost for all reference groups then multiplies the result by the trip counts of all remaining loops

Loop Transformations Permutation: least cost memory model
Most reuse means lower LoopCost so we want the most reused in the innermost loop Rank the loops using LoopCost from outermost to innermost so that LoopCost(li-1)>= LoopCost(l) Permute the distance/direction vector to see of the permutation of the best loop as innermost is legal If not legal, second best is tried

Legal Permutations Place a loop l at position k+1 for a single dependence Legal if direction vector entry at position l is positive or zero Legal if entry at l is negative and {p1,…,pk} positive Illegal if entry at l is negative and {p1,…,pk} zero

Permutation Complexity
Tests the legality of n(n-1) loop permutations (worst case) Most expensive part is evaluating the locality of the loop (LoopCost) so actually is O(n) in the number of LoopCost invocations where n is the number of loops in the nest

Loop Reversal Reverses the order in which the iterations of a loop nest execute Does not improve by itself, is only an enhancer for permutation If Permute cannot legally position a loop in the desired position, it tests if reversal is legal and if it enables the desired permutation

Loop Fusion Takes multiple loop nests and combines them into one
Legal only if no data dependences are reversed Improves locality directly by moving accesses to the same cache line to the same loop iteration Loops must have same number of iterations Compute RefGroup and LoopCost for both fused and non-fused code and compare them Memory order may differ May enable permutation

Fusion

Fusion Complexity Loop nests may be reordered due to fusion so we may need to calculate LoopCost for every pair of loop nests (O(m2)), m number of candidate nests for fusion For only adjacent loops O(m)

Loop Distribution Separates independent statements in a single loop into multiple loops with identical headers Recurrences must be placed in the same loop (DSWP-like) Finest granularity and test if it enables permutation Invoked only if memory order cannot be achieved on a nest and if not all of the inner nests can be fused Implemented only if combines with permutation to improve the actual LoopCost

Distribution Complexity
LoopCost is calculated for each individual partition O(m) where m is the number of individual partitions created by distribution

Compound Transformation
Combines all the previously mentioned algorithms into one

Compound Algorithm

Complexity O(nm2) in the number of LoopCost invocations where n is the number of loops and m the number of adjacent loop nests m2 is due to fusion Fusion and distribution are only occasionally applied

Effect of Permutation

Speed-up

Automatic Thread Extraction with Decoupled Software Pipelining

Similar presentations

Presentation on theme: "Automatic Thread Extraction with Decoupled Software Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Thread Extraction with Decoupled Software Pipelining

Similar presentations

Presentation on theme: "Automatic Thread Extraction with Decoupled Software Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback