Automatic Thread Extraction with Decoupled Software Pipelining

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
Computer Organization and Architecture
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Automatic Parallelization Nick Johnson COS 597c Parallelism 30 Nov
A Process Splitting Transformation for Kahn Process Networks Sjoerd Meijer.
VLIW Very Large Instruction Word. Introduction Very Long Instruction Word is a concept for processing technology that dates back to the early 1980s. The.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling John Cavazos University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Instruction-Level Parallelism compiler techniques and branch prediction prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University March.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Compiler Challenges for High Performance Architectures
Chapter 3 Instruction-Level Parallelism and Its Dynamic Exploitation – Concepts 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
1 CS 201 Compiler Construction Lecture 13 Instruction Scheduling: Trace Scheduler.
Chapter 2 Instruction-Level Parallelism and Its Exploitation
1 Improving Hash Join Performance through Prefetching _________________________________________________By SHIMIN CHEN Intel Research Pittsburgh ANASTASSIA.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Parallelizing Compilers Presented by Yiwei Zhang.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 Compiling with multicore Jeehyung Lee Spring 2009.
This module was created with support form NSF under grant # DUE Module developed by Martin Burtscher Module B1 and B2: Parallelization.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
High-Level Transformations for Embedded Computing
Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.
Learning A Better Compiler Predicting Unroll Factors using Supervised Classification And Integrating CPU and L2 Cache Voltage Scaling using Machine Learning.
Global Register Allocation Based on
Concepts and Challenges
Optimizing Compilers Background
Prepared by Viren Pandya
CSC 4250 Computer Architectures
Parallel Programming By J. H. Wang May 2, 2017.
Optimization Code Optimization ©SoftMoore Consulting.
Introduction to Algorithms
5.2 Eleven Advanced Optimizations of Cache Performance
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Cache Memory Presentation I
Algorithm Analysis CSE 2011 Winter September 2018.
Pipelining and Vector Processing
CSCI1600: Embedded and Real Time Software
Memory Hierarchies.
CS 201 Compiler Construction
Register Pressure Guided Unroll-and-Jam
Multivector and SIMD Computers
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
October 18, 2018 Kit Barton, IBM Canada
Chapter 12 Pipelining and RISC
ECE 498AL Lecture 10: Control Flow
Memory System Performance Chapter 3
ECE 498AL Spring 2010 Lecture 10: Control Flow
Programming with Shared Memory Specifying parallelism
Introduction to Optimization
Optimizing single thread performance
CSCI1600: Embedded and Real Time Software
Presentation transcript:

Automatic Thread Extraction with Decoupled Software Pipelining Ottoni, Rangan, Stoler and August [2005]

DSWP Long-running Concurrently executing Non-speculative Truly decoupled Reduce inter-core communication and per-core resources

Hardware Additions Software Additions Other techniques Two new instructions (produce and consume) Software Additions DSWP is added in the back-end of existing compilers Other techniques Either too limiting in implementation (for only a few cases of parallelization problems) Speculative methods => hardware support

Comparison with DOACROSS

DOACROSS: Communications cost between cores DSWP : Critical path remains in core 0, no communication latency because X will already have been executed in core 1

Key Concepts Flow of data between cores is acyclic : instructions in each recurrence are executed in the same core Decoupling of acyclic flows with the help of inter-core queues for communication Different recurrences assigned to different cores Better overall utilization of cores Better execution speed Limited number of restrictions (in comparison with DOACROSS and other techniques)

Communication and Synchronization “Produce” instruction to send data to other core “Consume” instruction to obtain data from other core Produce and consume are matched in order Data are stored in software implemented inter-core queues Synchronization blocks only when enqueuing to a full queue or dequeuing from empty queue to avoid overhead Queue latency measured of no importance

DSWP Algorithm

Transformed Code Partition of instructions to multiple threads Replication of the necessary ones Insertion of produce and consume to deal with dependencies Dependences go only in one direction, from the producer to the consumer thread Main thread produces loop live-in values for the other thread and consumes live-out values after consumer loop termination

Transforming the original code

Identify (data, control) dependencies and produce necessary arcs SCCs : sets of instructions that have “reverse” dependences between them (recurrences) All instructions in an SCC are executed in the same thread Creation of SCC sequences (Pn)

Rules for Grouping SCCs in Pn parttitions 1<=n<=t where t max simultaneous threads Each vertex belongs to exactly one partition in P Each dependence arc goes from Pi to Pj where i<j => each partition Pi can be executed separately thus forming a pipeline

Load Balancing (TPP) NP-complete Crucial for good performance (thread pipeline is limited by the longest average latency) Solution: heuristic method Count each SCC latency E Add consecutive latencies until Ei -> n:maximum threads Prefered partitions with minimum outgoing dependences in order to reduce communication and synchronization Total estimation to determine if it is profitable to create multiple threads

Splitting the code Computation of Basic Blocks (BB) through adding all instructions that belong to Pi Addition of produce and consume for data and control (branch flags) dependence Creation of BBs Fix branch targets and replicate instructions where necessary to respect control flow

Extra Control Dependencies Loop-iteration : to use the correct queues across iterations (different queues can be used in each iteration) Two-loop graph Extract classic dependences Coalesce final graph

Conditional control dependence: to communicate the condition under which a dependence may occur or a live-out value can origin from Insertion of an extra dependence arc for the first Do not ignore output dependences for the second

Measurements Not sensitive to communication latency Fairly insensitive to queue size Can be used along with current ILP techniques for better results

IMPROVING DATA LOCALITY WITH LOOP TRANSFORMATIONS McKinley, Carr and Tseng [1996]

Main Focus Processor speed increases faster than memory speed Memory hierarchy with fast but small memory levels (cache) Goal of model: To enhance spatial and temporal locality of programs through estimating reuse of cache lines and transforming code for desirable loop organization

Compiler Optimization Improve order of memory accesses to exploit all levels of the memory hierarchy through loop permutation, fusion, distribution and reversal Machine independent Only knowledge of cache line size

Data Dependence δ = {δ1...δκ} hybrid distance/direction vector, represents dependence between two array references from the outermost to the innermost loop enclosing the references Direction: outer to inner loop positive Distance: the minimum difference of iteration numbers of the pair of references when dependence occurs

Measure of Locality Number of cache lines a loop nest accesses We assume there will be no conflict or capacity cache misses in one iteration of the innermost loop Algorithms RefGroup, RefCost, LoopCost to determine the total number of cache lines accessed when a candidate loop l is placed innermost

RefGroup Calculates group-reuse Two references are in the same group if they exhibit spatial/temporal locality Ref1 and Ref2 are in the same group if (temporal) there is Ref1δRef2 and δ is a loop independent dependence δ l is a small constant d (|d|<=2) and all other entries are zero (spatial) They refer to the same array and the first subscript differs by at most d’ where d’ is less than or equal to the cache line size in terms of array elements A reference can only be in one group

RefGroup Example

RefCost Calculates locality for a loop l, number of cache lines l uses: 1 for loop invariant references Trip/(cls/stride) for consecutive references where trip= (ub-lb+step)/step and stride is the step size of l multiplied by the subscript coeficient of the loop index variable Trip for non-consecutive references

LoopCost Calculates total number of cache lines accessed by all references when l is innermost loop Sums RefCost for all reference groups then multiplies the result by the trip counts of all remaining loops

Loop Transformations Permutation: least cost memory model Most reuse means lower LoopCost so we want the most reused in the innermost loop Rank the loops using LoopCost from outermost to innermost so that LoopCost(li-1)>= LoopCost(l) Permute the distance/direction vector to see of the permutation of the best loop as innermost is legal If not legal, second best is tried

Legal Permutations Place a loop l at position k+1 for a single dependence Legal if direction vector entry at position l is positive or zero Legal if entry at l is negative and {p1,…,pk} positive Illegal if entry at l is negative and {p1,…,pk} zero

Permutation Complexity Tests the legality of n(n-1) loop permutations (worst case) Most expensive part is evaluating the locality of the loop (LoopCost) so actually is O(n) in the number of LoopCost invocations where n is the number of loops in the nest

Loop Reversal Reverses the order in which the iterations of a loop nest execute Does not improve by itself, is only an enhancer for permutation If Permute cannot legally position a loop in the desired position, it tests if reversal is legal and if it enables the desired permutation

Loop Fusion Takes multiple loop nests and combines them into one Legal only if no data dependences are reversed Improves locality directly by moving accesses to the same cache line to the same loop iteration Loops must have same number of iterations Compute RefGroup and LoopCost for both fused and non-fused code and compare them Memory order may differ May enable permutation

Fusion

Fusion Complexity Loop nests may be reordered due to fusion so we may need to calculate LoopCost for every pair of loop nests (O(m2)), m number of candidate nests for fusion For only adjacent loops O(m)

Loop Distribution Separates independent statements in a single loop into multiple loops with identical headers Recurrences must be placed in the same loop (DSWP-like) Finest granularity and test if it enables permutation Invoked only if memory order cannot be achieved on a nest and if not all of the inner nests can be fused Implemented only if combines with permutation to improve the actual LoopCost

Distribution Complexity LoopCost is calculated for each individual partition O(m) where m is the number of individual partitions created by distribution

Compound Transformation Combines all the previously mentioned algorithms into one

Compound Algorithm

Complexity O(nm2) in the number of LoopCost invocations where n is the number of loops and m the number of adjacent loop nests m2 is due to fusion Fusion and distribution are only occasionally applied

Effect of Permutation

Speed-up