Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University) Phil Sweany (Texas Instruments)

Optimizing Loop Performance for Clustered VLIW Architectures Clustered VLIW Architecture

Optimizing Loop Performance for Clustered VLIW Architectures Motivation Clustered VLIW architectures have been adopted to improve ILP and keep the port requirement of the register files low. The compiler must Expose maximal parallelism, Maintain minimal communication overhead. High-level optimizations can improve loop performance on clustered VLIW machines.

Optimizing Loop Performance for Clustered VLIW Architectures Background Software Pipelining – modulo scheduling Archive ILP by overlapping execution of different loop iterations. Initiation Interval (II) ResII -- constraints from the machine resources. RecII -- constraints from the dependence recurrences. MinII = max(ResII, RecII)

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Scalar Replacement replace array references with scalar variables. improve register usage for (i=0; i<n; ++i) for ( j=0; j<n; ++j) a[i] = a[i] + b[j] * x[i][j]; for (i=0; i<n; ++i) { t = a[i]; for ( j=0; j<n; ++j) t = t + b[j] * x[i][j]; a[i] = t; }

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Unrolling reduce inter-iteration overhead enlarge loop body size Unroll-and-jam balance the computation and memory-access requirements improve uMinII (MinII / unrollAmount) for (i=1; i<=2*n; ++i) for (j=1; j<=n; ++j) a[i][j] = a[i][j] + b[j] * c[j]; uMinII = 4 for (i=1; i<=2*n; i+=2) for (j=1; j<=n; ++j) { a[i][j] = a[i][j] + b[j] * c[j]; a[i+1][j] = a[i+1][j] + b[j] * c[j]; } uMinII = 3 (1 computational unit, 1 memory unit) unroll-and-jammed loop: original loop:

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Unroll-and-jam/unrolling generate intercluster parallelism for (i=0; i<2*n; ++i) a[i] = a[i] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i] + 1; /* cluster 1 */ a[i+1] = a[i+1] + 1; } for (i=0; i<2*n; ++i) a[i] = a[i-1] + 1; for (i=0; i<2*n; i+=2) { /* cluster 0 */ a[i] = a[i-1] + 1; /* cluster 1 */ a[i+1] = a[i] + 1; }

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Loop Alignment Remove loop-carried dependences Alignment conflicts Used to determine intercluster communication cost for (i=1; i<n; ++i) { a[i] = b[i] + c[i]; x[i] = a[i-1] *q; } x[1] = a[0] * q; for (i=1; i<n-1; ++i) { a[i] = b[i] + c[i]; x[i+1] = a[i] * q; } a[n-1] = b[n-1] + c[n-1]; for (i=1; i<n; ++i) { a[i] = b[i] + q; c[i] = a[i-1] + a[i-2]; } for (i=1; i<n; ++i) a[i] = a[i-1] + b[i];

Optimizing Loop Performance for Clustered VLIW Architectures Related Work Partitioning Problem Ellis -- BUG Capitanio et al. -- LC-VLIW Nystrom et al. -- cluster assignment & software pipelining Ozer et al. -- UAS Sanchez et al. -- unified method Hiser et al. – RCG Aleta et al. – pseudo-scheduler

Optimizing Loop Performance for Clustered VLIW Architectures Loop Transformations Scalar Replacement Callahan, et al -- pipelined architectures Carr, Kennedy -- general algorithm Duesterwalk -- data flow framework Loop Alignment Allen et al -- shared-memory machines Unrolling/Ujam Callahan et al -- pipelined architectures Carr,Kennedy -- ILP Carr, Guan -- linear algebra Carr -- cache, software pipelining Sarkar -- ILP, IC Sanchez et al -- clustered machines Huang et al -- clustered machines Shin et al – Superwood Register files

Optimizing Loop Performance for Clustered VLIW Architectures Optimization Strategy Unroll-and-jam/Unrolling Scalar Replacement Intermediate Code Generator Data-flow Optimization Value Cloning Register Partitioning Software Pipelining Assembly Code Generator Target Code Source Code

Optimizing Loop Performance for Clustered VLIW Architectures Our Method Picking loops to unroll Computing uMinII Computing register pressure (see paper) Determining unroll amounts

Optimizing Loop Performance for Clustered VLIW Architectures Picking Loops to Unroll : carries the most dep. that are amenable to S.R. : contains the fewest alignment conflicts. Computing uMinII uRecII does not increase uResII where

Optimizing Loop Performance for Clustered VLIW Architectures Computing Communication Cost for Unrolled Loops Intercluster Copies multiple loops (see paper) single loop invariant dep. variant dep. innermost loop is unrolled innermost loop is not unrolled invariant dep. variant dep.

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Variant Dep. Cluster 1... Cluster? Before unrolling... After unrolling = # of e where copies per cluster: sinks of the new dependences: total costs:

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Variant Dep. Special Cases if, then for (i=0; i<4*n; i+=4) { a[i] = a[i-4]; a[i+1] = a[i–3]; a[i+2] = a[i–2]; a[i+3] = a[i-1]; } if, then for (i=0; i<6*n; i+=6) { a[i] = a[i-2]; a[i+1] = a[i-1]; a[i+2] = a[i]; a[i+3] = a[i+1]; a[i+4] = a[i+2]; a[i+5] = a[i+3]; } 4 clusters: 2 clusters:

Optimizing Loop Performance for Clustered VLIW Architectures Unrolling a Single Loop Invariant Dep. references can be eliminated by scalar replacement. clusters need a copy operation. for (j=1; j<=4*n; ++j) for (i=1; i<=m; ++i) a[j][i] = a[j][i-1] + b[i]; for (j=1; j<=4*n; j+=4) for (i=1; i<=m; ++i) { t = b[i]; a[i][i] = a[j][i-1] + t; a[j+1][i] = a[j+1][i-1] + t; a[j+2][i] = a[j+2][i-1] + t; a[j+3][i] = a[j+3][i-1] + t; }

Optimizing Loop Performance for Clustered VLIW Architectures Determining Unroll Amounts Integer optimization problem Exhaustive search Heuristic method

Optimizing Loop Performance for Clustered VLIW Architectures Experimental Results Benchmarks 119 DSP loops from the TI's benchmark suite DSP applications: FIR filter, correlation, Reed- Solomon decoding, lattice filter, LMS filter, etc. Architectures URM, a simulated architecture 8 functional units - 2 clusters, 4 clusters (1 copy unit) 16 functional units - 2 clusters, 4 clusters (2 copy units) TMS320C64x

Optimizing Loop Performance for Clustered VLIW Architectures Unroll-and-jam/unrolling is applicable to 71 loops. URM Speedups: Transformed vs. Original width816 clusters2424 Speedup Harmonic1.391.681.41.43 Median1.521.781.6 Improved50695051

Optimizing Loop Performance for Clustered VLIW Architectures Our Algorithm vs. Fixed Unroll Amounts Using a fixed unroll amount may cause performance degradation when communication costs are dominant. Width816 Clusters2424 Speedup Harmonic10.9111.07 Harmonic(fixed)0.880.840.880.95 # of loops94921

Optimizing Loop Performance for Clustered VLIW Architectures C64x Results TMS320C64x Speedups: Unrolled vs. Original Speedup Harmonic1.7 Median2 Improved55

Optimizing Loop Performance for Clustered VLIW Architectures Accuracy of Communication Cost Model Compare the number of predicted data transfers against the actual number of intercluster dependences found in the transformed loops 2-cluster: 66 exact prediction 4-cluster: 64 exact prediction

Optimizing Loop Performance for Clustered VLIW Architectures Conclusion Proposed a communication cost model and an integer-optimization problem for predicting the performance of unrolled loops. 70%-90% of 71 loops can be improved by a speedup of 1.4-1.7. High-level transformations should be an integral part of compilation for clustered VLIW machines.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Similar presentations

Presentation on theme: "Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Similar presentations

Presentation on theme: "Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)"— Presentation transcript:

Similar presentations

About project

Feedback