Presentation is loading. Please wait.

Presentation is loading. Please wait.

Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute.

Similar presentations


Presentation on theme: "Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute."— Presentation transcript:

1 Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

2 The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0

3 The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1

4 The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048… i=0 i=1 i=2

5 The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 048…

6 The Congruence Property int a[M], b[n]; for (i=0; i<n; i++) { a[b[i]*8] = 0; } 0481216 202428 Congruent with offset of 0

7 The Congruence Property int a[M]; for (i=0; i<n; i++) { a[16*i+2] = 0; } 0481216 202428 Congruent with offset of 8

8 The Congruence Property int a[M]; for (i=0; i<n; i++) { a[15*i+3] = 0; } 0481216 202428 Not Congruent (32-byte line)

9 Outline Uses of congruence information Congruence detection algorithm Congruence-increasing transformations Results Related work

10 SIMD Compilation [PLDI ’00] Multimedia extensions offer wide mem ops –Motorola’s AltiVec –Intel’s MMX/SSE Automatic SIMD parallelization –Multiple mem ops  single wide mem op 128-bit lds/strs must be 128-bit aligned –SSE: 6-9 cycle penalty for unaligned accesses –AltiVec: All wide mem ops have to be aligned

11 Energy Savings [Micro ’01] Skip tag checks in a set-associative cache Add special loads/stores to ISA –First mem op memoizes the cache way –Second mem op uses this to skip the check Compiler analysis determines when data occupy the same line –Need congruence information

12 Banked Memory Architectures Offset specifies the memory bank –Place data close to computation –Access banks in parallel regfile memory 0 regfile memory 4 regfile memory 8 regfile memory 12

13 Congruence Recognition Iterative dataflow analysis –Low-level IR Lattice elements of the form an+b –For pointers, memory locations accessed If a = cache line size then b = offset –32n+8  accesses offset 8 in a 32-byte line 0481216 20 28 24

14 Dataflow Lattice 8 byte cache line 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7 n+0 

15 Dataflow Lattice 2n+02n+1 4n+04n+24n+14n+3 8n+08n+48n+28n+68n+18n+58n+38n+7  8n+0 4n+2 2n+0 n+0

16 Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a Meet

17 Transfer Functions a = gcd(a 1, a 2, |b 1 -b 2 |) b = b 1 % a a = gcd(a 1, a 2 ) b = (b 1 +b 2 ) % a a = gcd(a 1, a 2 ) b = (b 1 – b 2 ) % a a = gcd(a 1 a 2, a 1 b 2, a 2 b 1, C) b = (b 1 b 2 ) % a Meet Add Subtract Multiply

18 The Bad News Most mem ops are not congruent –32 byte cache line

19 Congruence Conventions (Padding) Allocate arrays/structs on a line boundary –Congruent accesses to arrays for a given index –Congruent accesses to struct fields Requires that we: –Allocate stack frames on cache line boundary –Modify malloc to return aligned data

20 Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[0] a[8] a[16]… 0481216202428

21 Unrolling Unrolling creates congruent references int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } a[1] a[9] a[17]… 0481216202428

22 Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); } 0481216202428

23 Congruence with Parameters void init(int* a) { for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } void main() { int a[100]; init(&a[2]); init(&a[3]); } 0481216202428

24 Pre-loop Add a pre-loop to enforce congruence for (i=0; i<n; i++) { if ((int)&a[i] % 32 == 0) break; a[i] = 0; } for (; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; } 0481216202428

25 Pre-loop Add a pre-loop to enforce congruence Mem ops congruent in the unrolled body Pre-loop has few iterations –Most dynamic mem ops are congruent

26 Finding the Break Condition Can we choose arbitrarily? void init(int *x) { int i; for (i=0; i<100; i+=2) { if ((int)&x[i] % 32 == 0) break; x[i] = 0; }... } int main() { int x[200]; init(&x[1]); } i&x[i]%32 04 212 420 628 84 NO!

27 Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call

28 Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call

29 Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0 && (int)&y[i] % 32 == 4) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call

30 Finding the Break Condition void copy(int *x, int *y) { int i; for (i=0; i<100; i++) { if ((int)&x[i] % 32 == 0) break; x[i] = y[i]; }... } int main() { int x[200], y[200]; copy(&x[0], &y[0]); copy(&x[0], &y[1]); } i&x[i]%32&y[i]%32 000 144 ……… 800 i&x[i]%32&y[i]%32 004 148 ……… 804 first call second call

31 Finding the Break Condition Use profiling to observe runtime addresses Find best break condition for the profile Exhaustive search: –Consider all possible break conditions –Compute iterations in unrolled loop –Multiply by # of mem ops with known offset –Break condition with highest value is the best Results vary little with profile data set –Insignificant on all but one benchmark

32 Congruence Results (SPECfp95) Original Congruent

33 Congruence Results (SPECfp95) Original Congruent Detected

34 Congruence Results (MediaBench) Original Congruent Detected

35 Execution Time Overhead unrolling+ pre-loop applu-6.27%-5.28% apsi 0.93% 1.13% fpppp 0.00% hydro2d 0.99% 0.39% mgrid 0.72% su2cor-0.32% 0.11% swim-0.96%-0.17% tomcatv-0.18% 0.65% turb3d-0.80% 1.72% wave5 3.75% 4.58%

36 DCache Energy Savings [Micro ’01]

37 Related Work Fisher and Ellis – Bulldog Compiler –Memory bank disambiguation –Loop unrolling Barua et al. – Raw Compiler –Modulo unrolling Davidson et al. – Mem Access Coalescing –Loop Unrolling –Alignment checks at runtime

38 Conclusions Increased number of congruent refs by 5x Analysis detected 95% Results are good –MediaBench – 65% congruent, 60% detected –SpecFP95 – 84% congruent, 82% detected Many uses of congruence information –Wide accesses in multimedia extensions –Energy savings by tag check elimination –Bank disambiguation in clustered architectures

39 Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology

40 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0 Example int a[100]; for (i=0; i<n; i+=8) { a[i+0] = 0; a[i+1] = 0; … a[i+7] = 0; }

41 Example i: 32n+0 r0: 32n+0 + 32n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 32n+0 + 32n+8 = 32n+8 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0

42 i: 32n+0 r0: 32n+0 + 32n+7 = 32n+7 r1: 32n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 32n+0 + 32n+8 = 32n+8 Example i: 32n+0 r0: 8n+0 + 32n+7 = 8n+7 r1: 8n+7 * 32n+4 = 32n+28 r2: 32n+28 + 32n+0 = 32n+28 i: 8n+0 + 32n+8 = 8n+0 i: 32n+0  32n+8 = 8n+0 *r2: offset is 28 r0 = i+7 r1 = r0*4 r2 = r1+a *r2 = 0 i = i+8 i < n i = 0

43 Multimedia Compilation PowerMAC G4 with AltiVec Commercial vectorizing compiler –Alignment pragmas datatypeVector length Speedup (unaligned) Speedup (aligned) Improve- ment float43.254.7546% int42.152.9336% short82.985.8797% char165.2111.53121%


Download ppt "Increasing and Detecting Memory Address Congruence Sam Larsen Emmett Witchel Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute."

Similar presentations


Ads by Google