Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal

Similar presentations


Presentation on theme: "Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal"— Presentation transcript:

1 Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal http://www.ics.ele.tue.nl/~heco/courses/EmbSystems Technical University Eindhoven DTI / NUS Singapore 2005/2006

2 Thanks to the IMEC DTSE experts: Erik Brockmeyer IMEC, Leuven, Belgium and also Martin Palkovic, Sven Verdoolaege, Tanja van Achteren, Sven Wuytack, Arnout Vandecappelle, Miguel Miranda, Cedric Ghez, Tycho van Meeuwen, Eddy Degreef, Michel Eyckmans, Francky Catthoor, e.a.

3 H.C. TD51023 DM methodology Dataflow Transformations Analysis/Preprocessing Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation C-out C-in Address optimization

4 H.C. TD51024 for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); Location Time Production Consumption for (i=0; i < 8; i++) A[i] = …; B[7-i] = f(A[i]); Location Time Production Consumption Locality of Reference

5 H.C. TD51025 Regularity for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[i] = f(A[7-i]); Location Time for (i=0; i < 8; i++) A[i] = …; for (i=0; i < 8; i++) B[7-i] = f(A[i]); Location Time Production Consumption

6 H.C. TD51026 for (i=0; i < 8; i++) B[i] = f1(A[i]); for (i=0; i < 8; i++) C[i] = f2(A[i]); Location Time Consumption Location Time Consumption Enabling Reuse for (i=0; i < 8; i++) B[i] = f1(A[i]); C[i] = f2(A[i]);

7 H.C. TD51027 How to do these loop transformations automatically? Requires cost function Requires technique Let's introduce some terminology - iteration spaces - polytopes - ordering vector / execution order

8 H.C. TD51028 01 j 2345 0 i 1 2 3 4 5 Iteration space and polytopes // assume A[][] exists for (i=1; i<6; i++) { for (j=2; j<6; j++) { B[i][j] = g( A[i-1][j-2]); } --- iteration space --- consumption space --- production space --- dependency vector

9 H.C. TD51029 Example with 3 polytopes A: for (i=1; i<=N; ++i) for (j=1; j<=N-i+1; ++j) a[i][j] = in[i][j] + a[i-1][j]; B: for (p=1; p<=N; ++p) b[p][1] = f( a[N-p+1][p], a[N-p][p] ); C: for (k=1; k<=N; ++k) for (l=1; l<=k; ++k) b[k][l+1] = g (b[k][l]); A B C Algorithm having 3 loops: j i k p l

10 H.C. TD510210 Common iteration space for (i=1; i<=(2*N+1); ++i) for (j=1; j<=2*N; ++j) if (i>=1 && i =1 && j<=N-i+1) a[i][j] = in[i][j] + a[i-1][j]; if (i==N+1 && j>=1 && j<=N) b[j][1] = f( a[N-j+1][j], a[N-j][j] ); if (i>=N+2 && i<=2*N+1 && j>=N+1 && j<=N+k) b[i-N-1][j-N+1] = g (b[i-N-1][j-N]); j i 1 2*N+1 12*N Initial solution having a common iteration space: Bad locality Bad regularity Requires 2N memory locations Many dummy iterations Ordering vector

11 H.C. TD510211 Cost function needed for automation Regularity Equal direction for dependency vectors Avoid that dependency vectors cross each other Good for storage size Temporal locality Equal length of all dependency vectors Good for storage size Good for data reuse

12 H.C. TD510212 Regularity Regular Irregular

13 H.C. TD510213 Bad regularity limits the ordering freedom j i 1 2*N+1 12*N Ordering freedom = 90 degrees

14 H.C. TD510214 Locality estimates P C C C C P C C C C P = production C = consumption P C C C C C Dependency vector length is measure for locality Q: Which length is the best estimate? Sum{d i } Max {d i }Spanning tree didi

15 H.C. TD510215 1.Affine loop transformations 1. Only geometric information is available during placement 2. Rotation, skewing, interchange, reverse 2.Polytope placement 1. Only geometric information is available during placement 2. Translation 3.Choose ordering vector Three step approach for loop transformation tool Combined transformation:

16 H.C. TD510216 A: (i: 1..N):: (j: 1.. N-i+1):: a[i][j] = in[i][j] + a[i-1][j]; C: (k: 1..N):: (l: 1..k):: b[N-k+1][l+1] = g( b[N-k+1][l] ); B: (p: 1..N):: b[p][1] = f( a[N-p+1][p], a[N-p][p] ); Affine loop transformations Polytope placement Choose ordering vector Three step approach for loop transformation tool

17 H.C. TD510217 Three step approach for loop transformation tool Affine loop transformations Polytope placement Choose ordering vector

18 H.C. TD510218 Three step approach for loop transformation tool Affine loop transformations Polytope placement = merging loops Choose ordering vector

19 H.C. TD510219 Choose optimal ordering vector Ordering Vector 1 Ordering Vector 2

20 H.C. TD510220 From the Polyhedral model back to C for (j=1; j<=N; ++j) { for (i=1; i<=N-j+1; ++i) a[i][j] = in[i][j] + a[i-1][j]; b[j][1] = f( a[N-j+1][j], a[N-j][j] ); for (l=1; l<=j; ++l) b[j][l+1] = g( b[j][l] ); } Affine loop transformations Polytope placement Choose ordering vector Optimized solution having a common iteration space: Optimal locality Optimal regularity Requires 2 memory locations

21 H.C. TD510221 Scanner Loop trafo - cavity detection Gauss Blur y Gauss Blur x N x M X-Y Loop Interchange N x M From N x M to N x (2GB+1) buffer size X Y N x M

22 H.C. TD510222 Loop trafo- cavity (1) 1 Transform: interchange 2 Translate: merge 3 Order

23 H.C. TD510223 Loop trafo- cavity (2) 1 Transform: interchange 2 Translate: merge 3 Order x-blur filter:

24 H.C. TD510224 Scanner Loop trafo - cavity detection Gauss Blur y Gauss Blur x N x M · X-Y Loop Interchange N x M From N x M to N x (2GB+1) buffer size X Y N x M

25 H.C. TD510225 Loop trafo- cavity (3) 2 Translate 1: 2 Translate 2: 3 Comparing different translations

26 H.C. TD510226 Loop trafo- cavity (4) 3 3 Order += Combining (merging) multiple polytopes

27 H.C. TD510227 Result on gauss filter for (y=0; y<M+GB; ++y) { for (x=0; x<N+GB; ++x) { if (x>=GB && x =GB && y<=M-1-GB) { gauss_x_compute = 0; for (k=-GB; k<=GB; ++k) gauss_x_compute += image_in[x+k][y]*Gauss[abs(k)]; gauss_x_image[x][y] = gauss_x_compute/tot; } else if (x<N && y<M) gauss_x_image[x][y] = 0; if (x>=GB && x =GB && (y-GB)<=M-1-GB) { gauss_xy_compute = 0; for (k=-GB; k<=GB; ++k) gauss_xy_compute += gauss_x_image[x][y-GB+k]* Gauss[abs(k)]; gauss_xy_image[x][y-GB] = gauss_xy_compute/tot; } else if (x =0 && (y-GB)<M) gauss_xy_image[x][y-GB] = 0;

28 H.C. TD510228 Intermezzo Before we continue with data reuse, have a look at other loop transformations

29 H.C. TD510229 DM methodology Dataflow Transformations Analysis/Preprocessing Loop/control-flow transformations Data Reuse Storage Cycle Budget Distribution Memory Allocation and Assignment Memory Layout organisation C-out C-in Address optimization

30 H.C. TD510230 Layer 1 Layer 2 Layer 3 Data paths Memory hierarchy and Data reuse 1. Determines reuse candidates 2. Combine reuse candidates into reuse chains 3. If multiple access statements/array combine into reuse trees 4. Determine number of layers (if architecture is not fixed) 5. Select candidates and assign to memory layers 6. Add extra transfers between the different memory layers (for scratchpad RAM; not for caches)

31 H.C. TD510231 TI C55@200MHz example platform Register file + Core 4Kx16 dual 32x Total256Kb 1 elem in 1 cycle 16Kx16 ROM Offchip MAX: 8MBx16 SRAM/EPROM/ SDRAM/SBSRAM TMS320vc5510@200MHz Vdd= 1.5 V P = unknown 8x Total64Kb 2 elem in 1 cycle 4Kx16 dual 4Kx16 dual 4Kx16 sing 4Kx16 sing 4Kx16 sing ROM (Data/program/DMA) first 3 cycles, next 2 cycles It seems this can be in parallel with the 256Kb memory Bandwidth 100M words/S Bandwidth 400M words/s Size 32kB Size 320kB ROM partition Variable size RAM partition Bandwidth 50M words/s Size 16 MB Fixed size RAM partition Bandwidth 4.8Gwords/s Size 2x16 registers Processor partition BW: 50M Word/s single port L2 L0 L1 BW: 400M Word/s dual port

32 H.C. TD510232 M P = 1 Exploiting Memory Hierarchy for reduced Power: principle Processor Data Paths Register File Processor Data Paths Register File A P = 1 #A = 100% P total (before) = 100%

33 H.C. TD510233 P total (before) = 100% M P = 1 A A’ P = 0.3 100% 5% Exploiting Memory Hierarchy for reduced Power: principle P total (after) = 100%x0.01+10%x0.1+1%x1 = 3% M P = 1 A A’ P = 0.1 A’’ P = 0.01 100% 1% 10% Processor Data Paths Register File Processor Data Paths Register File

34 H.C. TD510234 M Data reuse decision and memory hierarchy: principle Processor Data Paths Register File Processor Data Paths Register File BABA A’A’’ customized connections Customized connections in the memory subsystem to bypass the memory hierarchy and avoid the overhead.

35 H.C. TD510235 Step 1: identify arrays with data reuse potential for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index intra-copy reuse inter-copy reuse

36 H.C. TD510236 Importance of high level cost estimate for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index 6 Mk Array copies are stored in-place!

37 H.C. TD510237 Step 1: determine gains Intra-copy reuse factor for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index 6 Mk intra-copy reuse factor= 3 j iterator =not present so intra-copy reuse 3

38 H.C. TD510238 Step 1: determine gains Inter-copy reuse factor time copy3 copy4 copy1 copy2 Time frame 1Time frame 2Time frame 3Time frame 4 array index inter-copy reuse factor = 1/(1-1/3)=3/2 6 Mk for (i=0; i<n; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; for (i=0; i<4; i++) for (j=0; j<3; j++) for (k=0; k<6; k++) … = A[i*4+k]; i iterator has smaller weight than k range so inter-copy reuse

39 H.C. TD510239 5 Mm tf 1tf 2tf 3tf 4tf 5tf 6tf 7tf 8tf 9 Possibility for multi-level hierarchy array index time for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Mk 15 time frame 1time frame 2 5 Mm tf 1.1tf 1.2tf 1.3tf 1.4tf 1.5tf 1.6tf 2.1tf 2.2tf 2.3

40 H.C. TD510240 Step 2: determine data reuse chains for each memory access R1(A) A A’ R1(A) A A’ R1(A) A A’ A’’ Many reuse possibilities Cost estimate needed Prune for promising ones R1(A) A

41 H.C. TD510241 Cost function needs both size and number of accesses to intermediate array for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Gk 15 5 Gm estimate #misses from different levels for one iteration of i R1(A) 2*3*3*5 =90 A’ 3*5 =15 A’ 2*3*5 =30 estimate size 05101520 # elements 0 20 40 60 80 100 #misses

42 H.C. TD510242 R1(A) A A’ R1(A) A A’ R1(A) A A’ A’’ R1(A) A 30 90 15 30 90 3015 120 105 45 120 150 5 15 5 135 4522 6 16 7 6 135 5138 35 150 155165 170 Very simplistic power and area estimation for different data-reuse versions x y z accesses size energy

43 H.C. TD510243 R1(A) A A’ A’’ for (i=0; i<10; i++) for (j=0; j<2; j++) for (k=0; k<3; k++) for (l=0; l<3; l++) for (m=0; m<5; m++) … = A[i*15+k*5+m]; Step 3: determine data reuse trees for multiple accesses R2(A) A A’ for (x=0; x<8; x++) for (y=0; y<5; y++) … = A[i*5+y];

44 H.C. TD510244 R1(A) A A’ A’’ R2(A) A A’ Reuse tree A R1(A) A’ A’’ R2(A) A’ Step 3: determine data reuse trees for multiple accesses

45 H.C. TD510245 Assign all data reuse trees (multiple arrays) to memory hierarchy A R1(A) A’ A’’ R2(A) A’ R1(B) B B’ B’’ B’’’ Layer 1 Layer 2 Layer 3 A R1(A) A’ A’’ R2(A) A’ R1(B) B B’ B’’’

46 H.C. TD510246 Step 4: Determine number of layers Data reuse trees A Data reuse trees B Hierarchy layers Layer1 Layer2 Layer3 Foreground mem. Datapath

47 H.C. TD510247 Step 5: Select and assign reuse candidates Data reuse trees Hierarchy layers hierarchy assignments 1 2 3 FG A A 4 A 5 all

48 H.C. TD510248 Step 5: All freedom in array to memory hierarchy Data reuse trees A Hierarchy layers Data reuse trees B

49 H.C. TD510249 Step 5: Prune reuse graph (platform independent) Hierarchy layers Full freedom Hierarchy layers Pruned Quite some solutions never make sense

50 H.C. TD510250 Step 5: Prune reuse graph further (platform dependent) Hierarchy layers Pruned FG Final solution 4 layer platform A B B' A' FG Final solution 4 layer platform

51 H.C. TD510251 int in[H][W+8], out[H][W]; const int c[] = {1,0,1,2,2,1,0,1}; for (r=0; r < H; r++) for (c=0; c < W; c++) for (dc=0; dc < 8; dc++) out[r][c] += in[r][c+dc]*c[dc]; int in[H][W+8], out[H][W], buf[8]; const int c[] = {1,0,1,2,2,1,0,1}; for (r=0; r < H; r++) for (i=0; i<7; i++) buf[i]=in[r][i]; for (c=0; c < W; c++) buf[(c+7)%8] = in[r][c+7]; for (dc=0; dc < 8; dc++) out[r][c] += buf[(c+dc)%8]*c[dc]; Introducing 1D reuse buffer Reuse Factor =7intermediate level decl. additional copyinitial copyreread from buffer

52 H.C. TD510252 Data Reuse on 1D horizontal convolution How to make explicit copies? init buffer reuse data new data Image NxM, traversed row order

53 H.C. TD510253 Introducing line buffers for vertical filtering whole image size[N][M] set of lines [2GB+1] Why keep the whole image in that case? [N]

54 H.C. TD510254 Simplified “reuse script” 1. Identify arrays with sufficient reuse potential 2. Determine reuse chains and prune these (for every array read) 3. Determine reuse trees and prune these (for every array) 4. Determine reuse graph including bypasses and prune (for entire application) 5. Determine memory hierarchy layout assignment incorporating given background memory restrictions (layers) and real-time constraints 6. Introduce copies in code: init, update, use code For scratchpad memories only For caches we need a different approach

55 H.C. TD510255 Data re-use trees: cavity detector N*M N*1 3*1 image_in N*3 1*3 gauss_x N*3 3*3 gauss_xy/comp_edge N*3 1*1 N*M*3 N*M N*M*3 N*M image_out 0 N*M*8 ¸ CPU Array reads: Array write:

56 H.C. TD510256 Memory hierarchy assignment: cavity detector N*M 3*1 image_in N*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 L2 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 N*3 L3 L1 1MB SDRAM 16KB Cache 128 B RegFile ¸

57 H.C. TD510257 Data reuse & memory hierarchy


Download ppt "Embedded Systems in Silicon TD5102 Data Management (2) Loop transformations & Data reuse Henk Corporaal"

Similar presentations


Ads by Google