Presentation is loading. Please wait.

Presentation is loading. Please wait.

Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout.

Similar presentations


Presentation on theme: "Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout."— Presentation transcript:

1 Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout

2 H.C. Platform-based Design 5KK702 Part 3 overview Recap on design flow Platform dependent steps –SCBD: Storage Cycle Budget Distribution –MAA: Memory Allocation and Assignment –Data layout techniques for RAM –Data layout techniques for Caches Results Conclusions Thanks to the IMEC DTSE people

3 H.C. Platform-based Design 5KK703 Dynamic memory mgmt Task concurrency mgmt Physical memory mgmt Address optimization SWdesignflowHWdesignflow SW/HW co-design SW/HW co-design Concurrent OO spec Remove OO overhead

4 H.C. Platform-based Design 5KK704 DM steps C-in Preprocessing Dataflow transformations Loop transformations Data reuse Memory hierarchy layer assignment Cycle budget distribution Memory allocation and assignment Data layout C-out Address optimization Today

5 H.C. Platform-based Design 5KK705 Result of Memory hierarchy assignment for cavity detection L3 L2 L1 N*M 3*1 image_in M*3 gauss_x gauss_xycomp_edgeimage_out 3*3 1*1 3*3 1*1 N*M N*M*3 N*M 0 N*M*3 N*M N*M*3N*M*8 M*3 1MB SDRAM 16KB Cache 128 B RegFile

6 H.C. Platform-based Design 5KK706 Data-reuse - cavity detection code for (y=0; y<M+3; ++y) { for (x=0; x<N+2; ++x) { /* first in_pixel initialized */ if (x==0 && y>=1 && y<=M-2) in_pixels[x%3] = image_in[x][y]; /* copy rest of in_pixel's in row */ if (x>=0 && x =1 && y<=M-2) in_pixels[(x+1)%3]= image_in[x+1][y]; if (x>=1 && x =1 && y<=M-2) { gauss_x_tmp=0; for (k=-1; k<=1; ++k) gauss_x_tmp += in_pixels[(x+k)%3]*Gauss[Abs(k)]; gauss_x_lines[x][y%3]= foo(gauss_x_tmp); } else if (x<N && y<M) gauss_x_lines[x][y%3] = 0; Code after reuse transformation (partly)

7 Storage Cycle Budget Distribution & Memory Allocation and Assignment

8 H.C. Platform-based Design 5KK708 Define the memory organization which can provide enough bandwidth with minimal cost

9 H.C. Platform-based Design 5KK709 Balancing memory bandwidth Reduce max. number of loads/store per cycle: Memory Bandwidth Required time High Memory Bandwidth Required time Low

10 H.C. Platform-based Design 5KK7010 Data management approach One of the many possible schedules

11 H.C. Platform-based Design 5KK7011 Data management approach

12 H.C. Platform-based Design 5KK7012 Conflict cost calculation Key issues: Number of conflicts Self conflicts Chromatic number = size of maximum clique

13 H.C. Platform-based Design 5KK7013 Self conflict  dual port memory

14 H.C. Platform-based Design 5KK7014 Chromatic number  minimum # single port memories

15 H.C. Platform-based Design 5KK7015 Low number of conflicts  large assignment freedom

16 H.C. Platform-based Design 5KK7016 time slots ? R(C) W(B) R(B) W(A) R(A) R(C) W(C) R(D) W(D) 123456 W(A) W(C) R(C) W(B) Conflict Directed Ordering is used for flat graph scheduling Reduce intervals until all conflicts known Driven by cost of conflicts Constructive algorithm

17 H.C. Platform-based Design 5KK7017 Local optimization is not good for global optimization

18 H.C. Platform-based Design 5KK7018 Budget distribution has large impact on memory cost

19 H.C. Platform-based Design 5KK7019 Decreasing basic block length until target cycle budget is met

20 H.C. Platform-based Design 5KK7020 Obtain more freedom by merging loops More scheduling freedom Extension to different threads

21 H.C. Platform-based Design 5KK7021 Memory allocation and assignment

22 H.C. Platform-based Design 5KK7022 Memory Allocation and Assignment Substeps Array-to-memory Assignment D C A B Port Assignment Bus Sharing D C A B Memory Allocation 123

23 H.C. Platform-based Design 5KK7023 Influence of MAA Bitwidth Address range Nr. memories Nr. ports Assign arrays to memory Memory interconnect Minimize power & Area Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-1 A B Bitwidth (maximum) Size Nr. ports (R/W/RW) MEMORY-N K L 1001001110101001 100100111010XXXX 1001XXXXXX 0101110010

24 H.C. Platform-based Design 5KK7024 Example of bus sharing possibilities R(A)R(B) R(B)W(A) W(C)R(A) R(A)W(B) W(A)W(B) W(A)W(C) m1m2m3 AB X X C m1m2m3 ABC m1m2m3 AB X C

25 H.C. Platform-based Design 5KK7025 Decreasing cycle budget limits freedom and raises cost

26 H.C. Platform-based Design 5KK7026 Resulting Pareto curve for DAB synchro application Energy cost

27 H.C. Platform-based Design 5KK7027 Example conflict graph for cavity detection

28 H.C. Platform-based Design 5KK7028 MAA result Power: On-chip area:

29 H.C. Platform-based Design 5KK7029 Data layout how to put data into memory

30 H.C. Platform-based Design 5KK7030 A C ? ? B MEM1 F G ? ? H MEM2 PE A' B' ? ? CACHE Memory data layout for custom and cache architectures PE A' B' CACHE A C MEM1 B F MEM2 G H C A B C B

31 H.C. Platform-based Design 5KK7031 for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); i-1 i j Window Intra-array in-place mapping reduces size of one array a time max nr. of life elements

32 H.C. Platform-based Design 5KK7032 variable domains abstract addresses real addresses aAaA a C A B aCaC aBaB Two-phase mapping of array elements onto addresses Storage order Allocation

33 H.C. Platform-based Design 5KK7033 a a=??? memory address variable domain Exploration of storage orders for 2-dimensional array a2a2 a1a1 ?????? a=3a 1 +a 2 a=3(1-a 1 )+a 2 a=3a 1 +(2-a 2 ) a=2a 2 +a 1 a=2a 2 +(1-a 1 ) a=2(2-a 2 )+a 1 a=3(1-a 1 )+(2-a 2 ) a=2(2-a 2 )+(1-a 1 )

34 H.C. Platform-based Design 5KK7034 Chosen storage order determines window size for (i=1; i<5; i++) for (j=0; j<5; j++) a[i][j] = f(a[i-1][j]); row-major ordering: a=5i+j for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*i+j] = f(a[5*i+j-5]); Highest live address: Lowest live address: 5*i+j 5*i+j-5 Difference + 1= Window: 6 column-major: a=5j+i for (i=1; i<5; i++) for (j=0; j<5; j++) a[5*j+i] = f(a[5*j+i-1]); 5*4+i-1 5*0+i-1 21 j i

35 H.C. Platform-based Design 5KK7035 A B C D E Memory Size Static allocation: no in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB time

36 H.C. Platform-based Design 5KK7036 C Memory Size A D B E Static, windowed C Memory Size A D B E Dynamic, windowed Windowed Allocation: intra-array in-place mapping

37 H.C. Platform-based Design 5KK7037 Dynamic allocation: inter-array in-place mapping E aEaE C aCaC A aAaA D aDaD B aBaB A B C D E Memory Size

38 H.C. Platform-based Design 5KK7038 A B C E D A C E D B Memory Size Dynamic, common window Dynamic allocation strategy with common window

39 H.C. Platform-based Design 5KK7039 Before: bit8 B[10][20]; bit6 A[30]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[x][y] = …; After: bit8 memory[334]; bit8* B =(bit8*)&memory[134]; bit6* A =(bit6*)&memory[120]; for(x=0;x<10;++x) for (y=0;y<20;++y) … = A[3*x-y]; B[(x*20+y*2)%78] = …; Expressing memory data layout in source code Example: array of 10x20 elements A: offset 120, no window B: storage order [20, 2], offset 134, window 78

40 H.C. Platform-based Design 5KK7040 int x[W], y[W]; for (i1=0; i1 < W; i1++) x[i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * x[wrap(i2+di2,W)]; } y[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(y[i3]); Example of memory data layout for storage size reduction

41 H.C. Platform-based Design 5KK7041 Occupied address-time domain of x[] and y[]

42 H.C. Platform-based Design 5KK7042 int mem1[N+W]; for (i1=0; i1 < W; i1++) mem1[N+i1] = getInput(); for (i2=0; i2 < W; i2++) { sum = 0; for (di2=-N; di2 <=N; di2++) { sum += c[N+di2] * mem1[N+wrap(i2+di2,W)]; } mem1[i2] = sum; } for (i3=0; i3 < W; i3++) putOutput(mem1[i3]); Optimized source code after memory data layout

43 H.C. Platform-based Design 5KK7043 Optimized OAT domain after memory data layout

44 H.C. Platform-based Design 5KK7044 In-place mapping for cavity detection example Input image is partly consumed by the time first results for output image are ready index time Image_in time address Image time index Image_out

45 H.C. Platform-based Design 5KK7045 In-place - cavity detection code for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image_out[x-5][y-3] = …; /* code removed */ … = image_in[x+1][y]; } for (y=0; y<=M+3; ++y) { for (x=0; x<N+5; ++x) { image[x-5][y-3] = …; /* code removed */ … = image [x+1][y]; }

46 H.C. Platform-based Design 5KK7046 Cavity detection summary Overall result: Local accesses reduced by factor 3 Memory size reduced by factor 5 Power reduced by factor 5 System bus load reduced by factor 12 Performance worsened by factor 6

47 H.C. Platform-based Design 5KK7047 Data layout for caches Caches are hardware controled Therefore: no explicit copy coded needed ! What can we do ?

48 H.C. Platform-based Design 5KK7048 p-k-mmk tagindex address byte address tagdata Hit? main memory CPU 2 k lines p-k-m2 m bytes Cache line / Block Cache principles

49 H.C. Platform-based Design 5KK7049 Cache Architecture Fundamentals Block placement –Where in the cache will a new block be placed? Block identification –How is a block found in the cache? Block replacement policy –Which block is evicted from the cache? Updating policy –How is a block written from cache to memory?

50 H.C. Platform-based Design 5KK7050Cache0 1 7 2 3 4 5 6 2 3 4 5 0 1 6 7... 0 1 2 3 4 5 6 7 Fully associative (one-to-many) Anywhere in cache Here only! 0 1 2 3 4 5 6 7 Direct mapped (one-to-one) Here only! Memory 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Mapping?... Block placement policies

51 H.C. Platform-based Design 5KK7051 Direct mapped cache 20 10 Byte offset ValidTagDataIndex 0 1 2 1021 1022 1023 Tag Index HitData 20 32 31 30 13 12 1 1 2 1 0 Address (bit positions)

52 H.C. Platform-based Design 5KK7052 Taking advantage of spatial locality: Direct mapped cache: larger blocks Address (bit positions)

53 H.C. Platform-based Design 5KK7053 Increasing the block size tends to decrease miss rate: Performance

54 H.C. Platform-based Design 5KK7054 4-way associative cache

55 H.C. Platform-based Design 5KK7055 Performance 1 KB 2 KB 8 KB

56 H.C. Platform-based Design 5KK7056 Cache Fundamentals The “Three C's” Compulsory Misses –1st access to a block: never in the cache Capacity Misses –Cache cannot contain all the blocks –Blocks are discarded and retrieved later –Avoided by increasing cache size Conflict Misses –Too many blocks mapped to same set –Avoided by increasing associativity

57 H.C. Platform-based Design 5KK7057 for(i=0; i<10; i++) A[i] = f(B[i]); Cache(@ i=2) A[0] B[1] B[2] B[0] A[1] A[2] --- B[3], A[3] required B[3] never loaded before  loaded into cache A[3] never loaded before  allocates new line Cache(@ i=3) Compulsory miss example

58 H.C. Platform-based Design 5KK7058 Capacity miss example B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses Cache size: 8 blocks of 1 word Fully associative

59 H.C. Platform-based Design 5KK7059 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Conflict miss example

60 H.C. Platform-based Design 5KK7060 “Three C's” vs Cache size [Gee93]

61 Data layout may reduce cache misses

62 H.C. Platform-based Design 5KK7062 Example 1: Capacity & Compulsory miss reduction B[3] B[0] A[0] i=0 B[3] B[0] A[0] B[4] B[1] A[1] i=1 A[2] B[0] A[0] B[4] B[1] A[1] B[5] B[2] i=2 A[2] B[6] B[3] A[3] B[1] A[1] B[5] B[2] i=3 A[2] B[6] B[3] A[3] B[7] B[4] A[4] B[2] i=4 B[5] A[5] B[3] A[3] B[7] B[4] A[4] B[8] i=5 B[5] A[5] B[9] B[6] A[6] B[4] A[4] B[8] i=6 for(i=0; i<N; i++) A[i] = B[i+3]+B[i]; B[5] A[5] B[9] B[6] A[6] B[10] B[7] A[7] i=7 11 compulsory misses (+8 write misses) 5 capacity misses

63 H.C. Platform-based Design 5KK7063 #Words B[] i 60 Cache Memory Main Memory (16 words) AB[new] Fit data in cache with in-place mapping A[] 15 Detailed Analysis: max=15 words 12 for(i=0; i<12; i++) A[i] = B[i+3]+B[i]; Traditional Analysis: max=27 words

64 H.C. Platform-based Design 5KK7064 Remove capacity / compulsory misses with in-place mapping AB[3] AB[0] i=0 AB[3] AB[0] AB[4] AB[1] i=1 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] i=2 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] i=3 AB[3] AB[0] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=4 AB[3] AB[8] AB[4] AB[1] AB[5] AB[2] AB[6] AB[7] i=5 AB[3] AB[8] AB[4] AB[9] AB[5] AB[2] AB[6] AB[7] i=6 for(i=0; i<N; i++) AB[i] = AB[i+3]+AB[i]; AB[7] AB[8] AB[4] AB[9] AB[5] AB[10] AB[6] AB[7] i=7 11 compulsory misses 5 cache hits (+8 write hits)

65 H.C. Platform-based Design 5KK7065 Cache (@ i=0) 1 2 3 4 5 6 7 B[0][j] A[0]/B[0][j] 0 for(j=0; j<10; j++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 10 31 B[3][0] B[0][1] A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 11 B[2][1] B[3][1] 12 B[0][2] B[1][2] 13 B[2][2] B[3][2] 8 9 14 15 0 1 7 2 7 2 3 4 5 6 3 4 5 0 1 6 7 B[0][3] 0... Memory address Cache address j=even A[0] multiply loaded A[i] read 10 times -> A[0] flushed in favor B[0][j] -> Miss j=odd Example 2: Conflict miss reduction

66 H.C. Platform-based Design 5KK7066 for(j=0; j<10; j++) for(i=0; i<4; i++) for(i=0; i<4; i++) A[i] = A[i]+B[i][j]; A[i] = A[i]+B[i][j]; A[0] 0 A[1] 1 A[2] B[3][9] 7 12 31 B[3][0] B[0][1] Main Memory A[3] 2 3 4 B[0][0] B[1][0] B[1][1] B[2][0] 5 6 13 Leave gap B[2][1] B[3][1] B[0][2] 0 1 7 4 7 2 3 4 5 6 5 6 7 14 15 184......... 1 2 3 4 5 6 7 B[0][j] A[0] 0 A[0] multiply loaded A[i] multiple x read No conflict Cache (@ i=0) j=any © imec 2001 Avoid conflict miss with main memory data layout

67 H.C. Platform-based Design 5KK7067 Data Layout Organization for Direct Mapped Caches

68 H.C. Platform-based Design 5KK7068 Conclusion on Data Management In multi-media applications exploring data transfer and storage issues should be done at source code level DMM method –Reducing number of external memory accesses –Reducing external memory size –Trade-offs between internal memory complexity and speed –Platform independent high-level transformations –Platform dependent transformations exploit platform characteristics (efficient use of memory, cache, …) –Substantial energy reduction


Download ppt "Processor Architectures and Program Mapping 5KK70 TU/e Henk Corporaal Bart Mesman Data Management Part c: SCBD, MAA, and Data Layout."

Similar presentations


Ads by Google