Download presentation
Presentation is loading. Please wait.
Published byEzra Brooks Modified over 8 years ago
1
Tile Size Selection Using Cache Organization and Data Layout Stephanie Coleman Intermetrics, Inc. Kathryn S. M c Kinley Computer Science, LGRC, University of Massachusetts Amherst 10/27/01
2
Where to Use Tiling/Blocking? Register TLB L1 cache L2 cache any other memory hierarchy
3
Cache Misses Compulsory misses Capacity misses Interference misses Self-interference Cross-interference
4
Data Reuse and locality Data reuse –Temporal reuse –Spatial reuse Locality: reused data remain in cache Reuse does not necessarily result in locality
5
Without Tiling Matrix Multiply for I=1 to N do for K=1 to N do R=X(K,I) for J=1 to N do Z(J,I)=Z(J,I)+R*Y(J,K)
6
Reuse Pattern without tiling
7
Reuse Pattern after tiling
8
After tiling (tile size=TK* TJ) for KK=1 to N by TK do for JJ=1 to N by TJ do for I=1 to N do for K=KK to MIN(KK+TK-1,N) do R=X(K,I) for J=JJ to MIN(JJ+TJ-1,N) do Z(J,I)=Z(J,I)+R*Y(J,K)
9
General Formula for tiling Before tiling: for I= lo to hi do Tiled into: for It=floor((lo-off)/ts)*ts+off to floor((hi-off)/ts)*ts+off by ts do for I=max(lo, It) to min(hi, It+ts-1) (off: offset ts: tile size)
10
Loop Interchange Interchange an innter tile loop with an outer element loop: for I=max(l1,l2,..) to min(u1,u2,…) do for Jt=floor((k1*I+m1)/ts)*ts+off to floor((ku*I+mu)/ts)*ts+off by ts do The limit for the I loop: do not change; The new lower/upper limit for Jt loop will be the max of a set of expressions,where each expression is its old limit with I replaced by one of l1,l2,…(if k1>0), or u1,u2,…(if k1<0).
11
Tile Size Selection
12
Tile Size selection Cache layout with a tile size of 24
13
Potential column dimensions Euclidean algorithm –G.C.D(a,b)=G.C.D(a-b,b) CS= q1*N+r1 N = q2*r1+r2 r1 = q3*r2+r3 … 1024 = 5* 200 + 24 200 = 8*24 + 8 Potential column dimensions: 24, 8.
14
Computing row size for a column size
15
Improve Spatial Locality with Cache Line Size colSize= colSize if colSize mod CLS =0, or if colSize=column length floor(colSize/CLS)*CLS otherwise
16
Minimize Cross Interference Working set size constraint: TJ*TK+TJ+1*CLS<CS
17
Tile Size Selection Algorithm(TSS)
18
Other Algorithm for Computing Tile Size LRW –improves the average cache performance –sensitive to the array size –ineffective cache utilization ESS –effective only for one-dimensional tiling –no consideration on cross-interference
19
Conclusion TSS incorporate the effect of cache line size and cross-interference between arrays Performs better on direct-mapped caches and higher associative caches than ESS and LRW sensitive to array dimension not fully exploit temporal reuse for some matrix sizes
Similar presentations
© 2024 SlidePlayer.com Inc.
All rights reserved.