OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics, University of Oslo, Norway

OMS 2007 21-Oct-15OMS 2007 Overview  Motivation, CPU versus Memory speed –Caches  A cache test  A simple model for the execution times of algorithms –Does theoretical cache tests carry over to real programs?  A real example – three Radix sorting algorithms compared The number of instructions executed is no longer a good measure for the performance of an algorithm

OMS 2007 21-Oct-15OMS 2007 The need for caches, the CPU-Memory performance gap from: John L. Hennessy, David A. Patterson, Computer architecture a quantitative approach,: Morgan Kaufmann Publishers Inc., San Francisco, CA, 2003

OMS 2007 21-Oct-15OMS 2007 A cache test random vs. sequential access i large arrays  Both a and b are of length n (n= 100, 200, 400,..., 97m)  2 test runs – the same number of instruction performed: –Random access: set b[i] = random(0..n-1)  We will get 15 random accesses i b and 1 in a, and 1 sequential access i b (the innermost) –Sequential access :set b[i] = i.  then b[b[.....b[i]....]] = i, and we will get 16 sequential accesses in b and 1 in a for (int i= 0; i < n; i++) a[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[b[i]]]]]]]]]]]]]]]]] = i;

OMS 2007 21-Oct-15OMS 2007 Random vs. sequential access times, the same number of instructions performed. Cache-misses slowing random access down a factor: 50 – 60 (4 CPUs) start of cache miss from L1 to L2 start of cache miss from L2 to memory

OMS 2007 21-Oct-15OMS 2007 Why a slowdown of 50-60 and not factor 400 ?  Patterson and Hennessy suggests a slowdown factor of 400, test shows 50 to 60 – why?  Answer: Every array access in Java is checked for lower and upper array limits – say: –load array index –compare with zero (lower limit) –load upper limit –compare index and upper limit –load array base address –load/store array element ( = possible cache miss)  We see 5 cache hit operations + one cache miss – then average = (5 + 400)/6 = 67

OMS 2007 21-Oct-15OMS 2007  From the figure for random access, we see a asymptotical slowdown factor of: 1 if n < L1 4 if L1 < n < L2 50 if L2 < n  The access time T R for one random read or write is then: T R = 1* Pr (access in L1) + 4* Pr (access in L2) + 50* Pr (access in memory) ( = 1* L1/n + 4* L2/n + 50* (n - L2)/n, when n > L2 )  The sequential reads and writes is set to 1, and we can then estimate the total execution time as the weighted sum over all loop accesses A simple model for the execution time of a program  For every loop in program –Count the number of sequential references –Count the number of random accesses and the number of places n in which the randomly accessed object (array) is used n L2L1

OMS 2007 21-Oct-15OMS 2007 Applying the model to Radix sorting – the test  Three Radix algorithms –radix1, sorting the array in one pass with one ‘large’ digit –radix2, sorting the array in two passes with two half sized digits –radix3, sorting the array in three passes with three ‘small’ digits  radix3 performs almost three times as many instructions as radix1 –should be almost 3 times as slow as radix1?  radix2 performs almost twice as many instructions as radix1 –should be almost 2 times as slow as radix1?

OMS 2007 21-Oct-15OMS 2007 static void radixSort ( int [] a, int [] b,int left, int right, int maskLen, int shift) { int acumVal = 0, j, n = right-left+1; int mask = (1<<maskLen) -1; int [] count = new int [mask+1]; // a) count=the frequency of each radix value in a for (int i = left; i <=right; i++) count[(a[i]>> shift) & mask]++; // b) Add up in 'count' - accumulated values for (int i = 0; i <= mask; i++) { j = count[i]; count[i] = acumVal; acumVal += j; } // c) move numbers in sorted order a to b for (int i = 0; i < n; i++) b[count[(a[i+left]>>shift) & mask]++] = a[i+left]; // d) copy back b to a for (int i = 0; i < n; i++) a[i+left] = b[i] ; } Base: Right Radix sorting algorithm : One pass of array a with one sorting digit of width: maskLen (shifted shift bits up)

OMS 2007 21-Oct-15OMS 2007 Radix sort with 1, 2 and 3 digits = 1,2 and 3 passes static void radix1 (int [] a, int left, int right) { // 1 digit radixSort: a[left..right] int max = 0, numBit = 1, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int [] b = new int [n]; radixSort( a,b, left, right, numBit, 0); } static void radix3(int [] a, int left, int right) { // 3 digit radixSort: a[left..right] int max = 0,numBit = 3, n = right-left+1; for (int i = left ; i <= right ; i++) if (a[i] > max) max = a[i]; while (max >= (1<<numBit)) numBit++; int bit1 = numBit/3, bit2 = bit1, bit3 = numBit-(bit1+bit2); int [] b = new int [n]; radixSort( a,b, left, right, bit1, 0); radixSort( a,b, left, right, bit2, bit1); radixSort( a,b, left, right, bit3, bit1+bit2); } 3 1

OMS 2007 21-Oct-15OMS 2007 Random /sequential test (AMD Opteron), Radix 1, 2 and 3 compared withQuicksort and Flashsort radix1 slowed down by a factor 7 radix3, no slowdown radix2, slowdown started

OMS 2007 21-Oct-15OMS 2007 Random /sequential test (Intel Xeon), Radix 1, 2 and 3 compared withQuicksort and Flashsort

OMS 2007 21-Oct-15OMS 2007 The model, careful counting of loops in radix1,2,3  Let E k denote the number of the different operations for a k-pass radix algorithm (k=1,2,..), S denote a sequential read or write, and R k a random read or write in m different places in an array where:  After some simplification:  and (+ some more simplifications):

OMS 2007 21-Oct-15OMS 2007 Model vs. test results (Opteron and Xeon ) Test - Opteron n =100n= 52m R2 / R11.440.31 R3 / R11.770.25 Model n small: n < L1, R 1 = R 2 = R 3 = S n large: n >>L2, R 1 = 10R 2 = 50R 3 = 50S E 2 / E 1 21/13 = 1.61(15+6*10)/(10*1+3*50) = 0.46 E 3 /E 1 31/13 = 2.3831/(10*1+3*50) = 0.19 Test - Xeon n =100n= 52m R2 / R11.280.43 R3 / R11.740.18

OMS 2007 21-Oct-15OMS 2007 Conclusions  The effects of cache-misses are real and show up in ordinary user algorithms when doing random access in large arrays.  We have demonstrated that radix3, that performs almost 3 times as many instructions as radix1, is 4-5 times as fast as radix1 for large n.  i.e. radix1 experiences a slowdown of factor 7-10 because of cache-misses 1.The number of instructions executed is no longer a good measure for the performance of an algorithm. 2.Algorithms should be rewritten such that random access in large data structures is removed.

OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Similar presentations

Presentation on theme: "OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,

Similar presentations

Presentation on theme: "OMS 2007 21-Oct-15OMS 2007 A Model for the Effect of Caching on Algorithmic Efficiency in Radix based Sorting Arne Maus and Stein Gjessing Dept. of Informatics,"— Presentation transcript:

Similar presentations

About project

Feedback