CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

CS420 lecture six Loops

Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time 2. loop body is executed times

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest convex polygon surrounding them all.

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest polygon surrounding them all. The problem reduces to finding line segments connecting points of the set.

Convex hull

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line.

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. How do you find out all other points are on the same side?

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(p i,p j,p k )

Convex hull: first attempt for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(p i,p j,p k ) check is O(1) so this algorithm is O(n 3 )

the question that drives us.....

is there a better algorithm Find lowest point P1 Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn Start with P1-P2 in current hull for i from 3 to n add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

is there a better algorithm Find lowest point P1 Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn Start with P1-P2 in current hull for i from 3 to n add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break 1 3 2 4

Complexity? find lowest: O(n) sort O(nlgn) nested add/eliminate loop outer: i from 3 to n inner: j from i-1 downto 3 O(?)

nested add/eliminate loop O(N) !! why? n-2 points considered in i loop j loop either eliminates a point, ie it will not be checked again, or stops. The total number of points considered in all j loop iterations is therefore O(n) Complex hull algorithm complexity O(n lg n)

is there a better algorithm? no

is there a better algorithm? no, argument is harder (lower bound arguments usually are) it can be shown that sorting can be reduced to convex hull (reduced: translated such that when the convex hull problem is solved the original sorting problem is solved) and we have shown that sorting is Ω(n lg n)

sort({3, 1, 2})  convex hull({(3,9), (2,4), (1,1)}) 3,9 2,4 1,1 reduction: x  x,x 2

Sub-O Optimizations Suppose you have written an asymptotically optimal program, and still want to speed it up. Using a profiler identify which parts of your code are the hotspots of your program. 10/90 rule of thumb: 90% of the time is spent in 10% of the code: hotspots – Usually some of the innermost loops – Only improve the hotspots. Leave the rest clear and simple.

Data reorganization Create sentinel (value at boundary) to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++;

Data reorganization Create sentinel to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++; Sentinel: value at boundary x[n]=T; i=0; while (x[i]!=T)i++; found = (i<n);

Loop unrolling Loop unrolling is textually repeating the loop body so that the loop control is executed fewer times – Eg, a median filter operator on an image executes a 3x3 inner loop for each resulting pixel; this can be fully unrolled – some compilers (eg CUDA) allow unroll k pragmas – in a linked list, if the last element points at itself, visiting the elements can be partially unrolled

Loop peeling When the body of a loop tests whether it is on a boundary, and has a special case for that boundary, it is often advantageous to have separate code for the boundary avoiding the conditional in the loop body. Eg, median filter

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; // what happens if the loop gets unrolled once? for i = 3 to n { c=a+b, a=b; b=c } return c;

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; for i = 3 to n { c=a+b, a=b; b=c } return c; fibonacci(n) a=b=1; for i = 1 to (n/2 -1) {a=a+b; b=a+b} if odd(n) b = a+b; return b;

Memory hierarchy (cache) issues Processor are an order of magnitude faster than memories – both have been speeding up exponentially for ~30 years: but with different bases, so their ratio has been growing exponentially as well – caches keep recently used (temporal locality) and fetch in cache lines (spatial locality)

cache issues memory wall getting over it: cache cache line cache replacement policy: LRU cache and memory layout of 1D representation of 2D arrays in C – row access – col access

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] B is accessed in column order. If the arrays are (as in C) stored in row major order, this causes cache misses and unnecessary reads!!

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] While one row of A is read, all of B is read If the cache cannot keep all of B and uses the Least Recently Used replace policy, all reads of B will cause a cache miss

Tiling for improved cache behavior Instead of reading a whole row of A and doing n whole row A column B inner products we can read a block of A and compute smaller inner products with sub columns of B. (Remember blocked matrix multiply in Strassen) These partial products are then added up.

Conventional matrix multiply

etc......

Conventional matrix multiply All elements of B are used once, while all of row A[i] are used n times. A[i] may fit in the cache, B will probably not!

Tiled matrix multiply

Reuse of tile of B A k x k tile of A (which can fit in the cache) block multiplies with a k x k tile of B (which can fit in the cache) and thus reuses the B tile k times, potentially providing better cache use We can parameterize our program with k and experiment Data and loop reordering matrix multiply: assignment 2

Experiments you can do Transpose B for better cache line behavior Tile the loop as in the example In array access A[i*N+j] avoid the multiply by doing pointer increments and dereferences You will have a number of versions of your code. Make a 2D table of results. Then make observations about your results. In a follow up discussion, exchange your experiences

Tiling Loops become nested loops – outer loop visits tile origins – inner loops visit the tile points

Can every loop be tiled??? Tile this for i=1 to n for j=1 to n if (i==1 && j==1) A[i,j]=1 elif j==1 A[i,j]=A[i-1,n] else A[i,j]=A[i,j-1]

Tiling cont' Tiling is loop reordering. The reordered loop must obey the data dependences in the original loop. Let iteration i',j',k' occur before iteration i,j,k – true dependence: i,j,k uses a value that i',j',k produced – anti dependence: i,j,k redefines a value that i',j',k' used – output dependence: i,j,k redefines a value that i',j',k' defined in all these cases the reordered loop must obey the original ordering of the two iterations.

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Similar presentations

Presentation on theme: "CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Similar presentations

Presentation on theme: "CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop."— Presentation transcript:

Similar presentations

About project

Feedback