CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
Advertisements

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
CHAPTER 2 ALGORITHM ANALYSIS 【 Definition 】 An algorithm is a finite set of instructions that, if followed, accomplishes a particular task. In addition,
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
Lecture 3: Parallel Algorithm Design
Convex Hulls in Two Dimensions Definitions Basic algorithms Gift Wrapping (algorithm of Jarvis ) Graham scan Divide and conquer Convex Hull for line intersections.
CS4413 Divide-and-Conquer
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Algorithm Strategies Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Data Locality CS 524 – High-Performance Computing.
Complexity Analysis (Part I)
4/23/09Prof. Hilfinger CS 164 Lecture 261 IL for Arrays & Local Optimizations Lecture 26 (Adapted from notes by R. Bodik and G. Necula)
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
TDDB56 DALGOPT-D DALG-C Lecture 8 – Sorting (part I) Jan Maluszynski - HT Sorting: –Intro: aspects of sorting, different strategies –Insertion.
Register Allocation (via graph coloring)
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
Data Locality CS 524 – High-Performance Computing.
1  2004 Morgan Kaufmann Publishers Chapter Seven.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Fast matrix multiplication; Cache usage
1 Parallel Algorithms IV Topics: image analysis algorithms.
Computer ArchitectureFall 2007 © November 12th, 2007 Majd F. Sakr CS-447– Computer Architecture.
Algorithm Analysis & Complexity We saw that a linear search used n comparisons in the worst case (for an array of size n) and binary search had logn comparisons.
Analysis of Algorithm Lecture 3 Recurrence, control structure and few examples (Part 1) Huma Ayub (Assistant Professor) Department of Software Engineering.
CSC 201 Analysis and Design of Algorithms Lecture 03: Introduction to a CSC 201 Analysis and Design of Algorithms Lecture 03: Introduction to a lgorithms.
1 ©2008 DEEDS Group Introduction to Computer Science 2 - SS 08 Asymptotic Complexity Introduction in Computer Science 2 Asymptotic Complexity DEEDS Group.
C. – C. Yao Data Structure. C. – C. Yao Chap 1 Basic Concepts.
Chapter 1 Computer System Overview Dave Bremer Otago Polytechnic, N.Z. ©2008, Prentice Hall Operating Systems: Internals and Design Principles, 6/E William.
SEARCHING, SORTING, AND ASYMPTOTIC COMPLEXITY Lecture 12 CS2110 – Fall 2009.
Reading and Writing Mathematical Proofs
10/14/ Algorithms1 Algorithms - Ch2 - Sorting.
Elementary Sorting Algorithms Many of the slides are from Prof. Plaisted’s resources at University of North Carolina at Chapel Hill.
Computer Architecture Lecture 26 Fasih ur Rehman.
Program Efficiency & Complexity Analysis. Algorithm Review An algorithm is a definite procedure for solving a problem in finite number of steps Algorithm.
CSC 211 Data Structures Lecture 13
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
Lecture 5 Jianjun Hu Department of Computer Science and Engineering University of South Carolina CSCE350 Algorithms and Data Structure.
Data Structure Introduction.
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
Chapter 18: Searching and Sorting Algorithms. Objectives In this chapter, you will: Learn the various search algorithms Implement sequential and binary.
C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.
Time Complexity. Solving a computational program Describing the general steps of the solution –Algorithm’s course Use abstract data types and pseudo code.
1 Standard Version of Starting Out with C++, 4th Brief Edition Chapter 5 Looping.
Decision Making and Branching (cont.)
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
1 Ch. 2: Getting Started. 2 About this lecture Study a few simple algorithms for sorting – Insertion Sort – Selection Sort (Exercise) – Merge Sort Show.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
Computer Science 1620 Sorting. cases exist where we would like our data to be in ascending (descending order) binary searching printing purposes selection.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
Lecture 4 Jianjun Hu Department of Computer Science and Engineerintg University of South Carolina CSCE350 Algorithms and Data Structure.
Chapter 15 Running Time Analysis. Topics Orders of Magnitude and Big-Oh Notation Running Time Analysis of Algorithms –Counting Statements –Evaluating.
CSE 351 Caches. Before we start… A lot of people confused lea and mov on the midterm Totally understandable, but it’s important to make the distinction.
Lecture 3: Parallel Algorithm Design
Simone Campanoni Loop transformations Simone Campanoni
The Hardware/Software Interface CSE351 Winter 2013
Cache Miss Rate Computations
Memory Hierarchies.
Topic: Divide and Conquer
Searching, Sorting, and Asymptotic Complexity
Ch. 2: Getting Started.
Elementary Sorting Algorithms
Memory System Performance Chapter 3
Optimizing single thread performance
ENERGY 211 / CME 211 Lecture 11 October 15, 2008.
Algorithm Course Algorithms Lecture 3 Sorting Algorithm-1
Writing Cache Friendly Code
Presentation transcript:

CS420 lecture six Loops

Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop body takes constant time 2. loop body is executed times

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest convex polygon surrounding them all.

Convex hull Given a set of points in 2D ((x,y) coordinates), find the smallest polygon surrounding them all. The problem reduces to finding line segments connecting points of the set.

Convex hull

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line.

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. How do you find out all other points are on the same side?

Convex hull: first attempt Let L be a line segment connecting two points in the set. For L to be in the convex hull it is sufficient that all other points are on the same side of L’s extension to a full line. for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(p i,p j,p k )

Convex hull: first attempt for i = 1 to n for j= i+1 to n for k = 1 to n if (k!=i&&k!=j) check(p i,p j,p k ) check is O(1) so this algorithm is O(n 3 )

the question that drives us.....

is there a better algorithm Find lowest point P1 Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn Start with P1-P2 in current hull for i from 3 to n add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

is there a better algorithm Find lowest point P1 Sort remaining points by angle they form with P1 and the horizontal, resulting in a sequence P2…Pn Start with P1-P2 in current hull for i from 3 to n add Pi in current hull for j from i-1 downto 3 eliminate Pj if P1 and Pi are on different side of line Pj-P(j-1); if Pj stays break

Complexity? find lowest: O(n) sort O(nlgn) nested add/eliminate loop outer: i from 3 to n inner: j from i-1 downto 3 O(?)

nested add/eliminate loop O(N) !! why? n-2 points considered in i loop j loop either eliminates a point, ie it will not be checked again, or stops. The total number of points considered in all j loop iterations is therefore O(n) Complex hull algorithm complexity O(n lg n)

is there a better algorithm? no

is there a better algorithm? no, argument is harder (lower bound arguments usually are) it can be shown that sorting can be reduced to convex hull (reduced: translated such that when the convex hull problem is solved the original sorting problem is solved) and we have shown that sorting is Ω(n lg n)

sort({3, 1, 2})  convex hull({(3,9), (2,4), (1,1)}) 3,9 2,4 1,1 reduction: x  x,x 2

Sub-O Optimizations Suppose you have written an asymptotically optimal program, and still want to speed it up. Using a profiler identify which parts of your code are the hotspots of your program. 10/90 rule of thumb: 90% of the time is spent in 10% of the code: hotspots – Usually some of the innermost loops – Only improve the hotspots. Leave the rest clear and simple.

Data reorganization Create sentinel (value at boundary) to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++;

Data reorganization Create sentinel to simplify loop control. found = false; i=0; while (i<n and not found) if (x[i]==T) found = true; else i++; Sentinel: value at boundary x[n]=T; i=0; while (x[i]!=T)i++; found = (i<n);

Loop unrolling Loop unrolling is textually repeating the loop body so that the loop control is executed fewer times – Eg, a median filter operator on an image executes a 3x3 inner loop for each resulting pixel; this can be fully unrolled – some compilers (eg CUDA) allow unroll k pragmas – in a linked list, if the last element points at itself, visiting the elements can be partially unrolled

Loop peeling When the body of a loop tests whether it is on a boundary, and has a special case for that boundary, it is often advantageous to have separate code for the boundary avoiding the conditional in the loop body. Eg, median filter

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; // what happens if the loop gets unrolled once? for i = 3 to n { c=a+b, a=b; b=c } return c;

Loop unrolling and trivial assignments fibonacci(n) a=b=c =1; for i = 3 to n { c=a+b, a=b; b=c } return c; fibonacci(n) a=b=1; for i = 1 to (n/2 -1) {a=a+b; b=a+b} if odd(n) b = a+b; return b;

Memory hierarchy (cache) issues Processor are an order of magnitude faster than memories – both have been speeding up exponentially for ~30 years: but with different bases, so their ratio has been growing exponentially as well – caches keep recently used (temporal locality) and fetch in cache lines (spatial locality)

cache issues memory wall getting over it: cache cache line cache replacement policy: LRU cache and memory layout of 1D representation of 2D arrays in C – row access – col access

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j]

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] B is accessed in column order. If the arrays are (as in C) stored in row major order, this causes cache misses and unnecessary reads!!

Data or loop reordering for improve cache performance Matrix multiply: for i = 1 to n for j= 1 to n C[i,j]=0 for k = 1 to n C[i,j]+=A[i,k]*B[k,j] While one row of A is read, all of B is read If the cache cannot keep all of B and uses the Least Recently Used replace policy, all reads of B will cause a cache miss

Tiling for improved cache behavior Instead of reading a whole row of A and doing n whole row A column B inner products we can read a block of A and compute smaller inner products with sub columns of B. (Remember blocked matrix multiply in Strassen) These partial products are then added up.

Conventional matrix multiply

etc......

Conventional matrix multiply All elements of B are used once, while all of row A[i] are used n times. A[i] may fit in the cache, B will probably not!

Tiled matrix multiply

Reuse of tile of B A k x k tile of A (which can fit in the cache) block multiplies with a k x k tile of B (which can fit in the cache) and thus reuses the B tile k times, potentially providing better cache use We can parameterize our program with k and experiment Data and loop reordering matrix multiply: assignment 2

Experiments you can do Transpose B for better cache line behavior Tile the loop as in the example In array access A[i*N+j] avoid the multiply by doing pointer increments and dereferences You will have a number of versions of your code. Make a 2D table of results. Then make observations about your results. In a follow up discussion, exchange your experiences

Tiling Loops become nested loops – outer loop visits tile origins – inner loops visit the tile points

Can every loop be tiled??? Tile this for i=1 to n for j=1 to n if (i==1 && j==1) A[i,j]=1 elif j==1 A[i,j]=A[i-1,n] else A[i,j]=A[i,j-1]

Tiling cont' Tiling is loop reordering. The reordered loop must obey the data dependences in the original loop. Let iteration i',j',k' occur before iteration i,j,k – true dependence: i,j,k uses a value that i',j',k produced – anti dependence: i,j,k redefines a value that i',j',k' used – output dependence: i,j,k redefines a value that i',j',k' defined in all these cases the reordered loop must obey the original ordering of the two iterations.