Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

Slides:



Advertisements
Similar presentations
Towers of Hanoi Move n (4) disks from pole A to pole C such that a disk is never put on a smaller disk A BC ABC.
Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
Synopsys University Courseware Copyright © 2012 Synopsys, Inc. All rights reserved. Compiler Optimization and Code Generation Lecture - 3 Developed By:
The University of Adelaide, School of Computer Science
CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto
Computer Architecture Lecture 7 Compiler Considerations and Optimizations.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.
Comp 122, Spring 2004 Divide and Conquer (Merge Sort)
Automatic Parallelization of Divide and Conquer Algorithms Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.
Strassen's Matrix Multiplication Sibel KIRMIZIGÜL.
1 Divide & Conquer Algorithms. 2 Recursion Review A function that calls itself either directly or indirectly through another function Recursive solutions.
Stephen P. Carl - CS 2421 Recursive Sorting Algorithms Reading: Chapter 5.
Factorial Recursion stack Binary Search Towers of Hanoi
Analysis of Algorithms CS 477/677 Instructor: Monica Nicolescu Lecture 5.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
© Janice Regan, CMPT 102, Sept CMPT 102 Introduction to Scientific Computer Programming Recursion.
Sorting. Input: A sequence of n numbers a 1, …, a n Output: A reordering a 1 ’, …, a n ’, such that a 1 ’ < … < a n ’
Previous finals up on the web page use them as practice problems look at them early.
Merge sort, Insertion sort
Recursive Algorithms Nelson Padua-Perez Chau-Wen Tseng Department of Computer Science University of Maryland, College Park.
Validating High-Level Synthesis Sudipta Kundu, Sorin Lerner, Rajesh Gupta Department of Computer Science and Engineering, University of California, San.
Merge sort, Insertion sort. Sorting I / Slide 2 Sorting * Selection sort or bubble sort 1. Find the minimum value in the list 2. Swap it with the value.
CS Main Questions Given that the computer is the Great Symbol Manipulator, there are three main questions in the field of computer science: What kinds.
Machine-Independent Optimizations Ⅰ CS308 Compiler Theory1.
Building An Interpreter After having done all of the analysis, it’s possible to run the program directly rather than compile it … and it may be worth it.
Fundamental in Computer Science Recursive algorithms 1.
Unit 1. Sorting and Divide and Conquer. Lecture 1 Introduction to Algorithm and Sorting.
CS212: DATASTRUCTURES Lecture 3: Recursion 1. Lecture Contents 2  The Concept of Recursion  Why recursion?  Factorial – A case study  Content of a.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
A Review of Recursion Dr. Jicheng Fu Department of Computer Science University of Central Oklahoma.
Recursion. Basic problem solving technique is to divide a problem into smaller subproblems These subproblems may also be divided into smaller subproblems.
1 Decrease-and-Conquer Approach Lecture 06 ITS033 – Programming & Algorithms Asst. Prof. Dr. Bunyarit Uyyanonvara IT Program, Image and Vision Computing.
Multithreaded Programming in Cilk L ECTURE 2 Charles E. Leiserson Supercomputing Technologies Research Group Computer Science and Artificial Intelligence.
CHAPTER 02 Recursion Compiled by: Dr. Mohammad Omar Alhawarat.
Chapter 12 Recursion, Complexity, and Searching and Sorting
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
1 L ECTURE 2 Matrix Multiplication Tableau Construction Recurrences (Review) Conclusion Merge Sort.
CIS 068 Welcome to CIS 068 ! Stacks and Recursion.
Computer Science and Software Engineering University of Wisconsin - Platteville 9. Recursion Yan Shi CS/SE 2630 Lecture Notes Partially adopted from C++
10/14/ Algorithms1 Algorithms - Ch2 - Sorting.
Computer Science Department Data Structure & Algorithms Lecture 8 Recursion.
CS212: DATASTRUCTURES Lecture 3: Recursion 1. Lecture Contents 2  The Concept of Recursion  Why recursion?  Factorial – A case study  Content of a.
Chapter 13 Recursion. Learning Objectives Recursive void Functions – Tracing recursive calls – Infinite recursion, overflows Recursive Functions that.
Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.
CSC 221: Recursion. Recursion: Definition Function that solves a problem by relying on itself to compute the correct solution for a smaller version of.
Data Structures R e c u r s i o n. Recursive Thinking Recursion is a problem-solving approach that can be used to generate simple solutions to certain.
Pointer Analysis for Multithreaded Programs Radu Rugina and Martin Rinard M I T Laboratory for Computer Science.
Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.
Divide-and-Conquer UNC Chapel HillZ. Guo. Divide-and-Conquer It’s a technique instead of an algorithm Recursive in structure – Divide the problem into.
Chapter 6 Recursion. Solving simple problems Iteration can be replaced by a recursive function Recursion is the process of a function calling itself.
Concepts of Algorithms CSC-244 Unit 15 & 16 Divide-and-conquer Algorithms ( Binary Search and Merge Sort ) Shahid Iqbal Lone Computer College Qassim University.
Recursion Chapter 7 Copyright ©2012 by Pearson Education, Inc. All rights reserved.
递归算法的效率分析. When a function is called... A transfer of control occurs from the calling block to the code of the function --It is necessary that there be.
Credible Compilation With Pointers Martin Rinard and Darko Marinov Laboratory for Computer Science Massachusetts Institute of Technology.
Code Optimization.
Chapter 4 Divide-and-Conquer
Algorithm Analysis CSE 2011 Winter September 2018.
Martin Rinard Laboratory for Computer Science
Recursion "To understand recursion, one must first understand recursion." -Stephen Hawking.
Applied Algorithms (Lecture 17) Recursion Fall-23
Design-Driven Compilation
STUDY AND IMPLEMENTATION
Divide and Conquer (Merge Sort)
Radu Rugina and Martin Rinard Laboratory for Computer Science
Divide and Conquer (Merge Sort)
Yan Shi CS/SE 2630 Lecture Notes
Divide & Conquer Algorithms
Presentation transcript:

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

What This Talk Is About Automatic generation of efficient large base cases for divide and conquer programs

Outline 1.Motivating Example 2.Computation Structure 3.Transformations 4.Related Work 5.Conclusion

1. Motivating Example

Divide and Conquer Matrix Multiply Divide matrices into sub-matrices: A 0, A 1, A 2 etc Use blocked matrix multiply equations A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 A 0  B 0 +A 1  B 2 A 0  B 1 +A 1  B 3 A 2  B 0 +A 3  B 2 A 2  B 1 +A 3  B 3  = A  B = R

Divide and Conquer Matrix Multiply Recursively multiply sub-matrices A0A0 A1A1 A2A2 A3A3 B0B0 B1B1 B2B2 B3B3 A0B0+A1B2A0B0+A1B2 A0B1+A1B3A0B1+A1B3 A2B0+A3B2A2B0+A3B2 A2B1+A3B3A2B1+A3B3  = A  B = R

Divide and Conquer Matrix Multiply Terminate recursion with a simple base case  = A  B = R a0a0 b0b0 a 0  b 0

Divide and Conquer Matrix Multiply void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); } Implements R += A  B

Divide and Conquer Matrix Multiply Divide matrices in sub-matrices and recursively multiply sub-matrices void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }

Divide and Conquer Matrix Multiply Identify sub-matrices with pointers void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }

Divide and Conquer Matrix Multiply Use a simple algorithm for the base case void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }

Divide and Conquer Matrix Multiply Advantage of small base case: simplicity Code is easy to: Write Maintain Debug Understand void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }

Divide and Conquer Matrix Multiply Disadvantage: inefficiency Large control flow overhead: Most of the time is spent in dividing the matrix in sub-matrices void matmul(int *A, int *B, int *R, int n) { if (n == 1) { (*R) += (*A) * (*B); } else { matmul(A, B, R, n/4); matmul(A, B+(n/4), R+(n/4), n/4); matmul(A+2*(n/4), B, R+2*(n/4), n/4); matmul(A+2*(n/4), B+(n/4), R+3*(n/4), n/4); matmul(A+(n/4), B+2*(n/4), R, n/4); matmul(A+(n/4), B+3*(n/4), R+(n/4), n/4); matmul(A+3*(n/4), B+2*(n/4), R+2*(n/4), n/4); matmul(A+3*(n/4), B+3*(n/4), R+3*(n/4), n/4); }

Hand Coded Implementation void serialmul(block *As, block *Bs, block *Rs) { int i, j; DOUBLE *A = (DOUBLE *) As; DOUBLE *B = (DOUBLE *) Bs; DOUBLE *R = (DOUBLE *) Rs; for (j = 0; j < 16; j += 2) { DOUBLE *bp = &B[j]; for (i = 0; i < 16; i += 2) { DOUBLE *ap = &A[i * 16]; DOUBLE *rp = &R[j + i * 16]; register DOUBLE s0_0 = rp[0], s0_1 = rp[1]; register DOUBLE s1_0 = rp[16], s1_1 = rp[17]; s0_0 += ap[0] * bp[0]; s0_1 += ap[0] * bp[1]; s1_0 += ap[16] * bp[0]; s1_1 += ap[16] * bp[1]; s0_0 += ap[1] * bp[16]; s0_1 += ap[1] * bp[17]; s1_0 += ap[17] * bp[16]; s1_1 += ap[17] * bp[17]; s0_0 += ap[2] * bp[32]; s0_1 += ap[2] * bp[33]; s1_0 += ap[18] * bp[32]; s1_1 += ap[18] * bp[33]; s0_0 += ap[3] * bp[48]; s0_1 += ap[3] * bp[49]; s1_0 += ap[19] * bp[48]; s1_1 += ap[19] * bp[49]; s0_0 += ap[4] * bp[64]; s0_1 += ap[4] * bp[65]; s1_0 += ap[20] * bp[64]; s1_1 += ap[20] * bp[65]; s0_0 += ap[5] * bp[80]; s0_1 += ap[5] * bp[81]; s1_0 += ap[21] * bp[80]; s1_1 += ap[21] * bp[81]; s0_0 += ap[6] * bp[96]; s0_1 += ap[6] * bp[97]; s1_0 += ap[22] * bp[96]; s1_1 += ap[22] * bp[97]; s0_0 += ap[7] * bp[112]; s0_1 += ap[7] * bp[113]; s1_0 += ap[23] * bp[112]; s1_1 += ap[23] * bp[113]; s0_0 += ap[8] * bp[128]; s0_1 += ap[8] * bp[129]; s1_0 += ap[24] * bp[128]; s1_1 += ap[24] * bp[129]; s0_0 += ap[9] * bp[144]; s0_1 += ap[9] * bp[145]; s1_0 += ap[25] * bp[144]; s1_1 += ap[25] * bp[145]; s0_0 += ap[10] * bp[160]; s0_1 += ap[10] * bp[161]; s1_0 += ap[26] * bp[160]; s1_1 += ap[26] * bp[161]; s0_0 += ap[11] * bp[176]; s0_1 += ap[11] * bp[177]; s1_0 += ap[27] * bp[176]; s1_1 += ap[27] * bp[177]; s0_0 += ap[12] * bp[192]; s0_1 += ap[12] * bp[193]; s1_0 += ap[28] * bp[192]; s1_1 += ap[28] * bp[193]; s0_0 += ap[13] * bp[208]; s0_1 += ap[13] * bp[209]; s1_0 += ap[29] * bp[208]; s1_1 += ap[29] * bp[209]; s0_0 += ap[14] * bp[224]; s0_1 += ap[14] * bp[225]; s1_0 += ap[30] * bp[224]; s1_1 += ap[30] * bp[225]; s0_0 += ap[15] * bp[240]; s0_1 += ap[15] * bp[241]; s1_0 += ap[31] * bp[240]; s1_1 += ap[31] * bp[241]; rp[0] = s0_0; rp[1] = s0_1; rp[16] = s1_0; rp[17] = s1_1; } cilk void matrixmul(long nb, block *A, block *B, block *R) { if (nb == 1) { flops = serialmul(A, B, R); } else if (nb >= 4) { spawn matrixmul(nb/4, A, B, R); spawn matrixmul(nb/4, A, B+(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B+(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+2*(nb/4), B, R+3*(nb/4)); sync; spawn matrixmul(nb/4, A+(nb/4), B+2*(nb/4), R); spawn matrixmul(nb/4, A+(nb/4), B+3*(nb/4), R+(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+2*(nb/4)); spawn matrixmul(nb/4, A+3*(nb/4), B+3*(nb/4), R+3*(nb/4)); sync; }

Goal The programmer writes simple code with small base cases The compiler automatically generates efficient code with large base cases

2. Computation Structure

Running Example – Array Increment void f(char *p, int n) if (n == 1) { /* base case: increment one element */ (*p) += 1; } else { f(p, n/2); /* increment first half */ f(p+n/2, n/2); /* increment second half */ }

Dynamic Call Tree for n=4 Execution of f(p,4)

Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4)

Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4) Activation Frame on the Stack

Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4) Executed Instructions

Dynamic Call Tree for n=4 Test n=1 Call f Execution of f(p,4)

Dynamic Call Tree for n=4 Test n=1 Call f Test n=1 Call f Test n=1 Call f n=4 n=2 Execution of f(p,4)

Dynamic Call Tree for n=4 Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4)

Control Flow Overhead Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4)  Call overhead

Control Flow Overhead Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4)  Call overhead + Test overhead

Computation Test n=1 Call f Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p Test n=1 Call f Test n=1 Inc *p Test n=1 Inc *p n=4 n=2 n=1 Execution of f(p,4)  Call overhead + Test overhead  Computation

Large Base Cases = Reduced Overhead Test n=2 Call f n=4 n=2 Execution of f(p,4) Test n=2 Inc *p Inc *(p+1) Test n=2 Inc *p Inc *(p+1)

3. Transformations

Transformation 1: Recursion Inlining void f (char *p, int n) if (n == 1) { (*p) += 1; } else { f(p, n/2); f(p+n/2, n/2); } Start with the original recursive procedure

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } Make two copies of the original procedure

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } Transform direct recursion to mutual recursion

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { f2(p, n/2); f2(p+n/2, n/2); } void f2(char *p, int n) if (n == 1) { (*p) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); } Inline procedure f2 at call sites in f1

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); }

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Reduced procedure call overhead More code exposed at the intra-procedural level Opportunities to simplify control flow in the inlined code

Transformation 1: Recursion Inlining void f1(char *p, int n) if (n == 1) { (*p) += 1; } else { if (n/2 == 1) { *p += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); } if (n/2 == 1) { *(p+n/2) += 1; } else { f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Reduced procedure call overhead More code exposed at the intra-procedural level Opportunities to simplify control flow in the inlined code: identical condition expressions

Transformation 2: Conditional Fusion void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Merge if statements with identical conditions

Transformation 2: Conditional Fusion void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Merge if statements with identical conditions Reduced branching overhead and bigger basic blocks Larger base case for n/2 = 1

Unrolling Iterations void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } Repeatedly apply inlining and conditional fusion

Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f1(p, n/2/2); f1(p+n/2/2, n/2/2); f1(p+n/2, n/2/2); f1(p+n/2+n/4, n/2/2); } void f2(char *p, int n) if (n == 1) { *p += 1; } else { f2(p, n/2); f2(p+n/2, n/2); }

Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else { f2(p, n/2/2); f2(p+n/2/2, n/2/2); f2(p+n/2, n/2/2); f2(p+n/2+n/4, n/2/2); } void f2(char *p, int n) if (n == 1) { *p += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }

Result of Second Unrolling Iteration void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); }

Unrolling Iterations The unrolling process stops when the number of iterations reaches the desired unrolling factor The unrolled recursive procedure: Has base cases for larger problem sizes Divides the given problem into more sub-problems of smaller sizes In our example: Base cases for n=1, n=2, and n=4 Problems are divided into 8 problems of 1/8 size

Speedup for Matrix Multiply Matrix of 512 x 512 elements

Speedup for Matrix Multiply Matrix of 512 x 512 elements

Speedup for Matrix Multiply Matrix of 1024 x 1024 elements

Efficiency of Unrolled Recursive Part Because the recursive part is also unrolled, recursion may not exercise the large base cases Which base case is executed depends on the size of the input problem In our example: For a problem of size n=8, the base case for n=1 is executed For a problem of size n=16, the base case for n=2 is executed The efficient base case for n=4 is not executed in these cases

Solution: Recursion Re-Rolling Roll back the recursive part of the unrolled procedure after the large base cases are generated Re-Rolling ensures that larger base cases are always executed, independent of the input problem size The compiler unrolls the recursive part only temporarily, to generate the base cases

Transformation 3: Recursion Re-Rolling void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); }

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } Identify the recursive part else { f1(p, n/2/2/2); f1(p+n/2/2/2, n/2/2/2); f1(p+n/2/2, n/2/2/2); f1(p+n/2/2+n/2/2/2, n/2/2/2); f1(p+n/2, n/2/2/2); f1(p+n/2+n/2/2/2, n/2/2/2); f1(p+n/2+n/2/2, n/2/2/2); f1(p+n/2+n/2/2+n/2/2/2, n/2/2/2); } Transformation 3: Recursion Re-Rolling

void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } Replace with the recursive part of the original procedure else { f1(p, n/2); f1(p+n/2, n/2); } Transformation 3: Recursion Re-Rolling

Final Result void f1(char *p, int n) if (n == 1) { *p += 1; } else if (n/2 == 1) { *p += 1; *(p+n/2) += 1; } else if (n/2/2 == 1) { *p += 1; *(p+n/2/2) += 1; *(p+n/2) += 1; *(p+n/2+n/2/2) += 1; } else { f1(p, n/2); f1(p+n/2, n/2); }

Speedup for Matrix Multiply Matrix of 512 x 512 elements

Speedup for Matrix Multiply Matrix of 1024 x 1024 elements

Other Optimizations Inlining moves code from the inter-procedural level to the intra-procedural level Conditional fusion brings code from the inter-basic- block level to the intra-basic-block level Together, inlining and conditional fusion give subsequent compiler passes the opportunity to perform more aggressive optimizations

Comparison to Hand Coded Programs Two applications: Matrix multiply, LU decomposition Three machines: Pentium III, Origin 2000, PowerPC Two different problem sizes Compare automatically unrolled programs to optimized, hand coded versions from the Cilk benchmarks Best automatically unrolled version performs: Between 2.2 and 2.9 times worse for matrix multiply As good as hand coded version for LU

Procedure Inlining: Scheifler (1977) Richardson, Ghanapathi (1989) Chambers, Ungar (1989) Cooper, Hall, Torczon (1991) Appel (1992) Chang, Mahlke, Chen, Hwu (1992 ) Related Work

Conclusion Recursion Unrolling analogous to the loop unrolling transformation Divide and Conquer Programs The programmer writes simple base cases The compiler automatically generates large base cases Key Techniques Inlining: conceptually inline recursive calls Conditional Fusion: simplify intra-procedural control flow Re-Rolling: ensure that large base cases are executed

Comparison to Hand Coded Programs Matrix multiply 512 x 512 elements: Best automatically unrolled program: 2.55 sec. Hand coded with three nested loops: 3.46 sec. Hand coded Cilk program:1.16 sec. Matrix multiply for 1024 x 1024 elements: Best automatically unrolled program: sec. Hand coded with three nested loops: sec. Hand coded Cilk program:9.19 sec.

Correctness Recursion unrolling preserves the semantics of the program: The unrolled program terminates if and only if the original recursive program terminates When both the original and the unrolled program terminate, the yield the same result

Speedup for Matrix Multiply Pentium III, Matrix of 512 x 512 elements

Speedup for Matrix Multiply Pentium III, Matrix of 1024 x 1024 elements

Speedup for Matrix Multiply Power PC, Matrix of 512 x 512 elements

Speedup for Matrix Multiply Power PC, Matrix of 1024 x 1024 elements

Speedup for Matrix Multiply Origin 2000, Matrix of 512 x 512 elements

Speedup for Matrix Multiply Origin 2000, Matrix of 1024 x 1024 elements

Speedup for LU Pentium III, Matrix of 512 x 512 elements

Speedup for LU Pentium III, Matrix of 1024 x 1024 elements

Speedup for LU Power PC, Matrix of 512 x 512 elements

Speedup for LU Power PC, Matrix of 1024 x 1024 elements

Speedup for LU Origin 2000, Matrix of 1024 x 1024 elements

Speedup for LU Origin 2000, Matrix of 512 x 512 elements