Program Analysis Techniques for Memory Disambiguation

Program Analysis Techniques for Memory Disambiguation
Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

Basic Problem *p = v; *p = v What memory location may *p=v access?
(write v into the memory location that p points to) What memory location may *p=v access? Without Any Analysis: This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis. *p=v may access any location *p = v

Basic Problem *p = v; *p = v What memory location may *p=v access?
(write v into the memory location that p points to) What memory location may *p=v access? With Analysis: *p=v may access this location This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis. *p = v *p=v does not access these memory locations ! *p=v may access this location

Static Memory Disambiguation
Analyze the program to characterize the memory locations that statements in the program read and write Fundamental problem in program analysis with many applications This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis.

Application: Automatic Parallelization
*p = v1 *p = v1; *q = v2; *q = v2 This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis.

Application: Data Race Detection
( Dual Problem ) *p = v1 *p = v1; *q = v2; || *q = v2 This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis.

Application: Detection of Array Bounds Violations
*p = v A[1 .. n] *p = v; . This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis.

Many Other Applications
Virtually all program analyses, transformations, and validations require information about how the program accesses memory Foundation for other analyses and transformations Understand, maintain, debug programs Give security guarantees This talk is about static memory disambiguation. The goal here is to automatically determine what pieces of memory statements and procedures in a program access. This is an interesting problem because it has multiple applications. We have used it to automatically parallelize sequential programs, to check for the absence of array bounds violations, and to check for data races in parallel programs. There are also other potential uses for static memory disambiguation, such as to improve other compiler analyses, to understand and debug programs, or security problems such as information-flow analysis.

Analysis Techniques for Memory Disambiguation
Pointer Analysis Disambiguates memory accesses via pointers Symbolic Analysis Characterizes accessed subregions within dynamically allocated memory blocks In this talk I will present two program analysis techniques for static memory disambiguation. In the first part of this talk I will present a pointer analysis, which determines where pointers may point during the execution of the program. Therefore it can be used to disambiguate memory accesses via pointers. The second part of the talk I will present a symbolic analysis technique, which characterizes how the program accesses regions within dynamically allocated memory blocks.

1. Pointer Analysis I will start the first part of the talk, which presents pointer analysis.

Pointer Analysis GOAL: Statically compute where pointers may point
e.g. “ p  x before statement *p = 1” Must represent points-to relations between memory locations Complications: 1. Statically unbounded number of locations recursive data structures (lists, trees) dynamically allocated arrays 2. Multiple possible executions of the program may create different dynamic data structures The goal in pointer analysis is to statically compute where pointers may point in all the possible executions of the program. We want to compute facts like “p->x before *p=1”. So here we need to represent points-to relations between memory location. But there is two complications here: - first, the program has a statically unbounded number of locations, like in the case of dynamic structures such as lists or trees, or in the case of arrays. - second, there are multiple possible executions of the program which may create different data structures with different points-to relations between memory locations. To solve these problems . . .

Memory Abstraction Stack Heap Physical Memory Abstract Memory p i head
q v r p i head This is a graphical representation of the abstraction for the physical memory. In the upper part of the slide we have the physical memory, consisting of the stack and the heap, and in the lower part of the slide we have the abstraction of the memory. You can see here that: there is one abstract location for each scalar in the stack, there is one abstraction for all the elements of each array, all the elements of the dynamic list are abstracted into a single element. Moreover, this abstraction can model multiple executions of the program which build different dynamic structures in the heap. Abstract Memory q v r

Memory Abstraction Stack Heap Physical Memory Abstract Memory p i head
q v r p i head Abstract Memory q v r

Sequential vs. Multithreaded Pointer Analysis
Variety of existing algorithms for sequential programs [CWZ90], [LR92], [CBC93], [And94], [EGH94], [WL95], [Ruf95], [Ste96], [DMM98] Dataflow analysis: Computes points-to information at each program point Dataflow information : points-to graphs Analyze each statement : create/kill edges Pointer analysis for multithreaded programs Challenging: parallel threads may concurrently update shared pointers There is a large amount of work done in the area of pointer analysis for sequential programs, with a wide range of techniques with different degrees of precision and efficiency. In this talk, we focus on the techniques based on dataflow analysis. These algorithms compute points-to graphs at each point in the program. They iteratively analyze statements in the program and create new edges and potentially remove existing edges. In contrast to to the analyses for sequential programs, there is little work been done for the analysis of multithreaded programs. This is a challenging problem because here parallel threads may concurrently update shared pointers. In this talk I will present a pointer analysis technique I developed that handles programs with multiple threads.

Example p = &x; *p = 1; p = &y; *p = 2; 2 integers, 1 shared pointer:
int x, y; int *p; Two concurrent threads Questions: - what location is written by *p=1? - what location is written by *p=2? OR : Q1: p? in left thread Q2: p? after both threads completed p = &x; parbegin *p = 1; p = &y; I will start with a very simple program that illustrates the basic problems that arise in the analysis of multithreaded programs. parend *p = 2;

Two Possible Executions
p = &x; p = &x; p  x p  x *p = 1; p = &y; p  y p = &y; *p = 1; p  y *p = 2 *p = 2;

Analysis Results p = &x; *p = 1; p = &y; *p = 2; p x parbegin x p p x
parend p y *p = 2;

Analysis of Multithreaded Programs
Straightforward solution (Ideal Algorithm) Analyze all possible interleavings of statements from the parallel threads and merge the results fails because of exponential complexity Our approach: Analyze threads in turn During the analysis of each thread, take into account all edges created by parallel threads, which we call interference information Straightforward approach … Our approach is to analyze threads in turn, and for each thread to take into account the set of edges creates by the other parallel threads. We refer to this set of edges created by the other threads as interference information

Interference Information
points-to edges created by the other parallel threads t1 ti-1 ti+1 tn ti Interference (edges created) ... ... I want to say here that in contrast to the Ideal Algorithm, our approach does not take into account the order in which these interference edges are created. Now, of course, to compute the interference information for each thread, the analysis should keep track of what edges each thread creates. Parallel threads Analyzed thread

Multithreaded Analysis
Dataflow information is a triple <C, I, E> : C = current points-to information I = interference points-to edges from parallel threads E = set of points-to edges created by current thread Interference: Ik = U Ej where t1 … tn are n parallel threads Invariant: I  C Within each thread, interference points-to edges are always added to the current information k = j In fact, the analysis computes a datflow information which is a triple, consisting of the following components: The first item is the current points-to information. The second item is the interference information, I.e. edges created by parallel threads The third item is the set of edges created by the current thread This slide shows two properties of this dataflow information. The first property shown in this slide just defines the interference information with the notations presented in this slide. The second property shows how the analysis uses the interference information. Basically, interference edges are always added to the current points-to graph, resulting in the invariant that I is included in C. Now instead of giving you a formal definition of how the algorithm computes this dataflow information, I will go and show you how the algorithm works in out example. Instead of showing you the equations that define the algorithm, I will show you how the compiler computes the dataflow information in the example program.

Analysis for Example p = &x; *p = 1; p = &y; *p = 2; parbegin parend
The analysis starts with the first statement in the program and computes the following dataflow information after this statement… parend *p = 2;

Analysis for Example p = &x; *p = 1; p = &y; *p = 2; parbegin *p = 1; p = &y; The first item is the current points to graph, which shows that the first statement in the program makes p -> x. The second item is the interference information. In our case it is the empty set, because we assume that this is the main thread in the program. The last item in the dataflow information is the set of edges created by this statement, which in this case is p -> x. Next, the algorithm starts the analysis of the two parallel threads. parend *p = 2;

Analysis of Parallel Threads
p = &x; parbegin *p = 1; p = &y; The analysis of these threads starts with the same dataflow information as before the parallel construct, except that the sets of created edges are now empty, because these are fresh new threads. parend *p = 2;

p = &x; parbegin *p = 1; p = &y; Next the algorithm analyzes each thread in turn. The left thread does not change the dataflow information, but the right thread writes the value of p and makes it point to y. parend *p = 2;

p = &x; parbegin *p = 1; p = &y; Next the algorithm analyzes each thread in turn. The left thread does not change the dataflow information, but the right thread writes the value of p and makes it point to y. parend *p = 2;

p = &x; parbegin *p = 1; p = &y; You can see here that the second thread has created a new points-to edge p->y, which might affect the analysis and the results for the left thread. parend *p = 2;

p = &x; parbegin x y *p = 1; p = &y; Therefore, the algorithm reanalyzes the left thread using the edge p->y as interference information. This is now the second iteration of the algorithm. You can see here that the edge p->y is also added to the current points-to information. parend *p = 2;

p = &x; parbegin x y *p = 1; p = &y; x y The analysis of the left thread does not change the dataflow information, and the algorithm reaches now a fixed point. parend *p = 2;

p = &x; parbegin x y *p = 1; p = &y; x y In particular, you can see here that the interference information for each thread is the set of edges created by the other parallel thread. parend *p = 2;

Analysis of Thread Joins
p = &x; parbegin x y *p = 1; p = &y; x y Next the algorithm computed the dataflow information after both threads have completed, and this is the result. parend x y *p = 2;

Analysis of Thread Joins
p = &x; parbegin x y *p = 1; p = &y; x y Yu can see here that even though p may point to either x or y at the end of the left thread, the analysis determines that p definitely points to y after both threads have completed. We have determined that the correct operator here is the intersection. The points to information after both threads have completed is the intersection of the points-to information at the end of each thread. The other two components of the dataflow information here are pretty straightforward. The interference information is the same as before the parallel construct, and the sets of edges created by the parallel threads are added to the sets of edges created before the parallel construct. parend x y *p = 2;

Final Result p = &x; *p = 1; p = &y; *p = 2; 
parbegin x y *p = 1; p = &y; x y So this is the final analysis result, and you can see here that the algorithm correctly determined that p->x, p->y before *p=1 and p->y before *p=2. Now it is fairly easy to see what the general dataflow equations are for the analysis of two parallel threads. parend x y *p = 2;

General Dataflow Equations
Parent Thread < C , I , E > parbegin < C U E2 , I U E2 ,  > < C U E1 , I U E1 ,  > Thread 1 Thread 2 < C1 , I U E2 , E1 > < C1 , I U E1 , E2 > These equations are shown in this slide. parend < C1 C2 U , I , E U E1 U E2 > Parent Thread

Parent Thread < C , I , E > parbegin < C U E2 , I U E2 ,  > < C U E1 , I U E1 ,  > Thread 1 Thread 2 < C1 , I U E2 , E1 > < C2 , I U E1 , E2 > You can see here that the interference information for each thread is the set of edges created by the other thread. Therefore these equations are recursive and we use a fixed-point algorithm to solve them, as I shown you in the example. parend < C1 C2 U , I , E U E1 U E2 > Parent Thread

Parent Thread < C , I , E > parbegin < C U E2 , I U E2 ,  > < C U E1 , I U E1 ,  > Thread 1 Thread 2 < C1 , I U E2 , E1 > < C2 , I U E1 , E2 > Also note here the intersection operator used to compute the points-to information after both threads have completed. parend < C1 C2 U , I , E U E1 U E2 > Parent Thread

Overall Algorithm Extensions: Parallel loops
Conditionally spawned threads Recursively generated concurrency Flow-sensitive at intra-procedural level Context-sensitive at inter-procedural level Now, the overall algorithm used similar equations for the analysis of multiple threads. It also has extensions for the analysis of parallel for loops and conditionally spawned threads. The algorithm is flow sensitive at the intra-procedural level, which means it takes into account the ordering of statements in the program. At the inter-procedural level, the algorithm is context-sensitive, that means it re-analyzes procedures for every new input context.

Algorithm Evaluation Soundness :
the multithreaded algorithm conservatively approximates all possible interleavings of statements from the parallel threads Termination of fixed-point algorithms: follows from the monotonicity of the transfer functions Complexity of fixed-point algorithms: worst-case polynomial complexity: O(n4), where n = number of statements Precision of analysis: if the concurrent threads do not (pointer-)interfere then this algorithm gives the same result as the Ideal Algorithm Here’s a brief evaluation of the algorithm. This algorithm is sound, it gives a conservative approximation of all possible interleavings of statements from the concurrent threads. The algorithm is guaranteed to terminate, and it does so in polynomial time in the size of the program. The algorithm is also precise in the sense that if parallel threads do not interfere from the point of view of pointers, it gives the same result as the Ideal Algorithm.

Experimental Results Implementation: SUIF infrastructure, Cilk benchmarks Program Lines of Code Pointer Abstract Locations Number of parallel constructs Mean number of iterations per parallel construct Description Barnes 1149 125 12 2.00 Barnes-Hut N-body Simulation BlockMul 324 9 13 1.00 Blocked Matrix Multiply Cholesky 932 29 109 1.83 Sparse Cholesky Factorization CilkSort 499 14 8 Parallel Mergesort Ck 505 38 3 Checkers Program Fft 3255 335 182 1.73 Fast Fourier Transform Fib 53 1 Fibonacci Calculation Game 195 Simple Game Heat 360 1.62 Heat Diffusion on Mesh Knapsack 122 6 Knapsack, Branch and Bound Knary 114 Synthetic Benchmark LU 594 10 LU Decomposition Magic 965 74 24 Magic Squares Mol 4478 387 99 1.18 Viral Protein Simulation NoTemp 341 15 Pousse 1379 118 1.22 Pousse Game Program Queens 106 2.25 N Queens Program Space 458 Blocked Matrix Multiplication Here are some experimental results that to evaluate this algorithm. We implemented these techniques in the SUIF infrastructures and we used a set of benchmarks written in Cilk, which is a multithreaded extension of C. The second column here show the sizes of these benchmarks. The third column shows the number of pointer abstract locations, which gives a measure of how pointer intensive these programs are. Next column shows the number of parallel constructs in these benchmarks, and finally the fifth column shows the average number of iterations for the analysis of each parallel construct. You can see here that the algorithm needs very few iterations to reach the fixed point, in most of the cases there is only one iteration and in a few of these case there are about two iterations per parallel construct.

Precision of Pointer Analysis
Number of targets for dereferenced pointers at loads/stores: usually unique target: 83 % of the loads, 88 % of the stores few potentially uninitialized pointers very few pointers with more than two targets Next we evaluated the precision of this algorithm. We have counted the number of target for the dereferenced pointers at load and store statements and these two histograms in this slide show the combined results for all of our benchmarks. The upper histogram corresponds to load instructions, and the lower histogram corresponds to store instructions. For each of these cases, the bars show how many instructions have 1, 2, 3 or 4 target locations for the dereferenced pointer. The yellow part of the bars corresponds to definitely initialized pointers, and the red part of these bars corresponds to potentially uninitialized pointers. You can see here that the majority of dereferenced pointers point to a unique abstract location. Also there are few potentially uninitialized pointers and there are very few pointers with more that two targets. Next we evaluated the precision of our algorithm by comparing it to the Ideal Algorithm. To do this, we considered the sequential version of our benchmarks, which executes the threads in the program sequentially, one after another. We have now the following relations. All the points-to relations from the analysis of the Sequential versions are included in the results from the Ideal algorithm, because this is on of the possible interleavings of statements from the parallel threads. Second, the edges computed by the Ideal Algorithm are a subset of the edges computed by our algorithm, because our algorithm is sound. So what we did is we ran our analysis on the Multithreaded and on the Sequential versions of these benchmarks and what we got were identical results. Therefore we can conclude that for this set of benchmarks, our algorithm gave the same results as the Ideal Algorithm.

What Pointer Analysis Gives Us
Disambiguation of Memory Accesses Via Pointers Pointer-based loads and stores: use pointer analysis results to derive the memory locations that each pointer-based load or store statement accesses MOD-REF or READ-WRITE SETS Analysis: All loads and stores Procedures: use the memory access information for loads and stores to compute what locations each procedure accesses Since pointer analysis tells us where pointer may point, this information can be used at pointer-based load and store statements to determine what memory locations these instructions access. These results can be used further, to determine what locations all loads, stores and procedures in the program access. This is usually called mod-ref or read-write sets analysis.

Other Uses of Pointer Analysis
In the MIT RAW CC Compiler: static promotion Promote memory accesses to the fast, static network and avoid the slow, dynamic network [Barua et al., PLDI99] In the MIT DeepC project, a C-to-silicon compiler Split memory in smaller memories with narrow address spaces [Babb et al., FCCM99] Memory disambiguation for bitwidth analysis: The Bitwise project at MIT [Stephenson and Amarasinghe, PLDI00] The PipeWrench project at CMU [Budiu et al., EuroPar00] Our implementation of pointer analysis was also used in other projects. It was used in a C compiler for the RAW machine to determine what memory accesses can be scheduled on the fast, static communication network of the machine. It has also been used in a C-to-silicon compiler where the memory of the program is split into smaller memories with narrow address spaces. The analysis was used to improve other compiler analysis such as bitwidth analysis and …

Other Uses of Pointer Analysis (ctd.)
In the MIT Superword Level Parallelism project Again, disambiguates memory for subsequent analyses [Larsen and Amarasinghe, PLDI00] In the FlexCache project at MIT, University of Massachusetts, Amherst: Use pointer analysis and other static analyses to eliminate a large portion of the cache-tag lookups [Moritz et al., IRAM00] … another compiler analysis that detects Superword Level Parallelism. Finally it is currently used in the FlexCache project along with other static analysis to eliminate a large amount of the cache-tag lookups. Ok, so far I showed you a pointer analysis algorithm that can handle both multithreaded and sequential programs and an evaluation of the efficiency and precision of this algorithm. Let’s now see how this information can be used to disambiguate memory accesses.

Is Pointer Analysis Always Enough to Disambiguate Memory?
But now the question is “…?”

Pointer analysis uses a memory abstraction that merges together all elements within allocated memory blocks Sometimes need more sophisticated techniques to characterize accessed regions within allocated memory blocks

Motivating Example

Parallel Divide and Conquer Sort
7 4 6 1 3 5 8 2

7 4 6 1 3 5 8 2 4 7 6 1 5 3 8 2 Divide

7 4 6 1 3 5 8 2 4 7 6 1 5 3 8 2 Divide 4 7 1 6 3 5 2 8 Conquer

7 4 6 1 3 5 8 2 4 7 6 1 5 3 8 2 Divide 4 7 1 6 3 5 2 8 Conquer 1 4 6 7 2 3 5 8 Combine

7 4 6 1 3 5 8 2 4 7 6 1 5 3 8 2 Divide 4 7 1 6 3 5 2 8 Conquer 1 4 6 7 2 3 5 8 The motivating example that I’m going to show you is a divide and conquer sorting algorithm. Here is the input array to sort. The algorithm first divides the array into sub-arrays. The initial array is divided here into four quarters. The algorithm then recursively sorts each sub-array. After each quarter of the array is sorted, the algorithm combines sorted quarters into sorted halves. Finally, the algorithm combines the sorted halves into the final sorted sequence. Note here that this algorithm has a lot of concurrency. You can see that the computations that sort the sub-arrays access independent pieces of memory. It means that they are independent tasks, and can be executed in parallel. Similarly, the computations that combine sorted quarters into sorted halved access disjoint pieces of memory. They are also independent tasks, and can also execute concurrently. The concurrency available in this algorithm is relatively straightforward in this abstract, high level representation. But let’s look now at the code that implements this algorithm. Combine 1 2 3 4 5 6 7 8

Motivating Problem: Data Race Detection
Data Race = one thread accesses a location written by other parallel thread Presence of Data Races: Non-deterministic execution of the program Makes programs difficult to debug Indicate potential programming errors Goal: statically check absence of data races Sorting Example: absence of data races is relatively straightforward in the abstract algorithm

“Sort n Items in d, Using t as Temporary Storage”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); In this sequential implementation of the sorting algorithm, the concurrency is not obvious. What we would like here is to have the compiler automatically detect and exploit the concurrency available in this code. So the motivating problem here is the automatic detection of the parallelism available in this sequential code. To solve this problem, we first need to understand what this program does. The program sorts n items in the input array d, using a temporary array t.

Automatically Check Absence of Data Races
“Sort n Items in d, Using t as Temporary Storage” void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); In this sequential implementation of the sorting algorithm, the concurrency is not obvious. What we would like here is to have the compiler automatically detect and exploit the concurrency available in this code. So the motivating problem here is the automatic detection of the parallelism available in this sequential code. To solve this problem, we first need to understand what this program does. The program sorts n items in the input array d, using a temporary array t. Motivating Problem: Automatically Check Absence of Data Races

“Recursively Sort Four Quarters of d”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); Divide Array Into Subarrays and Recursively Sort Subarrays The program first divides the input array into four sub-arrays, and then recursively sorts each sub-array.

void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); Subproblems Identified Using Pointers Into Middle of Array Note here that sub-problems are identified using pointers in the middle of arrays. The first two arguments of procedure sort identify sub-problems in the input and in the temporary array. 4 7 6 1 5 3 8 2 d d+n/4 d+n/2 d+3*(n/4)

void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); 4 7 6 1 5 3 8 2 d d+n/4 d+n/2 d+3*(n/4)

void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); Sorted Results Written Back Into Input Array So, given the input array, the program recursively sorts each quarter of the array and writes the results back in the input array. These are destructive updates, and therefore the compiler has to reason here about such destructive updates. 7 4 1 6 5 3 2 8 d d+n/4 d+n/2 d+3*(n/4)

“Merge Sorted Quarters of d Into Halves of t”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); The program next implements the combine phases of the algorithm. The first two calls to the merge procedure combine sorted quarters into sorted halves in the sorted array… 7 4 1 6 5 3 2 8 d 4 1 6 7 3 2 5 8 t t+n/2

“Merge Sorted Halves of t Back Into d”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); … and the next call to merge combines the sorted halves from the temporary array into the final sorted sequence in the input array. 2 1 3 4 6 5 7 8 d 4 1 6 7 3 2 5 8 t t+n/2

“Use a Simple Sort for Small Problem Sizes”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); The program terminates the recursion using a simple sorting algorithm, in our case insertion sort, for small problem sizes. In this figure, a small array with two elements 6 and 1 is sorted by flipping these values. 4 7 6 1 5 3 8 2 d d+n

“Use a Simple Sort for Small Problem Sizes”
void sort(int *d, int *t, int n) if (n > CUTOFF) { spawn sort(d,t,n/4); spawn sort(d+n/4,t+n/4,n/4); spawn sort(d+2*(n/4),t+2*(n/4),n/4); spawn sort(d+3*(n/4),t+3*(n/4),n-3*(n/4)); sync; spawn merge(d,d+n/4,d+n/2,t); spawn merge(d+n/2,d+3*(n/4),d+n,t+n/2); sync merge(t,t+n/2,t+n,d); } else insertionSort(d,d+n); So for this sequential program the goal is to automatically generate a parallel program. We want the compiler to automatically generate the following parallel code. 4 7 1 6 5 3 8 2 d d+n

What Do You Need To Know To Check the Absence of Data Races?
So what does the compiler need to know about this program to exploit the available concurrency?

Points-To Information Is Not Enough ! Parallel Threads Access The Same Array It needs to derive symbolic information about the regions of memory that the whole computation of each procedure accesses. I will go in more detail about what this means for our example program.

Key Piece of Information: Symbolic Information About Accessed Memory Regions It needs to derive symbolic information about the regions of memory that the whole computation of each procedure accesses. I will go in more detail about what this means for our example program.

Information Needed For Data Race Checking
Calls to sort access disjoint parts of d and t Together, calls access [d,d+n-1] and [t,t+n-1] sort(d,t,n/4); sort(d+n/4,t+n/4,n/4); sort(d+n/2,t+n/2,n/4); sort(d+3*(n/4),t+3*(n/4), n-3*(n/4)); d d+n-1 t t+n-1 d d+n-1 t t+n-1 d d+n-1 First, the compiler needs to know that the first call to sort accesses only the first quarters of the arrays. The compiler also needs to know that the second call to sort accesses only the second quarters of the arrays, and so on. You can see here that these recursive calls access independent regions of the two arrays, so they are independent computations and can be executed concurrently. t t+n-1 d d+n-1 t t+n-1

First two calls to merge access disjoint parts of d,t Together, calls access [d,d+n-1] and [t,t+n-1] merge(d,d+n/4,d+n/2,t); merge(d+n/2,d+3*(n/4), d+n,t+n/2); d d+n-1 t t+n-1 d d+n-1 t t+n-1 For the combine phase, the compiler needs to know that the first call to merge accesses the first halves of the arrays, and the second call to merge accesses the second half of these array. The compiler also needs to know that the last call to merge accesses all of the two arrays. You can see now that the first two calls to merge access disjoint pieces of memory and can therefore be executed in parallel, while the last call to merge has data dependences with the previous two calls, and has to wait for them both to complete before it starts.

Calls to insertionSort access [d,d+n-1] insertionSort(d,d+n); d d+n-1 Finally, the compiler needs to know that the insertion sort procedure accesses all the elements of both arrays.

Symbolic Information About Accessed Memory Regions: sort(p,n) insertionSort(p,n) merge(l,m,h,d) accesses [p,p+n-1] accesses [l,h-1], [d,d+(h-l)-1] To summarize, for the automatic parallelization of our example program, the compiler needs to know the symbolic access regions for each procedure in the program. Each of these accessed regions of memory, shown in yellow in this slide, consists of a lower and upper bound. These bounds are symbolic expressions in the procedure parameters.

How Hard Is It To Figure These Things Out?
But how hard is it for the compiler to extract this information? …

Challenging We say that this is not an easy task for the complier.

void insertionSort(int *l, int *h) { int *p, *q, k; for (p = l+1; p < h; p++) { for (k = *p, q = p-1; l <= q && k < *q; q--) *(q+1) = *q; *(q+1) = k; } Not immediately obvious that insertionSort(l,h) accesses [l,h-1] For instance, if we take a look at the code that implements the insertion sort algorithm, you can see that this code makes heavy use of pointers. It has pointer arithmetic operations, pointer comparisons, and memory references via pointers. It is not immediately obvious that this code accesses this memory region.

void merge(int *l1, int*m, int *h2, int *d) { int *h1 = m; int *l2 = m; while ((l1 < h1) && (l2 < h2)) if (*l1 < *l2) *d++ = *l1++; else *d++ = *l2++; while (l1 < h1 && l2 <= h2) *d++ = *l1++; while (l2 < h2 && l1 <= h1) *d++ = *l2++; } Not immediately obvious that merge(l,m,h,d) accesses [l,h-1] and [d,d+(h-l)-1] Similarly for the merge procedure. It also uses pointer arithmetic, pointer comparisons, and pointer based memory accesses. It is not immediately obvious that this code accesses these two regions of memory.

Issues Heavy Use of Pointers Pointers into Middle of Arrays
Pointer Arithmetic Pointer Comparison Multiple Procedures sort(int *d, int *t, n) insertionSort(int *l, int *h) merge(int *l, int *m, int *h, int *t) Recursion Besides this heavy use of pointers, the compiler has to handle other problems, too. It has to reason about the memory regions accessed by multiple procedures in the program, which means the compiler has to do an inter-procedural analysis. On top of these problems, the compiler also has to reason about recursive computations.

2. Symbolic Bounds Analysis Algorithm
So far I have presented an example that introduced the problem and the difficulties associated with solving this problem. I will switch now and show you how the compiler extracts the needed information.

Overall Compiler Structure
Pointer Analysis Disambiguate Memory at the Granularity of Abstract Locations Symbolic Upper and Lower Bounds for Each Memory Access in Each Procedure Bounds Analysis Region Analysis Symbolic Regions Accessed By Execution of Each Procedure Here is the general structure of the compiler. The compiler uses pointer analysis as a foundation to disambiguate memory accesses via pointers. This analysis is done at the granularity of allocation blocks. Here, all of the elements of an array are merged together into a single allocation block. The next stages in the compiler refine this memory access information with symbolic region information within each array. The second phase is an intra-procedural analysis, called bounds analysis. The goal here is to derive symbolic bounds for the pieces of memory accessed by each load and store in the program. The compiler next uses the symbolic information about load and store accesses to compute, for each procedure, the region of memory that the whole computation of the procedure accesses. The compiler derivesthese access regions for procedures using an inter-procedural analysis, called region analysis, shown in the third step of this diagram. Finally, the compiler uses the symbolic region information in a parallelization stage to detect sequences of independent calls and generate parallel code to execute them concurrently. I will go in more detail about the last three stages in this diagram. Data Race Detection Check if Parallel Threads Are Independent

Running Example – Array Increment
void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); /* increment first half */ spawn f(p+n/2, n/2); /* increment second half */ sync; } else { /* base case: increment small array */ int i = 0; while (i < n) { *(p+i) += 1; i++; } } In this section I will use a simpler divide and conquer running example, probably the simplest divide and conquer algorithm that you can imagine. This program increments n elements of a given array. It divides the array in two halves and recursively increments each half. The program terminates recursion with a base case that iteratively increments the n elements of the array.

Intra-procedural Bounds Analysis
Pointer Analysis Symbolic Upper and Lower Bounds for Each Memory Access in Each Procedure Bounds Analysis Region Analysis Here is the general structure of the compiler. The compiler uses pointer analysis as a foundation to disambiguate memory accesses via pointers. This analysis is done at the granularity of allocation blocks. Here, all of the elements of an array are merged together into a single allocation block. The next stages in the compiler refine this memory access information with symbolic region information within each array. The second phase is an intra-procedural analysis, called bounds analysis. The goal here is to derive symbolic bounds for the pieces of memory accessed by each load and store in the program. The compiler next uses the symbolic information about load and store accesses to compute, for each procedure, the region of memory that the whole computation of the procedure accesses. The compiler derivesthese access regions for procedures using an inter-procedural analysis, called region analysis, shown in the third step of this diagram. Finally, the compiler uses the symbolic region information in a parallelization stage to detect sequences of independent calls and generate parallel code to execute them concurrently. I will go in more detail about the last three stages in this diagram. Data Race Detection

GOAL: For each pointer and array index variable at each program point, derive lower and upper bounds E.g. “ 0  i  n-1 at statement *(p+i) += 1 ” Bounds are symbolic expressions variables represent initial values of parameters of enclosing procedure bounds are combinations of variables example expression for f(p,n): p+(n/2)-1 The goal of the bounds analysis is to compute symbolic bounds for pointers and array indices at each program point. The bounds analysis derives facts like, for instance, “i is between 0 and n-1 at the program point corresponding to the update statement in the base case” The lower and upper bounds derived by this analysis are symbolic expressions. The variables in this symbolic expressions are the initial values of the parameters of the enclosing procedures. The expression at the bottom of this slide is an example of symbolic expression in a procedure f, with initial parameters p and n.

What are upper and lower bounds for i at each program point in base case? int i = 0; while (i < n) { *(p+i) += 1; i++; } Let’s see how the intra-procedural bounds analysis computes the symbolic bounds for index variable i at each program point of the while loop in the base case of our running example. The algorithm goes as follows.

Build control flow graph
Bounds Analysis, Step 1 Build control flow graph i = 0 i < n The compiler first builds the control flow graph of the computation. *(p+i) += 1 i = i+1

Set up bounds at beginning of basic blocks
Bounds Analysis, Step 2 Set up bounds at beginning of basic blocks i = 0 l1  i  u1 l2  i  u2 i < n It then sets up symbolic bounds for I at the beginning of each basic block: l1, u1, l2, u2, l3, u3. These are the target bounds: the goal now is to find symbolic expressions for these bounds. For this, the compiler proceeds as follows. *(p+i) += 1 i = i+1 l3  i  u3

Compute transfer functions
Bounds Analysis, Step 3 Compute transfer functions i = 0 l1  i  u1 0  i  0 i < n l2  i  u2 It computes transfer functions for statements and basic blocks in the programs. The compiler analyzes each statement, and computes the bounds after the execution of the statement. For the statement i=0, the compiler computes that no matter what the bounds of i were before this statement, both the lower and upper bounds after the statement are 0. Also, you can see that after the increment statement i=i+1, the both the lower and the upper bounds of i get incremented. Note here that the update statement that writes the memory location at address p+i, doesn’t change the bounds of i. Here, the compiler used information from the pointer analysis to detect that this statement does not write the value of i. *(p+i) += 1 i = i+1 l3  i  u3 l3  i  u3 l3+1  i  u3+1

Compute transfer functions
Bounds Analysis, Step 3 Compute transfer functions i = 0 l1  i  u1 0  i  0 i < n l2  i  u2 l2  i  n l2  i  u2 The transfer function for conditionals are computed in a special manner. Conditionals yield two sets of bounds: one for the true branch of the conditional, and one for the false branch. On the true branch the test condition generates a new upper bound for i; on the false branch the bounds of i remain unchanged. You can see here that once the transfer functions are computed, the symbolic bounds at each program point are expressed just in terms of the symbolic bounds at the beginning of the basic blocks. *(p+i) += 1 i = i+1 l3  i  u3 l3  i  u3 l3+1  i  u3+1

Bounds Analysis, Step 4 Key Step: set up constraints for bounds
l1  i  u1 i = 0 Build Region Constraints [ 0, 0 ]  [ l2 , u2 ] [ l3+1, u3+1 ]  [ l2 , u2 ] [ l2 , n-1 ]  [ l3 , u3 ] 0  i  0 l2  i  u2 i < n l2  i  n l2  i  u2 The next step is a key step in our analysis. To find the values of the bounds at the beginning of the basic blocks, the compiler formulates the problem as a constraint problem. It derives constraints that impose that the range of each variable at the beginning of each basic block should include the ranges of that variable at the end of the predecessor basic blocks. For instance the range of i at the beginning of the second basic block should include the range of i at the end of the first and third basic block. This translates into symbolic inequality constraints between lower and upper bounds. For instance lower bounds l2 should be smaller than 0 and than l3+1. Similarly, upper bound u2 should be greater than 0 and u3+1. *(p+i) += 1 i = i+1 l3  i  u3 l3  i  u3 l3+1  i  u3+1

Build Region Constraints [ 0, 0 ]  [ l2 , u2 ] [ l3+1, u3+1 ]  [ l2 , u2 ] [ l2 , n-1 ]  [ l3 , u3 ] 0  i  0 l2  i  u2 i < n l2  i  n l2  i  u2 The next step is a key step in our analysis. To find the values of the bounds at the beginning of the basic blocks, the compiler formulates the problem as a constraint problem. It derives constraints that impose that the range of each variable at the beginning of each basic block should include the ranges of that variable at the end of the predecessor basic blocks. For instance the range of i at the beginning of the second basic block should include the range of i at the end of the first and third basic block. This translates into symbolic inequality constraints between lower and upper bounds. For instance lower bounds l2 should be smaller than 0 and than l3+1. Similarly, upper bound u2 should be greater than 0 and u3+1. *(p+i) += 1 i = i+1 l3  i  u3 l3  i  u3 l3+1  i  u3+1

Build Region Constraints [ 0, 0 ]  [ l2 , u2 ] [ l3+1, u3+1 ]  [ l2 , u2 ] [ l2 , n-1 ]  [ l3 , u3 ] 0  i  0 i < n l2  i  u2 l2  i  n l2  i  u2 Inequality Constraints The next step is a key step in our analysis. To find the values of the bounds at the beginning of the basic blocks, the compiler formulates the problem as a constraint problem. It derives constraints that impose that the range of each variable at the beginning of each basic block should include the ranges of that variable at the end of the predecessor basic blocks. For instance the range of i at the beginning of the second basic block should include the range of i at the end of the first and third basic block. This translates into symbolic inequality constraints between lower and upper bounds. For instance lower bounds l2 should be smaller than 0 and than l3+1. Similarly, upper bound u2 should be greater than 0 and u3+1. *(p+i) += 1 i = i+1 l3  i  u3 l2  0 l2  l3+1 l3  l2 0  u2 u3+1  u2 n-1  u3 l3  i  u3 l3+1  i  u3+1

Bounds Analysis, Step 5 Generate symbolic expressions for bounds
Goal: express bounds in terms of parameters l2 = c1p + c2n + c3 l3 = c4p + c5n + c6 u2 = c7p + c8n + c9 u3 = c10p + c11n + c12 … but it proceeds under the assumption that they are combinations of the parameters of the enclosing procedure. It expresses the bounds as such combinations, and now the problem is reduced to finding the coefficients in these expressions.

Bounds Analysis, Step 5 Generate symbolic expressions for bounds
Goal: express bounds in terms of parameters l2  0 l2  l3+1 l3  l2 l2 = c1p + c2n + c3 l3 = c4p + c5n + c6 0  u2 u3+1  u2 n-1  u3 u2 = c7p + c8n + c9 u3 = c10p + c11n + c12 We can substitute now these expressions in our system of constraints …

Substitute expressions into constraints
Bounds Analysis, Step 6 Substitute expressions into constraints c1p + c2n + c3  0 c1p + c2n + c3  c4p + c5n + c6 +1 c4p + c5n + c6  c1p + c2n + c3 0  c7p + c8n + c9 c10p + c11n + c12 +1  c7p + c8n + c9 c7p + c8n + c9  c10p + c11n + c12 … and obtain a system of symbolic inequality constraints. Again, the goal now is to find the values for the coefficient variables that satisfy these symbolic inequalities. For this, the compiler further translates each symbolic inequality into a set of linear inequalities between the coefficients, using …

Reduce symbolic inequalities to
Bounds Analysis, Step 7 Reduce symbolic inequalities to linear inequalities c1p + c2n + c3  c4p + c5n + c6 if c1  c4, c2  c5, and c3  c6 … the following reduction. Here a symbolic inequality is reduced to linear inequalities between coefficients of corresponding terms in the left hand side and right hand side of the inequality. The compiler applies this reduction and …

Bounds Analysis, Step 8 Apply reduction and generate a linear program
c1  c2  c3  0 c1  c4 c2  c5 c3  c6+1 c4  c1 c5  c2 c6  c3 0  c7 0  c8 0  c9 c10  c7 c11  c8 c12+1  c9 c7  c10 c8  c11 c9  c12 .. generates a set of linear inequalities between coefficients, that is, it generates a linear program. The linear constraints ensure a safe solution for the bounds problem. But the compiler also wants to compute precise solution to this problem, that is, it wants to minimize all of the computed ranges. For this, it maximizes all the lower bounds and minimizes all of the upper bounds. The objective function maximizes the coefficients in the lower bounds and minimizes the coefficients in the upper bounds. Next the compiler solves the linear program …

max: (c1 + ••• + c6) - (c7 + ••• + c12)
Bounds Analysis, Step 8 Apply reduction and generate a linear program c1  c2  c3  0 c1  c4 c2  c5 c3  c6+1 c4  c1 c5  c2 c6  c3 0  c7 0  c8 0  c9 c10  c7 c11  c8 c12+1  c9 c7  c10 c8  c11 c9  c12 Objective Function: max: (c1 + ••• + c6) - (c7 + ••• + c12) .. generates a set of linear inequalities between coefficients, that is, it generates a linear program. The linear constraints ensure a safe solution for the bounds problem. But the compiler also wants to compute precise solution to this problem, that is, it wants to minimize all of the computed ranges. For this, it maximizes all the lower bounds and minimizes all of the upper bounds. The objective function maximizes the coefficients in the lower bounds and minimizes the coefficients in the upper bounds. Next the compiler solves the linear program … lower bounds upper bounds

Solve linear program to extract bounds
Bounds Analysis, Step 10 Solve linear program to extract bounds Solution i = 0 -  i + c1=0 c2 =0 c3 =0 c4=0 c5 =0 c6 =0 c7=0 c8 =1 c9 =0 c10=0 c11=1 c12=-1 0  i  0 i < n l2  i  u2 l2  i  n l2  i  u2 … and generate symbolic bounds for index variable i at each program point. *(p+i) += 1 i = i+1 l3  i  u3 l3  i  u3 l3+1  i  u3+1

Solve linear program to extract bounds
Bounds Analysis, Step 9 Solve linear program to extract bounds Solution i = 0 -  i + c1=0 c2 =0 c3 =0 c4=0 c5 =0 c6 =0 c7=0 c8 =1 c9 =0 c10=0 c11=1 c12=-1 0  i  0 l2  i  u2 i < n l2  i  n l2  i  u2 … the solution to the linear program yields numeric values for the coefficient variables. The compiler substitutes now these values into the expressions for the lower bounds l2, l3 and upper bounds u2, u3, and derives symbolic expressions for them. Finaly the compiler can substitute these symbolic bounds at each program point … Symbolic Bounds *(p+i) += 1 i = i+1 l3  i  u3 l2 = 0 l3 = 0 u2 = n u3 = n-1 l3  i  u3 l3+1  i  u3+1

Substitute bounds at each program point
Bounds Analysis, Step 10 Substitute bounds at each program point Solution i = 0 -  i + c1=0 c2 =0 c3 =0 c4=0 c5 =0 c6 =0 c7=0 c8 =1 c9 =0 c10=0 c11=1 c12=-1 0  i  0 i < n 0  i  n 0  i  n  i  n … and generate symbolic bounds for index variable i at each program point. Symbolic Bounds *(p+i) += 1 i = i+1 0  i  n-1 l2 = 0 l3 = 0 u2 = n u3 = n-1 0  i  n-1 1  i  n

Compute access regions at each load or store
Solution i = 0 -  i + c1=0 c2 =0 c3 =0 c4=0 c5 =0 c6 =0 c7=0 c8 =1 c9 =0 c10=0 c11=1 c12=-1 0  i  0 i < n 0  i  n 0  i  n  i  n The compiler can use now this information to compute the access regions for each load and store in the program. In our example, the compiler has detected that i is between 0 and n-1 before the pointer-based memory reference, and it computes an access region from p to p+n-1 for this statement. Symbolic Bounds *(p+i) += 1 i = i+1 0  i  n-1 [p,p+n-1] l2 = 0 l3 = 0 u2 = n u3 = n-1 0  i  n-1 1  i  n

Inter-procedural Region Analysis
Pointer Analysis Bounds Analysis Region Analysis Symbolic Regions Accessed By Execution of Each Procedure Here is the general structure of the compiler. The compiler uses pointer analysis as a foundation to disambiguate memory accesses via pointers. This analysis is done at the granularity of allocation blocks. Here, all of the elements of an array are merged together into a single allocation block. The next stages in the compiler refine this memory access information with symbolic region information within each array. The second phase is an intra-procedural analysis, called bounds analysis. The goal here is to derive symbolic bounds for the pieces of memory accessed by each load and store in the program. The compiler next uses the symbolic information about load and store accesses to compute, for each procedure, the region of memory that the whole computation of the procedure accesses. The compiler derivesthese access regions for procedures using an inter-procedural analysis, called region analysis, shown in the third step of this diagram. Finally, the compiler uses the symbolic region information in a parallelization stage to detect sequences of independent calls and generate parallel code to execute them concurrently. I will go in more detail about the last three stages in this diagram. Data Race Detection

Inter-procedural Region Analysis
GOAL: Compute accessed regions of memory for each procedure E.g. “ f(p,n) accesses [p, p+n-1] ” Same Approach Set up target bounds of accessed regions Build a constraint system to compute these bounds Constraint System Accessed regions for a procedure must include: 1. Regions accessed by statements in the procedure 2. Regions accessed by invoked procedures The goal here to compute, for each procedure, the symbolic regions of memory accessed by the whole computation of the procedure. The region analysis derives facts like “ the whole execution of f accesses the region with lower bound p and upper bound p+n-1”. The general approach here is similar to the one used during the bounds analysis: set up target symbolic bounds, then build a constraint system to compute these bounds. Here, the analysis generates symbolic bounds for the access regions of each procedure. It then builds constraints by imposing that the regions accessed by each procedure should include: first all the regions accessed by the invoked procedures, and second all the regions accessed by load and store statements in the body of the procedure. Let’s see what this means in our example.

Region Analysis in Example
void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } So far, the compiler knows from the bounds analysis the access region for the pointer-based update in the base case. This region is shown in yellow. [ p, p+n-1 ]

f(p,n) accesses [ l(p,n), u(p,n) ] void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } The compiler next set up a symbolic access region for the whole execution of procedure f. The goal of the compiler is to find the symbolic lower and upper bounds, l and u, of this region. [ p, p+n-1 ]

f(p,n) accesses [ l(p,n), u(p,n) ] void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } [ l(p,n/2), u(p,n/2) ] [ l(p+n/2,n/2), u(p+n/2,n/2) ] Given the access region for f, the compiler computes the access regions for the recursive calls, by substituting the actual parameters in the formals. To compute l and u, the compiler builds a constraint system which imposes that the access region for f (shown in red) should include the access region for call statements and load/store statements in the body of f (shown in yellow). [ p, p+n-1 ]

Derive Constraint System
Region constraints [ l(p,n/2), u(p,n/2) ]  [ l(p,n), u(p,n) ]www [ l(p+n/2,n/2), u(p+n/2,n/2) ]  [ l(p,n), u(p,n) ]www [ p, p+n-1 ]  [ l(p,n), u(p,n) ]www Reduce to inequalities between lower/upper bounds Further reduce to a linear program and solve: l(p,n) = p u(p,n) = p+n-1 Access region for f(p,n): [p, p+n-1] The compiler thus derives a system of region constraints. These immediately translate into inequality constraints between the lower and upper bounds. The compiler uses the same techniques as in the bounds analysis to solve this constraint system of symbolic inequalities.

Data Race Detection Pointer Analysis Bounds Analysis Region Analysis
Here is the general structure of the compiler. The compiler uses pointer analysis as a foundation to disambiguate memory accesses via pointers. This analysis is done at the granularity of allocation blocks. Here, all of the elements of an array are merged together into a single allocation block. The next stages in the compiler refine this memory access information with symbolic region information within each array. The second phase is an intra-procedural analysis, called bounds analysis. The goal here is to derive symbolic bounds for the pieces of memory accessed by each load and store in the program. The compiler next uses the symbolic information about load and store accesses to compute, for each procedure, the region of memory that the whole computation of the procedure accesses. The compiler derivesthese access regions for procedures using an inter-procedural analysis, called region analysis, shown in the third step of this diagram. Finally, the compiler uses the symbolic region information in a parallelization stage to detect sequences of independent calls and generate parallel code to execute them concurrently. I will go in more detail about the last three stages in this diagram. Data Race Detection Check if Parallel Threads Are Independent

Data Race Detection Dependence testing of two statements
Do accessed regions intersect? Based on comparing upper and lower bounds of accessed regions Absence of data races Check if all the statements that execute in parallel are independent This phase is relatively straight forward. It uses the symbolic region information for dependence testing at call statements. Here the idea is to compare the symbolic bounds of the regions accessed by call statements. Using such dependence testing, the compiler finds sequences of independent calls and generates parallel code that executes them concurrently.

Data Race Detection f(p,n) accesses [ p, p+n-1 ]
void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } In our example, the compiler uses the access region of f to compute the regions accessed by the recursive calls, then detects that these symbolic regions are independent.

Data Race Detection f(p,n) accesses [ p, p+n-1 ]
void f(char *p, int n) if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } [ p, p+n/2-1 ] [ p+n/2, p+n-1 ] In our example, the compiler uses the access region of f to compute the regions accessed by the recursive calls, then detects that these symbolic regions are independent.

Data Race Detection No data races ! void f(char *p, int n)
if (n > CUTOFF) { spawn f(p, n/2); spawn f(p+n/2, n/2); sync; } else { int i = 0; while (i < n) { *(p+i) += 1; i++; } } No data races ! In our example, the compiler uses the access region of f to compute the regions accessed by the recursive calls, then detects that these symbolic regions are independent.

Fundamental Property of the Analysis: No Fixed Point Computations
The analysis does not use fixed-point computations: The problem is reduced to a linear program The solution to the linear program directly gives the symbolic lower and upper bounds Fixed-point approaches: Termination is not guaranteed: analysis domain of symbolic expressions has infinite ascending chains Use imprecise techniques to ensure termination: Artificially truncate number of iterations Use imprecise widening operators

Scope of Symbolic Analysis
Symbolic regions within each allocation block Accessed regions depend on the program input Does not compute regions within recursive structures, e.g. lists, trees, graphs Shape analysis techniques required in this case Symbolic bounds are: Polynomial expressions Expressed in terms of the initial values of the parameters (can be extended to initial values of global variables)

3. Uses of Pointer Analysis and Symbolic Analysis
So far I have presented how to use the symbolic information for automatic parallelization of sequential code. I will now present other uses of the extracted symbolic information.

Uses of Pointer and Symbolic Information
Transformations Verifications Automatic Parallelization Of Sequential Programs Data Race Detection For Parallel Programs Bounds Checks Elimination For Safe Programs Array Bounds Checking For Unsafe Programs At a very general level there are two main uses of program analysis. First, program analysis can be used for program transformations. Given a certain computation, the compiler uses analysis results to optimize the computation and generate efficient code. Second, program analysis can be used for verifications. Given some efficient, optimized code, the compiler uses analysis results to verify that the code does not violate certain safety conditions. The interesting thing about our analysis is that it is neutral with respect to this duality. Given a sequential program, we can use symbolic information to automatically parallelize the program. Or given the parallel version of the program, we can use symbolic information to check if there are any data races. Similarly, given a program written in an unsafe language like C, we can check if there are any array bounds violations. Or given a program written in a safe language like Java, we can use the symbolic information to eliminate redundant array bounds checks.

Experimental Results Implementation SUIF Infrastructure
lp_solve linear programming solver Cilk multithreaded language Benchmarks: Sorting programs: QuickSort, MergeSort Dense matrix programs: Matrix Multiplication, LU Stencil computation: Heat Branch and Bound: Knapsack

Experimental Results Two versions of each benchmark
Sequential version written in C Multithreaded version written in Cilk Experiments: Data Race Detection for the multithreaded versions Array Bounds Violation Detection for both sequential and multithreaded versions Automatic Parallelization for the sequential version

Data Races and Array Bounds Violations
Application Data races (multithreaded) Array Bounds Violations (sequential) QuickSort NO MergeSort BlockMul NoTempMul LU Knapsack YES Heat

Automatic Parallelization
Quicksort Mergesort Heat BlockMul NoTempMul LU

Related Work Pointer Analysis of Sequential Programs
Landi, Ryder (PLDI 92); Choi, Burke, Carini (POPL 93); Emami, Ghyia, Hendren (PLDI 94); Wilson, Lam (PLDI 95); Ruf (PLDI 95); Steensgaard (PLDI 96); Shapiro, Horwitz (PLDI 97), … Analysis of Multithreaded Programs Knoop, Steffen, Vollmer (TOPLAS 96); Whaley, Rinard (OOPSLA 99); Salcianu, Rinard (PPoPP 01) Symbolic Analysis of Loop Variables and Array Sections Havlak, Kennedy (TPDS 91); Blume, Eigenmann (IPPS 95); Haghigat, Polychronopoulos (LCPC 93) Parallelization of Recursive Procedures Rugina, Rinard (PPoPP 99); Gupta, Mukhopadhyay, Sinha (PACT 99) Array Bounds Checking Sosuki, Ishihata (POPL 77); Gupta (PLDI 90); Kolte, Wolfe (PLDI 95); Xi, Pfenning (PLDI 98); Wagner, Foster, Brewer, Aiken (NDSS 00); Bodik, Gupta, Sarkar (PLDI 00) Data Race Detection Savage, Burrows, Nelson, Sobalvarro, Anderson (SOSP 97), … I will briefly go over some related work. There has been a great deal of related work in the areas of array section analysis, symbolic analysis of loop variables, parallelization of recursive procedures and array bounds checking. Most of these techniques concentrate on the analysis of loops and loop conditions to extract the needed information. In contrast, our framework performs inter-procedural analysis and can handle recursion. Compared to the previous work on the symbolic analysis and parallelization of recursive procedures, our approach presents a cleaner and more general analysis framework by reducing the problem to linear programs. The previous approaches used iterative fixed point analyses and had to artificially truncate the number of iterations to ensure termination.

Conclusion Novel pointer analysis for multithreaded programs
Models interactions between parallel threads Expresses the problem using dataflow equations Novel framework for symbolic bounds analysis Uses symbolic constraint systems Reduces problem to linear programs Analysis uses: Parallelization, data race detection Detecting array bounds violations Array bounds check elimination

Future Work Analysis of multithreaded programs:
Shape analysis General dataflow framework Application of static analyses techniques to: Software Engineering: automatic detection of errors Computer Security: buffer overruns, information flow analysis Computer Architecture: compiler support for VLIW and DSP Architectures

Program Analysis Techniques for Memory Disambiguation

Similar presentations

Presentation on theme: "Program Analysis Techniques for Memory Disambiguation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Program Analysis Techniques for Memory Disambiguation

Similar presentations

Presentation on theme: "Program Analysis Techniques for Memory Disambiguation"— Presentation transcript:

Similar presentations

About project

Feedback