Bitwidth Analysis with Application to Silicon Compilation

Bitwidth Analysis with Application to Silicon Compilation
Mark Stephenson Jonathan Babb Saman Amarasinghe MIT Laboratory for Computer Science

Goal For a program written in a high level language, automatically find the minimum number of bits needed to represent: Each static variable in the program Each operation in the program June 19th, 2000

Usefulness of Bitwidth Analysis
Higher Language Abstraction Enables other compiler optimizations Synthesizing application-specific processors June 19th, 2000

DeepC Compiler Targeted to FPGAs
C/Fortran program Suif Frontend Pointer alias and other high-level analyses Bitwidth Analysis Raw parallelization MachSuif Codegen DeepC specialization Verilog Traditional CAD optimizations Physical Circuit June 19th, 2000

Usefulness of Bitwidth Analysis
Higher Language Abstraction Enables other compiler optimizations Synthesizing application-specific processors Optimizing for power-aware processors Extracting more parallelism for same instruction, multiple data (SIMD) processors June 19th, 2000

Bitwidth Opportunities
Runtime profiling reveals plenty of bitwidth opportunities For the SPECint95 benchmark suite, Over 50% of operands use less than half the number of bits specified by the programmer June 19th, 2000

Analysis Constraints Bitwidth results must maintain program correctness for all input data sets Results are not runtime/data dependent A static analysis can do very well, even in light of this constraint June 19th, 2000

Bitwidth Extraction Use abundant hints in the source language to discover bitwidths with near optimal precision Caveats Analysis limited to fixed-point variables Assume source program correctness June 19th, 2000

The Hints Bitwidth refining constructs Arithmetic operations
Boolean operations Bitmask operations Loop induction variable bounding Clamping operations Type castings Static array index bounding June 19th, 2000

1. Arithmetic Operations
Example int a; unsigned b; a = random(); b = random(); a = a / 2; b = b >> 4; a: 32 bits b: 32 bits Arithmetic operations such as divide can be used to reduce bitwidth. Here we see that before execution of the divide instruction, a’s bitwidth is 32. After the divide by 2 however, only 31 bits are needed to represent the variable. This slide also provides an example of a right shift operation. B’s value before its execution is 32 bits. Because the value has been shifted right 4 times, after the instruction, only 28 bits are required. a: 31 bits b: 32 bits a: 31 bits b: 28 bits June 19th, 2000

2. Boolean Operations Example int a; a = (b != 15); a: 32 bits
In the C programming language, it is common to use a 32-bit integer data-type to represent a boolean variable– as in the case of this example. Thus, identifying such operations can be very profitable. Before the boolean operation, 32 bits are required to represent a. After the operation, however, only 1-bit is required. June 19th, 2000

3. Bitmask Operations Example int a; a = random() & 0xff; a: 32 bits
Many codes use bitmask operations. In this example, even though we don’t know what the return value of random is, we know that a’s value requires at most 8-bits because we are ANDing it with the quantity 0xff. June 19th, 2000

4. Loop Induction Variable Bounding
Applicable to for loop induction variables. Example int i; for (i = 0; i < 6; i++) { … } i: 32 bits Another high-level technique is to identify for loop induction variables. In essence, for loops determine the range of values that the induction variable can assume. This range can be converted to a bitwidth. For instance, within and after the for loop in this example we see that I requires only 3-bits to represent the range of values from 0 to 6. i: 3 bits June 19th, 2000

5. Clamping Optimization
Multimedia codes often simulate saturating instructions. Example int valpred if (valpred > 32767) valpred = 32767 else if (valpred < ) valpred = valpred: 32 bits We call the next optimization the clamping optimization. In many multimedia codes you see sequences of instructions simulating saturating instructions. In other words, the programmer wants to clamp the range of a variable. In this example, taken from one of our benchmarks, adpcm, the programmer wants to restrict the range of valpred to values that can be represented with 16-bits. Thus, after the saturating instructions are executed, only 16-bits are required to represent valpred. valpred: 16 bits June 19th, 2000

6. Type Casting (Part I) Example char b; a = b; a: 32 bits b: 8 bits
int a; char b; a = b; a: 32 bits b: 8 bits a: 8 bits b: 8 bits Type promotion also serves as a good tool for reducing bitwidths. In this example, a is an integer and b is a character. So after the assignment of the 8-bit type b to a, a’s value can be at most 8-bits. June 19th, 2000

6. Type Casting (Part II) Example char b; b = a; a: 32 bits b: 8 bits
int a; char b; b = a; a: 32 bits b: 8 bits a: 8 bits b: 8 bits Type promotion also serves as a good tool for reducing bitwidths. In this example, a is an integer and b is a character. So after the assignment of the 8-bit type b to a, a’s value can be at most 8-bits. a: 8 bits b: 8 bits June 19th, 2000

7. Array Index Optimization
An index into an array can be set based on the bounds of the array. Example int a, b; int X[1024]; X[a] = X[4*b]; a: 32 bits b: 32 bits Assuming program correctness we can use the bounds information of an array to restrict the ranges of variables that index into the array. In this example, an index into the array X can be at most 10-bits, otherwise a buffer overrun will result. Thus we can set the index variables a and b accordingly. This slide alludes to another of the optimizations that we implemented: backward propagation. We can use this information to re-compute the bitwidths of ancestor instructions. I’ll explain this better in a later slide. a: 10 bits b: 8 bits a: 10 bits b: 8 bits June 19th, 2000

Propagating Data-Ranges
Data-flow analysis Three candidate lattices Bitwidth Vector of bits Data-ranges All of the things that we’ve shown you to this point are high level ideas. Let’s dive in now and look at the details of the implementation. First of all, we perform data-flow analysis. We explored three candidate lattices for the data-flow analysis: The first lattice simply keeps track of a variable’s bitwidth. Here’s an example of bitwidth propagation: in this example, let’s assume that a’s bitwidth is initially 4. Propagating bitwidths, we have to conservatively assume that even an increment instruction always results in a carry, when in all likely hood, it does not. Though the second lattice we considered, a vector of bits, has advantages over the bitwidth lattice, it still suffers from this same arithmetic imprecision. We decided to propagate data-ranges, which are simply all the integers between a lower and an upper bound. We did this because, of the three structures, the data-range lattice is the only one that handles arithmetic expressions well. Since all of the code we examined had some degree of arithmetic computation, we decided that this is an important feature. a: 4 bits Propagating bitwidths a = a + 1 a: 5 bits June 19th, 2000

Data-flow analysis Three candidate lattices Bitwidth Vector of bits Data-ranges All of the things that we’ve shown you to this point are high level ideas. Let’s dive in now and look at the details of the implementation. First of all, we perform data-flow analysis. We explored three candidate lattices for the data-flow analysis: The first lattice simply keeps track of a variable’s bitwidth. Here’s an example of bitwidth propagation: in this example, let’s assume that a’s bitwidth is initially 4. Propagating bitwidths, we have to conservatively assume that even an increment instruction always results in a carry, when in all likely hood, it does not. Though the second lattice we considered, a vector of bits, has advantages over the bitwidth lattice, it still suffers from this same arithmetic imprecision. We decided to propagate data-ranges, which are simply all the integers between a lower and an upper bound. We did this because, of the three structures, the data-range lattice is the only one that handles arithmetic expressions well. Since all of the code we examined had some degree of arithmetic computation, we decided that this is an important feature. a: 1X Propagating bit vectors a = a + 1 a: XXX June 19th, 2000

Data-flow analysis Three candidate lattices Bitwidth Vector of bits Data-ranges Four bits are required Data ranges handle arithmetic much better. With this information we see that the required number of bits for arithmetic operations does not always change. In this example, a’s range is initially <0,13>. After being incremented, its range becomes <1,14>. In both cases 4 bits is sufficient to represent a. I should also note that it’s easy to compute the number of bits needed to represent a data-range. a: <0,13> Propagating data-ranges a = a + 1 a: <1,14> June 19th, 2000

Propagate data-ranges forward and backward over the control-flow graph using transfer functions described in the paper Use Static Single Assignment (SSA) form with extensions to: Gracefully handle pointers and arrays. Extract data-range information from conditional statements. With the lattice set, we propagate the data-ranges both forward and backward in the control flow graph. I’ll show you an example of this in a minute. We also chose to use SSA form because in the common case of forward propagation, it is an efficient form for data-range propagation. In the next slide I’ll show you how we extended SSA form to help us extract data-range information from conditional statements. Because we’re pressed for time, I won’t be talking about the extensions we made to SSA form that allow us to gracefully handle pointers and arrays. They’re in the paper though. June 19th, 2000

Example of Data-Range Propagation
a0 = input() a1 = a0 + 1 Range-refinement functions a1 < 0 true a2 = a1:(a10) a3 = a2 + 1 a4 = a1:(a10) c0 = a4 Here are the extensions to SSA form to extract data-range information from conditional statements. We call them range refinement functions. They allow us to restrict the range of a predicate based on the outcome of the test. For example the highlighted block corresponds to a path where the predicate is true. Thus, we know that the value of the predicate variable, a1, is less than 0. There is also a refinement function that corresponds to the situation where the predicate is false. a5 = (a3,a4) b0 = array[a5] June 19th, 2000

Example of Data-Range Propagation
a0 = input() a1 = a0 + 1 <-128, 127> <-2, 8> <-127, 127> <-1, 9> <-1, -1> a1 < 0 <-127, -1> true <0, 9> <0, 127> a2 = a1:(a10) a3 = a2 + 1 a4 = a1:(a10) c0 = a4 <-126, 0> <0, 127> <0, 9> <0, 9> a5 = (a3,a4) b0 = array[a5] <-126, 127> <0, 9> array’s bounds are [0:9] June 19th, 2000

What to do with Loops? June 19th, 2000 www.cag.lcs.mit.edu/bitwise
As we just saw, straight line code turns out to be fairly straightforward to analyze. Loops on the other hand present a bit of a challenge. In traditional data-flow analysis, you iterate over the control flow graph applying transfer functions until a fixed-point is reached. June 19th, 2000

What to do with Loops? Finding the fixed-point around back edges will often saturate data-ranges. a = 0 for (y = 1; y < 100; y++) a = a + 5; Well, this will saturate even the simplest arithmetic expression. In this example, we see that a fixed point is only reached when both linear sequences are saturated. Because instructions in loops comprise the bulk of dynamically executed instructions, it’s important that we analyze them well. June 19th, 2000

The Loop Solution Classify groups of dependent instructions into sequences Linear sequence: i = i + 1; //counter j = k + n; k = j + 1; Call a solver to find the closed form solution to approximate range We came up with a solution that allows us to accurately determine data-ranges of operands and expressions in loops. We find closed-form solutions to commonly occurring sequences. A sequence is defined as a group of mutually dependent instructions. I’ll show you an example in a minute to elucidate this point. We can then used the closed-form solutions to determine the final ranges. June 19th, 2000

Finding the Closed-Form Solution
a0 = 0 for i = 1 to 10 a1 = a0 + 1 for j = 1 to 10 a2 = a1 + 2 for k = 1 to 10 a3 = a2 + 3 ...= a3 + 4 Like I said before, a sequence is a group of mutually dependent instructions. In this example there is one sequence and all of the instructions that comprise the sequence are highlighted in white. June 19th, 2000

a0 = <0,0> for i = 1 to 10 a1 = a <1,460> for j = 1 to 10 a2 = a <3,480> for k = 1 to 10 a3 = a <24,510> ...= a <510,510> Shown to the right of each instruction in the sequence is the actual range of values that the particular instruction takes on. It’s interesting to note that each instruction in the sequence takes on a different range. It turns out that it’s non-trivial to find the exact ranges. Non-trivial to find the exact ranges June 19th, 2000

a0 = <0,0> for i = 1 to 10 a1 = a <1,460> for j = 1 to 10 a2 = a <3,480> for k = 1 to 10 a3 = a <24,510> ...= a <510,510> Shown to the right of each instruction in the sequence is the actual range of values that the particular instruction takes on. It’s interesting to note that each instruction in the sequence takes on a different range. It turns out that it’s non-trivial to find the exact ranges. Can easily find conservative range of <0,510> June 19th, 2000

Solving the Linear Sequence
for i = 1 to 10 a1 = a0 + 1 for j = 1 to 10 a2 = a1 + 2 for k = 1 to 10 a3 = a2 + 3 ...= a3 + 4 <1,10> <1,100> Before I explain the algorithm, note that there are several types of sequences to detect and solve, not just linear sequences. Each of them require a different method by which to compute the closed form solution. Here we show you the algorithm that solves closed-form solutions of linear sequences. It begins by calculating the iteration count of each loop. Here iteration count is defined as the number of times instructions in the loop body are going to be executed. Figure out the iteration count of each loop. June 19th, 2000

Solving the Linear Sequence
for i = 1 to 10 a1 = a0 + 1 for j = 1 to 10 a2 = a1 + 2 for k = 1 to 10 a3 = a2 + 3 ...= a3 + 4 <1,10> <1,10>*<1,1>=<1,10> <1,100> <1,100>*<2,2>=<2,200> <1,100> <1,100>*<3,3>=<3,300> We then use the iteration count and the growth of each instruction to determine how much the instruction contributes to the growth of the sequence. (<1,10>+<2,200>+<3,300>)<0,0>=<0,510> Sum all the contributions together, and take the data-range union with the initial value. June 19th, 2000

Summary Developed Bitwise: a compiler that automatically determines integer bitwidths Propagated value ranges Loop analysis Demonstrate savings when targeting silicon from high-level languages onto FPGA 57% less area up to 86% improvement in clock speed less than 50% of the power In summary, we created the Bitwise compiler, a compiler that successfully determines operand bitwidth with excellent precision. Bitwise uses a suite of techniques including standard data-flow analysis, sophisticated loop analysis, and we’ve also incorporated pointer analysis. We demonstrate substantial savings when targeting silicon. June 19th, 2000

Bitwidth Analysis with Application to Silicon Compilation

Similar presentations

Presentation on theme: "Bitwidth Analysis with Application to Silicon Compilation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bitwidth Analysis with Application to Silicon Compilation

Similar presentations

Presentation on theme: "Bitwidth Analysis with Application to Silicon Compilation"— Presentation transcript:

Similar presentations

About project

Feedback