Compilers for Embedded Systems

Compilers for Embedded Systems
Lecture 1 Integrated Systems of Hardware and Software V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 12/11/2017

Loop unroll transformation (1)
Creates additional copies of loop body Always safe //C-code1 for (i=0; i < 100; i++) A[i] = B[i]; //C-code2 for (i=0; i < 100; i+=4) { A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; } In code1, I value takes 0,1,2,3 while in code2 0,4,8 Why do that? In code2, 4 iterations have been unrolled Pros: Reduces the number of instructions Increase instruction parallelism Cons: Increases code size Increases register pressure

// C code1 for (i=0; i<100; i++) { … } // assembly code1 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if i lower to 100 // C code2 for (i=0; i<100; i+=4) { … } // assembly code2 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if lower 100 times A[i] = B[i]; 100/4 times A[i] = B[i]; A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; So, in order the C-code to run to the target platform, it has to be converted into assembly code and then into binary code; in this slide I give the assembly code that a typical compiler generates when loop unroll is applied (code2) and not (code1). The for loop is transformed into 3 assembly instructions, these are, … It is clear from this figure that in code1 these 3 assembly instructions are executed N times while in the second slide N/4 times. Thus, by applying loop unroll less compare, add and jump instructions occur. The number of arithmetical instructions is reduced Less add instructions for i, i.e., i=i+4 instead of i=i+1 Less compare instructions, i.e., i==100 ? Less jump instructions

// C code1 for (i=0; i<100; i++) { … } // assembly code1 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if i lower to 100 // C code2 for (i=0; i<100; i+=4) { … } // assembly code2 loop_i … inc i // increment i cmp i, 100 // compare i to 100 jl loop_i // jump if lower 100 times A[i] = B[i]; 100/4 times A[i] = B[i]; A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; So, in order the C-code to run to the target platform, it has to be converted into assembly code and then into binary code; in this slide I give the assembly code that a typical compiler generates when loop unroll is applied (code2) and not (code1). The for loop is transformed into 3 assembly instructions, these are, … It is clear from this figure that in code1 these 3 assembly instructions are executed N times while in the second slide N/4 times. Thus, by applying loop unroll less compare, add and jump instructions occur. Execution time is reduced Energy consumption on execution unit and instruction fetch unit is reduced

Scalar replacement transformation
Converts array reference to scalar reference Always safe //Code-1 for (i=0; i < 100; i++){ A[i] = … + B[i]; C[i] = … + B[i]; D[i] = … + B[i]; } //Code-2 t=B[i]; A[i] = … + t; C[i] = … + t; D[i] = … + t; B array reference is accessed 3 times for every i and therefore 300 times in total Cons: introduces extra dependencies, may disable other transformations Reduces the number of L/S instructions Reduces the number of memory accesses Reduces the number of arithmetic instructions

Scalar Replacement Transformation example (1)
// C-code2 for (i=0; i<N; i++) for (j=0; j<N; j++) { tmp=C[i][j]; for (k=0; k<N; k++) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; // C-code1 for (i=0; i<N; i++) for (j=0; j<N; j++) for (k=0; k<N; k++) C[i][j] += A[i][k] * B[k][j]; C[0][0] Main memory C[i][j] is not affected by k loop For every k, C[i][j] is redundantly loaded/stored from/to memory A load/store instruction needs 1-3 CPU cycles L2 unified cache L1 instruction cache L1 data cache C[0][0] C[0][0] RF CPU

Scalar Replacement Transformation example (2)
tmp C N A B N N // C code of MMM for (i=0; i<N; i++) for (j=0; j<N; j++) { tmp=C[i][j]; for (k=0; k<N; k++) { tmp += A[i][k] * B[k][j]; } C[i][j] = tmp; ... ... ... N N the number of L/S instructions is reduced N2 instead of N3 loads and stores for C array the number of arithmetical instructions is reduced less address computations for C the number of L1 data accesses is reduced N2 instead of N3 L1 accesses for C array

Scalar Replacement Transformation (3)
Main memory L2 unified cache Faster and smaller Cache lines L1 instruction cache L1 data cache This pie shows energy consumption in HW components of the initial MMM code on ARM. Most of the energy is consumed on memory hierarchy RF words CPU Dynamic power consumption in memory hierarchy is reduced by reducing the number of memory accesses The power consumption in Instruction fetch unit and execution unit is reduced by reducing the number of instructions

Think Pair Share Exercise
When code2 is faster than code1? Always Never It depends on the hardware architecture It is impossible to know When the code2 size becomes larger than L1 instruction cache size, code2 is no longer efficient Main memory //code1 N= ; for (i=0; i < N; i++) A[i] = B[i]; //code2 for (i=0; i < N; i+=10000) { A[i+1] = B[i+1]; A[i+2] = B[i+2]; A[i+3] = B[i+3]; … A[i+9999] = B[i+9999]; } L2 unified cache So, regarding code1, it may consists of about 9 assembly instructions. These instructions have to be loaded from main memory to L1. Then these instructions will be fetched 1 million times. Regarding code2, it consists of a huge amount of assembly instructions, which may not fit in L1. So, if the L1 is smaller than the code2 size, the code2 will be fetched 1 million times from L2 memory. Keep in mind that L2 is much slower than L1 and more energy hungry. L1 data cache L1 instruction cache RF CPU

Single Instruction Multiple Data (SIMD) (1)
Modern processors use vector assembly instructions to increase performance Modern compilers use auto-vectorization There is specific HW supporting a variety of vector instructions as well as wide registers Normally, array operations are implemented by using vector not scalar assembly instructions

Intel MMX technology 8 mmx registers of 64 bit extension of the floating point registers can be handled as 8 8-bit, 4 16-bit, 2 32-bit and 1 64-bit, operations An entire L1 cache line is loaded to the RF in 1-3 cycles Intel SSE technology 8/16 xmm registers of 128 bit (32-bit architectures support 8 registers only) Can be handled from 16 8-bit to bit operations Intel AVX technology 8/16 ymm registers of 256 bit (32-bit architectures support 8 registers only) Can be handled from 32 8-bit to bit operations

Vector instructions work only for data that they are written in consecutive main memory addresses Aligned load/store instructions are faster than the no aligned ones. MMX instructions have lower latency but SSE instructions have higher throughput MMX instructions are preferred for 64-bit operations The packing/unpacking overhead may be high We can use both mmx and xmm registers SSE memory and arithmetical instructions are executed in parallel

Basic SSE Instructions (1)
__m128 _mm_load_ps (float * p ) – Loads four SP FP values. The address must be 16-byte-aligned __m128 _mm_loadu_ps (float * p) - Loads four SP FP values. The address need not be 16-byte-aligned L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Main memory Aligned load L1 L2 unified cache A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load Faster and smaller Cache lines L1 instruction cache L1 data cache L1 A[1] A[2] A[3] A[4] A[5] A[6] A[7] A[8] …. Misaligned load RF words CPU

__m128 _mm_load_ps(float * p ) – Loads four SP FP values. The address must be 16-byte-aligned __m128 _mm_loadu_ps(float * p) - Loads four SP FP values. The address need not be 16-byte-aligned L1 float A[N] __attribute__((aligned(16))); A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Aligned load Main Memory A[0] A[1] A[2] A[3] …. L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load Modulo (Address ,16)=0 L1 A[0] A[1] A[2] A[3] A[4] A[5] A[6] A[7] …. Misaligned load

__m128 _mm_store_ps(float * p ) – Stores four SP FP values. The address must be 16-byte-aligned __m128 _mm_storeu_ps(float * p) – Stores four SP FP values. The address need not be 16-byte-aligned __m128 _mm_mul_ps(__m128 a, __m128 b) - Multiplies the four SP FP values of a and b __m128 _mm_mul_ss(__m128 a, __m128 b) - Multiplies the lower SP FP values of a and b; the upper 3 SP FP values are passed through from a. XMM1=_mm_mul_ss(XMM1, XMM0) XMM1=_mm_mul_ps(XMM1,XMM0)

__m128 _mm_unpackhi_ps (__m128 a, __m128 b) - Selects and interleaves the upper two SP FP values from a and b. __m128 _mm_unpacklo_ps (__m128 a, __m128 b) - Selects and interleaves the lower two SP FP values from a and b. XMM0=_mm_unpacklo_ps (XMMO, XMM1) XMM0=_mm_unpackhi_ps (XMMO, XMM1) __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) - Selects four specific SP FP values from a and b, based on the mask imm8, num0=_mm_shuffle_ps(num1,num2,_MM_SHUFFLE(1,0,1,0));

__m128 _mm_hadd_ps (__m128 a, __m128 b) - Adds adjacent vector elements void _mm_store_ss (float * p, __m128 a) - Stores the lower SP FP value __m128 _mm_shuffle_ps(__m128 a, __m128 b, unsigned int imm8) - Selects four specific SP FP values from a and b, based on the mask imm8, num0=_mm_shuffle_ps(num1,num2,_MM_SHUFFLE(1,0,1,0));

= … x Example – MVM with SSE … … … … … … for (i=0; i!=N; i++){
num3= _mm_setzero_ps(); for (j=0; j!=N; j+=4){ num0=_mm_load_ps( &A[i][j] ); num1=_mm_load_ps(X + j ); num3+=_mm_mul_ps(num0,num1); } num4=_mm_hadd_ps(num3, num3); num4=_mm_hadd_ps(num4, num4); _mm_store_ss((float *)Y+i, num4); Example – MVM with SSE After j loop finishes its execution, num3 contains the output data of y0 num3=[ya, yb, yc, yd] and y0=ya+yb+yc+yd after the 1st hadd -> num3=[ya+yb, yc+yd, ya+yb, yc+yd] after the 2nd hadd -> num3=[ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd, ya+yb+yc+yd] Y A (NxN) X num3 num0 num1 y0 y1 y2 y3 y4 y5 yN a00 a01 a02 a03 a0N a10 a11 a12 a13 a1N a20 a21 a22 a23 a2N a30 a31 a32 a33 a3N a40 a41 a42 a43 a4N a50 a51 a52 a53 aN0 aNN x0 x1 x2 x3 x4 xN … … = x … … … … …

= … x Example – MVM with SSE (2) … … … … … …
for (i=0; i!=N; i+=2){ num5= _mm_setzero_ps(); num6= _mm_setzero_ps(); for (j=0; j!=N; j+=4){ num3=_mm_load_ps( &A[i][j] ); num4=_mm_load_ps(X + j ); num5+=_mm_mul_ps(num3,num4); num3=_mm_load_ps( &A[i+1][j] ); num6+=_mm_mul_ps(num3,num4); } num5=_mm_hadd_ps(num5, num5); _mm_store_ss((float *)Y+i, num5); num6=_mm_hadd_ps(num6, num6); _mm_store_ss((float *)Y+i+1, num6); Example – MVM with SSE (2) Previous code after Loop unroll and scalar replacement X array is accessed two times less (data reuse – num4) we use more registers Y A (NxN) X num5 num3 num4 y0 y1 y2 y3 y4 y5 yN a00 a01 a02 a03 a0N a10 a11 a12 a13 a1N a20 a21 a22 a23 a2N a30 a31 a32 a33 a3N a40 a41 a42 a43 a4N a50 a51 a52 a53 aN0 aNN x0 x1 x2 x3 x4 xN num6 num3 … x … … = … … … …

dL1 accesses: N+N2+N2/unroll factor

Speeding up MVM for regular matrices using SIMD (4)
There are several ways to sum the Y array’s intermediate results to accumulate the four values of each XMM register, to pack their results into new registers and to store each one directly to accumulate the four values of each XMM register and store each single value separately to pack the Y values in new registers in such a way to add elements of different registers a) b) c)

Example – MVM with SSE (5)
Assume the previous code with unroll factor 4. Therefore, 4 XMM registers are needed for the results y1a y1b y1c y1d y2a y2b y2c y2d y3a y3b y3c y3d y4a y4b y4c y4d 2 hadd instr. to get Y[0]=(y1a+y1b+y1c+y1d) and 1 store_ss instr. normally, hadd needs more than 5 cycles store_ss latency is x2 comparing to store_ps y1a y2a y3a y4a y1b y2b y3b y4b y1c y2c y3c y4c y1d y2d y3d y4d 3 x add_ps() = 1 store_ps instr. Y[0] Y[1] Y[2] Y[3]

Example – MVM with SSE (6)
Assume the previous code with unroll factor 4. Therefore, 4 XMM registers are needed for the results y1a y1b y1c y1d y2a y2b y2c y2d y3a y3b y3c y3d y4a y4b y4c y4d m1=unpacklo_ps(y1,y2) m2=unpacklo_ps(y3,y4) m1: y1c y2c y1d y2d y3c y4c y3d y4d m2: k1=shuffle_ps(m1,m2, (1,0,1,0)) k2=shuffle_ps(m1,m2, (3,2,3,2)) k1: y1c y2c y3c y4c y1d y2d y3d y4d k2: Apply the same procedure as before using unpackhi_ps() k3: y1a y2a y3a y4a y1b y2b y3b y4b k4:

... ... ... MMM – Project 1b (1) for (jj=0; jj!=M; jj+=Tile)
for (i=0; i!=M; i++) for (j=jj; j!=jj+Tile; j++){ num3= _mm_setzero_ps(); for (k=0; k!=M; k+=4){ num0=_mm_load_ps(A + M*i + k); num1=_mm_load_ps(Btrans + M*j + k ); num3+=_mm_mul_ps(num0,num1); } num4=_mm_hadd_ps(num3, num3); num4=_mm_hadd_ps(num4, num4); _mm_store_ss((float *)C + M*i + j , num4); for (jj=0; jj!=M; jj+=Tile) for (i=0; i!=M; i++) for (j=jj; j!=jj+Tile; j++) for (k=0; k!=M; k++) C[M*i+j]+=A[M*i+k] * Btrans[M*j+k]; XMM2 C XMM0 A Btrans k M XMM1 P Tile j P Tile j ... ... ... i i N N k Tile1 Tile1 Tile1 M

... ... ... MMM – Project 1b (2) 2 rows of A[ ] (2 x M x 4 bytes) and
Main memory 2 rows of A[ ] (2 x M x 4 bytes) and Tile columns of Btrans[ ] (Tiles x M x 4bytes) fit in L1 data cache L2 unified cache L1 instruction cache L1 data cache Important: the tiles are written in consecutive main memory locations RF (Tiles+2) x M x 4 ≈ L1 cache size L2acc.=N2 + N2/Tile + N2 CPU XMM2 C XMM0 A Btrans k M XMM1 P Tile j P Tile j ... ... ... i i N N k Tile1 Tile1 Tile1 M

If-condition on SSE for (i=0; i < n; i++)
if ( x[i] > 2 || x[i] < -2 ) a[i]+=x[i]; If-condition on SSE 2 -2 5 -3 1 a[i] a[i+1] a[i+2] a[i+3] 1 1 1 x[i] x[i+1] a[i] + x[i] a[i+1] + x[i+1] a[i+2] a[i+3]

Thank you Date 22/11/2017

Compilers for Embedded Systems

Similar presentations

Presentation on theme: "Compilers for Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Compilers for Embedded Systems

Similar presentations

Presentation on theme: "Compilers for Embedded Systems"— Presentation transcript:

Similar presentations

About project

Feedback