OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for.

OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for performance or power consumption and are executed frequently Usually in combination with profiling

LOCAL VARIABLES ARM registers are 32-bit. Therefore it is more efficient to use 32-bit data types Use signed and unsigned integer types and avoid char and short Only exception is if you want wraparound to occur Unsigned int is more efficient for division

LOOP STRUCTURES (incrementing for loop) int checksum_v5(int *data) { unsigned int i; int sum=0; for (i=0; i<64; i++) { sum +=*(data++); } return sum; } checksum_v5 MOV r2,r0; r2=data MOV r0,#0; sum=0 MOV r1,#0; i=0 checksum_v5_loop LDR r3,[r2],#4; r3 = *(data++) ADD r1,r1,#1; i++ CMP r1,#0x40; compare i, 64 ADD r0, r3, r0; sum += r3 BCC checksum_v5_loop ; if (i<64) goto loop MOV pc,r14; return sum

LOOP STRUCTURES (decrementing for loop) int checksum_v6(int *data) { unsigned int i; int sum=0; for (i=64; i!=0; i--) { sum +=*(data++); } return sum; } checksum_v6 MOV r2,r0; r2=data MOV r0,#0; sum=0 MOV r1,#0x40; i=64 checksum_v6_loop LDR r3,[r2],#4; r3 = *(data++) SUBS r1,r1,#1; i-- and set flags ADD r0, r3, r0; sum += r3 BNE checksum_v6_loop ; if (i!=0) goto loop MOV pc,r14; return sum

LOOP UNROLLING int checksum_v7(int *data,unsigned int N) { int sum=0; do { sum +=*(data++); N -=4 } while (N!=0); return sum; } checksum_v7 MOV r2,#0; sum=0 checksum_v6_loop LDR r3,[r2],#4; r3 = *(data++) SUBS r1,r1,#4; N -=4 and set flags ADD r2, r3, r2; sum += r3 LDR r3,[r2],#4; r3 = *(data++) ADD r2, r3, r2; sum += r3 LDR r3,[r2],#4; r3 = *(data++) ADD r2, r3, r2; sum += r3 LDR r3,[r2],#4; r3 = *(data++) ADD r2, r3, r2; sum += r3 BNE checksum_v6_loop ; if (N!=0) goto loop MOV r0,r2; r0 = sum MOV pc,r14; return r0

Loop Unrolling example Unroll the following loop by a factor of 2, 4, and eight for (i=0; i<64; i++) { a[i] = b[i] + c[i+1]; }

Factor of 2 for (i=0; i<32; i++) { a[2*i] = b[2*i] + c[2*i+1]; a[2*i+1] = b[2*i+1] + c[2*i+1+1]; }

Factor of 4 for (i=0; i<16; i++) { a[4*i] = b[4*i] + c[4*i+1]; a[4*i+1] = b[4*i+1] + c[4*i+1+1]; a[4*i+2] = b[4*i+2] + c[4*i+2+1]; a[4*i+3] = b[4*i+3] + c[4*i+3+1]; }

Factor of 8 for (i=0; i<8; i++) { a[8*i] = b[8*i] + c[8*i+1]; a[8*i+1] = b[8*i+1] + c[8*i+1+1]; a[8*i+2] = b[8*i+2] + c[8*i+2+1]; a[8*i+3] = b[8*i+3] + c[8*i+3+1]; a[8*i+4] = b[8*i+4] + c[8*i+4+1]; a[8*i+5] = b[8*i+5] + c[8*i+5+1]; a[8*i+6] = b[8*i+6] + c[8*i+6+1]; a[8*i+7] = b[8*i+7] + c[8*i+7+1]; }

REGISTER ALLOCATION Limit the number of local variables in the internal loop of functions to 12 Use the important variables in the innermost loop to help the compiler

CALLING FUNCTIONS Try to restrict functions to four arguments. Use structures to group related arguments and pass structure pointers instead Define small functions in the same source file and before the functions that call them.

REGISTER ALLOCATION Limit the number of internal loop variables to 12 so they can be stored in registers

SUMMARY Use signed int and unsigned int types for local variables, function arguments and return values The most efficient form of loop is the do-while loop that counts down to zero Unroll important loops Try to limit functions to four arguments. Avoid divisions. Use multiplication by reciprocal Use the inline assembler

ARM INLINE ASSEMBLY int main() { int n1,n2,m; n1=5; n2=3; __asm//inline assembly code { MUL m,n1,n2 } printf("The result is %d\n",m); return(0); }

USING INLINE ASSEMBLY Used for ARM instructions not supported by the C compiler (coprocessor instruction set extensions) Creates portability issues

ALTERNATIVE: CALLING ASSEMBLY FUNCTION FROM C #include extern void multip(int n1, int n2, int m); int main() { int n1,n2,m; n1=5;//Assigning numbers n2=3; multip(n1,n2,m); //calling function printf("The result is\n",m); }

Assembly function AREA example, CODE, READONLY EXPORT multip;external function name IMPORT n1;input IMPORT n2 IMPORT m;return variable Multip;function begins LDR r3,=n1;load data from memory to registers LDR r1,[r3] LDR r4,=n2 LDR r2,[r4] LDR r5,=m LDR r0,[r5] MUL r0,r1,r2 STR r0,[r5];store result to m memory location MOV pc,lr;return from call END

PORTABILITY ISSUES Char type: Unsigned on ARM, signed on many other processors Alignment: ARM lw, sw instructions assume the address is a multiple of the type you are loading or storing Endianess: Little endian (default), can be configured to big endian Inline assembly: Separate inline assembly into small inlined functions

EXAMPLE Write a program that reads 8-element row and column vectors from memory and –Multiplies both by a scalar also found in memory –Calculates the scalar product of the two vectors –Assume no partial product may exceed 32 bits –Use v1= [1 2 3 4 5 6 7 8], v2= [0 1 2 3 4 5 6 7]T, s=5 as test inputs Unroll the loop by two and four Repeat using inline assembly for the multiplications

OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for.

Similar presentations

Presentation on theme: "OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for.

Similar presentations

Presentation on theme: "OPTIMIZING C CODE FOR THE ARM PROCESSOR Optimizing code takes time and reduces source code readability Usually done for functions that are critical for."— Presentation transcript:

Similar presentations

About project

Feedback