- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special.

- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special features of embedded processors have to be exploited.  High levels of optimization more important than compilation speed.  Compilers can help to reduce the energy consumption (energy optimization).  Compilers could help to meet real-time constraints.  Less legacy problems than for PCs.  There is a large variety of instruction sets.  Design space exploration for optimized processors makes sense  Many reports about low efficiency of standard compilers  Special features of embedded processors have to be exploited.  High levels of optimization more important than compilation speed.  Compilers can help to reduce the energy consumption (energy optimization).  Compilers could help to meet real-time constraints.  Less legacy problems than for PCs.  There is a large variety of instruction sets.  Design space exploration for optimized processors makes sense

- 2 - EE898- Compiler Use of assembly languages in embedded systems Assembler (DSP) Assembler (µController) C-Code (µController) C-Code (DSP) [Paulin, 1995 ] Similar situation more recently

- 3 - EE898- Compiler Optimizations considered Energy-aware compilation Compilation for digital signal processors Compilation for multimedia processors Compilation for VLIW (very long instruction word) processors Compilation for network processors Compiler generation, retargetable compilers and design space exploration Energy-aware compilation Compilation for digital signal processors Compilation for multimedia processors Compilation for VLIW (very long instruction word) processors Compilation for network processors Compiler generation, retargetable compilers and design space exploration

- 4 - EE898- Compiler Efforts for Reducing Energy Device Level –Development of Low Power Devices –Reducing Power Supply Voltage –Reducing Threshold Voltage Circuit Level –Gated Clock –Path Transistor Logic –Asynchronous Circuits System Level Fabrication level Device Level –Development of Low Power Devices –Reducing Power Supply Voltage –Reducing Threshold Voltage Circuit Level –Gated Clock –Path Transistor Logic –Asynchronous Circuits System Level Fabrication level

- 5 - EE898- Compiler Optimization for low-energy the same as optimization for high performance? int a[1000]; c = a; for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1; } int a[1000]; c = a; for (i = 1; i < 100; i++) { b += *c; b += *(c+7); c += 1; } LDR r3, [r2, #0] ADD r3,r0,r3 MOV r0,#28 LDR r0, [r2, r0] ADD r0,r3,r0 ADD r2,r2,#4 ADD r1,r1,#1 CMP r1,#100 BLT LL3 ADD r3,r0,r2 MOV r0,#28 MOV r2,r12 MOV r12,r11 MOV r11,rr10 MOV r0,r9 MOV r9,r8 MOV r8,r1 LDR r1, [r4, r0] ADD r0,r3,r1 ADD r4,r4,#4 ADD r5,r5,#1 CMP r5,#100 BLT LL3 2231 cycles 16.47 µJ 2096 cycles 19.92 µJ No ! High-performance if available memory bandwidth fully used; low-energy consumption if memories are at stand-by mode Reduced energy if more values are kept in registers

- 6 - EE898- Compiler Energy models Commercial tools frequently very imprecise Model of Tiwari (Dissertation, Princeton 1996): Cost of instructions and of transitions between instructions; Does not separate out the cost of memory access Model of Simunic, de Micheli (DAC 99): Model based on data sheets; does not require measurements. Does not take transitions into account. Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations; cannot predict effect of changes to memory architecture. Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls  Dedicated energy models. Commercial tools frequently very imprecise Model of Tiwari (Dissertation, Princeton 1996): Cost of instructions and of transitions between instructions; Does not separate out the cost of memory access Model of Simunic, de Micheli (DAC 99): Model based on data sheets; does not require measurements. Does not take transitions into account. Russell, Jacome (ICCD, 1998): based on precise measurement for two fixed configurations; cannot predict effect of changes to memory architecture. Lee (LCTES 2001): detailed analysis of the effect pipeline stages; does not include multi-cycle operations and stalls  Dedicated energy models.

- 7 - EE898- Compiler Energy Model by Knauer and Steinke Data Memory Instruction Memory Instr IAddr Data VDD ALU Multi- plier Barrel Shifter Register File Instr. Decoder & Control Logic Instr Imm Reg Value Reg# Opcode ARM7 DAddr E total = E cpu_instr + E cpu_data + E mem_instr + E mem_data

- 8 - EE898- Compiler Instruction dependent costs in the CPU E cpu_instr =  MinCostCPU(Opcode i ) +  1 *  w(Imm i,j ) + ß 1 *  h(Imm i-1,j, Imm i,j ) +  2 *  w(Reg i,k ) + ß 2 *  h(Reg i-1,k, Reg i,k ) +  3 *  w(RegVal i,k ) + ß 3 *  h(RegVal i-1,k, RegVal i,k ) +  4 *  w(IAddr i ) + ß 4 *  h(IAddr i-1, IAddr i ) + FUCost(Instr i-1,Instr i ) E cpu_instr =  MinCostCPU(Opcode i ) +  1 *  w(Imm i,j ) + ß 1 *  h(Imm i-1,j, Imm i,j ) +  2 *  w(Reg i,k ) + ß 2 *  h(Reg i-1,k, Reg i,k ) +  3 *  w(RegVal i,k ) + ß 3 *  h(RegVal i-1,k, RegVal i,k ) +  4 *  w(IAddr i ) + ß 4 *  h(IAddr i-1, IAddr i ) + FUCost(Instr i-1,Instr i ) Cost for a sequence of m instructions w: number of ones; h: Hamming distance; FUCost: cost of switching functional units , ß: determined through experiments

- 9 - EE898- Compiler Other costs E cpu_data =   5 * w(DAddr i ) + ß 5 * h(DAddr i-1, DAddr i ) +  6 * w(Data i ) + ß 6 * h(Data i-1, Data i ) E mem_instr =  MinCostMem(InstrMem,Word_width i ) +  7 * w(IAddr i ) + ß 7 * h(IAddr i-1, IAddr i ) +  8 * w(IData i ) + ß 8 * h(IData i-1, IData i ) E mem_data =  MinCostMem (DataMem, Direction, Word_width i ) +  9 * w(DAddr i ) + ß 9 * h(DAddr i-1, DAddr i ) +  10 * w(Data i ) + ß 10 * h(Data i-1, Data i )

- 10 - EE898- Compiler Results It is not important, which address bit is set to ‘1’ The number of ‚1‘s in the address bus is irrelevant The cost of flipping a bit on the address bus is independent of the bit position. It is not important, which data bit is set to ‘1’ The number of ‚1‘s on the data bus has a minor effect (3%) The cost of flipping a bit on the data bus is independent of the bit position. It is not important, which address bit is set to ‘1’ The number of ‚1‘s in the address bus is irrelevant The cost of flipping a bit on the address bus is independent of the bit position. It is not important, which data bit is set to ‘1’ The number of ‚1‘s on the data bus has a minor effect (3%) The cost of flipping a bit on the data bus is independent of the bit position.

- 11 - EE898- Compiler Compiler optimizations for improving energy efficiency  Energy-aware scheduling  Energy-aware instruction selection  Operator strength reduction: e.g. replace * by + and <<  Minimize the bitwidth of loads and stores  Standard compiler optimizations with energy as a cost function  Energy-aware scheduling  Energy-aware instruction selection  Operator strength reduction: e.g. replace * by + and <<  Minimize the bitwidth of loads and stores  Standard compiler optimizations with energy as a cost function E.g.: Register pipelining: for i:= 0 to 10 do C:= 2 * a[i] + a[i-1]; R2:=a[0]; for i:= 1 to 10 do begin R1:= a[i]; C:= 2 * R1 + R2; R2 := R1; end;  Exploitation of the memory hierarchy

- 12 - EE898- Compiler 3 key problems for future memory systems Energy Access times 1.(Average) Speed 2.Energy/Power 3.Predictability/Worst Case Execution Time 1.(Average) Speed 2.Energy/Power 3.Predictability/Worst Case Execution Time smaller, faster, less energy

- 13 - EE898- Compiler 1. (Average) Speed Speed gap between processor and main DRAM increases 2 4 8 245 Speed years CPU (1.5-2 p.a.) DRAM (1.07 p.a.) 31 early 60ties (Atlas): page fault ~ 2500 instructions 2002 (2 GHz µP): access to DRAM ~ 500 instructions  penalty for cache miss soon same as for page fault in Atlas early 60ties (Atlas): page fault ~ 2500 instructions 2002 (2 GHz µP): access to DRAM ~ 500 instructions  penalty for cache miss soon same as for page fault in Atlas [P. Machanik: Approaches to Addressing the Memory Wall, TR Nov. 2002, U. Brisbane]  2x every 2 years 1 0

- 14 - EE898- Compiler 2. Power/Energy 0 0.50.5 1 1.51.5 2 2.52.5 641282565121024204840968192 Memory size [bytes] Energy per access [nJ] Example (CACTI Model): [Steinke et al., Inf 12, UniDo, 2002]

- 15 - EE898- Compiler 3. Predictability/WCET Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern; pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas]  Time-triggered, statically scheduled operating systems What about memory accesses? –Currently available caches don‘t solve the problem: Improve the average case behavior Use “non-deterministic“ cache replacement algorithms  Scratch-pad/tightly coupled memory based predictability Predictability: For satisfying timing constraints in hard real-time systems, predictability is the most important concern; pre run-time scheduling is often the only practical means of providing predictability in a complex system [Xu, Parnas]  Time-triggered, statically scheduled operating systems What about memory accesses? –Currently available caches don‘t solve the problem: Improve the average case behavior Use “non-deterministic“ cache replacement algorithms  Scratch-pad/tightly coupled memory based predictability

- 16 - EE898- Compiler Hierarchical memories using scratch pad memories (SPM) Address space ARM7TDMI cores, well-known for low power consumption scratch pad memory 0 FFF.. main SPM processor Hierarchy Example no tag memory

- 17 - EE898- Compiler Exploitation of SPM Which segment (array, loop, etc.) to be stored in SPM? Gain g i and size s i for each segment i. Maximise gain G =  g i, respecting constraint K   s i. Static memory allocation: Solution: knapsack algorithm. Dynamic reloading: Finding optimal reloading points. Processor SPM capacity K board Main memory (On-board) ? For i.{ } for j..{ } while... Repeat call... Array... Int... Array Example:

- 18 - EE898- Compiler Reduction in energy and average run-time Multi_sort (mix of sort algorithms) Cycles

- 19 - EE898- Compiler Energy consumption per functional unit, as a function of the SPM size Parameters different from previous slide

- 20 - EE898- Compiler Hardware-support for block-copying The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip. The unit can be put to sleep when it is unused. The DMA unit was modeled in VHDL, simulated, synthesized. Unit only makes up 4% of the processor chip. The unit can be put to sleep when it is unused. Code size reductions of up to 23% for a 256 byte SPM were determined using the DMA unit instead of the dynamic approach that uses processor instructions for copying. DMA MEMORY CPU

- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special.

Similar presentations

Presentation on theme: "- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special.

Similar presentations

Presentation on theme: "- 1 - EE898- Compiler Compilers for embedded systems: Why are compilers an issue?  Many reports about low efficiency of standard compilers  Special."— Presentation transcript:

Similar presentations

About project

Feedback