Presentation is loading. Please wait.

Presentation is loading. Please wait.

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Similar presentations


Presentation on theme: "Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik."— Presentation transcript:

1 Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

2 Heterogeneous Memory Heterogeneous = different types of… Heterogeneous = different types of… Embedded Systems come with a small amount of on-chip SRAM, a moderate amount of off- chip SRAM, a considerable amount of off-chip DRAM and large amounts of EEPROM (Flash memory) Embedded Systems come with a small amount of on-chip SRAM, a moderate amount of off- chip SRAM, a considerable amount of off-chip DRAM and large amounts of EEPROM (Flash memory)

3 Relative RAM costs and Latencies Latency On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAM On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAMCost On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM

4 Caches in Embedded Chips Caches are power hungry Caches are power hungry Cache miss penalties make it hard to give real- time performance guarantees Cache miss penalties make it hard to give real- time performance guarantees Solution : do away with caches and create a non-overlapping address space for systems with heterogeneous memory units (DRAM, SRAM, EEPROM). Solution : do away with caches and create a non-overlapping address space for systems with heterogeneous memory units (DRAM, SRAM, EEPROM).

5 Memory Allocation in ES Memory allocation for program data is done by the embedded system programmer, in software, as current compilers are not capable of doing it over heterogeneous memory units Memory allocation for program data is done by the embedded system programmer, in software, as current compilers are not capable of doing it over heterogeneous memory units Code is written in Assembly : tedious and non- portable Code is written in Assembly : tedious and non- portable Solution : An intelligent compilation strategy that can achieve optimal memory allocation in ES. Solution : An intelligent compilation strategy that can achieve optimal memory allocation in ES.

6

7 Memory Allocation Example

8 The need for Profiling Recall : RAM Latencies Recall : RAM Latencies Optimal if most frequently accessed code sections are stored in the memory unit with lowest latency. Optimal if most frequently accessed code sections are stored in the memory unit with lowest latency. Access frequencies of memory references need to be measured. Access frequencies of memory references need to be measured. Solution : Profiling. Solution : Profiling.

9 Intelligent Compilers The intelligent compiler must be able to The intelligent compiler must be able to 1. Optimally allocate memory to program data 2. Base memory allocation on frequency estimates collected through profiling 3. Correlate memory accesses with the variables they access Task 3 demands inter-procedural pointer analysis, which is costly. Task 3 demands inter-procedural pointer analysis, which is costly.

10 Profiling Instead of pointers, a more efficient statistical method is used. Each accessed address is marked checked against a table of address ranges for the different variables. Instead of pointers, a more efficient statistical method is used. Each accessed address is marked checked against a table of address ranges for the different variables. Provides exact statistics as opposed to pointer analysis Provides exact statistics as opposed to pointer analysis

11 Memory Access Times Total access time (Sum) of all the memory accesses in the program needs to be minimized Total access time (Sum) of all the memory accesses in the program needs to be minimized The formulation is first defined for global variables and then extended for heap and stack variables. The formulation is first defined for global variables and then extended for heap and stack variables.

12 Formulation for global variables Key terms Key terms T rj N r (v i ) – Total time taken for N reads of variable i stored on memory unit j. T wj N w (v i ) – Total time taken for N writes of variable i stored on memory unit j. I j (v i ) – The set of 0/1 integer variables.

13 Formulation for global variables Total Access time Total Access time = ∑ ( j=1 to U) ∑ (i=1 to G) I j (v i )[T rj N r (v i ) + T wj N w (v i ) ] U = Number of Memory units G = Number of Variables T rj N r (v i ) + T wj N w (v i ) contributes to the inner sum only if variable i is stored in memory unit j (if not, I j (v i ) = 0 and the whole term will be 0).

14 0/1 integer linear program solver The 0/1 integer linear program solver tries out all combinations of the summation to arrive at the lowest total memory access time and returns this solution to the compiler The 0/1 integer linear program solver tries out all combinations of the summation to arrive at the lowest total memory access time and returns this solution to the compilerthe summationthe summation The solution is the optimal memory allocation. The solution is the optimal memory allocation. MATLAB is used as the solver in this paper. MATLAB is used as the solver in this paper.

15 Constraints The following constraints also hold : The following constraints also hold : The embedded processor allows at most one memory access per cycle. Overlapping memory latencies are not considered. The embedded processor allows at most one memory access per cycle. Overlapping memory latencies are not considered. Every variable is allocated on only one memory unit Every variable is allocated on only one memory unit The sum of the sizes of all the variables allocated to a particular memory unit must not exceed the size of the unit. The sum of the sizes of all the variables allocated to a particular memory unit must not exceed the size of the unit.must not exceed the size of the unit.must not exceed the size of the unit.

16 Stack variables Extending the formulation for local variables, procedure parameters and return variables (collectively known as stack variables). Extending the formulation for local variables, procedure parameters and return variables (collectively known as stack variables). Stacks are sequentially allocated abstractions, much like arrays. Stacks are sequentially allocated abstractions, much like arrays. Distributing stacks over heterogeneous memory units optimizes memory allocation. Distributing stacks over heterogeneous memory units optimizes memory allocation.

17 Stack split example

18 Distributed Stacks Multiple stack pointers…from example, 2 stack pointers will have to be incremented on entry (on for each split of the stack) and 2 will have to be decremented on leaving the procedure. Multiple stack pointers…from example, 2 stack pointers will have to be incremented on entry (on for each split of the stack) and 2 will have to be decremented on leaving the procedure. Induces overhead when 2 stack pointers have to be maintained. Induces overhead when 2 stack pointers have to be maintained.

19 Distributed Stacks software overhead…tolerated for long-running procedures and eliminated by allocating each stack frame to one memory unit for short procedures (one stack pointer per procedure) software overhead…tolerated for long-running procedures and eliminated by allocating each stack frame to one memory unit for short procedures (one stack pointer per procedure) Distributed stacks are implemented by compiler for ease of use…..abstraction of stack as a contiguous data structure is maintained for the programmer Distributed stacks are implemented by compiler for ease of use…..abstraction of stack as a contiguous data structure is maintained for the programmer

20 Comparison to globals Stack variables have limited lifetimes compared to globals. They are ‘live’ when a particular procedure is executing and can be garbage collected once the procedure is exited. Stack variables have limited lifetimes compared to globals. They are ‘live’ when a particular procedure is executing and can be garbage collected once the procedure is exited. Hence variables with non-overlapping lifetimes can share the same address space and their total size can be larger than that of the memory unit they are stored in. Hence variables with non-overlapping lifetimes can share the same address space and their total size can be larger than that of the memory unit they are stored in.larger than that of the memory unit they are stored inlarger than that of the memory unit they are stored in

21 Formulation for Stack Frames 2 ways of extending the method to handle stack variables. 2 ways of extending the method to handle stack variables. Each procedure’s stack frame is stored in a single memory unit. Each procedure’s stack frame is stored in a single memory unit. No multiple stack pointers No multiple stack pointers Distributed stack as different stack frames may still be allocated to different memory units Distributed stack as different stack frames may still be allocated to different memory units

22 Stack-extended formulation Total access time = time taken to access global variables + time taken to access stack variables Total access time = time taken to access global variables + time taken to access stack variables The f i s refer to the number of functions in the program (as each function has a stack frame). The f i s refer to the number of functions in the program (as each function has a stack frame).

23 Constraints Each stack frame may at most be stored in one memory unit Each stack frame may at most be stored in one memory unit Stack reaches maximum size when a call-graph leaf node is reached. Stack reaches maximum size when a call-graph leaf node is reached. A call-graph leaf node is the deepest nested procedure called….if all such procedures’ stack frames can be allocated, program allocation will fit into memory if all paths to leaf nodes on the call graph fit into memory. A call-graph leaf node is the deepest nested procedure called….if all such procedures’ stack frames can be allocated, program allocation will fit into memory if all paths to leaf nodes on the call graph fit into memory.

24 Stack-extended formulation 2 nd alternative 2 nd alternative Stack variables from the same procedure can be mapped to different memory units Stack variables from the same procedure can be mapped to different memory units Stack variables are thus treated like globals with the total access time equal to = Stack variables are thus treated like globals with the total access time equal to == However memory requirements are relaxed as in the stack-frame case based on disjoint lifetimes of the stack variables However memory requirements are relaxed as in the stack-frame case based on disjoint lifetimes of the stack variables

25 Heap-extended formulation Heap data cannot be allocated statically as the allocation frequencies and block sizes are unknown at compile time. Heap data cannot be allocated statically as the allocation frequencies and block sizes are unknown at compile time. Calls such as malloc( ) fall into this category Calls such as malloc( ) fall into this category Allocation has to be estimated using a good heuristic. Allocation has to be estimated using a good heuristic. Each static heap allocation site is treated as a variable v in the formulation Each static heap allocation site is treated as a variable v in the formulation

26 Heap-extended formulation The number of references to each site is counted through profiling. The number of references to each site is counted through profiling. The variable size is bounded as a finite multiple of the total size of memory allocated at that site. The variable size is bounded as a finite multiple of the total size of memory allocated at that site. If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the size of v which is multiplied by a safety factor of 2 to give 320 bytes as the allocation size for this site. If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the size of v which is multiplied by a safety factor of 2 to give 320 bytes as the allocation size for this site.

27 Heap-extended formulation This optimizes for the common case This optimizes for the common case Calls like malloc( ) are cloned for each memory level which in turn maintains a free list. Calls like malloc( ) are cloned for each memory level which in turn maintains a free list. If allocation size is exceeded at runtime (max size is passed as a parameter for each call site) a memory block from slower and larger memory is returned. If allocation size is exceeded at runtime (max size is passed as a parameter for each call site) a memory block from slower and larger memory is returned.

28 Heap-extended formulation Latency would be ≤ latency of slowest memory Latency would be ≤ latency of slowest memory If real-time guarantees are needed, all heap allocation must be assumed to go to the slowest memory. If real-time guarantees are needed, all heap allocation must be assumed to go to the slowest memory.

29 Experiment This compiler was implemented as an extension to the commonly used GCC cross-compiler to target the Motorola M-Core processor. This compiler was implemented as an extension to the commonly used GCC cross-compiler to target the Motorola M-Core processor. Benchmarks used represent code in typical applications. Benchmarks used represent code in typical applications. The runtimes were normalized using only the fastest memory type (SRAM) and then slower memories were introduced for subsequent tests to measure runtimes. The runtimes were normalized using only the fastest memory type (SRAM) and then slower memories were introduced for subsequent tests to measure runtimes.

30 Results

31 Results Using 20% SRAM and the rest DRAM still produces runtimes closer to the all SRAM case. Cheaper and without much of a performance loss. Using 20% SRAM and the rest DRAM still produces runtimes closer to the all SRAM case. Cheaper and without much of a performance loss. This proves that (at least for the benchmark programs) memory allocation is optimal. The FIB with a linear recurrence to compute Fibonacci numbers is an exception with equal number of accesses to all variables. This proves that (at least for the benchmark programs) memory allocation is optimal. The FIB with a linear recurrence to compute Fibonacci numbers is an exception with equal number of accesses to all variables.

32 Experiment 2 Enough DRAM and EEPROM was provided while SRAM size was varied for each of the benchmark programs. Enough DRAM and EEPROM was provided while SRAM size was varied for each of the benchmark programs. This would help determine the minimum amount of SRAM needed to maintain performance reasonably close to the 100% SRAM case This would help determine the minimum amount of SRAM needed to maintain performance reasonably close to the 100% SRAM case

33 FIR Benchmark

34 Matrix multiplication benchmark

35 Fibonacci series benchmark

36 Byte to ASCII converter

37 Results Clear that most frequently accessed code is between 10-20% of entire program Clear that most frequently accessed code is between 10-20% of entire program This portion of code is successfully put on SRAM through profile-based optimizations. This portion of code is successfully put on SRAM through profile-based optimizations.

38 Comparing Stack frames and stack variables

39 Results The BMM benchmark is used as it has the most number of functions/procedures (hence most number of stack frames/variables). The BMM benchmark is used as it has the most number of functions/procedures (hence most number of stack frames/variables). Allocating stack variables on different units performs better in theory due to the finer granularity and thus a more custom allocation. The difference is apparent for the smaller SRAM sizes. Allocating stack variables on different units performs better in theory due to the finer granularity and thus a more custom allocation. The difference is apparent for the smaller SRAM sizes.

40 Applications The approach in the paper can be used to determine an optimal trade-off between minimum SRAM size and meeting performance requirements. The approach in the paper can be used to determine an optimal trade-off between minimum SRAM size and meeting performance requirements.

41 Adapting to pre-emption In context-switching environments, all data has to be live at any given time on some live memory. In context-switching environments, all data has to be live at any given time on some live memory. The variables of all the live programs are combined and the formulation is solved by multiplying the relative frequencies of the contexts with their respective variables. An optimal allocation is achieved in this case. The variables of all the live programs are combined and the formulation is solved by multiplying the relative frequencies of the contexts with their respective variables. An optimal allocation is achieved in this case.

42 Summary Compiler method to distribute program data efficiently among heterogeneous memories. Compiler method to distribute program data efficiently among heterogeneous memories. Caching hardware is not used Caching hardware is not used Static allocation of memory units Static allocation of memory units Stack distribution Stack distribution Optimal guarantee Optimal guarantee Runtime depends on relative access frequencies. Runtime depends on relative access frequencies.

43 Related work Not much work on cache-less embedded chips with heterogeneous memory units Not much work on cache-less embedded chips with heterogeneous memory units Memory allocation task is usually left to the programmer Memory allocation task is usually left to the programmer Compiler method is better for larger, more complex programs Compiler method is better for larger, more complex programs It is error free and is also portable over different systems with minor modifications to the compiler. It is error free and is also portable over different systems with minor modifications to the compiler.

44 Related work Panda et al., Sjodin et al. have researched on memory allocation in cached embedded chips. Panda et al., Sjodin et al. have researched on memory allocation in cached embedded chips. Cached systems spend more effort on minimizing cache misses than minimizing memory access times…no optimal guarantee. Cached systems spend more effort on minimizing cache misses than minimizing memory access times…no optimal guarantee. Earlier studies only take into account 2 memory levels (SRAM and DRAM) while this formulation can be extended to N levels of memory. Earlier studies only take into account 2 memory levels (SRAM and DRAM) while this formulation can be extended to N levels of memory.

45 Related work Dynamic allocation strategies are also possible but not explored here. Dynamic allocation strategies are also possible but not explored here. Software caching (emulation of a cache in fast memory) is an option. Software caching (emulation of a cache in fast memory) is an option. Methods to overcome software overhead need to be devised. Methods to overcome software overhead need to be devised. Inability to provide real-time guarantees should be addressed. Inability to provide real-time guarantees should be addressed. THE END


Download ppt "Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik."

Similar presentations


Ads by Google