Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Slides:



Advertisements
Similar presentations
MEMORY MANAGEMENT Y. Colette Lemard. MEMORY MANAGEMENT The management of memory is one of the functions of the Operating System MEMORY = MAIN MEMORY =
Advertisements

Paging: Design Issues. Readings r Silbershatz et al: ,
© 2004 Wayne Wolf Topics Task-level partitioning. Hardware/software partitioning.  Bus-based systems.
Part IV: Memory Management
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
CML Efficient & Effective Code Management for Software Managed Multicores CODES+ISSS 2013, Montreal, Canada Ke Bai, Jing Lu, Aviral Shrivastava, and Bryce.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Zhiguo Ge, Weng-Fai Wong, and Hock-Beng Lim Proceedings of the Design, Automation, and Test in Europe Conference, 2007 (DATE’07) April /4/17.
Operating Systems Lecture 10 Issues in Paging and Virtual Memory Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing.
The Assembly Language Level
Memory Management Chapter 7.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Garbage Collection  records not reachable  reclaim to allow reuse  performed by runtime system (support programs linked with the compiled code) (support.
Fundamentals of Python: From First Programs Through Data Structures
CMPE 421 Parallel Computer Architecture MEMORY SYSTEM.
CS 326 Programming Languages, Concepts and Implementation Instructor: Mircea Nicolescu Lecture 18.
CPSC 388 – Compiler Design and Construction
Honors Compilers Addressing of Local Variables Mar 19 th, 2002.
CE6105 Linux 作業系統 Linux Operating System 許 富 皓. Chapter 2 Memory Addressing.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
Memory Management Chapter 5.
Computer Organization and Architecture
1 Lecture 14: Virtual Memory Today: DRAM and Virtual memory basics (Sections )
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
Real-Time Concepts for Embedded Systems Author: Qing Li with Caroline Yao ISBN: CMPBooks.
System Calls 1.
Chapter 7: Runtime Environment –Run time memory organization. We need to use memory to store: –code –static data (global variables) –dynamic data objects.
Memory Management Chapter 7.
Storage Allocation for Embedded Processors By Jan Sjodin & Carl von Platen Present by Xie Lei ( PLS Lab)
Subject: Operating System.
An Object-Oriented Approach to Programming Logic and Design Fourth Edition Chapter 5 Arrays.
How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
1 Linux Operating System 許 富 皓. 2 Memory Addressing.
CE Operating Systems Lecture 14 Memory management.
SOCSAMS e-learning Dept. of Computer Applications, MES College Marampally VIRTUALMEMORY.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Memory: Relocation.
Chapter 4 Memory Management Virtual Memory.
Abdullah Aldahami ( ) March 23, Introduction 2. Background 3. Simulation Techniques a.Experimental Settings b.Model Description c.Methodology.
Resource Mapping and Scheduling for Heterogeneous Network Processor Systems Liang Yang, Tushar Gohad, Pavel Ghosh, Devesh Sinha, Arunabha Sen and Andrea.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.
Operating Systems Lecture 14 Segments Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of Software Engineering.
COMPUTER ORGANIZATION AND ASSEMBLY LANGUAGE Lecture 19 & 20 Instruction Formats PDP-8,PDP-10,PDP-11 & VAX Course Instructor: Engr. Aisha Danish.
1 Memory Management Chapter 7. 2 Memory Management Subdividing memory to accommodate multiple processes Memory needs to be allocated to ensure a reasonable.
1 Iterative Integer Programming Formulation for Robust Resource Allocation in Dynamic Real-Time Systems Sethavidh Gertphol and Viktor K. Prasanna University.
Introduction: Memory Management 2 Ideally programmers want memory that is large fast non volatile Memory hierarchy small amount of fast, expensive memory.
CSC 8505 Compiler Construction Runtime Environments.
Processes and Virtual Memory
CSCI1600: Embedded and Real Time Software Lecture 33: Worst Case Execution Time Steven Reiss, Fall 2015.
Multilevel Caches Microprocessors are getting faster and including a small high speed cache on the same chip.
Memory Management -Memory allocation -Garbage collection.
1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.
Jeffrey Ellak CS 147. Topics What is memory hierarchy? What are the different types of memory? What is in charge of accessing memory?
COMP091 – Operating Systems 1 Memory Management. Memory Management Terms Physical address –Actual address as seen by memory unit Logical address –Address.
Memory Management Chapter 5 Advanced Operating System.
Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.
Eliminating External Fragmentation in a Non-Moving Garbage Collector for Java Author: Fridtjof Siebert, CASES 2000 Michael Sallas Object-Oriented Languages.
COMBINED PAGING AND SEGMENTATION
Dynamic Memory Allocation
CS 153: Concepts of Compiler Design November 28 Class Meeting
CSCI1600: Embedded and Real Time Software
Optimizing Malloc and Free
Memory management Explain how memory is managed in a typical modern computer system (virtual memory, paging and segmentation should be described.
COMP755 Advanced Operating Systems
CSCI1600: Embedded and Real Time Software
Run-time environments
Presentation transcript:

Embedded Systems Seminar Heterogeneous Memory Management for Embedded Systems By O.Avissar, R.Barua and D.Stewart. Presented by Kumar Karthik.

Heterogeneous Memory Heterogeneous = different types of… Heterogeneous = different types of… Embedded Systems come with a small amount of on-chip SRAM, a moderate amount of off- chip SRAM, a considerable amount of off-chip DRAM and large amounts of EEPROM (Flash memory) Embedded Systems come with a small amount of on-chip SRAM, a moderate amount of off- chip SRAM, a considerable amount of off-chip DRAM and large amounts of EEPROM (Flash memory)

Relative RAM costs and Latencies Latency On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAM On-chip SRAM < off-chip SRAM < on-chip DRAM < off-chip DRAMCost On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM On-chip SRAM > off-chip SRAM > on-chip DRAM > off-chip DRAM

Caches in Embedded Chips Caches are power hungry Caches are power hungry Cache miss penalties make it hard to give real- time performance guarantees Cache miss penalties make it hard to give real- time performance guarantees Solution : do away with caches and create a non-overlapping address space for systems with heterogeneous memory units (DRAM, SRAM, EEPROM). Solution : do away with caches and create a non-overlapping address space for systems with heterogeneous memory units (DRAM, SRAM, EEPROM).

Memory Allocation in ES Memory allocation for program data is done by the embedded system programmer, in software, as current compilers are not capable of doing it over heterogeneous memory units Memory allocation for program data is done by the embedded system programmer, in software, as current compilers are not capable of doing it over heterogeneous memory units Code is written in Assembly : tedious and non- portable Code is written in Assembly : tedious and non- portable Solution : An intelligent compilation strategy that can achieve optimal memory allocation in ES. Solution : An intelligent compilation strategy that can achieve optimal memory allocation in ES.

Memory Allocation Example

The need for Profiling Recall : RAM Latencies Recall : RAM Latencies Optimal if most frequently accessed code sections are stored in the memory unit with lowest latency. Optimal if most frequently accessed code sections are stored in the memory unit with lowest latency. Access frequencies of memory references need to be measured. Access frequencies of memory references need to be measured. Solution : Profiling. Solution : Profiling.

Intelligent Compilers The intelligent compiler must be able to The intelligent compiler must be able to 1. Optimally allocate memory to program data 2. Base memory allocation on frequency estimates collected through profiling 3. Correlate memory accesses with the variables they access Task 3 demands inter-procedural pointer analysis, which is costly. Task 3 demands inter-procedural pointer analysis, which is costly.

Profiling Instead of pointers, a more efficient statistical method is used. Each accessed address is marked checked against a table of address ranges for the different variables. Instead of pointers, a more efficient statistical method is used. Each accessed address is marked checked against a table of address ranges for the different variables. Provides exact statistics as opposed to pointer analysis Provides exact statistics as opposed to pointer analysis

Memory Access Times Total access time (Sum) of all the memory accesses in the program needs to be minimized Total access time (Sum) of all the memory accesses in the program needs to be minimized The formulation is first defined for global variables and then extended for heap and stack variables. The formulation is first defined for global variables and then extended for heap and stack variables.

Formulation for global variables Key terms Key terms T rj N r (v i ) – Total time taken for N reads of variable i stored on memory unit j. T wj N w (v i ) – Total time taken for N writes of variable i stored on memory unit j. I j (v i ) – The set of 0/1 integer variables.

Formulation for global variables Total Access time Total Access time = ∑ ( j=1 to U) ∑ (i=1 to G) I j (v i )[T rj N r (v i ) + T wj N w (v i ) ] U = Number of Memory units G = Number of Variables T rj N r (v i ) + T wj N w (v i ) contributes to the inner sum only if variable i is stored in memory unit j (if not, I j (v i ) = 0 and the whole term will be 0).

0/1 integer linear program solver The 0/1 integer linear program solver tries out all combinations of the summation to arrive at the lowest total memory access time and returns this solution to the compiler The 0/1 integer linear program solver tries out all combinations of the summation to arrive at the lowest total memory access time and returns this solution to the compilerthe summationthe summation The solution is the optimal memory allocation. The solution is the optimal memory allocation. MATLAB is used as the solver in this paper. MATLAB is used as the solver in this paper.

Constraints The following constraints also hold : The following constraints also hold : The embedded processor allows at most one memory access per cycle. Overlapping memory latencies are not considered. The embedded processor allows at most one memory access per cycle. Overlapping memory latencies are not considered. Every variable is allocated on only one memory unit Every variable is allocated on only one memory unit The sum of the sizes of all the variables allocated to a particular memory unit must not exceed the size of the unit. The sum of the sizes of all the variables allocated to a particular memory unit must not exceed the size of the unit.must not exceed the size of the unit.must not exceed the size of the unit.

Stack variables Extending the formulation for local variables, procedure parameters and return variables (collectively known as stack variables). Extending the formulation for local variables, procedure parameters and return variables (collectively known as stack variables). Stacks are sequentially allocated abstractions, much like arrays. Stacks are sequentially allocated abstractions, much like arrays. Distributing stacks over heterogeneous memory units optimizes memory allocation. Distributing stacks over heterogeneous memory units optimizes memory allocation.

Stack split example

Distributed Stacks Multiple stack pointers…from example, 2 stack pointers will have to be incremented on entry (on for each split of the stack) and 2 will have to be decremented on leaving the procedure. Multiple stack pointers…from example, 2 stack pointers will have to be incremented on entry (on for each split of the stack) and 2 will have to be decremented on leaving the procedure. Induces overhead when 2 stack pointers have to be maintained. Induces overhead when 2 stack pointers have to be maintained.

Distributed Stacks software overhead…tolerated for long-running procedures and eliminated by allocating each stack frame to one memory unit for short procedures (one stack pointer per procedure) software overhead…tolerated for long-running procedures and eliminated by allocating each stack frame to one memory unit for short procedures (one stack pointer per procedure) Distributed stacks are implemented by compiler for ease of use…..abstraction of stack as a contiguous data structure is maintained for the programmer Distributed stacks are implemented by compiler for ease of use…..abstraction of stack as a contiguous data structure is maintained for the programmer

Comparison to globals Stack variables have limited lifetimes compared to globals. They are ‘live’ when a particular procedure is executing and can be garbage collected once the procedure is exited. Stack variables have limited lifetimes compared to globals. They are ‘live’ when a particular procedure is executing and can be garbage collected once the procedure is exited. Hence variables with non-overlapping lifetimes can share the same address space and their total size can be larger than that of the memory unit they are stored in. Hence variables with non-overlapping lifetimes can share the same address space and their total size can be larger than that of the memory unit they are stored in.larger than that of the memory unit they are stored inlarger than that of the memory unit they are stored in

Formulation for Stack Frames 2 ways of extending the method to handle stack variables. 2 ways of extending the method to handle stack variables. Each procedure’s stack frame is stored in a single memory unit. Each procedure’s stack frame is stored in a single memory unit. No multiple stack pointers No multiple stack pointers Distributed stack as different stack frames may still be allocated to different memory units Distributed stack as different stack frames may still be allocated to different memory units

Stack-extended formulation Total access time = time taken to access global variables + time taken to access stack variables Total access time = time taken to access global variables + time taken to access stack variables The f i s refer to the number of functions in the program (as each function has a stack frame). The f i s refer to the number of functions in the program (as each function has a stack frame).

Constraints Each stack frame may at most be stored in one memory unit Each stack frame may at most be stored in one memory unit Stack reaches maximum size when a call-graph leaf node is reached. Stack reaches maximum size when a call-graph leaf node is reached. A call-graph leaf node is the deepest nested procedure called….if all such procedures’ stack frames can be allocated, program allocation will fit into memory if all paths to leaf nodes on the call graph fit into memory. A call-graph leaf node is the deepest nested procedure called….if all such procedures’ stack frames can be allocated, program allocation will fit into memory if all paths to leaf nodes on the call graph fit into memory.

Stack-extended formulation 2 nd alternative 2 nd alternative Stack variables from the same procedure can be mapped to different memory units Stack variables from the same procedure can be mapped to different memory units Stack variables are thus treated like globals with the total access time equal to = Stack variables are thus treated like globals with the total access time equal to == However memory requirements are relaxed as in the stack-frame case based on disjoint lifetimes of the stack variables However memory requirements are relaxed as in the stack-frame case based on disjoint lifetimes of the stack variables

Heap-extended formulation Heap data cannot be allocated statically as the allocation frequencies and block sizes are unknown at compile time. Heap data cannot be allocated statically as the allocation frequencies and block sizes are unknown at compile time. Calls such as malloc( ) fall into this category Calls such as malloc( ) fall into this category Allocation has to be estimated using a good heuristic. Allocation has to be estimated using a good heuristic. Each static heap allocation site is treated as a variable v in the formulation Each static heap allocation site is treated as a variable v in the formulation

Heap-extended formulation The number of references to each site is counted through profiling. The number of references to each site is counted through profiling. The variable size is bounded as a finite multiple of the total size of memory allocated at that site. The variable size is bounded as a finite multiple of the total size of memory allocated at that site. If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the size of v which is multiplied by a safety factor of 2 to give 320 bytes as the allocation size for this site. If a malloc( ) site allocates 20 bytes 8 times over in a program, 160 bytes is the size of v which is multiplied by a safety factor of 2 to give 320 bytes as the allocation size for this site.

Heap-extended formulation This optimizes for the common case This optimizes for the common case Calls like malloc( ) are cloned for each memory level which in turn maintains a free list. Calls like malloc( ) are cloned for each memory level which in turn maintains a free list. If allocation size is exceeded at runtime (max size is passed as a parameter for each call site) a memory block from slower and larger memory is returned. If allocation size is exceeded at runtime (max size is passed as a parameter for each call site) a memory block from slower and larger memory is returned.

Heap-extended formulation Latency would be ≤ latency of slowest memory Latency would be ≤ latency of slowest memory If real-time guarantees are needed, all heap allocation must be assumed to go to the slowest memory. If real-time guarantees are needed, all heap allocation must be assumed to go to the slowest memory.

Experiment This compiler was implemented as an extension to the commonly used GCC cross-compiler to target the Motorola M-Core processor. This compiler was implemented as an extension to the commonly used GCC cross-compiler to target the Motorola M-Core processor. Benchmarks used represent code in typical applications. Benchmarks used represent code in typical applications. The runtimes were normalized using only the fastest memory type (SRAM) and then slower memories were introduced for subsequent tests to measure runtimes. The runtimes were normalized using only the fastest memory type (SRAM) and then slower memories were introduced for subsequent tests to measure runtimes.

Results

Results Using 20% SRAM and the rest DRAM still produces runtimes closer to the all SRAM case. Cheaper and without much of a performance loss. Using 20% SRAM and the rest DRAM still produces runtimes closer to the all SRAM case. Cheaper and without much of a performance loss. This proves that (at least for the benchmark programs) memory allocation is optimal. The FIB with a linear recurrence to compute Fibonacci numbers is an exception with equal number of accesses to all variables. This proves that (at least for the benchmark programs) memory allocation is optimal. The FIB with a linear recurrence to compute Fibonacci numbers is an exception with equal number of accesses to all variables.

Experiment 2 Enough DRAM and EEPROM was provided while SRAM size was varied for each of the benchmark programs. Enough DRAM and EEPROM was provided while SRAM size was varied for each of the benchmark programs. This would help determine the minimum amount of SRAM needed to maintain performance reasonably close to the 100% SRAM case This would help determine the minimum amount of SRAM needed to maintain performance reasonably close to the 100% SRAM case

FIR Benchmark

Matrix multiplication benchmark

Fibonacci series benchmark

Byte to ASCII converter

Results Clear that most frequently accessed code is between 10-20% of entire program Clear that most frequently accessed code is between 10-20% of entire program This portion of code is successfully put on SRAM through profile-based optimizations. This portion of code is successfully put on SRAM through profile-based optimizations.

Comparing Stack frames and stack variables

Results The BMM benchmark is used as it has the most number of functions/procedures (hence most number of stack frames/variables). The BMM benchmark is used as it has the most number of functions/procedures (hence most number of stack frames/variables). Allocating stack variables on different units performs better in theory due to the finer granularity and thus a more custom allocation. The difference is apparent for the smaller SRAM sizes. Allocating stack variables on different units performs better in theory due to the finer granularity and thus a more custom allocation. The difference is apparent for the smaller SRAM sizes.

Applications The approach in the paper can be used to determine an optimal trade-off between minimum SRAM size and meeting performance requirements. The approach in the paper can be used to determine an optimal trade-off between minimum SRAM size and meeting performance requirements.

Adapting to pre-emption In context-switching environments, all data has to be live at any given time on some live memory. In context-switching environments, all data has to be live at any given time on some live memory. The variables of all the live programs are combined and the formulation is solved by multiplying the relative frequencies of the contexts with their respective variables. An optimal allocation is achieved in this case. The variables of all the live programs are combined and the formulation is solved by multiplying the relative frequencies of the contexts with their respective variables. An optimal allocation is achieved in this case.

Summary Compiler method to distribute program data efficiently among heterogeneous memories. Compiler method to distribute program data efficiently among heterogeneous memories. Caching hardware is not used Caching hardware is not used Static allocation of memory units Static allocation of memory units Stack distribution Stack distribution Optimal guarantee Optimal guarantee Runtime depends on relative access frequencies. Runtime depends on relative access frequencies.

Related work Not much work on cache-less embedded chips with heterogeneous memory units Not much work on cache-less embedded chips with heterogeneous memory units Memory allocation task is usually left to the programmer Memory allocation task is usually left to the programmer Compiler method is better for larger, more complex programs Compiler method is better for larger, more complex programs It is error free and is also portable over different systems with minor modifications to the compiler. It is error free and is also portable over different systems with minor modifications to the compiler.

Related work Panda et al., Sjodin et al. have researched on memory allocation in cached embedded chips. Panda et al., Sjodin et al. have researched on memory allocation in cached embedded chips. Cached systems spend more effort on minimizing cache misses than minimizing memory access times…no optimal guarantee. Cached systems spend more effort on minimizing cache misses than minimizing memory access times…no optimal guarantee. Earlier studies only take into account 2 memory levels (SRAM and DRAM) while this formulation can be extended to N levels of memory. Earlier studies only take into account 2 memory levels (SRAM and DRAM) while this formulation can be extended to N levels of memory.

Related work Dynamic allocation strategies are also possible but not explored here. Dynamic allocation strategies are also possible but not explored here. Software caching (emulation of a cache in fast memory) is an option. Software caching (emulation of a cache in fast memory) is an option. Methods to overcome software overhead need to be devised. Methods to overcome software overhead need to be devised. Inability to provide real-time guarantees should be addressed. Inability to provide real-time guarantees should be addressed. THE END