Download presentation
Presentation is loading. Please wait.
Published byAvice Reeves Modified over 10 years ago
1
Dynamic Memory Management for new embedded systems David Atienza (DACYA/UCM), Stelios Mamagkakis (VLSI D&T Center, Xanthi), Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven), José M. Mendías (DACYA/UCM), Dimitrios Soudris (VLSI D&T Center, Xanthi)
2
New embedded systems? - New consumer devices (e.g. mobiles, PDAs): - Main features: 1) More complex than traditional embedded devices (complex memory hierarchy, cpu, Real-Time OSes) 2) Portables - limited batteries 3) Preserve “relatively” high performance (real time) 4) Many applications are usually running concurrently
3
New applications? - Main features: 1) Complex high level design (e.g. C++, Java) 2) Very dynamic (variable use of resources) 3) Power hungry 4) Intensive memory use (accesses and footprint) Scalable video rendering 3D Virtual reality games Wireless protocols - Multimedia and wireless network protocols:
4
Why optimizing these systems? Users do not want these problems… new embedded systems need to be optimized! No optimizations: New applications Platforms Out of memory or battery fast! No real-time! low battery
5
Outline: 1) Memory subsystem in new embedded devices (Static vs dynamic memory allocation) 2) Dynamic Memory management mechanism 3) Dynamic memory subsystem refinement: 3.1) Dynamic Data Type Refinement (Application Level) 3.2) Dynamic Memory Management Refinement (OS Level) 4) Real life case studies and results 5) Enhanced Dynamic Memory Management (Multi-level) – Real life case study and results 6) Conclusions and future work
6
Memory subsystem in new embedded devices 1) Multi-level memory hierarchy (e.g. caches, etc.) 2) High-performance buses (e.g. AMBA bus) MMU Data Cache Scratchpad Memory ARM Instruction Cache Processor Core AMBA BUS Main Memory - Not enough… Memory subsystem up to 70% of total system power and high performance degradation (>20x) [Vijaykrisnan2002, Catthoor2000] - Highly optimized: Access and storage optimizations for DM needed!
7
time Data Scalable 3D decoding (per object): Memory: Static memory vs dynamic memory Scenario 1 - Compile-time (worst case) Low qualityHigh quality Medium quality Memory size t1t2t3t4 NO! Object 1 Object 2 Object 3 Object2 Object 3
8
time Data Scalable 3D decoding (per object): Memory: Static memory vs dynamic memory Scenario 2 – Run time allocation Low qualityHigh quality Medium quality Memory size t1t2 t3 t4 Object 2 Object 1 Object 3 Object 1 Object 4 Object5 OK! Object4 t5 Object 3 Object5 Memory usage scales to current input!
9
Overview of options for memory allocation (results 3D Image Reconstruction case study) Worst case static memory solutions not possible or do not work in extreme cases of input data Dynamic solutions achieve better results As shown later: Well-designed custom solutions can improve further standard DM mechanisms
10
DM management works at 2 levels Main memory RTOS Dynamic Memory Manager Heap new(O1) t1 t2 t3 t4 O2 O1 O4 t5 O3 Data new(O2) new(O3) delete(O2) O1 O2 O3 new(O4) t6 O4 O5 new(O5) O5 Fragmentation!! 1) Applications use SW functions, C++: new()/delete() 2) Real time OS support: DM manager Time
11
1) Which parts use the DM SW functions? Dynamic Data Types: Algorithms (Functionality) Embedded application: Static Data (e.g. frames) Dynamic Data (e.g. objects to render) PA(key1) AR(key2) Layer 1 Layer 2 data LAR(key1) PA(key2) Layer 1 Layer 2 data DDT 1DDT n Structured data (sets of objects) new(Object): …
12
2) RTOS support, DM manager to use? - Partition manager: suitable for one allocation size - One fixed block size - First fit allocation order Global Info of manager - Region allocator: suitable for several allocation sizes - Many block sizes, doubly-linked list - First fit, splitting, coalescing (best fit approximation) Free Used Global Info of Manager Free header New request Used Free header Free header Free header Free header Free Use header Free Used New request Used New request Fragmentation!! Simple - low energy consumption Complex – higher energy consumption
13
Design flow for new embedded systems Specification of the system at very high-level (e.g. C++ or UML) 2. DM Managers Refinement (DMMR) Final implementation 1. Dynamic Data Types Refinement (DDTR) Refinement of DM subsystem (2 levels): Further optimizations in static data
14
DDTs in new embedded applications 1) Complex control flow (many sub-algorithms) => Many DDTs with different (and irregular) behaviour interacting in time 2) Implementations thought from functional point of view, not efficient access or mapping to memory (e.g. clustering of data) 3) Complex implementations => combinations of trees, arrays, single linked lists, doubly linked lists, … Result: Very expensive (e.g. energy) if not well designed
15
Proposed Dynamic Data Type Refinement steps Specification of the required DDTs (from the algorithm) Final implementation of DDTs Refinement of current design (interaction and implementation of DDTs) Working implementation of DDTs (e.g. C++) Run time simulation and profiling acquisition PROBLEMS: Insertion of profiling Solutions proposed: Error-prone Automated ways? Library of auto-profiled DDTs Time-consuming Huge design space of DDTs High-level Estimations? Structured reports generation Heuristics to limit the exploration Analytical high level estimations
16
1.- Library of complex multi-level DDTs: – Based on initial exploration in Matlab with traces of real programs – Multiplatform compatible: ANSI C++ compliant – Basic data types (e.g. int) or user-defined (e.g. objects) – List of relevant DDTs (i.e. 14) for multimedia included: 1) Basic data types: single and doubly linked lists, array (AR) and pointer-AR 2) 2-layered combinations of ARs and pointer-ARs: 3) Single linked list – AR 4) Doubly linked list – AR 5) Binary trees 6) DDTs with pattern optimizations (e.g. fast access to the last element) 2.- Extension: Mechanism to create new DDTs in the library based on template classes (or “mixins”) Library of auto-profiled DDTs
17
“Mixin” concept and how we use it ● Y. Smaragdakis (2002): A method to specify extensions of classes without defining up-front which classes exactly ● Uses in our library of DM managers: 1) Specifying a subclass while the parent class is a template parameter: template class Mixin : public SuperClass{ // mixin definitions } 2) Using a template class inside another class: template class Myclass { SuperClass *data; // template class definitions }
18
Examples of DDT definitions: template class Array{... }; template class DLList{... }; template class BTTree{... }; ● Generic and basic DDTs: ● One-level concrete DDTs: ● New multi-level concrete DDTs: class ARARInteg: public Array >{}; class DLLARInteg: public DLList >{}; class BTSLLDoub: public BTTree >{}; class ARARAREmployee: public Array<2,Employee, \ Array > >{}; class F_array : public Array {}; class I_DLList: public DLList {}; class D_BTTree: public BTTree {};
19
Structured reports of DDTs implementations at run-time 1) Profiling already inserted in all the DDT implementations of the library 2) Information reported at run-time from the DDTs: 1. Read and write accesses 2. Memory footprint behaviour 3. Access pattern to the data => Methods calls to data (e.g. sequential) 3) Graphical Tool (based on Gtk/Perl) to perform code parsing and profiling insertion in new DDTs
20
Our run time exploration of DDTs 1) Heuristics based on clustering of blocks => Possible up to one per DDT 2) Unified exploration loop during usual execution: 3) Refinement is done in a post-processing phase. Normal execution speed of the application (with instrumented DDTs) Profile objects Heuristics evaluation Finished? Start exploration Yes No Exploration finished Acquiring profiling Evaluation Library of DDTs for Multimedia
21
Post-processing phase (refinement) ● Automated refinement process: Acquired run-time profiling information Further refinements possible with run time behaviour information of DDTs: global control flow simplification => Intermediate Variable Elimination (additional global gains!) Graphical evaluation tool Analytical power model (0.8 to 0.l3 tech. node) (based on Cacti v3.0) Global optimal (Pareto) points, trade-offs: Power consumption / Memory footprint / Execution time
22
Simplification of control flow: Intermediate Variable Elimination phase ● Interaction between DDTs in a global context: 1) Complex algorithms consist of many smaller ones: data generation and consumptions (DSP) 2) Each step performs “some” transformations: filtering of points, proximity, selection, … 3) Injective Relationship: Remove buffers when index function simple enough and not intermediate results are needed later Very significant additional global gains!
23
Automation tool for Dynamic Data Type Refinement
24
Design flow for new embedded systems Specification of the system at very high-level (e.g. C++ or UML) 2. DM Managers Refinement (DMMR) Final implementation 1. Dynamic Data Types Refinement (DDTR) Refinement of DM subsystem (2 levels): Further optimizations in static data
25
Problems to create custom DM managers ● DM management left to the OS => General- purpose DM managers, not custom ones! (Lea Allocator – Linux-based systems 2003) ● Custom DM managers? No guidelines! Only designers ’ experience (try-test phase): 1) Huge design space to manually explore (e.g. organization of memory blocks, fit algorithms) (Wilson et al. 1995) 2) Frameworks to build and profile custom DM managers are not available (Berger2001,Attardi98)
26
Small example of different choices in DM managers Several options to decide: – DM manager with information in the blocks? – DM manager with coalescing service or not? Both options are possible, the best for current application? No coalescingCoalescing New methodologies to decide the best options needed! header One block sizeSeveral block sizes
27
1) Proposed DM management methodology 1) Profiling of application’s dynamic behavior to detect most commonly occuring data type access 2) Systematic exploration of possible DM management solutions from structured (orthogonalized) design space, for a certain cost function 3) Efficient code implementation and empirical evaluation of promising DM management solutions using our own high-level C++ library to create them Main phases:
28
Proposed Dynamic Memory Management Refinement steps Profiling of application’s dynamic behavior (identification of DDTs access patterns) Final selection of custom DM manager Implementation and run-time evaluation of promising custom DM management candidates Exploration of possible DM management solutions for certain constraints (e.g. Power, performance, etc)
29
Profiling of application’s DM behavior Dynamic data = organized data structures: Dynamic Data Types (DDTs) –Allocation sizes of each structure in the DDTs –Temporal behavior of each DDT –Memory footprint of each DDT –Interaction of the DDTs (spatial locality) => From our Dynamic Data Type Refinement
30
Exploration of possible DM management solutions to minimize memory footprint 1) Definition of structured design space for DM management: – Orthogonal categories to create custom DM managers – Categories propagate dependencies and make feasible the design space exploration 2) Definition of a suitable order to traverse the design space reducing a certain cost function/s.
31
Our design space for DM management According to basic blocks defined in DM managers: Orthogonal decision trees inside each category
32
Complete DM management design space ● All important state-of-the-art DM managers covered within our DM design space: – Binary buddy, Double buddy – Simple segregated fit – First fit, next fit, best fit allocation orders – Kingsley allocator (among the fastest, N-Gage) – Region allocators (fast, embedded RTOS: RTEMS) – Complex region-segregated fit – Win XP real time (fast) – Doug Lea Allocator (Linux, best trade-off) – Obstacks (custom, optimized for stack behavior, gcc) – Xalloc (custom, variation of regions-stack, Apache) – … Main problem: Huge design space! Order of decision trees?
33
Interdependencies help to explore the DM management design space Interdependencies exist between orthogonal trees: These interdependencies make the exploration feasible: All combinations of trees not realistic! – A2) Block sizes in the DM manager: one or several? One block size Several block sizes – A5) Flexible blocks size DM manager: coalesce or not? No coalescingCoalescing 1) (2 header
34
E.g. Final order for reduced memory footprint (interdependencies and factors of influence) (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12)
35
Our approach to implement and evaluate custom DM managers 1) Object-oriented library to compose DM managers: - ANSI C++ code with “mixins” that can be efficiently optimised by current compilers (e.g. gcc, Visual C++) - Custom DM managers composed by basic components (e.g. fit algorithms, memory blocks organizations, etc.) - Fast creation and debugging of custom DM managers 2) Profiling framework can be easily inserted in the library to profile the candidates (e.g. memory footprint, energy, etc. ) 3) Post-processing phase to compare the DM candidates
36
Advantages of our mixin-based approach for DM managers to traditional implementations 1) Direct equivalences between our DM design space and its implementation classes 2) Independent layers => Good maintainability and fast changes in parts of DM managers (e.g. Lea Alloc., 15000 lines of complex C code) 3) Extensive code reuse => implementation classes reused among different DM managers using common interfaces of methods 4) Profiling can be done in a similar way to DDTs
37
Graphical example of custom DM manager created by composition of our basic blocks Best Fit OS interface heap (4B physical blocks) OS interface heap (8B physical blocks) Binary tree blocks structure Doubly-linked list blocks structure First Fit DMMHeap: Final custom DM manager (it chooses which manager according to alloc. size)
38
Example of custom DM manager created with our categories of basic blocks /* Basic blocks for heap requests to the system */ template class BasicHeap: public TypeClass {}; /* Data types of the Dynamic Memory (DM) manager: DLL –> Doubly Linked List and BTT -> Binary Tree */ template class DLList {/*Implem. DLL*/ }; template class BTTree { /*Imp. BTT */}; /* Two basic data types instantiated for the memory manager */ class I_DLList : public DLList > {}; class D_BTTree : public BTTree >{}; /* DM manager with 2 seg. fit lists of data types, best or first fit policy */ class DMMHeap: public SegLists, // 1st segList FirstFit // 2nd segList > {};
39
1) Reuse of the interface for the profiling of DDTs 2) Profiling already inserted in all the classes of our library of DM managers 3) Information reported at run-time from the DM managers: – Memory footprint behaviour – Access pattern due to DM managers to the data (e.g. Allocations/Deallocations, etc.) – Fragmentation in the managers – Classified for each implementation part of the DM managers (e.g. fit algorithms, internal data structures, etc.) Profiling framework for DM managers
40
Code example of integration of profiling objects in our custom DM managers /* Declaration of profile objects of our common profile framework */ _profile *prof1, *prof2, *prof3; class DMMHeap: public SegLists >, // 1st segList with profile object FirstFit > // 2nd segList with profile object > > > {}; - Easy insertion of our C++ profiling framework in the original structure of custom DM managers: Few new parts are required!
41
Case study 1: 3D reconstruction algorithm Matching of points in sequent frames (“like” motion detection) - 1,500 2D points to ‘match‘ on average - Size: 700000 lines of C++ code - Sources of uncertainty: 1) Unknown input image sizes 2) Additional intermediate DDTs
42
Initial and optimised DDTs in the 3D reconstruction algorithm –DDTR phase Final DDTs implementation ● Initial DDTs implementation
43
Results obtained with our DMMR phase ● Memory footprint reduction: ● Execution time reduced, overhead added to total execution time is not significant: 600 frames, 20s. Memory footprint of different DM managers (2 frames) 0 0.50 1.00 1.50 2.00 2.50 Kingsley (Win32) RegionsOur DM manager Memory footrpint (MBytes) Overhead due to DM managers 0 0.5 1 1.5 2 Time (secs) Our custom DM manager RegionsKingsley (Win32)
44
Final results 3D Reconstruction case study Overall gains of almost 2 orders of magnitude!
45
Case Study 2: Virtual reality game Real images interact with 3D generated objects Initially designed for embedded devices (e.g. Trimedia 1300) Unpredictable behaviour: –Objects on the screen –Wall detection –User movements Initial image Processed image
46
Overall results trying to minimize power consumption Final comparison with original DM implementation: –Total memory saving: 22.48% –Total power consumption saving: 75.3% DDTs behaviour with 6 images Global Pareto points for DDTs Global pareto points for the DDTs
47
Examples of control flow simplification: Intermediate Variable Elimination
48
Case study 3: 3D rendering system ● Scalable rendering of objects according to available system resources (QoS): Scalable 3D coding (3D Mesh + 2D Texture) Low-quality decodingHigh-quality decoding - Size: 5000 lines C++ code - Usual “static” mem. footprint: 12 MB (real scenes with 7 objects) - Sources of uncertainty: 1) Movements of the user (Qos) 2) Several DDTs: 3D points and 3D faces
49
Results DDTR reducing memory footprint ● Energy results with two different memory hierarchies (with and without cache): Very different results! ● Memory footprint reduction (35%): Without cache With cache
50
Results DMMR (reducing memory footprint) ● Execution may be slower: ● Memory footprint improved: (Trade-offs between execution time, memory footprint and power) 0 0.50 1.00 1.50 2.00 2.50 3.00 3.50 4.00 4.50 Memory footprint of different DM managers (5 objects) Kingsley (Win32) ObstacksOur DM manager Lea (Linux) Memory footprint (MBytes) Execution time of different DM managers 0 2 4 6 8 10 Time (secs) Lea (Linux) Kingsley (Win32) Obstacks Our DM manager
51
Final results 3D rendering case study Normalized results to original implementation - Great gains in memory footprint: almost 60%, but… Important loss in energy consumption! (Trade-offs exist between memory footprint, energy, etc.)
52
Dynamic embedded application Multi-level DM management Main memory RTOS Dynamic Memory Manager Heap new(O1), O1 O2 O5 - Use the whole memory hierarchy: Scratchpad memory O4 O3 new(O2), new(O3), free(O2),new(O4), new(O5) … (low energy) (high energy)
53
Hardware extensions for multi-level DM management: DMA ● DMA-like transfer controller used for DM management, moves blocks between main memory and scratchpad. – Operates in parallel with the processor. – More energy-efficient than processor to manipulate data transfers. – Low setup time. MMU CachesScratch Transfer DMA Controller DMA AMBA BUS ARM Core EXT. MEM New part:
54
Efficient partition of scratchpad for DM management Compile time decisions of partition of memory hierarchy for dynamic allocation Dynamic Scenario 1 of a certain application DDT 1 DDT n … DDT n+1 Dynamic Scenario 2 DDT 1 … DDT n Main memory + cache Scratchpad + DMA DDT n+1
55
Case Study 4: Network scheduling application (simulated on MPARM-Bologna) ● Wireless networks -> Deficit Round Robin (DRR) application (NetBench benchmark suite) ● Forwarding scheduler algorithm in many routers: Queues Packets Stream data … … T1 T5 Forwarding T2 - Size: 500 lines of C++ code - Sources of uncertainty: 1) Variable number of packets arriving at any moment in time 2) Multiple packet sizes allowed and variable number of queues
56
Multi-level DMMR has a large impact on energy and performance - Caches and design time techniques are not the best choice for dynamic systems - Large differences between DM managers (well-guided customization is needed!) 30% 20% 70% Energy Execution time (normalized to Region/Partition) 18% 14% 10%
57
Execution time results (cycles) in multiprocessors AMBA bus, 3 cycles L2 latency, ARM7TDMI cores: Gains scale and increase further with scratchpad+DMA approach for DM management! number of processors cycles
58
Conclusions ● Fast system design flow for embedded systems thanks to automation and libraries (DDTR,DMMR) – 2 to 4 weeks for applications of similar complexity ● Promising results in new dynamic multimedia and wireless network applications: – Significant speedup, and memory footprint and power consumption reductions – Trade-offs of memory footprint, performance, power consumed in DM management are possible ● Multi-layered DM management proven to be useful (also some companies agree…Infineon )
59
Future work ● Hardware support of some operations for DM management: – Significant speedups/power savings thanks to additional operations moved to HW (e.g. pointer-chasing) – Implementation of some kind of MMU for DM that the DM managers can control to move the data ● Run time switching of DM managers for different scenarios (e.g. just in software or with reconfigurable HW)
Similar presentations
© 2025 SlidePlayer.com Inc.
All rights reserved.