Presentation on theme: "CMPE 511, Fall 20061 Hybrid (Software-Hardware) Dynamic Memory Allocator Prepared by Mustafa Özgür Akduran İstanbul, 2006 Boğaziçi Üniversitesi."— Presentation transcript:
CMPE 511, Fall 20061 Hybrid (Software-Hardware) Dynamic Memory Allocator Prepared by Mustafa Özgür Akduran İstanbul, 2006 Boğaziçi Üniversitesi
CMPE 511, Fall 20062 Outline Introduction Related Research Proposed Hybrid Allocator Complexity and Performance Comparison Conclusion References
CMPE 511, Fall 20063 Introduction Need for Efficient Implementation of Memory Management Functions Memory Usage Execution Performance Modern Programming Languages Dynamic Memory Allocation (DMA) Garbage Collection
CMPE 511, Fall 20064 Current Systems Execution time spent on Memory Management is 42%. Still important researches on Good execution performance Memory locality How to get free chunks of memory ? Software Allocator Hardware Allocator Introduction DMM Pure Software Low Cost Allocator Dynamic Memory Management
CMPE 511, Fall 20065 Software Allocator Different Search Techniques to organize available chunks of free memory Disadvantage Search could be in the critical path of allocators causing a major performance bottleneck. Hardware Allocator Parallel Search Speed up Memory Allocation Improve Performance Hide execution latency of freeing objects Coalescing of free chunks of memory Disadvantage Potential Hardware Complexity Introduction
CMPE 511, Fall 20066 Introduction A New Hybrid Software-Hardware Allocator PHK (Poul-Henning Kamp) Allocation Algorithm used in Free-BSD System Chang’s Hardware Allocator Aim is to balance the hardware complexity with performance by using both hardware and software together.
CMPE 511, Fall 20067 Related Research PHK (Poul-Henning Kamp) Allocator Two most popular general purpose open source allocator 1. Doug Lea used in LINUX System 2. PHK used in Free-BSD System Difference between them is less than 3% for memory allocation intensive benchmarks in SPEC 2000 CPU. PHK Allocator chosen bacause of its suitability for hardware/software co- design. Free-BSD (Berkeley Software Distribution ) is an advanced operating system for x86 compatible (including Pentium® and Athlon™), architectures. It is derived from BSD, the version of UNIX® developed at the University of California, Berkeley. It is developed and maintained by a large team of individuals.
CMPE 511, Fall 20068 Related Research PHK (Poul-Henning Kamp) Allocator Page based allocator Each page can only contain objects of one size For a large object sufficient number of pages allocated For small objects less than a half page, object size is padded to the nearest power of 2 Allocator keeps a page directory for all allocated pages and at the beginning of each small object page, bitmap of allocation information is created While allocating small objects, PHK Allocator performs a linear search on the bitmap to find the first available chunk in that page
CMPE 511, Fall 20069 Related Research Chang’s Hardware Allocator Based on Buddy System invented by Knuth The buddy memory allocation technique divides memory into partitions to satisfy a memory request as suitably as possible This system makes use of splitting memory into halves to try to give a best-fit Compared to the memory allocation techniques (such as paging) that modern OS such as MS Windows and Linux use, the buddy memory allocation is relatively easy to implement, and does not have the hardware requirement of a memory management unit Chang’s algorithm is a first method based on a binary OR-tree and a binary AND-tree.
CMPE 511, Fall 200610 Related Research Chang’s Hardware Allocator Each leaf node of the OR-tree represents base size of the smallest unit of memory that can be allocated The leaves of OR-tree together represent the entire memory AND-tree has the same number of leaves as the OR- tree Input of the AND-tree is generated by a complex interconnection network of the OR-tree Or Gates
CMPE 511, Fall 200611 Related Research Chang’s Hardware Allocator Or-Tree Determine if there is a large enough space for allocation request AND-Tree Find the beginning address of that memory chunk Flip the bits corresponding to the memory chunk in the bit-vector Bit-vector
CMPE 511, Fall 200613 Related Research The interconnection between the OR-tree and the AND-tree is the most complex part of the Chang’s allocator The interconnection has the same critical path delay as the OR/AND-tree Final allocation result is produced by the output of the AND- tree through a set of multiplexers The Hardware complexity, in terms of number of gates is O(n logn) # the memory chunks Critical path delay
CMPE 511, Fall 200614 Proposed Hybrid Allocator Pure hardware allocators based on buddy system 1. Complexity of the hardware increases with the size of the memory managed 2. Poor object locality Software Allocators Poor execution performance Problems of hardware-software only allocators
CMPE 511, Fall 200615 Proposed Hybrid Allocator New Hybrid Allocator 1.Using small, fixed hardware to help manage the memory 2.Software portion which is based on PHK algorithm provides better object localities than buddy system 3.Hardware portion improves execution performance of the software portion
CMPE 511, Fall 200616 Proposed Hybrid Allocator Software portion Responsible for 1.Creating page indexes 2.For large sized objects (>half a page) does the allocation without any assistance from hardware 3.Allocation for a small sized object, it will locate the bitmap of a page with free memory and issues a search request to the hardware Hardware portion 1.Search the page index (or bitmap) in parallel to find a free chunk 2.Mark the bitmap to indicate an allocation
CMPE 511, Fall 200617 Proposed Hybrid Allocator OR-tree responsible for determining if there is a free chunk in a page (similar to Chang’s system) AND-tree will locate the position of the first free chunk in the page (similar to Chang’s system) Because an OR-tree and an AND-tree are dedicated to one object size, complex interconnections between OR and AND tree are not needed( unlike Chang’s)
CMPE 511, Fall 200618 Proposed Hybrid Allocator MUX uses opcode to select the address of the bit needed to be flipped. If the opcode is “alloc” the address from the AND-tree will be chosen If the opcode is “free” the address from the request will be selected D-latches are used as storage devices where the bitmap will be loaded from the page in accordance with the allocation size DEMUX used to decode the address from the MUX
CMPE 511, Fall 200619 Proposed Hybrid Allocator Bit-flippers use the decoded address and the opcode to determine how to flip a desired bit Block Diagram of Proposed Hardware Component (For Page Size 4096 bytes and Object Size 16 bytes)
CMPE 511, Fall 200620 Proposed Hybrid Allocator Overall design of the system with 4096-byte pages For different object sizes, the hardware needed to support the bit- map will be different In our design, preselected object sizes are from 16-bytes to 2048-bytes and include hardware to support pages for these objects MUX is used to select the hardware unit that will be responsible for supporting objects of a given size The larger the object size, the smaller the amount of hardware needed to support the bit-maps indicating the availability of chunks in that page
CMPE 511, Fall 200621 Proposed Hybrid Allocator With 4096-byte pages, we have 8 different sized objects ranging from 16-bytes to 2048-bytes. For allocating 2048-byte objects we need a tree with two leaves 16-byte objects we need trees with 256 leaves For a 16-byte object we need only 255 AND/OR gates For overall system 1+3+7+15+31+63+127+255=502 AND gates and 502 OR gates are needed Very small amount compared to billions of transistors available on modern processor chips
CMPE 511, Fall 200622 Complexity and Performance Comparison Complexity Comparison Existing hardware allocator designs implement the buddy system The amount of hardware that is used to implement a buddy allocator is dependent on the size of memory That makes buddy system based allocators not scalable. Our design has much lower hardware complexity than Chang’s allocator. (Buddy System)
CMPE 511, Fall 200623 Complexity and Performance Comparison M: Total dynamic memory size P: Page size S: Smallest allocated object size
CMPE 511, Fall 200624 Complexity and Performance Comparison Performance Analysis Hardware-assisted PHK allocator Conventional CPU using SimpleScalar simulation tool set V2.0
CMPE 511, Fall 200625 Complexity and Performance Comparison
CMPE 511, Fall 200626 Complexity and Performance Comparison
CMPE 511, Fall 200627 Complexity and Performance Comparison We show the reduced memory management execution cycles normalized to the original execution cycles spent on memory management functions by software only allocator Cfrac application shows the best performance improvement Ave.obj.size is 8 bytes which means that most pages allocated contain 256 objects Linear search in the software implementation for that many objects will be very slow The hardware speeds up the search, leading to 76.2% normalized performance improvement over the software only allocation Benchmark espresso with average object size of 250 bytes shows the least amount of improvement using the hybrid allocator Pages allocated for espresso contain fewer than 20 objects Linear search of 20 objects is not significant, and the hardware allocator only shows 48.0% nornalized performance improvement Other benchmarks have average object sizes of 16 bytes to 48 bytes, so the performance gains are not significant as cfrac, but better than espresso On average, the Hybrid allocator reduces the memory management time by 58.9%. The average overall execution speedup of our design when compared to a software only allocator implementation is 12.7%
CMPE 511, Fall 200628 Conclusion Compared to Hardware only allocators 1.Significantly lower hardware complexity 2.Lower critical path delays 3.Our design has a fixed hardware complexity which is dependent on the size of a memory page (not the total user memory being managed) Our Design Compared to Software only allocators 1.Overall execution performance is 12.7% better on memory intensive benchmarks 2.Memory management efficiency improved by 58.9%
CMPE 511, Fall 200629 Conclusion Future Work Exploring variable sized pages such that the number of allocated objects are the same in each page All the bitmaps will have the same number of bits Thus, we need only one pair of AND-tree and Or-tree in the design That will further reduce the hardware complexity This will also improve the memory management efficiency of allocators for large objects
CMPE 511, Fall 200630 References  W.Li, S.P.Mohanty and K.Kavi, “A Page-based Hybrid (Software- Hardware) Dynamic Memory Allocator” IEEE Computer Architecture Letters (accepted in July 2006 for future issue)  J.M. Chang and E.F.Gehringer, “A High-Performance Memory Allocator for Object-oriented Systems”, IEEE Transactions on Computers, Mar. 1996, pp 357-366.  P.H.Kamp.“Malloc(3)revisited”, http://phk.freebsd.dk/pubs/malloc.pdf http://phk.freebsd.dk/pubs/malloc.pdf D.E.Knuth, The Art of Computer Programming Vol.I: Fundamental Algorithms., Addison-Wesley, 1968. D.Burgerand, T.M.Austin, “The Simple Scalar Tool Set, V2.0”, Tech Report CS-1342, University of Wisconsin-Madison, Jun. 1997.