We think you have liked this presentation. If you wish to download it, please recommend it to your friends in any social system. Share buttons are a little bit lower. Thank you!
Presentation is loading. Please wait.
Published byClaire Hicks
Modified over 2 years ago
© Renesas Technology Corp., All rights reserved. Renesas Technology Paul Mundt Superpages Revisted: Transparent Application of Large TLBs on Embedded Systems
Non-Confidential © Renesas Technology Corp., All rights reserved. 2 The Coverage Problem TLBs on embedded systems tend to roughly be around 32 or 64 entries on average. PAGE_SIZE is generally 4k or 8k. 8k is commonly seen on platforms with aliasing VIPT L1 caches that wish to avoid aliasing altogether. These same systems support a wide range of page sizes to maximize TLB coverage in a controlled environment. Very few embedded platforms make use of this fact in the kernel today. The classic embedded problem, the desire for small page sizes as well as a reduction in the amount of time spent servicing TLB misses. While generally spending the entire product lifetime under heavy memory pressure, often without swap. For anyone unfortunate enough to be taking notes, this is what we are focusing on today.
Non-Confidential © Renesas Technology Corp., All rights reserved. The Coverage Problem Most of these systems lack global PTE bits and must wire TLB slots with fixed translations to preserve across a TLB flush. A scheme often used/abused for fixmaps. This all depletes entries and reduces coverage even more. Some systems employ special support for fixed section mappings. Often PMD-granular section mapping. Which generally see a reduction in permission bits over their PTE counterparts. Other times extra TLBs are added. These may include more emphasis on caching behaviour and less on access control. A common case for mapping lowmem cached/uncached and flipping section bits in ioremap(). And may or may not include a miss exception. Generally pre-faulted with fixed mappings, or piggybacked on top of the first-level TLB miss. Some of these will use their own page tables. Making superpage promotion/demotion even more tedious.
Non-Confidential © Renesas Technology Corp., All rights reserved. Ways that large TLBs are presently used ioremap() and friends ioremap() and friends cached/uncached lowmem mapping cached/uncached lowmem mapping hugetlbfs/libhugetlb hugetlbfs/libhugetlb Sparsemem vmemmap Sparsemem vmemmap Architecture extensions for abusing software-managed TLB entry loading. Architecture extensions for abusing software-managed TLB entry loading.
Non-Confidential © Renesas Technology Corp., All rights reserved. Ways that large TLBs are presently not used Dynamically Dynamically ie, Page frame coalescing for order promotion and other academic evils. ie, Page frame coalescing for order promotion and other academic evils. Generally Generally Huge pages and large contiguous physical blocks of memory are a scarce resource, and generally cease to be available shortly after system startup. Huge pages and large contiguous physical blocks of memory are a scarce resource, and generally cease to be available shortly after system startup. Fortunately workloads are generally fixed, access patterns and behaviour is well understood, and performance degradation is quantifiable. Fortunately workloads are generally fixed, access patterns and behaviour is well understood, and performance degradation is quantifiable. Combined with things like ZONE_MOVABLE, it's still possible to get back to a state where large TLBs can still be used often enough to provide a measurable performance boost. Combined with things like ZONE_MOVABLE, it's still possible to get back to a state where large TLBs can still be used often enough to provide a measurable performance boost.
Non-Confidential © Renesas Technology Corp., All rights reserved. Overview of the SH-4A/SH-X2 MMU Architecture Semi-harvard TLB layout. 64-entry Unified TLB containing I/D-TLB entries. 4-entry I-TLB, loaded from the U-TLB and managed entirely by hardware. PMB (Privileged Space Mapping Buffer) 16-entry section-mapping TLB. No miss exception, so must be either statically configured or loaded from a special page table piggybacking the TLB miss. Limited access control capabilities, only usable for kernel mappings. Usable for user mappings if using single-address space mode for uClinux on MMU parts where the role of the MMU is scaled back. Wide range of page sizes. 4k/8k/64k/256k/1MB/4MB/64MB in TLB. 512MB also on SH-5 TLB, generally used for lowmem mapping. 16MB/64MB/128MB/512MB in PMB. 64-bit PTEs for extended TLB mode, primarily for extended protection bits, which also necessitates a 64-bit pgprot.
Non-Confidential © Renesas Technology Corp., All rights reserved. ioremap() and friends Nothing terribly interesting to note here. The largest possible sizes are used and scaled down from. Often establishing both PMD and PTE mappings, depending on size. This is also a tie-in for multiple TLBs. SH-X2 and later parts first attempt PMB mappings to avoid building up TLB-bound page tables unnecessarily. Has the undesirable side-effect that things like unmap_vm_area() have a lot of linear scanning to do before finding TLB-bound page tables (if at all) when doing PMB tear-down. Not a hot path by any stretch of the imagination, so only mildly bothersome.
Non-Confidential © Renesas Technology Corp., All rights reserved. cached/uncached lowmem mapping Greatly simplifies ioremap() and friends in that caching attributes for a given mapping can be adjusted by flipping the high bits. Some SH parts (and most MIPS, too) support identity- mapped segmentation with varying cache attributes natively without having to go through the TLB. While this makes changing caching attributes easy and greatly reduces TLB misses, it also wreaks havoc on available virtual address space, negating the usefulness of things like sparsemem vmemmap, crimping vmalloc space, etc. Others simply set up mappings to provide the same functionality.
Non-Confidential © Renesas Technology Corp., All rights reserved. hugetlbfs/libhugetlb Generally requires static reservation of entries at boot time. Variable page sizes are supportable, but the optimal sizes will generally depend on target workload. While there is no one size fits all, and huge pages are a scarce resource, they are also the simplest interface for applications getting a handle on large TLBs. Still requires careful profiling of the workload to determine optimal number of reservations, sizes, etc. Most applications have no awareness of huge pages, so a more transparent option is needed for bridging the two. So we opt for the simplest solution any time a minor userspace problem arises -- hacking the libc directly and telling applications what they really want.
Non-Confidential © Renesas Technology Corp., All rights reserved. libhugetlb and uClibc, together at last uClibc contains a pluggable allocator backend that makes experimenting with various implementations fairly trivial, which makes it a good match for experimenting with libhugetlb. With integration of libhugetlb in the allocator backend, large allocations that match the huge page size are automatically backed with huge pages if they are available. This behaviour can be enabled/disabled explicitly for applications, or simply left to the hugetlb code to try and do the best job it can. Even with fairly dumb implementation, applications using large allocs for anonymous memory get a measurable performance gain. In sample workloads this has resulted in gains from 5-25% on a loaded system. If application access patterns are more random, or only small allocations are employed by the system workload, performance degradation is marginal. However, memory is wasted if more pages are reserved than are used.
Non-Confidential © Renesas Technology Corp., All rights reserved. Sparsemem vmemmap Similar to the ioremap() implementation, where the mappings are established using the largest possible size and scaling down. Far too heavy on virtual address space to be of general use on constrained platforms... but a reasonable trade-off for platforms that value more effective TLB utilization over address space conservation. If you aren't counting individual pages of available virtual address space while subsequently losing count of the number of bounce buffers, this can mean you!
Non-Confidential © Renesas Technology Corp., All rights reserved. Magical Architecture Extensions (MAE) Systems with software-managed TLBs allow for much greater flexibility in how the TLB miss is serviced. Which may not even entail loading an entry in to the TLB that took the initial miss. A combination of order recording and page hinting can trivially be used for platforms with enough free PTE bits to transparently construct a TLB entry that matches the entry on the initial page fault. In practice this can be optimized down to a handful of additional instructions given that the page table walking is taking place regardless of order hint. Presently this does not handle PTE->PMD transitions, which remains something to investigate in future work. Page hinting introduced by s390 for aiding guests and hypervisors. In the case of faulting in a section mapping, alignment hinting can offer suggestions whether walking an additional page table is worth the extra overhead or not. In the case where access is mispredicted, overhead increases, but integrity is maintained as the TLB is still loaded as a fallback.
Non-Confidential © Renesas Technology Corp., All rights reserved. Magical Architecture Extensions (MAE) In the case of hardware loaded TLBs, fewer optimizations can be made, requiring PTEs to be order encoded, and requiring many more generic code changes. While many optimizations can still be made for these platforms, the lack of software-management prohibits any intelligent decision making at TLB miss time. On the other hand, it is not always a net benefit, as the system needs to support page sizes that match the most frequently used page orders. It always depends on the workload. Fortunately on embedded systems, these things are measurable!
Non-Confidential © Renesas Technology Corp., All rights reserved. Summary While there are many trivial options for using large TLBs more transparently on embedded systems, there has been little to no effort in this area. In the past interest in this area has been an all or nothing. Empirical testing on the other hand suggests that even making use of large TLBs for anonymous memory is measurable enough to make it worthwhile. Even if TLBs grow larger and the coverage problem is lessened, systems will still prefer a tiny page size. For embedded systems in general, having control over the applications and workloads that the system is exposed to allows for careful profiling and figuring out where the bottlenecks are. While being under constant TLB pressure is not often considered a big problem on modern systems, measurement has shown that there is still significant performance degradation on even the most typical embedded workloads.
Non-Confidential © Renesas Technology Corp., All rights reserved. Questions?
Non-Confidential © Renesas Technology Corp., All rights reserved. Concerns?
Chapter 12 Technology. INTRODUCTION This chapter considers technology in general, with some limited emphasis on software. The life cycle and software.
Address Translation Tore Larsen Material developed by: Kai Li, Princeton University.
Chapters 7 & 8 Memory Management Operating Systems: Internals and Design Principles, 6/E William Stallings Patricia Roy Manatee Community College, Venice,
Introduction to Software Exploitation Corey K.. All materials is licensed under a Creative Commons Share Alike license.
Linux: The Guts By Sam Evans and John Massey. History Of Linux ❖ 1984: Richard Stallman quits his job at MIT, and starts working on the GNU project. ❖
Intermediate x86 Part 2 Xeno Kovah – 2010 xkovah at gmail.
Public Information Version 3.1: 1/1/2012 Introducing Instant Business Intelligence To IT BI Project Managers What you need, when you need it
What is an Operating System? A program that acts as an intermediary between a user of a computer and the computer hardware. Operating system goals: Execute.
SISTEMA EMBEBIDO. OPERATING SYSTEMS TYPES The Linux OS is monolithic. Generally operating systems come in three flavors: real-time executive, monolithic,
IOS103 OPERATING SYSTEM VIRTUAL MEMORY. Objectives At the end of the course, the student should be able to: Define virtual memory; Discuss the demand.
Memory Management Background Swapping Contiguous Memory Allocation Paging Structure of the Page Table Segmentation Example: The Intel Pentium.
Chapter 12: Indexing and Hashing Basic Concepts Ordered Indices B+-Tree Index Files B-Tree Index Files Static Hashing Dynamic Hashing Comparison of Ordered.
Design and Implementation Issues Today Design issues for paging systems Implementation issues Segmentation Next I/O.
LIS650 lecture 3 CSS positioning & site architecture Thomas Krichel 2009–02–08.
Debugging Tools Towards better use of system tools to weed the nasty critters out of your programs.
PCI Bus CENG Spring Dr. Yuriy ALYEKSYEYENKOV 2 The PCI (Peripheral Component Interconnect) bus was developed as a low-cost, processor-independent.
What happened to IPv5? and other oft asked IPv6 questions The Internet Society, IPv6 and You Susan Estrada.
ADAMA UNIVERSITY SCHOOL OF ENGENEERING & INFORMATION TECHNOLOGIES Sem. II,2010/11 INSTRUCTOR TARIKU W OPERATING SYSTEM (IT3016)
1 Principles of Virtual Memory Virtual Memory, Paging, Segmentation.
® IBM Software Group © 2008 IBM Corporation A new feature providing mainframe development flexibility David Myers Rational Developer for System z Product.
File Systems. Storing Information Applications can store it in the process address space Why is it a bad idea? –Size is limited to size of virtual address.
OS Organization Continued Andy Wang COP 5611 Advanced Operating Systems.
Vipul Patel Ideas … Please …
LIS650 lecture 3 CSS layout & site architecture Thomas Krichel
VMware vCenter Server High Availability Product Support Engineering VMware Confidential.
Main Memory. Goals for Today Protection: Address Spaces –What is an Address Space? –How is it Implemented? Address Translation Schemes –Segmentation –Paging.
Database System Concepts, 6 th Ed. ©Silberschatz, Korth and Sudarshan See for conditions on re-usewww.db-book.com Chapter 10: Storage and.
Introduction to cloud computing Jiaheng Lu Department of Computer Science Renmin University of China
© 2013 A. Haeberlen NETS 212: Scalable and Cloud Computing 1 University of Pennsylvania Storage at Facebook December 3, 2013.
© 2016 SlidePlayer.com Inc. All rights reserved.