© 2006. Renesas Technology Corp., All rights reserved. Renesas Technology Paul Mundt Superpages.

Slides:

Advertisements

Similar presentations

Virtual Memory Basics.

Advertisements

Memory Management Unit

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

Virtual Memory. The Limits of Physical Addressing CPU Memory A0-A31 D0-D31 “Physical addresses” of memory locations Data All programs share one address.

Segmentation and Paging Considerations

1 A Real Problem  What if you wanted to run a program that needs more memory than you have?

CS 153 Design of Operating Systems Spring 2015

CS 153 Design of Operating Systems Spring 2015

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Nov. 17, 2003 Topic: Virtual Memory.

CS 333 Introduction to Operating Systems Class 12 - Virtual Memory (2) Jonathan Walpole Computer Science Portland State University.

Memory Management (II)

Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

Memory Management and Paging CSCI 3753 Operating Systems Spring 2005 Prof. Rick Han.

Memory Management 2010.

Virtual Memory Chapter 8.

1 Virtual Memory Chapter 8. 2 Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

1 Chapter 8 Virtual Memory Virtual memory is a storage allocation scheme in which secondary memory can be addressed as though it were part of main memory.

Computer Organization and Architecture

CS333 Intro to Operating Systems Jonathan Walpole.

Review of Memory Management, Virtual Memory CS448.

Computer Architecture Lecture 28 Fasih ur Rehman.

CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.

8.4 paging Paging is a memory-management scheme that permits the physical address space of a process to be non-contiguous. The basic method for implementation.

Computer Architecture Memory Management Units Iolanthe II - Reefed down, heading for Great Barrier Island.

Cosc 2150: Computer Organization Chapter 6, Part 2 Virtual Memory.

IT253: Computer Organization

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming  To allocate scarce memory resources.

Chapter 8 – Main Memory (Pgs ). Overview  Everything to do with memory is complicated by the fact that more than 1 program can be in memory.

1 Address Translation Memory Allocation –Linked lists –Bit maps Options for managing memory –Base and Bound –Segmentation –Paging Paged page tables Inverted.

Virtual Memory Chapter 8. Hardware and Control Structures Memory references are dynamically translated into physical addresses at run time –A process.

Chapter 4 Memory Management Virtual Memory.

Memory Management Fundamentals Virtual Memory. Outline Introduction Motivation for virtual memory Paging – general concepts –Principle of locality, demand.

CENG334 Introduction to Operating Systems Erol Sahin Dept of Computer Eng. Middle East Technical University Ankara, TURKEY URL:

By Teacher Asma Aleisa Year 1433 H.   Goals of memory management  To provide a convenient abstraction for programming.  To allocate scarce memory.

1 Some Real Problem  What if a program needs more memory than the machine has? —even if individual programs fit in memory, how can we run multiple programs?

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Virtual Memory Hardware.

Introduction to Virtual Memory and Memory Management

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Demand Paging.

Full and Para Virtualization

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

Constructive Computer Architecture Virtual Memory: From Address Translation to Demand Paging Arvind Computer Science & Artificial Intelligence Lab. Massachusetts.

Lectures 8 & 9 Virtual Memory - Paging & Segmentation System Design.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Memory Management. 2 How to create a process? On Unix systems, executable read by loader Compiler: generates one object file per source file Linker: combines.

3.1 Advanced Operating Systems Superpages TLB coverage is the amount of memory mapped by TLB. I.e. the amount of memory that can be accessed without TLB.

Operating Systems Lecture 9 Introduction to Paging Adapted from Operating Systems Lecture Notes, Copyright 1997 Martin C. Rinard. Zhiqing Liu School of.

CS203 – Advanced Computer Architecture Virtual Memory.

Memory management The main purpose of a computer system is to execute programs. These programs, together with the data they access, must be in main memory.

Translation Lookaside Buffer

Memory Management.

CS703 - Advanced Operating Systems

Paging COMP 755.

Some Real Problem What if a program needs more memory than the machine has? even if individual programs fit in memory, how can we run multiple programs?

Chapter 8: Main Memory.

CSCI206 - Computer Organization & Programming

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Virtual Memory Hardware

CSE451 Memory Management Introduction Autumn 2002

CSE 451: Operating Systems Autumn 2005 Memory Management

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE451 Virtual Memory Paging Autumn 2002

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Computer Architecture

CSE 451: Operating Systems Autumn 2003 Lecture 9 Memory Management

Paging and Segmentation

CS703 - Advanced Operating Systems

COMP755 Advanced Operating Systems

Operating Systems: Internals and Design Principles, 6/E

Presentation transcript:

© Renesas Technology Corp., All rights reserved. Renesas Technology Paul Mundt Superpages Revisted: Transparent Application of Large TLBs on Embedded Systems

Non-Confidential © Renesas Technology Corp., All rights reserved. 2 The Coverage Problem TLBs on embedded systems tend to roughly be around 32 or 64 entries on average. PAGE_SIZE is generally 4k or 8k. 8k is commonly seen on platforms with aliasing VIPT L1 caches that wish to avoid aliasing altogether. These same systems support a wide range of page sizes to maximize TLB coverage in a controlled environment. Very few embedded platforms make use of this fact in the kernel today. The classic embedded problem, the desire for small page sizes as well as a reduction in the amount of time spent servicing TLB misses. While generally spending the entire product lifetime under heavy memory pressure, often without swap. For anyone unfortunate enough to be taking notes, this is what we are focusing on today.

Non-Confidential © Renesas Technology Corp., All rights reserved. The Coverage Problem Most of these systems lack global PTE bits and must wire TLB slots with fixed translations to preserve across a TLB flush. A scheme often used/abused for fixmaps. This all depletes entries and reduces coverage even more. Some systems employ special support for fixed section mappings. Often PMD-granular section mapping. Which generally see a reduction in permission bits over their PTE counterparts. Other times extra TLBs are added. These may include more emphasis on caching behaviour and less on access control. A common case for mapping lowmem cached/uncached and flipping section bits in ioremap(). And may or may not include a miss exception. Generally pre-faulted with fixed mappings, or piggybacked on top of the first-level TLB miss. Some of these will use their own page tables. Making superpage promotion/demotion even more tedious.

Non-Confidential © Renesas Technology Corp., All rights reserved. Ways that large TLBs are presently used ioremap() and friends ioremap() and friends cached/uncached lowmem mapping cached/uncached lowmem mapping hugetlbfs/libhugetlb hugetlbfs/libhugetlb Sparsemem vmemmap Sparsemem vmemmap Architecture extensions for abusing software-managed TLB entry loading. Architecture extensions for abusing software-managed TLB entry loading.

Non-Confidential © Renesas Technology Corp., All rights reserved. Ways that large TLBs are presently not used Dynamically Dynamically ie, Page frame coalescing for order promotion and other academic evils. ie, Page frame coalescing for order promotion and other academic evils. Generally Generally Huge pages and large contiguous physical blocks of memory are a scarce resource, and generally cease to be available shortly after system startup. Huge pages and large contiguous physical blocks of memory are a scarce resource, and generally cease to be available shortly after system startup. Fortunately workloads are generally fixed, access patterns and behaviour is well understood, and performance degradation is quantifiable. Fortunately workloads are generally fixed, access patterns and behaviour is well understood, and performance degradation is quantifiable. Combined with things like ZONE_MOVABLE, it's still possible to get back to a state where large TLBs can still be used often enough to provide a measurable performance boost. Combined with things like ZONE_MOVABLE, it's still possible to get back to a state where large TLBs can still be used often enough to provide a measurable performance boost.

Non-Confidential © Renesas Technology Corp., All rights reserved. Overview of the SH-4A/SH-X2 MMU Architecture Semi-harvard TLB layout. 64-entry Unified TLB containing I/D-TLB entries. 4-entry I-TLB, loaded from the U-TLB and managed entirely by hardware. PMB (Privileged Space Mapping Buffer) 16-entry section-mapping TLB. No miss exception, so must be either statically configured or loaded from a special page table piggybacking the TLB miss. Limited access control capabilities, only usable for kernel mappings. Usable for user mappings if using single-address space mode for uClinux on MMU parts where the role of the MMU is scaled back. Wide range of page sizes. 4k/8k/64k/256k/1MB/4MB/64MB in TLB. 512MB also on SH-5 TLB, generally used for lowmem mapping. 16MB/64MB/128MB/512MB in PMB. 64-bit PTEs for extended TLB mode, primarily for extended protection bits, which also necessitates a 64-bit pgprot.

Non-Confidential © Renesas Technology Corp., All rights reserved. ioremap() and friends Nothing terribly interesting to note here. The largest possible sizes are used and scaled down from. Often establishing both PMD and PTE mappings, depending on size. This is also a tie-in for multiple TLBs. SH-X2 and later parts first attempt PMB mappings to avoid building up TLB-bound page tables unnecessarily. Has the undesirable side-effect that things like unmap_vm_area() have a lot of linear scanning to do before finding TLB-bound page tables (if at all) when doing PMB tear-down. Not a hot path by any stretch of the imagination, so only mildly bothersome.

Non-Confidential © Renesas Technology Corp., All rights reserved. cached/uncached lowmem mapping Greatly simplifies ioremap() and friends in that caching attributes for a given mapping can be adjusted by flipping the high bits. Some SH parts (and most MIPS, too) support identity- mapped segmentation with varying cache attributes natively without having to go through the TLB. While this makes changing caching attributes easy and greatly reduces TLB misses, it also wreaks havoc on available virtual address space, negating the usefulness of things like sparsemem vmemmap, crimping vmalloc space, etc. Others simply set up mappings to provide the same functionality.

Non-Confidential © Renesas Technology Corp., All rights reserved. hugetlbfs/libhugetlb Generally requires static reservation of entries at boot time. Variable page sizes are supportable, but the optimal sizes will generally depend on target workload. While there is no one size fits all, and huge pages are a scarce resource, they are also the simplest interface for applications getting a handle on large TLBs. Still requires careful profiling of the workload to determine optimal number of reservations, sizes, etc. Most applications have no awareness of huge pages, so a more transparent option is needed for bridging the two. So we opt for the simplest solution any time a minor userspace problem arises -- hacking the libc directly and telling applications what they really want.

Non-Confidential © Renesas Technology Corp., All rights reserved. libhugetlb and uClibc, together at last uClibc contains a pluggable allocator backend that makes experimenting with various implementations fairly trivial, which makes it a good match for experimenting with libhugetlb. With integration of libhugetlb in the allocator backend, large allocations that match the huge page size are automatically backed with huge pages if they are available. This behaviour can be enabled/disabled explicitly for applications, or simply left to the hugetlb code to try and do the best job it can. Even with fairly dumb implementation, applications using large allocs for anonymous memory get a measurable performance gain. In sample workloads this has resulted in gains from 5-25% on a loaded system. If application access patterns are more random, or only small allocations are employed by the system workload, performance degradation is marginal. However, memory is wasted if more pages are reserved than are used.

Non-Confidential © Renesas Technology Corp., All rights reserved. Sparsemem vmemmap Similar to the ioremap() implementation, where the mappings are established using the largest possible size and scaling down. Far too heavy on virtual address space to be of general use on constrained platforms... but a reasonable trade-off for platforms that value more effective TLB utilization over address space conservation. If you aren't counting individual pages of available virtual address space while subsequently losing count of the number of bounce buffers, this can mean you!

Non-Confidential © Renesas Technology Corp., All rights reserved. Magical Architecture Extensions (MAE) Systems with software-managed TLBs allow for much greater flexibility in how the TLB miss is serviced. Which may not even entail loading an entry in to the TLB that took the initial miss. A combination of order recording and page hinting can trivially be used for platforms with enough free PTE bits to transparently construct a TLB entry that matches the entry on the initial page fault. In practice this can be optimized down to a handful of additional instructions given that the page table walking is taking place regardless of order hint. Presently this does not handle PTE->PMD transitions, which remains something to investigate in future work. Page hinting introduced by s390 for aiding guests and hypervisors. In the case of faulting in a section mapping, alignment hinting can offer suggestions whether walking an additional page table is worth the extra overhead or not. In the case where access is mispredicted, overhead increases, but integrity is maintained as the TLB is still loaded as a fallback.

Non-Confidential © Renesas Technology Corp., All rights reserved. Magical Architecture Extensions (MAE) In the case of hardware loaded TLBs, fewer optimizations can be made, requiring PTEs to be order encoded, and requiring many more generic code changes. While many optimizations can still be made for these platforms, the lack of software-management prohibits any intelligent decision making at TLB miss time. On the other hand, it is not always a net benefit, as the system needs to support page sizes that match the most frequently used page orders. It always depends on the workload. Fortunately on embedded systems, these things are measurable!

Non-Confidential © Renesas Technology Corp., All rights reserved. Summary While there are many trivial options for using large TLBs more transparently on embedded systems, there has been little to no effort in this area. In the past interest in this area has been an all or nothing. Empirical testing on the other hand suggests that even making use of large TLBs for anonymous memory is measurable enough to make it worthwhile. Even if TLBs grow larger and the coverage problem is lessened, systems will still prefer a tiny page size. For embedded systems in general, having control over the applications and workloads that the system is exposed to allows for careful profiling and figuring out where the bottlenecks are. While being under constant TLB pressure is not often considered a big problem on modern systems, measurement has shown that there is still significant performance degradation on even the most typical embedded workloads.

Non-Confidential © Renesas Technology Corp., All rights reserved. Questions?

Non-Confidential © Renesas Technology Corp., All rights reserved. Concerns?