Energy-Efficient Address Translation

Slides:

Advertisements

Similar presentations

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

Advertisements

Hardware-based Devirtualization (VPC Prediction) Hyesoon Kim, Jose A. Joao, Onur Mutlu ++, Chang Joo Lee, Yale N. Patt, Robert Cohn* ++ *

D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.

1 Line Distillation: Increasing Cache Capacity by Filtering Unused Words in Cache Lines Moinuddin K. Qureshi M. Aater Suleman Yale N. Patt HPCA 2007.

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,

Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.

EECS 470 Virtual Memory Lecture 15. Why Use Virtual Memory? Decouples size of physical memory from programmer visible virtual memory Provides a convenient.

Improving Cache Performance by Exploiting Read-Write Disparity

CSCE 212 Chapter 7 Memory Hierarchy Instructor: Jason D. Bakos.

Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Compilation Techniques for Energy Reduction in Horizontally Partitioned Cache Architectures Aviral Shrivastava, Ilya Issenin, Nikil Dutt Center For Embedded.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Operating Systems CMPSCI 377 Lecture.

Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.

Chapter 8 Memory Management Dr. Yingwu Zhu. Outline Background Basic Concepts Memory Allocation.

CoLT: Coalesced Large-Reach TLBs December 2012 Binh Pham §, Viswanathan Vaidyanathan §, Aamer Jaleel ǂ, Abhishek Bhattacharjee § § Rutgers University ǂ.

Revisiting Hardware-Assisted Page Walks for Virtualized Systems

Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.

Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.

Accelerating Two-Dimensional Page Walks for Virtualized Systems Jun Ma.

Sampling Dead Block Prediction for Last-Level Caches

MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Redundant Memory Mappings for Fast Access to Large Memories

1  2004 Morgan Kaufmann Publishers Chapter Seven Memory Hierarchy-3 by Patterson.

The Evicted-Address Filter

Page Table Implementation. Readings r Silbershatz et al:

University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.

LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”

Agile Paging: Exceeding the Best of Nested and Shadow Paging

Dynamic Associative Caches:

CS161 – Design and Architecture of Computer

Translation Lookaside Buffer

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Improving Cache Performance using Victim Tag Stores

Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh

ECE232: Hardware Organization and Design

CS161 – Design and Architecture of Computer

18-447: Computer Architecture Lecture 23: Caches

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Zhichun Zhu Zhao Zhang ECE Department ECE Department

Lecture 12 Virtual Memory.

Lecture: Large Caches, Virtual Memory

Section 9: Virtual Memory (VM)

Chapter 8: Main Memory Source & Copyright: Operating System Concepts, Silberschatz, Galvin and Gagne.

QuickPath interconnect GB/s GB/s total To I/O

18742 Parallel Computer Architecture Caching in Multi-core Systems

CS510 Operating System Foundations

Prefetch-Aware Cache Management for High Performance Caching

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Gwangsun Kim Niladrish Chatterjee Arm, Inc. NVIDIA Mike O’Connor

Chapter 8: Main Memory.

Executive Summary Problem: Overheads of virtual memory can be high

Fine-Grain CAM-Tag Cache Resizing Using Miss Tags

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Memory Management 11/17/2018 A. Berrached:CS4315:UHD.

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Reducing Memory Reference Energy with Opportunistic Virtual Caching

TLC: A Tag-less Cache for reducing dynamic first level Cache Energy

Using Dead Blocks as a Virtual Victim Cache

FIGURE 12-1 Memory Hierarchy

Address-Value Delta (AVD) Prediction

CARP: Compression-Aware Replacement Policies

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

CSE 451: Operating Systems Autumn 2003 Lecture 10 Paging & TLBs

Automatic Tuning of Two-Level Caches to Embedded Applications

CSE 542: Operating Systems

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Presentation transcript:

Energy-Efficient Address Translation Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman S. Unsal

Executive Summary Problem: TLBs consume energy, especially hits Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy

Outline Motivation + Opportunity Energy-Efficient Address Translation Results

Virtual Memory is not free Performance overhead due to page walks – previous work Energy overhead due to TLB lookups On every memory operation > 90% in L1 TLBs This work Leverage TLB reach to improve energy-efficiency TLBs *Sodani’s/Intel keynote at MICRO 2011

Base 4KB Pages Per-core TLB hierarchy Focus on data TLBs Core Hit Miss L1 TLB Per-core TLB hierarchy Focus on data TLBs L2 TLB Miss  Page Walk Page Table Walker

Base 4KB Pages Performance overhead Energy overhead Up to 50% in page walks Energy overhead 60% in L1 TLB accesses 35% in page walks Page Table Walker L1 TLB L2 TLB Core

Huge Pages Core 4KB TLB L1 TLB 2MB TLB L2 TLB Page Table Walker

Huge Pages Performance improves, but.. Energy increases by 4% Separate L1 TLBs Up to 43% increase 91% of energy  L1 TLBs Page Table Walker L2 TLB 4KB TLB 2MB TLB Core

Redundant Memory Mappings [ISCA ’15] Virtual Memory Physical Memory Range Translations Arbitrarily-large mappings between contiguous virtual pages to contiguous physical pages with uniform protection

Redundant Memory Mappings [ISCA ’15] Core 4KB TLB 2MB TLB L2 TLB L2 range TLB

Redundant Memory Mappings [ISCA ’15] Performance improves a lot Increases L2 TLB reach Eliminates page walks Energy still high The “innocent” L1 TLB hits! 98% of energy in L1 TLBs L2 TLB 4KB TLB 2MB TLB L2 range TLB Core

Our goal: improved performance and reduced energy State of the art 4KB Pages Huge Pages RMM Performance Dynamic Energy Our goal: improved performance and reduced energy

Key Observation Can we leverage increased TLB reach to save energy? Yes! Naturally take pressure off L1 TLBs Core Why access all L1 TLB entries, esp. for 4KB pages? Larger Reach Single 2MB entry == 512 x 4KB entries 4KB TLB 2MB TLB Look up fewer entries Reduce dynamic energy Similar performance L2 TLB

Outline Motivation + Opportunity Energy-Efficient Address Translation The Lite mechanism TLBLite design for huge pages RMMLite design for range translations Results

The Lite Mechanism Goal: Save TLB energy with similar performance Utility-based monitoring [Drophso et al. PACT ’02, Qureshi et al. MICRO ’06] Distance of TLB hits from MRU position Inferring utility of active ways Way disabling

The Lite Mechanism L1 TLB distance counters Track distance of hits 1 C1 C2 L1 TLB distance counters Track distance of hits Monitor utility C0  distance 0 C1  distance 1 C2  distance 2-3 MRU page 10 page 30 SET 0 page 90 page 50 distance == 0 LRU page 35 MRU page 35 page 25 SET 1 page 55 page 75 LRU

The Lite Mechanism distance == 1 C0 1 C1 1 C2 MRU SET 0 LRU page 30 1 C2 MRU page 10 page 30 SET 0 page 90 page 50 distance == 1 LRU page 30 MRU page 35 page 25 SET 1 page 55 page 75 LRU

The Lite Mechanism After many TLB accesses (interval ends) 95 C1 46 C2 2 MRU . . . . . . SET 0 . . . After many TLB accesses (interval ends) . . . Ways 2-3 less useful Disable them [Albonesi, MICRO’99] LRU MRU . . . . . . SET 1 . . . . . . LRU

The Lite Mechanism Save energy on every TLB lookup C0 C1 C2 MRU SET 0 C1 C2 MRU . . . . . . SET 0 . . . . . . Save energy on every TLB lookup LRU page XY MRU . . . . . . SET 1 . . . . . . LRU

TLBLite design Core 4KB TLB 2MB TLB Lite Lite L2 TLB

RMMLite design High hit ratio Fewer L1 TLB misses Disables more ways Core 4KB TLB L1 range TLB 2MB TLB Lite High hit ratio Fewer L1 TLB misses Disables more ways L2 TLB L2 range TLB

RMMLite design L1-range TLB small but efficient (4 entries) arbitrarily-large mappings L2 TLB L2 range TLB Lite Core 4KB TLB L1 range TLB Virtual Memory Physical Memory

Outline Motivation + Opportunity Energy-Efficient Address Translation Results

Methodology Developed MMU Simulator Pin, Cacti, and profiling with Linux Pagemap Baseline: Intel Sandy Bridge Dynamic energy & performance models For the address translation path TLB intensive workloads Spec2006, BioBench, and Parsec

Dynamic energy spent in address translation Miss Cycles Energy Cycles spent in address translation Dynamic energy spent in address translation Geometric mean (detailed results in paper)

Miss Cycles 4KB Energy (normalized to) L1 TLB L2 TLB 64 entries, 4-way L2 TLB 512 entries, 4-way High performance overhead page walks High energy overheads 60% in L1 TLB 35% in page walks

Miss Cycles 2MB Energy 4KB configuration + L1 2MB TLB Energy increases 32 entries, 4-way Performance improves, but.. Energy increases 4% on average and up to 43% Separate L1 TLBs

Miss Cycles TLBLite Energy 2MB configuration + Lite Similar performance with 2MB Reduces energy by 23% 49% of lookups with fewer than 4 ways in L1 4KB TLB

Miss Cycles RMM [ISCA ’15] Energy 2MB configuration + L2 range TLB 32 entries, fully assoc. Even better performance, but.. Energy is still high Similar to 2MB pages  L1 TLBs

Miss Cycles RMMLite Energy RMM configuration + L1 range TLB + Lite 4 entries, fully assoc. + Lite Better performance vs. RMM Fewer L1 TLB misses Reduces dynamic energy by 71% 84% of hits from L1 range TLB 63% of lookups with 1 way in 4KB

Observation: Increased TLB reach reduces TLB pressure Summary 4KB pages Huge RMM Performance Dynamic Energy TLBLite RMMLite 23% 71% Observation: Increased TLB reach reduces TLB pressure

Thank you!

BACKUP SLIDES

Summary Problem: TLBs consume energy, especially hits Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy

Related Work Optimizing TLBs for energy-efficiency Circuit techniques [ISLPED ’97] Partitioning TLBs [ISLPED ’03, CASES ’06, ISLPED ’13] Filtering TLB requests [ISLPED ’05, TVLSI ’07] Dynamically resizing TLB [MICRO ’00] Single reference bit per TLB entry Targets monolithic TLB Selective TLB lookups [MICRO ’02, ISPASS ’04, CODES ’04] Compiler support and special registers

Related Work TLB-Pred [HPCA ’15] Virtual caches All page sizes into single set-associative TLB See paper for comparison Virtual caches Defer address translation until cache miss Increase hardware complexity (synonyms, protection)

Miss Cycles Overhead: TLB Intensive Workloads

Energy Overhead: TLB Intensive Workloads

Miss Cycles and Energy Overheads: Spec2006 TLBLite and RMMLite reduce the dynamic energy by 26% and 72%

Miss Cycles and Energy Overheads: PARSEC TLBLite and RMMLite reduce the dynamic energy by 20% and 66%

Percentage of L1 TLB lookups with active ways TLBLite RMMLite 4-ways 2-ways 1-way 4KB 51.2% 32.9% 15.9% 25.9% 10.4% 63.7% 2MB 81.1% 9.0% 9.9% -----

Distribution of L1 TLB hits TLBLite RMMLite 4KB 2MB Range 15.9% 35.6% 84.1%

Comparison with TLB-Pred [HPCA ’15] Same as 2MB conf. Perfect prediction Both L1 and L2 TLB hold 4KB and 2MB pages Improved performance Energy reduces vs. 2MB conf. Still, RMMLite more energy-efficient Orthogonal to range translations

Impact of Eager Paging