Energy-Efficient Address Translation

Energy-Efficient Address Translation
Vasileios Karakostas, Jayneel Gandhi, Adrian Cristal, Mark D. Hill, Kathryn S. McKinley, Mario Nemirovsky, Michael M. Swift, Osman S. Unsal

Executive Summary Problem: TLBs consume energy, especially hits
Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy

Outline Motivation + Opportunity Energy-Efficient Address Translation
Results

Virtual Memory is not free
Performance overhead due to page walks – previous work Energy overhead due to TLB lookups On every memory operation > 90% in L1 TLBs This work Leverage TLB reach to improve energy-efficiency TLBs *Sodani’s/Intel keynote at MICRO 2011

Base 4KB Pages Per-core TLB hierarchy Focus on data TLBs Core Hit Miss
L1 TLB Per-core TLB hierarchy Focus on data TLBs L2 TLB Miss  Page Walk Page Table Walker

Base 4KB Pages Performance overhead Energy overhead
Up to 50% in page walks Energy overhead 60% in L1 TLB accesses 35% in page walks Page Table Walker L1 TLB L2 TLB Core

Huge Pages Core 4KB TLB L1 TLB 2MB TLB L2 TLB Page Table Walker

Huge Pages Performance improves, but.. Energy increases by 4%
Separate L1 TLBs Up to 43% increase 91% of energy  L1 TLBs Page Table Walker L2 TLB 4KB TLB 2MB TLB Core

Redundant Memory Mappings [ISCA ’15]
Virtual Memory Physical Memory Range Translations Arbitrarily-large mappings between contiguous virtual pages to contiguous physical pages with uniform protection

Core 4KB TLB 2MB TLB L2 TLB L2 range TLB

Performance improves a lot Increases L2 TLB reach Eliminates page walks Energy still high The “innocent” L1 TLB hits! 98% of energy in L1 TLBs L2 TLB 4KB TLB 2MB TLB L2 range TLB Core

Our goal: improved performance and reduced energy
State of the art 4KB Pages Huge Pages RMM Performance Dynamic Energy Our goal: improved performance and reduced energy

Key Observation Can we leverage increased TLB reach to save energy? Yes! Naturally take pressure off L1 TLBs Core Why access all L1 TLB entries, esp. for 4KB pages? Larger Reach Single 2MB entry == 512 x 4KB entries 4KB TLB 2MB TLB Look up fewer entries Reduce dynamic energy Similar performance L2 TLB

The Lite mechanism TLBLite design for huge pages RMMLite design for range translations Results

The Lite Mechanism Goal: Save TLB energy with similar performance
Utility-based monitoring [Drophso et al. PACT ’02, Qureshi et al. MICRO ’06] Distance of TLB hits from MRU position Inferring utility of active ways Way disabling

The Lite Mechanism L1 TLB distance counters Track distance of hits
1 C1 C2 L1 TLB distance counters Track distance of hits Monitor utility C0  distance 0 C1  distance 1 C2  distance 2-3 MRU page 10 page 30 SET 0 page 90 page 50 distance == 0 LRU page 35 MRU page 35 page 25 SET 1 page 55 page 75 LRU

The Lite Mechanism distance == 1 C0 1 C1 1 C2 MRU SET 0 LRU page 30
1 C2 MRU page 10 page 30 SET 0 page 90 page 50 distance == 1 LRU page 30 MRU page 35 page 25 SET 1 page 55 page 75 LRU

The Lite Mechanism After many TLB accesses (interval ends)
95 C1 46 C2 2 MRU . . . . . . SET 0 . . . After many TLB accesses (interval ends) . . . Ways 2-3 less useful Disable them [Albonesi, MICRO’99] LRU MRU . . . . . . SET 1 . . . . . . LRU

The Lite Mechanism Save energy on every TLB lookup C0 C1 C2 MRU SET 0
C1 C2 MRU . . . . . . SET 0 . . . . . . Save energy on every TLB lookup LRU page XY MRU . . . . . . SET 1 . . . . . . LRU

TLBLite design Core 4KB TLB 2MB TLB Lite Lite L2 TLB

RMMLite design High hit ratio Fewer L1 TLB misses Disables more ways
Core 4KB TLB L1 range TLB 2MB TLB Lite High hit ratio Fewer L1 TLB misses Disables more ways L2 TLB L2 range TLB

RMMLite design L1-range TLB small but efficient (4 entries)
arbitrarily-large mappings L2 TLB L2 range TLB Lite Core 4KB TLB L1 range TLB Virtual Memory Physical Memory

Results

Methodology Developed MMU Simulator
Pin, Cacti, and profiling with Linux Pagemap Baseline: Intel Sandy Bridge Dynamic energy & performance models For the address translation path TLB intensive workloads Spec2006, BioBench, and Parsec

Dynamic energy spent in address translation
Miss Cycles Energy Cycles spent in address translation Dynamic energy spent in address translation Geometric mean (detailed results in paper)

Miss Cycles 4KB Energy (normalized to) L1 TLB L2 TLB
64 entries, 4-way L2 TLB 512 entries, 4-way High performance overhead page walks High energy overheads 60% in L1 TLB 35% in page walks

Miss Cycles 2MB Energy 4KB configuration + L1 2MB TLB Energy increases
32 entries, 4-way Performance improves, but.. Energy increases 4% on average and up to 43% Separate L1 TLBs

Miss Cycles TLBLite Energy 2MB configuration + Lite
Similar performance with 2MB Reduces energy by 23% 49% of lookups with fewer than 4 ways in L1 4KB TLB

Miss Cycles RMM [ISCA ’15] Energy 2MB configuration + L2 range TLB
32 entries, fully assoc. Even better performance, but.. Energy is still high Similar to 2MB pages  L1 TLBs

Miss Cycles RMMLite Energy RMM configuration + L1 range TLB + Lite
4 entries, fully assoc. + Lite Better performance vs. RMM Fewer L1 TLB misses Reduces dynamic energy by 71% 84% of hits from L1 range TLB 63% of lookups with 1 way in 4KB

Observation: Increased TLB reach reduces TLB pressure
Summary 4KB pages Huge RMM Performance Dynamic Energy TLBLite RMMLite 23% 71% Observation: Increased TLB reach reduces TLB pressure

Thank you!

BACKUP SLIDES

Summary Problem: TLBs consume energy, especially hits
Energy-Efficient Address Translation Insight: Increased TLB reach reduces TLB pressure Lite mechanism: selectively resize L1 TLB resources to save energy TLBLite design: commodity processors with huge pages RMMLite design: RMM with support for range translations [ISCA’15] Results 23% - 71% reduction in address translation energy

Related Work Optimizing TLBs for energy-efficiency
Circuit techniques [ISLPED ’97] Partitioning TLBs [ISLPED ’03, CASES ’06, ISLPED ’13] Filtering TLB requests [ISLPED ’05, TVLSI ’07] Dynamically resizing TLB [MICRO ’00] Single reference bit per TLB entry Targets monolithic TLB Selective TLB lookups [MICRO ’02, ISPASS ’04, CODES ’04] Compiler support and special registers

Related Work TLB-Pred [HPCA ’15] Virtual caches
All page sizes into single set-associative TLB See paper for comparison Virtual caches Defer address translation until cache miss Increase hardware complexity (synonyms, protection)

Miss Cycles Overhead: TLB Intensive Workloads

Energy Overhead: TLB Intensive Workloads

Miss Cycles and Energy Overheads: Spec2006
TLBLite and RMMLite reduce the dynamic energy by 26% and 72%

Miss Cycles and Energy Overheads: PARSEC
TLBLite and RMMLite reduce the dynamic energy by 20% and 66%

Percentage of L1 TLB lookups with active ways
TLBLite RMMLite 4-ways 2-ways 1-way 4KB 51.2% 32.9% 15.9% 25.9% 10.4% 63.7% 2MB 81.1% 9.0% 9.9% -----

Distribution of L1 TLB hits
TLBLite RMMLite 4KB 2MB Range 15.9% 35.6% 84.1%

Comparison with TLB-Pred [HPCA ’15]
Same as 2MB conf. Perfect prediction Both L1 and L2 TLB hold 4KB and 2MB pages Improved performance Energy reduces vs. 2MB conf. Still, RMMLite more energy-efficient Orthogonal to range translations

Impact of Eager Paging

Energy-Efficient Address Translation

Similar presentations

Presentation on theme: "Energy-Efficient Address Translation"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Energy-Efficient Address Translation

Similar presentations

Presentation on theme: "Energy-Efficient Address Translation"— Presentation transcript:

Similar presentations

About project

Feedback