Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of.

Slides:



Advertisements
Similar presentations
1 Wire-driven Microarchitectural Design Space Exploration School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA 30332,
Advertisements

Exploiting Access Semantics and Program Behavior to Reduce Snoop Power in Chip Multiprocessors Chinnakrishnan S. Ballapuram Ahmad Sharif Hsien-Hsin S.
Lecture 8: Memory Hierarchy Cache Performance Kai Bu
DLL-Conscious Instruction Fetch Optimization for SMT Processors Fayez Mohamood Mrinmoy Ghosh Hsien-Hsin (Sean) Lee School of Electrical and Computer Engineering.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Synonymous Address Compaction for Energy Reduction in Data TLB Chinnakrishnan Ballapuram Hsien-Hsin S. Lee Milos Prvulovic School of Electrical and Computer.
® 1 Stack Value File : Custom Microarchitecture for the Stack Hsien-Hsin Lee Mikhail Smelyanskiy Chris Newburn Gary Tyson University of Michigan Intel.
Accurately Approximating Superscalar Processor Performance from Traces Kiyeon Lee, Shayne Evans, and Sangyeun Cho Dept. of Computer Science University.
Access Region Locality for High- Bandwidth Processor Memory System Design Sangyeun Cho Samsung/U of Minnesota Pen-Chung Yew U of Minnesota Gyungho Lee.
Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached Bohua Kou Jing gao.
Using one level of Cache:
CS Lecture 10 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers N.P. Jouppi Proceedings.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang, Rajesh Gupta, Alex Nicolau.
Computer ArchitectureFall 2008 © November 10, 2007 Nael Abu-Ghazaleh Lecture 23 Virtual.
Recap. The Memory Hierarchy Increasing distance from the processor in access time L1$ L2$ Main Memory Secondary Memory Processor (Relative) size of the.
Computer ArchitectureFall 2007 © November 21, 2007 Karem A. Sakallah Lecture 23 Virtual Memory (2) CS : Computer Architecture.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 16 Instructor: L.N. Bhuyan
Memory Redundancy Elimination to Improve Application Energy Efficiency Keith Cooper and Li Xu Rice University October 2003.
Exploiting Load Latency Tolerance for Relaxing Cache Design Constraints Ramu Pyreddy, Gary Tyson Advanced Computer Architecture Laboratory University of.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
©UCB CS 162 Ch 7: Virtual Memory LECTURE 13 Instructor: L.N. Bhuyan
Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 33 – Virtual Memory I OCZ has released a 1 TB solid state drive (the biggest.
Virtual Memory Topics Virtual Memory Access Page Table, TLB Programming for locality Memory Mountain Revisited.
Inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 34 – Virtual Memory II Researchers at Stanford have developed “nanoscale.
©UCB CS 161 Ch 7: Memory Hierarchy LECTURE 24 Instructor: L.N. Bhuyan
Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross and Frank Vahid* Department of Computer Science and Engineering University.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.
Mrinmoy Ghosh Weidong Shi Hsien-Hsin (Sean) Lee
Dynamically Trading Frequency for Complexity in a GALS Microprocessor Steven Dropsho, Greg Semeraro, David H. Albonesi, Grigorios Magklis, Michael L. Scott.
Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.
A Low-Cost Memory Remapping Scheme for Address Bus Protection Lan Gao *, Jun Yang §, Marek Chrobak *, Youtao Zhang §, San Nguyen *, Hsien-Hsin S. Lee ¶
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
Dong Hyuk Woo Nak Hee Seong Hsien-Hsin S. Lee
Dept. of Computer and Information Sciences : University of Delaware John Cavazos Department of Computer and Information Sciences University of Delaware.
1 Virtual Memory. 2 Outline Pentium/Linux Memory System Core i7 Suggested reading: 9.6, 9.7.
University of Michigan Electrical Engineering and Computer Science Composite Cores: Pushing Heterogeneity into a Core Andrew Lukefahr, Shruti Padmanabha,
ACSAC’04 Choice Predictor for Free Mongkol Ekpanyapong Pinar Korkmaz Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Institute.
Garo Bournoutian and Alex Orailoglu Proceedings of the 45th ACM/IEEE Design Automation Conference (DAC’08) June /10/28.
A Single-Pass Cache Simulation Methodology for Two-level Unified Caches + Also affiliated with NSF Center for High-Performance Reconfigurable Computing.
Miseon Han Thomas W. Barr, Alan L. Cox, Scott Rixner Rice Computer Architecture Group, Rice University ISCA, June 2011.
Improving Energy Efficiency of Configurable Caches via Temperature-Aware Configuration Selection Hamid Noori †, Maziar Goudarzi ‡, Koji Inoue ‡, and Kazuaki.
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
University of Michigan Electrical Engineering and Computer Science 1 Compiler Managed Partitioned Data Caches for Low Power Rajiv Ravindran*, Michael Chu,
Hardware Architectures for Power and Energy Adaptation Phillip Stanley-Marbell.
Addressing Instruction Fetch Bottlenecks by Using an Instruction Register File Stephen Hines, Gary Tyson, and David Whalley Computer Science Dept. Florida.
University of Toronto Department of Electrical and Computer Engineering Jason Zebchuk and Andreas Moshovos June 2006.
Superscalar Architecture Design Framework for DSP Operations Rehan Ahmed.
Department of Computer Science 6 th Annual Austin CAS Conference – 24 February 2005 Ricardo Portillo, Diana Villa, Patricia J. Teller The University of.
On the Importance of Optimizing the Configuration of Stream Prefetches Ilya Ganusov Martin Burtscher Computer Systems Laboratory Cornell University.
Cache Pipelining with Partial Operand Knowledge Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
COMPUTER SYSTEMS ARCHITECTURE A NETWORKING APPROACH CHAPTER 12 INTRODUCTION THE MEMORY HIERARCHY CS 147 Nathaniel Gilbert 1.
High Performance Computing1 High Performance Computing (CS 680) Lecture 2a: Overview of High Performance Processors * Jeremy R. Johnson *This lecture was.
CS203 – Advanced Computer Architecture Virtual Memory.
Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.
1 Virtual Memory. 2 Outline Case analysis –Pentium/Linux Memory System –Core i7 Suggested reading: 9.7.
ECE232: Hardware Organization and Design
CS161 – Design and Architecture of Computer
Adaptive Cache Partitioning on a Composite Core
‘99 ACM/IEEE International Symposium on Computer Architecture
Copyright © 2011, Elsevier Inc. All rights Reserved.
Energy-Efficient Address Translation
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
Reducing Memory Reference Energy with Opportunistic Virtual Caching
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Lecture 17: Case Studies Topics: case studies for virtual memory and cache hierarchies (Sections )
Automatic Tuning of Two-Level Caches to Embedded Applications
Presentation transcript:

Energy Efficient D-TLB and Data Cache Using Semantic-Aware Multilateral Partitioning School of Electrical and Computer Engineering Georgia Institute of Technology Atlanta, GA ISLPED 2003 Hsien-Hsin “Sean” Lee Hsien-Hsin “Sean” Lee Chinnakrishnan Ballapuram

ISLPED Background Picture  Address Translation and Caches  Major processor power contributors  I-TLB and d-TLB lookup for every instruction and memory reference  TLBs are Fully Associative  Superscalar processor needs multi-ported design increasing power consumption  multi-wide machines may need multiple memory references in the same cycle

ISLPED Virtual Memory Space Partitioning  Based on programming language  Non-overlapped subdivisions I-CacheD-Cache  Split Code and Data  I-Cache and D-Cache  Split Data into Regions  Stack (  )  Heap (  )  Global (static)  Read-only (static) Protected reserved max mem min mem ARM Architecture Code Region Static GLOBAL Data Region HEAP grows upward STACK grows downward Read-only region  The unique access behavior to these regions by a program creates an opportunity to reduce power

ISLPED Outline of the Talk  Motivation  unique access behavior and locality are analyzed for energy reduction  Semantic-Aware Multilateral Partitioning (SAM)  Semantic-Aware d-TLB (SAT)  Semantic-Aware d-Cachelets (SAC)  Selective Multi-Porting SAM Architecture  Performance/Energy/Area Evaluation  Conclusions

ISLPED Footprint of Stack Page Accesses  Only two stack pages are required by all stack accesses  stack band is small  In general, x-axis shows the working set size, y-axis shows the required TLB entries

ISLPED Footprint of Global and Heap Page Accesses  number of heap pages (y-axis) and heap working set (x-axis) required is greater than stack and global  heap band >> global band > stack band

ISLPED Compulsory data-TLB misses Number of compulsory TLB Misses  highly active heap accesses evict the useful stack and global entries due to conflict misses blowfish bitcount cjpeg djpeg dijkstra fft rijndael patricia bzip2 gcc mcf parser H-Mean stackglobalheap MiBenchSpec2000

ISLPED Compulsory data-Cache misses Number of compulsory Cache Misses  smaller stack and global working set than heap  smaller stack and global cache size is enough to capture most of the memory accesses to these semantic regions

ISLPED Dynamic Data Memory Distribution  ~40 % of the dynamic memory accesses go to the stack which is concentrated on only few pages  4 memory accesses ~= 2 stack, 1 global and 1 heap

ISLPED Semantic-Aware Memory Architecture smaller stack and global TLB smaller stack and global cache  Reduced power consumption To Processor Unified L2 Cache Data Address Router sCache gCache hCache ld_data_base_reg ld_env_base_reg ld_data_bound_reg sTLB gTLB To Processor Virtual address uTLB Most of the memory references go to sTLB 0 1 sCache

ISLPED Semantic-Aware TLB Misses Number of TLB Entries Number of TLB Misses TLB Miss Rate  The number of hTLB misses does not come down even at 512 TLB entries

ISLPED Semantic-Aware TLB Misses Number of TLB Entries Number of TLB Misses TLB Miss Rate  The number of gTLB misses saturate at 8 TLB entries

ISLPED Semantic-Aware TLB Misses Number of TLB Entries Number of TLB Misses TLB Miss Rate  The number of sTLB misses saturate faster than global and heap

ISLPED Semantic-Aware Cache Misses Number of Cache Misses Cache Size in KB Cache Miss Rate  Stack demonstrate very stable working set size than the other two. Global saturates at a reasonable rate.

ISLPED Simulation Infrastructure  Target Architecture: ARM  Performance: Simplescalar  Power: Integrated Wattch Power Model  Access Time/Area: CACTI 3.0 Execution EngineOut-of-Order Fetch / Decode / Issue / Commit4 / 4 / 4 / 4 L1 / L2 / Memory Latency1 / 6 / 150 TLB hit / miss latency1 / 30 L1 Cache baselineDM 32KB L1 stack / global / heap Cachelet8KB / 8KB / 16 KB L2 Cache4w 512KB Cache line size32B

ISLPED Design Effectiveness of SAM blowfish bitcount cpeg djpeg dijkstra fft rijndael patricia bzip2 gcc mcf parser Avg Performance Ratio d-TLB Energy w/ SAT L1 d-Cache Energy w/ SAC ~4% Perf. Loss ~35% Energy Savings

ISLPED Multi-porting Effectiveness of SAM

ISLPED Multi-porting Access Time / Die Area BaselineSemantic-Aware Cachelets (SAC) Cache Model32KB unified 8KB sCachelet 8KB gCachelet 16KB hCachelet Total SAC Area Area Savings R/W ports2211 Access time (ns) Area (mm 2 ) % Cache Model64KB unified 16KB sCachelet 16KB gCachelet 32KB hCachelet Total SAC Area Area Savings R/W ports2211 Access time (ns) Area (mm 2 ) %  area savings with 4% performance loss

ISLPED Conclusions  Presented Semantic-Aware Multilateral technique to reduce d-TLB and data cache energy consumption  data TLB – 36 % energy savings  data Cache – 34 % energy savings  4 % performance loss  Selective Multi-porting SAM reduces energy and area  data TLB – 47 % energy savings  data Cache – 45 % energy savings  4 % performance loss

ISLPED

ISLPED Distribution of Parallel TLB Activity Parallel Number of TLB Accesses

ISLPED Cost-Effective TLB configuration bmBfBcCjDjDijFftRijPatBzGcPar dTLB base sTLB gTLB hTLB

ISLPED

ISLPED Design Effectiveness of SAM Energy Speed blowfish djpeg bitcount cjpeg fft dijkstra rijndael patricia bzip2 mcf gcc parser average