LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian.

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

The Interaction of Simultaneous Multithreading processors and the Memory Hierarchy: some early observations James Bulpin Computer Laboratory University.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
A Case for Refresh Pausing in DRAM Memory Systems
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Fabián E. Bustamante, Spring 2007
AN ANALYTICAL MODEL TO STUDY OPTIMAL AREA BREAKDOWN BETWEEN CORES AND CACHES IN A CHIP MULTIPROCESSOR Taecheol Oh, Hyunjin Lee, Kiyeon Lee and Sangyeun.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.
ACM: An Efficient Approach for Managing Shared Caches in Chip Multiprocessors Mohammad Hammoud, Sangyeun Cho, and Rami Melhem Presenter: Socrates Demetriades.
A KTEC Center of Excellence 1 Cooperative Caching for Chip Multiprocessors Jichuan Chang and Gurindar S. Sohi University of Wisconsin-Madison.
Lecture 6: Multicore Systems
1 Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers By Sreemukha Kandlakunta Phani Shashank.
Nikos Hardavellas, Northwestern University
High Performing Cache Hierarchies for Server Workloads
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Our approach! 6.9% Perfect L2 cache (hit rate 100% ) 1MB L2 cache Cholesky 47% speedup BASE: All cores are used to execute the application-threads. PB-GS(PB-LS)
Virtual Exclusion: An Architectural Approach to Reducing Leakage Energy in Multiprocessor Systems Mrinmoy Ghosh Hsien-Hsin S. Lee School of Electrical.
A Framework for Dynamic Energy Efficiency and Temperature Management (DEETM) Michael Huang, Jose Renau, Seung-Moon Yoo, Josep Torrellas University of Illinois.
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
University of Maryland Locality Optimizations in cc-NUMA Architectures Using Hardware Counters and Dyninst Mustafa M. Tikir Jeffrey K. Hollingsworth.
U NDERSUBSCRIBED T HREADING ON C LUSTERED C ACHE A RCHITECTURES W IM H EIRMAN 1,2 T REVOR E. C ARLSON 1 K ENZO V AN C RAEYNEST 1 I BRAHIM H UR 2 A AMER.
Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.
PageNUCA: Selected Policies for Page-grain Locality Management in Large Shared CMP Caches Mainak Chaudhuri, IIT Kanpur
A Cache-Like Memory Organization for 3D memory systems CAMEO 12/15/2014 MICRO Cambridge, UK Chiachen Chou, Georgia Tech Aamer Jaleel, Intel Moinuddin K.
1 DIEF: An Accurate Interference Feedback Mechanism for Chip Multiprocessor Memory Systems Magnus Jahre †, Marius Grannaes † ‡ and Lasse Natvig † † Norwegian.
Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,
Scheduler Activations Effective Kernel Support for the User-Level Management of Parallelism.
Skewed Compressed Cache
1  1998 Morgan Kaufmann Publishers Chapter Seven Large and Fast: Exploiting Memory Hierarchy (Part II)
ECE 510 Brendan Crowley Paper Review October 31, 2006.
By- Jaideep Moses, Ravi Iyer , Ramesh Illikkal and
StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.
Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Embedded System Lab. 김해천 Linearly Compressed Pages: A Low- Complexity, Low-Latency Main Memory Compression Framework Gennady Pekhimenko†
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
Reconfigurable Computing Using Content Addressable Memory (CAM) for Improved Performance and Resource Usage Group Members: Anderson Raid Marie Beltrao.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.
Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.
Analyzing the Impact of Data Prefetching on Chip MultiProcessors Naoto Fukumoto, Tomonobu Mihara, Koji Inoue, Kazuaki Murakami Kyushu University, Japan.
1 Contents Memory types & memory hierarchy Virtual memory (VM) Page replacement algorithms in case of VM.
Variation Aware Application Scheduling in Multi-core Systems Lavanya Subramanian, Aman Kumar Carnegie Mellon University {lsubrama,
Agile Paging: Exceeding the Best of Nested and Shadow Paging
Virtual Memory Chapter 7.4.
Chang Hyun Park, Taekyung Heo, and Jaehyuk Huh
Memory Segmentation to Exploit Sleep Mode Operation
18-447: Computer Architecture Lecture 23: Caches
Xiaodong Wang, Shuang Chen, Jeff Setter,
Section 9: Virtual Memory (VM)
Lecture: Large Caches, Virtual Memory
Morgan Kaufmann Publishers Large and Fast: Exploiting Memory Hierarchy
Morgan Kaufmann Publishers
Lecture: Large Caches, Virtual Memory
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Reducing Memory Reference Energy with Opportunistic Virtual Caching
Verilog to Routing CAD Tool Optimization
Lecture 22: Cache Hierarchies, Memory
Lecture 29: Virtual Memory-Address Translation
A High Performance SoC: PkunityTM
Morgan Kaufmann Publishers Memory Hierarchy: Virtual Memory
(A Research Proposal for Optimizing DBMS on CMP)
Code Transformation for TLB Power Reduction
A Case for Interconnect-Aware Architectures
Main Memory Background
Virtual Memory Use main memory as a “cache” for secondary (disk) storage Managed jointly by CPU hardware and the operating system (OS) Programs share main.
Presentation transcript:

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian Institute of Technology, Kanpur

Low Energy MapMotivation 37% L2 cache energy 10% dead time per page

Low Energy MapMotivation Past work has exploited this dead time at cache block grain Past work has exploited this dead time at cache block grain For large last-level caches the book- keeping overhead becomes enormous For large last-level caches the book- keeping overhead becomes enormous Good news: large potential of dead time exploitation at page grain Good news: large potential of dead time exploitation at page grain –By-product: smart involvement of OS Design smart VA to PA mapping to cluster virtual pages accessed together in time so that average size of idle region increases Design smart VA to PA mapping to cluster virtual pages accessed together in time so that average size of idle region increases

Low Energy MapHighlights Three major contributions Three major contributions –First proposal to exploit smart virtual address translation schemes for region-based leakage control in large multi-banked shared CMP NUCAs –A new application-directed page placement system call to realize the leakage-aware translation –7% total system energy saving, 50% L2 cache energy saving, 52% L2 cache power saving for an 8-core CMP with a 16 MB shared L2 cache at 65 nm on selected SPLASH-2, SPEC OMP, and DIS applications

Low Energy Map LEMap: Basic idea C0C1C2C3 C7C6C5C4 CROSSBAR B0B1B2B3B4B5B6B7 B15B14B13B12B11B10B9B8 Subbanks L2 bank control Idle subbank: drowsy Baseline Showing one time window

Low Energy Map LEMap: Basic idea C0C1C2C3 C7C6C5C4 CROSSBAR B0B1B2B3B4B5B6B7 B15B14B13B12B11B10B9B8 Subbanks L2 bank control Idle subbank: drowsy LEMap Showing one time window

Low Energy Map LEMap: Basic idea Map a cluster of virtual pages that are accessed together onto a few subbanks Map a cluster of virtual pages that are accessed together onto a few subbanks –Improves effectiveness of low power drowsy mode due to larger number of idle subbanks –Can power down a subbank after the last access to the cluster of virtual pages mapped onto it –Take care of proximity by choosing a subbank for a cluster of virtual pages such that average access latency is minimized (important for NUCAs)

Low Energy Map Implementing LEMap Collect (core id, virtual page id, timestamp) tuple for each L2 cache access via a profile run Collect (core id, virtual page id, timestamp) tuple for each L2 cache access via a profile run Cluster the virtual pages based on timestamp using a hierarchical agglomerative clustering algorithm Cluster the virtual pages based on timestamp using a hierarchical agglomerative clustering algorithm –Grow the birth and death times of a cluster gradually until the cluster size exceeds subbank size Map each cluster on a physical subbank via application-directed page placement system call that takes a vector of VPNs Map each cluster on a physical subbank via application-directed page placement system call that takes a vector of VPNs

Simulation results Done on an 8-core CMP with a 16 MB shared L2 cache with leakage controlled at 128 KB subbank grain (16 banks) Done on an 8-core CMP with a 16 MB shared L2 cache with leakage controlled at 128 KB subbank grain (16 banks) Models dynamic and leakage (gate and subthreshold) power of all on-chip components at 65 nm including memory controller (leakage model extracted from HSpice simulations) Models dynamic and leakage (gate and subthreshold) power of all on-chip components at 65 nm including memory controller (leakage model extracted from HSpice simulations) Models DRAM dynamic power following Micron technical notes Models DRAM dynamic power following Micron technical notes Executes eight explicitly parallel shared memory applications drawn from SPLASH-2, SPEC OMP, and DIS suites Executes eight explicitly parallel shared memory applications drawn from SPLASH-2, SPEC OMP, and DIS suites

Low Energy Map Simulation results: L2 cache power 52% L2 cache power saving

Low Energy Map Simulation results: L2 cache energy 50% L2 cache energy saving

Low Energy Map Simulation results: Total energy 7% total energy saving

Low Energy Map Simulation results: Execution time 3% loss on average

Low Energy MapSummary A novel virtual to physical address translation mechanism to control leakage in large shared caches in CMP A novel virtual to physical address translation mechanism to control leakage in large shared caches in CMP Uses profile information to optimize virtual page to physical subbank placement in parallel programs Uses profile information to optimize virtual page to physical subbank placement in parallel programs Controls leakage at subbank grain to reduce timekeeping overhead of drowsy Controls leakage at subbank grain to reduce timekeeping overhead of drowsy Achieves 50% L2 cache energy saving and 7% total energy saving on eight benchmark programs compared to drowsy Achieves 50% L2 cache energy saving and 7% total energy saving on eight benchmark programs compared to drowsy

Low Energy MapAcknowledgment Intel: graduate fellowship Intel: graduate fellowship IBM: faculty award IBM: faculty award

LEMap: Controlling Leakage in Large Chip-multiprocessor Caches via Profile-guided Virtual Address Translation Jugash Chandarlapati Mainak Chaudhuri Indian Institute of Technology, Kanpur THANK YOU!