Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian,

Slides:

Advertisements

Similar presentations

Lazy Asynchronous I/O For Event-Driven Servers Khaled Elmeleegy, Anupam Chanda and Alan L. Cox Department of Computer Science Rice University, Houston,

Advertisements

Feichter_DPG-SYKL03_Bild-01. Feichter_DPG-SYKL03_Bild-02.

Using Partial Tag Comparison in Low-Power Snoop-based Chip Multiprocessors Ali ShafieeNarges Shahidi Amirali Baniasadi Sharif University of Technology.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Appendix 01.

1 Copyright © 2013 Elsevier Inc. All rights reserved. Chapter 3 CPUs.

1 Building a Fast, Virtualized Data Plane with Programmable Hardware Bilal Anwer Nick Feamster.

UNITED NATIONS Shipment Details Report – January 2006.

Towards Automating the Configuration of a Distributed Storage System Lauro B. Costa Matei Ripeanu {lauroc, NetSysLab University of British.

Year 6 mental test 10 second questions

Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi * Chang Joo Lee * + Onur Mutlu Yale N. Patt * * HPS Research Group The.

Gennady Pekhimenko Advisers: Todd C. Mowry & Onur Mutlu

Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt.

Re-examining Instruction Reuse in Pre-execution Approaches By Sonya R. Wolff Prof. Ronald D. Barnes June 5, 2011.

Yuejian Xie, Gabriel H. Loh. Core0 IL1 DL1 Core1 IL1 DL1 Last Level Cache (LLC) Core1s Data 2 Core0s Data.

Break Time Remaining 10:00.

SE-292 High Performance Computing

Best of Both Worlds: A Bus-Enhanced Network on-Chip (BENoC) Ran Manevich, Isask har (Zigi) Walter, Israel Cidon, and Avinoam Kolodny Technion – Israel.

Directory-Based Cache Coherence Marc De Melo. Outline Non-Uniform Cache Architecture (NUCA) Cache Coherence Implementation of directories in multicore.

Improving DRAM Performance by Parallelizing Refreshes with Accesses

HyLog: A High Performance Approach to Managing Disk Layout Wenguang Wang Yanping Zhao Rick Bunt Department of Computer Science University of Saskatchewan.

OS-aware Tuning Improving Instruction Cache Energy Efficiency on System Workloads Authors : Tao Li, John, L.K. Published in : Performance, Computing, and.

1 PhD Defense Presentation Managing Shared Resources in Chip Multiprocessor Memory Systems 12. October 2010 Magnus Jahre.

1 A Case for MLP-Aware Cache Replacement International Symposium on Computer Architecture (ISCA) 2006 Moinuddin K. Qureshi Daniel N. Lynch, Onur Mutlu,

Cache and Virtual Memory Replacement Algorithms

Javier Lira (UPC, Spain)Carlos Molina (URV, Spain) David Brooks (Harvard, USA)Antonio González (Intel-UPC,

Memory Management.

Virtual Memory 1 Computer Organization II © McQuain Virtual Memory Use main memory as a cache for secondary (disk) storage – Managed jointly.

Bypass and Insertion Algorithms for Exclusive Last-level Caches

IP Multicast Information management 2 Groep T Leuven – Information department 2/14 Agenda •Why IP Multicast ? •Multicast fundamentals •Intradomain.

Application-aware Memory System for Fair and Efficient Execution of Concurrent GPGPU Applications Adwait Jog 1, Evgeny Bolotin 2, Zvika Guz 2,a, Mike Parker.

15. Oktober Oktober Oktober 2012.

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.

Fairness via Source Throttling: A configurable and high-performance fairness substrate for multi-core memory systems Eiman Ebrahimi * Chang Joo Lee * Onur.

Factor P 16 8(8-5ab) 4(d² + 4) 3rs(2r – s) 15cd(1 + 2cd) 8(4a² + 3b²)

Basel-ICU-Journal Challenge18/20/ Basel-ICU-Journal Challenge8/20/2014.

© 2012 National Heart Foundation of Australia. Slide 2.

Understanding Generalist Practice, 5e, Kirst-Ashman/Hull

KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.

25 seconds left…...

SE-292 High Performance Computing

©Brooks/Cole, 2001 Chapter 12 Derived Types-- Enumerated, Structure and Union.

Memory Management in NUMA Multicore Systems: Trapped between Cache Contention and Interconnect Overhead Zoltan Majo and Thomas R. Gross Department of Computer.

SE-292 High Performance Computing Memory Hierarchy R. Govindarajan

Clock will move after 1 minute

PSSA Preparation.

Application-to-Core Mapping Policies to Reduce Memory System Interference Reetuparna Das * Rachata Ausavarungnirun $ Onur Mutlu $ Akhilesh Kumar § Mani.

Select a time to count down from the clock above

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.

Murach’s OS/390 and z/OS JCLChapter 16, Slide 1 © 2002, Mike Murach & Associates, Inc.

Performance, Area and Bandwidth Implications on Large-Scale CMP Cache Design Li Zhao, Ravi Iyer, Srihari Makineni, Jaideep Moses, Ramesh Illikkal, Don.

Cooperative Cache Scrubbing Jennifer B. Sartor, Wim Heirman, Steve Blackburn*, Lieven Eeckhout, Kathryn S. McKinley^ PACT 2014 * ^

Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.

1 Lecture 17: Large Cache Design Papers: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation, Cho and Jin, MICRO’06 Co-Operative Caching.

Managing Wire Delay in Large CMP Caches Bradford M. Beckmann David A. Wood Multifacet Project University of Wisconsin-Madison MICRO /8/04.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

Better than the Two: Exceeding Private and Shared Caches via Two-Dimensional Page Coloring Lei Jin and Sangyeun Cho Dept. of Computer Science University.

1 Lecture 12: Large Cache Design Papers (papers from last class and…): Co-Operative Caching for Chip Multiprocessors, Chang and Sohi, ISCA’06 Victim Replication,

1 Lecture 11: Large Cache Design Topics: large cache basics and… An Adaptive, Non-Uniform Cache Structure for Wire-Dominated On-Chip Caches, Kim et al.,

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.

Lecture 13: Large Cache Design I

Bank-aware Dynamic Cache Partitioning for Multicore Architectures

Energy-Efficient Address Translation

Reducing Memory Reference Energy with Opportunistic Virtual Caching

A Case for Interconnect-Aware Architectures

Presentation transcript:

Optimizing Communication and Capacity in 3D Stacked Cache Hierarchies Aniruddha Udipi N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S. Makineni and D. Newell University of Utah and Intel STL

University of Utah 2 Motivation Many-core designs requires large cache capacity for performance SRAM has low latency and consumes less power DRAM has 8X density but poor latency/power characteristics Can we design a hybrid SRAM-DRAM cache to take advantage of both technologies? Can we build a customized on-chip network specifically targeted at such a design?

University of Utah 3 Proposal - 3D Stacked Hybrid Cache SRAM DRAM Not an option in conventional 2D design 3D Mixed-process stacking enables a single vertical SRAM/DRAM bank

University of Utah 4 Executive Summary 3D stacked hybrid cache design Synergistic proposals to improve performance and power efficiency –Optimizing Capacity Reconfigurable cache hierarchy –Optimizing Communication Page coloring for effective data placement - reduced communication Tailor-made on-chip interconnection network - quicker communication Up to 62% performance increase

University of Utah 5 Outline Overview of 3D Technology Technique I - Reconfigurable Cache Hierarchy Technique II - Page coloring Technique III - On-chip Interconnection Network Evaluation Conclusions

University of Utah 6 3D Technology + Mixed process integration possible + High speed vertical interconnects - Thermal Issues Source: Black et al. MICRO06 Through-Silicon Vias (TSVs) Die-to-die vias Heat sink Die #2 Die #1 Bulk Si #1 Active Si #1 Metal Active Si #2 Bulk Si #2 I/O Bumps

University of Utah 7 Baseline Model Lower die - 16 Processing cores Upper die - 16 SRAM Banks, with grid based on-chip network

University of Utah 8 Outline Overview of 3D Technology Technique I - Reconfigurable Cache Hierarchy Technique II - On-chip Interconnection Network Technique III - Page coloring Evaluation Conclusions

University of Utah 9 Technique I - Reconfigurable hierarchy Increase capacity by stacking a DRAM bank on each SRAM cache bank, reconfigure bank size based on demand More compelling with 3D and NUCA –Space capacity on die 3 does not intrude with layout of second die or steal capacity from neighboring caches –Cache already partitioned into NUCA banks, additional banks do not complicate logic too much –Access time grows less than linearly with capacity –Dramatic increase in capacity, no gradation, only two choices Turn-off DRAM for small working set size

University of Utah 10 Proposed Reconfigurable Cache Model Die containing 16 cores Die containing 16 SRAM banks and tree interconnect Die containing 16 DRAM banks and no interconnect Inter-die via pillar to send request from core to L2 SRAM (not shown: one pillar for each core) Inter-die via pillar to access portion of L2 in DRAM (not shown: one pillar per sector)

University of Utah 11 Simple heuristic for enabling/disabling DRAM bank: Every Reconfiguration Interval, –If usage is low and cache-bank miss-rate is low disable DRAM bank above –If usage is high and cache-bank miss-rate is high enable DRAM bank above Reconfiguration interval is every 10 million cycles All cores are stalled for 100K cycles during reconfiguration Proposed Reconfiguration Policy

Cache Organization University of Utah 12 Tag Array Data Array DRAMDRAM SRAMSRAM Adaptive arrays become tag arrays for ways in DRAM Total Capacity 1 MB9 MB LowHighAccess Pressure Ways

Cache Organization SRAM banks have three memory arrays – tag array, data array, adaptive array (can act as both tag & data) Whenever DRAM banks are switched on, tags implemented in part of the SRAM –Quick lookup of tag Increased capacity manifests as additional ways –Cache lines in SRAM need not be flushed on reconfiguration –Two ways of data available with low latency, moving MRU data to these ways will further increase efficiency University of Utah 13

Why is this better than a L2/L3 hierarchy? Additional access penalty on L2 miss before the L3 is accessed to service the request –In our scheme, we look up all tags in parallel, in the SRAM An additional level implies additional coherence complexity Our experiments show non-trivial performance degradation on implementing SRAM/DRAM as L2/L3 compared to our scheme University of Utah 14

University of Utah 15 Outline Overview of 3D Technology Technique I - Reconfigurable Cache Hierarchy Technique II - Page coloring Technique III - On-chip Interconnection Network Evaluation Conclusions

University of Utah 16 Technique II - Page Coloring OS can control what Physical Page Number is assigned to each virtual page, thus controlling the index It can be manipulated to redirect cache line placements Cache TagOffsetIndex Physical Page NumberOffset Page Color CACHE VIEW PHYSICAL ADDRESS

University of Utah 17 Page Coloring Page coloring employed to map data to banks based on proximity to cores. We assume an offline oracle page-coloring implementation Policies depend upon 2 criteria: –Knowledge of a page being private or shared –Knowledge of a page being data or code More capacity pressure on banks carrying shared data

University of Utah 18 Proposed Page Coloring Schemes Share4:D+IRp:I+Share4:D Share16:D+I Shared Data + Code Private Page Shared Data Private Code Shared data & code mapped to central 4 banks Shared data to central 4 banks; code replicated Shared data + code distributed to all 16 banks

University of Utah 19 Outline Overview of 3D Technology Technique I - Reconfigurable Cache Hierarchy Technique II - Page coloring Technique III - On-chip Interconnection Network Evaluation Conclusions

University of Utah 20 Technique III - Interconnection network TREE Links Router Routers saved!

On chip tree network Predictable traffic pattern Data moves between shared central banks/private overhead banks and the core Decreased router overhead Saves energy and time University of Utah 21

University of Utah 22 Synergy between proposals Page coloringTree networkHybrid 3D cache - No search (S-NUCA) - Radiating traffic pattern - No spills into neighboring banks - Increased bank capacity with low latency

University of Utah 23 Outline Overview of 3D Technology Technique I - Reconfigurable Cache Hierarchy Technique II - Page coloring Technique III - On-chip Interconnection Network Evaluation Conclusions

University of Utah 24 Methodology Intel ManySim trace-based simulator CACTI cache model for area, power and access latencies HotSpot 4.0 for thermal evaluation 16 cores, 32nm process, 4GHz clock 4KB page granularity 1MB SRAM bank and 8MB DRAM bank SAP, SPECjbb, TPC-C and TPC-E commercial multi-threaded workload traces

University of Utah 25 Workload Characterization Working set size of code pages is 0.6% of data pages Average code page access count is 57%

University of Utah 26 Page Coloring Evaluation Capacity constraint favors distributing shared pages Code Replication favorable when capacity is available

University of Utah 27 Interconnect Evaluation Network power savings up to 48% Most accesses are local due to code replication Most accesses are random

University of Utah 28 Hybrid Cache Evaluation Cores SRAM Cores SRAM Cores SRAM DRAM Base-No-PCBase-2x-No-PCBase-3- level L2 L3 L2 Cores SRAM DRAM Proposed Chip L2 Re-configurable Cache (with code replication) performs 55% better than Base-1 ~ 5% IPC drop, to get power savings

University of Utah 29 SRAM-DRAM Hits without Reconfiguration Most accesses are to SRAM ways except in shared banks (5,6,9,10)

SRAM-DRAM Hits with Reconfiguration University of Utah 30

University of Utah 31 Reconfiguration Policy Shared Banks have DRAM always enabled SPECJbb – DRAM always enabled – majority pages are private

University of Utah 32 Related Work Reconfigurable Caches in 2D –Ranganathan et al. (ISCA 00), Balasubramonian et al. (MICRO 00), Zhang et al. (ISCA 03) 3D Cache hierarchy –Lie et al. (IEEE D&T 05), Loi et al. (DAC 06), Kgil et al. (ASPLOS 06), Loh (ISCA 08) Page coloring for NUCA – Cho et al. (MICRO 06), Awasthi et al. (HPCA09), Chaudhuri (HPCA 09) 3D NUCA interconnect –Li et al. (ISCA 06) Our is the first paper to propose SRAM/DRAM, targeted tree network, and combining all these into a 3D hierarcy

University of Utah 33 Key Contributions A synergistic cache design Communication- and capacity-optimized 3D cache –Reconfigurable cache to improve performance while reducing power –OS-based page coloring for reduced communication –Tailor-made on-chip network for quicker communication Significant increase in efficiency –Performance improvement of up to 62% –Network power savings of up to 48% Typical thermal effect +7 Celsius

University of Utah 34 Thank you.. Questions?