Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian,

Slides:

Advertisements

Similar presentations

DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.

Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Designing Efficient Memory for Future Computing Systems Aniruddha N. Udipi University of Utah Ph.D. Dissertation.

Memory Controller Innovations for High-Performance Systems

Application-Aware Memory Channel Partitioning † Sai Prashanth Muralidhara § Lavanya Subramanian † † Onur Mutlu † Mahmut Kandemir § ‡ Thomas Moscibroda.

LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

COEN 180 DRAM. Dynamic Random Access Memory Dynamic: Periodically refresh information in a bit cell. Else it is lost. Small footprint: transistor + capacitor.

A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.

1 Lecture 6: Chipkill, PCM Topics: error correction, PCM basics, PCM writes and errors.

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures June 14 th 2014 Prashant J. Nair - Georgia Tech David A. Roberts- AMD Research.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Citadel: Efficiently Protecting Stacked Memory From Large Granularity Failures Dec 15 th 2014 MICRO-47 Cambridge UK Prashant Nair - Georgia Tech David.

1 Lecture 13: DRAM Innovations Today: energy efficiency, row buffer management, scheduling.

Micro-Pages: Increasing DRAM Efficiency with Locality-Aware Data Placement Kshitij Sudan, Niladrish Chatterjee, David Nellans, Manu Awasthi, Rajeev Balasubramonian,

June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 14: Cache Innovations and DRAM Today: cache access basics and innovations, DRAM (Sections )

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Non-Uniform Power Access in Large Caches with Low-Swing Wires Aniruddha N. Udipi with Naveen Muralimanohar*, Rajeev Balasubramonian University of Utah.

Handling the Problems and Opportunities Posed by Multiple On-Chip Memory Controllers Manu Awasthi, David Nellans, Kshitij Sudan, Rajeev Balasubramonian,

Vinodh Cuppu and Bruce Jacob, University of Maryland Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM- System.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.

1 Lecture 14: DRAM, PCM Today: DRAM scheduling, reliability, PCM Class projects.

University of Utah 1 The Effect of Interconnect Design on the Performance of Large L2 Caches Naveen Muralimanohar Rajeev Balasubramonian.

1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.

1 Lecture 1: Introduction and Memory Systems CS 7810 Course organization:  5 lectures on memory systems  5 lectures on cache coherence and consistency.

1 Towards Scalable and Energy-Efficient Memory System Architectures Rajeev Balasubramonian School of Computing University of Utah.

1 University of Utah & HP Labs 1 Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian.

1 Efficient Data Access in Future Memory Hierarchies Rajeev Balasubramonian School of Computing Research Buffet, Fall 2010.

LOT-ECC: LOcalized and Tiered Reliability Mechanisms for Commodity Memory Systems Ani Udipi § Naveen Muralimanohar* Rajeev Balasubramonian Al Davis Norm.

Déjà Vu Switching for Multiplane NoCs NOCS’12 University of Pittsburgh Ahmed Abousamra Rami MelhemAlex Jones.

Reducing Refresh Power in Mobile Devices with Morphable ECC

IVEC: Off-Chip Memory Integrity Protection for Both Security and Reliability Ruirui Huang, G. Edward Suh Cornell University.

Energy-Efficient Cache Design Using Variable-Strength Error-Correcting Codes Alaa R. Alameldeen, Ilya Wagner, Zeshan Chishti, Wei Wu,

Buffer-On-Board Memory System 1 Name: Aurangozeb ISCA 2012.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

BEAR: Mitigating Bandwidth Bloat in Gigascale DRAM caches

Modern DRAM Memory Architectures Sam Miller Tam Chantem Jon Lucas CprE 585 Fall 2003.

CS/EE 5810 CS/EE 6810 F00: 1 Main Memory. CS/EE 5810 CS/EE 6810 F00: 2 Main Memory Bottom Rung of the Memory Hierarchy 3 important issues –capacity »BellÕs.

High-Performance DRAM System Design Constraints and Considerations by: Joseph Gross August 2, 2010.

1 Lecture 2: Memory Energy Topics: energy breakdowns, handling overfetch, LPDRAM, row buffer management, channel energy, refresh energy.

1 Lecture 5: Refresh, Chipkill Topics: refresh basics and innovations, error correction.

1 Lecture 5: Scheduling and Reliability Topics: scheduling policies, handling DRAM errors.

33 rd IEEE International Conference on Computer Design ICCD rd IEEE International Conference on Computer Design ICCD 2015 Improving Memristor Memory.

Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.

1 Lecture: Memory Technology Innovations Topics: memory schedulers, refresh, state-of-the-art and upcoming changes: buffer chips, 3D stacking, non-volatile.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

1 Lecture 16: Main Memory Innovations Today: DRAM basics, innovations, trends HW5 due on Thursday; simulations can take a few hours Midterm: 32 scores.

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

Improving Multi-Core Performance Using Mixed-Cell Cache Architecture

Seth Pugsley, Jeffrey Jestes,

Reducing Memory Interference in Multicore Systems

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture 15: DRAM Main Memory Systems

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Ali Shafiee Rajeev Balasubramonian Feifei Li

Lecture 6: Reliability, PCM

Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee,

15-740/ Computer Architecture Lecture 19: Main Memory

Presentation transcript:

Rethinking DRAM Design and Organization for Energy-Constrained Multi-Cores Aniruddha N. Udipi, Naveen Muralimanohar*, Niladrish Chatterjee, Rajeev Balasubramonian, Al Davis, Norm Jouppi * University of Utah and *HP Labs

Why a complete DRAM redesign? 2 Courtesy: High Density Low Cost-per-bit Energy efficient June 1994 Time for DRAM’s own “right-hand turn” Rethink design for modern constraints JEDEC SDRAM Standard Cost-per-bit over time

3 Memory Trends Energy –Large scale systems attribute 25-40% of total power to the memory subsystem –Capital acquisition costs = operating costs over 3 years –Energy is a first-order design constraint Access patterns –Increasing socket, core, and thread counts –Final memory request stream extremely random –Cannot design for locality

Memory Trends 4

5

6 Energy –Large scale systems attribute 25-40% of total power to the memory subsystem –Capital acquisition costs = operating costs over 3 years –Energy is a first-order design constraint Access patterns –Increasing socket, core, and thread counts –Final memory request stream extremely random –Cannot design for locality –What is exact overfetch degree? DRAM Reliability –Critical apps require chipkill-level reliability –Building fault-tolerance out of unreliable components is expensive –Schroeder et al., SIGMETRICS 2009

7 Related Work Overfetch – Ahn et al. (SC ’09), Ware et al. (ICCD ’06), Sudan et al. (ASPLOS ’10) DRAM Low-power modes –Hur et al. (HPCA ’08), Fan et al. (ISLPED ’01), Pandey et al. (HPCA ’06) DRAM Redesign –Loh (ISCA ’08), Beamer et al. (ISCA ’10) Chipkill mechanisms –Yoon and Erez (ASPLOS ’10)

Executive Summary Rethink DRAM design for modern constraints –Low-locality, reduced energy consumption, optimize TCO Selective Bitline Activation (SBA) –Minimal design changes –Considerable dynamic energy reductions for small latency and area penalties Single Subarray Access (SSA) –Significant changes to memory interface –Large dynamic and static energy savings Chipkill-level reliability –Reduced energy and storage overheads for reliability 8

9 Outline DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion

Basic Organization 10 … Memory bus or channel Rank DRAM chip or device Bank Array 1/8 th of the row buffer One word of data output DIMM On-chip Memory Controller

Basic DRAM Operation 11 RAS CAS Cache Line DRAM Chip Row Buffer One bank shown in each chip

12 Outline DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion

Selective Bitline Activation Activate only those bitlines corresponding to the requested cache line – reduce dynamic energy –Some area overhead depending on access granularity –we pick 16 cache lines for 12.5% area overhead Requires no changes to the interface and minimal control changes 13

14 Outline DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion

Key Idea Incandescent light bulb Low purchase cost High operating cost Commodity Energy-efficient light bulb Higher purchase cost Much lower operating cost Value-addition 15 It’s worth a small increase in capital costs to gain large reductions in operating costs $ W $ W And not 10X, just 15-20%!

Wishlist of features Eliminate overfetch –Disregard locality Increase opportunities for power-down Increase parallelism Enable efficient reliability mechanisms 16

SSA Architecture 17 MEMORY CONTROLLER 8 8 ADDR/CMD BUS 64 Bytes Bank Subarray Bitlines Row buffer Global Interconnect to I/O ONE DRAM CHIP DIMM DATA BUS

SSA Basics Entire DRAM chip divided into small subarrays Width of each subarray is exactly one cache line Fetch entire cache line from a single subarray in a single DRAM chip – SSA Groups of subarrays combined into “banks” to keep peripheral circuit overheads low Close page policy and “posted-RAS” similar to SBA Data bus to processor essentially split into 8 narrow buses 18

SSA Operation 19 Address Cache Line DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray DRAM Chip Subarray Sleep Mode (or other parallel accesses) Subarray

SSA Impact Energy reduction –Dynamic – fewer bitlines activated –Static – smaller activation footprint – more and longer spells of inactivity – better power down Latency impact –Limited pins per cache line – serialization latency –Higher bank-level parallelism – shorter queuing delays Area increase –More peripheral circuitry and I/O at finer granularities – area overhead (< 5%) 20

Methodology Simics based simulator –‘ooo-micro-arch’ and ‘trans-staller’ FCFS/FR-FCFS scheduling policies Address mapping and DRAM models from DRAMSim DRAM data from Micron datasheets Area/Energy numbers from heavily modified CACTI 6.5 PARSEC/NAS/STREAM benchmarks 8 single-threaded OOO cores, 32 KB L1, 2 MB L2 2GHz processor, 400MHz DRAM 21

Dynamic Energy Reduction 22 Moving to close page policy – 73% energy increase on average Compared to open page, 3X reduction with SBA, 6.4X with SSA

Contributors to energy consumption cache lines in baseline 16 cache lines in SBA 1 cache line in SSA

Static Energy – Power down modes Current DRAM chips already support several low- power modes Consider the low-overhead power down mode: 5.5X lower energy, 3 cycle wakeup time For a constant 5% latency increase –17% low-power operation in the baseline –80% low-power operation in SSA 24

Latency Characteristics 25 Impact of Open/Close page policy – 17% decrease (10/12) or 28% increase (2/12) Posted-RAS adds about 10% Serialization/Queuing delay balance in SSA - 30% decrease (6/12) or 40% increase (6/12)

Contributors to Latency 26

27 Outline DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion

DRAM Reliability Many server applications require chipkill-level reliability – failure of an entire DRAM chip One example of existing systems –64-bit word requires 8-bit ECC –Each of these 72 bits must be read out of a different chip, else a chip failure will lead to a multi-bit error in the 72-bit field – unrecoverable! –Reading 72 chips - significant overfetch! Chipkill even more of a concern for SSA since entire cache line comes from a single chip 28

Proposed Solution Approach similar to RAID-5 29 DIMM L0 C L1 C L2 C L3 C L4 C L5 C L6 C L7 C P0 C L9 C L10 C L11 C L12 C L13 C L14 C L15 C P1 C L8 C. C L56 C L57 C L58 C L59 C L60 C L61 C L62 C L63 C P7 DRAM DEVICE L – Cache LineC – Local ChecksumP – Global Parity

Chipkill design Two-tier error protection Tier - 1 protection – self-contained error detection –8-bit checksum/cache line – 1.625% storage overhead –Every cache line read is now slightly longer Tear -2 protection – global error correction –RAID-like striped parity across 8+1 chips –12.5% storage overhead Error-free access (common case) –1 chip reads –2 chip writes – leads to some bank contention –12% IPC degradation Erroneous access –9 chip operation 30

31 Outline DRAM systems overview Selective Bitline Activation (SBA) Single Subarray Access (SSA) Chipkill-level reliability Conclusion

Key Contributions Redesign of DRAM microarchitecture Substantial chip access energy savings (up to 6X) Overall, performance is a wash Minor area impact (12% with SBA, 4.5% with SSA) Two-tier chipkill-level reliability with minimal energy and storage overheads 32

Now is the time for new architectures.. Take into account modern constraints Energy far more critical today than before Cost-per-bit perhaps less important – optimize TCO –Operating costs over 3 years = capital acquisition costs Memory reliability is important for many server applications. Memory system’s “right-hand-turn” is long overdue 33