Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Improving DRAM Performance by Parallelizing Refreshes with Accesses
Outline Memory characteristics SRAM Content-addressable memory details DRAM © Derek Chiou & Mattan Erez 1.
Thank you for your introduction.
A Case for Refresh Pausing in DRAM Memory Systems
4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories HanBin Yoon, Justin Meza, Naveen Muralimanohar*, Onur Mutlu,
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
Justin Meza Qiang Wu Sanjeev Kumar Onur Mutlu Revisiting Memory Errors in Large-Scale Production Data Centers Analysis and Modeling of New Trends from.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
Carnegie Mellon University, *Seagate Technology
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
UH-MEM: Utility-Based Hybrid Memory Management
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity
Aya Fukami, Saugata Ghose, Yixin Luo, Yu Cai, Onur Mutlu
Dynamic Random Access Memory (DRAM) Basic
A Requests Bundling DRAM Controller for Mixed-Criticality System
Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin.
RAM Chapter 5.
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
RFVP: Rollback-Free Value Prediction with Safe to Approximate Loads
Bank-aware Dynamic Cache Partitioning for Multicore Architectures
Thesis Oral Kevin Chang
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Application Slowdown Model
What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study Saugata Ghose, A. Giray Yağlıkçı, Raghav Gupta, Donghyuk Lee,
Eiman Ebrahimi, Kevin Hsieh, Phillip B. Gibbons, Onur Mutlu
Hasan Hassan, Nandita Vijaykumar, Samira Khan,
Vulnerabilities in MLC NAND Flash Memory Programming: Experimental Analysis, Exploits, and Mitigation Techniques Yu Cai, Saugata Ghose, Yixin Luo, Ken.
Row Buffer Locality Aware Caching Policies for Hybrid Memories
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Achieving High Performance and Fairness at Low Cost
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Session 1A at am MEMCON Detecting and Mitigating
MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen,
DRAM SCALING CHALLENGE
CARP: Compression-Aware Replacement Policies
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Reducing DRAM Latency via
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 19: Main Memory
DRAM Hwansoo Han.
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hi, my name is Hasan. Today, I will be presenting CROW, which.
Minesh Patel Jeremie S. Kim
Demystifying Complex Workload–DRAM Interactions: An Experimental Study
Computer Architecture Lecture 10b: Memory Latency
Rajeev Balasubramonian
Haonan Wang, Adwait Jog College of William & Mary
Computer Architecture Lecture 6a: ChargeCache
Architecting Phase Change Memory as a Scalable DRAM Alternative
Presented by Florian Ettinger
DSPatch: Dual Spatial pattern prefetcher
Presentation transcript:

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, Onur Mutlu v1.3

Main Memory Latency Lags Behind 64x 16x 1.2x Long DRAM latency → performance bottleneck In-memory DB, Spark, JVM, … [Clapp+ (Intel), IISWC’15] Google warehouse-scale workloads [Kanev+ (Google), ISCA’15]

Why is Latency High? DRAM latency: Delay as specified in DRAM standards Doesn’t reflect true DRAM device latency Imperfect manufacturing process → latency variation High standard latency chosen to increase yield DRAM A DRAM B DRAM C Standard Latency Manufacturing Variation High Low DRAM Latency

1 Understand and characterize latency variation in modern DRAM chips Goals 1 1 Understand and characterize latency variation in modern DRAM chips 2 2 Develop a mechanism that exploits latency variation to reduce DRAM latency

Outline Motivation and Goals DRAM Background Experimental Methodology Characterization Results Mechanism: Flexible-Latency DRAM Conclusion

High-Level DRAM Organization chip DRAM Channel DIMM (Dual in-line memory module)

DRAM Chip Internals … DRAM Cell … … Row Buffer 8KB (128 cache lines)

1 2 3 DRAM Operations ACTIVATE: Store the row into the row buffer READ: Select the target cache line and drive to CPU to CPU 3 PRECHARGE: Prepare the array for a new ACTIVATE

DRAM Timing Parameters Activation latency: tRCD (13ns / 50 cycles) 1 Precharge latency: tRP (13ns / 50 cycles) 2 Command Data Duration ACTIVATE READ PRECHARGE 1 Cache line (64B) Next ACT

DRAM Latency Variation Imperfect manufacturing process → latency variation DRAM A DRAM C DRAM B Slow cells High Low DRAM Latency

Experimental Questions Imperfect manufacturing process → latency variation Can we show latency variation in these parameters? How large is latency variation in modern DRAM chips? Can we identify the properties of slow cells with long latency? Can we isolate slow cells to make DRAM faster?

Experimental Methodology Tool that enables us to freely issue DRAM commands Existing systems: Commands are generated and controlled by HW Custom FPGA-based infrastructure PCIe DDR3 PC FPGA DIMM C++ programs to specify commands Generate command sequence

Experiments Swept each timing parameter to read data Time step of 2.5ns (FPGA cycle time) Quantified timing errors: bit flips when using reduced latency Tested 240 DDR3 DRAM chips from three vendors 30 DIMMs Manufacturing dates: 2011 – 2013 Capacity: 1GB Ambient temperature: 20oC

Outline Motivation and Goals DRAM Background Experimental Methodology Characterization Results Activation latency Precharge latency Mechanism: Flexible-Latency DRAM Conclusion

Activation Latency: Key Observation Observation: ACT errors are isolated in the cells read in the first cache line 1 Not fully activated Row Buffer 1 ? 1 ? 1 Second read w/ sufficient activation time First read after ACT 1 tRCD X Command ACTIVATE READ READ Actual ACT Time

Variation in Activation Errors Results from 7500 rounds over 240 chips Max Min Quartiles No ACT Errors Many errors 13.1ns standard Rife w/ errors Very few errors Modern DRAM chips exhibit significant variation in activation latency Different characteristics across DIMMs

Spatial Locality of Activation Errors One DIMM @ tRCD=7.5ns Activation errors are concentrated at certain columns of cells

Strong Pattern Dependence DIMM A DIMM B DIMM C Row buffer design is biased towards 1 over 0 [Lim+, ISSCC’12] > 4 orders of magnitude Activation errors have a strong dependence on the stored data patterns

Precharge Latency: Key Observation Observation: PRE errors occur in multiple cache lines in the row activated after a precharge 1 Not fully precharged Incorrectly sensed data Row Buffer 1 1 tRP Command PRECHARGE ACTIVATE Actual PRE Time

Variation in Precharge Errors Results from 4000 rounds over 240 chips Many errors No PRE Errors Rife w/ errors 13.1ns standard Few errors Different characteristics across DIMMs Modern DRAM chips exhibit significant variation in precharge latency

Spatial Locality of Precharge Errors One DIMM @ tRP=7.5ns Precharge errors are concentrated at certain rows of cells

Outline Motivation and Goals DRAM Background Experimental Methodology Characterization Results Mechanism: Flexible-Latency DRAM Conclusion

Mechanism to Reduce DRAM Latency Observations DRAM timing errors are concentrated on certain regions All cells operate without errors at 10ns tRCD and tRP Flexible-LatencY (FLY) DRAM A software-transparent design that reduces latency Key idea: 1) Divide memory into regions of different latencies 2) Memory controller: Use lower latency for regions without slow cells; higher latency for other regions

FLY-DRAM Evaluation Methodology Cycle-level simulator: Ramulator [CAL’15] https://github.com/CMU-SAFARI/ramulator 8-core system with DDR3 memory Benchmarks: SPEC2006, TPC, STREAM, random 40 8-core workloads Performance metric: Weighted Speedup (WS)

FLY-DRAM Configurations tRCD 12% 93% 99% 13% 74% Profiles of 3 real DIMMs tRP

FLY-DRAM improves performance by exploiting latency variation in DRAM Results 19.7% 19.5% 17.6% 13.3% FLY-DRAM improves performance by exploiting latency variation in DRAM

Other Results in the Paper Error-correcting codes (ECC) Effective at correcting activation errors Restoration latency Significant margin to complete without errors Effect of temperature Difference is not statistically significant to draw conclusion

Conclusion First to experimentally demonstrate and analyze latency variation behavior within real DRAM chips Show across 240 DRAM chips that: All cells work below standard latency Some regions of cells work even faster, but slow cells in other regions start to fail Error rate is data-dependent FLY-DRAM reduces latency by using low latency for regions without slow cells and high latency for others 13%/17%/19% speedup based on profiles of 3 real DIMMs https://github.com/CMU-SAFARI/DRAM-Latency-Variation-Study

Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan, Saugata Ghose, Kevin Hsieh, Donghyuk Lee, Tianshi Li, Gennady Pekhimenko, Samira Khan, Onur Mutlu

Backup Slides

Infrastructure Temperature Controller FPGA DIMM Heater

DRAM DIMMs

Activation Latency Variation by DRAM Models

Activation Errors in Data Bursts

Effect of ECC on Activation Errors

Activation Errors by Temperature

Precharge Latency Variation by DRAM Models