Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.

Slides:



Advertisements
Similar presentations
DRAM background Fully-Buffered DIMM Memory Architectures: Understanding Mechanisms, Overheads and Scaling, Garnesh, HPCA'07 CS 8501, Mario D. Marino, 02/08.
Advertisements

Thank you for your introduction.
A Case for Refresh Pausing in DRAM Memory Systems
+ CS 325: CS Hardware and Software Organization and Architecture Internal Memory.
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim,
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
1 MemScale: Active Low-Power Modes for Main Memory Qingyuan Deng, David Meisner*, Luiz Ramos, Thomas F. Wenisch*, and Ricardo Bianchini Rutgers University.
Cache Memory Locality of reference: It is observed that when a program refers to memory, the access to memory for data as well as code are confined to.
Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture
STT-RAM as a sub for SRAM and DRAM
Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.
Parallel Application Memory Scheduling Eiman Ebrahimi * Rustam Miftakhutdinov *, Chris Fallin ‡ Chang Joo Lee * +, Jose Joao * Onur Mutlu ‡, Yale N. Patt.
1 Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture Donghyuk Lee, Yoongu Kim, Vivek Seshadri, Jamie Liu, Lavanya Subramanian, Onur Mutlu.
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu.
Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches Gennady Pekhimenko Vivek Seshadri Onur Mutlu, Todd C. Mowry Phillip B.
AUTHORS: STIJN POLFLIET ET. AL. BY: ALI NIKRAVESH Studying Hardware and Software Trade-Offs for a Real-Life Web 2.0 Workload.
Reducing Refresh Power in Mobile Devices with Morphable ECC
Page Overlays An Enhanced Virtual Memory Framework to Enable Fine-grained Memory Management Vivek Seshadri Gennady Pekhimenko, Olatunji Ruwase, Onur Mutlu,
Thread Cluster Memory Scheduling : Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter.
Simulating a $2M Commercial Server on a $2K PC Alaa R. Alameldeen, Milo M.K. Martin, Carl J. Mauer, Kevin E. Moore, Min Xu, Daniel J. Sorin, Mark D. Hill.
Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency.
A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
Modern VLSI Design 4e: Chapter 3 Copyright  2008 Wayne Wolf Topics n Pseudo-nMOS gates. n DCVS logic. n Domino gates. n Design-for-yield. n Gates as IP.
+ CS 325: CS Hardware and Software Organization and Architecture Memory Organization.
Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative
Achieving High Performance and Fairness at Low Cost Lavanya Subramanian, Donghyuk Lee, Vivek Seshadri, Harsha Rastogi, Onur Mutlu 1 The Blacklisting Memory.
Simultaneous Multi-Layer Access Improving 3D-Stacked Memory Bandwidth at Low Cost Donghyuk Lee, Saugata Ghose, Gennady Pekhimenko, Samira Khan, Onur Mutlu.
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur.
Carnegie Mellon University, *Seagate Technology
DRAM Tutorial Lecture Vivek Seshadri. Vivek Seshadri – Thesis Proposal DRAM Module and Chip 2.
CS203 – Advanced Computer Architecture Main Memory Slides adapted from Onur Mutlu (CMU)
Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Lavanya Subramanian 1.
CS161 – Design and Architecture of Computer Main Memory Slides adapted from Onur Mutlu (CMU)
Massed Refresh: An Energy-Efficient Technique to Reduce Refresh Overhead in Hybrid Memory Cube Architectures. A DRAM Refresh Method By Ishan Thakkar, Sudeep Pasricha
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Samira Khan University of Virginia Mar 29, 2016
Reducing DRAM Latency at Low Cost by Exploiting Heterogeneity
Understanding Latency Variation in Modern DRAM Chips Experimental Characterization, Analysis, and Optimization Kevin Chang Abhijith Kashyap, Hasan Hassan,
SRAM Memory External Interface
Low-Cost Inter-Linked Subarrays (LISA) Enabling Fast Inter-Subarray Data Movement in DRAM Kevin Chang Prashant Nair, Donghyuk Lee, Saugata Ghose, Moinuddin.
Ambit In-Memory Accelerator for Bulk Bitwise Operations Using Commodity DRAM Technology Vivek Seshadri Donghyuk Lee, Thomas Mullins, Hasan Hassan, Amirali.
Samira Khan University of Virginia Oct 9, 2017
Thesis Oral Kevin Chang
ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality
Prof. Gennady Pekhimenko University of Toronto Fall 2017
Application Slowdown Model
What Your DRAM Power Models Are Not Telling You: Lessons from a Detailed Experimental Study Saugata Ghose, A. Giray Yağlıkçı, Raghav Gupta, Donghyuk Lee,
Hasan Hassan, Nandita Vijaykumar, Samira Khan,
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Achieving High Performance and Fairness at Low Cost
William Stallings Computer Organization and Architecture 7th Edition
Yixin Luo Saugata Ghose Yu Cai Erich F. Haratsch Onur Mutlu
Session 1A at am MEMCON Detecting and Mitigating
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
MEMCON: Detecting and Mitigating Data-Dependent Failures by Exploiting Current Memory Content Samira Khan, Chris Wilkerson, Zhe Wang, Alaa Alameldeen,
DRAM SCALING CHALLENGE
CARP: Compression-Aware Replacement Policies
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Reducing DRAM Latency via
Yaohua Wang, Arash Tavakkol, Lois Orosa, Saugata Ghose,
Jeremie S. Kim Minesh Patel Hasan Hassan Onur Mutlu
Samira Khan University of Virginia Nov 14, 2018
15-740/ Computer Architecture Lecture 19: Main Memory
Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri , Yoongu Kim, Hongyi.
CROW A Low-Cost Substrate for Improving DRAM Performance, Energy Efficiency, and Reliability Hi, my name is Hasan. Today, I will be presenting CROW, which.
Computer Architecture Lecture 10b: Memory Latency
Computer Architecture Lecture 6a: ChargeCache
Architecting Phase Change Memory as a Scalable DRAM Alternative
Computer Architecture Lecture 30: In-memory Processing
Presentation transcript:

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency DRAM

2 (11 – 11 – 28) Timing Parameters DRAM Module x86 CPU DDR3 1600MT/s ( ) SPEC mcf Runtime: 527min Runtime: 477min -10.5% (no error) (8 – 8 – 19) MemCtrl Parsec GUPS Memcached Apache

3 Reducing DRAM Timing Why can we reduce DRAM timing parameters without any errors?

4 Executive Summary Observations –DRAM timing parameters are dictated by the worst- case cell (smallest cell across all products at highest temperature) –DRAM operates at lower temperature than the worst case Idea: Adaptive-Latency DRAM –Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Analysis: Characterization of 115 DIMMs –Great potential to lower DRAM timing parameters (17 – 54%) without any errors Real System Performance Evaluation –Significant performance improvement (14% for memory-intensive workloads) without errors (33 days)

5 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

6 DRAM Stores Data as Charge 1. Sensing 2. Restore 3. Precharge DRAM Cell Sense-Amplifier Three steps of charge movement

7 Data 0 Data 1 Cel l time charge Sense- Amplifier DRAM Charge over Time Sensin g Restor e Why does DRAM need the extra timing margin? Timing Parameters In theory In practice margi n Cell Sense-Amplifier

8 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

9 1. Process Variation –DRAM cells are not equal –Leads to extra timing margin for cell that can store small amount of charge ` 2. Temperature Dependence –DRAM leaks more charge at higher temperature –Leads to extra timing margin when operating at low temperature Two Reasons for Timing Margin 1. Process Variation –DRAM cells are not equal –Leads to extra timing margin for a cell that can store a large amount of charge 1. Process Variation –DRAM cells are not equal –Leads to extra timing margin for a cell that can store a large amount of charge

10 DRAM Cells are Not Equal RealIdeal Same Size  Same Charge  Different Size  Different Charge  Largest Cell Smallest Cell Same LatencyDifferent Latency Large variation in cell size  Large variation in charge  Large variation in access latency

11 Contact Process Variation Access Transistor Bitline Capacit or Small cell can store small charge Small cell capacitance High contact resistance Slow access transistor ❶ Cell Capacitance ❷ Contact Resistance ❸ Transistor Performance ACCESS DRAM Cell  High access latency

12 Two Reasons for Timing Margin 1. Process Variation –DRAM cells are not equal –Leads to extra timing margin for a cell that can store a large amount of charge ` 2. Temperature Dependence –DRAM leaks more charge at higher temperature –Leads to extra timing margin for cells that operate at the high temperature 2. Temperature Dependence –DRAM leaks more charge at higher temperature –Leads to extra timing margin for cells that operate at the high temperature 2. Temperature Dependence –DRAM leaks more charge at higher temperature –Leads to extra timing margin for cells that operate at the low temperature

13 Room Temp. Hot Temp. (85°C) Small LeakageLarge Leakage Cells store small charge at high temperature and large charge at low temperature  Large variation in access latency

14 DRAM Timing Parameters DRAM timing parameters are dictated by the worst-case –The smallest cell with the smallest charge in all DRAM products –Operating at the highest temperature Large timing margin for the common-case

15 Our Approach We optimize DRAM timing parameters for the common-case –The smallest cell with the smallest charge in a DRAM module –Operating at the current temperature Common-case cell has extra charge than the worst-case cell  Can lower latency for the common- case

16 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

17 Key Observations 1. Sensing 2. Restore 3. Precharge Sense cells with extra charge faster  Lower sensing latency No need to fully restore cells with extra charge  Lower restore latency No need to fully precharge bitlines for cells with extra charge  Lower precharge latency

18 Typical DIMM at Low Temperature Observation 1. Faster Sensing More Charge Strong Charge Flow Faster Sensing Typical DIMM at Low Temperature  More charge  Faster sensing Timing (tRCD) 17% ↓ No Errors 115 DIMM Characterizati on

19 Observation 2. Reducing Restore Time Larger Cell & Less Leakage  Extra Charge No Need to Fully Restore Charge Typical DIMM at lower temperature  More charge  Restore time reduction Typical DIMM at Low Temperature Read (tRAS) 37% ↓ Write (tWR) 54% ↓ No Errors 115 DIMM Characterizati on

20 Empty (0V) Full (Vdd) Half Observation 3. Reducing Precharge Time Bitline Sense-Amplifier Sensin g Prechar ge Precharg e ? – Setting bitline to half-full charge Typical DIMM at Lower Temperature

21 Empty (0V) Full (Vdd) Half bitline Not Fully Precharged More Charge  Strong Sensing Access Empty Cell Access Full Cell Timing (tRP) 35% ↓ No Errors 115 DIMM Characterizati on Typical DIMM at Lower Temperature  More charge  Precharge time reduction Observation 3. Reducing Precharge Time

22 Key Observations 1. Sensing 2. Restore 3. Precharge Sense cells with extra charge faster  Lower sensing latency No need to fully restore cells with extra charge  Lower restore latency No need to fully precharge bitlines for cells with extra charge  Lower precharge latency

23 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

24 Adaptive-Latency DRAM Key idea –Optimize DRAM timing parameters online Two components – DRAM manufacturer profiles multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM –System monitors DRAM temperature & uses appropriate DRAM timing parameters reliable DRAM timing parameters DRAM temperature

25 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

26 DRAM Temperature DRAM temperature measurement Server cluster: Operates at under 34°C Desktop: Operates at under 50°C DRAM standard optimized for 85°C Previous works – DRAM temperature is low El-Sayed+ SIGMETRICS 2012 Liu+ ISCA 2007 Previous works – Maintain DRAM temperature low David+ ICAC 2011 Liu+ ISCA 2007 Zhu+ ITHERM 2008 DRAM operates at low temperatures in the common- case

27 TemperatureController PC HeaterFPGAsFPGAs DRAM Testing Infrastructure

28 Test Pattern Writ e time Acce ss Verif y Refresh Interval: 64– 512ms Single cache line test (Read/Write) Overlapping multiple single cache line tests to simulate power noise and coupling Writ e Acce ss Verif y time Refresh Interval: 64– 512ms Acce ss Verif y...

29 Control Factors Timing parameters –Sensing: tRCD –Restore: tRAS (read), tWR(write) –Precharge: tRP Temperature: 55 – 85°C Refresh interval: 64 – 512ms –Longer refresh interval leads to smaller charge –Standard refresh interval: 64ms

Errors Temperature: 85°C/Refresh Interval: 64, 128, 256, 512ms 1. Timings ↔ Charge More charge enables more timing parameter reduction Sensing Restore (Read) Prechar ge Restore (Write)

31 Temperature: 55, 65, 75, 85°C/Refresh Interval: 512ms Errors 2. Timings ↔ Temperature Lower temperature enables more timing parameter reduction Sensing Restore (Read) Prechar ge Restore (Write)

32 3. Summary of 115 DIMMs Latency reduction for read & write (55°C) –Read Latency: 32.7% –Write Latency: 55.1% Latency reduction for each timing parameter (55°C) –Sensing: 17.3% –Restore: 37.3% (read), 54.8% (write) –Precharge: 35.2%

33 1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 4. Adaptive-Latency DRAM 5. DRAM Characterization 6. Real System Performance Evaluation 3. Key Observations

34 Real System Evaluation Method System –CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC) –DRAM: 4GByte DDR (800Mhz Clock) –OS: Linux –Storage: 128GByte SSD Workload –35 applications from SPEC, STREAM, Parsec, Memcached, Apache, GUPS

35 1.4% 6.7% 5.0% Single-Core Evaluation AL-DRAM improves performance on a real system Performance Improvement Average Improvement all-35- workload

% 2.9% 10.4% Multi-Core Evaluation AL-DRAM provides higher performance for multi-programmed & multi-threaded workloads Performance Improvement Average Improvement all-35-workload

37 Conclusion Observations –DRAM timing parameters are dictated by the worst- case cell (smallest cell across all products at highest temperature) –DRAM operates at lower temperature than the worst case Idea: Adaptive-Latency DRAM –Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Analysis: Characterization of 115 DIMMs –Great potential to lower DRAM timing parameters (17 – 54%) without any errors Real System Performance Evaluation –Significant performance improvement (14% for memory-intensive workloads) without errors (33 days)

Optimizing DRAM Timing for the Common-Case Donghyuk Lee Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, Onur Mutlu Adaptive-Latency DRAM