Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs Lunkai Zhang, Diana Franklin, Frederic T. Chong 1 Brian Neely,

Slides:



Advertisements
Similar presentations
A Framework for Coarse-Grain Optimizations in the On-Chip Memory Hierarchy J. Zebchuk, E. Safi, and A. Moshovos.
Advertisements

Jaewoong Sim Alaa R. Alameldeen Zeshan Chishti Chris Wilkerson Hyesoon Kim MICRO-47 | December 2014.
D. Tam, R. Azimi, L. Soares, M. Stumm, University of Toronto Appeared in ASPLOS XIV (2009) Reading Group by Theo 1.
High Performing Cache Hierarchies for Server Workloads
LEVERAGING ACCESS LOCALITY FOR THE EFFICIENT USE OF MULTIBIT ERROR-CORRECTING CODES IN L2 CACHE By Hongbin Sun, Nanning Zheng, and Tong Zhang Joseph Schneider.
FLEXclusion: Balancing Cache Capacity and On-chip Bandwidth via Flexible Exclusion Jaewoong Sim Jaekyu Lee Moinuddin K. Qureshi Hyesoon Kim.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.
Practical Caches COMP25212 cache 3. Learning Objectives To understand: –Additional Control Bits in Cache Lines –Cache Line Size Tradeoffs –Separate I&D.
Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.
Prefetch-Aware Cache Management for High Performance Caching
†The Pennsylvania State University
Improving Cache Performance by Exploiting Read-Write Disparity
Overview of Cache and Virtual MemorySlide 1 The Need for a Cache (edited from notes with Behrooz Parhami’s Computer Architecture textbook) Cache memories.
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
1 Balanced Cache:Reducing Conflict Misses of Direct-Mapped Caches through Programmable Decoders ISCA 2006,IEEE. By Chuanjun Zhang Speaker: WeiZeng.
Adaptive Cache Compression for High-Performance Processors Alaa R. Alameldeen and David A.Wood Computer Sciences Department, University of Wisconsin- Madison.
1 Coordinated Control of Multiple Prefetchers in Multi-Core Systems Eiman Ebrahimi * Onur Mutlu ‡ Chang Joo Lee * Yale N. Patt * * HPS Research Group The.
Moinuddin K. Qureshi ECE, Georgia Tech ISCA 2012 Michele Franceschini, Ashish Jagmohan, Luis Lastras IBM T. J. Watson Research Center PreSET: Improving.
Achieving Non-Inclusive Cache Performance with Inclusive Caches Temporal Locality Aware (TLA) Cache Management Policies Aamer Jaleel,
A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches Georgia Institute of Technology Atlanta, GA ICPP, Kaohsiung, Taiwan,
 Higher associativity means more complex hardware  But a highly-associative cache will also exhibit a lower miss rate —Each set has more blocks, so there’s.
© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.
1 Reducing DRAM Latencies with an Integrated Memory Hierarchy Design Authors Wei-fen Lin and Steven K. Reinhardt, University of Michigan Doug Burger, University.
Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University.
ECE8833 Polymorphous and Many-Core Computer Architecture Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Lecture 6 Fair Caching Mechanisms.
1 Towards Phase Change Memory as a Secure Main Memory André Seznec IRISA/INRIA.
Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.
1 Tuning Garbage Collection in an Embedded Java Environment G. Chen, R. Shetty, M. Kandemir, N. Vijaykrishnan, M. J. Irwin Microsystems Design Lab The.
Department of Computer Science and Engineering The Pennsylvania State University Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das Addressing.
Improving Cache Performance by Exploiting Read-Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez.
1 Utility-Based Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches Written by Moinuddin K. Qureshi and Yale N.
Sampling Dead Block Prediction for Last-Level Caches
MadCache: A PC-aware Cache Insertion Policy Andrew Nere, Mitch Hayenga, and Mikko Lipasti PHARM Research Group University of Wisconsin – Madison June 20,
Analyzing Performance Vulnerability due to Resource Denial-Of-Service Attack on Chip Multiprocessors Dong Hyuk WooGeorgia Tech Hsien-Hsin “Sean” LeeGeorgia.
MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.
Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.
CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.
PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie, Gabriel H. Loh Georgia Institute of Technology Presented by: Yingying.
Embedded System Lab. 정범종 PIPP: Promotion/Insertion Pseudo-Partitioning of Multi-Core Shared Caches Yuejian Xie et al. ACM, 2009.
The Evicted-Address Filter
Embedded System Lab. 오명훈 Addressing Shared Resource Contention in Multicore Processors via Scheduling.
1 CMP-MSI.07 CARES/SNU A Reusability-Aware Cache Memory Sharing Technique for High Performance CMPs with Private Caches Sungjune Youn, Hyunhee Kim and.
Ensieea Rizwani An energy-efficient management mechanism for large-scale server clusters By: Zhenghua Xue, Dong, Ma, Fan, Mei 1.
LECTURE 12 Virtual Memory. VIRTUAL MEMORY Just as a cache can provide fast, easy access to recently-used code and data, main memory acts as a “cache”
An Adaptive Cache Coherence Protocol Optimized for Producer-Consumer Sharing Liquin Cheng, John B. Carter and Donglai Dai cs.utah.edu by Evangelos Vlachos.
Improving Multi-Core Performance Using Mixed-Cell Cache Architecture
Improving Cache Performance using Victim Tag Stores
UH-MEM: Utility-Based Hybrid Memory Management
Seth Pugsley, Jeffrey Jestes,
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Adaptive Cache Partitioning on a Composite Core
Zhichun Zhu Zhao Zhang ECE Department ECE Department
ASR: Adaptive Selective Replication for CMP Caches
Cache Memory Presentation I
Moinuddin K. Qureshi ECE, Georgia Tech Gabriel H. Loh, AMD
Prefetch-Aware Cache Management for High Performance Caching
Hyperthreading Technology
Short Circuiting Memory Traffic in Handheld Platforms
Energy-Efficient Address Translation
Fine-Grain CAM-Tag Cache Resizing Using Miss Tags
Accelerating Dependent Cache Misses with an Enhanced Memory Controller
TLC: A Tag-less Cache for reducing dynamic first level Cache Energy
Exploiting Inter-Warp Heterogeneity to Improve GPGPU Performance
Optical Overlay NUCA: A High Speed Substrate for Shared L2 Caches
Qingbo Zhu, Asim Shankar and Yuanyuan Zhou
A Novel Cache-Utilization Based Dynamic Voltage Frequency Scaling (DVFS) Mechanism for Reliability Enhancements *Yen-Hao Chen, *Yi-Lun Tang, **Yi-Yu Liu,
Lei Zhao, Youtao Zhang, Jun Yang
RAIDR: Retention-Aware Intelligent DRAM Refresh
Architecting Phase Change Memory as a Scalable DRAM Alternative
Overview Problem Solution CPU vs Memory performance imbalance
Presentation transcript:

Mellow Writes: Extending Lifetime in Resistive Memories through Selective Slow Write Backs Lunkai Zhang, Diana Franklin, Frederic T. Chong 1 Brian Neely, Dmitri Strukov, Yuan Xie

An Everyday Experience: Sharpening a Knife Do it harshly … – Takes a shorter time. – Bad for the knife. Do it gently … – Takes a longer time. – Good for the knife. 2

A Similar Trade-off Also Works for Resistive Memory Write Operations! 3 For typical Resistive Memory technologies, slower writes are predicted to have a quadratic endurance advantage! Citation: D. B. Strukov, “Endurance-write-speed tradeoffs in nonvolatile memories,” Applied Physics A, vol. 122, no. 4, pp. 1–4, Write with higher power… – Takes a shorter time. – Lower endurance, shorter lifetime Write with lower power … – Takes a longer time. – Higher endurance, longer lifetime

A Single Write Latency Is NOT Enough! A Single Shorter Write Latency => Memory Lifetime suffers for some applications. A Single Longer Write Latency => Performance suffers for some other applications. Is it possible to let a system adaptively use different write speeds, so we can improve the lifetime without loss of performance? 4

Relatively Low Bank Utilization Memory banks are idle for most of the time. Is it possible to use the bank idle time to slowly write back the data? 5

Schemes Mellow Writes – Bank-Aware Mellow Writes – Eager Mellow Writes Wear Quota 6

Schemes Mellow Writes – Bank-Aware Mellow Writes – Eager Mellow Writes Wear Quota 7

Motivation: Bank Level Imbalance Bank 0 has only 1 memory block to be written back. It is less likely that the write queue will be blocked by Bank 0. 8 Bank 2 has more memory blocks to be written back. It is more likely that the write queue will be blocked by Bank 2. # Awaiting Writes

Bank-Aware Mellow Writes 9 Slow Write Normal Write Approach : Slowly writing back a memory block only when there is no other memory block queued for the same bank. # Awaiting Writes Write back the only memory block for Bank 0 in slow speed. Write back current memory block for Bank 2 in normal speed.

Simulated System OoO Alpha core. 32KB L1 I/D-$, 256KB L2$, 2MB L3$ (LLC). 4GB Resistive Main Memory (ReRAM technology), 16 Banks, 32-entry read/write queues, write drain, Start-Gap Wear Leveling, (1.0x latency = 150ns, 1.00x endurance = 5.0 * 10^6 ): – Norm Writes (1.0x): 1.00x latency, 1.00x endurance – Slow Writes (3.0x): 3.00x latency, 9.00x endurance Norm Writes with no write cancellation. Slow writes with write cancellation. Eight-Year lifetime requirement. 10

Effectiveness of Bank-Aware Mellow Writes 11 No Noticeable Performance Degradation. Geomean 87% lifetime improvement compared with All-Norm. 4 out of 11 applications meet the 8-year lifetime requirements.

Schemes Mellow Writes – Bank-Aware Mellow Writes – Eager Mellow Writes Wear Quota 12

So … is it possible to reschedule the writes? If we can evenly reschedule the writes … Wasted! Too Crowded! Motivation: Write Scheduling Imbalance in a Memory Bank With Bank-Aware Mellow Writes 13

Eager Mellow Writes We predict which dirty cache lines in the Last Level Cache will not be written again before their evictions, and eagerly and slowly write back these cache lines. In some sense, we treat Last Level Cache as a large write buffer, in which we find proper write backs to fill the idle memory intervals. 14

Choosing Cache Lines for Eager Mellow Writes In this paper, we choose dirty cache lines which are predicted to be useless as the candidates for Eager Mellow Writes. Those are, the cache lines will not be accessed again before their eviction. 15 Set 0 Set 1 Set 2 Set 3 Predicted uselessPredicted useful Candidates of Eager Mellow Writes if Dirty Last Level Cache

A Utility Based Approach To Predict Useless Cache Lines For an LRU Set-associative Last Level Cache (LLC): Add an access counter for each LRU stack position in LLC. Increase the corresponding access counter if there is an access hit on an LRU position. For every time slice (500,000 cycles), choose the consecutive least-used LRU positions with sum less than 1/32 LLC accesses. In the next time slice, consider these cache lines with these LRU positions as useless, and they can be eagerly written back. 16 Citation: Moinuddin K. Qureshi & Yale N. Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches”, MICRO'06.

Architectural Level Modifications 17 + Eager Mellow Write Requests + Eager Mellow Queue Lowest Priority, No Write Drains, Just Slow Writes

Effectiveness of Eager Mellow Writes 18 No Performance Degradation, even some performance benefit Geomean 158% lifetime improvement compared with All-Norm. 6 out of 11 applications meet the 8-year lifetime requirements. 5 applications still suffer from short lifetime!

Schemes Mellow Writes – Bank-Aware Mellow Writes – Eager Mellow Writes Wear Quota 19

Partition the time into Time slices. Wear Quota (per bank): the average available wear of each time slice. 20 Expected Lifetime Total Amount of Available Wear of Resistive Main Memory Wear Quota Time Slice

Wear Quota 21 Wear Quota Time Slice 1 Wear Time Slice 1 Wear Quota Time Slice 2 Wear Quota Time Slice 3 Wear Quota Time Slice 4 Wear Time Slice 2 Wear Time Slice 3 Wear Time Slice 4 Within Wear Quota Exceeding Wear Quota Within Wear Quota Time Slice 1: Mellow Writes Policy Time Slice 2: Mellow Writes Policy Time Slice 3: All-Slow Writes Policy Time Slice 4: Mellow Writes Policy Within Wear Quota

Effectiveness of Wear Quota All 11 applications meet the 8-year lifetime requirements. 22 Does not degrade the performance if the lifetime requirement is already met. Degrades the performance only when necessary!

How About Energy? 23 Operation Level A 3x Slow Write consumes 66% more Energy Compared with a normal write. Total Memory Consumption of the Execution On Average Less than 50% more memory energy compared with All-Norm Policy An Affordable Cost Compared with the Lifetime Benefit.

Sensitivity to Analytic Model In a typical ReRAM technology, compared with default speed writes, slow writes are predicted to achieve a quadratic endurance benefit. Based on a wider range of device parameters, the endurance benefit could be linear to cubic. What will happen if we have a different endurance benefit? Even with a pessimistic linear endurance benefit, we can still achieve 47% lifetime improvement. 24 Citation: D. B. Strukov, “Endurance-write-speed tradeoffs in nonvolatile memories,” Applied Physics A, vol. 122, no. 4, pp. 1–4, 2016.

Conclusion – A new dynamic trade-off between write latency and endurance. – Two Mellow Writes schemes which improve the lifetime without sacrificing the performance. – Wear Quota scheme which guarantees a minimal lifetime with relatively small performance loss. – Low hardware overhead and easy to implement.

Thanks! Lunkai Zhang The University of Chicago 26