Samira Khan University of Virginia Mar 29, 2016

Samira Khan University of Virginia Mar 29, 2016
COMPUTER ARCHITECTURE CS 6354 Emerging Memory Technologies Samira Khan University of Virginia Mar 29, 2016 The content and concept of this course are adapted from CMU ECE 740

AGENDA Logistics Review from last lecture Emerging Memory Technology

LOGISTICS Exam I Solution will be uploaded to Piazza
Contact Elaheh to get your graded paper back Honor Code Students pledge never to lie, cheat, or steal, and accept that the consequence for breaking this pledge is permanent dismissal from the University Any misconduct the case will be reported without mercy Consider this as my final warning

Grade Distribution

Milestone I Presentations
How do you think you did? Project is 50% of the grade There is no assignment, exam is very easy So that you can focus on the project 3 credit course requires more than 10 hours per week Should have spend more than 60 hours for proposal + milestone 1 Spend more than 90 hours till milestone 2 Spend more than 120 hours till final presentation It is very easy to see who tried and who did not even try If you do not want to fail, work harder! In case have to retake, Fall 2016 course will have higher standard!!!

SOLUTIONS TO THE DRAM SCALING PROBLEM
Two potential solutions Tolerate DRAM (by taking a fresh look at it) Enable emerging memory technologies to eliminate/minimize DRAM Do both Hybrid memory systems

(11 – 11 – 28) (8 – 8 – 19) Runtime: 527min x86 CPU Runtime: 477min
SPEC mcf Apache GUPS Memcached Parsec -10.5% (no error) MemCtrl (11 – 11 – 28) Timing Parameters DRAM Module (8 – 8 – 19) A system consists of a processor and DRAM based-memory system that are connected by memory channel. DRAM has many timing parameters as minimum number of clock cycles between commands. Memory controller follows these timing parameters when accessing DRAM. We execute a benchmark with using the standard timing parameters. Then, we intentionally reduce the timing parameters and execute same benchmark again. When comparing these two cases, we observe significant performance improvement by reducing timing parameters, for example, 10% performance improvement for spec mcf benchmark. Without any errors.----- DDR3 1600MT/s ( )

Why can we reduce DRAM timing parameters without any errors?
Reducing DRAM Timing Why can we reduce DRAM timing parameters without any errors? I will mostly talk about reducing DRAM timing parameters. Before explaining the details of our work, I would provide high level questions and our answers. First question is why reducing DRAM timing parameters works? Our observation is that DRAM timing parameters are overly conservative and dictated by the slowest DRAM chip at the highest operating temperature. Second question is how much we can reduce DRAM timing parameters? Our DRAM latency characterization shows that it depends on the operating temperature and DRAM chip. Third question is whether this approach is safe? Our answer is Yes. We exploit only the extra margin in timing parameters. We perform 33 day long evaluation with reduced timing parameters and observed no errors. Let me explain details about our work.

Executive Summary Observations Idea: Adaptive-Latency DRAM
DRAM timing parameters are dictated by the worst-case cell (smallest cell across all products at highest temperature) DRAM operates at lower temperature than the worst case Idea: Adaptive-Latency DRAM Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Analysis: Characterization of 115 DIMMs Great potential to lower DRAM timing parameters (17 – 54%) without any errors Real System Performance Evaluation Significant performance improvement (14% for memory- intensive workloads) without errors (33 days) Let me conclude DRAM timing parameters are dictated by the worst case cell. We propose Adaptive-Latency DRAM that optimizes DRAM timing parameters for common cases Our characterization shows the significant potential to lower DRAM timing parameters. Our real system evaluation shows that our mechanism provides significant performance improvement without introducing errors.

2. Reasons for Timing Margin in DRAM
1. DRAM Operation Basics 2. Reasons for Timing Margin in DRAM 3. Key Observations 4. Adaptive-Latency DRAM 5. DRAM Characterization This is the outline. I will first briefly explain DRAM operation and two reasons of timing margin in DRAM. Then, we will talk about our three key mechanisms and propose our mechanism, Adaptive-Latency DRAM. Then, we will provide the DRAM latency characterization results and real system performance evaluation results. Let me start from DRAM operation. 6. Real System Performance Evaluation

DRAM Stores Data as Charge
DRAM Cell Three steps of charge movement 1. Sensing The right figure shows DRAM organization which is consists of DRAM cells and sense-amplifiers. There are three steps for accessing DRAM. First, when DRAM selects one of rows in its cell array, the selected cells drive their charge to sense-amplifiers. Then, sense-amplifiers detect the charge, recognizing data of cells. We call this operation as sensing. Second, after sensing data from DRAM cells, sense-amplifiers drive charge to the cells for reconstructing original data. We call this operation as restore. Third, after restoring the cell data, DRAM deselects the accessed row and puts the sense-amplifiers to original state for next access. We call this operation as precharge. As shown in these three steps, DRAM operation is charge movement between cell and sense-amplifier. 2. Restore Sense-Amplifier 3. Precharge

Why does DRAM need the extra timing margin?
DRAM Charge over Time Cell In practice Cell Data 1 margin charge Sense-Amplifier Sense-Amplifier Data 0 Timing Parameters Sensing Restore time Timing parameters exist to ensure sufficient movement of charge. The figure shows the timeline for accessing DRAM in X-axis and Y-axis shows the charge amount in DRAM cell and sense-amplifier. When selecting a row of cells, each cell drives their charge toward to the corresponding sense-amplifier. If driven charge is more than enough to detect the data, the sense-amplifier starts to restore data by driving charge toward the cell. This restore operation finishes when the cell is fully charged. Ideally, DRAM timing parameters are just enough for the sufficient movement of charge. However, in practice, there are large margin in DRAM timing parameters. Why are these margin required? In theory Why does DRAM need the extra timing margin?

Two Reasons for Timing Margin
1. Process Variation DRAM cells are not equal Leads to extra timing margin for cell that can store small amount of charge ` 1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge 1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin when operating at low temperature There are two reasons for the margin in DRAM timing parameters. First is process variation and second is temperature dependence. Let me introduce the first factor, process variation.

DRAM Cells are Not Equal
Ideal Real Smallest Cell Largest Cell Same Size  Different Size  Large variation in cell size  Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Same Charge  Different Charge  Large variation in charge  Same Latency Different Latency Large variation in access latency

Process Variation DRAM Cell Small cell can store small charge
❶ Cell Capacitance Capacitor ❷ Contact Resistance ❸ Transistor Performance Small cell can store small charge Bitline Contact Ideally, DRAM would have uniform cells which have same size, thereby having same access latency. Unfortunately, due to process variation, each cell has different size, thereby having different access latency. Therefore, DRAM shows large variation in both charge amount in its cell and access latency to its cell. Access Transistor Small cell capacitance High contact resistance Slow access transistor ACCESS  High access latency

Two Reasons for Timing Margin
1. Process Variation DRAM cells are not equal Leads to extra timing margin for a cell that can store a large amount of charge ` 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the high temperature 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the low temperature 2. Temperature Dependence DRAM leaks more charge at higher temperature Leads to extra timing margin for cells that operate at the high temperature Let me introduce the second factor temperature dependence.

Charge Leakage ∝ Temperature
Room Temp. Hot Temp. (85°C) DRAM leakage increases the variation. DRAM loses their charge over time and loses more charge at higher temperature. Therefore, DRAM cell has smallest charge when operating at hot temperature, for example 85C which is the highest temperature guaranteed by DRAM standard. This temperature dependence increases the variation in DRAM cell charge. Cells store small charge at high temperature and large charge at low temperature  Large variation in access latency Small Leakage Large Leakage

DRAM Timing Parameters
DRAM timing parameters are dictated by the worst-case The smallest cell with the smallest charge in all DRAM products Operating at the highest temperature Large timing margin for the common- case

Can lower latency for the common-case
Our Approach We optimize DRAM timing parameters for the common-case The smallest cell with the smallest charge in a DRAM module Operating at the current temperature Common-case cell has extra charge than the worst-case cell Can lower latency for the common-case

Key Observations 1. Sensing 2. Restore 3. Precharge
Sense cells with extra charge faster  Lower sensing latency 2. Restore No need to fully restore cells with extra charge  Lower restore latency 3. Precharge Remember there are three steps in DRAM operation, sensing, restore, and precharge. In all these steps, the excessive charge enables potential timing parameter reduction. Let me introduce these three scenarios in order. No need to fully precharge bitlines for cells with extra charge  Lower precharge latency

 More charge  Faster sensing
Observation 1. Faster Sensing Typical DIMM at Low Temperature Timing (tRCD) 17% ↓ No Errors 115 DIMM Characterization More Charge Strong Charge Flow First step in DRAM operation is sensing. In typical case, there are more charge in DRAM cell compared to the worst case. The excessive charge enables stronger charge drive to sense-amplifier. Then, sense-amplifier can detect data of DRAM cell fast, thereby complete sensing operation fast. In our DRAM latency profiling results of 115 DRAM modules, we observe that 17% potential reduction for the corresponding timing parameters of sensing. As a result, more charge in typical cases lead to fast sensing. Faster Sensing Typical DIMM at Low Temperature  More charge  Faster sensing

 More charge  Restore time reduction
Observation 2. Reducing Restore Time Typical DIMM at Low Temperature Read (tRAS) 37% ↓ Write (tWR) 54% ↓ No Errors 115 DIMM Characterization Larger Cell & Less Leakage  Extra Charge No Need to Fully Restore Charge Second step in DRAM operation is restore. Due to larger size and less leakage, typical DRAM cells have more charge than the worst cell. Even though typical DRAM does not restore the excessive charge during restore operation, eventually the typical cell has more charge than the worst case cell, guaranteeing reliable operations. By reducing this restore time in DRAM read/write, our DRAM latency characterization shows significant potential to reduce corresponding timing parameters. For example, 37% for read and 54% for write. As a result, more charge in typical cases enables restore time reduction. Typical DIMM at lower temperature  More charge  Restore time reduction

Typical DIMM at Lower Temperature
Observation 3. Reducing Precharge Time Typical DIMM at Lower Temperature Empty (0V) Full (Vdd) Half Sensing Precharge Bitline Sense-Amplifier Third step is precharge. Previously we explained the precharge simply as going back to original state. Let me introduce more details about precharge. To detect data in DRAM cell, DRAM has sense-amplifier and wire connection to the bitline. DRAM preserve their a bit of data in two charge state, empty and fully charged. During sensing and restore, the bitline moves to either of these two states. After finishing the access, the bitline should be go back to the half of these two state. We call this precharge. Unfortunately, bitline is usually long by connecting hundred of cells such that it takes long time for precharge. Precharge ? – Setting bitline to half-full charge

 More charge  Precharge time reduction
Observation 3. Reducing Precharge Time Access Full Cell Access Empty Cell Timing (tRP) 35% ↓ No Errors 115 DIMM Characterization Not Fully Precharged More Charge  Strong Sensing Empty (0V) Full (Vdd) Half bitline Let us observe the impact of reducing precharge with a simple example. The sense-amplifier first accesses an empty cell then not fully precharged by reducing corresponding timing parameter. Then DRAM next access a fully charged cell. Due to the bias in the bitline, it takes longer time to access the fully charged cell. However, if the cell has much excessive charge, the strong charge flow overcomes the bias in the bitline, leading to reliable access. Our DRAM latency evaluation shows that in typical case, we can reduce corresponding timing parameter by 35% without errors. Therefore, more charge in typical cases enables precharge time reduction. Typical DIMM at Lower Temperature  More charge  Precharge time reduction

Key Observations 1. Sensing 2. Restore 3. Precharge
Sense cells with extra charge faster  Lower sensing latency 2. Restore No need to fully restore cells with extra charge  Lower restore latency 3. Precharge Remember there are three steps in DRAM operation, sensing, restore, and precharge. In all these steps, the excessive charge enables potential timing parameter reduction. Let me introduce these three scenarios in order. No need to fully precharge bitlines for cells with extra charge  Lower precharge latency

Adaptive-Latency DRAM
Key idea Optimize DRAM timing parameters online Two components DRAM manufacturer profiles multiple sets of reliable DRAM timing parameters at different temperatures for each DIMM System monitors DRAM temperature & uses appropriate DRAM timing parameters reliable DRAM timing parameters DRAM temperature

DRAM operates at low temperatures in the common-case
DRAM Temperature DRAM temperature measurement Server cluster: Operates at under 34°C Desktop: Operates at under 50°C DRAM standard optimized for 85°C DRAM operates at low temperatures in the common-case Previous works – DRAM temperature is low El-Sayed+ SIGMETRICS 2012 Liu+ ISCA 2007 Previous works – Maintain DRAM temperature low David+ ICAC 2011 Liu+ ISCA 2007 Zhu+ ITHERM 2008

DRAM Testing Infrastructure
Temperature Controller FPGAs Heater FPGAs PC This is the photo of our infrastructure, that we built mostly from scratch. For higher throughput, we employ eight FPGAs, all of which are enclosed in thermally-regulated environment.

Control Factors Timing parameters Temperature: 55 – 85°C
Sensing: tRCD Restore: tRAS (read), tWR(write) Precharge: tRP Temperature: 55 – 85°C Refresh interval: 64 – 512ms Longer refresh interval leads to smaller charge Standard refresh interval: 64ms

more timing parameter reduction
1. Timings ↔ Charge Temperature: 85°C/Refresh Interval: 64, 128, 256, 512ms 10 102 103 104 105 Errors Sensing Restore Restore Precharge (Read) (Write) More charge enables more timing parameter reduction

more timing parameter reduction
2. Timings ↔ Temperature Temperature: 55, 65, 75, 85°C/Refresh Interval: 512ms 10 102 103 104 105 Errors Sensing Restore Restore Precharge (Read) (Write) Lower temperature enables more timing parameter reduction

3. Summary of 115 DIMMs Latency reduction for read & write (55°C)
Read Latency: 32.7% Write Latency: 55.1% Latency reduction for each timing parameter (55°C) Sensing: 17.3% Restore: 37.3% (read), 54.8% (write) Precharge: 35.2%

Real System Evaluation Method
CPU: AMD 4386 ( 8 Cores, 3.1GHz, 8MB LLC) DRAM: 4GByte DDR (800Mhz Clock) OS: Linux Storage: 128GByte SSD Workload 35 applications from SPEC, STREAM, Parsec, Memcached, Apache, GUPS

Performance Improvement
Single-Core Evaluation Average Improvement 5.0% 1.4% 6.7% Performance Improvement all-35-workload AL-DRAM improves performance on a real system

Performance Improvement multi-programmed & multi-threaded workloads
Multi-Core Evaluation Average Improvement 10.4% 14.0% 2.9% Performance Improvement all-35-workload AL-DRAM provides higher performance for multi-programmed & multi-threaded workloads

Conclusion Observations Idea: Adaptive-Latency DRAM
DRAM timing parameters are dictated by the worst-case cell (smallest cell across all products at highest temperature) DRAM operates at lower temperature than the worst case Idea: Adaptive-Latency DRAM Optimizes DRAM timing parameters for the common case (typical DIMM operating at low temperatures) Analysis: Characterization of 115 DIMMs Great potential to lower DRAM timing parameters (17 – 54%) without any errors Real System Performance Evaluation Significant performance improvement (14% for memory- intensive workloads) without errors (33 days) Let me conclude DRAM timing parameters are dictated by the worst case cell. We propose Adaptive-Latency DRAM that optimizes DRAM timing parameters for common cases Our characterization shows the significant potential to lower DRAM timing parameters. Our real system evaluation shows that our mechanism provides significant performance improvement without introducing errors.

SOLUTIONS TO THE DRAM SCALING PROBLEM
Two potential solutions Tolerate DRAM (by taking a fresh look at it) Enable emerging memory technologies to eliminate/minimize DRAM Do both Hybrid memory systems

SOLUTION 2: EMERGING MEMORY TECHNOLOGIES
Some emerging resistive memory technologies seem more scalable than DRAM (and they are non-volatile) Example: Phase Change Memory Expected to scale to 9nm (2022 [ITRS]) Expected to be denser than DRAM: can store multiple bits/cell But, emerging technologies have shortcomings as well Can they be enabled to replace/augment/surpass DRAM? Even if DRAM continues to scale there could be benefits to examining and enabling these new technologies.

HYBRID MEMORY SYSTEMS CPU
Meza+, “Enabling Efficient and Scalable Hybrid Memories,” IEEE Comp. Arch. Letters, 2012. Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012 Best Paper Award. DRAMCtrl PCM Ctrl DRAM Phase Change Memory (or Tech. X) Fast, durable Small, leaky, volatile, high-cost Large, non-volatile, low-cost Slow, wears out, high active energy Hardware/software manage data allocation and movement to achieve the best of multiple technologies

THE PROMISE OF EMERGING TECHNOLOGIES
Likely need to replace/augment DRAM with a technology that is Technology scalable And at least similarly efficient, high performance, and fault-tolerant or can be architected to be so Some emerging resistive memory technologies appear promising Phase Change Memory (PCM)? Spin Torque Transfer Magnetic Memory (STT-MRAM)? Memristors? And, maybe there are other ones Can they be enabled to replace/augment/surpass DRAM?

CHARGE VS. RESISTIVE MEMORIES
Charge Memory (e.g., DRAM, Flash) Write data by capturing charge Q Read data by detecting voltage V Resistive Memory (e.g., PCM, STT-MRAM, memristors) Write data by pulsing current dQ/dt Read data by detecting resistance R

LIMITS OF CHARGE MEMORY
Difficult charge placement and control Flash: floating gate charge DRAM: capacitor charge, transistor leakage Reliable sensing becomes difficult as charge storage unit size reduces

EMERGING RESISTIVE MEMORY TECHNOLOGIES
PCM Inject current to change material phase Resistance determined by phase STT-MRAM Inject current to change magnet polarity Resistance determined by polarity Memristors/RRAM/ReRAM Inject current to change atomic structure Resistance determined by atom distance

WHAT IS PHASE CHANGE MEMORY?
Phase change material (chalcogenide glass) exists in two states: Amorphous: Low optical reflexivity and high electrical resistivity Crystalline: High optical reflexivity and low electrical resistivity PCM is resistive memory: High resistance (0), Low resistance (1) PCM cell can be switched between states reliably and quickly

HOW DOES PCM WORK? Write: change phase via current injection
SET: sustained current to heat cell above Tcryst RESET: cell heated above Tmelt and quenched Read: detect phase via material resistance Amorphous vs. crystalline Large Current SET (cryst) Low resistance W Small Current RESET (amorph) High resistance Access Device Memory Element W Photo Courtesy: Bipin Rajendran, IBM Slide Courtesy: Moinuddin Qureshi, IBM

OPPORTUNITY: PCM ADVANTAGES
Scales better than DRAM, Flash Requires current pulses, which scale linearly with feature size Expected to scale to 9nm (2022 [ITRS]) Prototyped at 20nm (Raoux+, IBM JRD 2008) Can be denser than DRAM Can store multiple bits per cell due to large resistance range Prototypes with 2 bits/cell in ISSCC’08, 4 bits/cell by 2012 Non-volatile Retain data for >10 years at 85C No refresh needed, low idle power

PHASE CHANGE MEMORY PROPERTIES
Surveyed prototypes from (ITRS, IEDM, VLSI, ISSCC) Derived PCM parameters for F=90nm Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.

PHASE CHANGE MEMORY PROPERTIES: LATENCY
Latency comparable to, but slower than DRAM Read Latency 50ns: 4x DRAM, 10-3x NAND Flash Write Latency 150ns: 12x DRAM Write Bandwidth 5-10 MB/s: 0.1x DRAM, 1x NAND Flash

PHASE CHANGE MEMORY PROPERTIES
Dynamic Energy 40 uA Rd, 150 uA Wr 2-43x DRAM, 1x NAND Flash Endurance Writes induce phase change at 650C Contacts degrade from thermal expansion/contraction 108 writes per cell 10-8x DRAM, 103x NAND Flash Cell Size 9-12F2 using BJT, single-level cells 1.5x DRAM, 2-3x NAND (will scale with feature size)

PHASE CHANGE MEMORY: PROS AND CONS
Pros over DRAM Better technology scaling Non volatility Low idle power (no refresh) Cons Higher latencies: ~4-15x DRAM (especially write) Higher active energy: ~2-50x DRAM (especially write) Lower endurance (a cell dies after ~108 writes) Reliability issues (resistance drift) Challenges in enabling PCM as DRAM replacement/helper: Mitigate PCM shortcomings Find the right way to place PCM in the system Ensure secure and fault-tolerant PCM operation

PCM-BASED MAIN MEMORY: SOME QUESTIONS
Where to place PCM in the memory hierarchy? Hybrid OS controlled PCM-DRAM Hybrid OS controlled PCM and hardware-controlled DRAM Pure PCM main memory How to mitigate shortcomings of PCM? How to take advantage of (byte-addressable and fast) non-volatile main memory?

PCM-BASED MAIN MEMORY (I)
How should PCM-based (main) memory be organized? Hybrid PCM+DRAM [Qureshi+ ISCA’09, Dhiman+ DAC’09, Meza+ IEEE CAL’12]: How to partition/migrate data between PCM and DRAM

HYBRID MEMORY SYSTEMS: CHALLENGES
Partitioning Should DRAM be a cache or main memory, or configurable? What fraction? How many controllers? Data allocation/movement (energy, performance, lifetime) Who manages allocation/movement? What are good control algorithms? How do we prevent degradation of service due to wearout? Design of cache hierarchy, memory controllers, OS Mitigate PCM shortcomings, exploit PCM advantages Design of PCM/DRAM chips and modules Rethink the design of PCM/DRAM with new requirements

PCM-BASED MAIN MEMORY (II)
How should PCM-based (main) memory be organized? Pure PCM main memory [Lee et al., ISCA’09, Top Picks’10]: How to redesign entire hierarchy (and cores) to overcome PCM shortcomings

ASIDE: STT-RAM BASICS Magnetic Tunnel Junction (MTJ) Cell
Reference layer: Fixed Free layer: Parallel or anti-parallel Cell Access transistor, bit/sense lines Read and Write Read: Apply a small voltage across bitline and senseline; read the current. Write: Push large current through MTJ. Direction of current determines new orientation of the free layer. Kultursay et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,” ISPASS 2013 Reference Layer Free Layer Barrier Logical 0 Logical 1 Moving from DRAM to STT-RAM, instead of a capacitor that stores charge, now we have a magnetic tunnel junction that has one reference layer with a fixed magnetic orientation, and a free layer that can be parallel or anti parallel to the reference layer; which determines the resistance, and in turn, the data stored in the MTJ. In addition to the MTJ, the cell, again, has an access transistor. Reading requires a small voltage to be applied across the bitline and sense line; and the current is sensed. This current varies depending on the resistance of MTJ. To write data, we push a large current into the MTJ, which re-aligns the free layer. The direction of the current determines whether it becomes parallel or anti-parallel; i.e. logical 0 or 1. Word Line Bit Line Access Transistor MTJ Sense Line

ASIDE: STT MRAM: PROS AND CONS
Pros over DRAM Better technology scaling Non volatility Low idle power (no refresh) Cons Higher write latency Higher write energy Reliability? Another level of freedom Can trade off non-volatility for lower write latency/energy (by reducing the size of the MTJ)

AN INITIAL STUDY: REPLACE DRAM WITH PCM
Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009. Surveyed prototypes from (e.g. IEDM, VLSI, ISSCC) Derived “average” PCM parameters for F=90nm

Results: Naïve Replacement of DRAM with PCM
Replace DRAM with PCM in a 4-core, 4MB L2 system PCM organized the same as DRAM: row buffers, banks, peripherals 1.6x delay, 2.2x energy, 500-hour average lifetime Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009.

ARCHITECTING PCM TO MITIGATE SHORTCOMINGS
Idea 1: Use multiple narrow row buffers in each PCM chip  Reduces array reads/writes  better endurance, latency, energy Idea 2: Write into array at cache block or word granularity  Reduces unnecessary wear DRAM PCM

RESULTS: ARCHITECTED PCM AS MAIN MEMORY
1.2x delay, 1.0x energy, 5.6-year average lifetime Scaling improves energy, endurance, density Caveat 1: Worst-case lifetime is much shorter (no guarantees) Caveat 2: Intensive applications see large performance and energy hits Caveat 3: Optimistic PCM parameters?

OTHER OPPORTUNITIES WITH EMERGING TECHNOLOGIES
Merging of memory and storage e.g., a single interface to manage all data New applications e.g., ultra-fast checkpoint and restore More robust system design e.g., reducing data loss Processing tightly-coupled with memory e.g., enabling efficient search and filtering

COORDINATED MEMORY AND STORAGE WITH NVM (I)
The traditional two-level storage model is a bottleneck with NVM Volatile data in memory  a load/store interface Persistent data in storage  a file system interface Problem: Operating system (OS) and file system (FS) code to locate, translate, buffer data become performance and energy bottlenecks with fast NVM stores Two-Level Store Load/Store fopen, fread, fwrite, … Virtual memory Operating system and file system Processor and caches Address translation Persistent (e.g., Phase-Change) Memory Storage (SSD/HDD) Main Memory

COORDINATED MEMORY AND STORAGE WITH NVM (I)
Goal: Unify memory and storage management in a single unit to eliminate wasted work to locate, transfer, and translate data Improves both energy and performance Simplifies programming model as well Unified Memory/Storage Persistent Memory Manager Processor and caches Load/Store Feedback Persistent (e.g., Phase-Change) Memory Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.

THE PERSISTENT MEMORY MANAGER (PMM)
Exposes a load/store interface to access persistent data Applications can directly access persistent memory  no conversion, translation, location overhead for persistent data Manages data placement, location, persistence, security To get the best of multiple forms of storage Manages metadata storage and retrieval This can lead to overheads that need to be managed Exposes hooks and interfaces for system software To enable better data placement and management decisions Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013. Let’s take a closer look at the high-level design goals for a Persistent Memory Manager, or PMM. [click] The most noticeable feature of the PMM – from the system software perspective – is the fact that it exposes a load/store interface to access all persistent data in the system. This means that applications can allocate portions of persistent memory, just as they would allocate portions of volatile DRAM memory in today’s systems. [click] Of course, with a diverse array of devices managed by a persistent memory, it will be important for the PMM to leverage each of their strengths – and mitigate each of their weaknesses – as much as possible. To this end, the PMM is responsible for managing how volatile and persistent are placed among devices, to ensure data can be efficiently located, persistence guarantees met, and any security rules enforced. [click] The PMM is also responsible for managing metadata storage and retrieval. We will later evaluate how such overheads affect system performance. [click] In addition, to achieve the best performance and energy efficiency for the applications running on a system, it will be necessary for the system software to provide hints to the hardware to be used to make more efficient data placement decisions.

THE PERSISTENT MEMORY MANAGER (PMM)
Persistent objects And to tie everything together, here is how a persistent memory would operate. Code allocates persistent objects and accesses them using loads and stores – no different than persistent memory in today’s machines. Optionally, the software can provide hints as to the types of access patterns or requirements of the software. Then, a Persistent Memory Manager uses this access information and software hints to allocate, locate, migrate, and access data among the heterogeneous array of devices present on the system, attempting to strike a balance between application-level guarantees, like persistence and security, and performance and energy efficiency. PMM uses access and hint information to allocate, locate, migrate and access data in the heterogeneous array of devices

ENERGY BENEFITS OF A SINGLE-LEVEL STORE
~24X ~5X Results for PostMark

ENERGY BENEFITS OF A SINGLE-LEVEL STORE
~16X ~5X Results for PostMark

ENABLING AND EXPLOITING NVM: ISSUES
Many issues and ideas from technology layer to algorithms layer Enabling NVM and hybrid memory How to tolerate errors? How to enable secure operation? How to tolerate performance and power shortcomings? How to minimize cost? Exploiting emerging tecnologies How to exploit non-volatility? How to minimize energy consumption? How to exploit NVM on chip? Microarchitecture ISA Programs Algorithms Problems Logic Devices Runtime System (VM, OS, MM) User

THREE PRINCIPLES FOR (MEMORY) SCALING
Better cooperation between devices and the system Expose more information about devices to upper layers More flexible interfaces Better-than-worst-case design Do not optimize for the worst case Worst case should not determine the common case Heterogeneity in design (specialization, asymmetry) Enables a more efficient design (No one size fits all) These principles are related and sometimes coupled

SUMMARY OF EMERGING MEMORY TECHNOLOGIES
Key trends affecting main memory End of DRAM scaling (cost, capacity, efficiency) Need for high capacity Need for energy efficiency Emerging NVM technologies can help PCM or STT-MRAM more scalable than DRAM and non-volatile But, they have shortcomings: latency, active energy, endurance We need to enable promising NVM technologies by overcoming their shortcomings Many exciting opportunities to reinvent main memory at all layers of computing stack

Samira Khan University of Virginia Mar 29, 2016
COMPUTER ARCHITECTURE CS 6354 Emerging Memory Technologies Samira Khan University of Virginia Mar 29, 2016 The content and concept of this course are adapted from CMU ECE 740

Samira Khan University of Virginia Mar 29, 2016

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Mar 29, 2016"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Samira Khan University of Virginia Mar 29, 2016

Similar presentations

Presentation on theme: "Samira Khan University of Virginia Mar 29, 2016"— Presentation transcript:

Similar presentations

About project

Feedback