International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R.

Slides:

Advertisements

Similar presentations

3D Graphics Content Over OCP Martti Venell Sr. Verification Engineer Bitboys.

Advertisements

1 “Scheduling with Dynamic Voltage/Speed Adjustment Using Slack Reclamation In Multi-processor Real-Time Systems” Dakai Zhu, Rami Melhem, and Bruce Childers.

Miss Penalty Reduction Techniques (Sec. 5.4) Multilevel Caches: A second level cache (L2) is added between the original Level-1 cache and main memory.

Nikos Hardavellas, Northwestern University

High Performing Cache Hierarchies for Server Workloads

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Cache III Steve Ko Computer Sciences and Engineering University at Buffalo.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee Margaret Martonosi.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science University of Michigan.

4/17/20151 Improving Memory Bank-Level Parallelism in the Presence of Prefetching Chang Joo Lee Veynu Narasiman Onur Mutlu* Yale N. Patt Electrical and.

Smart Refresh: An Enhanced Memory Controller Design for Reducing Energy in Conventional and 3D Die-Stacked DRAMs Mrinmoy Ghosh Hsien-Hsin S. Lee School.

1 Adaptive History-Based Memory Schedulers Ibrahim Hur and Calvin Lin IBM Austin The University of Texas at Austin.

Techniques for Multicore Thermal Management Field Cady, Bin Fu and Kai Ren.

Tiered-Latency DRAM: A Low Latency and A Low Cost DRAM Architecture

Data Mapping for Higher Performance and Energy Efficiency in Multi-Level Phase Change Memory HanBin Yoon*, Naveen Muralimanohar ǂ, Justin Meza*, Onur Mutlu*,

Reducing Read Latency of Phase Change Memory via Early Read and Turbo Read Feb 9 th 2015 HPCA-21 San Francisco, USA Prashant Nair - Georgia Tech Chiachen.

1 Multi-Core Systems CORE 0CORE 1CORE 2CORE 3 L2 CACHE L2 CACHE L2 CACHE L2 CACHE DRAM MEMORY CONTROLLER DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank.

- Sam Ganzfried - Ryan Sukauye - Aniket Ponkshe. Outline Effects of asymmetry and how to handle them Design Space Exploration for Core Architecture Accelerating.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

GPUs. An enlarging peak performance advantage: –Calculation: 1 TFLOPS vs. 100 GFLOPS –Memory Bandwidth: GB/s vs GB/s –GPU in every PC and.

Glenn Reinman, Brad Calder, Department of Computer Science and Engineering, University of California San Diego and Todd Austin Department of Electrical.

Lecture 12: DRAM Basics Today: DRAM terminology and basics, energy innovations.

1 Lecture 15: DRAM Design Today: DRAM basics, DRAM innovations (Section 5.3)

Min-Sheng Lee Efficient use of memory bandwidth to improve network processor throughput Jahangir Hasan 、 Satish ChandraPurdue University T. N. VijaykumarIBM.

Memory access scheduling Authers: Scott RixnerScott Rixner,William J. Dally,Ujval J. Kapasi, Peter Mattson, John D. OwensWilliam J. DallyUjval J. KapasiPeter.

Due to the economic downturn, Microsoft Research has eliminated all funding for title slides. We sincerely apologize for any impact these austerity measures.

UNDERSTANDING THE ROLE OF THE POWER DELIVERY NETWORK IN 3-D STACKED MEMORY DEVICES Manjunath Shevgoor, Niladrish Chatterjee, Rajeev Balasubramonian, Al.

Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical.

An Analytical Performance Model for Co-Management of Last-Level Cache and Bandwidth Sharing Taecheol Oh, Kiyeon Lee, and Sangyeun Cho Computer Science.

Defining Anomalous Behavior for Phase Change Memory

A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffery Stuecheli,

Storage Class Memory Architecture for Energy Efficient Data Centers Bruce Childers, Sangyeun Cho, Rami Melhem, Daniel Mossé, Jun Yang, Youtao Zhang Computer.

Unifying Primary Cache, Scratch, and Register File Memories in a Throughput Processor Mark Gebhart 1,2 Stephen W. Keckler 1,2 Brucek Khailany 2 Ronny Krashinsky.

University of Michigan Electrical Engineering and Computer Science 1 Extending Multicore Architectures to Exploit Hybrid Parallelism in Single-Thread Applications.

Sangyeun Cho Hyunjin Lee

StimulusCache: Boosting Performance of Chip Multiprocessors with Excess Cache Hyunjin Lee Sangyeun Cho Bruce R. Childers Dept. of Computer Science University.

© 2007 IBM Corporation HPCA – 2010 Improving Read Performance of PCM via Write Cancellation and Write Pausing Moinuddin Qureshi Michele Franceschini and.

Architectural Support for Fine-Grained Parallelism on Multi-core Architectures Sanjeev Kumar, Corporate Technology Group, Intel Corporation Christopher.

Timing Channel Protection for a Shared Memory Controller Yao Wang, Andrew Ferraiuolo, G. Edward Suh Feb 17 th 2014.

Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.

Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu.

1 Lecture 14: DRAM Main Memory Systems Today: cache/TLB wrap-up, DRAM basics (Section 2.3)

Energy Reduction for STT-RAM Using Early Write Termination Ping Zhou, Bo Zhao, Jun Yang, *Youtao Zhang Electrical and Computer Engineering Department *Department.

Authors – Jeahyuk huh, Doug Burger, and Stephen W.Keckler Presenter – Sushma Myneni Exploring the Design Space of Future CMPs.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Exploiting Scratchpad-aware Scheduling on VLIW Architectures for High-Performance Real-Time Systems Yu Liu and Wei Zhang Department of Electrical and Computer.

CMP L2 Cache Management Presented by: Yang Liu CPS221 Spring 2008 Based on: Optimizing Replication, Communication, and Capacity Allocation in CMPs, Z.

Copyright © 2010 Houman Homayoun Houman Homayoun National Science Foundation Computing Innovation Fellow Department of Computer Science University of California.

Parallelism-Aware Batch Scheduling Enhancing both Performance and Fairness of Shared DRAM Systems Onur Mutlu and Thomas Moscibroda Computer Architecture.

Block-Based Packet Buffer with Deterministic Packet Departures Hao Wang and Bill Lin University of California, San Diego HSPR 2010, Dallas.

1 Lecture: DRAM Main Memory Topics: DRAM intro and basics (Section 2.3)

1 Lecture: Memory Basics and Innovations Topics: memory organization basics, schedulers, refresh,

A Case for Toggle-Aware Compression for GPU Systems

Pentium 4 Deeply pipelined processor supporting multiple issue with speculation and multi-threading 2004 version: 31 clock cycles from fetch to retire,

Seth Pugsley, Jeffrey Jestes,

ECE232: Hardware Organization and Design

Zhichun Zhu Zhao Zhang ECE Department ECE Department

ASR: Adaptive Selective Replication for CMP Caches

Xiaodong Wang, Shuang Chen, Jeff Setter,

ISPASS th April Santa Rosa, California

Lynn Choi School of Electrical Engineering

Scalable High Performance Main Memory System Using PCM Technology

DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores

Lecture 15: DRAM Main Memory Systems

Staged-Reads : Mitigating the Impact of DRAM Writes on DRAM Reads

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Lecture: DRAM Main Memory

Chapter 5 Exploiting Memory Hierarchy : Cache Memory in CMP

Lei Zhao, Youtao Zhang, Jun Yang

Presentation transcript:

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R. Childers and 1 Jun Yang 1 Electrical and Computer Engineering Department 2 Computer Science Department University of Pittsburgh, Pittsburgh

Phase Change Memory (PCM) 2 DRAM PCM ? # of Cores (C#) ↑ ARM CortexA15 4 cores Intel Xeon 8 cores AMD Bulldozer 16 cores Working Set of Single Thread (WSST) ↑ MemCached Memory Capacity ↑ = C# x WSST largesmall Figures are from ARM, Intel, AMD, VoltDB, Memcached, MySQL and Samsung website

Voltage Time Multi-Level Cell and PCM write 3 Capacity ↑ Cost-per-bit ↓ Large Resistance Difference V verify … V set,0 V set,1 V set,2 Higher than V dd write voltage Nondeterministic write

Voltage Time Multi-Level Cell and PCM write 4 Capacity ↑ Cost-per-bit ↓ Large Resistance Difference V verify … V set,0 V set,1 V set,2 Higher than V dd write voltage Nondeterministic write More write power and energy Write is non-deterministic

PCM DIMM and Chip Architecture 5 1Bridge Chip [FANG_PACT2011] : handles non-deterministic write Iteration Manager (IM): iterative programming algorithm 2Local Charge Pump (LCP): boosts voltage and current for writes

Power Constraint and Solution for SLC 6 DIMM level power constraint (DLPC) [HAY_MICRO’11] –One DIMM only supports 560 concurrent RESETs (power token) –~one 512-bit (64B) write – poor write throughput SLC power management (SPM) [HAY_MICRO’11] –Approximately estimate # of written cells in cache by MC –Allocate power tokens based on estimated number –Reclaim after a fixed write latency –Can write ~ 8 64B lines (assuming 15% cell changing rate) IdealSPM ~Full write throughput

Higher power demand, but DLPC does not increase –MLC has larger write power –MLC needs larger memory line size and LLC –More cell changes, lower write throughput Nondeterministic write on MLC –Reclaim power tokens after a fixed latency? A Different Story on MLC 7 Worst case write latency must be used → Power tokens wasted 67% SPM does NOT work on MLC! IdealSPMMLC

Total # of cells written per chip is limited too –Introduced by local charge pump (LCP) [LEE_JSSCC’09] –LCP power supply ability ∝ LCP area In Addition: Chip Level Power Constraint 8 [CHOI_ISSCC’12] 15%-20% area overhead

DIMM and Chip Power Constraints Example 9 Bank 0 Bank 1 Chip 0 budgetChip 1 budgetChip 2 budget DIMM DIMM 8 Chip power constraint is violated! Hot chip WR-A (bank 0) WR-B (bank 1) Write-A obeys both DIMM and chip power constraints It can go to bank 0. 2Write-B violates chip power constraint. It has to be stopped.

Performance with Both Power Constraints 10 DIMM and chip power constraints hurt write throughput / performance a lot ! 49% IdealSPMDIMM Chip

Simple Solutions? 11 Intra-line wear leveling [ZHOU_ISCA’09] –Periodically shift N bytes for one line Scheduling for power constraints –Reorder writes ……….. WR-A WR-B WR-C WR-D ……….. WR-A WR-B WR-C WR-D ……….. WR-A WR-C WR-B WR-D ……….. WR-A WR-B WR-C WR-D Shift bytes reordering B and C Conflict No Conflict Conflict No Conflict 4x throughput 1.5x throughput

But They do NOT Help 12 PWLintra-level wear leveling without overhead SchedulingScheduling writes under both power constraints N x localEnlarging local charge pump --- No effect xlocal No effect 2 x local ≈ DIMM only case, but 100% overhead! DIMM+chip PWLScheduling 1.5xLocal2xLocal DIMM only

13 How to tackle chip level power constraint?

Global Charge Pump 14 1GCP balances power supply among chips 2Power of GCP + LCPs ≤ DIMM level power constraint 3 Each sub-array is powered by either GCP or LCP, not both IM Bridge Chip GCP LCP DIMM 4Long wire → large resistance on wire [OH_JSSC’06] → low efficiency 5Tradeoff between power utilization and efficiency

Global Charge Pump 15 GCP+50% eff. cancels the benefit of GCP! GCP+100% eff. can relieve chip level P constraint!

Cell Mapping 16 64B line = 256 cells Chip Naïve Mapping (NE) 31 … Vertical Interleaving (VIM) Chip mod Chip# = Cell# mod 8

Can We Do Even Better? 17 Braided Interleaving (BIM) Chip mod Chip# = (Cell# – Cell# / 16) mod … … …

Effectiveness of Cell Mapping 18 GCP + V/BIM + 70% eff. ≈ GCP + 100% eff. ! GCP + V/BIM + 50% eff. > GCP + 70% eff. ?

19 Can we utilize DIMM level power budget much better?

Iteration Power Management 20 A: 50 cell changes B: 60 cell changes powerlatency R eset 21 S et R eset S et S et S et 122 S et ideally SPM on MLC Total : wait Complete in 9 units of time Complete in 16 units of time

21 Iteration Power Management Proposed IPM A: 50 cell changes B: 60 cell changes powerlatency R eset 21 S et 12 Total : 80 Complete in 12 units of time Multi RESET (MR) Complete in 10 units of time

Experimental Methodology In-order 8-core 4GHz CMP processor –L1: private i-32KB/d-32KB –L2: private 2MB, 64B line –L3: DRAM off-chip, private 32MB, 256B line 4GB 2-bit MLC PCM main memory –One DIMM, single-rank, 8 banks –R/W queue 24 entries [HAY_MICRO’11] –Read first; schedule writes when NO read –Queue is full → write burst issuing all write until queue is empty –RESET: 500 cycles, 300μA, 480μW –SET: 1000 cycles, 150μA, 90μW –MLC non-deterministic write model [QURESHI_HPCA’10] Benchmarks –SPEC2006, BioBench, MiBench and STREAM 22

Effectiveness of IPM 23 x2.4 76% 86%

Conclusions Increasing # of cores & Enlarging working set –Large & scalable main memory: MLC PCM Two power restrictions on MLC PCM –Limited DIMM level power constraint –Small chip level power constraint Global charge pump –Overcome chip level power constraint Iteration power management –Better utilize DIMM level power budget Our techniques achieve –Write throughput ↑ by x2.4; Performance ↑ by 76% 24

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R. Childers and 1 Jun Yang 1 Electrical and Computer Engineering Department 2 Computer Science Department University of Pittsburgh, Pittsburgh