International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R.

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R. Childers and 1 Jun Yang 1 Electrical and Computer Engineering Department 2 Computer Science Department University of Pittsburgh, Pittsburgh

Phase Change Memory (PCM) 2 DRAM PCM ? # of Cores (C#) ↑ ARM CortexA15 4 cores Intel Xeon 8 cores AMD Bulldozer 16 cores Working Set of Single Thread (WSST) ↑ MemCached Memory Capacity ↑ = C# x WSST largesmall Figures are from ARM, Intel, AMD, VoltDB, Memcached, MySQL and Samsung website

Voltage Time Multi-Level Cell and PCM write 3 Capacity ↑ Cost-per-bit ↓ Large Resistance Difference V verify … V set,0 V set,1 V set,2 Higher than V dd write voltage Nondeterministic write

Voltage Time Multi-Level Cell and PCM write 4 Capacity ↑ Cost-per-bit ↓ Large Resistance Difference V verify … V set,0 V set,1 V set,2 Higher than V dd write voltage Nondeterministic write More write power and energy Write is non-deterministic

PCM DIMM and Chip Architecture 5 1Bridge Chip [FANG_PACT2011] : handles non-deterministic write Iteration Manager (IM): iterative programming algorithm 2Local Charge Pump (LCP): boosts voltage and current for writes

Power Constraint and Solution for SLC 6 DIMM level power constraint (DLPC) [HAY_MICRO’11] –One DIMM only supports 560 concurrent RESETs (power token) –~one 512-bit (64B) write – poor write throughput SLC power management (SPM) [HAY_MICRO’11] –Approximately estimate # of written cells in cache by MC –Allocate power tokens based on estimated number –Reclaim after a fixed write latency –Can write ~ 8 64B lines (assuming 15% cell changing rate) IdealSPM ~Full write throughput

Higher power demand, but DLPC does not increase –MLC has larger write power –MLC needs larger memory line size and LLC –More cell changes, lower write throughput Nondeterministic write on MLC –Reclaim power tokens after a fixed latency? A Different Story on MLC 7 Worst case write latency must be used → Power tokens wasted 67% SPM does NOT work on MLC! IdealSPMMLC

Total # of cells written per chip is limited too –Introduced by local charge pump (LCP) [LEE_JSSCC’09] –LCP power supply ability ∝ LCP area In Addition: Chip Level Power Constraint 8 [CHOI_ISSCC’12] 15%-20% area overhead

DIMM and Chip Power Constraints Example 9 Bank 0 Bank 1 Chip 0 budgetChip 1 budgetChip 2 budget 444 00 00 DIMM 12 0 11 11 DIMM 8 Chip power constraint is violated! Hot chip WR-A (bank 0) 11 11 00 00 WR-B (bank 1) 00 00 00 11 1Write-A obeys both DIMM and chip power constraints It can go to bank 0. 2Write-B violates chip power constraint. It has to be stopped.

Performance with Both Power Constraints 10 DIMM and chip power constraints hurt write throughput / performance a lot ! 49% IdealSPMDIMM Chip

Simple Solutions? 11 Intra-line wear leveling [ZHOU_ISCA’09] –Periodically shift N bytes for one line Scheduling for power constraints –Reorder writes ……….. WR-A WR-B WR-C WR-D ……….. WR-A WR-B WR-C WR-D ……….. WR-A WR-C WR-B WR-D ……….. WR-A WR-B WR-C WR-D Shift bytes reordering B and C Conflict No Conflict Conflict No Conflict 4x throughput 1.5x throughput

But They do NOT Help 12 PWLintra-level wear leveling without overhead SchedulingScheduling writes under both power constraints N x localEnlarging local charge pump --- No effect --- 1.5xlocal No effect 2 x local ≈ DIMM only case, but 100% overhead! DIMM+chip PWLScheduling 1.5xLocal2xLocal DIMM only

13 How to tackle chip level power constraint?

Global Charge Pump 14 1GCP balances power supply among chips 2Power of GCP + LCPs ≤ DIMM level power constraint 3 Each sub-array is powered by either GCP or LCP, not both IM Bridge Chip GCP LCP DIMM 4Long wire → large resistance on wire [OH_JSSC’06] → low efficiency 5Tradeoff between power utilization and efficiency

Global Charge Pump 15 GCP+50% eff. cancels the benefit of GCP! GCP+100% eff. can relieve chip level P constraint!

Cell Mapping 16 64B line = 256 cells 76501234 Chip Naïve Mapping (NE) 31 …. 0255 01234567 Vertical Interleaving (VIM) 76501234 Chip 01234567 mod Chip# = Cell# mod 8

Can We Do Even Better? 17 Braided Interleaving (BIM) 255 01234567 76501234 Chip mod Chip# = (Cell# – Cell# / 16) mod 8 31 30 29 … 23 22 …. 16 15 14 … 8 7 6 5 4 3 2 1 0 0721436510325476 10325476

Effectiveness of Cell Mapping 18 GCP + V/BIM + 70% eff. ≈ GCP + 100% eff. ! GCP + V/BIM + 50% eff. > GCP + 70% eff. ?

19 Can we utilize DIMM level power budget much better?

Iteration Power Management 20 A: 50 cell changes B: 60 cell changes powerlatency R eset 21 S et 12 50 R eset 60 40 S et 36 26 S et 20 12 S et 122 S et ideally SPM on MLC Total : 80 50 402612 50 603620122 60 wait Complete in 9 units of time Complete in 16 units of time

21 Iteration Power Management 50 402612 252013 603620122 301810606 Proposed IPM A: 50 cell changes B: 60 cell changes powerlatency R eset 21 S et 12 Total : 80 Complete in 12 units of time Multi RESET (MR) Complete in 10 units of time 402612 50 252013 603620122 301810606 30

Experimental Methodology In-order 8-core 4GHz CMP processor –L1: private i-32KB/d-32KB –L2: private 2MB, 64B line –L3: DRAM off-chip, private 32MB, 256B line 4GB 2-bit MLC PCM main memory –One DIMM, single-rank, 8 banks –R/W queue 24 entries [HAY_MICRO’11] –Read first; schedule writes when NO read –Queue is full → write burst issuing all write until queue is empty –RESET: 500 cycles, 300μA, 480μW –SET: 1000 cycles, 150μA, 90μW –MLC non-deterministic write model [QURESHI_HPCA’10] Benchmarks –SPEC2006, BioBench, MiBench and STREAM 22

Effectiveness of IPM 23 x2.4 76% 86%

Conclusions Increasing # of cores & Enlarging working set –Large & scalable main memory: MLC PCM Two power restrictions on MLC PCM –Limited DIMM level power constraint –Small chip level power constraint Global charge pump –Overcome chip level power constraint Iteration power management –Better utilize DIMM level power budget Our techniques achieve –Write throughput ↑ by x2.4; Performance ↑ by 76% 24

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R. Childers and 1 Jun Yang 1 Electrical and Computer Engineering Department 2 Computer Science Department University of Pittsburgh, Pittsburgh

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R.

Similar presentations

Presentation on theme: "International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R.

Similar presentations

Presentation on theme: "International Symposium on Microarchitecture Fine-grained Power Budgeting to Improve Write Throughput of MLC PCM 1 Lei Jiang, 2 Youtao Zhang, 2 Bruce R."— Presentation transcript:

Similar presentations

About project

Feedback