Presentation is loading. Please wait.

Presentation is loading. Please wait.

MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with.

Similar presentations


Presentation on theme: "MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with."— Presentation transcript:

1 MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with PCM http://www.cs.pitt.edu/PCM

2 Introduction DRAM memory is not energy efficient  Data centers are energy hungry  DRAM memory consumes 20-40% of the energy Apply PCM as main memory  Energy efficient but  Slower read, much slower write and shorter lifetime Hybrid memory: add a DRAM cache  Improve performance (  LLC miss rate)  Extend lifetime (  LLC writeback rate) How to manage the shared resources? C0C0 L1 L2 C1 L1 L2 C2 L1 L2 C3 L1 L2 DRAM PCM DRAM LLC

3 Shared Resource Management CMP systems  Shared resources: - last-level cache - memory bandwidth  Unmanaged resources  interference    poor performance Partition resources:  interference,  performance Cache Partitioning Bandwidth Partitioning DRAM main memory Hybrid main memory UCP [Qureshi et. al., MICRO 39] RBP [Liu et. al., HPCA’10] WCP [Zhou et. al., HiPEAC’12] This work Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Utility-based Cache Partitioning (UCP) Tracks utility (LLC hit/miss) and minimizes overall LLC misses Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information Read-only Bandwidth Partitioning (RBP) Partitions the bus bandwidth based on LLC miss information Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Writeback-aware Cache Partitioning (WCP) Tracks and minimizes LLC miss & writebacks Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? Questions: 1. Is read-only (LLC miss) information enough? 2. Is bus bandwidth still the bottleneck? C0 L1 L2 C1 L1 L2 C2 L1 L2 C3 L1 L2 Memory LLC

4 Bandwidth Partitioning Analytic model guides the run time partitioning  Use queuing theory to model delay  Monitor performance to estimate the parameters of the model  Find the partition that maximizes the system’s performance  Enforce the partition at run time DRAM vs. Hybrid main memory  PCM writes are extremely slow and power hungry Issues specific to hybrid main memory  Bottleneck: bus bandwidth or device bandwidth  Can we ignore the bandwidth consumed by LLC writebacks

5 Device Bandwidth Utilization DRAMPCM DRAM PCM DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate DRAM Memory 1. Low device bandwidth utilization 2. Memory reads (LLC misses) dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate Hybrid Memory 1. High device bandwidth utilization 2. Memory writes (LLC writebacks) often dominate

6 RBP on Hybrid Main Memory RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory RBP vs. SHARE 1. RBP outperforms SHARE for workloads dominated by PCM read (LLC miss) 2. RBP lost against SHARE for workloads dominated by PCM write (LLC writeback) A new bandwidth partitioning scheme is necessary for hybrid memory 10% 90% Percentage of Device Bandwidth Consumed by PCM Writes (LLC Writebacks)

7 Writeback-Aware Bandwidth Partitioning Focus on collective bandwidth of PCM devices Considers LLC writeback information Token bucket algorithm  Device service units = tokens  Allocate tokens among app. every epoch (5 million cycles) Analytic model  Maximize weighted speedup  Model the contention on bandwidth as queuing delay  Difficulty: write is blocking only when write queue is full

8 Analytic Model for bandwidth partitioning For a single core  Additive CPI formula: CPI = CPI LLC∞ + LLC miss freq. * LLC miss penalty  memory ≈ queue, memory service time ≈ queuing delay For a CMP CPI with a infinite LLC CPI due to LLC misses request arrival rate … request service rate LLC miss rate λ m Memory bandwidth α Time to serve requests CPI due to LLC misses … … … … LLC miss rate λ m,1 LLC miss rate λ m,N Memory bandwidth α 1 Memory bandwidth α 2 Memory bandwidth α N Memory Maximize Weighted Speedup

9 Taking into account the LLC writebacks  CPI = CPI LLC∞ + LLC miss freq. * LLC miss penalty + LLC writeback freq. * LLC writeback penalty LLC miss rate λ m,1 Analytic Model for WBP CPI due to LLC writebacks … … … LLC writeback rate λ w,1 Read memory bandwidth α 1 Write memory bandwidth β 1 Memory * P RQ WQ … … LLC miss rate λ m,2 LLC writeback rate λ w,2 Read memory bandwidth α 2 Write memory bandwidth β 2 … … LLC miss rate λ m,N LLC writeback rate λ w,N Read memory bandwidth α N Write memory bandwidth β N Memory p CPI due to LLC misses and writebacks How to determine P? Prob. that writebacks are on the critical path Maximize Weighted Speedup

10 Dynamic Weight Adjustment Choose P based on the expected number of executed instructions (EEI) λ m,1 λ w,1 λ m,2 λ w,2 λ m,N λ w,N α 1,1 β 1,1 α 1,2 β 1,2 α 1,N β 1,N WBP p1p1 EEI BU 1 BU 2 BU N Actual EEI EEI 1 P p2p2 … α 2,1 β 2,1 α 2,2 β 2,2 α 2,N β 2,N α m,1 β m,1 α m,2 β m,2 α m,N β m,N pmpm EEI 2 EEI m Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth Bandwidth Utilization ratio (BU): utilized bandwidth : allocated bandwidth

11 Architecture Overview BUMon tracks info during an epoch DWA and WBP compute bandwidth partition for the next epoch Bandwidth Regulator enforces the configuration

12 Enforcing Bandwidth Partitioning

13 Simulation Setup Configurations  8-core CMP, 168-entry instruction window  Private 4-way 64KB L1, Private 8-way 2MB L2  Partitioned 32MB LLC, 12.5 ns latency  64GB PCM, 4 channels of 2 ranks each, 50ns read latency, 1000ns write latency Benchmarks  SPEC CPU2006  Classified into 3 types (W, R, RW) based on whether PCM reads/writes dominate bandwidth consumption  Creates 15 workloads (Light, High) Sensitivity study on write latency, #channels and #cores

14 Effective Read Latency 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP 1. Different workloads favor different policy (partitioning weight) 2. WBP+DWA can match the best static policy (partitioning weight) 3. WBP+DWA reduces the effective read latency by 31.9% over RBP

15 Throughput 1. The best weight varies for different workloads (  writebacks   weight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP 1. The best weight varies for different workloads (  writebacks   weight) 2. WBP+DWA achieves comparable performance to the best static weight 3. WBP+DWA improves the throughput by 24.2% over RBP

16 Fairness (Harmonic IPC) WBP+DWA improves fairness by an average of 16.7% over RBP

17 Conclusions PCM device bandwidth is the bottleneck in hybrid memory Writeback information is important (LLC writebacks consume a substantial portion of memory bandwidth) WBP can better partition the PCM bandwidth WBP outperforms RBP by an average of 24.9% in terms of weighted speedup

18 Thank you Questions ?


Download ppt "MIAO ZHOU, YU DU, BRUCE CHILDERS, RAMI MELHEM, DANIEL MOSSÉ UNIVERSITY OF PITTSBURGH Writeback-Aware Bandwidth Partitioning for Multi-core Systems with."

Similar presentations


Ads by Google