Presentation is loading. Please wait.

Presentation is loading. Please wait.

Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka

Similar presentations


Presentation on theme: "Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka"— Presentation transcript:

1 Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka
Extending the Effectiveness of 3D-Stacked DRAM Caches with an Adaptive Multi-Queue Policy (G. H. Loh). Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka

2 Outline Introduction What are 3D stacked caches?
Multi-queue cache algorithm Implementation Adaptive Multi-Queue Conclusion

3 Introduction Main goal of the paper 3D-integration technology
Multi-queue Cache replacement policy to remove the dead cache lines and maintain isolation. 3D-integration technology 3D-integration to combat “Memory wall” Memory wall: disparity of speed between CPU and memory outside the CPU chip.

4 3D-stacked cache What is the need of 3D-stacked cache?
Construct 3D Last level cache(LLC) before 3D core architecture LLC is made up of DRAM Each cache set is organized into multiple queues

5 3D-stacked cache cont.. DRAM SRAM Set dueling approach
Machine refreshes the data periodically SRAM Data remains there in the RAM when there is power Set dueling approach Dynamically adapt the sizes of the queues How to advance the lines between queues.

6 3D cache organizations The figure ‘d’ is cost effective and efficient, 8 times more efficient than the baseline in fig a.

7 Physical organization of 3D Caches
Access time for any row on SRAM array is same Reading DRAM destroys the content. Precharge policy Openpage policy

8 Basic Multi-queue Algorithm
Each queue entry has ‘u’ bit set to zero on entry. Hit on the line sets the ‘u’ bit to 1. If ‘u’ bit is still zero then that evicts the line. In multi-core, each core has one FIFO queue.

9 Cache behavior 1 Cache lines that are inserted into the cache but never referenced again. In LRU, the unused data stays there until it’s turn comes to be rleased. With queueing system, the lines are evicted quickly

10 Cache behavior 2 With respect to temporal locality LRU and multi-queue, the behavior is different LRU: eviction happens late even it is unused Multi-Queue: eviction happens after the first queue.

11 Cache behavior 3 Isolation and protection between the different cores.
In figure to the left, core1 has high access rate with no reuse, there are many misses for core0.

12 Implementation Issues
Processors: Clock based pseudo-LRU replacement policy Overhead: u bit/entry, single counter for current clock position Extra counter/queue for tracking the queue head 4 cores

13 Implementation Issues
Row Buffer: Single cycle line shuffling between queues A multi queue design complication: moving data between queues With DRAM array: source and destination column addresses to the mux and demux Power for manipulating data in RB is less than SRAM based cache (precharge bit lines, power the sense amplifiers…) RB not efficient in SRAM cache, it might slow down access patterns

14 Configuration - Baseline system: Quad-core processor
Shared, inclusive 4MB, 16 way cache Clock replacement policy Multiple prefetchers applying FIFO

15 Configuration - DRAM with the same footprint as 4MB SRAM : 32 MB capacity (up to 8 banks) Set associative (up to 128 way) Line sizes: up to 512 bytes The best configuration for 32 MB DRAM : 4 banks, 64 way set associative, 128 byte cache lines Queues (/core): Q= 8, S= 12, SLRU = 20 Multi-programmed workloads of several memory intensive programs (SPEC2006) LLC : MPKI (misses / thousand instructions) IPC (Instruction /cycle)

16 Evaluation

17 Evaluation

18 Evaluation For multi-core performance simulations
Fast forward each program 500 million instructions while warming cache Then perform simulation until each program committed 250 million instructions Statistics collected up to that limit, core continues executing , contending with the other cores for shared resources Throughput : Speedup :

19 Performance Clock replacement
Stacking additional cache -> larger fraction of working set on the chip

20 Performance 32 MB 3D-stacked DRAM: 4 Policies
Baseline clock replacement TADIP (Thread-aware dynamic insertion policy) UCP (Utility-based cache partitioning) Multi Queue cache management

21 Performance MQ : 23.6% more performance than UCP
UCP -more performance in some workloads: adaptation to dynamic changes in per core memory requirement Inclusion: TADIP : no good performance according to the inclusion property LLC enforces MQ avoids the problem: 64 way set associative, Q=8 entries (no quick eviction from the queue) No 2nd level queue: speedup reduced to 15.4%.

22 Performance 4 cores: Core 0 (MIX01): 58.6 % LLC hits in first level queue, 35.8% hits in shared second level queue, 5.6% in clock managed region. Core 1(MIX01): All hits in first level queue…

23 ADAPTIVE MULTI-QUEUE (AMQ)
For some workloads, UCP can achieve higher IPC throughput than MQ approach Reasons: UCP dynamically partition cache to reduce overall misses MQ use statically partitioned queues that may some times result in: Over-provisioned queues for some cores deadlines stay longer than they should Under-provisioned queues for some cores early eviction of lines that will be referenced near future Solution : Adaptive MQ (AMQ) that use dynamic partitioning

24 Adaptive Multi-Queue (AMQ)
AMQ dynamically adjusts queue sizes based on the needs of each core. Instead of allowing arbitrary queue sizes, authors restrict the queues to only a few choices (simpler approach). But still need a method to choose among these few choices!

25 Multi-Set Dueling All possible unique queue size configurations for n-core system: |Q|n×|S| Given : |Q| possible selections for the size of each of the first-level queues and |S| selections for the second-level queue Finding the best parameters among such a potentially large configuration space may be daunting. To tackle this problem, authors propose a simple generalization of the set-dueling principal.

26 Set-Dueling Principle –(DIP)
Proposed for the Dynamic Insertion Policy (DIP) Objective : adaptively choose between the better of two different policies The idea: dedicate a small (but statistically significant) number of cache sets where the sets follow fixed policies.

27 Set-Dueling Principle –(DIP)
Process: A few leader sets always manage their lines using a fixed policy P0, and a few other leader sets always use policy P1. Policy Selection Counter (PSEL): Is decremented when misses occur in leader sets following P0 Is Incremented when misses occur in leader sets following P1 Estimates which policy causes more misses based on the observed behaviors of these sampled leader sets. The remaining follower sets simply use the policy that should result in fewer misses as indicated by the PSEL counter.

28 Set-Dueling Principle in Multi-Core Context -(TADIP)
In a multi-core context, each individual core may wish to follow a different policy. TADIP multi-core extension of DIP introduced the use of per-core leader sets with per-core PSEL counters Figure: Multi-Core, two-policy-per-core selection

29 Set-Dueling Principle in Multi-Core Context –(TADIP)
Leader Set 1 Leader Set 2 Leader Set 3 Leader Set 4 Leader Set 5 Leader Set 6 Leader Set 7 Leader Set 8 8 Leader sets in the group Core 0 Core 1 Core 2 Core Each set is annotated with a policy vector <ρc0 , ρc1 , ρc2 , ρc3>, where ci represents Core i, and ρci indicates the policy followed by Core i for this set. In each group of leader sets there is one leader set per policy, per core. Ex: First leader set always applies policy P0 to Core 0 Second leader set always uses P1 for Core 0. Remaining cores (Core 1 through Core 3) do not use a fixed policy and simply follow the policy specified by their respective PSEL counters.

30 Set-Dueling Principle in Multi-Core Context –(TADIP)
Miss in a set where Core 0 is forced to always follow P0  PSEL0 is decremented. Misses in sets where Core 0 is forced to always follow P1  PSEL0 incremented. For all remaining sets, including leader sets for other cores, cache decisions involving Core 0 will use the policy f0 chosen by PSEL0. The leader set structure is symmetric for all remaining cores. Each core can choose the policy that works the best for it, but the determination of what is “best” accounts for the policy selections of the other cores.

31 Set-Dueling –(AMQ) The set-dueling approaches for both DIP and TADIP assume that each core has only one of two policies to choose from. The selection of a queue size for MQ approach is effectively a “policy” decision. For |Q| > 2 authors of MQ use a multi-set dueling approach

32 Multi-Set-Dueling –(AMQ)
Consider the case for Q={Qa,Qb,Qc,Qd} shown in Figure Ex: Consider Core 0, First-Level Queue: For the first leader set, Core 0 always uses a first-level queue of size Qa. For the second set, Core 0 always uses size Qb. Misses in the first leader set policy cause the counter PSELab 0 to be decremented Misses in the next leader set policy increment the counter. The third set follows the policy φab 0 , which sets Core 0’s queue size (in this set) to Qa or Qb based on PSELab 0 . φ indicate a partial follower; (partial because the sizes Qc and Qd are not considered) Miss in the set following φab 0 causes a “meta-policy” counter MPSEL0 to be decremented. (Leader Set) Set 1 (Leader Set) Set 2 (Partial Follower )Set 3 (Leader Set) Set 4 (Leader Set) Set 5 (Partial Follower )Set 6

33 Multi-Set-Dueling –(AMQ)
The next three sets (set 4,5,6) are similar to the first three, except that: one always sets Core 0’s first-level queue size to Qc, the next to Qd, and the third to the best of these two (φcd0 ). A miss in the set following policy φcd 0 causes MPSEL0 to be incremented. Finally, all other follower sets set Core 0’s first-level queue size according to policy f0 f0 is determined by the results of PSELab0 , PSELcd0 and MPSEL0. (Leader Set) Set 1 (Leader Set) Set 2 (Partial Follower )Set 3 (Leader Set) Set 4 (Leader Set) Set 5 (Partial Follower )Set 6

34 Multi-Set-Dueling –(AMQ)
Figure shows how the next six sets effectively repeat the process to determine the size of Core 1’s first level queues. This repeats again for Core 2 (not fully shown) and Core 3 (not shown at all). Likewise, another six leader sets (also not shown) determine the size of the shared second-level queue. (Leader Set) Set 1 (Leader Set) Set 2 (Partial Follower )Set 3 (Leader Set) Set 4 (Leader Set) Set 5 (Partial Follower )Set 6

35 Multi-Set-Dueling –(AMQ)
For our adaptive multi-queue (AMQ) approach: the first-level queues use one of four policies Q = {0s, 0m, 4, 8}, and the second-level queue selects from one of four sizes S = {0, 4, 8, 12}. For the first-level queues, there are actually two choices for zero-sized queues. The policy 0s : means that the queue has no entries incoming cache lines should be inserted into the second-level queue. The policy 0m : is similar except that lines are inserted directly into the main clock-based region of the set

36 Stability Issues Problem: Instability
so many policy , meta-policy decisions:  overall system can become unstable and rapidly switch through many different configurations and not actually converge on a good one. Solutions: slow down the rate of policy change. Two Methods: Simple time delay where independent of the actual PSEL values, once a policy change has been made, no other changes may occur until at least δ cycles have elapsed, although the PSEL counters are still updated. Adds hysteresis to the PSEL counters. When a PSEL counter goes negative, it must actually be decremented below −h before the change in policy is invoked. Similarly, one must increment the counter above +h to switch the policy back.

37 Occasional Lines with Long Reuse Distances
Problem: Early Eviction The queue size must match up reasonably well with the actual reuse distances for each core. The multi-set dueling approach select the queue size that most closely covers the majority of a core’s cache line reuse patterns There may still be a significant number of lines that simply have reuse distances longer than the queue size (Early Eviction) Solution: Include a pardon probability and statistical trace cache filtering. Method: If a line’s u bit is set, then it is always advanced to the next region of the cache If the u-bit is zero, then with some probability Ppardon , the line is advanced anyway.  Four possible pardon probabilities P = {0, 1/32 , 1/8 , 1}  Use multi-set dueling to select Ppardon on a per-core basis

38 IPC throughput results for AMQ

39 IPC throughput results for AMQ
Additional performance gains beyond the simple stacking of DRAM as a cache UCP (18.9% ) MQ-static (23.6%) AMQ (25.7%) improvement Stability mechanisms provide a small net benefit increasing the geometric mean improvement to 27.6% over the baseline 32MB cache. Dynamic pardon probability selection provides another small boost, bringing the performance gain to 29.1%. AMQ technique achieves 75.6% of the performance difference between the 32MB and 64MB clock-managed DRAM caches

40 AMQ’s adaptations over time
Figure 14 shows for each benchmark of each workload, how much time each first-level queue spends configured at different sizes, as well as the amount of time that the shared second-level cache spends in its different sizes.

41 AMQ’s adaptations over time
The first four columns (dark shading on top) of each workload correspond to the per-core queues from Core 0 to Core 3. The 0s and 0m policies correspond to a zero-sized first-level queue with direct insertion into the second-level queue and main region, respectively. The fifth column (light shading on top) is for the shared queue. While a few individual programs find a queue size and then stick with it throughout the traced execution, there are others that clearly vary (i.e., adapt) over time.

42 Weighted speedup and Fair speedup

43 Weighted speedup and Fair speedup
Overall, AMQ performs well on these metrics. For the fair speedup metric, AMQ with stability and pardoning performs better than UCP and always better than clock, indicating that there are no significant concerns over fairness.

44 Conclusion In this paper, authors have revisited the simple application of using 3D integration to stack a DRAM layer as a large last-level cache. Authors have shown that the physical architecture of the DRAM and its peripheral logic, which traditionally increases the complexity of the memory interface, actually provides us with an opportunity to derive benefit from these otherwise inconvenient structures.

45 Questions?


Download ppt "Bismita Srichandan, Semra Kul, Rasanjalee Disanayaka"

Similar presentations


Ads by Google