Prof. Onur Mutlu ETH Zürich Fall November 2017

Prof. Onur Mutlu ETH Zürich Fall 2017 2 November 2017
Computer Architecture Lecture 13: Memory Interference and Quality of Service (II) Prof. Onur Mutlu ETH Zürich Fall 2017 2 November 2017

Summary of Yesterday Shared vs. private resources in multi-core systems Memory interference and the QoS problem Memory scheduling

Agenda for Today Memory scheduling wrap-up
Other approaches to mitigate and control memory interference Source Throttling Data Mapping Thread Scheduling Multi-Core Cache Management

Quick Summary Papers "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems” "The Blacklisting Memory Scheduler: Achieving High Performance and Fairness at Low Cost" "Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems” "Parallel Application Memory Scheduling” "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning"

Predictable Performance: Strong Memory Service Guarantees

Goal: Predictable Performance in Complex Systems
CPU CPU CPU CPU GPU HWA HWA Shared Cache DRAM and Hybrid Memory Controllers DRAM and Hybrid Memories Heterogeneous agents: CPUs, GPUs, and HWAs Main memory interference between CPUs, GPUs, HWAs How to allocate resources to heterogeneous agents to mitigate interference and provide predictable performance?

Strong Memory Service Guarantees
Goal: Satisfy performance/SLA requirements in the presence of shared main memory, heterogeneous agents, and hybrid memory/storage Approach: Develop techniques/models to accurately estimate the performance loss of an application/agent in the presence of resource sharing Develop mechanisms (hardware and software) to enable the resource partitioning/prioritization needed to achieve the required performance levels for all applications All the while providing high system performance Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013. Subramanian et al., “The Application Slowdown Model,” MICRO 2015.

Predictable Performance Readings (I)
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems" Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages , Pittsburgh, PA, March Slides (pdf)

Predictable Performance Readings (II)
Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February Slides (pptx)

Predictable Performance Readings (III)
Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu, "The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

MISE: Providing Performance Predictability in Shared Main Memory Systems
Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, Onur Mutlu Hello everyone. My name is Lavanya Subramanian. Today, I will present “MISE: Providing perfomance …”. This is work done in collaboration with my co-authors Vivek ….

Unpredictable Application Slowdowns
This graph shows two applications from the SPEC CPU 2006 suite, leslie3d and gcc. These applications are run on either core of a two core system and share main memory. The y-axis shows each application’s slowdown as compared to when the application is run standalone on the same system. Now, let’s look at leslie3d. It slows down by 2x when run with gcc. Let’s look at another case when leslie3d is run with another application, mcf. Leslie now slows down by 5.5x, while it slowed down by 2x when run with gcc. Leslie experiences different slowdowns when run with different applications. More generally, an application’s performance depends on which applications it is running with. Such performance unpredictability is undesirable…. An application’s performance depends on which application it is running with

Need for Predictable Performance
There is a need for predictable performance When multiple applications share resources Especially if some applications require performance guarantees Example 1: In mobile systems Interactive applications run with non-interactive applications Need to guarantee performance for interactive applications Example 2: In server systems Different users’ jobs consolidated onto the same server Need to provide bounded slowdowns to critical jobs Our Goal: Predictable performance in the presence of memory interference …because there is a need for predictable performance When multiple applications share resources and Especially when some applications require guarantees on performance. For example, in mobile systems, interactive applications are run with non-interactive ones. And there is a need to provide guarantees on performance to the interactive applications. In server systems, different users’ jobs are heavily consolidated onto the same server. In such a scenario, there is a need to achieve bounded slowdowns for critical jobs of users. These are a couple of examples of systems that need predictable performance. There are many other systems (such as …) in which performance predictability is necessary. Therefore, our goal in this work is to provide predictable performance in the presence of memory interference.

Outline 1. Estimate Slowdown 2. Control Slowdown Key Observations
Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Towards this goal, we propose to first estimate application slowdowns due to main memory interference. And once we have accurate slowdown estimates, we propose to leverage them to control application slowdowns.

Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown First, let me describe the key observations that enable us to estimate application slowdowns.

Slowdown: Definition Slowdown of an application is its performance when it is run stand alone on a system to its performance when it is sharing the system with other applications. An application’s performance when it is sharing the system with other applications can be measured in a straightforward manner. However, measuring an application’s alone performance, without actually running it alone is harder. This is where are first observation comes in.

Key Observation 1 Harder Easy For a memory bound application,
Performance  Memory request service rate Harder Intel Core i7, 4 cores Mem. Bandwidth: 8.5 GB/s We observe that for a memory bound application, its performance is roughly proportional to the rate at which its memory requests are serviced. This plot shows the results of our experiment on an Intel, 4-core system. We run omnetpp, a SPEC CPU 2006 benchmark along with another memory hog, whose memory intensity can be varied. This, in turn, lets us vary the rate at which the memory requests of omnetpp are serviced. The x-axis shows the normalized request service rate of omnetpp, normalized to when it is run alone. The y-axis shows the performance of omnetpp, normalized to when it is run alone. As can be seen, there is a very strong, y=x like correlation between request service rate and performance. We observe the same trend for various memory bound applications. Here are a couple of more examples. Therefore, memory request service rate can be used as a proxy for performance. Now, going back to the slowdown equation. We can replace performance by request service rate, based on our first observation, enabling us to express slowdown as the ratio of alone to shared request service rates. Shared request service rate can be measured in a relatively straightforward manner when an application is run along with other applications. Measuring alone request service rate, without actually running an application alone is harder. This is where our second observation comes into play. Easy

Key Observation 2 Request Service Rate Alone (RSRAlone) of an application can be estimated by giving the application highest priority in accessing memory Highest priority  Little interference (almost as if the application were run alone) We observe that an application’s alone request service rate can be estimated by giving the application highest priority in accessing memory. Because, when an application has highest priority, it receives little interference from other applications, almost as though the application were run alone.

Key Observation 2 1. Run alone 2. Run with another application
Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory 2. Run with another application Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory Here is an example. An application, let’s call it the red application, is run by itself on a system. Two of its requests are queued at the memory controller request buffers. Since these are the only two requests queued up for main memory at this point, the memory controller schedules them back to back and they are serviced in two time units. Now, the red application is run with another application, the blue application. The blue application’s request makes its way between two of the red application’s requests. A naïve memory scheduler could potentially schedule the requests in the order in which they arrived, delaying the red application’s second request by one time unit, as compared to when the application was run alone. Next, let’s consider running the red and blue applications together, but give the red application highest priority. In this case, the memory scheduler schedules the two requests of the red application back to back, servicing them in two time units. Therefore, when the red application is given highest priority, its requests are serviced as though the application were run alone. 3. Run with another application: highest priority Time units Service order Request Buffer State 3 2 1 Main Memory Main Memory

Memory Interference-induced Slowdown Estimation (MISE) model for memory bound applications
Based on these two observations, our memory interference induced slowdown estimation (MISE) model for memory bound applications is as follows. Now, let’s analyze why this request service rate ratio based model holds for memory bound applications

Memory phase slowdown dominates overall slowdown
Key Observation 3 Memory-bound application Compute Phase Memory Phase No interference Req Req Req time With interference Req Req Req A memory-bound application spends significant fraction of its execution time waiting for memory. Let’s first look at the case when a memory-bound application is not receiving any interference. The application does a small amount of compute at the core and then sends out a bunch of memory requests. The application cannot do any more compute until these requests are serviced. Now, when this application experiences interference, the length of its compute phase does not change. However, the time taken to service the memory requests increases. In essence, since the application spends most of its time in the memory phase, the memory induced slowdown of the application dominates the overall slowdown of the application. time Memory phase slowdown dominates overall slowdown

Only memory fraction ( ) slows down with interference
Key Observation 3 Non-memory-bound application Compute Phase Memory Phase Memory Interference-induced Slowdown Estimation (MISE) model for non-memory bound applications No interference time With interference However, this is not the case for a non-memory-bound application. A non-memory-bound application spends a sizeable fraction of its time doing compute at the core and a smaller fraction of time waiting for memory. When the application experiences interference, the length of the compute phase does not change, whereas the memory phase slows down with interference. Specifically, if alpha is the fraction of time the application spends in the memory phase, then 1 – alpha is the fraction of time it spends in the compute phase. The 1 – alpha fraction does not change with memory interference, however the alpha fraction slows down as the ratio of request service rate alone to request service rate shared. We augment the MISE model to take this into account. Slowdown for non-memory-bound applications can be expressed as the sum of two quantities - 1 – alpha, the compute fraction that does not change with interference and the memory induced slowdown. time Only memory fraction ( ) slows down with interference

1. Estimate Slowdown 2. Control Slowdown Outline Key Observations
Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown Having discussed the key observations behind our MISE model, let me describe its implementation.

Interval Based Operation
time Measure RSRShared, Estimate RSRAlone Measure RSRShared, Estimate RSRAlone MISE operates on an interval basis. Execution time is divided into intervals. During an interval, each application’s shared request service rate and alpha are measured. And alone request service rate is estimated. At the end of the interval, slowdown is estimated as a function of these three quantities. This repeats during every interval. Now, let us look at how each of these quantities is measured/estimated. Estimate slowdown Estimate slowdown

Measuring RSRShared and α
Request Service Rate Shared (RSRShared) Per-core counter to track number of requests serviced At the end of each interval, measure Memory Phase Fraction ( ) Count number of stall cycles at the core Compute fraction of cycles stalled for memory Shared request service rate can be measured using per-core counters to track the number of requests serviced during each interval. At the end of the interval, SRSR can be computed as the number of requests serviced divided by the length of the interval. Memory phase fraction of each application can be computed by counting the number of cycles the core stalls waiting for memory and then computing what fraction are these cycles of the interval length. Measuring SRSR and alpha is the straightforward part. Now, for the more interesting part, estimating alone request service rate.

Estimating Request Service Rate Alone (RSRAlone)
Goal: Estimate RSRAlone How: Periodically give each application highest priority in accessing memory Divide each interval into shorter epochs At the beginning of each epoch Memory controller randomly picks an application as the highest priority application At the end of an interval, for each application, estimate Our goal is to estimate alone request service rate. To do this, we periodically give each application the highest priority in accessing memory, based on observation 2. How exactly do we do this. First, each interval is divided into shorter epochs. At the beginning of each epoch, the memory controller picks one application at random and designates it the highest priority application. This way each application is given highest priority at different epochs during the interval. At the end of the interval, ARSR is estimated for each application as the ratio of the number of requests of the application serviced during the high priority epochs to the number of cycles the application was given highest priority. Now, is this an accurate way of estimating ARSR. It turns out not. We have been ignoring an inaccuracy all along.

Inaccuracy in Estimating RSRAlone
When an application has highest priority Still experiences some interference High Priority Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory Request Buffer State Time units Service order 3 2 1 Main Memory Main Memory That is, even when an application has highest priority, it still experiences some interference from other applications. Here is an example showing why. The red application has highest priority and is run along with another application, the blue application. First, there is just one request in the request buffer and it is from the red application. Since this is the only request, it is scheduled by the memory controller and completes in one time unit. While this request is being serviced, there comes another request from the blue application. Our memory controller is work conserving and schedules another application’s request, if there is no request from the red application. So, it schedules the blue application’s request next. However, just as the blue application’s request is being scheduled comes another request from the red application. Ideally, we would like to pre-empt the blue application’s request. However, because of the way DRAM operates, preempting the blue application’s request incurs overhead. Therefore, the red application’s request would have to wait until the blue application’s request completes and only then is it serviced. These extra cycles that the red application spent waiting for the blue application’s request to complete are interference cycles that it would not have experienced had it been running alone. Time units Service order 3 2 1 Main Memory Interference Cycles

Accounting for Interference in RSRAlone Estimation
Solution: Determine and remove interference cycles from RSRAlone calculation A cycle is an interference cycle if a request from the highest priority application is waiting in the request buffer and another application’s request was issued previously The solution to account for this interference is to determine and remove such interference cycles from the ARSR computation. How we do this is to subtract these cycles from the number of high priority epoch cycles. Now, the question is how do we determine which cycles are interference cycles. We deem a cycle an interference cycle if …. (read from slide)

Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown Having described the key observations behind our model and its implementation, we will put it all together

MISE Model: Putting it All Together
Interval Interval time Measure RSRShared, Estimate RSRAlone Measure RSRShared, Estimate RSRAlone MISE operates on an interval basis. Execution time is divided into intervals. During an interval, each application’s shared request service rate and alpha are measured. And alone request service rate is estimated. At the end of the interval, slowdown is estimated as a function of these three quantities. Now, let us look at how each of these quantities is measured/estimated. Estimate slowdown Estimate slowdown

Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown Now, having described the model and its implementation, let me present an evaluation of the model.

Previous Work on Slowdown Estimation
STFM (Stall Time Fair Memory) Scheduling [Mutlu+, MICRO ‘07] FST (Fairness via Source Throttling) [Ebrahimi+, ASPLOS ‘10] Per-thread Cycle Accounting [Du Bois+, HiPEAC ‘13] Basic Idea: Hard The most relevant previous works on slowdown estimation are STFM, stall time fair memory scheduling and FST, fairness via source throttling. FST employs similar mechanisms as STFM for memory-induced slowdown estimation, therefore, I will focus on STFM for the rest of the evaluation. STFM estimates slowdown as the ratio of alone to shared stall times. Stall time shared is easy to measure, analogous to shared request service rate Stall time alone is harder and STFM estimates it by counting the number of cycles an application receives interference from other applications. Easy Count number of cycles application receives interference

Two Major Advantages of MISE Over STFM
STFM estimates alone performance while an application is receiving interference  Hard MISE estimates alone performance while giving an application the highest priority  Easier Advantage 2: STFM does not take into account compute phase for non-memory-bound applications MISE accounts for compute phase  Better accuracy Let me now present the two major advantages of MISE over STFM, qualitatively. First, STFM estimates alone performance of an application while it is receiving interference. This is hard because of BLP effects. MISE addresses this difficulty by giving an application highest priority and then estimating its alone performance. Second, STFM does not take into account compute phase for non-mem-bound applications. MISE accounts for compute phase and hence achieves better accuracy.

Methodology 4 cores 1 channel, 8 banks/channel DDR3 1066 DRAM
Configuration of our simulated system 4 cores 1 channel, 8 banks/channel DDR DRAM 512 KB private cache/core Workloads SPEC CPU2006 300 multi programmed workloads Next, I will quantitatively compare MISE and STFM. Before that, let me present our methodology. We carry out our evaluations on a 4-core, 1-channel DDR DRAM system. Our workloads are composed of SPEC CPU 2006 applications. We build 300 multiprogrammed workloads from these applications.

Quantitative Comparison
SPEC CPU 2006 application leslie3d This graph shows the actual slowdown of leslie3d, a memory-bound application, with time (in million cycles). Here are slowdown estimates from STFM and here are slowdown estimates from MISE. As can be seen, MISE’s slowdown estimates are much closer to the actual slowdowns. This is because MISE estimates leslie’s alone performance while giving it highest priority, whereas STFM tries to estimate it while leslie receives interference from other applications.

Comparison to STFM Average error of MISE: 8.2%
Average error of STFM: 29.4% (across 300 workloads) cactusADM GemsFDTD soplex Several more applications. For most applications, MISE’s slowdown estimates are more accurate. On average, MISE’s error is 8.2%, while STFM’s error is 29.4%, across our 300 workloads. wrf calculix povray

Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown I have described our MISE model to estimate application slowdowns. Now, once we have slowdown estimates from our model, we can leverage them to control slowdowns. Let me present our first mechanism to control slowdowns – providing soft slowdown guarantees.

Providing “Soft” Slowdown Guarantees
Goal 1. Ensure QoS-critical applications meet a prescribed slowdown bound 2. Maximize system performance for other applications Basic Idea Allocate just enough bandwidth to QoS-critical application Assign remaining bandwidth to other applications The goal of this mechanism is to .. <read off slide>. The basic idea is to allocate just enough bandwidth the QoS-critical application, while assigning the remaining bandwidth among the other applications.

MISE-QoS: Mechanism to Provide Soft QoS
Assign an initial bandwidth allocation to QoS-critical application Estimate slowdown of QoS-critical application using the MISE model After every N intervals If slowdown > bound B +/- ε, increase bandwidth allocation If slowdown < bound B +/- ε, decrease bandwidth allocation When slowdown bound not met for N intervals Notify the OS so it can migrate/de-schedule jobs SLIDE NEEDS WORK

Methodology Each application (25 applications in total) considered the QoS-critical application Run with 12 sets of co-runners of different memory intensities Total of 300 multiprogrammed workloads Each workload run with 10 slowdown bound values Baseline memory scheduling mechanism Always prioritize QoS-critical application [Iyer+, SIGMETRICS 2007] Other applications’ requests scheduled in FRFCFS order [Zuravleff +, US Patent 1997, Rixner+, ISCA 2000] Read off slide

A Look at One Workload MISE is effective in
Slowdown Bound = 10 Slowdown Bound = 3.33 Slowdown Bound = 2 MISE is effective in meeting the slowdown bound for the QoS-critical application improving performance of non-QoS-critical applications QoS-critical non-QoS-critical

Effectiveness of MISE in Enforcing QoS
Across 3000 data points Predicted Met Not Met QoS Bound 78.8% 2.1% 2.2% 16.9% MISE-QoS meets the bound for 80.9% of workloads MISE-QoS correctly predicts whether or not the bound is met for 95.7% of workloads AlwaysPrioritize meets the bound for 83% of workloads

Performance of Non-QoS-Critical Applications
When slowdown bound is 10/3 MISE-QoS improves system performance by 10% Higher performance when bound is loose

Implementation MISE Model: Putting it All Together Evaluating the Model 2. Control Slowdown Providing Soft Slowdown Guarantees Minimizing Maximum Slowdown

Other Results in the Paper
Sensitivity to model parameters Robust across different values of model parameters Comparison of STFM and MISE models in enforcing soft slowdown guarantees MISE significantly more effective in enforcing guarantees Minimizing maximum slowdown MISE improves fairness across several system configurations

Summary Uncontrolled memory interference slows down applications unpredictably Goal: Estimate and control slowdowns Key contribution MISE: An accurate slowdown estimation model Average error of MISE: 8.2% Key Idea Request Service Rate is a proxy for performance Request Service Rate Alone estimated by giving an application highest priority in accessing memory Leverage slowdown estimates to control slowdowns Providing soft slowdown guarantees Minimizing maximum slowdown

MISE: Pros and Cons Upsides: Downsides:
Simple new insight to estimate slowdown Much more accurate slowdown estimations than prior techniques (STFM, FST) Enables a number of QoS mechanisms that can use slowdown estimates to satisfy performance requirements Downsides: Slowdown estimation is not perfect - there are still errors Does not take into account caches and other shared resources in slowdown estimation

More on MISE Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February Slides (pptx)

Handling Memory Interference In Multithreaded Applications
Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December Slides (pptx)

Multithreaded (Parallel) Applications
Threads in a multi-threaded application can be inter-dependent As opposed to threads from different applications Such threads can synchronize with each other Locks, barriers, pipeline stages, condition variables, semaphores, … Some threads can be on the critical path of execution due to synchronization; some threads are not Even within a thread, some “code segments” may be on the critical path of execution; some are not

Critical Sections Enforce mutually exclusive access to shared data
Only one thread can be executing it at a time Contended critical sections make threads wait  threads causing serialization can be on the critical path Each thread: loop { Compute lock(A) Update shared data unlock(A) } N C

Barriers Synchronization point
Threads have to wait until all threads reach the barrier Last thread arriving at the barrier is on the critical path Each thread: loop1 { Compute } barrier loop2 {

Stages of Pipelined Programs
Loop iterations are statically divided into code segments called stages Threads execute stages on different cores Thread executing the slowest stage is on the critical path A B C loop { Compute1 Compute2 Compute3 } A B C

Handling Interference in Parallel Applications
Threads in a multithreaded application are inter-dependent Some threads can be on the critical path of execution due to synchronization; some threads are not How do we schedule requests of inter-dependent threads to maximize multithreaded application performance? Idea: Estimate limiter threads likely to be on the critical path and prioritize their requests; shuffle priorities of non-limiter threads to reduce memory interference among them [Ebrahimi+, MICRO’11] Hardware/software cooperative limiter thread estimation: Thread executing the most contended critical section Thread executing the slowest pipeline stage Thread that is falling behind the most in reaching a barrier PAMS Micro 2011 Talk

Prioritizing Requests from Limiter Threads
Non-Critical Section Critical Section 1 Barrier Waiting for Sync or Lock Critical Section 2 Critical Path Barrier Thread A Thread B Thread C Thread D Time Limiter Thread Identification Barrier Thread D Thread C Thread B Thread A Most Contended Critical Section: 1 Saved Cycles Limiter Thread: C D A B Time 55

Parallel App Mem Scheduling: Pros and Cons
Upsides: Improves the performance of multi-threaded applications Provides a mechanism for estimating “limiter threads” Opens a path for slowdown estimation for multi-threaded applications Downsides: What if there are multiple multi-threaded applications running together? Limiter thread estimation can become complex

More on PAMS Eiman Ebrahimi, Rustam Miftakhutdinov, Chris Fallin, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Parallel Application Memory Scheduling" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December Slides (pptx)

Other Ways of Handling Memory Interference

Fundamental Interference Control Techniques
Goal: to reduce/control inter-thread memory interference 1. Prioritization or request scheduling 2. Data mapping to banks/channels/ranks 3. Core/source throttling 4. Application/thread scheduling

Designing QoS-Aware Memory Systems: Approaches
Smart resources: Design each shared resource to have a configurable interference control/reduction mechanism QoS-aware memory controllers QoS-aware interconnects QoS-aware caches Dumb resources: Keep each resource free-for-all, but reduce/control interference by injection control or data mapping Source throttling to control access to memory system QoS-aware data mapping to memory controllers QoS-aware thread scheduling to cores

Memory Channel Partitioning
Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning” 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December Slides (pptx) MCP Micro 2011 Talk

Observation: Modern Systems Have Multiple Channels
Core Memory Red App Memory Controller Channel 0 Core Memory Blue App Memory Controller Channel 1 A new degree of freedom Mapping data across multiple channels Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Data Mapping in Current Systems
Core Memory Red App Memory Controller Page Channel 0 Core Memory Blue App Memory Controller Channel 1 Causes interference between applications’ requests Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Partitioning Channels Between Applications
Core Memory Red App Memory Controller Page Channel 0 Core Memory Blue App Memory Controller Channel 1 Eliminates interference between applications’ requests Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Overview: Memory Channel Partitioning (MCP)
Goal Eliminate harmful interference between applications Basic Idea Map the data of badly-interfering applications to different channels Key Principles Separate low and high memory-intensity applications Separate low and high row-buffer locality applications Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Key Insight 1: Separate by Memory Intensity
High memory-intensity applications interfere with low memory-intensity applications in shared memory channels 1 2 3 4 5 Channel 0 Bank 1 Channel 1 Bank 0 Conventional Page Mapping Red App Blue App Time Units Core Channel Partitioning Red App Blue App Channel 0 Time Units 1 2 3 4 5 Channel 1 Core Bank 1 Bank 0 Saved Cycles Saved Cycles Map data of low and high memory-intensity applications to different channels

Key Insight 2: Separate by Row-Buffer Locality
Conventional Page Mapping Channel 0 Bank 1 Channel 1 Bank 0 R1 R0 R2 R3 R4 Request Buffer State R0 Channel 0 R1 R2 R3 R4 Request Buffer State Channel Partitioning Bank 1 Bank 0 Channel 1 High row-buffer locality applications interfere with low row-buffer locality applications in shared memory channels Channel 1 Channel 0 R0 Service Order 1 2 3 4 5 6 R2 R3 R4 R1 Time units Bank 1 Bank 0 Channel 1 Channel 0 R0 Service Order 1 2 3 4 5 6 R2 R3 R4 R1 Time units Bank 1 Bank 0 Map data of low and high row-buffer locality applications to different channels Saved Cycles

Memory Channel Partitioning (MCP) Mechanism
Hardware 1. Profile applications 2. Classify applications into groups 3. Partition channels between application groups 4. Assign a preferred channel to each application 5. Allocate application pages to preferred channel System Software Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Interval Based Operation
Current Interval Next Interval time 1. Profile applications 5. Enforce channel preferences 2. Classify applications into groups 3. Partition channels between groups 4. Assign preferred channel to applications

Observations Applications with very low memory-intensity rarely access memory  Dedicating channels to them results in precious memory bandwidth waste They have the most potential to keep their cores busy  We would really like to prioritize them They interfere minimally with other applications  Prioritizing them does not hurt others

Integrated Memory Partitioning and Scheduling (IMPS)
Always prioritize very low memory-intensity applications in the memory scheduler Use memory channel partitioning to mitigate interference between other applications Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Hardware Cost Memory Channel Partitioning (MCP)
Only profiling counters in hardware No modifications to memory scheduling logic 1.5 KB storage cost for a 24-core, 4-channel system Integrated Memory Partitioning and Scheduling (IMPS) A single bit per request Scheduler prioritizes based on this single bit Muralidhara et al., “Memory Channel Partitioning,” MICRO’11.

Performance of Channel Partitioning
Averaged over 240 workloads 11% 5% 7% 1% Better system performance than the best previous scheduler at lower hardware cost

Combining Multiple Interference Control Techniques
Combined interference control techniques can mitigate interference much more than a single technique alone can do The key challenge is: Deciding what technique to apply when Partitioning work appropriately between software and hardware

MCP and IMPS: Pros and Cons
Upsides: Keeps the memory scheduling hardware simple Combines multiple interference reduction techniques Can provide performance isolation across applications mapped to different channels General idea of partitioning can be extended to smaller granularities in the memory hierarchy: banks, subarrays, etc. Downsides: Reacting is difficult if workload changes behavior after profiling Overhead of moving pages between channels restricts benefits

More on Memory Channel Partitioning
Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning" Proceedings of the 44th International Symposium on Microarchitecture (MICRO), Porto Alegre, Brazil, December Slides (pptx)

Goal: to reduce/control inter-thread memory interference 1. Prioritization or request scheduling 2. Data mapping to banks/channels/ranks 3. Core/source throttling 4. Application/thread scheduling

Fairness via Source Throttling
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems" 15th Intl. Conf. on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages , Pittsburgh, PA, March Slides (pdf) FST ASPLOS 2010 Talk

Many Shared Resources Shared Memory Resources Shared Cache On-chip
Core 0 Core 1 Core 2 Core N ... Shared Memory Resources Shared Cache Memory Controller On-chip Chip Boundary Off-chip DRAM Bank 0 DRAM Bank 1 DRAM Bank 2 DRAM Bank K ... 79

The Problem with “Smart Resources”
Independent interference control mechanisms in caches, interconnect, and memory can contradict each other Explicitly coordinating mechanisms for different resources requires complex implementation How do we enable fair sharing of the entire memory system by controlling interference in a coordinated manner?

Source Throttling: A Fairness Substrate
Key idea: Manage inter-thread interference at the cores (sources), not at the shared resources Dynamically estimate unfairness in the memory system Feed back this information into a controller Throttle cores’ memory access rates accordingly Whom to throttle and by how much depends on performance target (throughput, fairness, per-thread QoS, etc) E.g., if unfairness > system-software-specified target then throttle down core causing unfairness & throttle up core that was unfairly treated Ebrahimi et al., “Fairness via Source Throttling,” ASPLOS’10, TOCS’12.

Fairness via Source Throttling (FST)
Two components (interval-based) Run-time unfairness evaluation (in hardware) Dynamically estimates the unfairness (application slowdowns) in the memory system Estimates which application is slowing down which other Dynamic request throttling (hardware or software) Adjusts how aggressively each core makes requests to the shared resources Throttles down request rates of cores causing unfairness Limit miss buffers, limit injection rate

Fairness via Source Throttling (FST) [ASPLOS’10]
Interval 1 Interval 2 Interval 3 Time ⎪ ⎨ ⎧ ⎩ Slowdown Estimation FST Unfairness Estimate Runtime Unfairness Evaluation Runtime Unfairness Evaluation Dynamic Request Throttling Dynamic Request Throttling App-slowest App-interfering 1- Estimating system unfairness 2- Find app. with the highest slowdown (App-slowest) 3- Find app. causing most interference for App-slowest (App-interfering) if (Unfairness Estimate >Target) { 1-Throttle down App-interfering (limit injection rate and parallelism) 2-Throttle up App-slowest } 83

Dynamic Request Throttling
Goal: Adjust how aggressively each core makes requests to the shared memory system Mechanisms: Miss Status Holding Register (MSHR) quota Controls the number of concurrent requests accessing shared resources from each application Request injection frequency Controls how often memory requests are issued to the last level cache from the MSHRs

Dynamic Request Throttling
Throttling level assigned to each core determines both MSHR quota and request injection rate Throttling level MSHR quota Request Injection Rate 100% 128 Every cycle 50% 64 Every other cycle 25% 32 Once every 4 cycles 10% 12 Once every 10 cycles 5% 6 Once every 20 cycles 4% 5 Once every 25 cycles 3% 3 Once every 30 cycles 2% 2 Once every 50 cycles Total # of MSHRs: 128

System Software Support
Different fairness objectives can be configured by system software Keep maximum slowdown in check Estimated Max Slowdown < Target Max Slowdown Keep slowdown of particular applications in check to achieve a particular performance target Estimated Slowdown(i) < Target Slowdown(i) Support for thread priorities Weighted Slowdown(i) = Estimated Slowdown(i) x Weight(i)

Source Throttling Results: Takeaways
Source throttling alone provides better performance than a combination of “smart” memory scheduling and fair caching Decisions made at the memory scheduler and the cache sometimes contradict each other Neither source throttling alone nor “smart resources” alone provides the best performance Combined approaches are even more powerful Source throttling and resource-based interference control

Source Throttling: Ups and Downs
Advantages + Core/request throttling is easy to implement: no need to change the memory scheduling algorithm + Can be a general way of handling shared resource contention + Can reduce overall load/contention in the memory system Disadvantages - Requires slowdown estimations  difficult to estimate - Thresholds can become difficult to optimize  throughput loss due to too much throttling  can be difficult to find an overall-good configuration

More on Source Throttling (I)
Eiman Ebrahimi, Chang Joo Lee, Onur Mutlu, and Yale N. Patt, "Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems" Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages , Pittsburgh, PA, March Slides (pdf)

More on Source Throttling (II)
Kevin Chang, Rachata Ausavarungnirun, Chris Fallin, and Onur Mutlu, "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks" Proceedings of the 24th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), New York, NY, October Slides (pptx) (pdf)

More on Source Throttling (III)
George Nychis, Chris Fallin, Thomas Moscibroda, Onur Mutlu, and Srinivasan Seshan, "On-Chip Networks from a Networking Perspective: Congestion and Scalability in Many-core Interconnects" Proceedings of the 2012 ACM SIGCOMM Conference (SIGCOMM), Helsinki, Finland, August Slides (pptx)

Goal: to reduce/control interference 1. Prioritization or request scheduling 2. Data mapping to banks/channels/ranks 3. Core/source throttling 4. Application/thread scheduling Idea: Pick threads that do not badly interfere with each other to be scheduled together on cores sharing the memory system

Application-to-Core Mapping to Reduce Interference
Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi, "Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February Slides (pptx) Key ideas: Cluster threads to memory controllers (to reduce across chip interference) Isolate interference-sensitive (low-intensity) applications in a separate cluster (to reduce interference from high-intensity applications) Place applications that benefit from memory bandwidth closer to the controller

Multi-Core to Many-Core
Processor industry has adopted multi-cores a power-efficient way for throughput scaling Today as processors move from few cores to few hundred cores we face several new challenges Multi-Core Many-Core

Many-Core On-Chip Communication
Applications Memory Controller Light Heavy Shared Cache Bank $ In a many-core processor with several hundred cores communication plays a key role in system design I show an example of on-chip communication An application generates requests to memory sub-system which can be either satisfied by caches or have to go all the way to main memory via memory controllers Some applications may generate few requests and are light and some may generate lots of requests and thus heavy

Problem: Spatial Task Scheduling
Applications Cores In this paper we try to solve the problem of spatial task scheduling For spatial scheduling there will be lots of applications with diverse characteristics and there will be lots of cores The key question is how to map these applications to cores to maximize performance fairness and energy efficiency How to map applications to cores?

Challenges in Spatial Task Scheduling
Applications Cores How to reduce communication distance? How to reduce destructive interference between applications? When doing the application to core mapping we need to answer several questions – The first obvious question is how do we reduce the distance that your requests travel Then how do you ensure that requests generated by two different applications donot destructively interfere with each other and degrade the overall performance Finally if we have applications which are very sensitive to memory system performance how do you prioritize them and still ensure good fairness and load balance How to prioritize applications to improve throughput?

Application-to-Core Mapping
Improve Bandwidth Utilization Improve Bandwidth Utilization Clustering Balancing Isolation Radial Mapping We take four steps to a2c First we adopt clustering for improving locality and reducing interference Then we balance the load across the clusters Next we ensure to isolate sensitive applications such they suffer minimal interference Finally we schedule critical application close to shared memory system resources Improve Locality Reduce Interference Reduce Interference

Step 1 — Clustering Inefficient data mapping to memory and caches
Controller In a naive scheduler the requests generated by your application could go to all corners of a chip How do you clustering the data so that requests generated by an application are satisfied by the resources within a cluster Inefficient data mapping to memory and caches

Step 1 — Clustering Improved Locality Reduced Interference Cluster 0
If you could somehow make sure to map the data to the resources within a cluster then there are two benefirs first you can improve communication locality second you can also reduce the interference between requests generated by different applications Say first divide the chip into clusters Improved Locality Reduced Interference

System Performance System performance improves by 17%
Radial mapping was consistently beneficial for all workloads System performance improves by 17%

Network Power Average network power consumption reduces by 52%

More on App-to-Core Mapping
Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi, "Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February Slides (pptx)

Interference-Aware Thread Scheduling
An example from scheduling in compute clusters (data centers) Data centers can be running virtual machines

How to dynamically schedule VMs onto hosts?
Virtualized Cluster VM VM VM VM App App App App How to dynamically schedule VMs onto hosts? Distributed Resource Management (DRM) policies Core0 Core1 Host LLC DRAM Core0 Core1 Host LLC DRAM In a virtualized cluster, we have a set of physical hosts, and many virtual machines. One key question to manage the virtualized cluster is how to dynamic scheduling the virtual machines onto the physical hosts? To achieve high resource utilization and energy savings, virtualized systems usually employ distributed resource management (DRM) policies to solve this problem.

Conventional DRM Policies
Based on operating-system-level metrics e.g., CPU utilization, memory capacity demand VM App VM App App App Memory Capacity Core0 Core1 Host LLC DRAM Core0 Core1 Host LLC DRAM VM CPU Could will map them in this manner. Conventional DRM policies, usually make the mapping decision based on operating-system level metrics, such as CPU utilization, memory capacity, and fit the virtual machines into physical hosts dynamically. Here is an example, let’s use the height to indicate the CPU utilization and the width to indicate the memory capacity demand. If we have four virtual machines like this, we can fit them into the two hosts. App

Microarchitecture-level Interference
Host Core0 Core1 VMs within a host compete for: Shared cache capacity Shared memory bandwidth VM App VM App LLC However, another key aspect that impact the virtual machines performance is the microarchitecture-level interference. Virtual machines within a host compete at: 1) … VMs will evict each other’s data from the last level cache, causing an increase in memory access latency, thus resulting in performance degradation. 2) … VMs’ memory requests also contend at the different components of main memory, such as channels, ranks, and banks. Also resulting in performance degradation. One question raised here is: can the operating-system-level metrics reflect the micro-architecture-level resources interference? DRAM Can operating-system-level metrics capture the microarchitecture-level resource interference?

Microarchitecture Unawareness
VM Operating-system-level metrics CPU Utilization Memory Capacity 92% 369 MB 93% 348 MB Microarchitecture-level metrics LLC Hit Ratio Memory Bandwidth 2% 2267 MB/s 98% 1 MB/s App Host Core0 Core1 Host LLC DRAM Memory Capacity Core0 Core1 VM App VM App VM App VM App VM Could possibly map VM to Host in this manner. NO topology Starved. We profiled many benchmarks and found many of them exhibit the similar CPU utilization and memory capacity demand. Take the STREAM and gromacs for example, their cpu utilization and memory capacity demand are similar. If we use conventional DRM to schedule them, this is a possible mapping. The two STREAM running on the same host. However, although the cpu and memory usage are similar, the microarchitecture-level resource are totally different. STREAM… Gromacs … Thus in this mapping, this host may already experiencing heavily contention. CPU App App STREAM gromacs LLC DRAM

Impact on Performance SWAP Core0 Core1 Host LLC DRAM Core0 Core1 Host
Memory Capacity VM App VM App VM App VM App VM Although the conventional DRM policies now aware of microarchitecture level contention, how much benefit will we get if we considering them? We did an experiment use the STREAM and groamcs benchmark. When two STREAM are locate on the same host, the harmonic mean of IPC across all the VMs are 0.30. CPU SWAP App App STREAM gromacs

We need microarchitecture-level interference awareness in DRM!
Impact on Performance We need microarchitecture-level interference awareness in DRM! 49% Core0 Core1 Host LLC DRAM Host Memory Capacity Core0 Core1 VM App VM App VM App VM VM However, after we migrate STREAM to the other host and migrate gromacs back. The performance is improved by nearly fifty percentage. Which indicate that we need microarchitecture-level interference awareness in distributed resource management. CPU App App App STREAM gromacs LLC DRAM

A-DRM: Architecture-aware DRM
Goal: Take into account microarchitecture-level shared resource interference Shared cache capacity Shared memory bandwidth Key Idea: Monitor and detect microarchitecture-level shared resource interference Balance microarchitecture-level resource usage across cluster to minimize memory interference while maximizing system performance A-DRM takes into account two main sources of interference at the microarchitecture-level, shared last level cache capacity and memory bandwidth. The key idea of A-DRM is to monitors the microarchitecture-level resource usage, and balance it across the cluster.

A-DRM: Architecture-aware DRM
Hosts Controller A-DRM: Global Architecture –aware Resource Manager OS+Hypervisor VM App Profiling Engine Architecture-aware Interference Detector ••• Architecture-aware Distributed Resource Management (Policy) In addition to conventional DRM, A-DRM add 1) an architectural resource profiler at each host We add two … to the Controller. 1) An architecture-aware interference detector is invoked at each scheduling interval to detect microarchitecture-level shared resource interference. 2) Once such interference is detected, the architecture-aware DRM policy is used to determine VM migrations to mitigate the detected interference. CPU/Memory Capacity Architectural Resource Architectural Resources Profiler Migration Engine

More on Architecture-Aware DRM
Hui Wang, Canturk Isci, Lavanya Subramanian, Jongmoo Choi, Depei Qian, and Onur Mutlu, "A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters" Proceedings of the 11th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments (VEE), Istanbul, Turkey, March [Slides (pptx) (pdf)]

Interference-Aware Thread Scheduling
Advantages + Can eliminate/minimize interference by scheduling “symbiotic applications” together (as opposed to just managing the interference) + Less intrusive to hardware (less need to modify the hardware resources) Disadvantages and Limitations -- High overhead to migrate threads and data between cores and machines -- Does not work (well) if all threads are similar and they interfere

Summary: Fundamental Interference Control Techniques
Goal: to reduce/control interference 1. Prioritization or request scheduling 2. Data mapping to banks/channels/ranks 3. Core/source throttling 4. Application/thread scheduling Best is to combine all. How would you do that?

Summary: Memory QoS Approaches and Techniques
Approaches: Smart vs. dumb resources Smart resources: QoS-aware memory scheduling Dumb resources: Source throttling; channel partitioning Both approaches are effective in reducing interference No single best approach for all workloads Techniques: Request/thread scheduling, source throttling, memory partitioning All approaches are effective in reducing interference Can be applied at different levels: hardware vs. software No single best technique for all workloads Combined approaches and techniques are the most powerful Integrated Memory Channel Partitioning and Scheduling [MICRO’11] MCP Micro 2011 Talk

Summary: Memory Interference and QoS
QoS-unaware memory  uncontrollable and unpredictable system Providing QoS awareness improves performance, predictability, fairness, and utilization of the memory system Discussed many new techniques to: Minimize memory interference Provide predictable performance Many new research ideas needed for integrated techniques and closing the interaction with software

What Did We Not Cover? Prefetch-aware shared resource management
DRAM-controller-cache co-design Cache interference management Interconnect interference management Write-read scheduling …

Some Other Ideas …

Decoupled DMA w/ Dual-Port DRAM [PACT 2015]

Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM
Decoupled Direct Memory Access Donghyuk Lee Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, Onur Mutlu

Logical System Organization
processor CPU access main memory IO access In logical organization, system consists of processor, main memory and IO devices. Processor accesses main memory to get data for executing instructions, called CPU access. IO devices transfer data to main memory, called IO access. Therefore, the main memory connects processor and IO devices as an intermediate layer. IO devices Main memory connects processor and IO devices as an intermediate layer

Physical System Implementation
High Pin Cost in Processor processor IO access High Contention in Memory Channel CPU access main memory IO access In physical implementation, however, IO devices are connected to processor, and DMA in the processor manages the IO traffic through memory channel. This difference between logical and physical organization leads to High contention in memory channel, High pin count in processor chip. IO devices

Enabling IO channel, decoupled & isolated from CPU channel
Our Approach processor IO access CPU access main memory IO access To avoid these problems, our approach is providing a direct connection between IO devices and main memory, which enables IO access without interfering CPU access. We call this new memory system architecture – Decouple Direct Memory Access IO devices Enabling IO channel, decoupled & isolated from CPU channel

Executive Summary Problem
CPU and IO accesses contend for the shared memory channel Our Approach: Decoupled Direct Memory Access (DDMA) Design new DRAM architecture with two independent data ports  Dual-Data-Port DRAM Connect one port to CPU and the other port to IO devices  Decouple CPU and IO accesses Application Communication between compute units (e.g., CPU – GPU) In-memory communication (e.g., bulk in-memory copy/init.) Memory-storage communication (e.g., page fault, IO prefetch) Result Significant performance improvement (20% in 2 ch. & 2 rank system) CPU pin count reduction (4.5%) This is the summary of today’s talk. The problem is that CPU and IO accesses contend for memory channel. To mitigate this problem, we design new DRAM which has two independent data port, called dual-data-port DRAM. We connect one port to CPU and the other port to IO devices, thereby isolating CPU and IO traffic. We call this Decoupled Direct Memory Access (DDMA). We can use this new memory system for Communication between compute units, In-memory communication, and Memory-storage communication. We evaluate our DDMA-based memory system and show that it provides significant performance improvement with reduction in CPU pin count.

Outline 1. Problem 1. Problem 2. Our Approach 3. Dual-Data-Port DRAM
This is the outline of today’s talk. We first describe more details of the problem in conventional memory system. Then we provide the overview of our proposed memory system. Then we introduce one of major component, dual-data-port DRAM. We explain how we can leverage our new memory system. We, then, provide our evaluation methodology and results. 4. Applications for DDMA 5. Evaluation

Problem 1: Memory Channel Contention
CPU Processor Chip memory controller DMA DMA IO interface IO interface main memory Memory Channel Contention DRAM Chip graphics This is more detailed view of a system. CPU is connected to main memory over memory channel and the memory controller manages accesses to the main memory. IO devices are connected the CPU over IO interface in CPU chip. Direct memory access unit (DMA) controls the IO traffic. When there is a request for data in IO devices, the requested data is first transferred to the main memory over both IO interface and the memory channel. Then CPU accesses the main memory again to bring the requested data. When there is a request for data in main memory, the requested data is also transferred over the main memory. In today’s multi-core system, all these data transfer can be happen concurrently, leading to memory channel contention. network storage USB

Problem 1: Memory Channel Contention
33.5% on average Fraction of Execution Time To see the potential memory channel contention, we observe the IO traffic in GPGPU workloads. X-axis shows the GPGPU workloads. Y-axis shows the fraction of execution time that consume for IO accesses. As we can see, some of GPGPU workloads mostly consumes for its execution time for computation, however, many workloads spent significant fraction of the total execution time for transferring data between CPU and GPU. For example 33.5% on average of 8 GPGPU workloads. Therefore, there is high demand of IO bandwidth, leading to high memory channel contention. A large fraction of the execution time is spent on IO accesses

Problem 2: High Cost for IO Interfaces
others (10.6%) others power memory memory (2 ch) (2 ch) IO interface (28.4%) 959 pins in total 359 pins in total Second problem is high area cost for IO interfaces in processor chip. The chart shows the processor pin count of an Intel i7 processor. As we can see, 10.6% of total pins are used for IO interfaces. In the right chart, without counting power pints, we see that about 19.5% of total pins are used for IO interfaces. Therefore, integrating IO interface on processor chip leads to high processor pin count, potentially large processor chip area. Processor Pin Count Processor Pin Count (w/ power pins) (w/o power pins) Integrating IO interface on the processor chip leads to high area cost

Shared Memory Channel Memory channel contention for IO access and CPU access High area cost for integrating IO interfaces on processor chip In summary, IO and CPU accesses contend for memory channel. Integrating more IO interfaces for higher bandwidth leads to high area cost on processor chip.

Outline 1. Problem 2. Our Approach 3. Dual-Data-Port DRAM

? Our Approach CPU Processor Chip memory controller DMA CTRL. DMA
DMA control DMA IO interface main memory control channel Dual-Data- Port DRAM Port 1 Port 2 DRAM Chip graphics As explained, our approach is directly connecting IO devices to main memory. To this end, we first migrate IO interfaces and fraction of DMA unit out of processor chip, we call this chip DMA IO. DMA control unit is still existing within the processor chip, which manages off-chip DMA-IO. To connect this off-chip DMA-IO to the main memory, we devise a new DRAM architecture, Dual-Data-Port DRAM, which has two independent data ports, both of which are managed by memory controller. network storage DMA Chip DMA IO interface ? USB

? Our Approach Decoupled Direct Memory Access CPU ACCESS IO ACCESS CPU
graphics network storage USB DRAM Chip DMA CTRL. DMA control Processor control channel Dual-Data- Port DRAM Port 1 Port 2 memory controller DMA IO interface CPU ACCESS As explained, our approach is directly connecting IO devices to main memory. To this end, we first migrate IO interfaces and fraction of DMA unit out of processor chip, we call this chip DMA IO. DMA control unit is still existing within the processor chip, which manages off-chip DMA-IO. To connect this off-chip DMA-IO to the main memory, we devise a new DRAM architecture, Dual-Data-Port DRAM, which has two independent data ports, both of which are managed by memory controller. IO ACCESS ?

Background: DRAM Operation
memory controller at CPU control channel memory channel data channel data port data port bank READY bank bank peripherallogic activate read control port control port bank We first explain the conventional DRAM organization and its operation. DRAM consists of multiple banks and peripheral logic. This DRAM chip is connected to the processor chip over memory channel. This memory channel consists of a control channel and a data channel. The DRAM peripheral logic connects these control and data channel to DRAM bank, which are called control path and data path. Accessing DRAM starts from activating a bank. The bank are ready to be transfer data. When receiving a read command, data is transferred over both data path and data channel. In summary, DRAM periphery logic, control banks and transfer data over memory channel. DRAM peripheral logic: i) controls banks, and ii) transfers data over memory channel

Problem: Single Data Port
memory controller at CPU control channel data channel data port data port data port bank READY bank periphery Many Banks Single Data Port control port read control port bank READY read This figure shows the access scenario of multiple banks. Now, both banks are ready to serve data and memory controller has requests for both banks. When receiving a read command of a bank, the bank transfer data. After finishing the data transfer, the memory controller issues another read commands, leading to transferring data. Even though there are multiple banks which are ready to transfer data, the peripheral logic only transfer data one after another. Therefor, requests for multiple banks are served in serial due to limited peripheral logic. Requests are served serially due to single data port

Problem: Single Data Port
time Control Port RD RD Data Port DATA DATA What about a DRAM with two data ports? time Control Port RD RD Specifically, issuing a command takes one clock cycle in control channel. However, transferring data takes multiple cycles in data channel, for example, 4 cycles for 8 data burst. Therefore, peripheral logic for data traffic leads to more serialization for accessing DRAM. Data Port 1 DATA Data Port 2 DATA

twice the bandwidth & independent data ports with low overhead
Dual-Data-Port DRAM port select signal control channel data channel data port 1 bank bank periphery mux data port 2 data channel to Port 1 (upper) to Port 2 (lower) bank data bus control port bank Overhead Area: 1.6% ↑ Pins: 20 ↑ mux We add additional data data port and data path. To manage the connections between bank and these two data port, we add a multiplexer per each bank. The multiple makes any connection in the three nodes – bank data bus, upper data port, and lower data port. To manage the multiplexer, we add a control signal, port select, in the control channel. We called this new DRAM architecture, Dual-Data-Port DRAM, which has twice bandwidth and two independent data ports. twice the bandwidth & independent data ports with low overhead

DDP-DRAM Memory System
memory controller at CPU control channel with port select CPU channel data port 1 bank bank periphery mux control port bank We integrate this Dual-Data-Port DRAM in a system by connecting a data port to CPU and the other to DDM-IO. In our architecture, Both data ports and DDMA-IO are controlled by the memory controller in the processor. mux data port 2 IO channel DDMA IO interface

Three Data Transfer Modes
CPU Access: Access through CPU channel DRAM read/write with CPU port selection IO Access: Access through IO channel DRAM read/write with IO port selection Port Bypass: Direct transfer between channels DRAM access with port bypass selection Our Dual-Data-Port DRAM supports three access mode, CPU access, IO access, and Port Bypass. Let us take a look at more details of these three modes.

memory controller at CPU
1. CPU Access Mode memory controller at CPU control channel with CPU channel CPU channel CPU channel control channel with port select data port 1 data port bank bank READY periphery mux read control port control port bank In CPU access mode, memory controller simply issues read command with CPU port selection. Then, DRAM periphery builds the connection between bank and CPU port, and transfer data to CPU, which is same as DRAM access in conventional systems. mux data port 2 IO channel DDMA IO interface

2. IO Access Mode memory controller at CPU control channel with port select control channel with IO channel CPU channel data port 1 bank READY bank periphery mux read control port control port bank In IO Access mode, memory controller issues read/write command with IO port selection. Then, DRAM peripheral logic builds the connection between bank and IO port, and transfer data to IO channel. DDMA-IO, then, read data from IO channel. mux data port 2 data port IO channel IO channel DDMA IO interface

3. Port Bypass Mode memory controller at CPU control channel with port select control channel with port bypass data port 2 data port 1 CPU channel CPU channel data port bank bank periphery mux control port bank In Port Bypass mode, memory controller issues read/write command with both ports selection. Then, DRAM peripheral logic builds the connection between these two ports, then memory controller and DDMA-IO directly communicate over DRAM peripheral logic. mux data port IO channel IO channel DDMA IO interface

Three Applications for DDMA
Communication b/w Compute Units CPU-GPU communication In-Memory Communication and Initialization Bulk page copy/initialization Communication b/w Memory and Storage Serving page fault/file read & write First, DDMA supports communication between compute units. For example, CPU-GPU communication. Second, DDMA supports In-memory communication and initialization. For example, bulk page copy and initialization can perform through DDMA. Third, DDMA support communication between memory and storage. For example, page fault, file read and write. This feature also enables aggressive IO prefetching. Let us take a look at more details of these mechanisms.

1. Compute Unit ↔ Compute Unit
CPU DDMA ctrl. memory controller DDP-DRAM DDMA IO interface CPU memory controller GPU memory controller GPU DDMA ctrl. memory controller DDP-DRAM DDMA IO interface with IO sel. read ctrl. channel CPU → GPU DDMA ctrl. ctrl. channel with IO sel. write DDMA ctrl. Ack. source destination destination DDMA IO interface DDMA IO interface This figure shows two systems, CPU system and GPU system that has their own memory system. The communication scenario is transferring data from CPU memory to GPU memory. To this end, CPU memory controller issues read commands to transfer data to CPU DDMA-IO. Then, CPU DDMA-IO transfers data to GPU DDMA-IO. GPU DDMA-IO lets GPU knows about the data transfer. Then, GPU memory controller issues write commands to transfer data to DRAM. Transfer data through DDMA without interfering w/ CPU/GPU memory accesses

2. In-Memory Communication
CPU memory controller CPU DDMA ctrl. memory controller DDP-DRAM DDMA IO interface destination with IO sel. write ctrl. chan. with IO sel. read DDMA ctrl. source The communication scenario is migrating data from one location to the other location within DRAM. To this end, The memory controller first issues read command with IO port select to transfer data to DDMA-IO. The memory controller, then, issues write command with IO port select to transfer data to DRAM. Transfer data in DRAM through DDAM without interfering with CPU memory accesses

3. Memory ↔ Storage CPU DDMA ctrl. memory controller DDP-DRAM DDMA IO interface CPU memory controller ctrl. chan. with IO sel. write Acc. Storage DDMA ctrl. Ack. destination destination DDMA IO interface Storage Storage (source) The communication scenario in this figure is transferring data from storage to main memory. To this end, DDMA-IO first brings data from storage. Then, DDMA-IO lets CPU know the data traffic from storage. The memory controller, then, issues write command to transferring the data to DRAM. Transfer data from storage through DDMA without interfering with CPU memory accesses

Evaluation Methods System Workloads Processor: 4 – 16 cores
LLC: 16-way associative, 512KB private cache-slice/core Memory: 1 – 4 ranks and 1 – 4 channels Workloads Memory intensive: SPEC CPU2006, TPC, stream (31 benchmarks) CPU-GPU communication intensive: polybench (8 benchmarks) In-memory communication intensive: apache, bootup, compiler, filecopy, mysql, fork, shell, memcached (8 in total) We model systems that have 4 to 16 cores and memory systems that have 1 to 4 ranks and 1 to 4 channels. We builds workloads by randomly selecting from three benchmark categories. Memory intensive benchmark from SPEC CPU 20016, TPC, stream CPU-GPU communication benchmarks from Polybench In-memoy intensive benchmarks,

High performance improvement
Performance (2 Channel, 2 Rank) Performance Improvement Performance Improvement We first observe the performance improvement in 2 rank and 2 channel memory systems. Left figure shows the case for CPU-GPU communication-intensive workloads. X axis shows the number of cores and Y axis are weighted speedup improvement. As we can see, our mechanism provides significant performance improvement and provide more performance improvement in more core counts. More core systems provide more inflight requests the main memory, leading to higher contention in memory channel. By decoupling IO accesses from the memory channel, our mechanism provides more performance improvement at more core systems. We observe similar results for in-memory communication-intensive workloads. We conclude that our mechanism provide more performance improvement for more core systems. CPU-GPU Comm.-Intensive In-Memory Comm.-Intensive High performance improvement More performance improvement at higher core count

Performance increases with rank count
Performance on Various Systems Performance Improvement Performance Improvement Now we vary the memory system configuration of the number of ranks and channels. X axis shows channel and rank count. Y axis are weighted speedup improvement. When increasing the number of channels, our mechanism provides less benefits in performance mainly due to the reduced memory channel conflicts in more-channel memory systems. When increasing the number of ranks, our mechanism provides more benefits in performance mainly due to higher conflicts in more-rank memory system. In summary, our mechanism provides more performance improvement at more memory ranks and less memory channels. Channel Count Rank Count Performance increases with rank count

DDMA achieves higher performance at lower processor pin count
DDMA vs. Dual Channel 1103 959 915 Performance Processor Pin Count We compare our mechanism to the case, simply increasing channel counts. Left figure shows performance comparison of three memory system, baseline – 1 channel memory system, our proposal – 1 channel with DDMA, and 2 channel memory system. Our mechanism provides significant performance improvement over the baseline 1 channel memory system. but, less performance benefit than 2 channel memory system. However, increasing channel count leads to significant increase in processor pins. We compare this trade off in right figure. Y axis are package pin count. Adding one more memory channel requires significant additional processor pins (144 pins per channel). On the other hand, using DDMA reduces processors pins mainly due to migrated IO interfaces out of processors chip. Therefore, our mechanism achieves higher performance at lower processor pin count. DDMA achieves higher performance at lower processor pin count

More on Decoupled DMA Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, and Onur Mutlu, "Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM" Proceedings of the 24th International Conference on Parallel Architectures and Compilation Techniques (PACT), San Francisco, CA, USA, October [Slides (pptx) (pdf)]

Prof. Onur Mutlu ETH Zürich Fall 2017 2 November 2017
Computer Architecture Lecture 13: Memory Interference and Quality of Service (II) Prof. Onur Mutlu ETH Zürich Fall 2017 2 November 2017

We did not cover the following slides in lecture
We did not cover the following slides in lecture. These are for your preparation for the next lecture.

Multi-Core Caching Issues

Multi-core Issues in Caching
How does the cache hierarchy change in a multi-core system? Private cache: Cache belongs to one core (a shared block can be in multiple caches) Shared cache: Cache is shared by multiple cores CORE 0 CORE 1 CORE 2 CORE 3 L2 CACHE DRAM MEMORY CONTROLLER CORE 0 CORE 1 CORE 2 CORE 3 DRAM MEMORY CONTROLLER L2 CACHE L2 CACHE

Shared Caches Between Cores
Advantages: High effective capacity Dynamic partitioning of available cache space No fragmentation due to static partitioning Easier to maintain coherence (a cache block is in a single location) Shared data and locks do not ping pong between caches Disadvantages Slower access Cores incur conflict misses due to other cores’ accesses Misses due to inter-core interference Some cores can destroy the hit rate of other cores Guaranteeing a minimum level of service (or fairness) to each core is harder (how much space, how much bandwidth?)

Shared Caches: How to Share?
Free-for-all sharing Placement/replacement policies are the same as a single core system (usually LRU or pseudo-LRU) Not thread/application aware An incoming block evicts a block regardless of which threads the blocks belong to Problems Inefficient utilization of cache: LRU is not the best policy A cache-unfriendly application can destroy the performance of a cache friendly application Not all applications benefit equally from the same amount of cache: free-for-all might prioritize those that do not benefit Reduced performance, reduced fairness

Handling Shared Caches
Controlled cache sharing Approach 1: Design shared caches but control the amount of cache allocated to different cores Approach 2: Design “private” caches but spill/receive data from one cache to another More efficient cache utilization Minimize the wasted cache space by keeping out useless blocks by keeping in cache blocks that have maximum benefit by minimizing redundant data

Controlled Cache Sharing: Examples
Utility based cache partitioning Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. Fair cache partitioning Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. Shared/private mixed cache mechanisms Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA 2009.

Efficient Cache Utilization: Examples
Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005. Qureshi et al., “Adaptive Insertion Policies for High Performance Caching,” ISCA 2007. Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012. Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.

Controlled Shared Caching

Hardware-Based Cache Partitioning

Utility Based Shared Cache Partitioning
Goal: Maximize system throughput Observation: Not all threads/applications benefit equally from caching  simple LRU replacement not good for system throughput Idea: Allocate more cache space to applications that obtain the most benefit from more space The high-level idea can be applied to other shared resources as well. Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002.

Marginal Utility of a Cache Way
Utility Uab = Misses with a ways – Misses with b ways Low Utility Misses per 1000 instructions High Utility Saturating Utility Num ways from 16-way 1MB L2

Utility Based Shared Cache Partitioning Motivation
equake vpr Improve performance by giving more cache to the application that benefits more from cache Misses per 1000 instructions (MPKI) UTIL LRU Num ways from 16-way 1MB L2

Utility Based Cache Partitioning (III)
UMON1 UMON2 PA I$ Shared L2 cache I$ Core1 Core2 D$ D$ Main Memory Three components: Utility Monitors (UMON) per core Partitioning Algorithm (PA) Replacement support to enforce partitions

1. Utility Monitors For each core, simulate LRU policy using a separate tag store called ATD (auxiliary tag directory/store) Hit counters in ATD to count hits per recency position LRU is a stack algorithm: hit counts  utility E.g. hits(2 ways) = H0+H1 ATD Set B Set E Set G Set A Set C Set D Set F Set H + (MRU)H0 H1 H2…H15(LRU) MTD (Main Tag Store) Set B Set E Set G Set A Set C Set D Set F Set H

Utility Monitors

Dynamic Set Sampling + Extra tags incur hardware and power overhead
Dynamic Set Sampling reduces overhead [Qureshi, ISCA’06] 32 sets sufficient (analytical bounds) Storage < 2kB/UMON + (MRU)H0 H1 H2…H15(LRU) MTD Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G Set A Set C Set D Set F Set H Set B Set E Set G ATD UMON (DSS) Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005.

2. Partitioning Algorithm
Evaluate all possible partitions and select the best With a ways to core1 and (16-a) ways to core2: Hitscore1 = (H0 + H1 + … + Ha-1) from UMON Hitscore2 = (H0 + H1 + … + H16-a-1) ---- from UMON2 Select a that maximizes (Hitscore1 + Hitscore2) Partitioning done once every 5 million cycles

3. Enforcing Partitions: Way Partitioning
Way partitioning support: [Suh+ HPCA’02, Iyer ICS’04] Each line has core-id bits On a miss, count ways_occupied in set by miss-causing app ways_occupied < ways_given Yes No Victim is the LRU line from other app Victim is the LRU line from miss-causing app

Performance Metrics Three metrics for performance: Weighted Speedup (default metric)  perf = IPC1/SingleIPC1 + IPC2/SingleIPC  correlates with reduction in execution time Throughput  perf = IPC1 + IPC2  can be unfair to low-IPC application Hmean-fairness  perf = hmean(IPC1/SingleIPC1, IPC2/SingleIPC2)  balances fairness and performance

Weighted Speedup Results for UCP

IPC Results for UCP UCP improves average throughput by 17%

Any Problems with UCP So Far?
- Scalability to many cores - Non-convex curves? Time complexity of partitioning low for two cores (number of possible partitions ≈ number of ways) Possible partitions increase exponentially with cores For a 32-way cache, possible partitions: 4 cores  6545 8 cores  15.4 million Problem NP hard  need scalable partitioning algorithm

Greedy Algorithm [Stone+ ToC ’92]
Greedy Algorithm (GA) allocates 1 block to the app that has the max utility for one block. Repeat till all blocks allocated Optimal partitioning when utility curves are convex Pathological behavior for non-convex curves Stone et al., “Optimal Partitioning of Cache Memory,” IEEE ToC 1992.

Problem with Greedy Algorithm
Problem: GA considers benefit only from the immediate block. Hence, it fails to exploit large gains from looking ahead In each iteration, the utility for 1 block: U(A) = 10 misses U(B) = 0 misses Misses All blocks assigned to A, even if B has same miss reduction with fewer blocks Blocks assigned

Lookahead Algorithm Marginal Utility (MU) = Utility per cache resource
MUab = Uab/(b-a) GA considers MU for 1 block. LA (Lookahead Algorithm) considers MU for all possible allocations Select the app that has the max value for MU. Allocate it as many blocks required to get max MU Repeat until all blocks are assigned

Lookahead Algorithm Example
Iteration 1: MU(A) = 10/1 block MU(B) = 80/3 blocks B gets 3 blocks Misses Next five iterations: MU(A) = 10/1 block MU(B) = 0 A gets 1 block Blocks assigned Result: A gets 5 blocks and B gets 3 blocks (Optimal) Time complexity ≈ ways2/2 (512 ops for 32-ways)

UCP Results Four cores sharing a 2MB 32-way L2 Mix1 Mix2 Mix3 Mix4
LRU UCP(Greedy) UCP(Lookahead) UCP(EvalAll) Mix1 (gap-applu-apsi-gzp) Mix2 (swm-glg-mesa-prl) Mix3 (mcf-applu-art-vrtx) Mix4 (mcf-art-eqk-wupw) LA performs similar to EvalAll, with low time-complexity

Utility Based Cache Partitioning
Advantages over LRU + Improves system throughput + Better utilizes the shared cache Disadvantages - Fairness, QoS? Limitations - Scalability: Partitioning limited to ways. What if you have numWays < numApps? - Scalability: How is utility computed in a distributed cache? - What if past behavior is not a good predictor of utility?

Fair Shared Cache Partitioning
Goal: Equalize the slowdowns of multiple threads sharing the cache Idea: Dynamically estimate slowdowns due to sharing and assign cache blocks to balance slowdowns Approximate slowdown with change in miss rate Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004.

Dynamic Fair Caching Algorithm
MissRate alone P1: P2: P1: P2: MissRate shared Repartitioning interval P1: P2: Target Partition

MissRate alone 1st Interval P1:20% P2: 5% P1:20% P2:15% MissRate shared P1: P2: MissRate shared At the very beginning, the static profiling obtained miss rate of 20% for P1 and 5% for P2. Initially, target partitions are equal to P1 and P2. After the 1st interval, we get shared miss rate by dynamic profiling and it has 20% and 15% miss rate. Repartitioning interval P1:256KB P2:256KB Target Partition

MissRate alone Repartition! P1:20% P2: 5% Evaluate Slowdown P1: 20% / 20% P2: 15% / 5% P1:20% P2:15% MissRate shared At repartitioning, we evaluate M3. It equalize the increase in miss rate. P1 has no increase while P2 has 3 times increase. In this case, we increase partition size for P2. We used 64KB partition granularity. Repartitioning interval P1:192KB P2:320KB Target Partition P1:256KB P2:256KB Target Partition Partition granularity: 64KB

MissRate alone 2nd Interval P1:20% P2: 5% P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared P1:20% P2:15% MissRate shared In second interval, new target partition is applied. After the execution, we get new shared miss rate. same 20% and 10%. We have 5% reduction in miss rate. Repartitioning interval P1:192KB P2:320KB Target Partition

MissRate alone Repartition! P1:20% P2: 5% Evaluate Slowdown P1: 20% / 20% P2: 10% / 5% P1:20% P2:15% MissRate shared P1:20% P2:10% MissRate shared When we evaluate M3, again P2 requires more cache space. So we increase it. Repartitioning interval P1:192KB P2:320KB Target Partition P1:128KB P2:384KB Target Partition

MissRate alone 3rd Interval P1:20% P2: 5% P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared P1:20% P2:10% MissRate shared After third interval, dynamic profiling get the miss rate for P1 25% and 9% for P2. Repartitioning interval P1:128KB P2:384KB Target Partition

MissRate alone Do Rollback if: P2: Δ<Trollback Δ=MRold-MRnew Repartition! P1:20% P2: 5% P1:20% P2:10% MissRate shared P1:25% P2: 9% MissRate shared There is a hidden step of rollback phase. In this step, we check whether P2’s miss rate is reduced more than rollback threshold. Because we increased cache partition, we expected to see reduction in miss rate. In our example, P2 only reduced 1% miss rate which is smaller than rollback threshold. So, we cancel our previous decision and go back to the previous partition. Repartitioning interval P1:128KB P2:384KB Target Partition P1:192KB P2:320KB Target Partition

Advantages/Disadvantages of the Approach
+ Reduced starvation + Better average throughput + Block granularity partitioning Disadvantages and Limitations - Alone miss rate estimation can be incorrect - Scalable to many cores? - Is this the best (or a good) fairness metric? - Does this provide performance isolation in cache?

Software-Based Shared Cache Partitioning

Software-Based Shared Cache Management
Assume no hardware support (demand based cache sharing, i.e. LRU replacement) How can the OS best utilize the cache? Cache sharing aware thread scheduling Schedule workloads that “play nicely” together in the cache E.g., working sets together fit in the cache Requires static/dynamic profiling of application behavior Fedorova et al., “Improving Performance Isolation on Chip Multiprocessors via an Operating System Scheduler,” PACT 2007. Cache sharing aware page coloring Dynamically monitor miss rate over an interval and change virtual to physical mapping to minimize miss rate Try out different partitions

OS Based Cache Partitioning
Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation,” MICRO 2006. Static cache partitioning Predetermines the amount of cache blocks allocated to each program at the beginning of its execution Divides shared cache to multiple regions and partitions cache regions through OS page address mapping Dynamic cache partitioning Adjusts cache quota among processes dynamically Page re-coloring Dynamically changes processes’ cache usage through OS page address re-mapping

Page Coloring Physical memory divided into colors
Colors map to different cache sets Cache partitioning Ensure two threads are allocated pages of different colors Memory page Cache Way-1 ………… Way-n Thread A Thread B

Page Coloring Physically indexed caches are divided into multiple regions (colors). All cache lines in a physical page are cached in one of those regions (colors). Physically indexed cache Virtual address virtual page number page offset OS control Address translation … … Physical address physical page number Page offset OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits). Let me introduce page coloring technique first. 1. A virtual address can be divide into two parts: Virtual page number and page offset (In our experimental system, the page size is 4KB, so length of page offset bits are 12) 2. The virtual address is converted to physical address by address translation controlled by OS. Virtual page number is mapped to a physical page number and page offset remain same. 3. The physical address is used in a physically indexed cache. The cache address is divided into tag, set index and block offset. 4. The common bits between physical page number and set index are referred as page color bits. 5. The physically indexed cache is then divided into multiple regions. Each OS page will cached in one of those regions. Indexed by the page color. In a summary, 1) 2)… = Cache address Cache tag Set index Block offset page color bits

Static Cache Partitioning using Page Coloring
Physical pages are grouped to page bins according to their page color OS address mapping Physically indexed cache 1 2 3 4 … … …… i i+1 i+2 … … Shared cache is partitioned between two processes through address mapping. …… Process 1 … … ... 1. This slide shows an example how we enhance page coloring to partition shared cache between two processes. 2. Physical pages are grouped into page bins according to their page color. The pages in a same bin will be cached in same cache region or cache color in a physically indexed cache. 3. Suppose there are two processes running on a dual-core processor. Each processor has a set of pages need to be mapped to physical pages. 4. If we limit that pages from one process can only be mapped to a subset of page bins. Then pages from a process will only use part of the cache. In a summary, … We call this restriction co-partitioning. 1 2 Cost: Main memory space needs to be partitioned, too. 3 4 … … …… i i+1 i+2 … … …… Process 2

Dynamic Cache Partitioning via Page Re-Coloring
Allocated color Allocated colors Pages of a process are organized into linked lists by their colors. Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points. 1 2 3 This slides show how we mange the pages and how we design the dynamic cache partitioning. 1. We link all the pages allocated to a process by their page colors. A page links table is added to each working process. And Memory allocation guarantees that pages are evenly distributed into all the cache colors to avoid hot points. 2. Suppose at the beginning of execution, 3 colors was assigned to the process: red, green and yellow. Pages are allocated in a round-robin fashion among those 3 colors. 3. If a policy decide to expand number of colors from 3 to 4, for example, the blue color is allocated to the process 4. one of every 4 pages are selected for migration. And we re-coloring those pages to the new color. The re-coloring procedure includes three steps: 1. Allocate a physical page in new color 2. Copy memory contents from old page to new page 3. Free old page. We execute these three steps for all pages that need to be re-colored. …… Page re-coloring: Allocate page in new color Copy memory contents Free old page N - 1 page color table

Dynamic Partitioning in a Dual-Core System
Init: Partition the cache as (8:8) Yes finished Exit No Run current partition (P0:P1) for one epoch Next, I will present a simple dynamic partition policy we used for performance. 1. Initially, half cache capacity is allocated to each program. 2. We then enter the loops of trying to find optimal partitioning regarding a policy metrics M. 3. We first run one period for current partitioning, then try one period for each of the two neighboring partitionings. 4. We choosing best partition for the next epoch based on their policy metrics measurement. We repeat this loop until the completion of the workload. This is a simple greedy policy which emulate the policy of a HPCA’02 paper Try one epoch for each of the two neighboring partitions: (P0 – 1: P1+1) and (P0 + 1: P1-1) Choose next partitioning with best policy metrics measurement (e.g., cache miss rate)

Experimental Environment
Dell PowerEdge1950 Two-way SMP, Intel dual-core Xeon 5160 Shared 4MB L2 cache, 16-way 8GB Fully Buffered DIMM Red Hat Enterprise Linux 4.0 kernel Performance counter tools from HP (Pfmon) Divide L2 cache into 16 colors We conducted our experiment on a two-way SMP machine that has two intel dual-core Xeon processors. Each processor has a 4MB 16-way L2 cache shared by two processor cores. The machine has 4 2GB Fully buffered DIMM as main memory. We install RHEL 4.0 on the machine and use 2.6 kernel and patched the kernel with hardware Performance counter tools from HP. And we extend linux kernel to support both static and dynamic cache partitioning. The L2 cache is divided into 16 colors.

Performance – Static & Dynamic
Aim to minimize combined miss rate For RG-type, and some RY-type: Static partitioning outperforms dynamic partitioning For RR- and RY-type, and some RY-type Dynamic partitioning outperforms static partitioning This figures shows the comparison between the performance of best static partitioning and the performance by the dynamic partitioning policies, using combined miss rates as policy metrics. It is worth noting that we assume full profiling information is available for the static partitioning. Without such profiling information, the dynamic partitioning perform comparable with static partitioning. If we look into the workload individually, some workloads perform better by static partitioning while some other workload performance better by dynamic partitioning.

Software vs. Hardware Cache Management
Software advantages + No need to change hardware + Easier to upgrade/change algorithm (not burned into hardware) Disadvantages - Large granularity of partitioning (page-based versus way/block) - Limited page colors  reduced performance per application (limited physical memory space!), reduced flexibility - Changing partition size has high overhead  page mapping changes - Adaptivity is slow: hardware can adapt every cycle (possibly) - Not enough information exposed to software (e.g., number of misses due to inter-thread conflict)

Private/Shared Caching

Private/Shared Caching
Example: Adaptive spill/receive caching Goal: Achieve the benefits of private caches (low latency, performance isolation) while sharing cache capacity across cores Idea: Start with a private cache design (for performance isolation), but dynamically steal space from other cores that do not need all their private caches Some caches can spill their data to other cores’ caches dynamically Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009.

Revisiting Private Caches on CMP
Private caches avoid the need for shared interconnect ++ fast latency, tiled design, performance isolation Core A I$ D$ CACHE A Core B CACHE B Core C CACHE C Core D CACHE D Memory Problem: When one core needs more cache and other core has spare cache, private-cache CMPs cannot share capacity

Cache Line Spilling Spill evicted line from one cache to neighbor cache - Co-operative caching (CC) [ Chang+ ISCA’06] Spill Cache A Cache B Cache C Cache D Problem with CC: Performance depends on the parameter (spill probability) All caches spill as well as receive  Limited improvement Goal: Robust High-Performance Capacity Sharing with Negligible Overhead Chang and Sohi, “Cooperative Caching for Chip Multiprocessors,” ISCA 2006.

Spill-Receive Architecture
Each Cache is either a Spiller or Receiver but not both - Lines from spiller cache are spilled to one of the receivers - Evicted lines from receiver cache are discarded Spill Cache A Cache B Cache C Cache D S/R =1 (Spiller cache) S/R =0 (Receiver cache) S/R =1 (Spiller cache) S/R =0 (Receiver cache) What is the best N-bit binary string that maximizes the performance of Spill Receive Architecture  Dynamic Spill Receive (DSR) Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009.

Dynamic Spill-Receive via “Set Dueling”
Divide the cache in three: Spiller sets Receiver sets Follower sets (winner of spiller, receiver) n-bit PSEL counter misses to spiller-sets: PSEL-- misses to receiver-set: PSEL++ MSB of PSEL decides policy for Follower sets: MSB = 0, Use spill MSB = 1, Use receive - miss + Spiller-sets Follower Sets Receiver-sets PSEL MSB = 0? YES No Use Recv Use spill monitor  choose  apply (using a single counter)

Dynamic Spill-Receive Architecture
Each cache learns whether it should act as a spiller or receiver Cache A Cache B Cache C Cache D Set X AlwaysSpill AlwaysRecv PSEL B PSEL C PSEL D Set Y Miss in Set X in any cache Miss in Set Y - + PSEL A Decides policy for all sets of Cache A (except X and Y)

Experimental Setup Baseline Study: Benchmarks:
4-core CMP with in-order cores Private Cache Hierarchy: 16KB L1, 1MB L2 10 cycle latency for local hits, 40 cycles for remote hits Benchmarks: 6 benchmarks that have extra cache: “Givers” (G) 6 benchmarks that benefit from more cache: “Takers” (T) All 4-thread combinations of 12 benchmarks: 495 total Five types of workloads: G4T0 G3T1 G2T2 G1T3 G0T4

Results for Weighted Speedup
On average, DSR improves weighted speedup by 13%

Distributed Caches

Caching for Parallel Applications
core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core cache slice L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 core core core core core core core core L2 L2 L2 L2 L2 L2 L2 L2 Data placement determines performance Goal: place data on chip close to where they are used

Efficient Cache Utilization

Efficient Cache Utilization: Examples
Qureshi et al., “A Case for MLP-Aware Cache Replacement,” ISCA 2005. Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address both Cache Pollution and Thrashing,” PACT 2012. Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012.

The Evicted-Address Filter
Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry, "The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing" Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September Slides (pptx)

Cache Utilization is Important
Large latency Memory Core Core Last-Level Cache All high-performance processors employ on-chip caches. In most modern systems, multiple cores share the last-level cache. With the large latency of main memory access and increasing contention between multiple cores, effective cache utilization is critical for system performance. For effective cache utilization, the cache should be filled with blocks that are likely to be reused. Increasing contention Effective cache utilization is important

Reuse Behavior of Cache Blocks
Different blocks have different reuse behavior Access Sequence: A B C A B C S T U V W X Y Z A B C High-reuse block Low-reuse block However, different blocks have different reuse behaviors. For example, consider the following access sequence. Blocks A, B and C are accessed repeatedly. We refer to such blocks as High reuse blocks. On the other hand, the remaining blocks are not reused. We refer to such blocks as low-reuse blocks. Ideally, the cache should be filled with high-reuse blocks such as A, B and C. But traditional cache management policies such as LRU suffer from two problems. A B C . Ideal Cache

Cache Pollution Problem: Low-reuse blocks evict high-reuse blocks
F E D U T S H G F E D H G F E D C B S H G F E D C A B C C B A A B A LRU Policy MRU LRU Idea: Predict reuse behavior of missed blocks. Insert low-reuse blocks at LRU position. The first problem, referred to as cache pollution, is one where blocks with low-reuse evict blocks with high-reuse from the cache. Consider the following example of a cache following the LRU policy. Although the cache is filled with high–reuse blocks, the LRU policy inserts even blocks with low-reuse at the MRU position, evicting more useful blocks from the cache. So the future accesses to blocks A, B and C will miss in the cache. To address this problem, prior work proposed the idea of predicting the reuse behavior of missed cache blocks and inserting low-reuse blocks at the LRU position. This mitigates the impact of pollution as shown in the example. H G F E D C B A U T S T S A S A A MRU LRU

Cache Thrashing Cache Problem: High-reuse blocks evict each other
LRU Policy A B C D E F G H I J K K J I H G F E D I H G F E D C H G F E D C B J I H G F E D C A B C B A A B A MRU Cache LRU Idea: Insert at MRU position with a very low probability (Bimodal insertion policy) The second problem, referred to as thrashing, occurs when the number of high-reuse blocks is more than what the cache can handle, as shown in the example. In this case, the LRU policy leads to these high-reuse blocks evicting each other from the cache, leading to zero cache hits. Prior work proposed to achieve this using the bimodal insertion policy, which inserts blocks at the MRU position with a very low probability. As a result, majority of the blocks are inserted at the LRU position. For a thrashing working set, this results in a fraction of the working set to be retained in the cache, leading to cache hits for at least that fraction. A fraction of working set stays in cache H G F E D C B I A K J I A J I A A MRU LRU Qureshi+, “Adaptive insertion policies for high performance caching,” ISCA 2007.

Handling Pollution and Thrashing
Need to address both pollution and thrashing concurrently Cache Pollution Need to distinguish high-reuse blocks from low-reuse blocks The motivation behind this work is that prior works do not address both pollution and thrashing concurrently. Prior works on pollution do not have control over the number of high-reuse blocks. As a result, if there are more high-reuse blocks than what the cache can handle, these approaches can suffer from thrashing. On the other hand, prior works on cache thrashing have no mechanism to distinguish high-reuse blocks from low-reuse blocks. As a result, when a working set is not thrashing, these approaches can suffer from pollution. Our goal in this work is to design a mechanism to address both cache pollution and thrashing. In this talk, I will present our mechanism, the Evicted-address filter, that achieves our goal. Cache Thrashing Need to control the number of blocks inserted with high priority into the cache

Reuse Prediction High reuse Miss Missed-block ? Low reuse
Keep track of the reuse behavior of every cache block in the system The reuse prediction problem asks the following question. On a cache miss, does the missed cache block have high reuse or low reuse? One possible way to answer this question is to keep track of the reuse behavior of every block in the system. But such an approach is impractical due to its high storage overhead and look-up latency. So, how do prior works address this problem? Impractical High storage overhead Look-up latency

Approaches to Reuse Prediction
Use program counter or memory region information. 2. Learn group behavior 1. Group Blocks 3. Predict reuse B A T S PC 1 PC 2 B A T S PC 1 PC 2 PC 1 PC 2 C U Previous approaches proposed to predict the reuse behavior of missed blocks indirectly using program-counter or memory-region information. At a high level, the program-counter based mechanism operates in three steps. First, the mechanism groups blocks based on the program counter that caused their cache miss. Second, the mechanism learns the reuse behavior of each group based on the blocks that were previously accessed by the program counter. Third, the learned behavior of each group is used to predict the reuse behavior of future blocks accessed by the corresponding program counter. Such an approach has two shortcomings. First, blocks belonging to the same program counter may not have the same reuse behavior. Second, the mechanism has no control over the number of blocks predicted to have high reuse. As a result, this approach can potentially suffer from thrashing. Same group → same reuse behavior No control over number of high-reuse blocks

Per-block Reuse Prediction
Use recency of eviction to predict reuse A A Time Accessed soon after eviction Time of eviction Accessed long time after eviction In this work, we propose a mechanism to predict the reuse behavior of a block based on its own past behavior. The key idea behind our mechanism is to use the recency of eviction of a block to predict its reuse on its next access. More specifically, if a block is accessed soon after getting evicted, our mechanism concludes that the block was prematurely evicted and predicts it to have high reuse. On the other hand, if a block is accessed a long time after eviction, our mechanism concludes that the block is unlikely to be accessed in the near future and predicts it to have low reuse. The above idea suggests that it is sufficient to keep track of a small number of recently evicted blocks to distinguish between high-reuse blocks and low-reuse blocks. Time S S

Evicted-Address Filter (EAF)
Evicted-block address EAF (Addresses of recently evicted blocks) Cache MRU LRU Yes No In EAF? Based on this observation, our mechanism augments the cache with a structure called the Evicted-address filter that keeps track of addresses of recently evicted blocks. When a block is evicted from the cache, our mechanism inserts its address into the EAF. On a cache miss, if the missed block address is present in the EAF, then the block was evicted recently. Therefore, our mechanism predicts that the block has high reuse and inserts it at the MRU position. It also removes the block address from the EAF to create space for more evicted addresses. On the other hand, if the missed block address is not present in the EAF, there is one of two possibilities: either the block was accessed before and evicted a long time ago or the block is accessed for the first time. In both cases, our mechanism predicts that the block has low reuse and inserts it at the LRU position. Although the EAF is a simple mechanism to predict the reuse behavior of missed blocks, one natural question is how to implement it in hardware? High Reuse Low Reuse Miss Missed-block address

Naïve Implementation: Full Address Tags
Recently evicted address EAF ? Need not be 100% accurate 1. Large storage overhead A naïve implementation of the EAF would be to implement it as a tag store where each tag entry corresponds to the address or a recently evicted block. However, such an implementation suffers from two problems: 1) it has a large storage overhead and 2) it requires associative lookups which consume high energy. But notice that the EAF need not be 100% accurate as it is just a prediction structure. This raises the possibility of implementing EAF using a more compact structure which doesn’t have the two shortcomings. So, we consider implementing the EAF using a Bloom filter. 2. Associative lookups – High energy

Low-Cost Implementation: Bloom Filter
EAF ? Need not be 100% accurate A naïve implementation of the EAF would be to implement it as a tag store where each tag entry corresponds to the address or a recently evicted block. However, such an implementation suffers from two problems: 1) it has a large storage overhead and 2) it requires associative lookups which consume high energy. But notice that the EAF need not be 100% accurate as it is just a prediction structure. This raises the possibility of implementing EAF using a more compact structure which doesn’t have the two shortcomings. So, we consider implementing the EAF using a Bloom filter. Implement EAF using a Bloom Filter Low storage overhead + energy

Bloom Filter   Clear Compact representation of a set Remove Insert
May remove multiple addresses  Remove Clear Insert Test  False positive 1. Bit vector Y X Z W 2. Set of hash functions H1 H2 1 1 1 1 Let me give you a quick overview of a Bloom filter. A Bloom filter is a compact representation of a set, in this case, a set of addresses. It consists of a bit vector and a set of hash functions. All the bits in the bit-vector are initialized to zero. To insert an address X into the set, the Bloom filter computes the values of the two hash functions and sets the corresponding elements in the filter. To insert another address Y, the Bloom filter repeats the process. To test if an address is present in the filter, the Bloom filter computes the values of the hash functions and checks if all the corresponding bits are set. If yes, the Bloom filter concludes that the address is present in the set. This is the case for the address X. However, even if one of the bits is not set, the Bloom filter concludes that the address is not present in the set. When we test for the address W, the Bloom filter will conclude that W is present in the set. But W was never inserted. The left bit was set due to the insertion of X and the right bit was set due to the insertion of Y. This situation is referred to as a false positive. The false positive rate of a Bloom filter can be made low by controlling the parameters of the Bloom filter. Since multiple addresses can map to the same bits, a Bloom filter does not support removing individual addresses from the set. However, all the addresses can be cleared from the set by simply resetting all the bits in the Bloom filter. Now, let us look at how we can implement EAF using a Bloom filter. X Y H1 H2 Inserted Elements: X Y

EAF using a Bloom Filter
Clear 2 when full  Insert Bloom Filter  Remove Evicted-block address FIFO address when full  Test Missed-block address Let us quickly revisit the operations performed on the EAF. First, on a cache eviction, the evicted block address is inserted into the EAF. Second, on a cahce miss, our mechanism tests if the missed block address is present in the EAF. If yes, the address is also removed from the EAF. Finally, when the EAF is full, an address is removed in a FIFO manner to create space for more recently evicted addresses. A Bloom filter based implementation of EAF will support the insert and the test operations. However, as we saw in the previous slide, the Bloom filter does not support the remove operations. So we need to somehow eliminate these remove operations. For this purpose, we propose two changes to the EAF design. First, on a cache miss, if the missed block address is present in the EAF, we choose to retain the block address in the EAF instead of removing it. Second, when the EAF becomes full, we propose to clear the EAF completely instead of removing an address in a FIFO manner. With these changes, we can implement the EAF using a Bloom filter. Doing so can reduce the storage overhead of EAF by 4x, to around 1.47% compared to the cache size. Remove  1 If present Bloom-filter EAF: 4x reduction in storage overhead, 1.47% compared to cache size

EAF-Cache: Final Design
1 Cache eviction Insert address into filter Increment counter Bloom Filter Cache Counter Lets take a look at the final design of our mechanism, the EAF-cache. It augments the cache with a Bloom filter and a counter that keeps track of the number of addresses inserted into the Bloom filter. On a cache eviction, the evicted address is inserted into the filter and the counter is incremented. On a cache miss, if the missed block address is present in the Bloom filter, the block is inserted at the MRU position. Otherwise it is inserted with the bimodal insertion policy. Finally, when the counter reaches a maximum, both the filter and the counter are cleared. 3 Counter reaches max Clear filter and counter 2 Cache miss Test if address is present in filter Yes, insert at MRU. No, insert with BIP

EAF: Advantages Cache Cache eviction Bloom Filter Cache miss
Counter Cache miss Apart from its ability to mitigate both pollution and thrashing, EAF has three other advantages. First, it adds only a Bloom filter and a counter. Hardware implementations of the Bloom filter are well studied. So, EAF-cache is simple to implement. Second, both the Bloom filter and the counter are outside the cache. As a result, our mechanism is easy to design and verify on top of existing cache implementations. Finally, all operations on the EAF happen on a cache miss. As a result, EAF can work well with other techniques that monitor cache hits to improve performance, for example, the cache replacement policy. 1. Simple to implement 2. Easy to design and verify 3. Works with other techniques (replacement policy)

EAF Performance – Summary
This slide presents the summary of our performance results for systems with different number of cores. As shown, our mechanisms outperform all prior approaches for all the three systems. Also, the multicore results are significantly better than the single core results. This is because, in a multicore system, pollution or thrashing caused by an application will affect useful blocks of other concurrently running applications. SAY THAT LAST TWO BARS ARE EAF AND D-EAF

More on Evicted Address Filter Cache
Vivek Seshadri, Onur Mutlu, Michael A. Kozuch, and Todd C. Mowry, "The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing" Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September 2012. Slides (pptx) Source Code

Cache Compression

Motivation for Cache Compression
Significant redundancy in data: 0x 0x B 0x 0x … How can we exploit this redundancy? Cache compression helps Provides effect of a larger cache without making it physically larger

Background on Cache Compression
Hit L2 Cache CPU L1 Cache Decompression Uncompressed Uncompressed Compressed Key requirements: Fast (low decompression latency) Simple (avoid complex hardware changes) Effective (good compression ratio)

Summary of Major Works   Compression Mechanisms Decompression
Latency Complexity Ratio Zero  

Summary of Major Works   Compression Mechanisms Decompression
Latency Complexity Ratio Zero   Frequent Value

Summary of Major Works   / Compression Mechanisms Decompression
Latency Complexity Ratio Zero   Frequent Value Frequent Pattern /

Summary of Major Works   / Compression Mechanisms Decompression
Latency Complexity Ratio Zero   Frequent Value Frequent Pattern / BΔI

Base-Delta-Immediate Cache Compression
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Philip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry, "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches" Proceedings of the 21st ACM International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September Slides (pptx)

Executive Summary Off-chip memory latency is high
Large caches can help, but at significant cost Compressing data in cache enables larger cache at low cost Problem: Decompression is on the execution critical path Goal: Design a new compression scheme that has 1. low decompression latency, 2. low cost, 3. high compression ratio Observation: Many cache lines have low dynamic range data Key Idea: Encode cachelines as a base + multiple differences Solution: Base-Delta-Immediate compression with low decompression latency and high compression ratio Outperforms three state-of-the-art compression mechanisms

Key Data Patterns in Real Applications
Zero Values: initialization, sparse matrices, NULL pointers 0x 0x 0x 0x … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Narrow Values: small values stored in a big data type 0x 0x B 0x 0x … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

How Common Are These Patterns?
SPEC2006, databases, web workloads, 2MB L2 cache “Other Patterns” include Narrow Values 43% of the cache lines belong to key patterns

Key Data Patterns in Real Applications
Zero Values: initialization, sparse matrices, NULL pointers Low Dynamic Range: Differences between values are significantly smaller than the values themselves 0x 0x 0x 0x … Repeated Values: common initial values, adjacent pixels 0x000000FF 0x000000FF 0x000000FF 0x000000FF … Narrow Values: small values stored in a big data type 0x 0x B 0x 0x … Other Patterns: pointers to the same memory region 0xC04039C0 0xC04039C8 0xC04039D0 0xC04039D8 …

Key Idea: Base+Delta (B+Δ) Encoding
4 bytes 32-byte Uncompressed Cache Line 0xC04039C0 0xC04039C8 0xC04039D0 … 0xC04039F8 0xC04039C0 Base 12-byte Compressed Cache Line 0x00 0x08 0x10 … 0x38 1 byte 1 byte 1 byte  Fast Decompression: vector addition  Simple Hardware: arithmetic and comparison 20 bytes saved  Effective: good compression ratio

Can We Do Better? Uncompressible cache line (with a single base):
Key idea: Use more bases, e.g., two instead of one Pro: More cache lines can be compressed Cons: Unclear how to find these bases efficiently Higher overhead (due to additional bases) 0x 0x09A40178 0x B 0x09A4A838 …

B+Δ with Multiple Arbitrary Bases
 2 bases – the best option based on evaluations

How to Find Two Bases Efficiently?
First base - first element in the cache line Second base - implicit base of 0 Advantages over 2 arbitrary bases: Better compression ratio Simpler compression logic  Base+Delta part  Immediate part Base-Delta-Immediate (BΔI) Compression

B+Δ (with two arbitrary bases) vs. BΔI
Average compression ratio is close, but BΔI is simpler

BΔI Cache Compression Implementation
Decompressor Design Low latency Compressor Design Low cost and complexity BΔI Cache Organization Modest complexity

BΔI Decompressor Design
Compressed Cache Line B0 B0 Δ0 Δ0 Δ1 Δ1 Δ2 Δ2 Δ3 Δ3 B0 B0 B0 B0 Vector addition + + + + V1 V2 V3 V0 V0 V1 V2 V3 Uncompressed Cache Line

Compression Flag & Compressed Cache Line
BΔI Compressor Design 32-byte Uncompressed Cache Line 8-byte B0 1-byte Δ CU 8-byte B0 2-byte Δ CU 8-byte B0 4-byte Δ CU 4-byte B0 1-byte Δ CU 4-byte B0 2-byte Δ CU 2-byte B0 1-byte Δ CU Zero CU Rep. Values CU CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL CFlag & CCL Compression Selection Logic (based on compr. size) Compression Flag & Compressed Cache Line Compressed Cache Line

BΔI Compression Unit: 8-byte B0 1-byte Δ
32-byte Uncompressed Cache Line 8 bytes V0 V0 V0 V1 V1 V2 V2 V3 V3 B0= V0 B0 B0 B0 B0 - - - - Δ0 Δ1 Δ2 Δ3 Within 1-byte range? Within 1-byte range? Within 1-byte range? Within 1-byte range? Is every element within 1-byte range? Yes No B0 B0 Δ0 Δ0 Δ1 Δ1 Δ2 Δ2 Δ3 Δ3

BΔI Cache Organization
Tag Storage: Data Storage: 32 bytes Conventional 2-way cache with 32-byte cache lines Set0 … … Set0 … … Set1 Tag0 Tag1 Set1 Data0 Data1 … … … … Way0 Way1 Way0 Way1 BΔI: 4-way cache with 8-byte segmented data 8 bytes Tag Storage: Set0 … … … … Set0 … … … … … … … … Set1 Tag0 Tag1 Tag2 Tag3 Set1 S0 S0 S1 S2 S3 S4 S5 S6 S7 C C - Compr. encoding bits … … … … … … … … … … … … Way0 Way1 Way2 Way3 Twice as many tags Tags map to multiple adjacent segments 2.3% overhead for 2 MB cache

Qualitative Comparison with Prior Work
Zero-based designs ZCA [Dusser+, ICS’09]: zero-content augmented cache ZVC [Islam+, PACT’09]: zero-value cancelling Limited applicability (only zero values) FVC [Yang+, MICRO’00]: frequent value compression High decompression latency and complexity Pattern-based compression designs FPC [Alameldeen+, ISCA’04]: frequent pattern compression High decompression latency (5 cycles) and complexity C-pack [Chen+, T-VLSI Systems’10]: practical implementation of FPC-like algorithm High decompression latency (8 cycles)

Cache Compression Ratios
SPEC2006, databases, web workloads, 2MB L2 1.53 BΔI achieves the highest compression ratio

Single-Core: IPC and MPKI
3.6% 16% 5.6% 4.9% 24% 5.1% 21% 5.2% 13% 8.1% 19% 14% BΔI achieves the performance of a 2X-size cache Performance improves due to the decrease in MPKI

Multi-Core Workloads Application classification based on
Compressibility: effective cache size increase (Low Compr. (LC) < 1.40, High Compr. (HC) >= 1.40) Sensitivity: performance gain with more cache (Low Sens. (LS) < 1.10, High Sens. (HS) >= 1.10; 512kB -> 2MB) Three classes of applications: LCLS, HCLS, HCHS, no LCHS applications For 2-core - random mixes of each possible class pairs (20 each, 120 total workloads)

Multi-Core: Weighted Speedup
If at least one application is sensitive, then the performance improves BΔI performance improvement is the highest (9.5%)

Other Results in Paper IPC comparison against upper bounds
BΔI almost achieves performance of the 2X-size cache Sensitivity study of having more than 2X tags Up to 1.98 average compression ratio Effect on bandwidth consumption 2.31X decrease on average Detailed quantitative comparison with prior work Cost analysis of the proposed changes 2.3% L2 cache area increase

Conclusion A new Base-Delta-Immediate compression mechanism
Key insight: many cache lines can be efficiently represented using base + delta encoding Key properties: Low latency decompression Simple hardware implementation High compression ratio with high coverage Improves cache hit ratio and performance of both single-core and multi-core workloads Outperforms state-of-the-art cache compression techniques: FVC and FPC

Readings on Memory Compression (I)
Gennady Pekhimenko, Vivek Seshadri, Onur Mutlu, Philip B. Gibbons, Michael A. Kozuch, and Todd C. Mowry, "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches" Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT), Minneapolis, MN, September Slides (pptx) Source Code

Readings on Memory Compression (II)
Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Michael A. Kozuch, Phillip B. Gibbons, and Todd C. Mowry, "Linearly Compressed Pages: A Low-Complexity, Low-Latency Main Memory Compression Framework" Proceedings of the 46th International Symposium on Microarchitecture (MICRO), Davis, CA, December [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] Poster (pptx) (pdf)]

Readings on Memory Compression (III)
Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip P. Gibbons, Michael A. Kozuch, and Todd C. Mowry, "Exploiting Compressed Block Size as an Indicator of Future Reuse" Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February [Slides (pptx) (pdf)]

Readings on Memory Compression (IV)
Gennady Pekhimenko, Evgeny Bolotin, Nandita Vijaykumar, Onur Mutlu, Todd C. Mowry, and Stephen W. Keckler, "A Case for Toggle-Aware Compression for GPU Systems" Proceedings of the 22nd International Symposium on High-Performance Computer Architecture (HPCA), Barcelona, Spain, March [Slides (pptx) (pdf)]

Readings on Memory Compression (V)
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, Abhishek Bhowmick, Rachata Ausavarungnirun, Chita Das, Mahmut Kandemir, Todd C. Mowry, and Onur Mutlu, "A Case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling Flexible Data Compression with Assist Warps" Proceedings of the 42nd International Symposium on Computer Architecture (ISCA), Portland, OR, June [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)]

Predictable Performance Again: Strong Memory Service Guarantees

Remember MISE? Lavanya Subramanian, Vivek Seshadri, Yoongu Kim, Ben Jaiyen, and Onur Mutlu, "MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems" Proceedings of the 19th International Symposium on High-Performance Computer Architecture (HPCA), Shenzhen, China, February Slides (pptx)

Extending Slowdown Estimation to Caches
How do we extend the MISE model to include shared cache interference? Answer: Application Slowdown Model Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu, "The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

Application Slowdown Model
Quantifying and Controlling Impact of Interference at Shared Caches and Main Memory Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, Onur Mutlu

Shared Cache and Memory Contention
Core Core Shared Cache Capacity Main Memory Core Core MISE [HPCA’13]

Cache Capacity Contention
Access Rate Core Shared Cache Main Memory Core Priority Applications evict each other’s blocks from the shared cache

Estimating Cache and Memory Slowdowns
Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Cache Service Rate Memory Service Rate Core Core Core Core

Service Rates vs. Access Rates
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Cache Service Rate Core Core Core Core Request service and access rates are tightly coupled

The Application Slowdown Model
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core

Real System Studies: Cache Access Rate vs. Slowdown

Challenge How to estimate alone cache access rate? Shared Main Memory
Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Priority Core Core Core Core Auxiliary Tag Store

Auxiliary tag store tracks such contention misses
Cache Access Rate Core Shared Cache Main Memory Core Priority Auxiliary Tag Store Still in auxiliary tag store Auxiliary Tag Store Auxiliary tag store tracks such contention misses

Accounting for Contention Misses
Revisiting alone memory request service rate Cycles serving contention misses should not count as high priority cycles

Alone Cache Access Rate Estimation
Cache Contention Cycles: Cycles spent serving contention misses From auxiliary tag store when given high priority Measured when given high priority

Application Slowdown Model (ASM)
Cache Access Rate Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core

Previous Work on Slowdown Estimation
STFM (Stall Time Fair Memory) Scheduling [Mutlu et al., MICRO ’07] FST (Fairness via Source Throttling) [Ebrahimi et al., ASPLOS ’10] Per-thread Cycle Accounting [Du Bois et al., HiPEAC ’13] Basic Idea: The most relevant previous works on slowdown estimation are STFM, stall time fair memory scheduling and FST, fairness via source throttling. FST employs similar mechanisms as STFM for memory-induced slowdown estimation, therefore, I will focus on STFM for the rest of the evaluation. STFM estimates slowdown as the ratio of alone to shared stall times. Stall time shared is easy to measure, analogous to shared request service rate Stall time alone is harder and STFM estimates it by counting the number of cycles an application receives interference from other applications. Count interference experienced by each request  Difficult ASM’s estimates are much more coarse grained  Easier

Model Accuracy Results
Select applications Average error of ASM’s slowdown estimates: 10%

Leveraging ASM’s Slowdown Estimates
Slowdown-aware resource allocation for high performance and fairness Slowdown-aware resource allocation to bound application slowdowns VM migration and admission control schemes [VEE ’15] Fair billing schemes in a commodity cloud

Cache Capacity Partitioning
Access Rate Core Shared Cache Main Memory Core Goal: Partition the shared cache among applications to mitigate contention

Cache Capacity Partitioning
Way 2 Set 0 Set 1 Set 2 Set 3 .. Set N-1 1 3 Core Main Memory Core Previous partitioning schemes optimize for miss count Problem: Not aware of performance and slowdowns

ASM-Cache: Slowdown-aware Cache Way Partitioning
Key Requirement: Slowdown estimates for all possible way partitions Extend ASM to estimate slowdown for all possible cache way allocations Key Idea: Allocate each way to the application whose slowdown reduces the most

Memory Bandwidth Partitioning
Cache Access Rate Core Shared Cache Main Memory Core Goal: Partition the main memory bandwidth among applications to mitigate contention

ASM-Mem: Slowdown-aware Memory Bandwidth Partitioning
Key Idea: Allocate high priority proportional to an application’s slowdown Application i’s requests given highest priority at the memory controller for its fraction

Coordinated Resource Allocation Schemes
Cache capacity-aware bandwidth allocation Core Core Core Core Core Core Core Core Shared Cache Main Memory Core Core Core Core Core Core Core Core 1. Employ ASM-Cache to partition cache capacity 2. Drive ASM-Mem with slowdowns from ASM-Cache

Fairness and Performance Results
16-core system 100 workloads Significant fairness benefits across different channel counts

Summary Problem: Uncontrolled memory interference cause high and unpredictable application slowdowns Goal: Quantify and control slowdowns Key Contribution: ASM: An accurate slowdown estimation model Average error of ASM: 10% Key Ideas: Shared cache access rate is a proxy for performance Cache Access Rate Alone can be estimated by minimizing memory interference and quantifying cache interference Applications of Our Model Slowdown-aware cache and memory management to achieve high performance, fairness and performance guarantees Source Code Released in January 2016

More on Application Slowdown Model
Lavanya Subramanian, Vivek Seshadri, Arnab Ghosh, Samira Khan, and Onur Mutlu, "The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory" Proceedings of the 48th International Symposium on Microarchitecture (MICRO), Waikiki, Hawaii, USA, December [Slides (pptx) (pdf)] [Lightning Session Slides (pptx) (pdf)] [Poster (pptx) (pdf)] [Source Code]

Prof. Onur Mutlu ETH Zürich Fall November 2017

Similar presentations

Presentation on theme: "Prof. Onur Mutlu ETH Zürich Fall November 2017"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Prof. Onur Mutlu ETH Zürich Fall November 2017

Similar presentations

Presentation on theme: "Prof. Onur Mutlu ETH Zürich Fall November 2017"— Presentation transcript:

Similar presentations

About project

Feedback