Fairness-oriented Scheduling Support for Multicore Systems

Fairness-oriented Scheduling Support for Multicore Systems
Changdae Kim and Jaehyuk Huh KAIST Hello, I am Changdae Kim from KAIST, and I’d like to present our paper, Fairness-oriented Scheduling Support for Multicore Systems.

Multicores have Uneven Capabilities
frequency Asymmetric Multicores Within-die Process Variation Recently, multicores are likely to have uneven computing capabilities. First, asymmetric multicores, such as big little architecture, have uneven multicores by its design, Second, within-die process variation makes cores have different maximum frequency, due to the variation in manufacturing process. Third, we can make uneven multicores by setting DVFS. DVFS

Throughput-maximizing Scheduling
Max-perf: maximize the overall throughput [Kumar’03], [Koufaty’10], [Kwon‘11], [Saez‘10], [Shelepov’09], [Craeynest‘12] App1 Speedup: 3.0 App2 Speedup: 1.5 App2 Speedup: 2.9 2.9 Unfair 1) Estimate fast core speedup = perf on fast core perf on slow core 2) Assign fast core to highest speedup app To benefit from uneven multicores, scheduling support is essential. And many papers have proposed schedulers for uneven multicores. Details differ, but main goal of them are same, maximizing the overall throughput. We call the policy max-perf in the paper. The schedulers, first, estimate the order of fast core speedup. Fast core speedup means the relative performance on fast core compared to the performance on slow core. Let’s see this example. Application one has three as fast core speedup, and application two has one point five as fast core speedup. Then, [*] the scheduler assign fast core to the application with highest fast core speedup, And the application one monopolizes the fast core. This seems good since the application one has two times higher fast core speedup, However, [*] What if the fast core speedup of application two is two point nine? Max-perf scheduler still assign all of fast core to application one. [*] This is very unfair, since application two also has high fast core speedup. Fast core Slow core

Fairness-Maximizing Scheduling
Max-fair: maximize the fairness [Kwon’11] App1 Speedup: 3.0 App2 Speedup: 2.9 No throughput↑ 1) Equal share of fast core to all apps 2) Equal share of slow core On the other hand, we can consider the fairness-maximizing scheduling for uneven multicores. Youngjin Kwon proposed a type of fair scheduler, and we choose it as our baseline fair scheduler. The scheduler works as follows. First, [*] it assigns equal share of fast core to all applications, And then, [*] it assigns equal share of slow core to all applications. This scheduling is perfectly fair, [*] But we cannot improve any throughput. Fast core Slow core

Max-Fair vs. Max-Perf +15% -20% -43% Fairness Throughput Throughput
(average) +15% MinFairness (minimum) -20% Uniformity (variance) -43% 1− 𝑠𝑡𝑑𝑒𝑣 𝑝𝑒𝑟𝑓 𝑎𝑣𝑔 𝑝𝑒𝑟𝑓 [Craeynest’13] Now, Let’s compare two schedulers, Fairness-maximizing scheduler, max-fair, and throughput maximizing scheduler, max-perf This is a result from our previous example, Application one has speedup three point zero, And application two has speedup one point five. Since the max-fair is our baseline, we define the performance from max-fair scheduling as one. And, the performance of other scheduling is defined as the relative performance compared to max-fair. In this example, max-perf improves the performance of application one to one point five, and degrades the performance of application two to zero point eight. We use three metrics for comparison. [*] Firstly, we define the overall throughput as the average value of application performance. Compared to max-fair, max-perf has fifteen percent higher throughput. In addition, we use two fairness metrics. [*] MinFairness indicates [*] the minimum performance among applications, In this case, MinFairness is degraded by twenty percent, compared to Max-fair. [*] The second fairness metric is uniformity. It represents [*] the variance of application performance, [*] We use the formula proposed by Craeynest. It is shown in the circle. This metric has value between zero and one, and higher is better. In this example, the uniformity drops down forty three percent, compared to max-fair.

2.9 2.9 Max-Fair vs. Max-Perf +1% -49% -69% Fairness Throughput
(average) +1% MinFairness (minimum) -49% Uniformity (variance) -69% Fairness maximizing  No throughput↑ Throughput maximizing Significantly Unfair Now, let’s consider about the second example. The fast core speedup of application one is still three point zero, But the fast core speedup of application two is now two point nine. In this case, max-perf improves the performance of application one as previous example, But the performance of application two are largely degraded. [*] As a result, the overall throughput, the average value, is improved by only one percent, compared to max-fair. [*] Moreover, minfairness is also degraded largely, by forty nine percent. This is due to the performance degradation of application two. [*] The uniformity is also degraded by sixty nine percent. In this example, we lose half of fairness, to get just one percent throughput. This is not fair, and also not beneficial. [*] We can conclude that fairness maximizing scheduling does not improve any throughput, And throughput maximizing scheduling results in significantly unfair state. 2.9 2.9

Fairness-oriented Scheduling
Fairness Throughput First, Fairness, Then, Throughput Policies Min-fair: Guarantee MinFairness, then Throughput↑ Similar-min-fair: Min-fair + Improve Uniformity Implementation issues Estimate accurate value of fast core speedup Schedule as core share In the middle of two extreme schedulers, We propose fairness-oriented scheduling for uneven multicore systems. The main goal of our scheduler is, firstly, consider fairness, and then, improve throughput. We propose two policies. First one is min-fair scheduling. Min-fair guarantee minfairness, and then try to improve throughput. The next one is similar-min-fair. Adding to the min-fair scheduler, this scheduling also try to improve uniformity more. There are two implementation issues. To guarantee minfairness, the scheduler needs to estimate accurate value of fast core speedup. Also, since there are two core types, we need to schedule threads as their core share. I will talk about these things in the following slides.

𝑡𝑎𝑟𝑔𝑒𝑡× 𝑝𝑒𝑟𝑓 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 −1 𝑠𝑝𝑒𝑒𝑑𝑢𝑝−1
Min-Fair Scheduling Guarantee minimum performance as users want App1 Speedup: 3.0 App2 Speedup: 2.9 App3 Speedup: 1.5 𝑡𝑎𝑟𝑔𝑒𝑡× 𝑝𝑒𝑟𝑓 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 −1 𝑠𝑝𝑒𝑒𝑑𝑢𝑝−1 (Details in the paper) Fast core Slow core Slow core 1) Administrator set target of MinFairness 2) Required fast core to all apps  MinFairness > target 3) Remaining fast core to high speedup app  throughput↑ 4) Distribute slow core to all apps Firstly, I will explain about min-fair scheduling. The policy of this scheduler is to guarantee minimum performance of all applications as users want. To use this scheduler, the system administrator need to set the target level of min-fairness. [*] Then, the scheduler gives the required fast core share for each application. [*] This step makes the min-fairness above the target level, That is, guarantees minimum performance as users want. [*] To calculate the required fast core share, we make a performance model using fast core speedup. The final result is shown, and for the detail, please refer to our paper. [*] [*] Then, the remaining fast core share is taken by the highest fast core speedup applications. [*] By this step, the scheduler also improves throughput. [*] Lastly, the scheduler distributes slow core share [*] to applications. Required fast core

Similar-Min-Fair Scheduling
Similar speedup apps  same throughput↑ App1 Speedup: 3.0 App2 Speedup: 2.9 App3 Speedup: 1.5 Fast core Slow core Slow core 1) Administrator sets similarity and target of MinFairness 2) Min-fair scheduling with target 3) Group similar apps (speedup difference < similarity) 4) Even fast core share out among apps in a group And now, similar-min-fair scheduling. The policy of this scheduler is, if applications have similar fast core speedup, they should gain same throughput improvement. To use this scheduler, the system administrator should sets similarity value as well as the target level of min-fairness. [*] After that, first, this scheduling does [*] what min-fair scheduling does, [*] Then, it groups applications whose fast core speedup is similar. [*] Similar means that the difference in fast core speedup is smaller than the similarity value. [*] Then, the scheduler even fast core share out among applications in a group. [*] By this step, applications with similar fast core speedup gain same throughput improvement.

Sampling-based Speedup Estimation
Force to run on both types of cores for each interval Similar to [Kumar ‘03] Measure performance using PMU (Performance Monitoring Units) Performance metric: Instruction Per Seconds (IPS) speedup= moving average of IPS on fast core moving average of IPS on slow core Architecture independent mechanism Low error rate: avg. 2.70%, at worst 9.92% Must run on both types Incur frequent thread migration Little impact on cores sharing a LLC [Craeynest ‘12] Less than 1.8% at worst  min-fair already does To support minimum fairness guarantee, The scheduler should estimate the accurate value of fast core speedup. We implement the sampling based estimation. Kumar proposed similar mechanism, but the work is based on the simulator. And we implement it on linux kernel and show its feasibility on real machine. To estimate fast core speedup, The scheduler forces applications to run on both types of cores for each interval. and measures performance using performance monitoring units. We use Instruction Per Seconds as a performance metric. Finally, the speedup is estimated by moving average of IPS on fast cores divided by moving average of IPS on slow cores. [*] Let’s consider the benefits and overhead of this mechanism. [*] This mechanism is architecture independent. This can be used for all types of uneven multicores. [*] Also, this shows low error rate. The average is two point seven percent. However, there are two potential overheads. [*] First, threads must run on both types of cores for each interval, And this restricts the scheduler. [*] However, fairness-oriented scheduling already give both types of core share to all applications, [*] Thus, this is not a problem in our scheduler. [*] Second, the mechanism incurs frequent thread migration. [*] However, the previous work reports that frequent thread migration show little overhead if cores share a last level cache. Also, in our work, the frequent thread migration overhead is less than one point eight percent at the worst case. [*] =================================================== Van Craeynest et al. [33] did an extensive evaluation to quantify migration overhead for both shared and private last level caches (LLCs) as a function of time slice granularity. They found the migration overhead to be less than 1.5% across all workloads for a 4 MB shared LLC for a 1 ms time slice, and less than 0.6% for a 2.5 ms time slice. They also explored migration overhead for private LLCs and found the overhead to be small as well (although slightly higher compared to shared caches) because the cache coherency protocol can get the data from another core’s private LLC instead of memory. Our own experimental evaluation confirms these findings, and we consider a 1 ms time slice unless mentioned otherwise.

Implementation on Linux Kernel
Add fast_round and slow_round on thread context +1 when run on fast/slow core for fast/slow_core_share*30ms Progress together  schedule as core share ratio Swap threads to balance fast_round and slow_round Measure IPS on each type of cores Core share update interval: 2 seconds Thread1 fast_round > slow_round Thread2 fast_round < slow_round Fast core Slow core We implement this scheduling mechanism on linux kernel. In the thread context, we add two variables, fast round and slow round. These variables are incremented whenever the thread completes to run on fast or slow core for core share multiplied by thirty milliseconds. Therefore, if two rounds progress together, the core share ratio of the threads can be kept. [*] To make two rounds progress together, the scheduler swap threads based on their rounds. If there is a thread whose fast round is larger than slow round and running on fast core, like thread one, the scheduler finds a thread whose fast round is smaller than slow round and running on slow core, like thread two. And, [*] swap two threads After this, the thread one will run on slow core, and likely to increment its slow round, so that fast and slow rounds will be balanced. And similar thing happens on thread two. [*] In addition, we implement the functionality of measuring IPS for each thread on each type of cores. [*] And, we decided to core share update interval as two seconds. For every two seconds, the scheduling policy calculates the core shares for each application, and update them.

Evaluation Methodology
Emulate uneven multicores using DVFS AMD Phenom II: 2 fast cores and 4 slow cores Big.LITTLE architecture Odroid-XU3 Lite: 4 big cores and 4 little cores PMU not available  use offline speedup value Workloads: mix of SPECCPU2006 Run repeatedly until all apps finish at least once Performance: execution time of the first run 2.8GHz 2.8GHz 0.8GHz 0.8GHz 0.8GHz 0.8GHz Shared LLC For evaluation, we use two platforms. First, we emulate uneven multicores by DVFS setting. We use AMD hexa core processor. and make two cores as fast cores and four cores as slow cores. Also, all six cores share a last level cache. Second, we use an odroid board with big little architecture. But, Performance monitoring units are not available for us. So, we cannot use our fast core speedup estimation, and instead, we use offline value for speedup. [*] We only present the results from the emulated uneven multicores, and please refer to the paper for big little architecture results. We use the mix of SPECCPU two thousand six for the workloads. In our experiments, all applications in the mix run repeatedly, Until all applications finish at least once. Then, the performance of an application is calculated based on the execution time of the first run.

Result: Diverse Applications
Guarantee MinFairness and then improve throughput Max-perf Min-fair(85%) Similar-min-fair(0.2, 85%) Perf norm. to Max-fair The first mix consist of 2 high fast core speedup applications, 2 medium fast core speedup applications, and 2 low fast core speedup applications. [*] In the max perf scheduling, the first two applications are largely boosted, since they have high fast core speedup. But, the rest of applications suffer from slow down. Thus, the overall throughput improves by seven percent, but minimum fairness downs to sixty two percent and uniformity is forty three percent. [*] With the same mix, min fair scheduling with eighty five percent target, succeeds to guarantee minimum fairness as targeted. [*] You can see the performance of the last four applications are improved. Also, as minimum fairness increases, uniformity is increased too. Moreover, the scheduling still improve the overall throughput by six percent, just one percent lower than the max perf scheduling. However, you can see the two high fast core speedup applications show different throughput improvement. [*] Similar min fair scheduling addresses this problem, and the uniformity is improved further. [*] ===================== HML.b Throughput: 107% 106% 104% MinFairness: 62% 85% Uniformity: 43% 72% 78%

Result: Similar Applications
Improve fairness without effect on throughput Max-perf Min-fair(85%) Similar-min-fair(0.2, 85%) Perf norm. to Max-fair The second mix consist of similarly high fast core speedup applications. [*] In this case, the max perf scheduling cannot improve throughput, since increasing the performance of an application leads to decreasing the performance of other applications. However, minimum fairness and uniformity are degraded a lot. [*] Min-fair scheduling solves this problem, [*] by guaranteeing minimum performance for all applications. [*] Similar min fair scheduling with zero point two similarity, groups four applications, and give same fast core share to four applications. [*] This results in the improvement of uniformity. ============================ HML.b Throughput: 100% MinFairness: 56% 85% 87% Uniformity: 33% 72% 90%

Throughput MinFairness Uniformity
Result: All Mixes Throughput Achieve 51% of Max-perf MinFairness -3% -1% -1% -1% Uniformity Avg. 24%↑ This is the results from all mixes. Three graphs show the throughput, minfairness, and uniformity, And the three bars in the graphs shows scheduling policies, max-perf, min-fair, similar-min-fair. This is complex results, [*] so I summarize the results of similar-min-fair. [*] First, minfairness is guaranteed in ten workload mixes among fourteen mixes. In addition, in the missed cases, the violation of MinFairness is less than three percent. [*] Second, similar-min-fair scheduling improves uniformity by twenty four percent, compared to max-perf. [*] Even though the fairness is improved largely, similar-min-fair achieves fifty one percent throughput of max-perf scheduling.

Conclusion Fairness-oriented scheduling for uneven multicores
First, Fairness, Then, Throughput Architecture independent speedup estimation High accuracy and little overhead Implemented on linux kernel 3.7.3 Real machine results MinFairness: mostly guaranteed (missed less than 3%) Uniformity: avg. 24%↑ Throughput: achieve 51% of Max-perf Let me conclude this paper. We propose fairness-oriented scheduling for uneven multcore systems. The scheduler firstly consider the fairness, and then, improve throughput. To support fairness guarantee, we implement an architecture independent speedup estimation. It shows high accuracy, and little overhead. Our scheduler is implemented on linux kernel, And the real machine results show that our scheduler works as intended. The minfairness is mostly guaranteed above the target level, And the uniformity is improved by twenty four percent on average, And the throughput is also improved, and the scheduler achieves fifty one percent of max-perf scheduling. Thanks for listening, and, actually, I have a request.

Speak Slowly, Slowly and Slowly
ASK YOU WHEN YOU ASK ME Speak Slowly, Slowly and Slowly Or, pick up a question…. 1) Why just TWO core types 2) Effect of cache, or other resources 3) Other works on fairness for uneven multicores 4) Details for required fast core share 5) Multithreaded applications? 6) 2 sec (long) interval… how about short applications? 7) Is Max-fair really fair? 8) Why is (min)fairness important? 9) Is it lose energy efficiency of AMP? 10) How to measure accuracy of speedup? 11) Complex scheduler may burden on CPU 12) less than 1.8% overhead  what’s overhead? I am not good at English, especially on listening, So, please speak slowly, slowly and slowly. [*] Or, you can just pick up a question here. 1) Why just TWO core types  20 2) Effect of cache, or other resources  21 3) Other works on fairness for uneven multicores  22 4) Details for required fast core share  26 5) Multithreaded applications?  27 6) 2 sec (long) interval… how about short applications?  28 7) Is Max-fair really fair?  29 8) Why is (min)fairness important?  30 9) Is it lose energy efficiency of AMP?  31 10) How to measure accuracy of speedup?  32 11) Complex scheduler may burden on CPU  33 12) less than 1.8% overhead  what’s overhead?  34 A) You may treat just corner cases.  35 B) About the title…  36

Backup Slides

For More Core Types With 3+ core types, problems on,  Future work
Sampling-based speedup estimation Fast/slow_round based scheduling Calculate required fast core share to guarantee MinFairness  Future work Sample solution for speedup estimation Performance model with special hardware support Scheduling heterogeneous multi-cores through Performance Impact Estimation (PIE) [ISCA ‘12] PMU based modeling are enough for process variation Koala: a platform for OS-level power management [EuroSys ’09] [Kumar ‘06] Not much diff in 2 types vs. 4 types Core Architecture Optimization for Heterogeneous Chip Multiprocessors [PACT’06] Our implementations and experiments support only two core types. This is a limitation of our work. With three or more core types, there are some problems. First, sampling based speedup estimation may be burden, since a thread run on three or more cores for each interval. Second, fast and slow round based scheduling does not work. We can add round for each type, but scheduling complexity increases a lot. Third, calculating required fast core share should be revised for more types. Actually, this is our future work. An, for speedup estimation, we can consider two mechanism. First, Craeynest proposed a accurate performance modeling for asymmetric multicore processors. We can adopt it for speedup estimation of big little architecture. However, it requires special hardware support, so we may not realize the mechanism. Second, performance monitoring unit based modeling works when only the frequency is different among cores, like process variation. Koala paper showed that the PMU based modeling can achieve low error rate, when the frequency is the only difference. Moreover, refer to the Kumar’s paper, there is not much difference in performance improvement between using two core types and using four core types. Thus, our work have value even if it works for two core types. It requires special hardware support

Effect of Shared Cache Beyond scope of our paper
Unevenness of cores has the primary impact App runtime with different co-runners with Max-fair Corner cases  find other ways E.g. cache partitioning >20% difference in only 2 mixes fast core speedup 1.7~3.5 But, <20% difference Considering the effect of shared cache, or other shared resources, Is beyond the scope of our paper. However, I found, in our environment, performance unevenness has dominated effect on performance. I want to show some data. This is not in the paper. X-axis shows the applications, and y-axis shows the normalized runtime. And, each point, shows the runtime of an application in the specific mixes. The applications in the graph belongs to two or more workload mixes. This represents the effect of co-runners on performance. That is, the effect of shared resources on performance. [*] As you can see, in the most cases, performance variance is less than twenty percent. Note that the fast core speedup of these applications between one point seven and three point five. That is, the relative performance on fast core is twice, but co-runners affects only twenty percent of the performance. This supports our claim. [*] There are 2 workload mixes, which causes more than twenty percent performance variation. Actually, these workload mixes contain too many memory intensive workloads. And, I think this is the corner case, and other components can solve this problem. For example, hardware based cache partitioning can be used to avoid this problem.

Related Work: scaled load balancing
More threads on fast core Computing power is determined by benchmark score (fixed) Load balancing based on (load / computing power) Comparison Do not aware application characteristics Applications have different fast core speedup Need more threads to improve fairness No guaranteed fairness metric Load: Computing power: Scaled load: 4 2 2 1 2 1 2 1 fast core slow core slow core slow core There are three work on fairness aware scheduling for multicore processors. Firstly, Tong Li proposed scaled load balancing. Briefly speaking, the scheduler just put more threads on fast cores. How many threads should be put on the fast core, is determined by relative computing power. However, this work does not aware application characteristics. The relative computing power is determined by benchmark score, and does not aware applications. Also, it needs more threads to improve fairness. The number of threads should be larger than the number of cores. Our work, the number of threads doesn’t matter. [1] Tong Li et. al. “Efficient operating system scheduling for…” SC’07

Related Work: R%-fair R%  max-fair / (100-R)%  max-perf Comparison
Do not guarantee any fairness metric App1 Speedup: 3.0 App2 Speedup: 2.9 App3 Speedup: 1.5 Fast core Slow core Slow core Administrator set the R% ratio, e.g. 50%. Give R% of fast core share evenly Give remaining fast core to highest speedup app Distribute slow core share to apps Second, Youngjin Kwon proposed R-percent fair. Simply, the scheduler works as max-fair for R-percent of time, And works as max-perf for the other time. [*] The administrator should set the R value, [*] And the scheduler distribute R-percent of fast core share [*] evenly, [*] Then the remaining fast core share is taken [*] by the highest speedup application, [*] And slow core share are distributed. [*] However, this scheduler does not guarantee any fairness metric. And the effect of R-value is not clearly investigated. [2] Youngjin Kown et. al. “Virtualizing performance asymmetric…” ISCA’11

Related Work: Guaranteed fairness
Estimate uniformity for each scheduling interval, and Do max-fair if uniformity < threshold Example (threshold = 0.85) Comparison Simulation-based study Require new hardware support to estimate fast core speedup Dependent on architecture Guarantee uniformity, not minimum fairness as ours 1ms max-perf max-perf max-perf max-fair max-fair max-perf max-fair max-perf Uniformity: 0.93 0.86 0.79 0.83 0.87 0.80 0.85 0.82 Timeline The most recent work is proposed by Craeynest. This scheduler estimates uniformity for each scheduling interval, And works as max-fair scheduling if uniformity is less than the threshold level, Otherwise, the scheduler works as max-perf scheduling. Here is an example. [*] The schedulers use max-perf policy, [*] and estimate the uniformity, [*] until uniformity is larger than zero point eighty five, And then, [*] do max-fair until uniformity increases more than the threshold. Then, [*] back to max-perf, and so one. For comparison, This work is based on the simulator, and also, it requires new hardware support, which is not currently available in real machines. Also, it guarantee uniformity. And, I believe that minfairness is more important in cloud computing era. [3] Kenzo Van Craeynest et. al. “Fairness-aware scheduling on…” PACT’13

Related Work: Comparisons
Scaled load balancing [1] R%-fair [2] Guaranteed fairness [3] Sim-min-fair (proposed) Aware application characteristics? NO YES Estimate speedup at runtime? Implement on real machine? Require extra HW support? Guaranteed fairness metric - uniformity Minimum fairness And this is the final comparison table. [1] Tong Li et. al. “Efficient operating system scheduling for…” SC’07 [2] Youngjin Kown et. al. “Virtualizing performance asymmetric…” ISCA’11 [3] Kenzo Van Craeynest et. al. “Fairness-aware scheduling on…” PACT’13

Required Fast Core Share for Min-Fair
How to calculate required fast core share? Performance model MinFairness > target  ∀apps, throughput_app > target By solving the inequality, 𝑝𝑒𝑟𝑓=𝑠𝑙𝑜𝑤_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒+𝑠𝑝𝑒𝑒𝑑𝑢𝑝 ×𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡_𝑎𝑝𝑝= 𝑝𝑒𝑟𝑓 𝑝𝑒𝑟𝑓 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 = 𝑠𝑙𝑜𝑤_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒+𝑠𝑝𝑒𝑒𝑑𝑢𝑝 ×𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑠𝑙𝑜𝑤_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 +𝑠𝑝𝑒𝑒𝑑𝑢𝑝 × 𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 >𝑡𝑎𝑟𝑔𝑒𝑡 (by definition) (1−𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒) 𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 To calculate the amount of required fast core share, We made a performance model using fast core speedup. The performance of an application can be represented as slow core share plus speedup multiplied by fast core share. In this model, performance is represented as the relative performance to when fully running on slow core. To make minimum fairness above the target value, for all applications, their throughput should be larger than the target, since the minimum fairness is defined as minimum. As I said before, we define the performance of max-fair as one, and the throughput of an application is defined as performance relative to that of max-fair scheduling. By plugging in the performance model, we can expand the equation like this. For single-threaded applications, slow core share equals to one minus fast core share. And, the result of this equation should be larger than the target value. Note that this is a function of fast core share, We can solve the inequality, and get the solution like this. This is the minimum fast core share required to guarantee minimum fairness. 𝑓𝑎𝑠𝑡_𝑐𝑜𝑟𝑒_𝑠ℎ𝑎𝑟𝑒 𝑟𝑒𝑞𝑢𝑖𝑟𝑒𝑑 = 𝑡𝑎𝑟𝑔𝑒𝑡× 𝑝𝑒𝑟𝑓 𝑚𝑎𝑥−𝑓𝑎𝑖𝑟 −1 𝑠𝑝𝑒𝑒𝑑𝑢𝑝−1

Multi-threaded Applications
This paper investigates inter-application fairness with single-threaded applications Solution for future work: 2-level scheduler Inter-app: this paper with modification on speedup Intra-app: max-fair 2nd solution: this paper + Utility-based Acceleration Utility-based acceleration of multithreaded applications on asymmetric CMPs [ISCA’13] Inter-app: this paper with utility as speedup Utility: application-level speedup when thread run on fast core Intra-app: highest utility thread monopolize fast core share of the application speedup= moving average of IPS on fast core moving average of IPS on slow core IPS  sum of #insts of threads / sum of exectime of threads First of all, this paper investigates inter-application fairness with single-threaded applications. This is the first step of research on fairness in uneven multicores. For multithreaded applications, we can consider two level scheduler. For inter-application, our current work can be used, but the fast core speedup should be modified. For example, we can used the aggregated IPS of all threads, instead of IPS of single thread. And for intra-applications, that is, for threads in an applications, we can use max-fair. There are two reasons. First, threads in an application are likely to have similar characteristics, And, second, today, programmers still assume the symmetric multicore processors. The second solution is, the applying the utility based acceleration mechanism in to the our work. Utility based acceleration is based on ISCA twenty thirteen paper. For inter-applications, our work can be used, and the utility is used for fast core speedup of an applications. Then, for intra-applications, the thread with highest utility value monopolizes fast core share, but only for the fast core of an application, not the whole fast core of the system.

2 Second (Long) Interval
Target long term stable phase Fine-quantum  short-term phase changes rapidly [1] Composite Cores: Pushing Heterogeneity into a Core [MICRO’12] [2] Trace Based Switching For A Tightly Coupled Heterogeneous Core [MICRO’13] [3] DynaMOS: Dynamic Schedule Migration for Heterogeneous Cores [MICRO’15] May be shorten if… Predict the next short-term phase [3] Reduce thread migration overhead [1] For short programs, memoization may help Start with speedup from previous run We use two second interval for core share update and speedup estimation. This can be a limitation of our work. Two second interval targets long term stable phase. SPECCPU shows the long term stable phases, with two second interval. Fine-grained quantum may help improve throughput, but short-term interval changes very rapidly, And we may not capture the potential at OS level. There are a series of work to exploit fine-grained scheduling interval. But it requires new hardware design. I think, the scheduling interval of our work may be shorten, If predicting the next phase is possible, and the thread migration overhead is drastically reduced. In fact, these are what the series of papers do. Anyway, for short applications, memoization may help the scheduler react immediately. The fast core speedup from previous run can be used to the starting value of new run.

What is Fair in Uneven Multicores?
Equal slowdown from isolated run on fast core [Craeynest ‘13] Equal core share of each core type [Kwon ‘11] -30% Hard to estimate at runtime App1 App2 App1 App2 App1 App2 Fast core What is fair in uneven multicores are controversial issues. There have been two proposals, in two category. The firs one is proposed by Craeynest, He says, if threads in a system shows equal slowdown from isolated run on fast core, Threads are fairly scheduled. For example, two applications show different performance in isolated and consolidated environment. But, if slowdowns are same, two applications are scheduled fairly. However, it is difficult to estimate the isolated performance at runtime. The work use the special hardware support. Since we want to implement our scheduler on real machine, We cannot choose this one. Thus, we choose the second one, as I previously explained. This is solely based on CPU cycles. And the performance of this scheduling can be estimated in runtime, when we can estimate the fast core speedup. App2 App1 App2 App1 App2 App1 Slow core

Why is Fairness Important
Basic of OS scheduler Linux scheduler: completely fair scheduler To prevent starvation To meet deadline of real-time apps To support priority and weight But, not investigated in this paper To support consolidated system (e.g. Cloud Computing) Low performance of application: may be acceptable Low performance of VCPU: not acceptable Guarantee MinFairness[OURS] vs. Uniformity[Craeynest’13] MinFairness  minimum performance (important on cloud) After guarantee MinFairness, we can improve throughput Throughput↑ directly affect uniformity First, fairness is the basic of OS scheduler. The name of current linux scheduler is completely “fair” scheduler. I think that, fairness is basic, since it is related to starvation or real time applications. Second, fairness is important to support priority and weight. Once we can schedule threads fairly, we can apply priority or weight to some of threads. However, this is not investigated in this paper, and this is one of our future work. Third, fairness is more important today, since the consolidated system becomes popular. Low performance of application in private system may be acceptable, But, low performance of VCPU cannot be acceptable. For fairness metrics, we choose the minfairness as the guaranteed metric. I will explain the reaons. Firstly, guaranteeing minfairness directly results in guaranteeing the minimum performance. This is very important in cloud computing era. Also, after guaranteeing minfairness, we can still improve throughput as we did. However, throughput improvement of an application directly affect uniformity. Thus, I think, there is less room for throughput improvement.

Get Fairness  Lose Energy Efficiency
Energy efficiency from throughput improvement More fairness  less throughput  less energy efficiency Trade-off between throughput and fairness This paper offers an option for the trade-off Adjustable fairness level if you want more energy efficiency The most energy efficiency of uneven multicores Is from the throughput improvement in the power budget. If we get more fairness, this directly degrades the throughput, and results in less energy efficiency. There is trade-off between throughput and fairness. This is a graph related the achieved throughput of min-fair with different parameters. As the target level of minfairness increases, the achieved throughput decreases. Thus, more fairness, more throughput, more energy efficiency. However, fairness is still important. And, this paper offers an option to choose fairness level. If you want more energy efficiency, that is, more throughput improvement, You can set the lower level of fairness.

Accuracy of Speedup Estimation
Error: avg. 2.70% / max 9.92% Real: Relative execution time from pinned run Estimated: average speedup of all intervals with max-fair

Overhead: CPU time CPU time used by scheduler per scheduling interval
Speedup estimation, calculation for core share distribution At worst, 0.17ms  ~0.002% with 2s interval and 6 cores 0.002% CPU utilization

Overhead: Throughput degradation
Native linux vs. max-fair (with and without speedup estimation) Setup: 2 fast and 4 slow cores but actually same frequency Worst case, 1.8% throughput degradation

Fairness Analysis of max-perf
Histogram of max-perf results with ideal scheduling All combinations of 23 SPECCPU2006 on 2 fast / 4 slow cores Lose 30~% minimum fairness to gain ~4% throughput (1) Fast core speedup is known (2) No scheduling overhead (3) No thread migration overhead (4) No resource sharing effect

Fairness-oriented Scheduling Support for Multicore Systems

Similar presentations

Presentation on theme: "Fairness-oriented Scheduling Support for Multicore Systems"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fairness-oriented Scheduling Support for Multicore Systems

Similar presentations

Presentation on theme: "Fairness-oriented Scheduling Support for Multicore Systems"— Presentation transcript:

Similar presentations

About project

Feedback