Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing Efficient Operating System Scheduling for Performance- Asymmetric Multi-core Architecture

Outline Introduction Scheduling For Performance-Asymmetric Architecture Evaluation Conclusion 2

Introduction 3 Over the next decade, we expect to see processors with tens and even hundreds of cores on a chip. Resent research advocates the performance- asymmetric multi-core architectures. core1 core3 core4 core2 same instruction set different performance characteristics Deliver higher performance at lower cost

Introduction (cont.) 4 OS schedulers traditionally assume homogeneous hardware and do not directly work well on asymmetric architectures. This paper presents AMPS (Asymmetric Multi- Processor Scheduler) that efficiently supports both SMP- and NUMA-style performance-asymmetric architectures.

Scheduling For Performance- Asymmetric Architecture 6 Run-queue models: Distributed run-queue model Centralized run-queue model Scheduling policies: Thread-dependent policies Thread-independent policies

Scheduling For Performance- Asymmetric Architecture (cont.) 7 Optimization metrics: Performance Fairness Repeatability Three components of AMPS: 1. Asymmetric-aware load balancing 2. Faster-core-first scheduling 3. NUMA-aware migration

Asymmetric-Aware Load Balancing 8 AMPS approximate core computing power using core frequencies. Quantifying core computing power Define a core’s scaled computing power as P. Let the core with lowest frequency have P=1. Let the core have F times higher frequency have P=F×S, where S is a scaling factor and S<1. In our asymmetric model, one unit of time on a core with P=x equivalent to x units of time on a core with P=1.

Asymmetric-Aware Load Balancing (cont.) 9 Conventional Oses define the load of a core to be the number of threads in its run queue, i.e., run queue length. Load balancing For any core with scaled power P, define its scaled load as L, where L=(run queue length/P). Let L max be the maximum scaled load in the system. Let L min be the minimum scaled load in the system. Define that an asymmetric system is load-balanced if L max − L min ≦ 1.

Core 0Core 1Core 2Core 3 P=2 P=1 Ex1: Asymmetric-Aware Load Balancing (cont.) 10 L=1.5 L=3 L=2 L max − L min = 1.5 Core 0Core 1Core 2Core 3 P=2 P=1 Ex2: Not load- balanced L=2 L max − L min = 0 Load- balanced

Faster-Core-First Scheduling 11 Faster-core-first scheduling enables threads to run on more powerful cores whenever they are under- utilized (i.e., L<1). Ex: Better performance!

Faster-Core-First Scheduling (cont.) 12 Faster-Core-First Algorithm 1. For a newly created thread, AMPS compute the new scaled load for each core assuming that thread would run on it. 2. Choose the core with the minimal new scaled load to start the thread. 3. If tied, choose the faster core.

NUMA-Aware Migration 13 When a thread migrates to a new core, it incurs compulsory cache miss. Thus, OS schedulers generally avoid migrations unless the system load is significantly unbalanced. AMPS’s asymmetric-aware load balancing naturally enables threads to migrate to faster cores when they are under-utilized.

SMP (Symmetric Multi-Processing) 14 Core 0Core 1Core 2Core 3 Memory Controller DRAM …

NUMA (Non-Uniform Memory Access) 15 Core 0Core 1 Memory Controller DRAM … Node 0 Core 2Core 3 Memory Controller DRAM … Node 1 Scalable Interconnect

NUMA-Aware Migration (Cont.) 16 Experiments shows that the overhead associated with a thread's migration is negligible in SMP systems but can be significant in NUMA systems. Thus, we extend AMPS with NUMA-aware migration policies.

NUMA-Aware Migration (Cont.) 17 How to determine if the migration is likely beneficial or not? Migration overhead prediction Tracking thread working sets to help examine the conditions. Since tracking resident sets is much easier than working sets, AMPS tracks the resident set of each thread on each node.

NUMA-Aware Migration (Cont.) 18 Prediction algorithm When a thread T migrates form core A to core B, our algorithm predicts the migration overhead to be high if all of the following conditions are true. 1. Core A and core B are in different nodes. 2. Code A is in a node for which thread T has the maximum RSS(Resident set size) counter value compared to other nodes. 3. The RSS counter value of thread T for core A’s node is greater than the LLC (Last-Level Cache) size of core B.

NUMA-Aware Migration (Cont.) 19 Thread migration policies The Always policy The Same-Node policy The RSS policy

Evaluation 21 We have implemented AMPS in Linux kernel 2.6.16 and use CPU clock modulation to emulate performance asymmetric on SMP and NUMA systems.

Evaluation (Cont.) 22 To emulate the CPU clock, assume cores to run at 50% of their full duty cycles logically making their frequencies 50% lower than the rest of the cores. Most time in parallel phases and a small fraction in sequential phasesOver-subscribe the system with more threads than cores

SMP Evaluation 23 Performance Median: 1.16 1.44

SMP Evaluation (cont.) 24 Fairness 25% 72% 88%

SMP Evaluation (cont.) 25 Repeatability AMPS schedules threads to faster cores whenever possible. Stock Linux may schedules a thread on a faster core in one run but on a slow core in another.

SMP Evaluation (cont.) 26 Software Migration Overhead AMPS introduces a large number of extra migrations; however, they only result in negligible amounts of overhead in terms of runtime.

SMP Evaluation (cont.) 27 Hardware Migration Overhead miss ratios of the instruction TLB, data TLB, and trace cache with stock Linux and AMPS are nearly identical to each other due to hardware prefetching being more effective with AMPS, thus reducing the number of L2 misses.

NUMA Evaluation 28 We evaluate AMPS for two NUMA configurations: 1. NUMA-1: 8 faster cores in the 2 nodes and the remaining 24 cores are shower cores. Represent configurations of system-wide asymmetry. 2. NUMA-2: each of the 8 nodes contains 1 faster core. Represent configurations of asymmetry within a socket.

NUMA Evaluation (Cont.) 29 Performance The Always policy performs much worse than stock Linux for most benchmarks due to significant migration overheads However, Ammp and Galgel obtain good speedups because they have small per-thread working sets and frequent migrations have little impact The Same-Node policy brings performance close to stock Linux The RSS leads AMPS to outperform stock Linux for every benchmark with speedups ranging from 1.02 to 2.61

NUMA Evaluation (Cont.) 30 Migration Overhead RSS policy to achieve the best performance among the three policies. RSS policy constrains migrations for benchmarks with large working sets, but allows more migrations for Ammp and Galgel, whose working sets are relatively small

Conclusion 32 This paper proposed the AMPS OS scheduler that efficiently manages both SMP- and NUMA-style performance asymmetric architectures. AMPS is easy to deploy as it requires simple modifications to existing OSes and no changes in applications. The evaluation demonstrated that AMPS improves stock Linux for asymmetric systems in three aspects: performance, fairness, and repeatability of performance measurements.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Similar presentations

Presentation on theme: "Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Similar presentations

Presentation on theme: "Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing."— Presentation transcript:

Similar presentations

About project

Feedback