Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim.

Slides:



Advertisements
Similar presentations
Scheduling Algorithems
Advertisements

Scheduling Criteria CPU utilization – keep the CPU as busy as possible (from 0% to 100%) Throughput – # of processes that complete their execution per.
CPU Scheduling Tanenbaum Ch 2.4 Silberchatz and Galvin Ch 5.
Dynamic Thread Assignment on Heterogeneous Multiprocessor Architectures Pree Thiengburanathum Advanced computer architecture Oct 24,
Orchestrated Scheduling and Prefetching for GPGPUs Adwait Jog, Onur Kayiran, Asit Mishra, Mahmut Kandemir, Onur Mutlu, Ravi Iyer, Chita Das.
Exploiting Unbalanced Thread Scheduling for Energy and Performance on a CMP of SMT Processors Matt DeVuyst Rakesh Kumar Dean Tullsen.
International Symposium on Low Power Electronics and Design Dynamic Workload Characterization for Power Efficient Scheduling on CMP Systems 1 Gaurav Dhiman,
Chap 5 Process Scheduling. Basic Concepts Maximum CPU utilization obtained with multiprogramming CPU–I/O Burst Cycle – Process execution consists of a.
5: CPU-Scheduling1 Jerry Breecher OPERATING SYSTEMS SCHEDULING.
Chapter 5: CPU Scheduling
Wk 2 – Scheduling 1 CS502 Spring 2006 Scheduling The art and science of allocating the CPU and other resources to processes.
University of Karlsruhe, System Architecture Group Balancing Power Consumption in Multiprocessor Systems Andreas Merkel Frank Bellosa System Architecture.
The Impact of Performance Asymmetry in Multicore Architectures Saisanthosh Ravi Michael Konrad Balakrishnan Rajwar Upton Lai UW-Madison and, Intel Corp.
What is the Cost of Determinism?
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Recall … Process states –scheduler transitions (red) Challenges: –Which process should run? –When should processes be preempted? –When are scheduling decisions.
Process Scheduling Decide which process should run and for how long Policy: Maximize CPU utilization Mechanism:Multiprogramming Dispatch latency Time required.
Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Lecture 5 Operating Systems.
MM Process Management Karrie Karahalios Spring 2007 (based off slides created by Brian Bailey)
SYNAR Systems Networking and Architecture Group Scheduling on Heterogeneous Multicore Processors Using Architectural Signatures Daniel Shelepov and Alexandra.
1 Scheduling Processes. 2 Processes Each process has state, that includes its text and data, procedure call stack, etc. This state resides in memory.
Chapter 5 – CPU Scheduling (Pgs 183 – 218). CPU Scheduling  Goal: To get as much done as possible  How: By never letting the CPU sit "idle" and not.
1 Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction Rakesh Kumar, Keith I. Farkas, Norman P. Jouppi, Parthasarathy.
Thread Criticality Predictors for Dynamic Performance, Power, and Resource Management in Chip Multiprocessors Abhishek Bhattacharjee and Margaret Martonosi.
(1) Scheduling for Multithreaded Chip Multiprocessors (Multithreaded CMPs)
1 CSE451 Scheduling Autumn 2002 Gary Kimura Lecture #6 October 11, 2002.
CASH: REVISITING HARDWARE SHARING IN SINGLE-CHIP PARALLEL PROCESSOR
Process A program in execution. But, What does it mean to be “in execution”?
Chapter 5 Processor Scheduling Introduction Processor (CPU) scheduling is the sharing of the processor(s) among the processes in the ready queue.
Understanding Performance, Power and Energy Behavior in Asymmetric Processors Nagesh B Lakshminarayana Hyesoon Kim School of Computer Science Georgia Institute.
Chapter 5: CPU Scheduling. 5.2 Silberschatz, Galvin and Gagne ©2005 Operating System Concepts – 7 th Edition, Feb 2, 2005 Chapter 5: CPU Scheduling Basic.
Shouqing Hao Institute of Computing Technology, Chinese Academy of Sciences Processes Scheduling on Heterogeneous Multi-core Architecture.
Lecture Topics: 11/15 CPU scheduling: –Scheduling goals and algorithms.
Migration Cost Aware Task Scheduling Milestone Shraddha Joshi, Brian Osbun 10/24/2013.
Sunpyo Hong, Hyesoon Kim
Silberschatz, Galvin and Gagne ©2013 Operating System Concepts – 9 th Edition Chapter 4: Threads.
Task Mapping and Partition Allocation for Mixed-Criticality Real-Time Systems Domițian Tămaș-Selicean and Paul Pop Technical University of Denmark.
CPU Scheduling Operating Systems CS 550. Last Time Deadlock Detection and Recovery Methods to handle deadlock – Ignore it! – Detect and Recover – Avoidance.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
Lecture 4 CPU scheduling. Basic Concepts Single Process  one process at a time Maximum CPU utilization obtained with multiprogramming CPU idle :waiting.
Effect of Instruction Fetch and Memory Scheduling on GPU Performance Nagesh B Lakshminarayana, Hyesoon Kim.
CPU scheduling.  Single Process  one process at a time  Maximum CPU utilization obtained with multiprogramming  CPU idle :waiting time is wasted 2.
Lecture 5 Scheduling. Today CPSC Tyson Kendon Updates Assignment 1 Assignment 2 Concept Review Scheduling Processes Concepts Algorithms.
Rakesh Kumar Keith Farkas Norman P Jouppi,Partha Ranganathan,Dean M.Tullsen University of California, San Diego MICRO 2003 Speaker : Chun-Chung Chen Single-ISA.
Lucas De Marchi sponsors: co-authors: Liria Matsumoto Sato
Guy Martin, OSLab, GNU Fall-09
lecture 5: CPU Scheduling
Thread & Processor Scheduling
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Copyright ©: Nahrstedt, Angrave, Abdelzaher
Chapter 5a: CPU Scheduling
Networks and Operating Systems: Exercise Session 2
April 6, 2001 Gary Kimura Lecture #6 April 6, 2001
Chapter 6: CPU Scheduling
Houssam-Eddine Zahaf, Giuseppe Lipari, Luca Abeni RTNS’17
Spare Register Aware Prefetching for Graph Algorithms on GPUs
Department of Computer Science University of California, Santa Barbara
Scheduling.
Chapter 5: CPU Scheduling
Process scheduling Chapter 5.
Outline Scheduling algorithms Multi-processor scheduling
CPU scheduling decisions may take place when a process:
Operating System Concepts
Chapter 6: CPU Scheduling
Multithreaded Programming
Process Scheduling Decide which process should run and for how long
Hardware Counter Driven On-the-Fly Request Signatures
Chapter 6: CPU Scheduling
Chapter 5: CPU Scheduling
Department of Computer Science University of California, Santa Barbara
Presentation transcript:

Age Based Scheduling for Asymmetric Multiprocessors Nagesh B Lakshminarayana, Jaekyu Lee & Hyesoon Kim

Outline Background and Motivation Age Based Scheduling Evaluation Conclusion 2

3 Asymmetric (Chip) Multiprocessors Heterogeneous Architectures where all cores have same ISA but different performance PE A PE B Heterogeneous Architecture

4 Asymmetric (Chip) Multiprocessors Potential for better performance than SMPs occupying same area and consuming same power Core0Core1 Core2Core3 Core0 Symmetric Chip Multiprocessor (SMP/CMP)Asymmetric Chip Multiprocessor (AMP/ACMP) Core1Core2Core3

AMPs present new challenges Thread Scheduling is one among them 5

6 Scheduling in Multiprocessor OSes Thread Assignment –assign to least loaded core Load Balancing –make load on all cores uniform Idle Balancing –move threads from busy cores to idle core

7 Scheduling in Multiprocessor OSes Assume that all cores are identical Results in bad performance and application instability Parsec benchmarks on a (real) AMP using the Linux Scheduler all-fast16 cores- 2GHz half-half8 cores -2GHz, 8 cores -1GHz all-slow16 cores - 1GHz

8 Problem with current Scheduling Not taking advantage of fast core

9 Outline Background and Motivation Age Based Scheduling (ABS) Evaluation Conclusion

10 Motivation for Age Based Scheduling Many compute-intensive multithreaded applications follow fork-join model Milestones (barriers) in thread execution Application Model … … … … … fork join barrier main thread

11 Symmetry of Applications Threads created together are symmetric –Based on instruction count –Degree of Symmetry = Std Dev / Average Degree of Symmetry of Parsec Benchmarks (Symmetric benchmarks are benchmarks with degree of symmetry <= 0.1)

Insight exe_dur (T1) = exe_dur (T2) = exe_dur (T3) = exe_dur (T4) Difficult to predict absolute execution duration, so predict relative execution duration 12 execution duration = ? barrier T1T1 T2T2 T3T3 T4T4

Putting together Applications follow fork-join model with milestones in between Many applications are symmetric Easy to predict relative execution duration to next milestone  Age Based Scheduling 13

What is Age? Age is the progress made by a thread towards its next milestone 14

15 Age Calculation Threads created together have the same age As a thread executes, it ages Reset age when milestone crossed t A – age of thread A t B – age of thread B creation execution t A = 0 milestone (termination) milestone (barrier) t A = 30 t A = X t A = 0 t B = 0t B = 50t B = 0 X – Unknown, assumed to be a large value

16 Age Based Scheduling Algorithm To make a Scheduling decision: Calculate remaining execution duration to next milestone based on age Assign threads with longer remaining execution durations to fast core – Longest Job to Fast Core First (LJFCF)

Application of L JFCF Apply whenever –Thread is created –A core becomes idle –Reassignment timer expires (for load balancing) 17

Working of the Algorithm execution t A = 0 creation milestone (termination) milestone (barrier) t A = 30 Age at barrier = X rem_exe = (X – 30) T1T1 18

19 Remaining Execution Duration (I) Track progress of threads Using Prediction [AGE] –Predict all threads have same inter-milestone distance t A – age of thread A t B – age of thread B creation execution t A = 0 milestone (termination) milestone (barrier) t A = X t A = 0 t A = X t B = 0 t B = X

20 Remaining Execution Duration (II) Using Profiling [AGE(PROF)] –threads have different inter-milestone distances calculated based on a metric obtained by profiling t A – age of thread A t B – age of thread B creation execution t A = 0 milestone (termination) milestone ( barrier ) t A = X t A = 0t A = X t B = 0 t B = rX r is from profiler Only one r value for each thread

Working of the Algorithm fast slow B CD A rem_exe A = 50rem_exe D = 30rem_exe C = 90rem_exe B = 70 A C rem_exe C = 90rem_exe A = 50 21

22 Benefit of Age Based Scheduling Asymmetry aware Utilizes all cores Gives all threads opportunities to run on fast cores

23 Implementation OS –Track progress using Performance Counters –Disable counter on Interrupts Compiler (AGE[PROF]) –Passing profiled information one value for each thread

24 Outline Background and Motivation Age Based Scheduling Evaluation Conclusion

25 Evaluation Simulation based experiments Trace + execution hybrid simulator Lock, barriers are modeled Context switch and migration overhead simulated 10 ms time slice for each thread Machine configuration 1 fast, 7 slow, 8:1 speed ratio (others are in the paper) Benchmarks Symmetric –Parsec (simmedium input) Asymmetric –Splash-2 –OMPSCR –SuperLU

Comparisons with Other Policies 26 Policy Description LinuxLinux O(1) Scheduler RRThreads are assigned to fast cores in a Round Robin fashion SCALEDLD [Li’07] Fast Core First assignment, asymmetry aware load balancing (baseline) FCA-AGEFast Core First assignment with Age based periodic reassignment AGEAge based assignment and reassignment using prediction AGE(PROF)Age based assignment and reassignment using profiling AGE(ORACLE)Age based assignment and reassignment using oracle

27 L JFCF vs Other Policies (I) * - Default Linux Policy which performs considerable worse than other policies is not shown PolicyAvg % reduction over SCALEDLD RR FCA-AGE9.8 AGE10.4 AGE(PROF)13.2 AGE(ORACLE)15.4 Parsec Baseline: SCALEDLD

L JFCF vs Other Policies (II) Asymmetric Benchmarks 28 PolicyAvg % reduction over SCALEDLD FCA-AGE8.2 AGE7.7 AGE(PROF)9.4 AGE(ORACLE)13.1 Baseline: SCALEDLD

29 Idle Cycles Linux Scheduler – Most of the idle cycles contributed by fast core SCALEDLD – keeps same thread(s) on fast core AGE – assigns different threads to fast core

30 Different AMP Configurations Need for asymmetry aware scheduling increases as cores become more asymmetric AGE based policies show more improvement over SCALEDLD as asymmetry increases X/1 : Ratio of speeds of Fast and Slow cores is X:1

31 Outline Background and Motivation Age Based Scheduling Evaluation Conclusion

32 Conclusion Age based scheduling (ABS) for Asymmetric Multiprocessors –ABS assumes threads created at the same time are symmetric –ABS assigns threads to cores based on their predicted remaining execution durations –Predictions are made based on Age of threads Improvement of 10.4% (Pred) and 13.2% (Prof) for Parsec and 7.6% (Pred) and 9.4% (Prof) for Asymmetric benchmarks over Li’s mechanism

THANK YOU