Real-Time Scheduling for Multiprocessor Platforms

Real-Time Scheduling for Multiprocessor Platforms
Marko Bertogna Scuola Superiore S.Anna, Pisa, Italy

Outline Why do we need multiprocessor RTS?
Why are multiprocessors difficult? Global vs partitioned scheduling Existing scheduling algorithms Schedulability results Evaluation of existing techniques Future research

As Moore’s law goes on… Number of transistor/chip doubles every 18 to 24 mm the number of transistors that can be inexpensively placed on an integrated circuit is increasing exponentially, doubling approximately every two years Patterson & Hennessy: number of cores will double every 18 months, while power, clock frequency and costs will remain constant

…heating becomes a problem
Pentium Tejas cancelled! Power density (W/cm2) Nuclear Reactor P4 STOP !!! P3 Pentium P1 P2 Hot-plate 286 486 8086 386 8085 8080 8008 4004 Year P  V  f: Clock speed limited to less than 4 GHz

Why did it happen? Integration technology constantly improves…

CMOS manufactoring process
Half-Pitch: half the distance between identical features in an array of transistors Half-pitch (half the distance between identical features in an array)

Technology trends Reduced gate sizes Higher frequencies allowed
More space available BUT Physical limits of semiconductor-based microelectronics Larger dynamic power consumed Leakage current becomes important Problematic data synchronization (GALS)

Lead Microprocessors frequency doubled every 2 years
10000 1000 P6 100 Pentium ® proc Frequency (Mhz) 486 10 386 8085 286 8086 8080 1 8008 4004 0.1 1970 1980 1990 2000 2010 Year Courtesy, Intel

Intel’s timeline

Die size grows by 14% every two years to satisfy Moore’s Law
Die Size Growth Die size grows by 14% every two years to satisfy Moore’s Law

Frequency and power f = operating frequency
V = supply voltage (V~= f) Reducing the voltage causes a higher frequency reduction P = Pdynamic + Pstatic = power consumed Pdynamic  A C V2 f (main contributor until hundreds nm) Pstatic  V Ileak (due to leakage  important when going below 100nm) P dynamic due to charge and discharge capacitive loads: A is the fraction of gates actively switching C is the total capacitance load of all gates Leakage current, the source of static power consumption, is a combination of subthreshold and gate-oxide leakage: Ileak = Isub + Iox. Reduce the transistors operating voltage, obtaining a quadratic reduction of the power consumed. However, this would cause a more than linear reduction of the working frequency of the device, liming the performances of the system.

Design considerations (I)
P = A C V2 f + V Ileak Number of transistors grows  A grows Die size grows  C grows Reducing V would allow a quadratic reduction on dynamic power But clock frequency would decrease more than linearly since V~= f  unless Vth as well is reduced But, again, there are problems: Ileak  Isub + Igox  increases when Vth is low!

Design considerations (II)
P = A C V2 f + V Ileak Reducing Vth and gate dimensions  leakage current becomes dominant in recent process technologies  static dissipation Static power dissipation is always present unless losing device state Summarizing: There is no way out for classic frequency scaling on single cores systems!

Power delivery and dissipation
100000 18KW 5KW 10000 1.5KW 1000 500W Pentium® proc Power (Watts) 100 286 486 8086 10 386 8085 8080 8008 1 4004 0.1 1971 1974 1978 1985 1992 2000 2004 2008 Year Courtesy, Intel

UltraSPARC Power consumption
The largest amount of power is consumed by Cores Leakage

Keeping Moore’s law alive
Exploit the immense number of transistors in other ways Reduce gate sizes maintaining the frequency sufficiently low Use a higher number of slower logic gates In other words: Switch to Multicore Systems!

Denser chips with transistors operating at lower frequencies
Solution Denser chips with transistors operating at lower frequencies MULTICORE SYSTEMS

What is this? David May’s B0042 board - 42 Transputers

The multicore invasion (high-end)
Intel’s Core 2, Itanium, Xeon: 2, 4 cores AMD’s Opteron, Athlon 64 X2, Phenom: 2, 4 cores IBM’s POWER7: 8 cores IBM-Toshiba-Sony Cell processor: 8 cores (PSX3) Sun’s Niagara UltraSPARC: 8 cores Microsoft’s Xenon: 3 cores (Xbox 360) Tilera’s TILE64: 64-core Others (network processors, DSP, GPU,…) GPU: Graphics Processing Unit (schede grafiche) Processore MPC8641D (Power Architecture) contenuto nell'Evaluation board di Freescale di cui al link

The multicore invasion (embedded)
ARM’s MPCore: 4 cores ALTERA’s Nios II: x Cores Network Processors are being replaced by multicore chips (Broadcom’s 8-core processors) DSP: TI, Freescale, Atmel, Picochip (up to 300 cores, communication domain) The telecommunications market had been one of the first that needed a new design of parallel datapath packet processing because there was a very quick adoption of these multiple core processors for the datapath and the control plane. These MPUs are going to replace the traditional Network Processors that were based on proprietary micro- or pico- code.

How many cores in the future?
Application dependent Typically few for high-end computing Many trade-offs transistor density technology limits Amdahl’s law

Beyond 2 billion transistors/chip
Intel’s Tukwila Itanium based 2.046 B FET Quad-core 65 nm technology 2 GHz on 170W 30 MB cache 2 SMT  8 threads/ck

Intel’s 80 core prototype already available Able to transfers a TB of data/s (while Core 2 Duo reaches 1.66GB data/s)

Berkeley: weather simulation for 1.5km resolution, 1000 x realtime, 3M custom-tailored Tensilica cores Petaflop computer Power around 2.5 MW estimated for 2011 cost around 75 M$ main obstacle will be on software

Supercomputers Petaflop supercomputers (current supercomputer have Teraflop computing power) IBM’s Bluegene/P 3 Pflops quad-core Power PC ready before 2010

Prediction Patterson & Hennessy: “number of cores will double every 18 months, while power, clock frequency and costs will remain constant” due to power considerations chip designers are opting to increase overall per-socket performance by adding more processing cores rather than increasing clock speed: John L. Hennessy and David A. Patterson. Computer Architecture : A Quantitative Approach; fourth edition. Morgan Kaufmann, San Francisco, 2006 Petaflop supercomputers (current supercomputer have Teraflop computing power): IBM’s Bluegene/P: 3 Pflops, quad-core Power PC before 2010

Amdahl’s law Originally made for speed-ups in a portion of a program
Later adapted to measure the speedup obtained increasing the number of processors P = Parallel portion of a given application N = Number of processors/cores The total speedup obtained increasing N is

Considerations on Amdahl’s law
For N arbitrarily large  maximum speedup tends to 1 / (1-P) In practice, performance/price falls rapidly as N is increased once there is even a small component of (1 − P) Example: P = 90%  (1 − P) = 10%  speedup < 10

Amdahl’s law

Consequences of Amdahl’s law
“Law of diminishing returns”: picking optimal improvements, the income is each time lower  so is with adding processors Considering as well the memory, bus and I/O bottlenecks, the situation gets worse Parallel computing is only useful for limited numbers of processors, or problems with very high values of P  “embarrassingly parallel problems”

Embarassingly parallel problems
Problems for which no particular effort is needed to segment the problem into a very large number of parallel tasks there is no essential dependency (or communication) between those parallel tasks 1 4 GHz = 2 2 GHz Examples: GPU handled problems, 3D projection (independent rendering of each pixel), brute-force searching in cryptography

Performance boost with multicore
Interrupts can be handled on an idle processor instead of preempting the running process (also for programs written for single core) Not faster execution, but smoother appearance For inherently parallel applications (graphic operations, servers, compilers, distributed computing) speedup proportional to the number of processors Limitations due to serialized RAM access and cache coherency the main benefit to an ordinary user from a multi-core CPU will be improved multitasking performance, which may apply more often than expected. Ordinary users are already running many threads; operating systems utilize multiple threads, as well as antivirus programs and other 'background processes' including audio and video controls. The largest boost in performance will likely be noticed in improved response time while running CPU-intensive processes, like antivirus scans, defragmenting, ripping/burning media (requiring file conversion), or searching for folders. For example, if the automatic virus scan initiates while a movie is being watched, the movie is far less likely to lag, as the antivirus program will be assigned to a different processor than the processor running the movie playback.

Less likely to benefit from multicores
I/O bound tasks Tasks composed by a series of pipeline dependent calculations Tasks that frequently communicate with each other Tasks that contend for shared resources

Exploiting multicores
Multicore-capable OS’s Windows NT 4.0/2000/XP/2003 Server/Vista Linux and Unix-like systems Mac OS VxWorks, QNX, etc. Multi-threaded applications Muticore optimizations for game engines Half-Life 2: Episode Two, Crysis, etc. Software development tools

Parallel programming Existing parallel programming models OpenMP MPI
IBM’s X10 Intel’s TBB (abstraction for C++) Sun’s Fortress Cray’s Chapel Cilk (Cilk++) Codeplay’s Sieve C++ Rapidmind Development Platform

Identical vs heterogenous cores
ARM’s MPCore STI’s Cell Processor 4 identical ARMv6 cores One Power Processor Element (PPE) 8 Synergistic Processing Element (SPE)

Allocating tasks to processors
Possible partitioning choices Partition by CPU load Partition by information-sharing requirements Partition by functionality Use the least possible number of processors or run at the lowest possible frequency Depends on considerations like fault tolerance, power consumed, temperature, etc.

Real-time scheduling theory for multiprocessors

Different models Deadline model Task model Priority model
Implicit, constrained, arbitrary Task model Periodic (synchronous, asynchronous) Sporadic Generalized multiframe, Recurring (DAG), … Priority model Static (fixed) task priority, static job priority, arbitrary Migration model Global, partitioned, job-level partitioned, clustered, etc.

System model Platform with m identical processors
Task set t with n periodic or sporadic tasks ti Period or minimum inter-arrival time Ti Worst-case execution time Ci Deadline Di Utilization Ui=Ci/Ti, density li=Ci/min(Di,Ti) Define sporadic and periodic. Define synchronous.

Problems addressed ? Feasibility problem Run-time scheduling problem
Schedulability problem CPU1 t1 ? The run-time scheduling problem: given a set of tasks with real-time requirements, find a schedule that meets all timing constraints. The schedulability problem: given a set of tasks and a scheduling algorithm, find in a reasonable amount of time if the produced schedule violates any deadline. t2 CPU2 t3 t4 t5 CPU3 w.r.t. a given task model

Assumptions Independent tasks Job-level parallelism prohibited
the same job cannot be simultaneously executed on more than one processor Preemption and Migration support For global schedulers, a preempted task can resume its execution on a different processor Cost of preemption/migration integrated into task WCET The independence assumtion can be later removed, considering blocking times and shared resource protocols.

Uniprocessor RT Systems
Solid theory (starting from the 70s) Optimal schedulers Tight schedulability tests for different task models Shared resource protocols Bandwidth reservation schemes Hierarchical schedulers RTOS support Power-aware, limited-preemptions schedulers, analysis tools, testing, QoS support, control systems

EDF for uniprocessor systems
Optimality: if a collection of jobs can be scheduled with a given scheduling algorithm, than it is schedulable as well with EDF Bounded number of preemptions Efficient implementations Exact feasibility conditions linear test for implicit deadlines: Utot ≤ 1 Pseudo-polynomial test for constrained and arbitrary deadlines [Baruah et al. 90] EDF optimality on uniprocessor: if a periodic task system can be scheduled upon a given uniprocessor (by any algorithm) then it is scheduled also by EDF on the same platform. EDF has not a fully dynamic priority so it has a bounded number of preemptions (number of jobs in an interval). And a simple implementation. There is a very simple necessary and sufficient schedulability condition valid for implicit deadline systems. For constrained deadlines systems: -for synchronous systems (with U<1) there are known necessary and sufficient tests with pseudo-polinomial complexity; [Bar90 demand function] -for asynchronous systems the problem is co-NP-complete in the strong sense. There are pseudo-polinomial time sufficient schedulability tests.

Uniprocessor feasibility
EDF optimal for arbitrary job collections Deadline model Task model Implicit Constrained or Arbitrary Sporadic or Synchronous Periodic Linear test: Utot ≤ 1 Unknown complexity; Pseudo-polynomial test if Utot< 1: EDF until Utot/(1- Utot) · max(Ti-Di) Asynchronous Periodic Strong NP-hard; Exponential test: EDF until 2H+Dmax+rmax In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

RT scheduling for uniprocessors
Optimal priority assignments for sporadic and synchronous periodic task systems RM for implicit deadlines DM for constrained deadlines Exact FP schedulability conditions Response Time Analysis for Fixed Priority systems: the response time of task k is given by the fixed point of Rk in the iteration

Uniprocessor static priorities
Deadline model Task model Implicit Constrained Arbitrary Sporadic or Synchronous Periodic RM optimality DM optimality Unknown complexity; Audsley’s bottom-up algorithm (exponential complexity) Asynchronous Periodic

Uniprocessor static priority feasibility
Deadline model Task model Implicit Constrained Arbitrary Sporadic or Synchronous Periodic Pseudo-polynomial test: RM until Tmax or RTA Pseudo-polynomial test: DM until Dmax or RTA Unknown complexity; Audsley’s bottom-up algorithm (exponential) Asynchronous Periodic Unknown complexity Strong NP-hard In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Uniprocessor static priority schedulability
Deadline model Task model Implicit Constrained Arbitrary Sporadic or Synchronous Periodic Pseudo-polynomial simulation until Tmax or RTA Pseudo-polynomial simulation until Dmax or RTA Unknown complexity; Lehoczky’s test (exponential) Asynchronous Periodic Strong NP-hard; Simulation until 2H+rmax or other exponential tests In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Muliprocessors are difficult
“The simple fact that a task can use only one processor even when several processors are free at the same time adds a surprising amount of difficulty to the scheduling of multiple processors” [Liu’69] CPU1 CPU2 CPU3

Multiprocessor RT Systems
First theoretical results starting from 2000 Many NP-hard problems Few optimal results Heuristic approaches Simplified task models Only sufficient schedulability tests Limited RTOS support

Global vs partitioned scheduling
Single system-wide queue instead of multiple per-processor queues: Global scheduling Partitioned approach CPU1 CPU2 CPU3 t1 t2 t3 t4 t5 CPU1 CPU2 CPU3 t1 t2 t3 t4 t5 From Queueing theory: lower AVERAGE response time for global schedulers than for partitioned approaches. When there is no precise information on the tasks, a global approach seems more efficient. Other advantages: - Number of preemptions - Simple implementation - Easy rescheduling - Reclaiming Disadvantages: - Cache affinity: HW mitigates migration cost

Partitioned Scheduling
The scheduling problem reduces to: Global (work-conserving) and partitioned approaches are incomparable Uniprocessor scheduling problem Bin-packing problem + t1 t3 t5 t2 t4 NP-hard in the strong sense Well known Various heuristics used: FF, NF, BF, FFDU, BFDD, etc. EDF Utot ≤ 1 RM (RTA) ...

Global scheduling on SMP
CPU1 t1 Global queue (ordered according to a given policy) CPU2 t2 t5 t4 t3 t2 t1 There is a global queue in which ready tasks ready are placed, according to a certain policy. When there is a free CPU the first task is removed from the queue and is scheduled. When a new task arrives with priority higher than one of the executing tasks, it preempts the executing task with lowest priority. If a task on a different CPU finishes its execution, the preempted task can “migrate” to the free CPU and continue its execution. CPU3 t3 The first m tasks are scheduled upon the m CPUs

CPU1 t1 Global queue (ordered according to a given policy) CPU2 t2 t1 t3 t4 t5 t2 t5 t4 t3 t2 t1 CPU3 t4 t3 When a task finishes its execution, the next one in the queue is scheduled on the available CPU

CPU1 t1 Global queue (ordered according to a given policy) CPU2 t2 t1 t3 t4 t5 t2 t1 t3 t4 t5 t2 t3 t3 CPU3 t3 t4 When a higher priority task arrives, it preempts the task with lowest priority among the executing ones

CPU1 t1 t4 Global queue (ordered according to a given policy) CPU2 t3 t4 t2 t5 t2 t5 t4 t3 t3 t4 t2 t1 Task t4 “migrated” from CPU3 to CPU1 CPU3 t3 t4 When another task ends its execution, the preempted task can resume its execution

Global scheduling The m highest priority ready jobs are always the one executing Work-conserving scheduler No processor is ever idled when a task is ready to execute. CPU1 CPU2 CPU3 t1 t2 t3 t4 t5

Global scheduling: advantages
Load automatically balanced Easier re-scheduling (dynamic loads, selective shutdown, etc.) Lower average response time (see queueing theory) More efficient reclaiming and overload management Number of preemptions Migration cost: can be mitigated by proper HW (e.g., MPCore’s Direct Data Intervention) Few schedulability tests  Further research needed No need for heavy load balancing routines More efficient when there is no precise information on the tasks More robust to overload conditions Cache affinity and thrashing Snoop Control Unit with DDI -Load intrinsically balanced. -Andersson (to partition or not): with a good dispatcher the number of preemptions is lower with the global approach. (SUCCESS RATIO: Among same priority class algorithm, comparing the top schedulers, also the success ratio is higher with the global method. But the classes are “incomparable”: for both of them there are task sets that can be scheduled only by one of the classes. Moreover, with both the partitioning and non-partitioning methods, the problem of deciding whether a (synchronous) task set is feasible on m processors is NP-hard in the strong sense! -The implementations of most non-partitioned schedulers have just a global queue that feeds every single CPU. -When a task join/leaves the system, the complexity of rescheduling a task set with the best ranked partitioned approaches is higher than the global ones: rescheduling frequently is not convenient. -When a task executes less than its Worst Case Execution Time, the unused bandwidth can easily be reclaimed by other tasks without needing to reschedule-->better responsiveness for soft real time processes. -Cache affinity is a measure of how many cache misses suffer an architecture. With a good dispatcher the cache misses typical of the global scheduling, can be significantly decreased, so that it doesn’t suffer a great disadvantage compared to the partitioned solution (in some situation can even be better! Consider a situation in which the unbalanced load, with partitioning, causes the tasks running on only the first CPU: if the tasks have high-fingerprint compared to the cache dimension, there will be a lot of cache misses/refills. With the global scheduling and a good dispatcher this can be avoided). The task presences (number of different CPUs on which each tasks executes) and the migrations are usually a few. In any case, when the number of tasks is high compared to the number of CPU, the number of migrations increases lowering the global scheduling performances due to cache misses. But there are already many architecture that allow to avoid also this situation: ARM’s MPCore has, among other migration-features, a “Direct Data Intervention” mechanism that allows a processor to read directly the data out of the cache of another processor. Moreover the parallelism of the new FPGAs allows to bypass the bus contention bottle-neck (ALTERA’s Avalon switch fabric interconnect). In order to reduce the migration costs (cache coherency, context switching (register saving and kernel queues), cache fill with the context of the task) solution are given by FPGAs features (Altera’s Avalon switch fabric interconnect bypasses bus contention) and new chips like ARM’s MPCore (cache-2-cache transfers reducing need for cache misses to access main memory, and allow data to stream between processors)

Global Scheduling problem
Pfair optimal only for implicit deadlines: Utot ≤ m preemption and synchronization issues No optimal scheduler known for more general task models Classic schedulers are not optimal: Dhall’s effect Hybrid schedulers: EDF-US, RM-US, DM-DS, AdaptiveTkC, fpEDF, EDF(k), EDZL, … No optimal algorithms is known for constrained and arbitrary deadlines. GPS optimal as pfair for implicit.

EDF can fail at very low utilizations
Dhall’s effect Example: m processors, n=m+1 tasks, Di = Ti t1 ,…, tm = (1,T-1) tm+1 = (T,T) T m light tasks 1 heavy task Utot1 DEADLINE MISS EDF fails to schedule some feasible task set with low utilization (down to 1) Simulations with typical task sets showed that this situation is not so common: bound is very much tighter than average scheduling performances. There are known scheduling algorithms based on EDF (EDF(k), priD, EDFfp) that allow to overcome the Dhall’s effect, reaching a higher utilization bound: -(EDF(k) and EDFfp are incomparable with EDF) -(priD is a little more complex) -(EDFfp=EDF-US(1/2) reaches m+1/2 which is the class limit: Andersson et al. proved that no static or job-level dynamic scheduling algorithms can have schedulability bounds higher than m+1/2) EDF can fail at very low utilizations

Hybrid schedulers EDF-US, RM-US, DM-DS, fpEDF EDF(k), RM(k), DM(k)
give highest static priority to the heaviest tasks and schedule the remaining ones with EDF/RM/DM EDF(k), RM(k), DM(k) give highest priority to the heaviest k tasks and schedule the remaining ones with EDF/RM/DM AdaptiveTkC assign priorities according to a function (T- k C) EDZL Schedule tasks with EDF, raising the priority to the jobs that reach zero laxity …

Global vs partitioned There are task sets that are schedulable only with a global scheduler Example: t1=(1,2); t2=(2,3); t3=(2,3) Valid also for global FP assigning p2 > p1 > p3

Global vs partitioned There are task sets that are schedulable only with a partitioned scheduler Example: t1=(2,3); t2=(3,4); t3=(5,15); t4=(5,20) Processor 1 Processor 2

Global vs partitioned t1=(2,3); t2=(3,4); t3=(5,15); t4=(5,20)
In interval [0,12) there are 9 jobs  9! possible job priority assignments For all of them there is either a deadline miss or an idle slot in [0,12) Since total utilization equals m  deadline miss

Global vs partitioned (FP)
There are task sets that are schedulable only with a partitioned scheduler Example: t1=(4,6); t2=(7,12); t3=(4,12); t4=(10,24) All 4!=24 global priority assignments lead to deadline miss Processor 1 Processor 2

Global vs partitioned (FP)
Example: p1>p2>p3>p4: t1=(4,6); t2=(7,12); t3=(4,12); t4=(10,24)

Partitioned scheduling heuristics
First Fit (FF) Best Fit (BF) Worst Fit (WF) Next Fit (NF) FFD BFD …

Partitioned schedulability
Lopez et al.: EDF-FF gives the best utilization bound among all possible partitioning methods The bound is: A refined bound, when Umax is the maximum utilization among all tasks, is: Beta = Maximum number of tasks with utilization Umax that fit into one processor. , where = 1/Umax

Tightness of the bound The bound is tight Take tasks with utilization
By definition of , tasks of utilization Umax do not fit into one processor  Umax >  the set is well defined At least one processor should allocate or more tasks  but they do not fit since  deadline miss

BF, BFD, FFD also give the same utilization bound, but with a higher computational complexity Among fixed priority systems, the best utilization bound is given by FFD and BFD The bound for fixed priority is somewhat more complicated (see Lopez et al.’04) Note that they are not “optimal” algorithms, but not even an optimal algorithm could achieve a better utilization bound

Global schedulers (implicit)
Pfair algorithms are optimal for (periodic and sporadic) systems with implicit deadlines Based on GPS, with a lag bounded by one In any interval t, a task with utilization U will execute for an amount W, with Ut-1 < W < Ut+1 Different Pfair algorithms (PF, PD, PD2) [see Anderson et al.] Other optimal algorithms: LLREF, EKG, BF All these algorithms suffer from a large number of preemptions/migrations

Global scheduling (constrained)
No optimal algorithm is known for constrained or arbitrary deadline systems No optimal on-line algorithm is possible for arbitrary collection of jobs [Leung and Whitehead] Even for sporadic task system optimality requires clairvoyance [Fisher et al’09]

Global scheduling: main results
Many sufficient schedulability tests: GFB (RTSJ’01) BAK (RTSS’03  TPDS’05) BAR (RTSS’07) LOAD (ECRTS’07,ECRTS’08,RTSJ’08  RTSJ’09) BCL (ECRTS’05  TPDS’09) RTA (RTSS’07) FF-DBF (ECRTS’09) Baruah’s fpEDF test is tight Andersson for RM: U_i <= m/(3m-2) Load = max_t (sum eta(t1,t2)/(t2-t1)) (for sporadic: max_t (sum_tasks dbf/t)) Feasibility results: Fisher, Baruah: load based pseudo-polynomial test (ECRTS’06 improved in ECRTS’07) Baker, Cirinei: load-based pseudo-polynomial necessary test (RTSS’06) Andersson, Tovar: EKG for implicit deadlines (RTCSA’06)

Global scheduling: main results
Utilization-based tests (implicit deadlines) EDF  Goossens et al.: Utot ≤ m(1-Umax)+Umax fpEDF  Baruah: Utot ≤ (m+1)/2 RM-US  Bertogna et al.: Utot ≤ (m+1)/3 Polynomial tests EDF, FP  Baker: O(n2) and O(n3) tests EDZL  Cirinei,Baker: O(n2) test EDF, FP, WC  Bertogna et al.: O(n2) test Pseudo-polynomial tests EDF, FP  Fisher,Baruah: load-based tests EDF, FP, WC  Bertogna et al.: RTA EDF  Baruah et al.: BAR and FF-DBF Baruah’s fpEDF test is tight Andersson for RM: U_i <= m/(3m-2) Load = max_t (sum eta(t1,t2)/(t2-t1)) (for sporadic: max_t (sum_tasks dbf/t)) Feasibility results: Fisher, Baruah: load based pseudo-polynomial test (ECRTS’06 improved in ECRTS’07) Baker, Cirinei: load-based pseudo-polynomial necessary test (RTSS’06) Andersson, Tovar: EKG for implicit deadlines (RTCSA’06)

Global schedulability tests
Few dominance results Most tests are incomparable Different possible metrics for evaluation

Possible metrics for evaluation
Percentage of schedulable task set detected Over a randomly generated load Depends on the task generation method Processor speedup factor s All feasible task sets pass the test on a platform in which all processors are s times as fast Run-time complexity Sustainability and predictability properties Tests still succeeds if Ci , Ti , Di How much would I need to increase the processor speed in order for the test to verify the schedulability of every feasible task set

Processor speedup factor
All feasible task sets pass the schedulability test on a platform in which all processors are s times as fast Phillips et al.’97: Each collection of jobs that is feasible on m processors can be scheduled with EDF when processors are times as fast A test is better if its speedup bound  The closer the bound to 2-1/m, the better is a test.

Sustainability A scheduling algorithm is sustainable iff schedulability of a task set is preserved when decreasing execution requirements increasing periods of inter-arrival times increasing relative deadlines Baker and Baruah [ECRTS’09]: global EDF for sporadic task sets is sustainable w.r.t. points 1. and 2. Sustainable schedulability test There are sustainable schedulability tests associated to non sustainable scheduling algorithms and viceversa.

(Utilization of the heaviest task)
The GFB test For implicit deadline systems (Di = Ti) Linear complexity Utilization-based test (tight) A task set is schedulable with EDF on a platform with m identical processors if: Utilization based test: (periodic and sporadic task sets) It is a sufficient test. linear complexity: uses the utilization of all tasks. Implicit deadline: D=T (constrained deadline: D<=T) λ = C/D = worst case request It is TIGHT!! For every utilization higher than the bound, there is a task set that EDF cannot schedule. (It doesn’t mean that the test is necessary and sufficient. Actually it is not!) Only few parameteres to be given: the total utilization and utilization of the heaviest task Umax = maxi{Ci/Ti} (Utilization of the heaviest task) Total utilization

The GFB test Utot £ m (1-Umax) + Umax Total Utilization
If in the task set there is a heavy task, which means a task with a high utilization factor, the test cannot guarantee that EDF will schedule task sets with very low total utilizations. Remember that the Dhall’s effect is not so common (simulations show that the average utilization of the EDF schedulable task set is much higher than that). But the only existence of this effect causes the utilization bound to be so low (this is the utilization bound, in fact it is tight). Another test proposed by Baker uses also informations on execution times, deadlines and periods of the tasks to abtain a different test. Umax = max{Ci/Ti}

GFB Density-based test  linear complexity
Sustainable w.r.t. all parameters

Density-based tests EDF: ltot ≤ m(1-lmax)+lmax
EDF-DS[1/2]: ltot ≤ (m+1)/2 DM: ltot ≤ m(1–lmax)/2+lmax DM-DS[1/3]: ltot ≤ (m+1)/3 [ECRTS’05] Gives highest priority to (at most m-1) tasks having lt ≥ 1/2, and schedules the remaining ones with EDF EDF and DM tests have performances inversely proportional to the largest density Hybrid scheduler tests are the best density based-tests in its class. We would like to derive more efficient tests, paying some more complexity (polynomial or pseudo-polynomial) [OPODIS’05] Gives highest priority to (at most m-1) tasks having lt ≥ 1/3, and schedules the remaining ones with DM (only constrained deadlines)

Critical instant A particular configuration of releases that leads to the largest possible response time of a task. Possible to derive exact schedulability tests analyzing just the critical instant situation. Uniprocessor FP and EDF: a critical instant is when all tasks arrive synchronously all jobs are released as soon as permitted Response Time Analysis for uniprocessors FP  the response time of task k is given by the fixed point of Rk in the iteration For FP, the worst-case response time of a task is given by the first instance released at a critical instant For EDF, it is given by an instance in a busy interval starting with a critical instant

Multiprocessor anomaly
Synchronous periodic arrival of jobs is not a critical instant for multiprocessors: t1 = (1,1,2) t2 = (1,1,3) t3 = (5,6,6) Synchronous periodic situation Second job of t2 delayed by one unit from [Bar07] Need to find pessimistic situations to derive sufficient schedulability tests

Problem window tk ti L Dk Ck t Ti εi Di Ci First missed deadline
Carry-in

Adopted techniques Consider the interference on the problem job
Bound the interference with the workload Use an upper bound on the workload Existing schedulability tests differ in Problem window selection: L Carry-in bound εi in the considered window Amount of each contribution (BAK, LOAD, BCL, RTA) Number of carry-in contributions (BAR, LOAD) Total amount of all contributions (FF-DBF, GFB)

Introducing the interference
Ik = Total interference suffered by task tk Iki = Interference of task ti on task tk Ik3 Ik1 Ik3 Ik6 CPU3 Ik2 Ik5 tk Ik5 Ik2 CPU2 tk Ik4 Ik3 Ik7 Ik8 tk CPU1 rk rk+Rk For our schedulability analysis we introduced the term Ik which represents the interference suffered in an interval by a task. The interference suffered in an interval from task k is the total length of all intervals in which the task is ready but it cannot execute due to higher priority jobs.

Limiting the interference
It is sufficient to consider at most the portion (Rk-Ck+1) of each term Iik in the sum Ik3 Ik1 Ik3 Ik6 CPU3 Ik2 Ik5 tk Ik5 Ik2 CPU2 tk Ik4 Ik3 Ik7 Ik8 tk CPU1 rk rk+Rk It can be proved that WCRTk is given by the fixed point of:

Bounding the interference
Exactly computing the interference is complex Pessimistic assumptions: Bound the interference of a task with the workload: Use an upper bound on the workload.

(# jobs excluded the last one)
Bounding the workload Consider a situation in which: The first job executes as close as possible to its deadline Successive jobs execute as soon as possible Ci L Di Ti εi Epsilon is the contribution of the last job N is the number of jobs entirely contained in the interval in the densest possible packing of jobs. Counts the number of releases in the considered interval. (# jobs excluded the last one) where: (last job)

RTA for generic global schedulers
An upper bound on the WCRT of task k is given by the fixed point of Rk in the iteration: The slack of task k is at least: Rk Sk Work-conserving

Improvement using slack values
Consider a situation in which: The first job executes as close as possible to its deadline Successive jobs execute as soon as possible Ci L Di Ti εi Same slide as before (# jobs excluded the last one) where: (last job)

Consider a situation in which: The first job executes as close as possible to its deadline Successive jobs execute as soon as possible Ci L Di Ti Si Same slide as before where:

Consider a situation in which: The first job executes as close as possible to its deadline Successive jobs execute as soon as possible Ri Di Ti Ci Ci Ci Ci L Same slide as before where:

RTA for generic global schedulers
An upper bound on the WCRT of task k is given by the fixed point of Rk in the iteration: If a fixed point Rk ≤ Dk is reached for every task k in the system, the task set is schedulable with any work-conserving global scheduler.

Iterative schedulability test
All response times Ri initialized to Di Compute response time bound for tasks 1,…,n if smaller than old value  update Ri If Ri > Di, mark as temporarily not schedulable If all tasks have Ri ≤ Di  return success If no response time has been updated for tasks 1,…,n  return fail Otherwise, return to point 2 The theorems be applied to every task in the system, using each time the most recently computed values for the slack of the interfering tasks. The analysis can then be repeated again starting with the slack values from the previous iteration. The first task, that at the previous iteration didn’t consider any slack for the interfering tasks, can this time take advantage of the positive slacks previously computed for the other tasks, leading to a lower worst-case response time. If the target is to verify the schedulability of the system, the whole procedure can successfully stop when all tasks are verified to have an upper bound on the response time lower than their deadline. If a task still didn’t converge when Rub k > Dk, it will be temporarily set aside, waiting for a slack update (ie. increase) of potentially interfering tasks; in this case, if no update takes place during a whole run for all tasks in the system, than there is no possibility for further improvements and the test fails. On the other hand, if the target is to derive the closest possible value for every response time, the procedure can go on until there is no more change in any response time. Note that every slack function is monotonically non-decreasing since, at each step, the considered interference from other tasks can only be lower than or equal to the interference considered in the precedent step. This allows to bound the overall complexity of the whole slack-based analysis

RTA refinement for Fixed Priority
The interference on higher priority tasks is always null: An upper bound on the WCRT of task k can be given by the fixed point of Rk in the iteration: We can exploit further information on the scheduling algorithm in use to tighten the bounds on interference and workload

RTA refinement for EDF A different bound can be derived analyzing the worst-case workload in a situation in which: The interfering and interfered tasks have a common deadline All jobs execute as late as possible Ri Di Ti Ci Ci Ci Ci Dk

RTA refinement for EDF Ri Di Ti Ci Ci Ci Ci Dk An upper bound on the WCRT of task k is given by the fixed point of Rk in the iteration: 1.

Complexity Pseudo-polynomial complexity Fast average behavior
Lower complexity for Fixed Priority systems at most one slack update per task, if slacks are updated in decreasing priority order. Possible to reduce complexity limiting the number of rounds For fixed priority systems, improving the slack bound of lower priority tasks doesn’t have any influence on the slack of higher priority tasks. Theorical complexity O(n^2 D_max) significantly higher than average complexity (millions of task sets per minute)

Polynomial complexity test
A simpler test can be derived avoiding the iterations on the response times A lower bound on the slack of tk is given by: The iteration on the slack values is the same Performances comparable to RTA-based test Complexity down to O(n2) EDZL adaptation (requires that less than m+1 tasks have negative slack)

BAK Polynomial complexity: O(n3) Pessimistic version O(n2)
Not so good performances The condition used by BAK is derived considering a particular scheduling window, that allows deriving a bound on the maximum carry-in contribution of each task.

BAR Go back until the first instant at which some of the processors is idled At most (m-1) carry-in contributions The particularity of BAR is a bound provided on the total number of carry-in contributions, limited by (m− 1) When no carry-in 

BAR The particularity of BAR is a bound provided on the total number of carry-in contributions, limited by (m− 1) When carry-in 

BAR (BAR) The particularity of BAR is a bound provided on the total number of carry-in contributions, limited by (m− 1) When Utot < m  pseudo-polynomial complexity (limit the Ak to check)

LOAD Computation is exponential in the worst-case
Polynomial and pseudo-polynomial approximations A different bound on the total number of carry-in contributions is used in LOAD: \ceil(μ)−1, along with another bound on each carry-in contribution.

LOAD (LOAD) Sustainable Proc. Speedup bound of

RTA An upper bound on the WCRT of task k is given by the fixed point of Rk in the iteration: Iteratively refine the response time bounds using already computed values Pseudo-polynomial complexity Sustainable w.r.t. task periods The peculiar advantage of BCL and RTA is the iterative estimation of the maximum carry-in contribution of each task (however, the total number of carry-in contributions in not bounded) Both RTA and BCL are sustainable with respect to task periods.More work is needed to verify the sustainability with respect to execution times and deadlines.

FF-DBF Di Ti t Problem window Ci Ci Ci Ci Executing at speed s:
dmax ≤ s ≤ 1 The iterative way in which the problem window is defined in FF-DBF allows deriving a bound on the total amount of carry-in that can be imposed on any task. A similar, although weaker, bound is found in the GFB case as well.

FF-DBF The iterative way in which the problem window is defined in FF-DBF allows deriving a bound on the total amount of carry-in that can be imposed on any task. A similar, although weaker, bound is found in the GFB case as well.

FF-DBF Pseudo-polynomial complexity Best processor speedup bound:
The iterative way in which the problem window is defined in FF-DBF allows deriving a bound on the total amount of carry-in that can be imposed on any task. A similar, although weaker, bound is found in the GFB case as well. Pseudo-polynomial complexity Best processor speedup bound:

Multiprocessor feasibility
Deadline model Task model Implicit Constrained Arbitrary Sporadic Linear test: Utot ≤ m Unknown complexity; Synchronous periodic not a critical instant Synchronous Periodic Horn’s algorithm in (0,H] Unknown complexity Asynchronous Periodic Strong NP-hard

Multiprocessor run-time scheduling
Deadline model Task model Implicit Constrained Arbitrary Sporadic P-fair, GPS Requires clairvoyance Synchronous Periodic P-fair, GPS, LLREF, EKG, BF Unknown complexity; Clairvoyance not needed; Horn’s algorithm in (0,H] Unknown complexity; Clairvoyance not needed Asynchronous Periodic Clairvoyance not needed

Feasibility conditions
Utot > m Not feasible load > m load* > m ??? Sufficient feasibility and schedulability tests Feasible Σi Ci /min(Di,Ti) ≤ m

Multiprocessor static job priority feasibility
Deadline model Task model Implicit Constrained Arbitrary Sporadic Unknown complexity Unknown complexity; Synchronous periodic not a critical instant Synchronous Periodic Simulation until hyperperiod for all N! job priority assignments Asynchronous Periodic Strong NP-hard In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Multiprocessor static job priority schedulability
Deadline model Task model Implicit Constrained Arbitrary Sporadic Unknown complexity Unknown complexity; Synchronous periodic not a critical instant Synchronous Periodic Simulation until hyperperiod Asynchronous Periodic Strong NP-hard In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Multiprocessor static priority run-time scheduling
Deadline model Task model Implicit Constrained Arbitrary Periodic (synchronous or asynchronous) Unknown complexity; Cucu’s optimal priority assignment Sporadic In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Multiprocessor static priority feasibility
Deadline model Task model Implicit Constrained Arbitrary Sporadic Unknown complexity; Synchronous periodic not a critical instant Synchronous Periodic Strong NP-hard; Simulation until hyperperiod for all n! priority assignments Asynchronous Periodic Simulation on exponential feasibility interval for all n! priority assignments In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Multiprocessor static priority schedulability
Deadline model Task model Implicit Constrained Arbitrary Sporadic Unknown complexity; Synchronous periodic not a critical instant Synchronous Periodic Simulation until hyperperiod Asynchronous Periodic Strong NP-hard; Simulation on exponential feasibility interval In case I’m using dynamic scheduling, will I partition or I will allow migration among processors? In case I’ll allow migration, will it be unrestricted or I want it to happen only at job boundaries? Will I use always the same priority for each instance of a task(FP)? Or will I allow it to change between various jobs(EDF)? Or also during each job execution(pfair)? EDF multi is a task level dynamic priority scheduling algorithm with unrestricted migration (priority is fixed within a job, but migration is always allowed)-->it belongs to the global scheduling class.

Conclusions Multiprocessor Real-Time systems are a promising field to explore. Still few existing results far from tight conditions. Future work: Find tighter schedulability tests Take into account shared resources Integrate into Resource Reservation framework.

Bibliography Sustainability: Sanjoy Baruah and Alan Burns. Sustainable scheduling analysis. In Proceedings of the IEEE Real-time Systems Symposium, Rio de Janeiro, December 2006. Speedup: Cynthia A. Phillips, Cliff Stein, Eric Torng, and Joel Wein. Optimal time-critical scheduling via resource augmentation. In Proceedings of the Twenty-Ninth Annual ACM Symposium on Theory of Computing, El Paso, Texas, 4{6 May 1997. BAR test: Sanjoy Baruah. Techniques for mutliprocessor global schedulability analysis. In Proceedings of the IEEE Real-time Systems Symposium, Tucson, December 2007. FF-DBF test: Sanjoy Baruah, Vincenzo Bonifaci, Alberto Marchetti-Spaccamela, and Sebastian Stiller. Implementation of a speedup-optimal global EDF schedulability test. In Proceedings of the EuroMicro Conference on Real-Time Systems, Dublin, Ireland, July 2009. RTA test: Marko Bertogna and Michele Cirinei. Response-time analysis for globally scheduled symmetric multiprocessor platforms. In 28th IEEE Real-Time Systems Symposium (RTSS), Tucson, Arizona (USA), 2007. BCL: Marko Bertogna, Michele Cirinei, and Giuseppe Lipari. Schedulability analysis of global scheduling algorithms on multiprocessor platforms. IEEE Transactions on Parallel and Distributed Systems, 20(4):553{566, April 2009. GFB test: Joel Goossens, Shelby Funk, and Sanjoy Baruah. Priority-driven scheduling of periodic task systems on multiprocessors. Real Time Systems, 25(2{3):187{205, 2001.

The end

Real-Time Scheduling for Multiprocessor Platforms

Similar presentations

Presentation on theme: "Real-Time Scheduling for Multiprocessor Platforms"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Real-Time Scheduling for Multiprocessor Platforms

Similar presentations

Presentation on theme: "Real-Time Scheduling for Multiprocessor Platforms"— Presentation transcript:

Similar presentations

About project

Feedback