Presentation on theme: "Smart Scheduling and Dispatching Policies"— Presentation transcript:
1 Smart Scheduling and Dispatching Policies Lecture 7Smart Scheduling and Dispatching Policies
2 Single Server Model (M/G/1) Poissonarrivalprocessw/rate lLoad r = lE[X]<1X: job size(service requirement)1Bounded ParetoCPU Lifetimes of UNIX jobs [Harchol-Balter, Downey 96]Supercomputing job sizes [Schroeder, Harchol-Balter 00]Web file sizes [Crovella, Bestavros 98, Barford, Crovella 98]IP Flow durations [Shaikh, Rexford, Shin 99]Job sizes with huge variance are everywhere in CS:HugeVariabilityD.F.R.Top-heavy:top 1% jobsmake up halfload
3 Outline Smart scheduling Performance metricsPolicies classificationExamplesScheduling policies comparison (Fairness, Latency)Task assignment problemSupercomputing and web server modelsOptimal dispatching/scheduling policies+ they’ll go out first Ladies first!
4 Smart scheduling: Motivation (I) Why scheduling matters?Why doesn’t it work?!Bla, bla, bla…Grrrrr!Why doesn’t it work?!Bla, bla, bla…Why doesn’t it work?!!! Delay due to other users who are currently sharing the service !!
5 Smart scheduling: Motivation (II) The goal of smart scheduling is to reduce mean delay “for free”, i.e., by simply serving jobs in the “right order”, no additional resourcesWhich is the right order to schedule jobs?The answer strongly depends onsystem loadjob size distribution
6 Smart scheduling: Performance metrics (I) Common metrics to compare scheduling policiesE[T], mean response timeE[N], mean number (of jobs) in systemE[TQ], mean waiting time (= E[T]-E[S], where E[S]=service time)Slowdown: SD=T/S (response time normalized by the running time)Meaning: if a job takes twice as long to run due to system load, it suffers from a Slowdown factor of 2, etc.Job response time should be proportional to its running time. Ideally:small jobs → small response timesbig jobs → big response times
7 Smart scheduling: Performance metrics (II) Starvation/fairness metricsA low average Slowdown doesn’t necessarily mean fairness (starvation of large jobs)Good metric: E[SD(x)] is the expected slowdown of a job of size x, i.e., mean slowdown as a function of xE[SD]= E[T]/E[S]?No! First we need to derive:Then, we get the mean SD:
8 Scheduling policies: classification Definitions:Preemptive policy: a job may be stopped and then resumed later from the same point where it was stoppedSize-based policy: it uses the knowledge of job sizeClassificationNon-Preemptive, Non-Size-Based PoliciesPreemptive, Non-Size-Based PoliciesNon-Preemptive, Size-Based PoliciesPreemptive, Size-Based PoliciesFocus on M/G/1 queue (General job size distribution)Poisson, λ
9 Non-Preemptive, Non-Size-Based Policies (I) Non-preemptive policies (each job is run to completion), that don’t assume knowledge of job size, are:FCFS (First-Come-First-Served) or FIFOJobs are served in the order they arrive. Each job is run to completion before next job receives service (e.g., call centers, supercomputing centers)LCFS (Last-Come-First-Served non-preemptive)When the server frees up, it always chooses the last arrived job and runs that job to completion (jobs piled onto a stack)RANDOMWhen the server frees up, it chooses a random job to run next (mostly of theoretical interest)
10 Non-Preemptive, Non-Size-Based Policies (II) Interesting property: All non-preemptive service orders that do not make use of job sizes have the same distribution on the number of jobs in the system (time until completion is equal in distribution for all these policies)Hence, same E[T], E[N]What about E[SD]?For all these policies (in an M/G/1):Thus,Since E[S], E[TQ] is the same for each policy → They have the same E[SD]Proportional to job size variability!Independently of the job’s size!
11 Preemptive, Non-Size-Based Policies (I) So far: non-preemptive/non-size-based serviceE[T] can be very high when job size variability is highIntuition: short jobs queue up behind long jobsProcessor-Sharing (PS): when a job arrives, it immediately shares the capacity with all the current jobs (Ex. R.R. CPU scheduling)+ PS allows short jobs to get out quickly, helps to reduce E[T], E[SD] (compared to FCFS), increases system throughput (different jobs run simultaneously)- PS is not better than FCFS on every arrival sequence+ Mean response time for PS is insensitive to job size variability:E[T]M/G/1/PS= E[S] / (1-ρ)where ρ is the system utilization (load)
12 Preemptive, Non-Size-Based Policies (II) Performances of M/G/1/PS systemResponse timeThink of Little’s law!Mean Slowdown!!! Constant Slowdown (independent of the size x)!!! In non-preemptive, non-size-based scheduling: E[SD] for small jobs was greater than the one for large jobsUnder PS, all jobs have same Slowdown → FAIR Scheduling
13 Preemptive, Non-Size-Based Policies (III) Preemptive-LCFS: a new arrival preempts the job in service, when that arrival completes, the preempted job is resumedE[T(x)], E[SD(x)] as for the PS case+ wrt PS: less # of preemptions (only 2 per job)Can we drop E[SD(x)]? → lower SD for smaller jobs!!! We don’t know the size of the jobs!FB (Foreground-Background) or LAS - Least Attained ServiceIdea: To reduce E[SD] → use knowledge of job’s age (indicator of remaining CPU demand), and serve the job with lowest age
14 Foreground-Background scheduling (cont’d) Used to control execution of multiple processes on a single processor: two queues (F and B) and one serverIdea of FB: The job with the lowest CPU age gets the CPU to itself.If several jobs have same lowest CPU age, they share the CPU using PSPerformance depends on how good is the predictor of the age of remaining size (depends on job size distribution)!!Jobs enter queue F(PS service)When a job hits a certain age a, it is moved to queue BJobs in B get service only when queue F is empty
15 Non-Preemptive, Size-Based Policies (I) Size-based policies: special case of Priority QueueingOften used in computer systems, e.g., database (differentiated levels of service), scheduling of HTTP requests, high/low-priority transactionsSize-based scheduling: it can improve the performance of a system tremendously!Priority queueing (non-preemptive)Consider an M/G/1 priority queue with n classes: class 1 has highest priority, n the lowestClass k job arrival rate is λk= λ pkTime in queue for jobs of priority k isWaiting for the job in serviceE[TQ(k)]NP-Priority < E[TQ]FCFC+ k-class job only see loaddue to jobs of class ≤ kWaiting for the jobs in the queue of ≥ priorityWaiting for the jobs of higher priority arriving after k
16 Non-Preemptive, Size-Based Policies (II) Question: If you want to minimize E[T], who should have higher priority: large or small jobs?Theorem: Consider an NP-Priority M/G/1 with two classes of jobs: small (S) and large (L). To minimize E[T], class S jobs should have priority over class L jobs (since E[SS]<E[SL])SJF - Non-preemptive Shortest Job FirstWhenever the server is free, it chooses the job with the smallest size (once a job is running, it is never interrupted)Under heavy-tailed distributions, E[TQ] is smaller than the FCFS one (since most jobs are small)But, mean delay is proportional to the variance → large delays for very high varianceSmall jobs can still get stuck behind a big one (already running) → need of preemption!
17 Preemptive, Size-Based Policies So far: non-preemptive policies → higher delay under highly variable job size distributionsPreemptive priority queueingPSJF - Preemptive Shortest Job FirstSimilar to SJF policy, the job in service is the job with the smallest original sizeA preemption occurs if a smaller job arrivesMean response time far lower than under SJF (PSJF is far less sensitive to variability in job size distr.)Better compared to non-preemptive, it depends only on the first k priority classes variability!
18 SRPT SRPT - Shortest Remaining Processing Time Whenever the server is free, the job chosen is the one with shortest remaining processing timePreemptive policy: a new arrival may preempt the current job in service if it has shorter remaining processing timeCompared to PSJFSRPT takes into account of remaining service requirement and not just the original job sizeOverall mean response time is lowerCompared to FBIn SRPT, a job gains priority as it receives more serviceIn FB, a job has highest priority when it first entersIn an M/G/1 → E[T(x)]SRPT ≤ E[T(x)]FB
19 Policies comparison: mean response time (I) M/G/1 queue, job size distribution is Bounded ParetoFCFSSJF24LASLASFCFSSJF20LASFCFSSJF16LASFCFSSJF12Versus ρSJF/FCFS delay very high even for low ρSRPT/LAS delay slightly increases with ρSRPT has the lowest delayFCFSSJFLAS8E[T]PSSRPTVersus C2SJF/FCFS delay increases with C2LAS delay decreases with C2 (DFR needs higher C2)PS and SRPT are invariant to C2rC2 =(C 2, is the squared coefficient of variation)Source: Prof. Mor Harchol-Balter,
20 Policies comparison: mean response time (II) Weibull distribution, ρ=0.7Fast increase with C2Invariant to job size variabilityRequires higher C2 to perform wellSource: Prof. Mor Harchol-Balter,
21 ExerciseM/G/1 queueJob size distribution: Bounded ParetoThe load is ρ = 0.9The (very) biggest job in the job size distribution has size x = 1010Question: E[T(x)] is lower under SRPT scheduling or under PS scheduling?Source: Prof. Mor Harchol-Balter,
22 Exercise: Solution Small jobs should favor SRPT Large jobs have the lowest priority under SRPT, but they get treated equally under PS (equal time-sharing)Thus, it seems much better for “Mr. Max” to go to the PS queueE[T(x)]PS should be far lower than E[T(x)]SRPTSource: Prof. Mor Harchol-Balter,
23 Exercise: Solution (cont’d) The largest job prefers SRPT to PS, but almost all jobs ( %) prefer SRPT to PS by more than a factor of 299% of jobs prefer SRPT to PS by more than a factor of 5But how can this be? Can every job really do better in expectation under SRPT than under PS?All-Can-Win Theorem! (for BP distribution holds for ρ < 0.96)< 5 timesSource: Prof. Mor Harchol-Balter,
24 SRPT: Fairness SRPT is optimal wrt mean response time In practice, not used for scheduling jobJob size is not always knownPS is preferred in web servers, unless serving static requestsWhat about Fairness?A policy is fair if each job has the same expected SD, regardless of its sizeSRPT vs PS? SRPT worse with large jobs?All-Can-Win Theorem: in an M/G/1, if ρ<0.5, → E[T(x)]SRPT ≤ E[T(x)]PS (for all distribution, for all x)Intuition: Once a large job starts to get the service, it gains priority; under light load even a job of large size x could do worse under PS than under SRPT (because of higher residence time)
25 Summary on scheduling single server (M/G/1): E[T] PoissonarrivalprocessLoad r <1Smart scheduling greatly improves mean response time (e.g., SRPT)Variability of job size distribution is keyLet’s order the policies based on E[T]:LOW E[T] HIGH E[T]SRPT < LAS < PS < SJF < FCFSRequires D.F.R.(DecreasingFailure Rate)~E[S2](shorts caught behind longs)Insensitiveto E[S2]Surprisingly bad:(E[S2] term)OPT for allarrivalsequencesNo “Starvation!” Even the biggest jobs prefer SRPT to PSSource: Prof. Mor Harchol-Balter,
27 Outline & Review of scheduling in single-server Supercomputing FCFSRouterSupercomputingWeb server farm modelPSRouterIV. Towards Optimality …SRPTRouter&Metric:MeanResponseTime,E[T]
28 Supercomputing Model Jobs are not preemptible. FCFSRouterRouting(assignment)policyPoissonProcessJobs are not preemptible.Jobs processed in FCFS order.Assume hosts are identical.Jobs i.i.d. ~ G: highly variable size distribution.Size may or may not be known. Initially assume known.
29 Q: Compare Routing Policies for E[T]? FCFSRouterRoutingpolicyPoissonProcessJobs i.i.d. ~ G: highly variableSupercomputingRound-Robin2. Join-Shortest-QueueGo to host w/ fewest # jobs.Least-Work-LeftGo to host with least total work.5. Central-Queue-Shortest-Job (M/G/k/SJF)Host grabs shortest job when free.6. Size-Interval SplittingJobs are split up by size among hosts.
30 Supercomputing model (II) HighE[T]LowRound-RobinJobs assigned to hosts (servers) in a cyclical fashion2. Join-Shortest-QueueGo to host with fewest # jobs3. Least-Work-Left (equalize the total work)Go to host with least total work (sum of sizes of jobs there)4. Central-Queue-Shortest-Job (M/G/k/SJF)Host grabs shortest job when free5. Size-Interval SplittingJobs are split up by size among hosts. Each host is assigned to a size interval (e.g., Short/Medium jobs go to the first host, Long jobs go to the second host)Hp: Job size is known!
31 What if job size is not known? The TAGS algorithm “Task Assignment by Guessing Size”sHost 1mHost 2OutsideArrivalsHost 3Answer:When job reaches size limit for host, then itis killed and restarted from scratch at next host.Explain where used – Microsoft example, UNIX example, supercomputing centers actually do this if you can’t predict the runtime of your job, although they do it preemptively.[Harchol-Balter, JACM 02]
32 Results of Analysis Bounded Pareto Jobs Random Least-Work-Left TAGS 2 hosts only, system load = JSQ is in between LWL and Random. Mean job size = 3000.HighvariabilityLowervariability
33 Supercomputing model (III) SummaryThis model is stuck with FCFS at servers. It is important to find a routing/dispatching policy that helps smalls not be stuck behind bigs → Size-Interval SplittingBy isolating small jobs, can achieve effects of smart single-server policiesGreedy routing policies (JSQ, LWL) are poor (don’t provide isolation for smalls, not good under high variability workloads)Don’t need to know size (TAGS = Task Assignment by Guessing Size)
34 Web server farm model (I) Examples: Cisco Local Director, IBM Network Dispatcher, Microsoft SharePoint, etc.RouterRoutingpolicyPoissonProcessPSHTTP requests are immediately dispatched to serverRequests are fully preemptibleProcessor-Sharing (HTTP request receives “constant” service)Jobs i.i.d. with distribution G (heavy tailed job size distr. for Web sites)
35 Web server farm model (II) Random2. Join-Shortest-QueueGo to host with fewest # jobs3. Least-Work-LeftGo to host with least total work4. Size-Interval SplittingJobs are split up by size among hostsE[T]JSQLWLRANDSIZE8 servers, r = .9, C2=50Shortest-Queue is better (high variance distr.)Same for E[T], but not greatSource: Prof. Mor Harchol-Balter,
36 Optimal dispatching/scheduling scheme (I) What is the optimal dispatching + scheduling pair?Central-queue-SRPT looks very goodIs Central-queue-SRPT always optimal for server farm?No!! It does not minimize E[T] on every arrival sequence!Practical issue: jobs must be immediately dispatched (cannot be held in a central queue)!!Assumptions:Jobs are fully preemptible within queueJobs size is knownSRPT
37 Optimal dispatching/scheduling scheme (II) Claim:The optimal dispatching/scheduling pair, given immediate dispatch, uses SRPT at the hostsRouterSRPTImmediatelyDispatch JobsIncomingjobsIntuition: SRPT is very effective at getting short jobs out → it reduces E[N], thus the mean response time E[T] (Little’s law)→ narrow search to policies with SRPT at hosts!
38 Optimal dispatching/scheduling scheme (III) Optimal immediate dispatching policy is not obvious!RANDOM task assignment performs well: each queue looks like an M/G/1/SRPT queue with arrival rate λ/kIdea: short jobs spread out over SRPT servers → IMD algorithm (Immediate Dispatching)Divide jobs into size classes (e.g., small, medium, large) and assign jobs to the server with fewest # of jobs of that size classEach server should have some small, some medium and some large jobs (so that SRPT can be maximally effective)IMD performance is as good as Central-Queue-SRPTAlmost no stochastic analysis (analysis available for worst-cases)!
39 + Summary Supercomputing Web server farm model FCFS PS RouterPSRouterNeed Size-interval splittingto combat job size variabilityand enable good performanceJob size variability is not an issueLWL, JSQ, performs wellOptimal dispatching/scheduling pairSRPTRouter+Both have similar worst-case E[T]Almost exclusively worst-case analysis, so hard to compare with above resultsNeed stochastic researchSource: Prof. Mor Harchol-Balter,
40 Exercises Ex. 1 – Slowdown Jobs arrive at a server which services them in FCFS order. The average arrival rate is λ = 1/2 job/sec. The job sizes (service times) are independently and identically distributed according to random variable S where: S=1 with prob. ¾, S=2 o.w.Suppose: E[T] = 29/12. Compute the mean slowdown, E[SD], where the slowdown of job j is defined as Slow(j) = T(j)/S(j), where T(j) is the response time of job j and S(j) is the size of job j.Solution:Recall the definition of response time for a FCFS queue: T = TQ + S. Here, TQ is the waiting or queueing time. Thus,
41 ExercisesSince the server is FCFS, a particular job’s waiting time is independent of its service time. This fact allows us to break up the expectation, giving us:The distribution of S is given, so we calculate E[S] and E[1/S] using the definition of expectation:E[S] =5/4 and E[1/S]=7/8Then, we get E[SD] =1+[(29/12 – 5/4)7/8]=97/48.If the service order had been SJF, would the same technique have worked for computing mean slowdown?In the SJF case, S and TQ are not independent, so we can’t split the expectation as we did above. The reason why they are not independent is because job size affects the queueing order: short jobs get to jump to the front of the queue under SJF, and hence their TQ is shorter.
42 Exercises Ex. 2 – FCFS-SJF-RR CPU scheduling Compute the average waiting time for processes with the following next CPU burst times (ms) and ready queue order:P1: 20P2: 12P3: 8P4: 16P5: 4
43 Exercises + Very simple algorithm - Long waiting time! Solution FCFS: Average waiting time: 148/5=29.6P1P2P3P4P52032405660+ Very simple algorithm- Long waiting time!
44 Exercises + Shorter average waiting time - Requires future knowledge Solution SJF:Waiting time:T1=40T2=12T3=4T4=24T5=0Average waiting time: 16P5P2P3P4P1412402460+ Shorter average waiting time- Requires future knowledge
45 ExercisesSolution RR scheduling: Give each process a unit of time (time slice, quantum) of execution on CPU. Then move to next process in the queue. Continue until all processes completed.Hp: Time quantum of 4.P1P2P3P4P54812162024283236404448525660P5 completesP3 completesP2 completesP4 completesP1 completes
46 Exercises Same exercise with other scheduling disciplines! Waiting time:T1: 60-20=40T2: 44-12=32T3: 32-8=24T4: 56-16=40T5: 20-4=16Average waiting time: 30.4P1P2P3P4P54812162024283236404448525660P5 completesP3 completesP2 completesP4 completesP1 completesAve. waiting time high+ Good ave. response time (Important for interactive/time-sharing systems)Use of smaller quantum (overhead increase)Same exercise with other scheduling disciplines!
47 ExercisesEx. 3 - LCFSDerive the mean queueing time E[TQ]LCFS. Derive this by conditioning on whether an arrival finds the system busy or idle.Solution:With probability 1 − ρ, the arrival finds the system idle. In that case E[TQ] = 0.With probability ρ, the arrival finds the system busy and has to wait for the whole busy period started by the job in service.The job in service has remaining size Se. Thus the arrival has to wait for the expected duration of a busy period started by Se, which we denote by E[B(Se)] = E[Se]/(1−ρ).
48 ExercisesYou can derive this fact by first deriving the mean length of a busy period started by a job of size x, namely E[B(x)] = x /(1−ρ) , and then deriving E[B(Se)] by conditioning on the probability that Se equals x.Putting these two pieces together, we haveAs expected, this is exactly the mean queueing time under FCFS.
49 Exercises Ex. 4 – Server Farm Suppose you have a distributed server system consisting of two hosts. Each host is a time-sharing host. Host 1 is twice as fast as Host 2.Jobs arrive to the system according to a Poisson process with rate λ = 1/9.The job service requirements come from some general distribution D and have mean 3 seconds if run on Host 1.When a job enters the system, with probability p = 3/4 it is sent to Host 1, and with probability 1 − p = 1/4 is sent to Host 2.Question: What is the mean response time for jobs?
50 Exercises Solution: The mean response time is simply: PS E[T] = ¾ (Mean response time at server 1)+ ¼ (Mean response time at server 2)But server 1 (2) is just an M/G/1/PS server, which has the same mean response time as an M/M/1/FCFS server, namely justThus,PSPoisson (1/9)3/41/43 sec.6 sec.