Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston Queen’s University March 23, 2001.

Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston eb@cs.umb.edu www.cs.umb.edu/~eb Queen’s University March 23, 2001

2 References www.bmc.com/patrol/fairshare www/cs.umb.edu/~eb/goalmode Yiping Ding Jeff Buzen Dan Keefe Oliver Chen Chris Thornley Acknowledgements Aaron Ball Tom Larard Anatoliy Rikun Liying Song

3 Coming Attractions Queueing theory primer Fair share semantics Priority scheduling; conservation laws Predicting response times from shares –analytic formula –experimental validation –applet simulation Implementation geometry

4 Transaction Workload Stream of jobs visiting a server (ATM, time shared CPU, printer, …) Jobs queue when server is busy Input: –Arrival rate: job/sec –Service demand: s sec/job Performance metrics: –server utilization: u = s (must be  1) –response time: r = ??? sec/job (average) –degradation: d = r/s

5 Response time computations r, d measure queueing delay r  s (d  1), unless parallel processing possible Randomness really matters r = s (d = 1) if arrivals scheduled (best case, no waiting) r >> s for bulk arrivals (worst case, maximum delays) Theorem. If arrivals are Poisson and service is exponentially distributed (M/M/1) then d = 1/(1- u) r = s/(1- u) Think: virtual server with speed 1-u

6 M/M/1 Essential nonlinearity often counterintuitive –at u = 95% average degradation is 1/(1-0.95) = 20, –but 1 customer in 20 has no wait at all (5% idle time) A useful guide even when hypotheses fail –accurate enough (  30%) for real computer systems –d depends only on u: many small jobs have same impact as few large jobs –faster system  smaller s  smaller u r = s/(1-u)  double win: less service, less wait –waiting costly, server cheap (telephones): want u  0 –server costly (doctors): want u  1 but scheduled

7 Customers want good response times Decreasing u is expensive High end Unix offerings from HP, IBM, Sun offer fair share scheduling packages that allow an administrator to allocate scarce resources (CPU, processes, bandwidth) among workloads How do these packages behave? Model as a black box, independent of internals Limit study to CPU shares on a uniprocessor Scheduling for Performance

8 Multiple Job Streams Multiple workloads, utilizations u 1, u 2, … U =  u i < 1 Ifno workload prioritization then all degradations are equal: d i = 1/(1-U) Share allocations are de facto prioritizations Study degradation vector V = (d 1, d 2, …)

9 Share Semantics Suppose workload w has CPU share f w Normalize shares so that  w f w = 1 w gets fraction f w of CPU time slices when at least one of its jobs is ready for service Can it use more if competing workloads idle? No :think share = cap Yes : think share = guarantee

10 Shares As Caps dedicated system share f u/f need f > u ! Good for accounting (sell fraction of web server) Available now from IBM, HP, soon from Sun Straightforward (boring) - workloads are isolated Each runs on a virtual processor with speed *= f utilization u response time r r(1  u)/(f  u)

11 Shares As Guarantees Good for performance + economy (use otherwise idle resources) Shares make a difference only when there are multiple workloads Large share resembles high priority: share may be less than utilization Workload interaction is subtle, often unintuitive, hard to explain

12 Modeling Performance Goals complex scheduling software OS query update report response time workload measure frequently fast computation analytic algorithms Model

13 Modeling Real system –Complex, dynamic, frequent state changes –Hard to tease out cause and effect Model –Static snapshot, deals in averages and probabilities –Fast enlightening answers to “what if ” questions Abstraction helps you understand real system Start with a study of priority scheduling

Priority Scheduling Priority state: order workloads by priority (ties OK) –two workloads, 3 states: 12, 21, [12] –three workloads, 13 states: 123 (6 = 3! of these ordered states), [12]3 (3 of these), 1[23] (3 of these), [123] (1 state with no priorities) –n wkls, f(n) states, n! ordered (simplex lock combos) p(s) = prob( state = s ) = fraction of time in state s V(s) = degradation vector when state = s (measure this, or compute it using queueing theory) V =  s p(s)V(s) (time avg is convex combination) Achievable region is convex hull of vectors V(s)

15 Two workloads    d1d1 V(12) (wkl 1 high prio) V(21) V([12]) (no priorities) achievable region d2d2 d 1 = d 2

16 Two workloads     d1d1 V(12) (wkl 1 high prio) V(21) V([12]) (no priorities) 0.5 V(12) + 0.5V(21)  V([12]) d2d2 d 1 = d 2

17 Two workloads    d1d1 V(12) (wkl 1 high prio) V(21) V([12]) (no priorities) d2d2 d 1 = d 2 note: u 1 < u 2  wkl 2 effect on wkl 1 large

18 Conservation No Free Lunch Theorem. Weighted average degradation is constant, independent of priority scheduling scheme:  i (u i /U) d i = 1/(1-U) Provable from some hypotheses Observable in some real systems Sometimes false: shortest job first minimizes average response time (printer queues, supermarket express checkout lines)

19 Conservation For any proper set A of workloads Imagine giving those workloads top priority. Then can pretend other wkls don’t exist. In that case  i  A (u i /U(A)) d i = 1/(1-U(A)) When wkls in A have lower priorities they have higher degradations, so in general  i  A (u i /U(A)) d i  1/(1-U(A)) These 2 n -2 linear inequalities determine the convex achievable region R R is a permutahedron: only n! vertices

20 Two Workloads u 1 d 1 + u 2 d 2 = 1/(1-U) d 1 : workload 1 degradation conservation law: (d 1, d 2 ) lies on the line d 2 : workload 2 degradation

21 d 1  1/(1- u 1 ) constraint resulting from workload 1 d 1 : workload 1 degradation d 2 : workload 2 degradation Two Workloads

22  Workload 1 runs at high priority: V(1,2) = (1 /(1- u 1 ), 1 /(1- u 1 )(1-U) ) d 1  1 /(1- u 1 ) constraint resulting from workload 1 d 1 : workload 1 degradation d 2 : workload 2 degradation Two Workloads

23   V(2,1) d 2  1 /(1- u 2 ) d 1 : workload 1 degradation d 2 : workload 2 degradation Two Workloads u 1 d 1 + u 2 d 2 = 1/(1-U)

24   V(2,1) achievable region R d 1 : workload 1 degradation d 2 : workload 2 degradation Two Workloads u 1 d 1 + u 2 d 2 = 1/(1-U) V(1,2) = (1 /(1- u 1 ), 1 /(1- u 1 )(1-U) ) d 1  1 /(1- u 1 ) d 2  1 /(1- u 2 )

25 Three Workloads Degradation vector (d 1,d 2, d 3 ) lies on plane u 1 d 1 + u 2 d 2 + u 3 dr 3 = C We know a constraint for each workload w: u w d w  C w Conservation applies to each pair of wkls as well: u 1 d 1 + u 2 d 2  C 12 Achievable region has one vertex for each priority ordering of workloads: 3! = 6 in all Hence its name: the permutahedron

26 3! = 6 vertices (priority orders) 2 3 - 2 = 6 edges (conservation constraints) d2d2 d1d1 d3d3 V(2,1,3) u 1 r 1 + u 2 d 2 + u 3 d 3 = C V(1,2,3)   Three Workload Permutahedron

27 Experimental evidence

28 Four workload permutahedron 4! = 24 vertices (ordered states) 2 4 - 2 = 14 facets (proper subsets) (conservation constraints) 74 faces (states) Simplicial geometry and transportation polytopes, Trans. Amer. Math. Soc. 217 (1976) 138.

29 Map shares to degradations - two workloads - Suppose f 1 and f 2 > 0, f 1 + f 2 = 1 Model: System operates in state –12 with probability f 1 –21 with probability f 2 (independent of who is on queue) Average degradation vector: V = f 1 V(12) + f 2 V(21)

Dec 13, 2000Fair Share Scheduling30 Predict Degradations From Shares (Two Workloads) Reasonable modeling assumption: f 1 = 1, f 2 = 0 means workload 1 runs at high priority For arbitrary shares: workload priority order is (1,2) with probability f 1 (2,1) with probability f 2 (probability = fraction of time) Compute average workload degradation: d 1 = f 1  (wkl 1 degradation at high priority) + f 2  (wkl 1 degradation at low priority )

31 Model validation

32 Model validation

33 Map shares to degradations - three (n) workloads - f 1 f 2 f 3 prob(123) = ------------------------------ (f 1 + f 2 + f 3 ) (f 2 + f 3 ) (f 3 ) Theorem: These n! probabilities sum to 1 –interesting identity generalizing adding fractions –prove by induction, or by coupon collecting V =  ordered states s prob(s) V(s) O(n!),  (n!), good enough for n  9 (12)

34 Model validation

35 Model validation

36 The Fair Share Applet Screen captures on next slides are from www.bmc.com/patrol/fairshare Experiment with “what if” fair share modeling Watch a simulation Random virtual job generator for the simulation is the same one used to generate random real jobs for our benchmark studies

37 1 2 3 Three workloads, each with utilization 0.32 jobs/second  1.0 seconds/job = 0.32 = 32% CPU 96% busy, so average (conserved) response time is 1.0/(1  0.96) = 25 seconds Individual workload average response times depend on shares Three Transaction Workloads ???

38 1 2 3 Normalized f 3 = 0.20 means 20% of the time workload 3 (development) would be dispatched at highest priority Three Transaction Workloads During that time, workload priority order is (3,1,2) for 32/80 of the time, (3,2,1) for 48/80 Probability( priority order is 312 ) = 0.20  (32/80) = 0.08 sum 80.0 32.0 48.0 20.0

39 Formulas on previous slide Average predicted response time weighted by throughput 25 seconds (as expected) Hard to understand intuitively Software helps Three Transaction Workloads

40 Three Transaction Workloads note change from 32%

41 Simulation jobs currently on run queue

42 When the Model Fails Real CPU uses round robin scheduling to deliver time slices Short jobs never wait for long jobs to complete That resembles shortest job first, so response time conservation law fails At high utilization, simulation shows smaller response times than predicted by model Response time conservation law yields conservative predictions

43 Scaling Degradation Predictions V =  ordered states s prob(s) V(s) Each s is a permutation of (1,2, …, n) Think of it as a vector in n-space Those n! vectors lie on of a sphere For n large they are pretty densely packed Think of prob(s) as a discrete approximation to a probability distribution on the sphere V is an integral

44 Monte Carlo loop sampleSize times choose a permutation s at random from the distribution determined by the shares compute degradation vector V(s) accumulate V += prob(s)V(s) sampleSize = 40000 works well independent of n!

45 Map shares to degradations (geometry) Interpret shares as barycentric coordinates in the n-1 simplex Study the geometry of the map from the simplex to the n-1 dimensional permutahedron Easy when n=2: each is a line segment and map is linear

46 Mapping a triangle to a hexagon f 1 = 1 f 1 = 0 f 3 = 0 f 3 = 1 132 123 213 312 321 231 wkl 1 high priority wkl 1 low priority  M

47 Mapping a triangle to a hexagon f 1 = 1 {23} f 1 = 0 

48 Mapping a triangle to a hexagon

49 What This Means Add a strong statement that summarizes how you feel or think about this topic Summarize key points you want your audience to remember

Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston Queen’s University March 23, 2001.

Similar presentations

Presentation on theme: "Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston Queen’s University March 23, 2001."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston Queen’s University March 23, 2001.

Similar presentations

Presentation on theme: "Fair Share Scheduling Ethan Bolker Mathematics & Computer Science UMass Boston Queen’s University March 23, 2001."— Presentation transcript:

Similar presentations

About project

Feedback