Presentation is loading. Please wait.

Presentation is loading. Please wait.

Scalable and Coordinated Scheduling for Cloud-Scale computing

Similar presentations


Presentation on theme: "Scalable and Coordinated Scheduling for Cloud-Scale computing"— Presentation transcript:

1 Scalable and Coordinated Scheduling for Cloud-Scale computing
Apollo : Scalable and Coordinated Scheduling for Cloud-Scale computing 심윤석

2 INDEX Backgroud Goals & Challenges of Apollo Apollo Framework
Evaluation Conclusion

3 Backgroud SCOPE DAG (Directed acyclic graph) Job Stage Task Compile
150 DOG

4 Backgroud

5 Goals & Challenges Minimize Job Latency & Maximize Cluster Utilization
Scaling Heterogeneous workload Maximize Resource Utilization

6 Goals & Challenges Scale Job processes had GB to PB of data
100,000 scheduling request/sec (in peak time) Clusters contain over 20,000 servers Clusters run up to 170,000 tasks in parallel

7 Goals & Challenges Heterogeneous workload
Short (Seconds) & Long (Hours) Execution Time I/O bound, CPU bound Various Resource Requirements (e.g. Memory, Cores) Data Locality (Long Task) & Scheduling Latency (Short Task)

8 Goals & Challenges Maximize Utilization Workload Fluctuates Regularly
Especially CPU Utilization

9 Apollo Framework

10 Apollo Framework Distributed and Coordinate Scheduler

11 Apollo Framework Estimation Based Scheduling

12 Apollo Framework Wait-Time Update

13 Apollo Framework Wait-Time Matrix For represent server load
Lightweight Expected Wait Time Future Resource Availability

14 Apollo Framework 𝐸=𝐼+𝑊+𝑅 𝐶= 𝑃 𝑠𝑢𝑐𝑐 𝐸+𝐾 1− 𝑃 𝑠𝑢𝑐𝑐 𝐸
Estimation-Based Scheduling For Minimize Task Completion Time Stable match algorithm Task Completion Time Equation E  Estimated Task Completion Time I  Initialization Time W  Wait Time R  Runtime Include Server Failure Cost C  Final Estimated Completion Time P  Success Probability K  Server Failure Panalty 𝐸=𝐼+𝑊+𝑅 𝐶= 𝑃 𝑠𝑢𝑐𝑐 𝐸+𝐾 1− 𝑃 𝑠𝑢𝑐𝑐 𝐸

15 Apollo Framework Distributed and Coordinate Scheduler
One scheduler per one job Each scheduler make Independent Decision based on Global Status Conflicts can be occur

16 Apollo Framework Correcting Conflicts (Correction Machanism)
Re-evaluates prior scheduling decisions Duplicate Scheduling Confidence Scattering completion time Randomization

17 Apollo framework Opportunistic Scheduling Opportunistic Task
Maximize Utilization Random Scheduling  Fairness Opportunistic Task Can be preempted Can be upgrade to regular task Only consume idle resources Opportunistic Task can use if Regular Task does not exist

18 Evaluation Apollo at Sacle Scheduling Quality
Evaluating Estimates Completion Time Correction Effectiveness Stable matching Efficiency

19 Evaluation Apollo at Scale Run 170,000 tasks in parallel
Tracks 14,000,000 pending tasks Well utilized in weekday (90% median CPU utilization)

20 Evaluation Scheduling Quality 80% of Recurring jobs getting faster
Significantly improved wait time Similar performance with Oracle (No schedule latency, conflicts, failure …)

21 Evaluation Evaluating Estimates Completion Time

22 Evaluation Correction Effectiveness Stable matching Efficiency
82% Success rate < 0.5% Trigger rate Stable matching Efficiency

23 Conclusion Minimize Job Latency Maximize Cluster Utilization
Loosely Coordinated Distributed Scheduler High Quality Scheduling Maximize Cluster Utilization Opportunistic Scheduling

24 reference sessions/presentation/boutin files/osdi14_slides_boutin.pdf


Download ppt "Scalable and Coordinated Scheduling for Cloud-Scale computing"

Similar presentations


Ads by Google