Scalable and Coordinated Scheduling for Cloud-Scale computing

Scalable and Coordinated Scheduling for Cloud-Scale computing
Apollo : Scalable and Coordinated Scheduling for Cloud-Scale computing 심윤석

INDEX Backgroud Goals & Challenges of Apollo Apollo Framework
Evaluation Conclusion

Backgroud SCOPE DAG (Directed acyclic graph) Job Stage Task Compile
150 DOG

Backgroud

Goals & Challenges Minimize Job Latency & Maximize Cluster Utilization
Scaling Heterogeneous workload Maximize Resource Utilization

Goals & Challenges Scale Job processes had GB to PB of data
100,000 scheduling request/sec (in peak time) Clusters contain over 20,000 servers Clusters run up to 170,000 tasks in parallel

Goals & Challenges Heterogeneous workload
Short (Seconds) & Long (Hours) Execution Time I/O bound, CPU bound Various Resource Requirements (e.g. Memory, Cores) Data Locality (Long Task) & Scheduling Latency (Short Task)

Goals & Challenges Maximize Utilization Workload Fluctuates Regularly
Especially CPU Utilization

Apollo Framework

Apollo Framework Distributed and Coordinate Scheduler

Apollo Framework Estimation Based Scheduling

Apollo Framework Wait-Time Update

Apollo Framework Wait-Time Matrix For represent server load
Lightweight Expected Wait Time Future Resource Availability

Apollo Framework 𝐸=𝐼+𝑊+𝑅 𝐶= 𝑃 𝑠𝑢𝑐𝑐 𝐸+𝐾 1− 𝑃 𝑠𝑢𝑐𝑐 𝐸
Estimation-Based Scheduling For Minimize Task Completion Time Stable match algorithm Task Completion Time Equation E  Estimated Task Completion Time I  Initialization Time W  Wait Time R  Runtime Include Server Failure Cost C  Final Estimated Completion Time P  Success Probability K  Server Failure Panalty 𝐸=𝐼+𝑊+𝑅 𝐶= 𝑃 𝑠𝑢𝑐𝑐 𝐸+𝐾 1− 𝑃 𝑠𝑢𝑐𝑐 𝐸

Apollo Framework Distributed and Coordinate Scheduler
One scheduler per one job Each scheduler make Independent Decision based on Global Status Conflicts can be occur

Apollo Framework Correcting Conflicts (Correction Machanism)
Re-evaluates prior scheduling decisions Duplicate Scheduling Confidence Scattering completion time Randomization

Apollo framework Opportunistic Scheduling Opportunistic Task
Maximize Utilization Random Scheduling  Fairness Opportunistic Task Can be preempted Can be upgrade to regular task Only consume idle resources Opportunistic Task can use if Regular Task does not exist

Evaluation Apollo at Sacle Scheduling Quality
Evaluating Estimates Completion Time Correction Effectiveness Stable matching Efficiency

Evaluation Apollo at Scale Run 170,000 tasks in parallel
Tracks 14,000,000 pending tasks Well utilized in weekday (90% median CPU utilization)

Evaluation Scheduling Quality 80% of Recurring jobs getting faster
Significantly improved wait time Similar performance with Oracle (No schedule latency, conflicts, failure …)

Evaluation Evaluating Estimates Completion Time

Evaluation Correction Effectiveness Stable matching Efficiency
82% Success rate < 0.5% Trigger rate  Stable matching Efficiency

Conclusion Minimize Job Latency Maximize Cluster Utilization
Loosely Coordinated Distributed Scheduler High Quality Scheduling Maximize Cluster Utilization Opportunistic Scheduling

reference sessions/presentation/boutin files/osdi14_slides_boutin.pdf

Scalable and Coordinated Scheduling for Cloud-Scale computing

Similar presentations

Presentation on theme: "Scalable and Coordinated Scheduling for Cloud-Scale computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Scalable and Coordinated Scheduling for Cloud-Scale computing

Similar presentations

Presentation on theme: "Scalable and Coordinated Scheduling for Cloud-Scale computing"— Presentation transcript:

Similar presentations

About project

Feedback