Presentation is loading. Please wait.

Presentation is loading. Please wait.

SLAQ: Quality-Driven Scheduling for Distributed Machine Learning

Similar presentations


Presentation on theme: "SLAQ: Quality-Driven Scheduling for Distributed Machine Learning"— Presentation transcript:

1 SLAQ: Quality-Driven Scheduling for Distributed Machine Learning
Logan Stafman*, Haoyu Zhang *, Andrew Or, Michael J Freedman

2 Machine Learning in Data Science
Machine learning in clusters has become ubiquitous Training data and model size growthhigh resource contention

3 Machine Learning Overview
Repeatedly update model Model available after each iteration Loss is an indicator of how well a model is trained Choose Initial Model Model/Loss Update Model

4 Cluster Schedulers Use Fairness
Use “fair-share” algorithms to evenly allocate resources ML jobs have 3 key features making fair-share an inappropriate choice Approximate Diminishing Returns Exploratory Process

5 Machine Learning is Approximate
ML Job Worker Send Tasks Worker Worker Model Model Updates Model Replica Tasks Data Shards

6 ML Jobs Have Diminishing Returns
80% of work done in 20% of time Common across many ML algorithms Only applicable to convex optimization problems

7 ML Training is Exploratory
Data Scientists run and rerun experiments while varying Hyperparameters Dataset Features Model Structures Might launch several concurrently Modify Features Modify Hyperparameters Modify Model Structure Run ML Training Algorithm

8 Our Solution: SLAQ SLAQ uses knowledge of ML applications’ losses
Allocate resources based on loss reduction, not resource fairness Helps applications quickly produce approximate models with high predictive power

9 SLAQ Design Challenges
Find universal way to measure how much work an application does Accurately predict an application’s loss and runtime Do this online, as many of these jobs are “one-off” jobs

10 SLAQ Approach ML jobs send progress reports to scheduler
Predicts future loss online Scheduler Prediction Resource Allocation Update Loss ML Job Model Worker ML Job Model Send Tasks Worker ML Job Model Worker Model Updates Model Replica Tasks Tasks Data Shards

11 Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix

12 Unifying Different ML Metrics
Applicable to All Algorithms Comparable Magnitudes Known Range Predictable Accuracy/ PRC/F1 Score/ Confusion Matrix Loss Normalized Loss ∆Loss Normalized ∆Loss

13 ML algorithms have similar norm ∆Loss

14 Iteration CPU-time is Predictable

15 Fitting Loss Curves to Predict Progress
Convex optimization processes O(1/n) or O( 1/𝑛 2 ) for sublinear, O( 𝜇 𝑛 ) for 𝜇≤1 for superlinear convergence Use weighted loss history to fit a curve

16 How To Assign Resources
Assigning is an optimization problem Different optimization goals available, as with fair resource scheduling Maximizing total quality vs maximizing minimum quality

17 Experimental Setup 20 Amazon c3.8xlarge instances; total of 600+ cores
Tested with many ML algorithms Including Classification, regression, unsupervised learning GBT, SVM, Logistic Regression, Linear Regression, K-Means, LDA, MLPC Most algorithms in Spark MLlib Varying model sizes 10 KB-10 MB 200GB+ training data Synthetic trace of ML jobs arriving on average every 15 seconds

18 SLAQ jobs have lower average loss
Average loss is lowerAverage model quality is higher

19 Resources belong to newer jobs
Models that converged to 97% or less of their total loss, SLAQ is faster More than twice as fast for models converged to 80%

20 Conclusion SLAQ is a scheduler for ML algorithms that allocates resources for work, not resource fairness Average quality improvement up to 73%, delay reduction up to 44%


Download ppt "SLAQ: Quality-Driven Scheduling for Distributed Machine Learning"

Similar presentations


Ads by Google