Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005.

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005

11/14/05Grid'05, Seattle, WA 2 Outline Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 3 Problem Statement Context: Large scale Grids like LCG Target: Computing resources like clusters and parallel supercomputers Source: Historical workload traces Goal: Develop a practically useful technique for job response time predictions Purpose: Provide dynamic information for metascheduling decision support

11/14/05Grid'05, Seattle, WA 4 The LCG case (http://lcg.web.cern.ch/LCG/)http://lcg.web.cern.ch/LCG/

11/14/05Grid'05, Seattle, WA 5 The LCG Challenges Challenges  Scalable production environment (~211 sites, 16854 CPUS, 5 PB storage)  Many options after matchmaking and authorization filtering  How does the resource broker make a good selection of candidate sites?  What makes a good metric? Sites may not want to publish their policies.

11/14/05Grid'05, Seattle, WA 6 The NIKHEF Site

11/14/05Grid'05, Seattle, WA 7 Job Response Times on Resources Job response time as a dynamic performance metric, defined as the time elapsed from a job’s submission to completion. Response time = Application Run Time + Queue Wait Time

11/14/05Grid'05, Seattle, WA 8 Related Work Predictions based on historical observations  Similarity Templates [Smith et al, 98] - Run Time  Instance Based Learning [Kapadia et al, 99] - Run Time  Scheduler Simulation [Smith 99, Li et al 04] - Wait Time “Learning it from data”  Can scheduling rules and policies be discovered by mining historical data?  How to use it for wait time predictions?

11/14/05Grid'05, Seattle, WA 9 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 10 Job Similarity Job attributes recorded in traces that characterize a job Group, user, queue, executable name, #CPUs, requested run time, arrival time of day (executable arguments*, node specification*) Naturally for run times, being used for queue wait times

11/14/05Grid'05, Seattle, WA 11 Resource State Similarity Definition: A pool of running and queued jobs on the resource at the time to make a prediction Assumption: “similar” jobs under “similar” resource states would most likely have similar waiting times Key problems:  How to define attributes to represent a resource state?  How to incorporate local policies into attributes for more fine-grained similarity comparison?

11/14/05Grid'05, Seattle, WA 12 Resource State Attributes VecRunJobs: categorized number of running jobs VecQueueJobs: categorized number of queued jobs VecAlreadyRun: categorized sum of elapsed run time multiplying with #CPUs of running jobs VecRunRemain: categorized sum of remaining run time multiplying with #CPUs of running jobs AlreadyQueue: categorized sum of already queued time multiplying with #CPUs of queue jobs QueueDemand: categorized sum of run time multiplying with #CPUs of queue jobs

11/14/05Grid'05, Seattle, WA 13 Policy Attributes Credential attributes usually used in scheduling policy expressions Group (VO), user, and queue  Maui (NIKHEF), Catalina (SDSC) Embedding the policy attributes into resource state attributes via categorization

11/14/05Grid'05, Seattle, WA 14 Resource State Example Policy attribute set =, resource attributes = VecRunJobs and VecQueueJobs … … Atlas 30 Lhcb 60 Atlas 45 Lhcb 50 State 1 RunJobs QueueJobs … … cms 30 Alice 60 Atlas 45 Lhcb 50 State 2 RunJobs QueueJobs

11/14/05Grid'05, Seattle, WA 16 Instance Based Learning Nonparametric learning technique Store training data in a historical database, and make predictions by applying an induction model on data entries “near” the query The distance function and the induction model

11/14/05Grid'05, Seattle, WA 17 The Distance Function An extended Heterogeneous Euclidean- Overlap Metric (HEOM)

11/14/05Grid'05, Seattle, WA 18 The Distance Function (cont’d)

11/14/05Grid'05, Seattle, WA 19 The Induction Models Weighted Average (WA) Linear Locally Weighted Regression (LLWR)

11/14/05Grid'05, Seattle, WA 21 Parameter Optimization by GA Genetic Algorithm implementation using standard operators such as selection, mutation, and crossover Real-encoding v.s. binary-encoding Chromosomes are structured to match different objectives (i.e. run time or wait time) Objective: average prediction error

11/14/05Grid'05, Seattle, WA 22 Chromosomes Run Time  (WAg, WAu, WAe, WAn, WAr, WAtod), (#CPUs), (method), (neighbor size), (history size), (bandwidth type), (bandwidth) Wait Time  (WPg, WPu, WPq), (WAg, WAu, WAe, WAn, WAr, WAtod), (WSrj, WSqj, WSalrr, WSalrq, WSrrem, WSqdem), (#CPUs, queue demand credential, queue demand total), (method), (neighbor size), (history size), (bandwidth type), (bandwidth)

11/14/05Grid'05, Seattle, WA 24 Experimental Setup Real traces with diverse characteristics  NIKHEF cluster: ~300 CPUs, up to 3GB memory per node, Ethernet connections. Maui scheduler with backfilling, policies based on groups (VOs) and users.  SDSC Blue Horizon: IBM SP, 1152 CPUs. Catalina scheduler with backfilling, policies based on queues. Evaluation is done on multiple Intel Xeon machines with 4 CPUs and 3GB shared memory

11/14/05Grid'05, Seattle, WA 25 Methodology Prediction accuracy  Average Absolute Error (AAE)  Average Relative Error = AAE/Average Real Value  Relative Error = (Est - Real)/(Est + Real) Prediction time  Average execution time per prediction in milliseconds Workload traces are divided into training sets and test sets  On NIKHEF, we test trace data of one month of consecutive months, with parameters trained on the preceding two-month data.  ON SDSC, we test data every three months and training is done on the preceding six months.

11/14/05Grid'05, Seattle, WA 26 Absolute Prediction Error NameRun TimeWait TimeResponse Time Abs. ErrRel. ErrAbs. ErrRel. ErrAbs. ErrRel. Err NIKHEF 324.6 min0.58299.3 min0.73560.5 min0.57 SDSC01 35.9 min0.49376.7 min0.89391.4 min0.79 SDSC02 50.1 min0.51690.2 min0.70705.7 min0.65

11/14/05Grid'05, Seattle, WA 27 Relative Prediction Error (Run Time)

11/14/05Grid'05, Seattle, WA 28 Relative Prediction Error (Wait Time)

11/14/05Grid'05, Seattle, WA 29 Error Analysis NameWait Time t (sec)Job %Abs. ErrRel. Err NIKHEF 0 < t < 1000 1000 < t < 10000 t > 10000 48.5 12.7 38.8 6.5 min 61.7 min 704 min 2.8 0.85 0.68 SDSC01 0 < t < 1000 1000 < t < 10000 t > 10000 53.6 20.0 26.4 10.9 min 93.6 min 1195 min 6.9 1.3 0.77 SDSC02 0 < t < 1000 1000 < t < 10000 t > 10000 50.4 20.0 29.6 8.1 min 68.3 min 2167 min 4.2 0.97 0.66

11/14/05Grid'05, Seattle, WA 30 Error Analysis

11/14/05Grid'05, Seattle, WA 31 Optimized Parameters NamePeriodPolicyMethodHistoryBW NIK. Jun-Jul’04[g,u,q]104-WA|1-WA3309|5409 GBS(k=0.5)|NBS NIK. Jul-Aug’04[u,q]125-WA|1-WA7822|3681 GBS(k=0.6)|k=1.6 NIK. Aug-Sep’04[q]115-WA|48-WA4324|5435 GBS(k=1.2)|NBS NIK. Sep-Oct’04[g,q]22-WA|1-WA7188|4967 GBS(k=0.5)|k=2.0 NIK. Oct-Nov’04[g,u,q]18-WA|1-WA5108|3900 NBS|GBS(k=0.8) SD.01 Jan-Jun’01[g]1-WA|1-WA5756|3914 GBS(k=1.4)|k=1.5 SD.01 Apr-Oct’01[g,q]1-WA|27-WA6878|5230 GBS(k=0.5)|NBS SD.02 Jan-Jun’02[g,u]1-WA|1-WA6062|2925 NBS|NBS SD.02 Apr-Oct’02[g,u,q]1-WA|1-WA7514|3672 GBS(k=1.8)|k=0.5

11/14/05Grid'05, Seattle, WA 32 Prediction Time Name Run time (no cache)Run time (cache)Wait time meanstdmeanstdmeanstd NIKHEF 38 ms28 ms10 ms8 ms313 ms185 ms SDSC 30 ms32 ms23 ms17 ms461 ms516 ms

11/14/05Grid'05, Seattle, WA 34 Conclusions A response time prediction technique based on Instance Based Learning Novel resource state similarity that incorporate policies Automatic parameter selection “Efficient” and “more general”  “I’m VO 1, how many jobs can you tolerate before reaching a max. response time of X” ?

11/14/05Grid'05, Seattle, WA 35 Future Work Accuracy (global vs local tuning) Performance (search structure) PDM: A Java-based Toolkit for mining performance data in the Grid

11/14/05Grid'05, Seattle, WA 36 References Mining Performance Data for Metascheduling Decision Support in the Grid, Technical Report 2005-07, LIACS, Leiden University, 2005.  http://www.liacs.nl/~hli/pub.htm http://www.liacs.nl/~hli/pub.htm PDM Toolkit  http://www.liacs.nl/~hli/pdm http://www.liacs.nl/~hli/pdm

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005.

Similar presentations

Presentation on theme: "Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005.

Similar presentations

Presentation on theme: "Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005."— Presentation transcript:

Similar presentations

About project

Feedback