Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005.

Slides:



Advertisements
Similar presentations
Evaluating the Cost-Benefit of Using Cloud Computing to Extend the Capacity of Clusters Presenter: Xiaoyu Sun.
Advertisements

Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
Item Based Collaborative Filtering Recommendation Algorithms
Efficient Event-based Resource Discovery Wei Yan*, Songlin Hu*, Vinod Muthusamy +, Hans-Arno Jacobsen +, Li Zha* * Chinese Academy of Sciences, Beijing.
Alex Cheung and Hans-Arno Jacobsen August, 14 th 2009 MIDDLEWARE SYSTEMS RESEARCH GROUP.
Using Parallel Genetic Algorithm in a Predictive Job Scheduling
School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.
A system Performance Model Instructor: Dr. Yanqing Zhang Presented by: Rajapaksage Jayampthi S.
Scheduling of parallel jobs in a heterogeneous grid environment Scheduling of parallel jobs in a heterogeneous grid environment Each site has a homogeneous.
SKELETON BASED PERFORMANCE PREDICTION ON SHARED NETWORKS Sukhdeep Sodhi Microsoft Corp Jaspal Subhlok University of Houston.
Senior Design Project: Parallel Task Scheduling in Heterogeneous Computing Environments Senior Design Students: Christopher Blandin and Dylan Machovec.
Pre-GDB on Batch Systems (Bologna)11 th March Torque/Maui PIC and NIKHEF experience C. Acosta-Silva, J. Flix, A. Pérez-Calero (PIC) J. Templon (NIKHEF)
A Heuristic Bidding Strategy for Multiple Heterogeneous Auctions Patricia Anthony & Nicholas R. Jennings Dept. of Electronics and Computer Science University.
Security-Driven Heuristics and A Fast Genetic Algorithm for Trusted Grid Job Scheduling Shanshan Song, Ricky Kwok, and Kai Hwang University of Southern.
July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.
Data Mining – Intro.
MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering.
Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.
1 Enabling Large Scale Network Simulation with 100 Million Nodes using Grid Infrastructure Hiroyuki Ohsaki Graduate School of Information Sci. & Tech.
A Metadata Based Approach For Supporting Subsetting Queries Over Parallel HDF5 Datasets Vignesh Santhanagopalan Graduate Student Department Of CSE.
Performance Issues in Parallelizing Data-Intensive applications on a Multi-core Cluster Vignesh Ravi and Gagan Agrawal
Meta Scheduling Sathish Vadhiyar Sources/Credits/Taken from: Papers listed in “References” slide.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
1 Time & Cost Sensitive Data-Intensive Computing on Hybrid Clouds Tekin Bicer David ChiuGagan Agrawal Department of Compute Science and Engineering The.
Policy-based CPU-scheduling in VOs Catalin Dumitrescu, Mike Wilde, Ian Foster.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
GA-Based Feature Selection and Parameter Optimization for Support Vector Machine Cheng-Lung Huang, Chieh-Jen Wang Expert Systems with Applications, Volume.
© Copyright 2012 Hewlett-Packard Development Company, L.P. The information contained herein is subject to change without notice. LogKV: Exploiting Key-Value.
南台科技大學 資訊工程系 A web page usage prediction scheme using sequence indexing and clustering techniques Adviser: Yu-Chiang Li Speaker: Gung-Shian Lin Date:2010/10/15.
Chapter 3 System Performance and Models. 2 Systems and Models The concept of modeling in the study of the dynamic behavior of simple system is be able.
Intelligent Database Systems Lab 1 Advisor : Dr. Hsu Graduate : Jian-Lin Kuo Author : Silvia Nittel Kelvin T.Leung Amy Braverman 國立雲林科技大學 National Yunlin.
Your university or experiment logo here Caitriana Nicholson University of Glasgow Dynamic Data Replication in LCG 2008.
Optimization Problems - Optimization: In the real world, there are many problems (e.g. Traveling Salesman Problem, Playing Chess ) that have numerous possible.
Euro-Par, A Resource Allocation Approach for Supporting Time-Critical Applications in Grid Environments Qian Zhu and Gagan Agrawal Department of.
IPDPS 2005, slide 1 Automatic Construction and Evaluation of “Performance Skeletons” ( Predicting Performance in an Unpredictable World ) Sukhdeep Sodhi.
Data Mining – Intro. Course Overview Spatial Databases Temporal and Spatio-Temporal Databases Multimedia Databases Data Mining.
Computer Science and Engineering Predicting Performance for Grid-Based P. 1 IPDPS’07 A Performance Prediction Framework.
Job scheduling algorithm based on Berger model in cloud environment Advances in Engineering Software (2011) Baomin Xu,Chunyan Zhao,Enzhao Hua,Bin Hu 2013/1/251.
TeraGrid Advanced Scheduling Tools Warren Smith Texas Advanced Computing Center wsmith at tacc.utexas.edu.
Caitriana Nicholson, CHEP 2006, Mumbai Caitriana Nicholson University of Glasgow Grid Data Management: Simulations of LCG 2008.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
EGEE-III INFSO-RI Enabling Grids for E-sciencE EGEE and gLite are registered trademarks Abel Carrión Ignacio Blanquer Vicente Hernández.
Gestion efficace de Séries Temporelles en P2P Application à l'analyse technique et l'étude des objets mobiles G. Gardarin, B. Nguyen, L. Yeh, K. Zeitouni,
Scheduling MPI Workflow Applications on Computing Grids Juemin Zhang, Waleed Meleis, and David Kaeli Electrical and Computer Engineering Department, Northeastern.
Author Utility-Based Scheduling for Bulk Data Transfers between Distributed Computing Facilities Xin Wang, Wei Tang, Raj Kettimuthu,
Learning Photographic Global Tonal Adjustment with a Database of Input / Output Image Pairs.
Analysis of job submissions through the EGEE Grid Overview The Grid as an environment for large scale job execution is now moving beyond the prototyping.
Instance Discovery and Schema Matching With Applications to Biological Deep Web Data Integration Tantan Liu, Fan Wang, Gagan Agrawal {liut, wangfa,
WMS baseline issues in Atlas Miguel Branco Alessandro De Salvo Outline  The Atlas Production System  WMS baseline issues in Atlas.
Chapter 4 CPU Scheduling. 2 Basic Concepts Scheduling Criteria Scheduling Algorithms Multiple-Processor Scheduling Real-Time Scheduling Algorithm Evaluation.
2004 Queue Scheduling and Advance Reservations with COSY Junwei Cao Falk Zimmermann C&C Research Laboratories NEC Europe Ltd.
Architecture for Resource Allocation Services Supporting Interactive Remote Desktop Sessions in Utility Grids Vanish Talwar, HP Labs Bikash Agarwalla,
1 Performance Impact of Resource Provisioning on Workflows Gurmeet Singh, Carl Kesselman and Ewa Deelman Information Science Institute University of Southern.
ItemBased Collaborative Filtering Recommendation Algorithms 1.
Dynamic Resource Allocation for Shared Data Centers Using Online Measurements By- Abhishek Chandra, Weibo Gong and Prashant Shenoy.
OPERATING SYSTEMS CS 3502 Fall 2017
CPU SCHEDULING.
Authors: Sajjad Rizvi, Xi Li, Bernard Wong, Fiodar Kazhamiaka
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
CPU Scheduling G.Anuradha
Module 5: CPU Scheduling
3: CPU Scheduling Basic Concepts Scheduling Criteria
A Characterization of Approaches to Parrallel Job Scheduling
University of Wisconsin-Madison
Declarative Transfer Learning from Deep CNNs at Scale
Process Scheduling B.Ramamurthy 4/11/2019.
Process Scheduling B.Ramamurthy 4/7/2019.
Experiences in Running Workloads over OSG/Grid3
Module 5: CPU Scheduling
Module 5: CPU Scheduling
Presentation transcript:

Efficient Response Time Predictions by Exploiting Application and Resource State Similarities Hui Li, David Groep, Lex Wolters Nov 14th, 2005

11/14/05Grid'05, Seattle, WA 2 Outline Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 3 Problem Statement Context: Large scale Grids like LCG Target: Computing resources like clusters and parallel supercomputers Source: Historical workload traces Goal: Develop a practically useful technique for job response time predictions Purpose: Provide dynamic information for metascheduling decision support

11/14/05Grid'05, Seattle, WA 4 The LCG case (

11/14/05Grid'05, Seattle, WA 5 The LCG Challenges Challenges  Scalable production environment (~211 sites, CPUS, 5 PB storage)  Many options after matchmaking and authorization filtering  How does the resource broker make a good selection of candidate sites?  What makes a good metric? Sites may not want to publish their policies.

11/14/05Grid'05, Seattle, WA 6 The NIKHEF Site

11/14/05Grid'05, Seattle, WA 7 Job Response Times on Resources Job response time as a dynamic performance metric, defined as the time elapsed from a job’s submission to completion. Response time = Application Run Time + Queue Wait Time

11/14/05Grid'05, Seattle, WA 8 Related Work Predictions based on historical observations  Similarity Templates [Smith et al, 98] - Run Time  Instance Based Learning [Kapadia et al, 99] - Run Time  Scheduler Simulation [Smith 99, Li et al 04] - Wait Time “Learning it from data”  Can scheduling rules and policies be discovered by mining historical data?  How to use it for wait time predictions?

11/14/05Grid'05, Seattle, WA 9 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 10 Job Similarity Job attributes recorded in traces that characterize a job Group, user, queue, executable name, #CPUs, requested run time, arrival time of day (executable arguments*, node specification*) Naturally for run times, being used for queue wait times

11/14/05Grid'05, Seattle, WA 11 Resource State Similarity Definition: A pool of running and queued jobs on the resource at the time to make a prediction Assumption: “similar” jobs under “similar” resource states would most likely have similar waiting times Key problems:  How to define attributes to represent a resource state?  How to incorporate local policies into attributes for more fine-grained similarity comparison?

11/14/05Grid'05, Seattle, WA 12 Resource State Attributes VecRunJobs: categorized number of running jobs VecQueueJobs: categorized number of queued jobs VecAlreadyRun: categorized sum of elapsed run time multiplying with #CPUs of running jobs VecRunRemain: categorized sum of remaining run time multiplying with #CPUs of running jobs AlreadyQueue: categorized sum of already queued time multiplying with #CPUs of queue jobs QueueDemand: categorized sum of run time multiplying with #CPUs of queue jobs

11/14/05Grid'05, Seattle, WA 13 Policy Attributes Credential attributes usually used in scheduling policy expressions Group (VO), user, and queue  Maui (NIKHEF), Catalina (SDSC) Embedding the policy attributes into resource state attributes via categorization

11/14/05Grid'05, Seattle, WA 14 Resource State Example Policy attribute set =, resource attributes = VecRunJobs and VecQueueJobs … … Atlas 30 Lhcb 60 Atlas 45 Lhcb 50 State 1 RunJobs QueueJobs … … cms 30 Alice 60 Atlas 45 Lhcb 50 State 2 RunJobs QueueJobs

11/14/05Grid'05, Seattle, WA 15 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 16 Instance Based Learning Nonparametric learning technique Store training data in a historical database, and make predictions by applying an induction model on data entries “near” the query The distance function and the induction model

11/14/05Grid'05, Seattle, WA 17 The Distance Function An extended Heterogeneous Euclidean- Overlap Metric (HEOM)

11/14/05Grid'05, Seattle, WA 18 The Distance Function (cont’d)

11/14/05Grid'05, Seattle, WA 19 The Induction Models Weighted Average (WA) Linear Locally Weighted Regression (LLWR)

11/14/05Grid'05, Seattle, WA 20 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 21 Parameter Optimization by GA Genetic Algorithm implementation using standard operators such as selection, mutation, and crossover Real-encoding v.s. binary-encoding Chromosomes are structured to match different objectives (i.e. run time or wait time) Objective: average prediction error

11/14/05Grid'05, Seattle, WA 22 Chromosomes Run Time  (WAg, WAu, WAe, WAn, WAr, WAtod), (#CPUs), (method), (neighbor size), (history size), (bandwidth type), (bandwidth) Wait Time  (WPg, WPu, WPq), (WAg, WAu, WAe, WAn, WAr, WAtod), (WSrj, WSqj, WSalrr, WSalrq, WSrrem, WSqdem), (#CPUs, queue demand credential, queue demand total), (method), (neighbor size), (history size), (bandwidth type), (bandwidth)

11/14/05Grid'05, Seattle, WA 23 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 24 Experimental Setup Real traces with diverse characteristics  NIKHEF cluster: ~300 CPUs, up to 3GB memory per node, Ethernet connections. Maui scheduler with backfilling, policies based on groups (VOs) and users.  SDSC Blue Horizon: IBM SP, 1152 CPUs. Catalina scheduler with backfilling, policies based on queues. Evaluation is done on multiple Intel Xeon machines with 4 CPUs and 3GB shared memory

11/14/05Grid'05, Seattle, WA 25 Methodology Prediction accuracy  Average Absolute Error (AAE)  Average Relative Error = AAE/Average Real Value  Relative Error = (Est - Real)/(Est + Real) Prediction time  Average execution time per prediction in milliseconds Workload traces are divided into training sets and test sets  On NIKHEF, we test trace data of one month of consecutive months, with parameters trained on the preceding two-month data.  ON SDSC, we test data every three months and training is done on the preceding six months.

11/14/05Grid'05, Seattle, WA 26 Absolute Prediction Error NameRun TimeWait TimeResponse Time Abs. ErrRel. ErrAbs. ErrRel. ErrAbs. ErrRel. Err NIKHEF min min min0.57 SDSC min min min0.79 SDSC min min min0.65

11/14/05Grid'05, Seattle, WA 27 Relative Prediction Error (Run Time)

11/14/05Grid'05, Seattle, WA 28 Relative Prediction Error (Wait Time)

11/14/05Grid'05, Seattle, WA 29 Error Analysis NameWait Time t (sec)Job %Abs. ErrRel. Err NIKHEF 0 < t < < t < t > min 61.7 min 704 min SDSC01 0 < t < < t < t > min 93.6 min 1195 min SDSC02 0 < t < < t < t > min 68.3 min 2167 min

11/14/05Grid'05, Seattle, WA 30 Error Analysis

11/14/05Grid'05, Seattle, WA 31 Optimized Parameters NamePeriodPolicyMethodHistoryBW NIK. Jun-Jul’04[g,u,q]104-WA|1-WA3309|5409 GBS(k=0.5)|NBS NIK. Jul-Aug’04[u,q]125-WA|1-WA7822|3681 GBS(k=0.6)|k=1.6 NIK. Aug-Sep’04[q]115-WA|48-WA4324|5435 GBS(k=1.2)|NBS NIK. Sep-Oct’04[g,q]22-WA|1-WA7188|4967 GBS(k=0.5)|k=2.0 NIK. Oct-Nov’04[g,u,q]18-WA|1-WA5108|3900 NBS|GBS(k=0.8) SD.01 Jan-Jun’01[g]1-WA|1-WA5756|3914 GBS(k=1.4)|k=1.5 SD.01 Apr-Oct’01[g,q]1-WA|27-WA6878|5230 GBS(k=0.5)|NBS SD.02 Jan-Jun’02[g,u]1-WA|1-WA6062|2925 NBS|NBS SD.02 Apr-Oct’02[g,u,q]1-WA|1-WA7514|3672 GBS(k=1.8)|k=0.5

11/14/05Grid'05, Seattle, WA 32 Prediction Time Name Run time (no cache)Run time (cache)Wait time meanstdmeanstdmeanstd NIKHEF 38 ms28 ms10 ms8 ms313 ms185 ms SDSC 30 ms32 ms23 ms17 ms461 ms516 ms

11/14/05Grid'05, Seattle, WA 33 Progress Problem Statement Similarity Definition The IBL-based Prediction Algorithm Parameter Optimization via GA Experimental Results Conclusions and Future Work

11/14/05Grid'05, Seattle, WA 34 Conclusions A response time prediction technique based on Instance Based Learning Novel resource state similarity that incorporate policies Automatic parameter selection “Efficient” and “more general”  “I’m VO 1, how many jobs can you tolerate before reaching a max. response time of X” ?

11/14/05Grid'05, Seattle, WA 35 Future Work Accuracy (global vs local tuning) Performance (search structure) PDM: A Java-based Toolkit for mining performance data in the Grid

11/14/05Grid'05, Seattle, WA 36 References Mining Performance Data for Metascheduling Decision Support in the Grid, Technical Report , LIACS, Leiden University,  PDM Toolkit 