Resource and Test Management in Grids

Resource and Test Management in Grids
Dick Epema, Catalin Dumitrescu, Hashim Mohamed, Alexandru Iosup, Ozan Sonmez Parallel and Distributed Systems Group Delft University of Technology December 3, 2018 The Grid Initiative Summer School, Bucharest, RO, 2006.

Outline Koala: Processor and Data Co-Allocation in Grids
The Co-Allocation Problem in Grids The Design of Koala Koala and the DAS Community The Future of Koala GrenchMark: Analyzing, Testing, and Comparing Grids A Brief Introduction to Grid Computing Grid Performance Evaluation Issues The GrenchMark Architecture Experience with GrenchMark Take home message December 3, 2018

The Co-allocation Problem in Grids (1) Motivation
Co-allocation = the simultaneous allocation of resources in multiple clusters to single applications which consist of multiple components Reasons Use more resources than available at single cluster at given time Create a specific virtual environment (e.g., visualization cluster , geographically spread data) Achieve reliability through replication on multiple clusters Avoid resource contention on the same site (e.g., batches) December 3, 2018

The Co-allocation Problem in Grids (2) Overall Example
global queue KOALA local queues with local schedulers load sharing LS LS LS co-allocation clusters global job local jobs Source: Dick Epema December 3, 2018

The Co-allocation Problem in Grids (3) Details: Processors and Data Co-Alloc.
Jobs have access to processors and data from many sites Files stored at different file sites, replicas may exist Scheduler decides on job component placement at execution sites Jobs can be of high or low priority Source: Hashim Mohamed December 3, 2018

The Co-allocation Problem in Grids (4) Details: Co-Allocated Job Types
fixed jobs non-fixed jobs Job component size and placement fixed by user Job component size fixed by user, placement by scheduler decision semi-fixed jobs flexible jobs December 3, 2018 Job component size and placement by scheduler decision / fixed by user Job component size and placement by scheduler decision

The Koala Design Selection Placing job components
Source: Hashim Mohamed Selection Placing job components Control Transfer executable and input files Instantiation Claiming resources selected for each job component Run Submit, then monitor job execution (fault-tolerance) December 3, 2018

The Koala Selection Step Many Placement Policies
Originally supported co-allocation policies: Worst-Fit: balance job components across sites Close-to-Files: take into account the locations of input files to minimize transfer times (Flexible) Cluster Minimization: mitigate inter-cluster communication; can also split the job automatically But, different application types require different ways of component placement So: Modular structure with pluggable policies Take into account internal structure of applications Use monitoring information (dynamic placement) December 3, 2018

The Koala Instantiation Step Adv. Reservations or On-The-Fly
Source: Hashim Mohamed Use system feedback (dynamic claiming) December 3, 2018

The Koala Selection Step HOCs: Exploiting Application Structure
Higher-Order Components: Pre-packaged software components with generic patterns of parallel behavior Patterns: master-worker, pipelines, wavefront Benefits: Facilitates parallel programming in grids Enables user-transparent scheduling in grids Most important additional middleware: Translation layer that builds a performance model from the HOC patterns and the user-supplied application parameters Supported by KOALA (with Univ. of Münster) Initial results: up to 50% reduction in runtimes December 3, 2018

The Koala Instantiation Step The Runners
Problem: How to support many application types, each with specific (and difficult) requirements? Solution: runners (=interface modules) Currently supported: Any type of single-component job MPI/DUROC jobs Ibis jobs HOC applications API for extensions: write your own! December 3, 2018

Koala and the DAS Community
Extensive experience working in real Grid environments: over 25,000 completed jobs! Koala has been released on the DAS in Sep [ ] Hands-on Tutorials (last in Spring 2006) Documentation (web-site) Papers IEEE Cluster’04, Dagstuhl FGG’04, EGC’05, IEEE CCGrid’05, IEEE eScience’06, etc. Koala helps you get results: IEEE CCGrid’06, others submitted December 3, 2018

The Future of Koala (1) DAS-3
Support for more applications types, e.g., Workflows, Parameter sweep applications Scheduling your application? Communication-aware and application-aware scheduling policies: Take into account the communication pattern of applications when co-allocating Also schedule bandwidth (in DAS3) CPU’s R NOC DAS-3 December 3, 2018

The Future of Koala (2) Support heterogeneity DAS3 DAS2 + DAS3
DAS3 + Grid’ RoGRID Peer-to-peer structure instead of hierarchical grid scheduler December 3, 2018

The Future of Koala (3) Usage Service Level Agreements (uSLAs)
Want to give a partner his share of resources Prevent abusive behavior Rules for “mostly free” system usage pattern (decay-based uSLA mechanism) December 3, 2018

Other Koala-related Resource Provisioning and Scheduling for World-Wide Data-Sharing Services Grid systems provisioning resources for W-W D-S Services, e.g., P2P file-sharing Provision quickly resources, give guarantees No adverse impact on current level of service Prevent abusive behavior December 3, 2018

A Brief Introduction to Grid Computing
Typical grid environment e.g., the DAS Applications [!] Resources Compute (Clusters) Storage (Dedicated) Network Virtual Organizations, Projects (e.g., VL-e), Groups, Users Grids vs. (traditional) parallel production environments Dynamic Heterogeneous Very large-scale (world) No central administration → Most problems are NP-hard, need experimental validation Note that applications are the starting point for the majority of grids (grid infrastructures have been designed to support specific applications used by their respective grid community). However, most grids support large groups of researchers ( ), sometimes with very different requirements (e.g., the DAS). Key point: we need experimental validation December 3, 2018

Experimental Environments Real-World Testbeds
DAS, NorduGrid, Grid3/OSG, Grid’5000… Pros True performance, also shows “it works!” Infrastructure in place Cons Time-intensive Exclusive access (repeatability) Controlled environment problem (limited scenarios) Workload structure (little or no realistic data) What to measure (new environment) December 3, 2018

Experimental Environments Simulated and Emulated Testbeds
GridSim, SimGrid, GangSim, MicroGrid … Essentially trade-off precision vs. speed Pros Exclusive access (repeatability) Controlled environment (unlimited scenarios) Cons Synthetic Grids: What to generate? How to generate? Clusters, Disks, Network, VOs, Groups, Users, Applications, etc. Workload structure (little or no realistic data) What to measure (new environment) Validity of results (accuracy vs. time) December 3, 2018

Grid Performance Evaluation Current Practice
Performance Indicators Define my own metrics, or use U and AWT/ART, or both Workload Structure Run my own workload; Mostly all users are created equal assumption (unrealistic) Do not make comparisons (incompatible workloads) No repeatability of results (e.g., background load) Need a common performance evaluation framework for Grid December 3, 2018

Test Management: The Generic Problem of Analyzing, Testing, and Comparing Grids
Use cases for automatically analyzing, testing, and comparing Grids Comparisons for system design and procurement Functionality testing and system tuning Performance testing/analysis of grid applications … For grids, this problem is hard ! Testing in real environments is difficult Grids change rapidly Validity of tests December 3, 2018

Test Management: A Generic Solution to Analyzing, Testing, and Comparing Grids
“ Generate and run synthetic grid workloads, based on real and synthetic applications “ Current alternatives (not covering all problems) Benchmarking with real/synthetic applications (representative?) User-defined test management (statistically sound?) Advantages of using synthetic grid workloads Statistically sound composition of benchmarks Statistically sound test management Generic: cover the use cases’ broad spectrum (to be shown) Solution: Build synthetic grid workloads using real and synthetic applications, and run them in real or simulated environments. December 3, 2018

Grid Performance Evaluation Current Issues
Test Management Perform a representative test in a real(istic) Grid Performance Indicators What are the metrics for the new environment? Workload Structure Which general aspects are important? Which Grid-specific aspects need to be addressed? Need a common performance evaluation framework for Grid: GrenchMark December 3, 2018

GrenchMark: a Framework for Analyzing, Testing, and Comparing grids
What’s in a name? grid benchmark → working towards a generic tool for the whole community: help standardizing the testing procedures, but benchmarks are too early; we use synthetic grid workloads instead What’s it about? A systematic approach to analyzing, testing, and comparing grid settings, based on synthetic workloads A set of metrics and workload units for analyzing grid settings [JSSPP’06] A set of representative grid applications Both real and synthetic Easy-to-use tools to create synthetic grid workloads Flexible, extensible framework December 3, 2018

GrenchMark Overview: Easy to Generate and Run Synthetic Workloads
December 3, 2018

… but More Complicated Than You Think
Workload structure User-defined and statistical models Dynamic jobs arrival Burstiness and self-similarity Feedback, background load Machine usage assumptions Users, VOs Metrics A(W) Run/Wait/Resp. Time Efficiency, MakeSpan Failure rate [!] (Grid) notions Co-allocation, interactive jobs, malleable, moldable, … Measurement methods Long workloads Saturated / non-saturated system Start-up, production, and cool-down scenarios Scaling workload to system Applications Synthetic Real Workload definition language Base language layer Extended language layer Other Can use the same workload for both simulations and real environments December 3, 2018

GrenchMark Overview: Unitary and Composite Applications
Unitary applications sequential, MPI, Java RMI, Ibis, … Composite applications Bag of tasks Chain of jobs Direct Acyclic Graph-based (Standard Task Graph Archive) December 3, 2018

GrenchMark Overview: Workload Description Files
Format: Number of jobs Composition and application types Co-allocation and number of components Inter-arrival and start time Language extensions Combining four workloads into one December 3, 2018

Performance Indicators
Time-, Resource-, and System-Related Metrics Traditional: utilization, A(W)RT, A(W)WT, A(W)SD New: waste, fairness (or service quality reliability) Workload Completion and Failure Metrics “ In Grids, functionality may be even more important than performance ” Workload Completion (WC) Task and Enabled Task Completion (TC, ETC) System Failure Factor (SFF) A.Iosup, D.H.J.Epema (TU Delft), C. Franke, A. Papaspyrou, L. Schley, B. Song, R. Yahyapour (U Dortmund), On modeling synthetic workloads for Grid performance evaluation, JSSPP’06. December 3, 2018

Using GrenchMark: Grid System Analysis
Performance testing: test the performance of an application (for sequential, MPI, Ibis applications) Report runtimes, waiting times, grid middleware overhead Automatic results analysis What-if analysis: evaluate potential situations System change Grid inter-operability Special situations: spikes in demand December 3, 2018

Using GrenchMark: Functionality Testing in Grid Environments
System functionality testing: show the ability of the system to run various types of applications Report failure rates [ arguably, functionality in grids is even more important than performance !  10% job failure rate in a controlled system like the DAS ] Periodic system testing: evaluate the current state of the grid Replay workloads December 3, 2018

Using GrenchMark: Comparing Grid Settings
Single-site vs. co-allocated jobs: compare the success rate of single-site and co-allocated jobs, in a system without reservation capabilities Single-site jobs 20% better vs. small co-allocated jobs (<32 CPUs), 30% better vs. large co-allocated jobs [setting and workload-dependent !] Unitary vs. composite jobs: compare the success rate of unitary and composite jobs, with and without failure handling mechanisms Both 100% with simple retry mechanism [setting and workload-dependent !] December 3, 2018

A GrenchMark Success Story: Releasing the Koala Grid Scheduler on the DAS
Grid Scheduler with co-allocation capabilities DAS: The Dutch Grid, ~200 researchers Initially Koala, a tested (!) scheduler, pre-release version Test specifics 3 different job submission modules Workloads with different jobs requirements, inter-arrival rates, co-allocated v. single site jobs… Evaluate: job success rate, Koala overhead and bottlenecks Results 5,000+ jobs successfully run (all workloads); functionality tests 2 major bugs first day, 10+ bugs overall (all fixed) KOALA is now officially released on the DAS (full credit to KOALA developers, 10x for testing with GrenchMark) December 3, 2018

GrenchMark: Iterative Research Roadmap
Simple functional system A.Iosup, J.Maassen, R.V.van Nieuwpoort, D.H.J.Epema, Synthetic Grid Workloads with Ibis, KOALA, and GrenchMark, CoreGRID IW, Nov 2005. Open- GrenchMark Community Effort JSSPP’06 University of Dortmund Complex extensible system A.Iosup, D.H.J.Epema, GrenchMark: A Framework for Analyzing, Testing, and Comparing Grids, IEEE CCGrid'06, May 2006. December 3, 2018

Towards Open-GrenchMark: Grid traces, Simulators, Benchmarks
Distributed testing Integrate with DiPerF (C. Dumitrescu, I. Raicu, M. Ripeanu, and M.I. Andreica) Grid traces analysis Automatic tools for grid traces analysis Use in conjunction with simulators Ability to generate workloads which can be used in simulated environments (e.g., GangSim, GridSim, …) Grid benchmarks Analyze the requirements for domain-specific grid benchmarks A. Iosup, C. Dumitrescu, D.H.J. Epema (TU Delft), H. Li, L. Wolters (U Leiden), How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications, IEEE Grid’06. December 3, 2018

Take home message PDS Group/TU Delft - resource and test management in Grid systems Koala: Processor and Data Co-Allocation in Grids [ ] - Grid scheduling with co-allocation and fault-tolerance - many placement policies available - extensible runners system - tutorials, on-line documentation, papers GrenchMark: Analyzing, Testing, and Comparing Grids [ grenchmark.st.ewi.tudelft.nl ] - generic tool for the whole community - generates diverse grid workloads - easy-to-use, flexible, portable, extensible, … - tutorials, papers December 3, 2018

Questions? Remarks? Observations? All welcome!
Thank you! Questions? Remarks? Observations? All welcome! grenchmark.st.ewi.tudelft.nl/ December 3, 2018

Resource and Test Management in Grids

Similar presentations

Presentation on theme: "Resource and Test Management in Grids"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Resource and Test Management in Grids

Similar presentations

Presentation on theme: "Resource and Test Management in Grids"— Presentation transcript:

Similar presentations

About project

Feedback