A Grid Research Toolbox

A Grid Research Toolbox
DGSim A. Iosup, O. Sonmez, H. Mohamed, S. Anoep, D.H.J. Epema M. Jan PDS Group, ST/EWI, TU Delft LRI/INRIA Futurs Paris, INRIA I. Raicu, C. Dumitrescu, I. Foster H. Li, L. Wolters U. Chicago LIACS, U. Leiden November 19, 2018 Paris, France

Grid Research: Science or Engineering?
When is work in grid computing science? Studying systems to uncover their hidden laws Designing innovative systems Proposing novel algorithms Methodological aspects: repeatable experiments to verify and extend hypotheses When is work in grid computing engineering? Showing that the system works in a common case, or in a special case of great importance (e.g., weather prediction) When our students can do it (H. Casanova’s argument) November 19, 2018

Grids are far from being reliable job execution environments
Server % reliable Small Cluster 99.999% reliable So at the moment our students cannot work in grid computing engineering! Production Cluster 5x decrease in failure rate after first year [Schroeder and Gibson, DSN‘06] CERN LCG jobs 74.71% successful 25.29% unsuccessful Source: dboard-gr.cern.ch, May’07. DAS-2 >10% jobs fail [Iosup et al., CCGrid’06] TeraGrid 20-45% failures [Khalili et al., Grid’06] We see each day that a commercially acceptable server needs to be % reliable. That’s one error every … hours! As soon as we increase in size, failures become more frequent. A top cluster manufacturer advertises only “three nines” reliability. We are not unhappy with this value. [goal for grids] It is well known that the reliability of a system is lower in the beginning, and towards the end of its life (the famous bath-tub shape). Production clusters have been found to be xx to yy % reliable, after passing the ??-year margin [Sch06]. With Grids, we seem to still be in the beginning: errors occur very frequently. Recent studies have shown that ... [Ios06, Kha06] [Ios06] A. Iosup and D. H. J. Epema. Grenchmark: A framework for analyzing, testing, and comparing grids. In CCGRID, pages 313–320. IEEE Computer Society, 2006. [Kha06] O. Khalili, Jiahue He, et al. Measuring the performance and reliability of production computational grids. In GRID. IEEE Computer Society, 2006. [Dum05] C. Dumitrescu, I. Raicu, and I. T. Foster. Experiences in running workloads over grid3. In H. Zhuge and G. Fox, editors, GCC, volume 3795 of LNCS, pages 274–286, 2005. Grid3 27% failures, 5-10 retries [Dumitrescu et al., GCC’05] November 19, 2018

Grid Research Problem: For Science We Are Missing Both Data and Tools
Lack of data Grid infrastructure number and type of resources, resource availability and failures Grid workloads arrival process, resource consumption … Lack of tools Simulators SimGrid, GridSim, MicroGrid, GangSim, OptorGrid, MONARC, … Testing tools that operate in real environments DiPerF, QUAKE/FAIL-FCI We have problems to solve in grid computing (as a science)! November 19, 2018

Research Questions Q1: How to exchange grid data? (e.g., Grid * Archive) Q2: What are the characteristics of grids? (e.g., infrastructure, workload) Q3: How to test and evaluate grids? November 19, 2018

Outline Introduction and Motivation
Meeting ALEAE Goals: A Grid Research Toolbox Q1: Exchange Grid Data The Grid Workloads Archive Q2: Grid Characteristics Grid Workloads Grid Infrastructure Q3: Grids Testing and Evaluation (not in this session) November 19, 2018

ALEAE/AMOMI Goals Provide models and algorithmic solutions in the field of resource management that cope with uncertainties in large-scale distributed systems. Use experimental research to validate the proposed models and evaluate the algorithms using simulation or large-scale environments such as Grid’5000 in order to improve both models and algorithms. November 19, 2018

A Grid Research Toolbox
Hypothesis: (a) is better than (b). For scenario 1, … 1 3 DGSim 2 November 19, 2018

3.1. The Grid Workloads Archive [1/7] Motivation and Goals
Motivation: little is known about real grid use No grid workloads (except “my grid”) No standard way to share them The Grid Workloads Archive: easy to share grid workload traces and research associated with them Understand how real grids are used Address the challenges facing grid resource management (both research and practice) Develop and test grid resource management solutions Perform realistic simulations A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [2/7] Content
6 traces online 1.5 yrs >750K >250 A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [3/7] Presentation
Many jobs Low utiliz. Signature Used? Compare with others. More detailed information Workload signature: simple six-category description Easy to see which traces are fit/unfit for your experiment A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [4/7] Approach
Standard data format (GWF) Share traces with the community Use extensions for specific modeling aspects Text-based, easy to parse for custom tasks Additional SQL-compatible data format (GWF-SQLite) Automated trace analysis Provide ready-to-use tools to the community Promote results availability and comparability Automated trace ranking Help non-experts with their trace selection process … A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [5/7] Approach: GWF Example
Used Req Submit Wait[s] Run #CPUs Mem [KB] #CPUs A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [6/7] Approach: Standard Data Format (GWF)
Goals Provide a unitary format for Grid workloads; Same format in plain text and relational DB (SQLite/SQL92); To ease adoption, base on the Parallel Workloads Format (SWF). Existing Identification data: Job/User/Group/Application ID Time and Status: Sub/Start/Finish Time, Job Status and Exit code Request vs. consumption: CPU/Wallclock/Mem Added Job submission site Job structure: bag-of-tasks, workflows Extensions: co-allocation, reservations, others possible A. Iosup, H. Li, M. Jan, S. Anoep, C. Dumitrescu, L. Wolters, D. Epema, The Grid Workloads Archive, FGCS 24, 672—686, 2008. November 19, 2018

3.1. The Grid Workloads Archive [7/7] Approach: Automated Trace Analysis
General information System-wide characteristics Utilization Job arrival rate Job characteristics Parallel vs. sequential jobs User and group characteristics Analysis for Top10 users Analysis for Top10 groups Performance # running/waiting jobs Throughput, # completed jobs November 19, 2018

(End of Part I) Take Home Message
There is a need for making available grid traces to drive the design of grid middleware components for resource-management to test and benchmark (new) grid middleware components to use as input for realistic simulation studies The Grid Workloads Archive Long-term grid traces, easy to select Standard grid workload format (GWF), easy to use Tools for automatic trace analysis Add your traces! November 19, 2018

November 19, 2018

What do we expect from Grids?
Note: “Now” is June 2006 What do we expect from Grids? November 19, 2018

What do we expect from grids? (1/3)
Goal: Grid systems will be dynamic, heterogeneous, very large-scale (world), and have no central administration The all users are interested assumption: Grids support many VOs, many users, and they all use the grid intensively The 0-day learning curve assumption: a grid must support many different projects in a way that facilitates an immediate jump to productivity The theory sounds good, but how about the practice? Let’s look at Real Grid Traces! November 19, 2018

What do we expect from grids? (2/3) Grids vs. Parallel Production Envs.
The HPC successor assumption: the Grid’s typical user comes from the HPC world, and the typical workload can be described with well-known models of parallel production environments workloads We will have from the beginning lots of parallel jobs We will have from the beginning lots of power-of-2 sized jobs We will handle from the beginning tightly-coupled parallel jobs The infinite power assumption: the Grid can deliver any computing power demand, and in particular will offer at least computing power comparable to traditional Parallel Production Environments Promising in theory, but how about in practice? Let’s look at Real Grid Traces! November 19, 2018

What do we expect from grids? (3/3) Grid Computing Works!
The Power of Sums is Greater Than the Sum of Powers assumption: by coupling individual environments the Grid will provide better performance than by using the individual environments alone Higher throughput Shorter wait times Shorter response times Same or lower jobs slowdown There is a need for the Power of Sums assumption: institutions have the will and the need to couple their environments into a larger, grid-based, environment I’m sure it’s true, but what happens in practice? Let’s look at Real Grid Traces! November 19, 2018

The Analysis November 19, 2018

The Grid Traces: LCG, Grid3, TeraGrid, the DAS
Production Grids: LCG, Grid3, TeraGrid Academic Grid: the DAS (two overlapping traces: local, and shared as Grid platform) Features: long traces (6+ months), active environments (500+K jobs per trace, 100s of users) - The ‘*’ sign represents restrictions due to data (un)availability, that is, we have had data only for 1 site (RAL) for the CERN trace, only for one VO/group (US ATLAS) for the Grid3 trace, and only from one site (Catalin???) for the TeraGrid trace November 19, 2018

Trace Analysis: System-Wide Characteristics
System utilization is on average 60-80% for production Grids, and <20% for academic Grids Average job size is 1 (that is, there are no [!] tightly-coupled, only conveniently parallel jobs) November 19, 2018

Trace Analysis: VO, Group, and User Characteristics
Top 2-5 groups/users dominate the workload Top groups/users are constant submitters The week’s top group/user is not always the same The graph represents stacked histograms of number of jobs/CPU time per group, over time Besides the obvious “top 2-5 dominate”, the graph allows us to see that: + the top groups/users are constant submitters (almost no stack brick/layer (that is, the number of jobs/CPU time for the week represented by the brick) is empty) + the top group/user for a specific week is not always the same November 19, 2018

Trace Analysis: Evolution of Grid Environments
Everything evolves! The infrastructure, the projects, and the users Similar submission patterns Stages automated tool learning + We knew systems are supposed to evolve, but we show evidence that it really happens + We show sample results for the change of rate for grids, grid projects, and grid users; we identify several typical stages in this process, like the user’s process of learning towards producing an automated tool, or the project production and cool-down phases (maybe even warm-up, but is short in our example); for the system (see paper) we observe that users have learning curves which make the system unused in the beginning, then steadily getting more used, until the utilization stabilizes cool-down phase production phase November 19, 2018

Answers for “How are Real Grids Used?”
November 19, 2018

What did we expect from grids? (1/3)
The all users are interested assumption: Grids support many VOs, many users, and they all use the grid intensively The 0-day learning curve assumption: a grid must support many different projects in a way that facilitates an immediate jump to productivity November 19, 2018

What did we expect from grids? (1/3) … and what have we observed
The all users are interested assumption: Grids support many VOs, many users, and they all use the grid intensively (top 2-5 groups/users dominate the workload) The 0-day learning curve assumption: a grid must support many different projects in a way that facilitates an immediate jump to productivity November 19, 2018

What did we expect from grids? (1/3) … and what have we observed
The all users are interested assumption: Grids support many VOs, many users, and they all use the grid intensively (top 2-5 groups/users dominate the workload) The 0-day learning curve assumption: a grid must support many different projects in a way that facilitates an immediate jump to productivity (learning curves of up to 60 days, 100 days for an automated tool development) November 19, 2018

What did we expect from grids? (2/3) Grids vs. Parallel Production Envs.
The HPC successor assumption: the Grid’s typical user comes from the HPC world, and the typical workload can be described with well-known models of parallel production environments workloads We will have from the beginning lots of parallel jobs We will have from the beginning lots of power-of-2 sized jobs We will handle from the beginning tightly-coupled parallel jobs The infinite power assumption: the Grid can deliver any computing power demand, and in particular will offer at least computing power comparable to traditional Parallel Production Environments November 19, 2018

Grids vs. Parallel Production Envs. … and what have we observed
The HPC successor assumption: the Grid’s typical user comes from the HPC world, and the typical workload can be described with well-known models of parallel production environments workloads No parallel jobs No power-of-2 sized jobs No tightly-coupled parallel jobs The infinite power assumption: the Grid can deliver any computing power demand, and in particular will offer at least computing power comparable to traditional Parallel Production Environments November 19, 2018

Grids vs. Parallel Production Envs. … and what have we observed
The HPC successor assumption: the Grid’s typical user comes from the HPC world, and the typical workload can be described with well-known models of parallel production environments workloads No parallel jobs No power-of-2 sized jobs No tightly-coupled parallel jobs The infinite power assumption: the Grid can deliver any computing power demand, and in particular offers at least computing power comparable to traditional Parallel Production Environments November 19, 2018

Grid Computing @ Works … and what have we observed
The Power of Sums is Greater Than the Sum of Powers assumption: by coupling individual environments the Grid will provide better performance than by using the individual environments alone Higher throughput Shorter wait times Shorter response times Same or lower jobs slowdown There is a need for the Power of Sums assumption: institutions have the will and the need to couple their environments into a larger, grid-based, environment November 19, 2018

Grid Computing @ Works … and what have we observed
The Power of Sums is Greater Than the Sum of Powers assumption: by coupling individual environments the Grid will provide better performance than by using the individual environments alone Higher throughput (as predicted) Reality: much higher wait times (NOT as predicted) Shorter response times Reality: for short jobs much higher job slowdown (NOT as predicted) There is a need for the Power of Sums assumption: institutions have the will and the need to couple their environments into a larger, grid-based, environment November 19, 2018

Grid Computing @ Work … and what have we observed
The Power of Sums is Greater Than the Sum of Powers assumption: by coupling individual environments the Grid will provide better performance than by using the individual environments alone Higher throughput (as predicted) Reality: much higher wait times (NOT as predicted) Shorter response times Reality: for short jobs much higher job slowdown (NOT as predicted) There is a need for the Power of Sums assumption: (there already is a large community using them!) November 19, 2018

4.1. Grid Workloads [1/4] Analysis Summary: Grid workloads different, e.g., from parallel production envs. Traces: LCG, Grid3, TeraGrid, and DAS long traces (6+ months), active environments (500+K jobs per trace, 100s of users), >4 million jobs Analysis System-wide, VO, group, user characteristics Environment, user evolution System performance Selected findings Almost no parallel jobs Top 2-5 groups/users dominate the workloads Performance problems: high job wait time, high failure rates A. Iosup, C. Dumitrescu, D.H.J. Epema, H. Li, L. Wolters, How are Real Grids Used? The Analysis of Four Grid Traces and Its Implications, Grid 2006. November 19, 2018

Parallel Production Environments (Large clusters, supercomputers)
4.1. Grid Workloads [2/4] Analysis Summary: Grids vs. Parallel Production Systems Similar CPUTime/Year, 5x larger arrival bursts LCG cluster daily peak: 22.5k jobs Grids Parallel Production Environments (Large clusters, supercomputers) A. Iosup, D.H.J. Epema, C. Franke, A. Papaspyrou, L. Schley, B. Song, R. Yahyapour, On Grid Performance Evaluation using Synthetic Workloads, JSSPP’06. November 19, 2018

4.1. Grid Workloads [3/4] More Analysis: Special Workload Components
Bags-of-Tasks (BoTs) Workflows (WFs) Time [units] BoT = set of jobs… WF = set of jobs with precedence (think Direct Acyclic Graph) …that start at most Δs after the first job Parameter Sweep App. = BoT with same binary November 19, 2018

4.1. Grid Workloads [3/4] BoTs are predominant in grids
Selected Findings Batches predominant in grid workloads; up to 96% CPUTime Average batch size (Δ≤120s) is (500 max) 75% of the batches are sized 20 jobs or less Grid’5000 NorduGrid GLOW (Condor) Submissions 26k 50k 13k Jobs 808k (951k) 738k (781k) 205k (216k) CPU time 193y (651y) 2192y (2443y) 53y (55y) A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, The Characteristics and Performance of Groups of Jobs in Grids, Euro-Par, LNCS, vol.4641, pp , 2007. November 19, 2018

4.1. Grid Workloads [3/4] Workflows exist, but they seem small
Traces Selected Findings Loose coupling Graph with 3-4 levels Average WF size is 30/44 jobs 75%+ WFs are sized 40 jobs or less, 95% are sized 200 jobs or less S. Ostermann, A. Iosup, R. Prodan, D.H.J. Epema, and T. Fahringer. On the Characteristics of Grid Workflows, CoreGRID Integrated Research in Grid Computing (CGIW), 2008. November 19, 2018

4.1. Grid Workloads [4/4] Modeling Grid Workloads: Feitelson adapted
Adapted to grids: percentage parallel jobs, other values. Validated with 4 grid and 7 parallel production env. traces A. Iosup, D.H.J. Epema, T. Tannenbaum, M. Farrellee, and M. Livny. Inter-Operating Grids Through Delegated MatchMaking, ACM/IEEE Conference on High Performance Networking and Computing (SC), pp , 2007. November 19, 2018

4.1. Grid Workloads [4/4] Modeling Grid Workloads: adding users, BoTs
Single arrival process for both BoTs and parallel jobs Reduce over-fitting and complexity of “Feitelson adapted” by removing the RunTime-Parallelism correlated model Validated with 7 grid workloads A. Iosup, O. Sonmez, S. Anoep, and D.H.J. Epema. The Performance of Bags-of-Tasks in Large-Scale Distributed Systems, HPDC, pp , 2008. November 19, 2018

4.2. Grid Infrastructure [1/3] Existing resource models and data
Compute Resources Commodity clusters [Kee et al., SC’04] Desktop grids resource availability [Kondo et al., FCFS’07] Network Resources Structural generators: GT-ITM [Zegura et al., 1997] Degree-based generators: BRITE [Medina et al., 2001] Storage Resources, other resources ? Static! Resource dynamic, evolution, … NOT considered Source: H. Casanova November 19, 2018

Grid-level availability: 70%
4.2. Grid Infrastructure [2/3] Resource dynamics in cluster-based grids Environment: Grid’5000 traces jobs 05/ /2006 (30 mo., 950K jobs) resource availability traces 05/ /2006 (18 mo., 600K events) Resource availability model for multi-cluster grids Grid-level availability: 70% A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic Resource Availability in Grids, Grid 2007, Sep 2007. November 19, 2018

4.2. Grid Infrastructure [2/3] Correlated Failures
Maximal set of failures (ordered according to increasing event time), of time parameter in which for any two successive failures E and F, where returns the timestamp of the event; = s. Grid-level view Range: 1-339 Average: 11 Cluster span Range: 1-3 Average: 1.06 Failures “stay” within cluster Average CDF Grid-level view Size of correlated failures A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic Resource Availability in Grids, Grid 2007, Sep 2007. November 19, 2018

4.2. Grid Infrastructure [2/3] Dynamics Model
see article for per-cluster parameter values 4.2. Grid Infrastructure [2/3] Dynamics Model MTBF MTTR Correl. Assume no correlation of failure occurrence between clusters Which site/cluster? fs, fraction of failures at cluster s Weibull distribution for IAT Shape parameter > 1: increasing hazard rate the longer a node is online, the higher the chances that it will fail A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic Resource Availability in Grids, Grid 2007, Sep 2007. November 19, 2018

4.2. Grid Infrastructure [3/3] Evolution Model
A. Iosup, O. Sonmez, and D. Epema, DGSim: Comparing Grid Resource Management Architectures through Trace-Based Simulation, Euro-Par 2008. November 19, 2018

Q1,Q2: What are the characteristics of grids (e. g
Q1,Q2: What are the characteristics of grids (e.g., infrastructure, workload)? Grid workloads very different from those of other systems, e.g., parallel production envs. (large clusters, supercomputers) Batches of jobs are predominant [Euro-Par’07,HPDC’08] Almost no parallel jobs [Grid’06] Workload model [SC’07, HPDC’08] Grid resources are not static Resource dynamics model [Grid’07] Resource evolution model [EuroPar’08] The Grid Workloads Archive: easy to share grid workload traces and research associated with them November 19, 2018

November 19, 2018

Meeting ALEAE Goals: A Grid Research Toolbox Q1: Exchange Grid Data The Grid Workloads Archive Q2: Grid Characteristics Grid Workloads Grid Infrastructure Q3: Grids Testing and Evaluation November 19, 2018

4.1. GrenchMark: Introduction and Motivation A Layered View of Grids
Layer 1: Hardware + OS Automated Simple Layers 2-4: Grid Middleware Stack Low Level: file transfers, local resource allocation, etc. High Level: grid scheduling Very High Level: application environments (e.g., distributed objects) Automated/user control Simple to complex Layer 5: Grid Applications User control Grid Applications Grid Very High Level MW Grid High Level MW Grid MW Stack Grid Low Level MW HW + OS November 19, 2018

In today’s grids, reliability is more important than performance!
4.1. GrenchMark: Introduction and Motivation Complexity => More Testing Required (Grid reality worse than theory) NMI Build-and-Test Environment at U.Wisc.-Madison: 112 hosts, >40 platforms (e.g., X86-32/Solaris/5, X86-64/RH/9) Serves >50 grid middleware packages: Condor, Globus, VDT, gLite, GridFTP, RLS, NWS, INCA(-2), APST, NINF-G, BOINC … Two years of functionality tests (‘04-‘06): over 1:3 runs have at least one failure! Test or perish! In today’s grids, reliability is more important than performance! A. Iosup, D.H.J.Epema, P. Couvares, A. Karp, M. Livny, Build-and-Test Workloads for Grid Middleware: Problem, Analysis, and Applications, CCGrid, 2007. November 19, 2018

For grids, this problem is hard !
4.1. GrenchMark: Testing in LSDCSs Analyzing, Testing, and Comparing Grids Use cases for automatically analyzing, testing, and comparing grids (or grid middleware) Functionality testing and system tuning Performance testing/analysis of grid applications Reliability testing of grid middleware … For grids, this problem is hard ! Testing in real environments is difficult Grids change rapidly Validity of tests November 19, 2018

4.1. GrenchMark: Testing in LSDCSs Requirements for Grids Testing
Representative workload generation Test reproducibility Reproduce conditions Record and store provenance data High-performance Workload generation and submission Test results management (towards OLAP) Extensibility, Automation, Ease-of-Use Extend the testing system for new environments Single-click testing Templates for common testing scenarios November 19, 2018

4.1. GrenchMark: Testing in LSDCSs GrenchMark: a Framework for Analyzing, Testing, and Comparing Grids What’s in a name? grid benchmark → working towards a generic tool for the whole community: help standardizing the testing procedures, but benchmarks are too early; we use synthetic grid workloads instead What’s it about? A systematic approach to analyzing, testing, and comparing grid settings, based on synthetic workloads A set of metrics for analyzing grid settings A set of representative grid applications Both real and synthetic Easy-to-use tools to create synthetic grid workloads Flexible, extensible framework GrenchMark evolved from a grid-specific testing framework to a framework for testing large-scale distributed computing systems November 19, 2018

4.1. GrenchMark: Testing in LSDCSs Architecture Overview
November 19, 2018

Periodic system testing: evaluate the current state of the grid
4.1. GrenchMark: Testing in LSDCSs Experiments: Testing Design Adequacy System functionality testing: show the ability of the system to run various types of applications Report failure rates [10% job failure rate in a controlled system like the DAS! ] Periodic system testing: evaluate the current state of the grid Replay workloads November 19, 2018

4.1. GrenchMark: Testing in LSDCSs Experiments: Testing Performance
Testing application performance: test the performance of an application (for sequential, MPI, Ibis applications) Report runtimes, waiting times, grid middleware overhead Automatic results analysis What-if analysis: evaluate potential situations System change Grid inter-operability Special situations: spikes in demand November 19, 2018

Testing a 1500-processors Condor environment
4.1. GrenchMark: Testing in LSDCSs Testing a Large-Scale Environment (1/2) Testing a 1500-processors Condor environment Workloads of 1000 jobs, grouped by 2, 10, 20, 50, 100, 200 Test finishes 1h after the last submission Results >150,000 jobs submitted >100,000 jobs successfully run, >2 yr CPU time in 1 week 5% jobs failed (much less than other grids’ average) 25% jobs did not start in time and where cancelled November 19, 2018

4.1. GrenchMark: Testing in LSDCSs Testing a Large-Scale Environment (2/2)
Performance metrics system-, job-, operational-, application-, and service-level November 19, 2018

GrenchMark, CoreGRID IW, Nov 2005.
4.1. GrenchMark: Testing in LSDCSs Releasing the Koala Grid Scheduler on the DAS Koala Grid Scheduler with co-allocation capabilities DAS: The Dutch Grid, ~200 researchers Initially Koala, a tested (!) scheduler, pre-release version Test specifics 3 different job submission modules Workloads with different jobs requirements, inter-arrival rates, co-allocated v. single site jobs… Evaluate: job success rate, Koala overhead and bottlenecks Results 5,000+ jobs successfully run (all workloads); functionality tests 2 major bugs first day, 10+ bugs overall (all fixed) KOALA is now officially released on the DAS (full credit to KOALA developers, 10x for testing with GrenchMark) A.Iosup, J.Maassen, R.V.van Nieuwpoort, D.H.J.Epema, Synthetic Grid Workloads with Ibis, KOALA, and GrenchMark, CoreGRID IW, Nov 2005. November 19, 2018

LittleBird, part of Tribler
4.1. GrenchMark: Testing in LSDCSs Validating the LittleBird Peer-to-Peer Protocol LittleBird, part of Tribler Tribler – a Peer-to-Peer file-sharing system, >100,000 downloads/year LittleBird – an epidemic protocol that exchanges swarm information between peers that are currently or were recently members of a swarm Test specifics Large-scale test environment based on GrenchMark Workload: peer arrival / departure Evaluate: job success rate, Koala overhead and bottlenecks Results (by Jelle Roozenburg) Functionality proof Protocol is scalable, resilient against DoS and pollution attacks LittleBird is currently being integrated with Tribler November 19, 2018

4.1. GrenchMark: Testing in LSDCSs Current Work: ServMark
Blending DiPerF and GrenchMark. Tackles two orthogonal issues: Multi-sourced testing (multi-user scenarios, scalability) Generate and run dynamic test workloads with complex structure (real-world scenarios, flexibility) Adds Coordination and automation layers Fault tolerance module DiPerF GrenchMark ServMark November 19, 2018

4.2. DGSim: Simulating Multi-Cluster Grids Goal and Challenges
Simulate various grid resource management architectures Multi-cluster grids Grids of grids (THE grid) Challenges Many types of architectures Generating and replaying grid workloads Management of the simulations Many repetitions of a simulation for statistical relevance Simulations with many parameters Managing results (e.g., analysis tools) Enabling collaborative experiments Two GRM architectures November 19, 2018 DGSim

4.2. DGSim: Simulating Multi-Cluster Grids Overview
We envision four main roles for the DGSim community member. The Experiment Designer creates experiment-specific configuration files for the DGSim. The Simulator Developer develops simulation modules for new scenarios. The Quality Assurance (QA) Team Member cetifies the correctness of the results. The Investigator analyzes the experiment results. The Experiment Manager toolbox controls the execution of the experiments. The Workload Generation toolbox can load existing workloads or generate new workloads, based on a given model. The workload model parameters may be specified by the Experiment Designer, extracted automatically from existing traces, or iteraticely computed until the generated workload has the characteristics specified by the Experiment Designer. The Simulation toolbox can simulate the experiment scenarios, and store the results in a structure that facilitates further processing. The Data Warehousing toolbox processes the results of the experiments into usable data for both generic and experiment-specific purposes. Discrete-Event Simulator November 19, 2018 DGSim

Hybrid hierarchical/ decentralized
4.2. DGSim: Simulating Multi-Cluster Grids Simulated Architectures (Sep 2007) Independent Centralized DGSim currently simulates six grid resource management architectures: independent clusters operated with job push (sep-c), independent clusters operated with matchmaking (condor), centralized meta-scheduling with job pull (cern), centralized meta-scheduling with job push (koala), federated clusters with matchmaking (fcondor), and inter-operated grids with delegated matchmaking (dmm). Hierarchical Hybrid hierarchical/ decentralized Decentralized November 19, 2018 DGSim

4.2. DGSim: Simulating Multi-Cluster Grids Sample Use [1/2]
Investigate mechanisms for inter-operating grids New mechanism: DMM Trace-based performance evaluation through simulations Real and model-based traces Simulate Grid’5000+DAS-2 DMM > Condor Flocking >> centralized grid schedulers >> independent clusters A. Iosup, D.H.J.Epema, T. Tannenbaum, M. Farrellee, M. Livny, Inter-Operating Grids through Delegated MatchMaking, SC, 2007. November 19, 2018 DGSim

4.2. DGSim: Simulating Multi-Cluster Grids Sample Use [2/2]
Resource availability Static Dynamic Availability Information Delay On-Time (0) Short period Long period SA KA AMA HMA What is the performance impact of the dynamic grid resource availability? Four models for grid resource availability information Trace-based performance evaluation through simulations Real traces Simulate Grid’5000 KA = AMA > HMA >> SA Goodput decreases with intervention delay A. Iosup, M. Jan, O. Sonmez, and D.H.J. Epema, On the Dynamic Resource Availability in Grids, Grid2007, 2007. Avg. Norm. G’put. [cpuseconds/day/proc] November 19, 2018 DGSim SA KA AMA 60s AMA 1h HMA 1w HMA 1mo HMA Never Model

4.2. DGSim: Simulating Multi-Cluster Grids Current Work
More grid architectures Peer-to-Peer Scheduling policies Resource-level User/Application-level First release November 19, 2018 DGSim

Q3: How to test and evaluate grids?
GrenchMark: testing large-scale distributed systems Framework Testing in real environments performance, reliability, functionality Uniform process: metrics, workloads Real tool available DGSim: simulating multi-cluster grids Many types of architectures Generating and replaying grid workloads Management of the simulations grenchmark.st.ewi.tudelft.nl dev.globus.org/wiki/Incubator/ServMark November 19, 2018

Outline Introduction and Motivation A Grid Research Toolbox
Grid Characteristics Grids Testing and Evaluation Collaboration Opportunities November 19, 2018

Collaboration Opportunities dev.globus.org/wiki/Incubator/ServMark
Understanding how real grids work Modeling aspects of grid infrastructure and workloads The Grid Workloads Archive: easy to share grid workload traces and research associated with them Traces from the Austrian Grid added to the GWA Using traces from the GWA GrenchMark: testing large-scale distributed systems Testing in the Austrian Grid Evaluating the performance of Askalon using GrenchMark/ServMark DGSim: simulating multi-cluster grids New grid resource management models and algorithms dev.globus.org/wiki/Incubator/ServMark grenchmark.st.ewi.tudelft.nl November 19, 2018

A Grid Research Toolbox

Similar presentations

Presentation on theme: "A Grid Research Toolbox"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

A Grid Research Toolbox

Similar presentations

Presentation on theme: "A Grid Research Toolbox"— Presentation transcript:

Similar presentations

About project

Feedback