June 3, 2015 1 ServMark A Hierarchical Architecture for Testing Grids Santiago, Chile A. Iosup, H. Mohamed, D.H.J. Epema PDS Group, ST/EWI, TU Delft C.

Slides:



Advertisements
Similar presentations
A Workflow Engine with Multi-Level Parallelism Supports Qifeng Huang and Yan Huang School of Computer Science Cardiff University
Advertisements

Performance Testing - Kanwalpreet Singh.
7 april SP3.1: High-Performance Distributed Computing The KOALA grid scheduler and the Ibis Java-centric grid middleware Dick Epema Catalin Dumitrescu,
June 1, Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema PDS Group, TU Delft, NL Todd Tannenbaum, Matt Farrellee,
June 1, GrenchMark : Towards a Generic Framework for Analyzing, Testing, and Comparing Grids ASCI Conference 2006 A. Iosup, D.H.J. Epema PDS Group,
June 2, GrenchMark : A Framework for Analyzing, Testing, and Comparing Grids CCGrid 2006 A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.
June 3, 2015 Synthetic Grid Workloads with Ibis, K OALA, and GrenchMark CoreGRID Integration Workshop, Pisa A. Iosup, D.H.J. Epema Jason Maassen, Rob van.
Inter-Operating Grids through Delegated MatchMaking Alexandru Iosup, Dick Epema, Hashim Mohamed,Mathieu Jan, Ozan Sonmez 3 rd Grid Initiative Summer School,
Workload Management Workpackage Massimo Sgaravatto INFN Padova.
A Grid Resource Broker Supporting Advance Reservations and Benchmark- Based Resource Selection Erik Elmroth and Johan Tordsson Reporter : S.Y.Chen.
The CrossGrid project Juha Alatalo Timo Koivusalo.
DAS-3/Grid’5000 meeting: 4th December The KOALA Grid Scheduler over DAS-3 and Grid’5000 Processor and data co-allocation in grids Dick Epema, Alexandru.
1 A Performance Study of Grid Workflow Engines Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Corina Stratan Parallel.
OS Fall ’ 02 Performance Evaluation Operating Systems Fall 2002.
1 Trace-Based Characteristics of Grid Workflows Alexandru Iosup and Dick Epema PDS Group Delft University of Technology The Netherlands Simon Ostermann,
Performance Evaluation
1 Introduction to Load Balancing: l Definition of Distributed systems. Collection of independent loosely coupled computing resources. l Load Balancing.
June 25, GrenchMark: A synthetic workload generator for Grids KOALA Workshop A. Iosup, H. Mohamed, D.H.J. Epema PDS Group, ST/EWI, TU Delft.
June 25, GrenchMark: Synthetic workloads for Grids First Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.
The new The new MONARC Simulation Framework Iosif Legrand  California Institute of Technology.
June 28, Resource and Test Management in Grids Rapid Prototyping in e-Science VL-e Workshop, Amsterdam, NL Dick Epema, Catalin Dumitrescu, Hashim.
June 29, Grenchmark: A workload generator for Grid schedulers First Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.
University of Dortmund June 30, On Grid Performance Evaluation using Synthetic Workloads JSSPP 2006 Alexandru Iosup, Dick Epema PDS Group, ST/EWI,
July 13, GrenchMark: A workload generator for Grids Demo at TU Delft A. Iosup, D.H.J. Epema PDS Group, ST/EWI, TU Delft.
July 13, “How are Real Grids Used?” The Analysis of Four Grid Traces and Its Implications IEEE Grid 2006 Alexandru Iosup, Catalin Dumitrescu, and.
Euro-Par 2008, Las Palmas, 27 August DGSim : Comparing Grid Resource Management Architectures Through Trace-Based Simulation Alexandru Iosup, Ozan.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
CONDOR DAGMan and Pegasus Selim Kalayci Florida International University 07/28/2009 Note: Slides are compiled from various TeraGrid Documentations.
Euro-Par 2007, Rennes, 29th August 1 The Characteristics and Performance of Groups of Jobs in Grids Alexandru Iosup, Mathieu Jan *, Ozan Sonmez and Dick.
KARMA with ProActive Parallel Suite 12/01/2009 Air France, Sophia Antipolis Solutions and Services for Accelerating your Applications.
©Ian Sommerville 2006Software Engineering, 8th edition. Chapter 12 Slide 1 Distributed Systems Architectures.
1 EuroPar 2009 – POGGI: Puzzle-Based Online Games on Grid Infrastructures POGGI: Puzzle-Based Online Games on Grid Infrastructures Alexandru Iosup Parallel.
DISTRIBUTED COMPUTING
Cluster Reliability Project ISIS Vanderbilt University.
Young Suk Moon Chair: Dr. Hans-Peter Bischof Reader: Dr. Gregor von Laszewski Observer: Dr. Minseok Kwon 1.
Through the development of advanced middleware, Grid computing has evolved to a mature technology in which scientists and researchers can leverage to gain.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
A Proposal of Application Failure Detection and Recovery in the Grid Marian Bubak 1,2, Tomasz Szepieniec 2, Marcin Radecki 2 1 Institute of Computer Science,
Wenjing Wu Andrej Filipčič David Cameron Eric Lancon Claire Adam Bourdarios & others.
1 Challenge the future KOALA-C: A Task Allocator for Integrated Multicluster and Multicloud Environments Presenter: Lipu Fei Authors: Lipu Fei, Bogdan.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Evaluation of Agent Teamwork High Performance Distributed Computing Middleware. Solomon Lane Agent Teamwork Research Assistant October 2006 – March 2007.
Introduction to dCache Zhenping (Jane) Liu ATLAS Computing Facility, Physics Department Brookhaven National Lab 09/12 – 09/13, 2005 USATLAS Tier-1 & Tier-2.
OMIS Approach to Grid Application Monitoring Bartosz Baliś Marian Bubak Włodzimierz Funika Roland Wismueller.
Tool Integration with Data and Computation Grid GWE - “Grid Wizard Enterprise”
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
MROrder: Flexible Job Ordering Optimization for Online MapReduce Workloads School of Computer Engineering Nanyang Technological University 30 th Aug 2013.
Introduction to Grids By: Fetahi Z. Wuhib [CSD2004-Team19]
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
EGI-InSPIRE RI EGI-InSPIRE EGI-InSPIRE RI How to integrate portals with the EGI monitoring system Dusan Vudragovic.
Migrating Desktop Uniform Access to the Grid Marcin Płóciennik Poznan Supercomputing and Networking Center Poznan, Poland EGEE’07, Budapest, Oct.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
© Geodise Project, University of Southampton, Workflow Support for Advanced Grid-Enabled Computing Fenglian Xu *, M.
Migrating Desktop Uniform Access to the Grid Marcin Płóciennik Poznan Supercomputing and Networking Center Poland EGEE’08 Conference, Istanbul, 24 Sep.
Collection and storage of provenance data Jakub Wach Master of Science Thesis Faculty of Electrical Engineering, Automatics, Computer Science and Electronics.
Next Generation of Apache Hadoop MapReduce Owen
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Introduction to Distributed Platforms
Introduction to Load Balancing:
CLIF meets Jenkins Performance testing in continuous integration, and more... Bruno Dillenseger - Orange Labs CLIF is OW2's load testing framework project,
Transparent Adaptive Resource Management for Middleware Systems
Class project by Piyush Ranjan Satapathy & Van Lepham
A Grid Research Toolbox
On Dynamic Resource Availability in Grids
Resource and Test Management in Grids
GRUBER: A Grid Resource Usage SLA Broker
Presentation transcript:

June 3, ServMark A Hierarchical Architecture for Testing Grids Santiago, Chile A. Iosup, H. Mohamed, D.H.J. Epema PDS Group, ST/EWI, TU Delft C. Dumitrescu (TU Delft/U. Muenster), I. Raicu, I. Foster (U. Chicago), M. Ripeanu (UBC), C. Stratan, M. Andreica, N. Tapus (UPB)

June 3, Introduction and Motivation 1.1. A Layered View of Grids (1/2) Layer 1: Hardware + OS Automated Simple Layers 2-4: Grid Middleware Stack Low Level: file transfers, local resource allocation, etc. High Level: grid scheduling Very High Level: application environments (e.g., distributed objects) Automated/user control Simple to complex Layer 5: Grid Applications User control Simple to complex HW + OS Grid Low Level MW Grid High Level MW Grid Very High Level MW Grid Applications Grid MW Stack

June 3, Introduction and Motivation 1.1. A Layered View of Grids (2/2) Unitary applications Single processor, single node Parallel/distributed: MPI, Java RMI, Ibis, ProActive … Composite applications Bag of tasks Chain of jobs Direct Acyclic Graph-based Workflows

June 3, Introduction and Motivation 1.2. Complexity => More Testing Required (theory) % reliable Small Cluster 5x decrease in failure rate after first year [Sch06] Production Cluster >10% jobs fail [Ios06] DAS % reliable Server 20-45% failures [Kha06] TeraGrid 27% failures, 5-10 retries [Dum05] Grid3 Today’s grids are complex and error-prone!

June 3, Introduction and Motivation 1.2. Complexity => More Testing Required (Grid reality worse than theory) NMI Build-and-Test Environment at U.Wisc.-Madison: 112 hosts, >40 platforms (e.g., X86-32/Solaris/5, X86-64/RH/9) Serves >50 grid middleware packages: Condor, Globus, VDT, gLite, GridFTP, RLS, NWS, INCA(-2), APST, NINF-G, BOINC … Two years of functionality tests (‘04-‘06): over 1:3 runs have at least one failure! (1) Test or perish! (2) In today’s grids, reliability is more important than performance!

June 3, Outline 1.Introduction and Motivation 2.Testing in Grids 3.The ServMark Architecture 4.Experiments and Success Stories 5.Conclusion

June 3, Testing in Grids 2.1. Analyzing, Testing, and Comparing Grids Use cases for automatically analyzing, testing, and comparing grids (or grid middleware) Functionality testing and system tuning Performance testing/analysis of grid applications Reliability testing of grid middleware … For grids, this problem is hard ! Testing in real environments is difficult Grids change rapidly Validity of tests …

June 3, Testing in Grids 2.2. Requirements for Grids Testing 1.Representative workload generation 2.Accurate testing 3.Scalable testing 4.Reliable testing 5.Extensibility, Automation, Ease-of-Use Extend the testing system for new requirements/environments Single-click testing Ease-of-use at least as important as functionality

June 3, Outline 1.Introduction and Motivation 2.Testing in Grids 3.The ServMark Architecture 4.Experiments and Success Stories 5.Conclusions

June 3, The ServMark Architecture 3.1. ServMark: Generic (Grid) Testing What’s in a name? grid service benchmarking What’s it about? A systematic approach to analyzing, testing, and comparing grid settings, based on synthetic workloads A set of metrics for analyzing grid settings A set of representative grid applications Both real and synthetic Easy-to-use tools to create synthetic grid workloads Automated, single click testing Flexible, extensible framework ServMark can generate and run synthetic grid workloads to benchmark grid services

June 3, The ServMark Architecture 3.2. ServMark Design Blending DiPerF and GrenchMark, two performance evaluation tools. Tackles two orthogonal issues: Multi-sourced testing (multi-user scenarios, scalability) Generate and run dynamic test workloads with complex structure (real-world scenarios, flexibility) Adds Coordination and automation layers Fault tolerance module DiPerF GrenchMark ServMark

June 3, The ServMark Architecture 3.3. DiPerF Overview: Easy to Run Large-Scale Grid Tests Controller Tester #1Tester #2 Grid Service Tester #N Components Controller Tester & Submitter Analyzer Large-scale grid tests (Grid3) Scalable testing (4000 testers) Works with GT2, ssh Analyzer

June 3, The ServMark Architecture 3.4. GrenchMark Overview: Easy to Generate and Run Synthetic Workloads

June 3, The ServMark Architecture … but More Complicated Than You Think Workload structure User-defined and statistical models Dynamic jobs arrival Burstiness and self-similarity Feedback, background load Machine usage assumptions Users, VOs Metrics A(W) Run/Wait/Resp. Time Efficiency, MakeSpan Failure rate [!] Notions Co-allocation, interactive jobs, malleable, moldable, … Measurement methods Long workloads Saturated / non-saturated system Start-up, production, and cool-down scenarios Scaling workload to system Applications Synthetic Real Workload definition language Base language layer Extended language layer Other Can use the same workload for both simulations and real environments

June 3, The ServMark Architecture 3.5. The Current Status ServMark Test Services in large-scale environments (e.g., grids) Blending DiPerF and GrenchMark: done Extending analysis module for high-order statistics and model fitting (expected Q4 2007) Extending for testing grid workflow engines (expected Q4 2007) Testing, testing, testing… ServMark is a Globus Incubator Project [ ] Functionality proof: test HTTP servers First release in October 2006 New release scheduled for Q ServMark is a hierarchical grid services tester, and part of the Globus Incubator Project

June 3, Outline 1.Introduction and Motivation 2.Testing in Grids 3.The ServMark Architecture 4.Experiments and Success Stories 5.Conclusions

June 3, Experiments and Success Stories 4.1. Testing Design Adequacy System functionality testing: show the ability of the system to run various types of applications Report failure rates [10% job failure rate in a controlled system like the DAS! ] Periodic system testing: evaluate the current state of the grid Replay workloads

June 3, Experiments and Success Stories 4.2. Testing Performance (1/2) Testing application performance: test the performance of an application (for sequential, MPI, Ibis applications) Report runtimes, waiting times, grid middleware overhead Automatic results analysis What-if analysis: evaluate potential situations System change Grid inter-operability Special situations: spikes in demand

June 3, Experiments and Success Stories 4.3. Testing Performance (2/2) Single-site vs. co-allocated jobs: compare the success rate of single-site and co-allocated jobs, in a system without reservation capabilities Single-site jobs 20% better vs. small co-allocated jobs (<32 CPUs), 30% better vs. large co-allocated jobs [setting and workload-dependent !] Unitary vs. composite jobs: compare the success rate of unitary and composite jobs, with and without failure handling mechanisms Both 100% with simple retry mechanism [setting and workload-dependent !]

June 3, Experiments and Success Stories 4.4. Testing Reliability Testing a 1500-processors Condor environment: what is the success rate of single-processor jobs when submitted in batches of various sizes? Workloads of 1000 jobs, grouped by 2, 10, 20, 50, 100, 200 Goal: finish jobs in at most 1h after the last submission Results >150,000 jobs submitted >100,000 jobs successfully run, >2 yr CPU time in 1 week 5% jobs failed (much less than other grids’ average) 25% jobs did not start in time and where cancelled

June 3, Experiments and Success Stories 4.5. Releasing the Koala Grid Scheduler on the DAS Koala [ ] Grid Scheduler with co-allocation capabilities DAS: The Dutch Grid, ~200 researchers Initially Koala, a tested (!) scheduler, pre-release version Test specifics 3 different job submission modules Workloads with different jobs requirements, inter-arrival rates, co-allocated v. single site jobs… Evaluate: job success rate, Koala overhead and bottlenecks Results 5,000+ jobs successfully run (all workloads); functionality tests 2 major bugs first day, 10+ bugs overall (all fixed) KOALA is now officially released on the DAS (full credit to KOALA developers, 10x for testing with GrenchMark)

June 3, Experiments and Success Stories 4.6. The LittleBird Peer-to-Peer protocol LittleBird, part of Tribler [ ] Tribler – a Peer-to-Peer file-sharing system, >100,000 downloads/year LittleBird – an epidemic protocol that exchanges swarm information between peers that are currently or were recently members of a swarm Test specifics Large-scale test environment based on GrenchMark Workload: peer arrival / departure Evaluate: job success rate, Koala overhead and bottlenecks Results (by Jelle Roozenburg) Functionality proof Protocol is scalable, resilient against DoS and pollution attacks LittleBird is going to be integrated into Tribler

June 3, Outline 1.Introduction and Motivation 2.Testing in Grids 3.The ServMark Architecture 4.Experiments and Success Stories 5.Conclusions

June 3, Conclusion Today’s grids are complex and error-prone In today’s grids, reliability is more important than performance, so: Test or perish! ServMark can generate and run synthetic grid workloads ServMark is a hierarchical grid services tester

June 3, Thank you! Questions? Remarks? Observations? ServMark Alexandru IOSUP TU Delft [google: “iosup”]