Workload Management Workpackage Massimo Sgaravatto INFN Padova.

Slides:



Advertisements
Similar presentations
WP1 Grid Workload Management Massimo Sgaravatto INFN Padova
Advertisements

Installation and evaluation of the Globus toolkit WP 1 INFN-GRID Workload management WP 1 DATAGRID WP 2.1 INFN-GRID Massimo Sgaravatto INFN Padova.
INFN & Globus activities Massimo Sgaravatto INFN Padova.
Grid Workload Management (WP 1) Report to INFN-GRID TB Massimo Sgaravatto INFN Padova.
Work Package 1 Installation and Evaluation of the Globus Toolkit Massimo Sgaravatto INFN Padova.
Evaluation of the Globus Toolkit: Status Roberto Cucchi – INFN Cnaf Antonia Ghiselli – INFN Cnaf Giuseppe Lo Biondo – INFN Milano Francesco Prelz – INFN.
Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun.
A Computation Management Agent for Multi-Institutional Grids
Condor and GridShell How to Execute 1 Million Jobs on the Teragrid Jeffrey P. Gardner - PSC Edward Walker - TACC Miron Livney - U. Wisconsin Todd Tannenbaum.
WP 1 Grid Workload Management Massimo Sgaravatto INFN Padova.
CMS HLT production using Grid tools Flavia Donno (INFN Pisa) Claudio Grandi (INFN Bologna) Ivano Lippi (INFN Padova) Francesco Prelz (INFN Milano) Andrea.
DataGrid is a project funded by the European Union 22 September 2003 – n° 1 EDG WP4 Fabric Management: Fabric Monitoring and Fault Tolerance
EU-GRID Work Program Massimo Sgaravatto – INFN Padova Cristina Vistoli – INFN Cnaf as INFN members of the EU-GRID technical team.
1 Software & Grid Middleware for Tier 2 Centers Rob Gardner Indiana University DOE/NSF Review of U.S. ATLAS and CMS Computing Projects Brookhaven National.
GRID workload management system and CMS fall production Massimo Sgaravatto INFN Padova.
Universität Dortmund Robotics Research Institute Information Technology Section Grid Metaschedulers An Overview and Up-to-date Solutions Christian.
Status of Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
LNL M.Biasotto, Bologna, 20 novembre Providing the Grid Information Service with information of local farms Massimo Biasotto – INFN LNL Massimo.
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
Report on the INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
GRID Workload Management System Massimo Sgaravatto INFN Padova.
Globus activities within INFN Massimo Sgaravatto INFN Padova for the INFN Globus group
Workload Management Massimo Sgaravatto INFN Padova.
First steps implementing a High Throughput workload management system Massimo Sgaravatto INFN Padova
Status of Globus activities within INFN (update) Massimo Sgaravatto INFN Padova for the INFN Globus group
First ideas for a Resource Management Architecture for Productions Massimo Sgaravatto INFN Padova.
Evaluation of the Globus GRAM Service Massimo Sgaravatto INFN Padova.
Resource Management Reading: “A Resource Management Architecture for Metacomputing Systems”
INFN-GRID Globus evaluation (WP 1) Massimo Sgaravatto INFN Padova for the INFN Globus group
Workload Management WP Status and next steps Massimo Sgaravatto INFN Padova.
GRID The GRID distribution toolkit at INFN Flavia Donno (INFN Pisa) Andrea Sciaba` (INFN Pisa) Zhen Xie (INFN Pisa) presented by Massimo Sgaravatto (INFN.
WP9 Resource Management Current status and plans for future Juliusz Pukacki Krzysztof Kurowski Poznan Supercomputing.
Job Submission Condor, Globus, Java CoG Kit Young Suk Moon.
Grid Workload Management & Condor Massimo Sgaravatto INFN Padova.
DataGrid WP1 Massimo Sgaravatto INFN Padova. WP1 (Grid Workload Management) Objective of the first DataGrid workpackage is (according to the project "Technical.
The Grid System Design Liu Xiangrui Beijing Institute of Technology.
Grid Workload Management Massimo Sgaravatto INFN Padova.
Condor: High-throughput Computing From Clusters to Grid Computing P. Kacsuk – M. Livny MTA SYTAKI – Univ. of Wisconsin-Madison
Report from USA Massimo Sgaravatto INFN Padova. Introduction Workload management system for productions Monte Carlo productions, data reconstructions.
Ames Research CenterDivision 1 Information Power Grid (IPG) Overview Anthony Lisotta Computer Sciences Corporation NASA Ames May 2,
July 11-15, 2005Lecture3: Grid Job Management1 Grid Compute Resources and Job Management.
What is SAM-Grid? Job Handling Data Handling Monitoring and Information.
Globus Toolkit Massimo Sgaravatto INFN Padova. Massimo Sgaravatto Introduction Grid Services: LHC regional centres need distributed computing Analyze.
Review of Condor,SGE,LSF,PBS
GRID Zhen Xie, INFN-Pisa, on DataGrid WP6 meeting1 Globus Installation Toolkit Zhen Xie On behalf of grid-release team INFN-Pisa.
CEOS Working Group on Information Systems and Services - 1 Data Services Task Team Discussions on GRID and GRIDftp Stuart Doescher, USGS WGISS-15 May 2003.
Proposal for a IS schema Massimo Sgaravatto INFN Padova.
6/23/2005 R. GARDNER OSG Baseline Services 1 OSG Baseline Services In my talk I’d like to discuss two questions:  What capabilities are we aiming for.
International Symposium on Grid Computing (ISGC-07), Taipei - March 26-29, 2007 Of 16 1 A Novel Grid Resource Broker Cum Meta Scheduler - Asvija B System.
Report on the INFN-GRID Globus evaluation Massimo Sgaravatto INFN Padova for the INFN Globus group
GRID The GRID distribution toolkit at INFN Flavia Donno (INFN Pisa) Andrea Sciaba` (INFN Pisa) Zhen Xie (INFN Pisa) presented by Massimo Sgaravatto (INFN.
Condor on WAN D. Bortolotti - INFN Bologna T. Ferrari - INFN Cnaf A.Ghiselli - INFN Cnaf P.Mazzanti - INFN Bologna F. Prelz - INFN Milano F.Semeria - INFN.
Tool Integration with Data and Computation Grid “Grid Wizard 2”
6 march Building the INFN Grid Proposal outline a.ghiselli,l.luminari,m.sgaravatto,c.vistoli INFN Grid meeting, milano.
Summary from WP 1 Parallel Section Massimo Sgaravatto INFN Padova.
Grid Compute Resources and Job Management. 2 Grid middleware - “glues” all pieces together Offers services that couple users with remote resources through.
JSS Job Submission Service Massimo Sgaravatto INFN Padova.
4/9/ 2000 I Datagrid Workshop- Marseille C.Vistoli Wide Area Workload Management Work Package DATAGRID project Parallel session report Cristina Vistoli.
Status of Globus activities Massimo Sgaravatto INFN Padova for the INFN Globus group
Grid Workload Management (WP 1) Massimo Sgaravatto INFN Padova.
Grid Activities in CMS Asad Samar (Caltech) PPDG meeting, Argonne July 13-14, 2000.
First evaluation of the Globus GRAM service Massimo Sgaravatto INFN Padova.
The EPIKH Project (Exchange Programme to advance e-Infrastructure Know-How) gLite Grid Introduction Salma Saber Electronic.
Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,
Workload Management Workpackage
Basic Grid Projects – Condor (Part I)
Wide Area Workload Management Work Package DATAGRID project
I Datagrid Workshop- Marseille C.Vistoli
GRID Workload Management System for CMS fall production
Presentation transcript:

Workload Management Workpackage Massimo Sgaravatto INFN Padova

Overview Goal: define and implement a suitable architecture for distributed scheduling and resource management in a GRID environment Large heterogeneous environment PC farms and not supercomputers used in HEP Large numbers (thousands) of independent users in many different sites Different applications with different requirements HEP Monte Carlo productions, reconstructions and production analyses “Scheduled” activities Goal: throughput maximization HEP individual physics analyses “Chaotic”, non-predictable activities Goal: latency minimization …

Overview Many challenging issues : Optimizing the choice of execution location based on the availability of data, computation and network resources Optimal co-allocation and advance reservation of CPU, data, network Uniform interface to possible different local resource management systems Priorities, policies on resource usage Reliability Fault tolerance Scalability … INFN responsibility in DataGrid

Tasks Job resource specification and job description Method to define and publish the resources required by a job Job control language (submission language, API, GUI) Partitioning programs for parallel execution “Decomposition” of single jobs in multiple, “smaller” jobs that can be executed in parallel Exploitation of task and data parallelism

Tasks Scheduling Definition and implementation of scheduling policies to find the best match between job requirements and available resources Co-allocation and advance reservation Resource management Services Bookkeeping, accounting, logging, authentication, authorization

Effort breakdown (mm) FundedUnfunded INFN DATAMAT1080 CESnet PPARC

Workload Management in the INFN-GRID project Integration, adaptation and deployment of middleware developed within the DataGrid project GRID software must enable physicists to run their jobs using all the available GRID resources in a “transparent” way HEP applications classified in 3 different “classes”, with incremental level of complexity Workload management system for Monte Carlo productions Goal: throughput maximization Implementation strategy: code migration (moving the application where the processing will be performed) Workload management system for data reconstruction and production analysis Goal: throughput maximization Implementation strategy: code migration + data migration (moving the data where the processing will be performed, and collecting the outputs in a central repository) Workload management system for individual physics analysis “Chaotic” processing Goal: latency minimization Implementation strategy: code migration + data migration + remote data access (accessing data remotely) for client/server applications

First Activities and Results CMS-HLT use case (Monte Carlo production) analyzed in terms of GRID requirements and GRID tools availability Discussions with Globus team and Condor team Good and productive collaborations already in place Definition of a possible high throughput workload management system architecture Use of Globus and Condor mechanisms But major developments needed

High throughput workload management system architecture (simplified design) Globus GRAM CONDOR Globus GRAM LSF Globus GRAM PBS globusrun Site1 Site2Site3 condor_submit (Globus Universe) Condor-G Master Grid Information Service (GIS) Submit jobs (using Class-Ads) Resource Discovery Information on characteristics and status of local resources Local Resource Management Systems Globus GRAM as uniform interface to different local resource management systems Condor-G able to provide reliability Use of Condor tools for job monitoring, logging, … Master chooses in which Globus resources the jobs must be submitted Farms Other info

First Activities and Results On going activities in putting together the various building blocks Globus deployment Installed on ~ 35 hosts in 11 different sites INFNGRID distribution toolkit to make Globus deployment easier and more automatic INFN customizations User and host certificates signed by INFN CA Preliminary architecture for GIS

Dc=bo, Dc=infn, dc=it,o=grid Bologna GIIS INFN ATLAS GIIS GIIS Dc=mi,Dc=infn, dc=it,o=grid Exp=atlas, o=grid Top Level INFN GIIS Dc=infn,dc=it, o=grid Milano GIS Architecture (test phase) GRIS Implemented Implemented using INFNGRID distribution To be implemented

First Activities and Results Evaluation of Globus GRAM Tests with job submissions on remote resources Globus GRAM as uniform interface to different underlying resource management systems (LSF, Condor, PBS) Problems related with scalability and fault tolerance (Globus jobmanager) Evaluation of Globus RSL as uniform language to describe resources More flexibility is required (Condor ClassAds model) “Cooperation” between GRAM and GIS The information on characteristics and status of local resources and on jobs is not enough As local resources we must consider Farms and not the single workstations

First Activities and Results Evaluation of Condor-G It works, but some problems must be fixed: Very difficult to understand about errors Problems with log files Problems with scalability in the submitting machine Condor-G is not able to provide fault tolerance and robustness (because Globus doesn’t provide these features) Fault tolerance only in the submitting side Condor team is already working to fix some of these problems

Some next steps Tests with real applications and real environments CMS fall production Evaluation of the new Condor-G implementation (when ready) GIS – ClassAds translator (Globus team ?) Face the problems !! Master development !!!

Other info